Learn Python with Talk Python's 270 hours of courses

#285: Dask as a Platform Service with Coiled Transcript

Recorded on Wednesday, Aug 12, 2020.

00:00 If you're into data science, you've probably heard about dask. It's a package that feels like familiar API such as NumPy, pandas and scikit learn. Yet it can scale that computation across CPU cores on your local machine, all the way to distributed grid based computing in large clusters. While powerful, this takes some setup to execute in its full glory. That's why Matthew Rocklin has teamed up with Hugo bound Anderson, and others to launch a business to help Python living data scientists run dask workloads in the cloud. And they're here to tell us all about how they built this open source foundational business. And you know what they must be on to something between recording and releasing this episode, they just raised $5 million in VC funding. This is talk Python to me, Episode 285, recorded August 12 2020.

01:03 Welcome to talk Python, to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm at m Kennedy. Keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode is brought to you by brilliant.org and monday.com. Please check out what they're offering during their segments. It really helps support the show. Talk Python to me is partially supported by our training courses. Do you want to learn Python, but you can't bear to subscribe to yet another service at Talk Python Training we hate subscriptions to that's why our course bundle gives you full access to the entire library of courses. For one fair price. That's right, with the course bundle, you save 70% off the full price of our courses, and you own them all forever. That includes courses published at the time of the purchase, as well as courses released within about a year of the bundle to stop subscribing and start learning at talk Python. FM slash everything.

02:04 Matthew Hugo Welcome back to talk Python to me. Hi, Michael.

02:07 Hey, Michael. Thanks for having us again, after a couple years.

02:10 Yeah, it's fabulous to have you both back you all, were a guest previously, you go you are on a really popular Episode 139 paths into data science career, which I think is great helping people do that. And Matthew, you are 207 parallelizing computation with dask. And

02:27 I guess beat me to

02:30 petition. But if it was your right, you would have won.

02:33 Yes, that's right. But I think this talk is going to be a really interesting extension of the one that you were on Matthew, where we talked about the technical stuff of dask and the computation. And now, you know, it's really grown a lot. That was February 2019. That was a time in the world when you could go places like you could just go out. And you could go to a place where they made food and you could physically touch it on the person and then just take it and you wouldn't worry about it. It's amazing.

02:59 That's right. Yeah, the actual, like remote experience of developing software hasn't changed a ton. I think actually people in our space have more or less the same workflow. We just, you know, don't see each other conferences any longer.

03:10 Yeah, if we were accused of being antisocial as developers, well, we've taken it up a notch, or everyone's just come. Other people come to join us in this is a weird world. But yeah, actually, I think software is pretty social, as you know, even though it gets categorized, not so much that way. So it's great to have you both back. I'm looking forward to talking about dask, but especially coiled and this new company, you all are starting this new platform, you're starting sort of based on taking that to the next level. So it's gonna be really fun conversation.

03:39 Yeah, looking forward to it as my Yeah, absolutely.

03:41 Yeah. Now, before we get into though, you both were on before, and I asked how you got into programming, and how you got into Python. So I'm I really, you know, maybe keep it short since there's two of you. And we did cover some of the stuff a little before. But let's try a different question. How did you get into data science? You go, you go first?

03:57 Sure. I was working in Applied Math in cell biology and biophysics. And my job was ostensibly to do a bunch of mathematical modeling. But all my collaborators were generating loads of data that I needed to figure out how to how to analyze to collaborate with them, and even you know, play a big role in the iterative design of experiments and all this type of stuff. This was 2011 2012.

04:21 Excel didn't seem like a great answer. That's where

04:23 I started. And then I went to, and then I went to went to Python and kind of lumped them concurrent because I was working with biologists I was was quite popular. But iPython Notebook was becoming more and more popular. It was a great resource for teaching oneself and teaching others. So I jumped in then and worked in research, worked in education a lot and then moved to industry to tech, and started working in tech to educate people around all of these tools as well. Yeah,

04:50 excellent. Matthew, How about yourself? I think for me, the probably the culmination point at the inflection point was just after I graduated from university, I graduate like a physics and astronomy degree. I was looking at job applications. I didn't like any of the jobs that I was qualified for.

05:04 But the feeling when I, when I studied math, I'm like, I love this stuff. But what am I gonna do this looks that great to me.

05:12 But I didn't know how to code I had been coding since I was like a kid, a little TI calculator. And all of those, like programming mixed with science jobs, those were fascinating, especially when graduate school for computer science, scientific computing. And that was a really good fit for me. I got, I got to touch all sorts of different domains. But it was probably the inflection point is looking at job applications, after having already graduated college is really like, I think I made a mistake. Maybe I should have turned left earlier. Yeah,

05:37 that's really interesting. Because a lot of people, they look around, they see folks who have successfully created these amazing open source libraries and projects, initiatives, you know, like to ask, and they're like the people who get to do that. They've been programming since they were five, and they got a computer science degree. And they just knew they wanted to do that. And I think that's not actually as true as often people think it is.

05:59 Right? I think none of us got Computer Science degrees. Is that fair?

06:02 Yeah, I didn't, I got a computer science book. I decided I didn't want to work in math anymore started, like, you know, C++ book and started going on that that's not the same as a degree though. Good plan.

06:12 I feel like there's an analogue in the data science world as well, with respect to you know, these all these boot camps that happen now, and now Master's in Data Science at prestigious universities, and relatively few people I know, at least have taken the university University track, I think boot camp, Junior data scientist positions far more prevalent, but for more on that you can check out the previous episode that I recorded with Michael.

06:34 Yes, exactly. We got a whole show on that one. So all right. Now, the last time we spoke, you go you were at data camp, doing stuff with teaching people online. Python stuff was awesome. And Matthew, yeah, cool. And Matthew, you are, I think just moving to do a little bit of collaborative stuff with Nvidia. But now you both have, you know, teamed up joined forces to do something else, right, what are you doing nowadays,

06:59 so we're building a company called called coiled, essentially, to productize. For the enterprise, a lot of features of the PI Data ecosystem and dask. In particular. So the V one of our product, we're just about to launch. So it's mid August 2020, we're about to launch is managing dask and distributed compute in the cloud for you, and handling security, kind of dark environments, team management, this type of stuff. And that doesn't really say what we do on a day to day basis. I think I could list what we don't do on a day to day basis more quickly than what we what we do. Sure, but having, you know, incorporated the company in February, it's kind of waking up every day and saying what, what the most impactful stuff we can do actually, it's and that's really exciting. And that's a great, I'm really excited to work with Matt and the whole Coyle team.

07:43 Matthew, how about you? What do you do day to day you joined up with Hugo uncoiled?

07:46 Yeah, so day to day, you know, I still help maintain desk. So there's a ton of GitHub conversations, discord, Stack Overflow, but now we're also making a company. So we're, you know, figuring out sales, we're figuring out legal, there's a ton of sort of nuts and bolts of running a company that is quite difficult. As you're listening, what you can do is different

08:04 for you can Yeah, exactly. So you went to school for physics and astronomy, and then you got a master's degree in computer science. And then you did some really interesting data science around dask. I bet none of that taught you about marketing funnels, or accounting, or taxation law, or any of that fun stuff. Right?

08:24 you'd actually be surprised, like running really late. Yeah, maintaining an open source project includes lots of those things. Okay. So there's a lot of like, we're hiring. We do a lot of community management. There's a lot of like, employer employee kinds of relationships that occur, right?

08:37 People are new to the

08:38 project. Okay. Yeah. You never relationships with large companies like Nvidia or Google or Microsoft, like running a sizable open source project is actually not so dissimilar from running a business. Yeah. Now, okay. Laws and legal agreements. Certainly more of them now. Probably. So.

08:53 But yeah, you guys probably wear a lot of hats as you're getting the stuff off the ground, indeed,

08:57 very much. Yeah. And just speak to Matt, Matt's point. When we first started working together, he was like, okay, what's the call to action on on the landing page? And I was like, Oh, great. Let's have that that conversation. And here's, you know, appreciation of design ethics. Sorry, design, aesthetics, and marketing funnels, I think is actually I was pleasantly surprised.

09:17 Woman, so they're probably you know, 10 to 15 to 20 people who work on desk at different companies, you know, many of whom we've like, tried to get jobs. So, you know, we do marketing, we do sales, we do all of those kinds of things you do inside of companies, we just do it without paying anybody directly or without having any actual control over everything. When it was responded to kinda like running a company without any power. It's a lot of soft power. Does like product management. Yeah,

09:42 I can't imagine. Sure it does. All right. Cool. I guess also just throw out there really quick. Hugo, you did the data frame podcast raw. So people may also you know, know you from there as well. Right. Are you still doing that?

09:53 No, no, I'm not. And I miss podcasting, a great deal. I'm very much enjoying going on podcasts. But I'm Matt and I Have some secret plans to start podcasting in the near near future. So that's that may be a bit of a teaser teaser there. But yeah, that was a really fun podcast to put out weekly. And of course, I came on your show on this show. Before doing that, I just want to let everyone know that Michael was actually incredibly helpful in getting the podcast up and running, and letting me know all the things that he learned along the way that, you know, I still learned a lot along the way. But there are several things that Michael held so wonderfully with. So

10:28 thank you for that, Michael. Oh, you're very welcome. I'm glad to see you having success there. And I'm looking forward to whatever secret project you guys have in mind, as am I, awesome. So we're gonna talk about dask and open source, and where maybe some sort of company making this a little more powerful or amplifying the message there might come from, but I wanted to start a conversation a little more broad and philosophical. So I'm going to talk about two things real quick. First of all, we saw in 2017, StackOverflow, published an article called the incredible growth of Python. And it showed Python being pretty flat for many years, relatively, along with a bunch of other languages. And then around 2012, you know, somebody kicked the derivative of the second derivative, and it just, it just took off, right? And went up and up. And they said, look at this, it's about the past these other languages, and we predict out Look how crazy it's gonna get. And you know, that was three years ago, we can look back, you know, actually linked to the StackOverflow trends now. And those predictions underestimated how popular Python was going to be, but more so they underestimated the stability, the other languages, and they're actually going down more than the flatness that they predicted, which is I think, all those kinds of things is really interesting. My theory is a lot of folks came into Python because of the data science stack around that timeframe. And so I guess my question to you all is, you know, how much is this incredible growth of Python, the incredible growth of data science in Python versus more broad stories around there?

12:01 I think there's a lot of I think the data science or scientific stack inside of Python certainly played a large role in that. And I think that's because we understood, I think that the the scientific stack targeted very much scientific users, which were a good proxy for data science users today. Yeah. And so we have the same combination of performance and accessibility, that that was necessary to meet that need. So we're sort of ahead of the game by accident. Those are why a lot of the maintainers have a science background. Right. What I would say, though, is that Python is really only powerful today, because of that union of the data science stack with the web stack with the visualization stack, which with the system operations stack, and it's really the fact that we can do all of those things together, which make Python the standard default place to build advanced applications today, if we were only MATLAB, for example, and we couldn't do web servers, we would be sort of a fringe language, you'd be a niche. If we were only, you know, Scala, and you do only Spark, we would also be sort of a niche, Python can do everything. And that is where you can build these really rich and awesome applications. Yeah,

13:02 I think that's a really good point. I mean, people talk about what that's why no GS had the interest that it did, because you could sort of solve the problem and all the places you had to work. And I feel like, that's kind of the story as well, for Python on the scientific computation side, and to some degree on the website as well. I mean, we all have to write HTML and stuff. But, you know, the way I think of it is, you know, if there's a biologist who's coming in, and she's, she has to do a little bit of computation, she's like, okay, I just need to do this, I need to make this graph, the state is too big for whatever. So I'm gonna just, if I can write these 10 lines of code, not even in a function, just top to bottom, straight down, right, these five lines, 10 lines of code, get this amazing output by using all these libraries, eventually, the idea of like, well, maybe I need to pass different data. So maybe I need to write a function. And then you know, you get a little bit farther. No, now I can reuse this, I'm gonna make this this package. And right, you know, two years later, you look back. And it's like, how did I become a programmer, I thought I was a biologist. I never ever intended to be that. So I feel like Python brings a lot of people in through that sort of gravity. But because it'll solve the problems that are more advanced and has this rich, sort of top end, you don't have to abandon it like you would MATLAB and go to learn something else. You just get a stay there. And I think people are sticky in that regard. I agree.

14:17 This is a really common theme in in how we think about data science, which is on ramp, right? You know, nothing, or you know how to use Excel, use a little bit of pandas unistalled, a das on one machine, you have to scale on the cloud, you can run on the biggest, biggest supercomputers, right, and it's that smooth experience that we think about, we're getting ahead of myself a little bit talking about dask. But that smooth experience of talking of starting from nothing. And working up to being amazing, is what I think a lot of the Python ethos is about. I mean, yeah,

14:45 I agree. I agree with all of that. And I do think it's the rise of, you know, an entire community, a network of tooling and in the Python landscape. It's also the rise of, you know, Tyler's educators such as you know, the carpentry is data carpentry and software carpentry. This stuff around the world, I'm very humbled to have played a small role in my work at data camp to spread, spread the gospel of Python. I've never used that term before, we'll see how that that lands. The other thing, I think, that I've been really excited about is seeing the wider Python community embrace data science as a fundamental part of the Python landscape. Now, I don't know if anyone's done a data analysis of, let's say, data science talks and keynotes and tutorials at pi Khan over the past 10 years, but at least anecdotally, we've seen a huge embrace it. I mean, it was where we first met in your hometown of Portland, right, Oregon. Yeah. Right. Kevin huff and Jake Vander plaats. Were were to keynote several years ago, I was like, Oh, well, they talked

15:39 about the Mosaic, and it was exactly, and we're having

15:44 that idea. Exactly.

15:46 Yeah. Super addressing. So I do think that that's a lot of the the magic of Python is that it? it welcomes people from these different areas. And there was a really interesting survey, I think it was the JetBrains PSF combined survey that asks something like, how many you are data scientists? And what percentage of the Python community Do you believe data scientists are? And data scientists made up almost like the equal partition of Webb, which were the two biggest groups, but they thought they were much less represented, because they, I think they felt like they were working more individually, but there were just so many of them. So it was really interesting that they, they believe that they were not as big of a part as they actually were. But I think that that perception starting to change,

16:28 I'd be interested in what type of like self identification bias there is, with respect to data sight, like I almost get the impression everyone has to identify as a data scientist these days, even like whatever they do, even if they're a gardener.

16:42 It's true. I have you have you touched pandas, then you are.

16:48 This portion of talk by enemy is brought to you by brilliant.org. Brilliant has digestible courses in topics from the basics of scientific thinking all the way up to high end science, like quantum computing. And while quantum computing may sound complicated, brilliant, makes complex learning uncomplicated, and fun, it's super easy to get started. And they've got so many science and math courses to choose from. I recently used brilliant to get into rocket science for an upcoming episode, and it was a blast. The interactive courses are presented in a clean and accessible way. And you could go from knowing nothing about a topic to having a deep understanding, put your spare time to good use and hugely improve your critical thinking skills. Go to talkpython.fm/ brilliant and sign up for free. The first 200 people that use that link, get 20% off the premium subscription, that's talkpython.fm/ brilliant, or just click the link in the show notes. Another thing that I want to ask you and this sort of philosophical idea is that we're not you know, I'm mostly do web stuff with Python. So I run my online stuff, my online training company and various things with Python. But obviously, I'm interested in all of it. When I look at the web world, I feel like once Python three really got fully adopted, or at least became the default thing to do a whole bunch of older ideas, were all the sudden, still relevant, but they became, you know, that's neat, but let's rethink this now that we can. So I'm thinking of things like FastAPI, where they use type annotations to mean stuff, or, you know, things like that, or we're gonna build this from the ground up with async, and await and so on. So there were just so many different web frameworks and other tools that just came out of nowhere. And some of them are not really maintained anymore. But all these flowers bloomed. And it was just a really interesting thing to see. I don't have the same visibility in the data science side for that, like, did that happen? And what was it like?

18:41 I think to a certain extent, it did, I think probably probably larger is the widespread adoption of tools, and not necessarily the creation of new tooling, and the development of specific tooling as well. But there are several tools I think I'm the first one that really is so important is the Jupyter ecosystem. I refer to ipython notebooks, and I did that in accidentally, but also intentionally because they were ipython old school back then. Right. That was super cool. And I feel like people yeah, people forget about ipython. But it It's so important in the in the Jupyter ecosystem. But yes, since 3.6, we've seen two years ago, like JupyterLab 1.0, which I think brings the possibility of so many new people to use use the Eco, the entire ecosystem. I always think JupyterLab does for Python. I think what the our studio ID does for the for the ecosystem, it allows it reduces the barrier to entry for so many people. But then, as we've noted in other conversations, tools, like streamlet are really cool. And I think this is going to be the next evolution of data tooling is bridging the gap between local computation and iterative science in Python and production, science, right, and machine learning. So the ability to kind of build machine learning product quite quite quickly is wonderful. You streamlet

20:00 was definitely one of the things that came to mind for me is like, this is really a different take on how this works. But it's kind of a neat, modern way to do it. Exactly.

20:09 So for me, it wasn't so much the other two, Python three, six, I think, the scientific stack, the PI Data stack actually adopted Python three, like two or three years before the weigh on that.

20:19 Yeah, they definitely had a head start there. That was great.

20:22 So I think that we that wasn't as as transformational we had a much more smooth transition. I think, I think the this around the same time, though, there was this huge influx of new people and new actors in the system. So there's like, you know, we as we shifted from being the sort of fringe scientific computing language, to becoming the de facto standard language for ml AI workloads. With the adoption of like TensorFlow and pi torch, you've got, you know, Google and Facebook acting, you've got Nvidia in there now. And you have just, you know, a 10. x in the user growth. That is really I think, our like, change moment, is the like the TensorFlow engagement accompanies for better or for worse,

21:00 yeah. What about enterprises and large corporations seen the PI Data stack? As legitimately the thing that they could adopt? over? I don't say they have to do with Java or something like that? Like, it seems like that's changed as well.

21:13 Yeah, there's a huge acceptance of Python, in both in the data science machine learning space, but also in production, right as sort of Docker came up. And as Kubernetes came up, like Python became a thing you could easily deploy and depend upon. And that's new as of the last decade. And with that comes a bunch of money to be made, which again, brings in this like, other slew of actors in the space?

21:32 Yeah. both good and bad. Yeah,

21:34 I think this is something which started for the most part in in tech, where we saw massive, massive wins in open source adoption, and then bled outs slowly. I mean, finances is a great example like seeing how much places like jpmc and a couple of one of these places us use Python now, but then we're seeing it in retail. I mean, one of the one of the great use cases of daska, and rapids and Xg. Boost is a is a Walmart, going back to tech, the fact that Netflix runs like hundreds of thousands of Jupyter notebooks, batch jobs a day or something, something like that. And they're hiring. I mean, I've got a bunch of Jupyter core devs. Right. Yeah,

22:07 their whole paper mill thing? Yeah. Yeah, that's really interesting, I guess, the early start on Python three, because a lot of the tooling was improving. And there wasn't as much like legacy Python data science, I think that was part of the the key to being able to just say, No, we're just going to start something new on these new set of tools, new Machine Learning Library, let's just use Python three, as opposed to, well, we're still on this web framework. We can't change versions. And that thing depends on this. So we're stuck, you know, seven or eight years in the past. So yeah, pretty cool. I mean,

22:40 there was effort, but I think there's Oh, yes, like there was less production use of the data science stack is even there was production use of the web stack. And that's also really hard to change.

22:49 And also add that data scientists require require flexibility and nimbleness and agility in their tooling. And it's such the questions they're answering a changing so quickly, the way these questions integrate into making business decisions, and that interface is changing so quickly, that the tooling is kind of moving alongside that. And there's this wonderful coevolution between the tooling and the techniques and the questions that were required to answer.

23:12 Yeah, that's a good point. All right. So let's set the foundation for this project. It's company that you all are working on. And that would be desk, Matthew, want to kick us off? Tell us you give us the summary of dask and all the cool things that it does. We did talk about it a while ago, but it was over a year, and probably not everyone heard it anyway.

23:31 Yeah, sure. So dask is an open source Python library that was designed to paralyze other Python libraries. So we first started desk at Anaconda. And the goal was to paralyze out NumPy, and pandas and scikit learn, just sort of like the foundation of the sort of scientific Python PI Data stack, what we found pretty quickly is that about half the users we're targeting, that's exactly what they wanted. They had a big table or a big, you know, array of numbers, they wanted a bigger version of those things with the same API is in the same feel, and so Dass gave that to them. But about half of our users wanted to do something totally different. They wanted to use the internals of desk to paralyze out some other crazy thing that they were building. As we mentioned, in Python, people are building all sorts of new libraries. And they wanted to add a little parallelism, a little bit of scalability into their library. Yeah. And the internals of das allow them to do that. So dask is at its core, a general purpose library for parallel computing. You can think of it in the same way, think of the threading module or the concurrent futures module, or anything like that in the set of library. But it runs at scale, and gives you full scale ability,

24:35 right? When you say at scale, you mean, you could have a cluster of 20 machines, or 200? Yeah, something that looks like a panda's data frame, and you ask it to do computation and that computation happens on all those machines. Exactly. Yeah. It was really smart about not moving data. Because if you got a lot of data, the movement of it around might actually be the slowest part, right? Yeah.

24:56 And like 20 things. Yeah. So we think very, very deeply about how to run complicated task graphs at scale. And that allows for us to do things like big pandas. But also, you could use it in the same way use celery behind a web application. Or you could use it, you know, backing, you know, a big machine learning application. So dask is, you know, a very flexible tool on which many people have built these sort of more more special purpose scheduler tools, you can think of it kind of like a tool that you would use to build Spark, or the tool that you would use to build, you know, a parallel airflow. For example,

25:26 I'll just build on this briefly by quoting Matt, and I always quote him on this. So he's heard me quote him many times on this, but this is a blog post he wrote called A Brief History of DOS, which you can find on coil.io, forward slash blog. But he he speaks to three goals that the original idea that him and several other people, Anaconda, were thinking about two technical goals, which were to harness the power of all the cores on your laptop in parallel, a second one to support larger than memory computation. And these we know about, but he also mentions a social goal, which is to invent nothing, and I quote, we wanted to be as familiar as possible to what users already knew in the in the PI Data stack. And I think as the PI Data stack grew and garnered a lot of adoption, the fact that there was a social goal to invent nothing allows pandas and NumPy users then to use it immediately. And Matt says, you know, doesn't have an API, because it uses the API, the packages that it allows people to do distributed compute with,

26:20 yeah, that's a really good goal. Because it doesn't matter how amazing the thing you came up with, if you say, Okay, well, you used to talk to databases, but now use our special API that's like a database, but it's not exactly because it's better. And we have this graphing thing, and it's like Altair, but it's not really Altair, it's, it's just motivated by so you can just redo all your stuff, and it's gonna be better. And, you know, a lot of times it's like, you know, what, the tools that I have really work well, and I don't want to reinvent my world. Exactly. Yeah.

26:49 I call this the the principle of minimal creativity. I love it. It's like

26:53 productive. It's beautiful. And it's similar to the productive, like, laziness of developers or something like that. Right. Sure.

27:02 Yeah, that we thought was great creativity. You know, building a new wonderful thing is actually horrible. Like, you really don't want to impose your thoughts on others

27:11 building boring tool. Yeah,

27:12 right. Building boring tool. Yeah,

27:14 yeah. Otherwise known as getting stuff done, right.

27:21 This portion of talk Python, to me is sponsored by monday.com monday.com is an online platform that powers over 100,000 teams daily work, it's an easy to use flexible and visual teamwork platform, beautifully designed to manage any team organization or online process. Now for most of us, we missed our chance to build the first apps ever and the mobile app stores. It was a once in a lifetime opportunity. But it's one that's coming around again. Monday, comm is launching their marketplace and running a contest for the best new apps featured right from the get go. Want to be one of the first in the monday.com Apps Marketplace. Start building today, they're even giving away $184,000 in prizes, including three, Tesla's 10, MacBooks. And more, build your idea for an app and get in front of hundreds of thousands of users on day one, start building today by visiting monday.com slash Python, or just click the link in your podcast players show notes. My understanding is that dask is used for all these different projects. And like sometimes you'll find das being used. And you're not even really aware that dask is somehow powering maybe it's parallelism, or whatever. So maybe give us a couple of examples of places you found dask being used that surprised you or you're proud of

28:37 Yeah, sure. I'll maybe talk about the pen geo first. So I think I mentioned Pangea. At the end of our of our last podcast, the so pens you is amazing. Pangea was a collaboration of Earth scientists. So like climate scientists, meteorologists, oceanographers, and a bunch of open source software developers who like teamed up to make a new software stack to solve a lot of climate science problems. We combined things like Jupyter hub, dask, Kubernetes, and other libraries that like x ray, intake, all sorts of things. Yeah. And we just like we revolutionize the way that that that sector computes. There was a decade's old software stacks that we just immediately showed where we're not nearly as powerful or able to do together both in terms of computation and terms of accessibility. Yeah, it's at the point where, you know, if you were a an undergraduate student in the Philippines, you can go and you can look at cloud based data sets that are many terabytes large, and see, you know, what climate change will do with sea level rise, for example. That's something that we were all having like six to 12 months, it was amazing. And that's such an important problem that we need to solve. And so to know that, and some of the software you helped create is central to making that happen. That's really cool. Me and like 1000 other people?

29:48 Yeah, of course. Absolutely.

29:50 Yeah. But it was a great. It's a good example of a bunch of different kinds of people all working together. It was also like computer scientists working together, we wouldn't have done it correctly. Was it oceanographers working together? Wouldn't have written both together. It needed to be a collaboration,

30:02 right? Yeah, how many oceanographers are like, you know what, I'm gonna set up a Kubernetes cluster? So we can do automatic scale out, but then scale back down. You know what? No, no, I, I know.

30:12 You'd be surprised if they ended up having to learn those skill sets over time, which is partially why we made coiled is to stop that from having to be the case, they're also shorting folks. People who are in that space are quite sharp. Another example might be Yeah, so rapids. So when we last talked, I had just joined Nvidia. So at the time, Nvidia was building rapids data science suite that is GPU accelerated. So they have like NumPy, pandas, equivalents that all run on the GPU. It's a little bit like what das does for parallelism. They're somewhat attempting to do that for the CUDA cores and GPU. Right. Yeah, exactly. So you know, das allowed you to do from pandas to parallel pandas. rapids allows you to go from pandas to GPU accelerated pendants, and the rapids and das together like to go to GPU accelerated pandas across a cluster of machines.

31:00 Yeah, that's a great way to put it together. Yeah,

31:02 they're a really great result we had other which is super surprising to me, was they did a benchmark, the TPC XP benchmark, where there's a bunch of sort of data science and business analytics queries, I think, is something like 40% faster. And like seven times cheaper than the next solution, which is like spark or MapReduce on a bunch of Dell machines.

31:21 Yeah. And so in some sense, that's like 400. All right, because it's, you know, the 40 x faster, and the cheaper, right? Because normally, you think I can get a lot faster, but I spend a lot of more money or I can get cheaper, but it goes a lot slower. But if you get it to expand in both axes, that's awesome. I know to be to be honest, here. It's a more expensive machine. We're running from a shorter amount of time.

31:41 Yeah. So it's not 400 X, it's generally 40 x faster, and only seven x cheaper. Yeah, no, it's amazing. And what I think what I love about that is that they're all of these sort of old guard companies building out solutions for this benchmark. You've got Oh, h HP, and they're all using like monolith software projects like Spark. And this benchmark was different. It was Nvidia and it was people at coiled and as people blazing blazing sequel, another startup company. They're a bunch of small startups and a bunch of small software projects, that all collaborated together to just smash this benchmark, right? All the other companies were fighting over sort of 10% 20% gains. And we come in with like a 40. x speed improvement. And so just sort of Yeah, it does, like we just wiped wiped everyone up, which was great. Yeah,

32:26 that's a super cool project in rapids. I mean, it seems like that's a pretty early stage project. And it's probably just gonna grow early stage, we're moving fast.

32:33 I mean, that's an example of a company investing in open source with a team of, you know, 50 people. It's impressive. Yeah, the last one I mentioned is no party, which is kind of like pen geo, and that they are using dask, to advance another field, in this case, Biomedical Imaging, but their approach is really different. A lot of lab bench scientists, people who are looking at microscopes don't know how to program. And so we can't give them a Jupyter notebook that they can play with. Instead, they're used to sort of point and click interfaces. And so in the party is an image viewer for large images with a point click interfaces, you sort of pan and zoom around these, you know, picture of a cell A picture of some sort of cancer thing I would love about this is that they use other parts of Python, they use the whole Qt side of pythons. Qt base application, and they're not targeting data scientists that are targeting actual scientists. And like that's, that's like a great space for growth of Python. And the future is to build applications.

33:26 Yeah, because it's one thing to say we build it for data scientists. We're kind of like Python programmers, in general, but they have this special visualization data handling skill said, as opposed to the person who took your x rays, or whatever.

33:41 Yeah, exactly. And also, this is an order of magnitude growth that we can have.

33:45 Yeah. And also add to that, that I spoke with several users of Nepal very recently. And the way it changes their approach to science and the scientific flow. I mean, they can do a bunch of imaging. And whereas previously, they'd have to wait until the next day to check out the images, they can literally go away for 15 to 30 minutes, come back, check out the images, then plan the next experiment, so changes the absolute flow state of scientific research and the rapid iterative cycle for them.

34:11 Yeah, that's really awesome. I mean, it's got to feel great to work on that project and just see, you know, it's making a true impact for like, everyday people. Yeah, that's cool. The one

34:20 other thing I'll add, which I think we hinted at before, is that dos SC is famously used for leveraging clusters and parallel parallelization, but it also does it locally. So before you scale out to to your cluster, or whatever you can scale up to do out of cog computing on larger data sets yourself. So it allows a lot of people to do more locally than they'd otherwise be able to across all the types of questions we've we've been discussing, which is pretty cool.

34:47 Yeah, absolutely. Well, you know, the MacBook that I'm sitting here with MacBook Pro, is a core i nine with six hyper threaded cores. So well basically, if I go and do Python stuff really hard on it, I get one 112 the CPU, yeah, assumption of what's available. And on my gaming machine is, you know, 16 cores, right? It's like if that stuff is just mostly idle in the Python world unless you find a way to take advantage of it. And so that actually really surprised me about dask. I thought of dask. As you know, the way we sort of open the conversation, it's this way to take a huge data that has to fit into a cluster, and not just onto my little laptop and run it distributed in terms of distributed machines. But the fact that it also runs and sort of scales up and takes better advantage of your local hardware is pretty awesome. The median

35:31 cluster size is one, which is just a laptop. And so we optimized for that case. Yeah, yeah, that's really cool.

35:37 I was gonna add, the other thing to note is the interoperability with a lot of other packages. So if you're using socket loon, and you want to parallelize your hyper parameter search, or whatever it is, you can use the I think it's the end jobs quad or something like that. And you can use dask in the back end there. Recently, I was so excited to say, Matt, we're doing these live these YouTube live streams weekly, and people using dask, and a bunch of other stuff. So subscribe to our YouTube channel if you're interested. But marketing aside with somebody from grubhub, Alex egg used snorkel for weak supervision, which is a clever way to label your data for supervised learning without actually hand labeling it. And I saw a map discover a sub module of snorkel called snorkel das. So seeing, you know, the whole right, the whole ecosystem start to incorporate das is really exciting, right?

36:25 Yeah. So I had no idea that integrated, it was like, Oh, I just learned something today.

36:30 They're using this stuff. Great. That's really, really neat. And yeah, I'll put a link to your YouTube channel there so people can check it out. I think it's really interesting that people are doing stuff on Twitch and programming all these like live streaming places that I just, I never associated with programming for the longest time. It was always, you know, World of Warcraft, or some other sort of gaming thing that it was in a different world. It all starts

36:51 with gaming men gaming, the code,

36:56 switching my for profit hat just for a second, like now in COVID times, like marketing is completely the Wild West again, like, we got to figure out how to reach our audience. I used to go to conferences, just know everybody. Now, how do we do that? You know, twitch or live streaming is like a thing to play with? Probably not the right thing, or we're gonna find out and experimenting and finding out the right way to engage people today is actually a really fun and interesting problem for which there are no right answers yet.

37:21 Yeah, it's definitely it's the Wild West. And it's a fun time, right? Like you said, it's unknown. But it's really cool to be able to just explore all these different ideas

37:30 as you go marketing idea for you. Let's get on all the cool Python podcasts and see if they can think and cross market the coil thing.

37:37 I think that's a great idea. Let me just note that down.

37:39 Yeah. Great.

37:40 Yeah. Well, I don't really know which ones you should talk to you. But

37:45 should we talk Python two?

37:48 All right, so let's take it over to the business side a little bit. We have das, we talked about some of the things that it does. One of the interesting challenges people often run into when they have these successful projects is how do I take this really amazing and powerful thing that I built, and allow me to keep working on it by somehow getting paid to do so? Right? And I think probably the first step is, we've seen a really big adoption of open source software on the enterprise, like it used to be, oh, what are you gonna be doing? Are you doing, you know, Java, an Oracle? Or are you doing Microsoft? Like, which type of company are you at? Right? Like, there's a lot of, there's even like Gardner reports, talking about how open source software has not just lowered the cost for enterprises, but actually increase the quality at the same time, which is like one of these double wins as well. How do you guys see it?

38:41 So I'll add one more piece of the puzzle to the question, in the sense that not only we seeing a lot of people in enterprise adopting open source we're seeing, I'm actually going to paraphrase Brian Granger, who said this a couple of years ago, Jupyter, Khan was seeing a phase transition from a lot of people using open source individually in enterprise to large scale enterprise adoption of open source. So having individuals use it within an organization, I think is something where the open source provides nearly all the value they need that but for an institution to actually adopt, adopt OSS at scale across an organization, there are other moving parts that need to be in place that I think commercial companies can console for. So having said that, maybe Matt can talk about why we're starting coiled.

39:27 Yeah, I love the Brian phrase as magician language that sounds exactly like Brian, who's a physicist, physicist, physicist who then made Jupyter notebooks. Yeah, the way I think of it is that, sorry, it's open source software one, right so like no new companies installing Oracle or SAS today. They're installing open source. But as we tossed out all of these sort of enterprise software companies, and their enterprise software stacks, we accidentally tossed out all the things they sold us that weren't software, right. So like they Oracle would sell you the other database. Thanks. service level agreement. Is that how to how do we get Help if we're stuck in it does not working, all that kind of stuff, right, the sort of training safety blanket of commercial software, yeah, training, safety blanket of enterprise support, but also, you know, hooking into enterprise office systems, or network security, or sort of plugging all those things together, integrating with other technologies. And now that all these companies are adopting this open source stack, all of that stuff is missing. And so it is, again, the Wild West, and it sort of it's an awkward place as a company to be adopting this technology. I was in a conversation once with folks at NASA, and they're saying, hey, look, what hosted notebook solution should we use at the time there there is Domino data labs, there was a call to enterprise for selling something. There was your AWS Sage maker, I said, Hey, have you considered just using Jupyter hub is actually a really pleasant experience. And it's free? And it's easy to use? And he said, No, but if you need to buy something, like we're not going to manage Jupyter hub, we need to buy something, because we're

40:55 in the open source thought is just each free. Why don't you just take it? It's not so hard. But of course, for them, it's one another, they don't want another puppy? right? Exactly. I think they got to take care of and walk even in the rain.

41:07 Yeah. And so that I think filling in those gaps, and providing all that sort of supporting infrastructure, which is both technological and cultural infrastructure, that I think is a great place where for profit companies can augment the open source software stacks that we have spent. So all this time building, and have really out competed the proprietary software stacks. But of course, doing those two together requires a lot of nuance, I think we haven't yet figured out the right models, to be a good actor in the open source community that we love and love that's highly principled, and also build a successful business. Yeah, and that's fun. And that's where we're excited about experimenting with.

41:41 Well, and I think it's, it's about time, right? Like, the fact that these companies, they've had, for lack of a better word, they've started to feel the pain, like they've decided they want to take on open source, they see the value, and they're gonna move there, but they're like, but it's, there's still these problems, or these new problems that we don't know how to solve. And so I think there's opportunities for for people to come along knowledgeable, open source stuff to help give them the support, they need to show that they can go to the CTO or whatever, say, we can build on this. And here's how we're going to get the support that we need not, you know, Sarah's pretty good with dask. She could probably fork it and fix it.

42:21 Well, they've done that for a while there were problems. Yes. But to be clear, though, like that, I think that isn't on them, you said it's about time. So there's sort of sort of blame. But I think that blame is on us. It's on the open source community. Like we needed to learn how to speak Corp. And like, we're learning that we're learning how to write legal agreements that they can actually sign. We're learning how to give them institutions that they can engage with on a peer to peer basis, right? You know, Ford can't sign a contract with desk. dask is not an entity that you can sue if they break your company coiled, you can sue, like great, they can sign a contract with coil. But I think that's really on us. Our community needs to figure out how to build more of those institutions to engage better with companies, and eventually pull money out of them to feed back the open source innovation that we've all been building lis successfully over the last decade.

43:08 I think another part of the value of commercial entities like Coyle can add is, as Matt mentioned, the PI Data ecosystem is for the most part developed by a huge number of scientists and people were answering scientific questions themselves. So the PI Data ecosystem is very good at helping scientists and data scientists answer scientific research or industry based questions. It doesn't necessarily solve for all the cultural challenges that occur within organizations and the enterprise. Now, I actually having this isn't as well thought out, as I hope it to be in the future. But these types of cultural concerns I think, are about making sure everyone across an organization is happy, such as it checking all the boxes from it, that I mentioned before, like network security or usage controls, making sure management is on board with everything that's happened. So advanced telemetry for management, so that they can enable collaboration, collaboration, and make sure they're aware of costs and usage and that type of stuff. So not necessarily solving for the nodes of the network of an organization. But solving for the edge as well, I think is something key that entities like coil can help solve.

44:11 Yeah, absolutely. I definitely think that the the donate button is not the right answer, right? We're in the corporate world, making a donation just it's like, it doesn't even make sense. Like how are the shareholders gaining value from this donation or whatever,

44:25 it just doesn't even make sense. You can get pizza money pretty easily. Yeah.

44:30 That's hard. But yeah, having a career or a business around it is really tricky. So I think you're right that the speaking corporate wall, keeping the ethos of open sources is pretty important. So it sounds to me like you guys decided, I'm gonna guess, November to December time you decided there's this, this gap. And you guys can start a company, which you founded in February of Coyle, to just start addressing some of these problems in the data science distributed computing world. people doing desk stuff. That's right. All right. Yeah. So tell us what is what is coiled? What slice of this enterprise challenge or corporate challenge Have you taken on?

45:10 Yeah, so cool is a for profit company, based around scaling Python are sort of based around dask, whose mission is to make computing accessible, right. So we care very deeply about enabling individuals to scale out comfortably. And also extending that capabilities into corporations, we interacted with hundreds of companies who are using dask. And all had more or less the same needs. I mentioned them before, all the needs that you have around adopting open source software, training, support, deployment problems, security off it issues, we were more or less bagged by enough companies to make this thing that we decided to make it

45:44 Yeah, that's a good place to be

45:46 data scientists, as I mentioned before, require a lot of new nimbleness and flexibility in the tools they use. And at the moment, they need to all types of crazy stuff from let's say, they want to build something at scale, they need to like, as we mentioned before, figure out Kubernetes and, and Docker containers, and then like battle with AWS and all of this stuff. So do a whole bunch of DevOps II stuff, right, that they shouldn't be doing. Yeah, it probably makes their managers nervous, because, oh, yeah, all we had to do to make this work is we created the Kubernetes cluster, and then we put the data into this s3 bucket, and we're able to just talk to it, it was great, like, way, way, way, way?

46:22 what's the what's the access level? Or what's the publicity versus in, you know, like, locked down state of that s3 bucket? Are we going to end up on the news next week? Because of this?

46:32 There's a story around Pangea. I mentioned NGO, before this wrapper science collaboration, climate science group, that was actually a really eye opening experience for me, because I really started to understand what enterprise software looked like. So what we did, I was giving a talk at like the American Meteorological Society. And over the weekend, a bunch of us, you know, hack this thing together, it was Jupyter hub on the cloud with ASCII enabled. And so we were we computed this big data set. And then we showed that everyone in the room could do the exact same computation on the cloud. So I did this fantastic thing on stage. And then we said, Hey, everyone open for laptops, you can do the same thing, too. And that blew up. I think within about six months, we went from, you know, three groups than Columbia University, and car Anaconda to about 50 groups, right. So every everybody was excited about this, we sort of revolutionized how we do Earth System Science. But then that's when the nightmare started, right? Because suddenly have all these groups trying to use this public access thing. They're just figured out like, Hey, wait a minute, this is costing us money. So we had to implement user access controls, okay, who's whitelist on this thing? Who can use it? You know, suddenly, there were different groups in different continents who wanted to use different regions on the cloud. Now, we have to have not one Kubernetes cluster, but you know, it doesn't Kubernetes clusters, every one one of their own software to be installed. You know, the oceanographers have a different software stack from the satellite imagery, people, it turns out, and so they're all asking me, Hey, can you just just pip install this one more package, please. And eventually, you know, you can't add all those packages together. So now we had to make you know, software environment switchers, it turns out that no one had, we couldn't figure out how to do AWS credentials. So people couldn't store their data on the cloud. It was read, you could read any data, but you couldn't then transform it and then store it into your own buckets. And so figuring out off and passing it around, and then security, it was open for an embarrassing amount of time. For you Bitcoin folks showed up with all of those problems. Eventually, I stepped back. And as you sort of predicted, all the oceanographers started taking over Kubernetes. And that was what happened and that that experience really showed me personally, all of the challenges around doing this kind of science, data science at scale, both on a technical perspective, but also on a social cultural perspective. There's a ton more than just like algorithms work to do.

48:44 Yeah, it seems so easy just to set it up. And then like just one more thing and run a more thing. And eventually, yeah, you've got oceanographers doing Kubernetes, which is fine, they can do it. But you shouldn't have to understand how your car works is to go get groceries, right? Or to go to work. Right. If that's not your intent, it should just get out of the way.

49:00 I mean, some of them did become auto mechanics as a result, which is a fine Career

49:04 Career change. Yeah, absolutely. If you want it, but if you're trying to do something else, you don't want to work in your car you want to get there. Awesome. Okay, so tell us exactly like what is it that Coyle is doing? So we can sort of understand where it fits in trying to solve this enterprise open source business story. If I want to be your customer, what can I do? Do I say a SaaS service? Is it like, posted notebooks? What do I do?

49:27 Yeah, so coil sells many things. Mostly, we focus on the coil product, which is a hosted desk solution. So we solve all of the infrastructure problems around hosting dask. So if you want to if you're on our beta, for example, you could actually invited you just half an hour ago, like

49:44 yeah, and I'm sorry, I didn't get to a quick enough to sign up and play. I wish I so

49:48 if you did, you could pip install coils. Wherever you run Python. It could be in a Python session could be a Jupyter notebook, could be on the cloud. And then you would ask for a cluster, you'd say coil dot cluster and then would allocate a bunch of machines for you on the cloud somewhere, you would then connect to those machines in a super secure, super potentialized. Very nice way. And you'd be off to the races. So if you started my beta, we sort of tried to minimize the time of signing up to running competitions on the cloud. Currently, the number is around two or three minutes, we can get it down to about one minute, I think so

50:21 just enough time to make a cup of coffee,

50:22 just that I'd make a cup of coffee. Yeah,

50:24 yeah, if you have a Keurig, with the pre made filter thing, if you got to get the coffee out, maybe it won't be.

50:30 So from your perspective, it just looks like task, we have Kubernetes. But there's lots of other stuff around that that we manage. So first managed as clusters, second, customizable software environments, you want to use your favorite software, your favorite software libraries, and we're gonna help you manage that between your local environment and also in document is on the cloud, we'll help you share that with your colleagues. So often, we find a sort of one person who manages software they share with other folks. And then third, user management, cost management and telemetry. So that coil will help you understand what's going on, or more likely will help your boss understand what you're doing and make sure that you don't break the bank. Yeah,

51:07 I've definitely got to some very high cloud computing bills before, although in my defense, I said this is going to be very expensive. We should use this other service. No, no, we have a contract in agreement with this blog service. So we're going to use it, then I got a message. Why did we spend $15,000? in that last month? Yeah. Because you told me to use it, it would have been 500. Over there. But But you know, the surprises are unwelcome.

51:30 Yeah, so just elevating that to visibly, right, there's a there's a page, which has all the running clusters, it shows how much they're costing per hour, and how much they cost total, you can aggregate those costs across users, you know, you can set policies, hey, I mean, by default, if you leave a cluster running, this is a super common mistake would have shut it off after 20 minutes for you. So there's like lots of things that you can do that save you a ton of money if you're doing them correctly. So Coyle is just our way of running Python at scale in the cloud, in a way that's a bit opinionated, it is the way that I find it to be the best way to run

52:00 to run data. And it also allows you to be opinionated as a user, you can do it from the comfort of wherever you do your data science, right. So if you like doing it in JupyterLab, or Jupyter notebooks, or ipython, you can do it all from there. So we're trying to make end users where, where they are. So we will spin up all the AWS stuff, stuff for you, and you don't need need to go there. And I think one of the biggest pain points, which Matt hit the nail on the head with is making sure that you can move seamlessly from local computation to cluster computation, where your environments and your data are all matched up, which is super exciting. Maybe I'm doing some work in Jupyter, working with das to try to take advantage of the 16 cores that my machine has or whatever. And at some point, I decide, we're going to give it more data, we're going to productionize

52:46 it and it's going to actually need to be faster or work with more data. I can basically switch where I'm talking to, I can import coils and say absolutely a cluster and go here. And that's pretty much what that switch looks like, yeah,

52:59 you can one coil from anywhere you could run Python. In that story, you're actually probably not starting with that you're probably starting with pandas on a very small sample of data in that same notebook and that same process to say, hey, wait, it's time to scale up, import best use dask. Locally, you know, use your 16 cores, say we actually is a little slow, I want to kick it up a notch, I want to go to the cloud, or go to your local Kubernetes cluster, whatever. And then you import coiled at a button. And now you're running your competition elsewhere on some remote cluster of other machines, all in the same user flow. You mentioned before, like, Hey, are you hosting notebooks and actually, intentionally we are not, we do not want to own the user's environment. Because your user environment is going to look so much different from everybody else's. Maybe you don't use Jupyter, maybe you use PI charm or VS Code. Maybe you're actually just like running this in a cron job. We don't want to own everything. We just want to sprinkle in robust, secure parallelism and scalability into your existing workflows. Right. So this is again, in sort of the ethos of invent nothing. And being minimal, minimal invitation. I don't want

54:00 no Metacritic, minimal. Minimal creativity. Yeah.

54:02 The principle of minimal creativity. Yeah. So you don't have to do very much to switch to Yeah,

54:08 it's a small change. So you talk about the different places I might like, I could be using JupyterLab, which would be cool. I could be using VS Code or pi charm, that would be cool. Another place that seems like it might be really interesting to use this is I'm in a FastAPI back end. And I want to answer quick, with a lot of data. Right? It could be, I guess what I'm getting is like production? Yeah, absolutely. You can use coils from anywhere you can use Python. What's the story data? So you know, one of the reasons I might switch away from just running pandas or dask, would be I have a lot of data at the same time. How do I get the data to you guys to your clusters? Because that, like I said, could be the slow thing and moving around a cluster if you gotta like, get that out of one data center to another that could even make it worse.

54:52 Yeah, well, we often find is that if you have you know, gigabytes of data on your machine, it's more likely that you've already downloaded the data from somewhere else. You Had your data on the cloud or on your local data center, it would have been better just not to download it. And so quickly really helps you run your computations in that remote space very comfortably. So you know, if you have that on AWS s3, you know, there's no reason download it, open up coiled, and you can just run on the data that's there, right? Get an authenticated URL, like a signed URL over to that s3 bucket or something. I mean, you already have the URL, right? Yeah, I mean, what we do is we scrape your local AWS credentials, generate a security token, pass that to all the coiler workers. So those coal workers look like you, you then can operate on that data as though you were sitting, like any of the ergonomic experience of being on your laptop, but you have the computational experience of being on the cloud. And it feels much more proximate. Yeah.

55:42 Okay, that's really neat. Where do these Kubernetes clusters that run my desk stuff? Where do they live?

55:49 Yeah, so there's two answers to that. One is you might have your own set of machines on prem, you want to run coil there. This is actually really common today, with like GPU clusters, there's really, people buying all the GPU clusters, they don't really know how to handle them coil is a great way for that. But if you're on AWS, so we currently support AWS and Kubernetes. Going to other clouds, whenever that makes sense for us. If you're on AWS have they're running on the cloud, we actually don't use Kubernetes, we chose to use elastic Container Service ECS, which is like a older, slightly simpler system, they can be running on our account if you want to use our hosted thing, or we can deploy it inside of your AWS account. Okay.

56:25 Yeah. And I guess if it deploys on your AWS account, you could say, I'd like this to run in the Bahrain data center, or the Sydney one

56:32 or wherever. Sure, I mean, we can also do that anywhere you can run, like our our public version of Coyle. So we host a public version, which is mostly for individuals to increase that access. And that can run on any region. So because we're using systems like ECS, we don't have to maintain Kubernetes clusters everywhere. We can run your code anywhere AWS is running, can you

56:53 also run against an arbitrary Kubernetes? cluster? So like, if I have set up a Kubernetes, cluster at linode? And I give you the connection and credentials to the Kubernetes? cluster? Could I say run it there? Yeah, you'd

57:03 be wanting you would want to run coiled very close to that system. So we have access everywhere.

57:08 But yes, yeah. Okay. That sounds quite neat. It's fairly close conceptually, to an infrastructure as a service, sort of model in terms of like, what I'm buying what I'm getting what I'm paying for, but specifically focused on helping data scientists not actually care about the infrastructure. I'm asking them like, what do I pay for? Right? Do I pay for like the CPU cycle computations? Sort of like lambda, like the function call? Do I that is running?

57:34 So that is a super sick question for which honestly, I don't have a good answer yet. We are like talking about this. Every week right now. Right now, you either it's free for individuals, currently, we're in beta, or like big companies assigning us some check with some other pricing involved. Yeah, actually recently wrote a blog post on cloud pricing, which I think is broken in lots of ways. Yeah, like, the obvious thing to do is to charge you a surcharge on top of whatever your cloud provider charges you. If Amazon charges you five bucks, we'll charge you 100 bucks, too. That's kind of like the data bricks, model data bricks charges about 100% markup, AWS has about like a 40 or 50% markup, I actually hate this model, because it totally misaligned incentives, you know, so I, if we do this, then I'm incentivized to make your dask workload as inefficient as possible, and use as many resources as possible, you know, reserved instances or spot instances, we don't need any of that give me the most expensive flavor of a through a large three, or whatever the machine name is right. Like that would be the incentive, which is actually the exact opposite of what you would hope it's like a Toyota or Ford. We're incentivized to make you burn more gas, right? It is like the tar companies got a kickback from all the oil companies for burning gas. And you'd never get the Prius in that situation or the Tesla. Yeah, and we want to make the Tesla, right, we want to make the super sexy, super ergonomic experience that's highly efficient and saves you money in the long run. So we're still thinking about that. If your listeners have thoughts on the right pricing pricing model, we'd love to hear from you at a Hello at coil yeah.io.

59:00 It is tricky. I don't know. I mean, I was just doing this new project where the video players for my online courses I would like so there's a whole like custom players, just html5 and JavaScript, Python. And I want when you hover over the little scrubber to like change the time it just show you a preview of each thing and I okay, well, what I can do is I can re encode all the videos really small, and then like put a little hidden player and like just set the time and move that around. It all works beautifully. But I've got 200 hours of high def video. Now I need to turn into that. And so I literally sat down and said okay, well how much is it going to cost me to feed that through elastic transcoder at AWS and and bandwidth and get it back and get it so I actually have that as those pictures I could show like, that's gonna cost about five or $600. Alright, this feature, having the data to back this feature is worth $600 to me, I'll do it but I just think this, this thinking about how do I pay for a thing? And the trade offs is really interesting.

59:56 Yeah, no, I'm putting on my sales hat for a second to Like that is a very clear win or like you need video, it will give you this much money in sales. Great, it makes sense. When you're talking about accelerating data science in a company, like the value they're providing is so far removed. Yeah. So like, if I was going to sell you an efficient refrigerator, I would say, hey, look, it's gonna save you, you know, $200 in electricity bill over the year, over 10 years to $2,000. Great. Pay me $1,000. That's a very clear, yeah. Plus, yeah, but when it's like, oh, yeah, I'm gonna help you scale Python. That's like, not quite as clearly valuable me to get into a longer conversation of like, Well, why do you care about Python? Why do you care about data science? How much value does that bring to you? What actually, are you trying to solve for? Oh, you're trying to increase, you know, retention in your customers? Like, it's like, you have to get a lot more in depth with your customers. All right, I figure out what the value proposition is of like more ergonomic scale Python, you get into things like, well,

01:00:53 what if we could get your recommender system the next version into production a month? sooner? Yeah. How much money would that get you? And what if we could get you answers? 100 milliseconds quicker? How many fewer bouncing users would you have leaving your site like these are? They're meaningful, but they're, they're not as clear cut? See, they're interesting discussions, I think I think

01:01:15 you've raised several important concerns there. And one in particular, is like a whole bunch of companies who have hired lots of data scientists, because that's what you do these days. And they actually haven't seen the value delivered. They've got data scientists who are getting, you know, really serious, serious salaries. And yeah, the data scientist will recognize the value of code, but the economic buyer in our organization will be like, Wait a second, we're paying you all this money, like, why are we paying more for tooling, we thought used all this OSS stuff. So there's a slight mismatch there as well, which is something culturally that we're, we're working to figure out as well?

01:01:46 Well, I think just put it on my like, I used to work at big company, x, or whatever thinking hat, you know, from the people who make the decisions. There's also avoids security leaks, and downtime. And, you know, there's there's not just always the positive selling point or the positive advantages. There's the, you know, you're avoiding these three really bad potential outcomes by productizing, your data science, and you can do it quicker. Right. So these are also valuable. Yeah,

01:02:15 the things that we've seen that work really well so far, are what you just said, I mean, actually, the biggest one is data bricks is expensive. And they like they're expensive, not in a good way. They're expensive, because we leave machines on all day, if it was a better experience, that won't happen. That's what we want, that we want to reduce our costs must feel

01:02:30 really bad to just waste money, right? to literally just leave it running. And not actually even because I it just took a while, right?

01:02:37 There's what he was talking about, of like, a common theme we hear about is like, I got this data science team. They use pandas and Python and pi torch on a single machine. I've got like the scalable data engineering team. And there's a lot of crosstalk between them that's really slow. And we want to be able to allow our data science team to go directly to scale without having to interface with a different team. Because that communication, just that iteration cycle is just killing our performance. Yeah, another common sort of theme is, oops, we bought a bunch of hardware. We bought, like, you know, 50 GPU boxes, and we have no idea how to use them. Well, like there's only so many TensorFlow Kerris runs, we can do. We've seen like the das rapids actually boosts thing we want that, but like, just have like a bunch of hardware sitting in a rack somewhere, can you help us sell a product to manage that rack for us, and give us a lot of utilization, a lot of value out of our, our pre existing purchase? And those are the things that really tend to hit home boom?

01:03:31 Well, I do hope that you guys have a lot of success. Because it, I'm always excited to see some interesting company taking a really powerful open source project, and just adding value to that space and being successful. So it seems like you're on a good track. Thanks.

01:03:47 Yeah, we've been on the track for a while. I mean, dask is sort of unique among other source projects in the Python space. We've always had really good funding, but there are a lot of funded dap developers at us, because we've always been sort of scrambling a little bit to get money. And this is maybe just the next step in that process. Yeah.

01:04:02 Cool. We're going a little long in time. But I did want to just ask you really quickly, some meta questions. So are you guys using coil to build coil? Is there like, what's the Python inside story? The coil dask. Inside Story.

01:04:17 Right now we're not that's actually didn't come up recently, we actually find ourselves building a lot of conda environments for our users. And that is actually something that we've got a web back end, we've got the scalability need. dask is like an obvious way to scale it out curl using a single machine for that are not that big, but that that would be the next step as a build farm. And you

01:04:36 guys are using Django. Is that right? Sort of the API layer?

01:04:39 Yeah. So coiled actually looks kind of like a normal vanilla Python web application. It's, you know, Django and Postgres and your Amazon ECS, as we mentioned, and it's been great. It's actually it's my first time using Django, like a decade or two late, but it's been fantastic to seeing how much is ready for us out of the box. So yeah,

01:04:58 couldn't be happier. Yeah. That's great. You It sounds like there's a non trivial team size. I know a lot of people are seeing it and possibly looking for remote work. Is that a thing that you guys are doing? Are you looking for more people? Are you kind of steady state until you get a little farther? What's the status there?

01:05:13 Yeah, so Coyle is a fully remote company. We're actually born in the era of COVID, which is fun, but we've always been remote. We've always worked remotely. Yeah, we're maybe like, five or six full time, and two or three part time right now, right now, I think actually kind of like staying small. We raised a bunch of money, but we're sort of not in a hurry to burn it. By the time this episode airs. I wouldn't be surprised if we're looking for more folks. If you go to coil.io slash jobs. I think we're just call that out. We'll have active job listings. Yeah. And that's both data science folks and web folks and DevOps people. It's a nice group of needs.

01:05:47 Excellent. Well, like I said, really interesting project. And I hope you guys make a lot of progress with it. Now, before you get out of here. Let me ask you the final two questions. If you're gonna write some Python code, what editor do you use?

01:05:57 Do you really want to go first?

01:05:58 I use JupyterLab these days, right on Matthew.

01:06:01 So if I write Python code, I'm actually going to give me a cheeky answer. I'll say GitHub. So most of the Python code that I write today is indirect. I mostly try to nerd snipe other people into writing code. So maybe this is like the community maintainers answer.

01:06:13 Yeah, get out and be all say open the plug for whereby as well, which is my favorite video conferencing tool, how super okay. And notable pipe UI package. And we've already got a pip install dask pip install, coiled. Right? Those are good ones. But you know, something like, Oh, this is really cool. It's not super well, then maybe you should check out

01:06:34 x open attend a party I mentioned earlier. Yeah, this sort of interviewer. It's really cool. It's actually fun seeing a rich client application written in Python. To know that was cool anymore.

01:06:43 Oh, point to a snorkel, which I mentioned earlier, which is photo week supervision. And it helps you build training data. I think training data is one of the biggest challenges we have in the ML space. Like, I think scale AI raised 100 million dollars last year or something like that. And they still had label data for people. Like they're called scale AI. But they had they had labeled data ml to me. Yeah, Mechanical Turk is still like a huge thing, right? So anyways, we can figure out to do this programmatically is really cool. And this local API allows them domain experts to come in and write using basic functions and decorators heuristics for labeling data, which is, which is super cool. That's what I'll shout out today.

01:07:23 All right, super cool, I guess all thrown as well. Missing, no, missing an O, which is a quick visualization for missing data. And like pandas and stuff like that, ask what's missing, you'll get a cool little graph, like, usually if the state is missing. Also, I don't know. Bowen is missing, like all sorts of cool stuff. So people can check that out as well. Alright, so final call to action. People are interested in dask. They're interested in coiled they want to hell Li some things. What do you guys tell them

01:07:50 I would love so I've got a call to action for coil may have one for DOS, all the features we've just described of coiled in rapid development. And we'd love to involve anybody who likes to break things at scale in this development cycle. So go to coyle.io, sign up for our beta and take our product for a test drive crash it and let's figure out how to build it together. So I'd love to involve you in that conversation. Excellent.

01:08:12 Matthew. Oh, how do you go with that one? I'm actually gonna Sorry, I'm gonna go. I don't think this mentioned one thing on that. It is super fun to run a beta. As an open source software developer. I'm used to never seeing what how people use my software and running a beta on like a public service. Like there's a bunch of creepy spyware. I can put on our website, which tells me exactly what you've done, which you know, will take off at some point. It is amazing. Getting this kind of immediate feedback. Yeah, I different

01:08:39 level of visibility right into fuel use and telemetry and whatnot. Right? Yeah.

01:08:44 So yeah, I'll just echo Hugo, go play with a beta. It's super fun, mostly for us, but also for you. And yeah, you can play with that on the cloud. It's cool. Is there a free thing I can do? Everything is for you right now. Yeah, yeah. Well, like we'll limit how much you can spend. Because that's the features that we care about. Yeah, I think we just like limit you to like 100 cores at a time. So you can't go totally wild. But you can do lots of fun things. Just

01:09:05 100 cores. That's awesome. Just Of course. Yeah. That's really cool. All right. Well, you guys. Thanks again for being on the show. Good luck with coiled it's it's a nice natural extension of dask. I think so excited to see you doing it.

01:09:19 Yeah. Thanks, Michael. goes up for beta.

01:09:21 Thanks so much for having us on the show. Michael. It's always fun. Yeah, you bet.

01:09:25 Bye, guys. This has been another episode of talk Python to me. Our guest on this episode were Matthew Rocklin and Hugo bound Anderson. It's been brought to you by brilliant.org and Monday calm brilliant.org encourages you to level up your analytical skills and knowledge. Visit talkpython.fm/ brilliant and get brilliant premium to learn something new every day. Build your idea for an app and get it in front of a hundreds of thousands of users on day one. Start building today at the monday.com marketplace by visiting monday.com slash Python. One, two level up your Python. If you're just getting started, try my Python jumpstart by building 10 apps course. Or if you're looking for something more advanced, check out our new async course the digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show, open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes. The Google Play feed is /play in the direct RSS feed net /rss on talk python.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. get out there and write some Python code

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon