Dask as a Platform Service with Coiled

0:00

01:11:04

Links Episode Deep Dive Transcript

Panelists

If you're into data science, you've probably heard about Dask. It's a package that feels like familiar APIs such as Numpy, Pandas, and Scikit-Learn. Yet it can scale that computation across CPU cores on your local machine all the way to distributed grid-based computing in large clusters.

While powerful, this may take some serious setup to execute in its full glory. That's why Matthew Rocklin has teamed up with Hugo Bowne-Anderson and others to launch a business to help Python loving data scientists run Dask workloads in the cloud. And they are here to tell us about they open-source foundation business.

And they must be on to something, between recording and releasing this episode, they raised $5M in VC funding.

Episode Deep Dive

Guests introduction and background

Matthew Rocklin and Hugo Bowne-Anderson are seasoned data scientists and Python developers deeply involved in the PyData ecosystem. Matthew is the original creator of Dask, a powerful library for parallel and distributed computing in Python. Hugo has extensive experience in education, data science communication, and has contributed to popularizing Python for analytics and data science at scale. Together, they co-founded Coiled to provide a straightforward platform that brings Dask’s power to organizations and individuals without the heavy lifting of configuring infrastructure and cluster management.

What to Know If You're New to Python

If you’re just getting started with Python, here are a few points to help you get more out of this Dask episode:

Basic familiarity with libraries like pandas (especially DataFrame-style data) will help you follow along with how Dask extends familiar APIs.
Understanding the concept of parallel computing (running tasks simultaneously) and why it’s beneficial for large datasets or CPU-bound tasks will be important.
Knowing how to install and import Python packages with pip install or conda install will help you get started with Dask or Coiled quickly.
A sense of using cloud services (e.g., AWS S3) or at least reading large files from disk can give you insight into how Dask handles out-of-memory data.

Key points and takeaways

Dask as the Central Theme Dask is a Python library for parallel and distributed computing, designed to scale computations from single machines to large clusters. It offers familiar APIs (like those of pandas or NumPy) but works with bigger-than-memory datasets and multiple CPU cores or machines.
- Links / Tools:
  - Dask
  - pandas
The Founding of Coiled Coiled was formed to address the enterprise and infrastructure challenges data scientists often face when they try to scale up Python-based workflows with Dask. Rather than reinventing solutions for environments, security, and cluster management, Coiled provides a “no-devops-required” way of spinning up Dask clusters in the cloud or on-prem.
- Links / Tools:
  - Coiled
Bridging the Gap Between Local Development and the Cloud One of Coiled’s main selling points is that you can prototype on your laptop using Python and pandas, then seamlessly switch to a remote Dask cluster to handle larger-scale data. The user-friendly approach removes the friction of learning Kubernetes or other orchestration platforms for distributed workloads.
- Links / Tools:
  - AWS ECS
  - Kubernetes
Pangeo and Open-Source Collaboration The conversation highlighted Pangeo, a community of Earth scientists using Dask, JupyterHub, and cloud computing to study massive climate and oceanography datasets. This example shows how Dask can transform research workflows by allowing scientists to explore terabytes of data interactively and collaboratively without excessive overhead.
- Links / Tools:
  - Pangeo Project
  - JupyterHub
Rapids and GPU Acceleration The Rapids suite (created by NVIDIA) integrates with Dask to provide GPU-accelerated data science (e.g., for pandas-like operations on the GPU). This combination can drastically speed up certain workloads, from data analytics to machine learning, all while keeping the Pythonic syntax consistent.
- Links / Tools:
  - Rapids
  - Numba (often used in GPU contexts, though only briefly alluded to)
Cost Management and Team Coordination Enterprises care about cost visibility and usage monitoring, especially when running large clusters. Coiled’s platform offers dashboards and policies to keep an eye on how long clusters run, track cloud spend, and ensure data scientists don’t unintentionally run up massive bills.
- Links / Tools:
  - Coiled Cloud (for cluster cost management)
Minimal Creativity and Familiar APIs One of the guiding philosophies behind Dask is to invent as little as possible. Rather than forcing new APIs on users, Dask tries to mirror the approach of existing Python tools (like pandas and NumPy), allowing data scientists to scale out with minimal new learning.
- Links / Tools:
  - NumPy
  - scikit-learn
Data Scientists vs. Infrastructure Traditionally, data scientists were forced to handle DevOps tasks—like setting up Kubernetes, Docker containers, and security roles—to run large computations. Coiled aims to free them from that overhead, letting them focus on analysis and modeling while the platform automates cluster setup and teardown.
- Links / Tools:
  - Docker
  - conda
Dask in Production (Web APIs and Beyond) You can use Dask not only interactively, but also in production-like scenarios, such as powering a background job in a web API (e.g., with FastAPI or Flask). The conversation hints at possibilities for building robust, scalable services that handle large amounts of data behind the scenes.
- Links / Tools:
  - FastAPI
  - Flask
Addressing the Enterprise “Buy vs Build” Dilemma Enterprises often look for a “product” to purchase rather than building from scratch with open-source pieces. Coiled offers that product-like experience—support, legal contracts, training, environment management—so large organizations can confidently adopt open source (Dask) for big data workloads.

Links / Tools:
- Dask on GitHub
- Coiled on GitHub

Interesting quotes and stories

"The principle of minimal creativity: We wanted to be as familiar as possible to what users already knew in the PyData stack." — Matthew Rocklin

"They needed to buy something because they're NASA. They don't want another puppy to take care of; they want someone else to handle that for them." — Matthew Rocklin

"We can share the same environment on the cloud so you don’t have to talk to some devops or data engineering team and wait for them to handle your environment.” — Hugo Bowne-Anderson

Key definitions and terms

Dask: A parallel computing library in Python that extends pandas-like workflows to multiple cores or machines.
Distributed Computing: The practice of spreading workloads across multiple systems to scale performance or handle larger data than fits in one place.
Coiled: A managed Dask service that simplifies cluster setup, environment sharing, and security.
GPU Acceleration (Rapids): Leveraging the power of GPUs for data processing and analytics, significantly speeding up tasks compared to CPU alone.
Kubernetes: An orchestration system for deploying, scaling, and managing containerized applications.

Learning resources

Here are a few places to continue your learning:

Python for Absolute Beginners: Perfect if you’re new to the language and want to build a solid foundation.
Getting started with Dask: A dedicated course on how to scale pandas workflows and Python computations with Dask.
Fundamentals of Dask: Explore more advanced or in-depth parallelization patterns and integrations with the broader PyData ecosystem.

Overall takeaway

Dask has become a linchpin in Python’s data science stack for parallel and distributed computing. By founding Coiled, Matthew Rocklin and Hugo Bowne-Anderson are turning years of open-source learnings into a straightforward product that addresses real-world deployment, environment, and security hurdles. Whether you’re an individual data scientist upgrading from pandas or an enterprise looking for cost management and compliance, the combination of Dask and Coiled can help you harness the full potential of modern hardware and cloud services with minimal friction.

Links from the show

Hugo on Twitter: @hugobowne
Matthew on Twitter: @mrocklin
Coiled: coiled.io
Coiled raised $5M in Sept: twitter.com
A brief history of dask article: coiled.io/blog
Coiled: Dask for Everyone, Everywhere: medium.com
The incredible growth of python: stackoverflow.blog
Growth updated (SO Trends current): insights.stackoverflow.com
Coiled Youtube channel: youtube.com
Snorkel package: pypi.org

Episode #285 deep-dive: talkpython.fm/285
Episode transcripts: talkpython.fm

---== Don't be a stranger ==---
YouTube: youtube.com/@talkpython

Bluesky: @talkpython.fm
Mastodon: @talkpython@fosstodon.org
X.com: @talkpython

Michael on Bluesky: @mkennedy.codes
Michael on Mastodon: @mkennedy@fosstodon.org
Michael on X.com: @mkennedy

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 If you're into data science, you've probably heard about Dask.

00:02 It's a package that feels like familiar APIs such as NumPy, Pandas, and scikit-learn.

00:07 Yet it can scale that computation across CPU cores on your local machine

00:12 all the way to distributed grid-based computing in large clusters.

00:16 While powerful, this takes some setup to execute in its full glory.

00:21 That's why Matthew Rocklin has teamed up with Hugo Bowne-Anderson and others to launch a business to help Python-loving data scientists

00:28 run Dask workloads in the cloud.

00:30 And they're here to tell us all about how they've built this open-source foundational business.

00:35 And you know what? They must be on to something.

00:38 Between recording and releasing this episode, they just raised $5 million in VC funding.

00:44 This is Talk Python To Me, episode 285, recorded August 12, 2020.

00:57 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

01:09 This is your host, Michael Kennedy.

01:11 Follow me on Twitter where I'm @mkennedy.

01:13 Keep up with the show and listen to past episodes at talkpython.fm.

01:17 And follow the show on Twitter via at Talk Python.

01:20 This episode is brought to you by Brilliant.org and Monday.com.

01:23 Please check out what they're offering during their segments.

01:25 It really helps support the show.

01:27 Talk Python To Me is partially supported by our training courses.

01:30 Do you want to learn Python, but you can't bear to subscribe to yet another service?

01:35 At Talk Python Training, we hate subscriptions too.

01:38 That's why our course bundle gives you full access to the entire library of courses

01:42 for one fair price.

01:44 That's right.

01:45 With the course bundle, you save 70% off the full price of our courses, and you own them all forever.

01:51 That includes courses published at the time of the purchase, as well as courses released within about a year of the bundle.

01:57 So stop subscribing and start learning at talkpython.fm/everything.

02:04 Matthew, Hugo, welcome back to Talk Python To Me.

02:06 Hi, Michael.

02:07 Hey, Michael.

02:08 Thanks for having us again after a couple of years.

02:10 Yeah, it's fabulous to have you both back.

02:12 You all were guests previously.

02:14 Hugo, you were on a really popular episode 139, Paths into a Data Science Career,

02:20 which I think is great helping people do that.

02:22 And Matthew, you were 207, Paralyzing Computation with Dask.

02:27 Oh, I think you'll beat me.

02:28 You beat me on the show, Hugo.

02:30 It's not a competition.

02:31 But if it was, you're right.

02:33 You would have won.

02:33 Yes, that's right.

02:34 But I think this talk is going to be a really interesting extension of the one that you were on, Matthew,

02:41 where we talked about the technical stuff of Dask and the computation.

02:45 And now, you know, it's really grown a lot.

02:47 That was February 2019.

02:49 That was a time in the world when you could go places.

02:51 Like you could just go out and you could go to a place where they made food

02:55 and you could physically touch another person and then just take it and you wouldn't worry about it.

02:59 It was amazing.

02:59 That's right.

03:00 Yeah.

03:00 The actual remote experience of developing software hasn't changed a ton.

03:03 I think actually people in our space have more or less the same workflow.

03:07 We just, you know, don't see each other at conferences any longer.

03:10 Yeah.

03:11 Yeah.

03:11 If we were accused of being antisocial as developers, well, we've taken it up a notch

03:15 or everyone's just come.

03:16 Other people have come to join us in this weird world.

03:19 But yeah, actually, I think software is pretty social as, you know, even though it gets categorized,

03:25 not so much that way.

03:26 So it's great to have you both back.

03:28 I'm looking forward to talking about Dask, but especially Coiled and this new company you

03:32 are starting and this new platform you're starting sort of based on taking that to the next level.

03:37 So it's going to be a really fun conversation.

03:39 Yeah.

03:39 I'm looking forward to it.

03:40 As am I.

03:40 Yeah, absolutely.

03:41 Yeah.

03:41 Now, before we get into it, though, you both were on before and I asked how you got into

03:45 programming and how you got into Python.

03:47 So I might have a really, you know, maybe keep it short since there's two of you and

03:51 we did cover some of the stuff a little before, but let's try a different question.

03:54 How did you get into data science?

03:55 You go, you want to go first?

03:57 Sure.

03:57 I was working in applied math in cell biology and biophysics.

04:01 And my job was ostensibly to do a bunch of mathematical modeling.

04:05 But all my collaborators were generating loads of data that I needed to figure out how to

04:11 analyze to collaborate with them and even, you know, play a big role in the iterative design

04:16 of experiments and all of this type of stuff.

04:18 This was 2011, 2012.

04:20 Excel didn't seem like a great answer.

04:22 That's where I started.

04:23 And then I went to R and then I went to Python and kind of learned them concurrently because

04:28 I was working with biologists.

04:29 R was quite popular, but the IPython notebook was becoming more and more popular.

04:34 It was a great resource for teaching oneself and teaching others.

04:38 So I jumped in then and worked in research, worked in education a lot, and then moved to

04:43 industry, to tech, and started working in ed tech to educate people around all of these

04:49 tools as well.

04:49 Yeah.

04:49 Excellent.

04:50 Matthew, how about yourself?

04:51 I think for me, that probably the culmination point or the inflection point was just after

04:55 I graduated from university.

04:57 I graduated with like a physics and astronomy degree.

04:59 And I was looking at job applications.

05:01 I didn't like any of the jobs that I was qualified for.

05:04 But the jobs that I loved.

05:05 I had the same feeling when I studied math.

05:08 I'm like, I love this stuff, but what am I going to do with this?

05:11 And none of this looks that great to me.

05:12 But I didn't know how to code.

05:13 I had been coding since I was like a kid on an old TI calculator.

05:16 And all of the programming mixed with science jobs, those were fascinating.

05:21 So I actually went into graduate school for computer science, scientific computing, and that

05:24 was a really good fit for me.

05:27 I love that I got to touch all sorts of different domains.

05:29 But that was probably the inflection point is looking at job applications after having

05:33 already graduated college and realizing, ah, I think I made a mistake.

05:35 Maybe I should have turned left earlier.

05:37 That's really interesting because a lot of people, they look around and they see folks

05:41 who have successfully created these amazing open source libraries and projects, initiatives,

05:46 you know, like Dask.

05:48 And they're like, the people who get to do that, they've been programming since they were

05:51 five and they got a computer science degree and they just knew they wanted to do that.

05:55 And I think that's not actually as true as often people think it is.

05:59 Right.

05:59 I think none of us got computer science degrees.

06:01 Is that fair?

06:02 Yeah, I didn't.

06:03 I got a computer science book after I decided I didn't want to work in math anymore and started

06:08 like, you know, C++ book and started going on that.

06:10 That's not the same as a degree though.

06:12 Good plan.

06:12 I feel like there's an analog in the data science world as well with respect to, you know,

06:16 all these boot camps that happen now and now masters in data science at prestigious

06:19 universities and relatively few people I know at least have taken the university track.

06:25 I think boot camp to junior data scientist positions far more prevalent.

06:29 But for more on that, you can check out the previous episode that I recorded with Michael.

06:33 Yes, exactly.

06:35 We've got a whole show on that one.

06:37 So, all right.

06:38 Now, the last time we spoke, Hugo, you were at data camp doing stuff with teaching people

06:44 online Python stuff.

06:46 That was awesome.

06:46 And Matthew, yeah, yeah, cool.

06:49 And Matthew, you were just moving to do a little bit of collaborative stuff with NVIDIA, but

06:54 now you both have, you know, teamed up, joined forces to do something else, right?

06:58 What are you doing nowadays?

06:59 So, we're building a company called Coiled, essentially to productize for the enterprise

07:06 a lot of features of the PyData ecosystem and Dask in particular.

07:09 So, the V1 of our product, we're just about to launch.

07:12 So, it's mid-August 2020, we're about to launch, is managing Dask and distributed compute

07:16 in the cloud for you and handling security, conda doc environments, team management, this

07:21 type of stuff.

07:22 And that doesn't really say what we do on a day-to-day basis.

07:24 I think I could list what we don't do on a day-to-day basis more quickly than what we do.

07:29 Yeah, sure.

07:30 But having, you know, incorporated the company in February, it's kind of waking up every day

07:35 and seeing what the most impactful stuff we can do actually is.

07:38 And that's really exciting.

07:39 And it's a great, I'm really excited to work with Matt and the whole Coiled team.

07:43 Matthew, how about you?

07:43 What do you do day to day?

07:44 You've joined up with Hugo on Coiled.

07:46 Yeah.

07:47 So, day-to-day, you know, I still help maintain Dask.

07:49 So, there's a ton of GitHub conversations, you know, Discord, Stack Overflow.

07:53 But now, we're also making a company.

07:54 So, we're, you know, figuring out sales.

07:56 We're figuring out legal.

07:57 There's a ton of sort of nuts and bolts of running a company.

08:00 That is quite difficult.

08:01 As Hugo was saying, listening to what you can do is different from what you can't.

08:05 Yeah, exactly.

08:07 So, you went to school for physics and astronomy, and then you got a master's degree in computer science, and then you did some really interesting data science around Dask.

08:16 I bet none of that taught you about marketing funnels or accounting or taxation law or any of that fun stuff, right?

08:24 You'd actually be surprised.

08:25 Like, running...

08:26 Like, yeah.

08:26 Maintaining an open source project includes lots of those things.

08:29 Okay.

08:30 So, there's actually a lot of, like, we're hiring.

08:31 We do a lot of community management.

08:33 There's a lot of, like, employer-employee kinds of relationships that occur.

08:37 Right.

08:37 People are new to the project.

08:38 Okay.

08:39 Yeah.

08:39 You need to have relationships with large companies like NVIDIA or Google or Microsoft.

08:42 Like, running a sizable open source project is actually not so dissimilar from running a business.

08:47 Yeah.

08:48 Now, taxation laws and legal agreements, there's certainly more of them now.

08:52 Probably so.

08:53 But, yeah, you guys probably wear a lot of hats as you're getting this stuff off the ground.

08:57 Indeed.

08:57 Very much so.

08:58 Yeah.

08:58 And to speak to Matt's point, when we first started working together, he was like, okay, what's the call to action on the landing page?

09:04 And I was like, oh, great.

09:05 Let's have that conversation.

09:07 And his, you know, appreciation of design ethics, sorry, design aesthetics and marketing funnels, I think is actually, I was pleasantly surprised.

09:17 Yeah.

09:17 Well, I mean, so there are probably, you know, 10 to 15 to 20 people who work on desk at different companies.

09:22 You know, many of whom we've, like, tried to get jobs.

09:25 So, you know, we do marketing.

09:27 We do sales.

09:27 We do all of those kinds of things you do inside of companies.

09:30 We just do it without paying anybody directly or without having any actual control over everything.

09:37 Running with responsibility is kind of like running a company without any power.

09:39 It's a lot of soft power.

09:40 Sounds like product management.

09:42 Yeah, I can imagine.

09:43 Sure.

09:44 It does.

09:44 I don't know.

09:45 All right.

09:45 Cool.

09:46 And I guess also just throw out there really quick, Hugo, you did the DataFrame podcast for a while.

09:50 So people may also know you from there as well, right?

09:52 Are you still doing that?

09:53 No, no, I'm not.

09:54 And I miss podcasting a great deal.

09:56 I'm very much enjoying going on podcasts.

09:58 But Matt and I have some secret plans to start podcasting in the nearish future.

10:04 So that may be a bit of a teaser there.

10:07 But yeah, that was a really fun podcast to put out weekly.

10:10 And of course, I came on your show, on this show before doing that.

10:13 I just want to let everyone know that Michael was actually incredibly helpful in getting the podcast up and running and letting me know all the things that he learned along the way that, you know, I still learned a lot along the way.

10:25 But there are several things that Michael helped so wonderfully with.

10:28 So thank you for that, Michael.

10:29 Oh, you're really welcome.

10:30 I was glad to see you having success there.

10:31 And I'm looking forward to whatever secret project you guys have in mind.

10:36 As am I.

10:36 Awesome.

10:37 So we're going to talk about Dask and open source and where maybe some sort of company making this a little more powerful or amplifying the message there might come from.

10:47 But I wanted to start a conversation a little more broad and philosophical.

10:52 So I'm going to talk about two things real quick.

10:55 First of all, we saw in 2017, Stack Overflow published an article called The Incredible Growth of Python.

11:02 And it showed Python being pretty flat for many years, relatively, along with a bunch of other languages.

11:08 And then around 2012, you know, somebody kicked the derivative up or the second derivative.

11:14 And it just took off.

11:15 Right.

11:16 And it went up and up.

11:17 And they said, look at this.

11:18 It's about to pass these other languages.

11:19 And if we predict out, look how crazy it's going to get.

11:21 And, you know, that was three years ago.

11:23 We can look back, you know, I'll actually link to the Stack Overflow trends now.

11:28 And those predictions underestimated how popular Python was going to be.

11:33 But more so, they underestimated the stability of the other languages.

11:36 And they're actually going down more than the flatness that they predicted, which is, I think all those kinds of things is really interesting.

11:42 My theory is a lot of folks came into Python because of the data science stack around that timeframe.

11:49 And so I guess my question to you all is, you know, how much is this incredible growth of Python, the incredible growth of data science in Python versus more broad stories around there?

12:01 I think there's a lot of, I think the data science or scientific stack inside of Python certainly played a large role in that.

12:07 And, you know, I think that's because we understood, I think that the scientific stack targeted very much scientific users, which were a good proxy for data science users today.

12:15 Yeah.

12:15 And so we had the same combination of performance and accessibility that was necessary to meet that need.

12:20 So we're sort of ahead of the game by accident.

12:22 Those are why a lot of the maintainers have a science background.

12:25 Right.

12:25 What I would say, though, is that Python is really only powerful today because of that union of the data science stack with the web stack, with the visualization stack, with the system operations stack.

12:37 And it's really the fact that we can do all of those things together, which make Python sort of the standard default place to build advanced applications today.

12:46 You know, if we were only MATLAB, for example, and we couldn't do web servers, we would be sort of a fringe language.

12:52 We would be a niche.

12:53 If we were only, you know, Scala and could do only Spark, we would also be sort of a niche.

12:58 Python can do everything.

12:59 And that is where you can build these really rich and awesome applications.

13:02 Yeah.

13:02 I think that's a really good point.

13:04 I mean, people talk about that's why Node.js had the interest that it did, because you could sort of solve the problem in all the places you had to work.

13:12 And I feel like that's kind of the story as well for Python on the scientific computation side and to some degree on the website as well.

13:20 I mean, we all have to write HTML and stuff.

13:22 But, you know, the way I think of it is, you know, if there's a biologist who's coming in and she has to do a little bit of computation, she's like, okay, I just need to do this.

13:31 I need to make this graph.

13:32 This data is too big for whatever.

13:34 If I can write these 10 lines of code, not even in a function, just top to bottom, straight down, write these five lines, 10 lines of code, get this amazing output by using all these libraries.

13:44 Eventually, the idea of like, well, maybe I need to pass different data.

13:46 So maybe I need to write a function.

13:48 And then, you know, you get a little bit farther.

13:49 No, now I'm going to reuse this.

13:51 I'm going to make this package.

13:52 Right.

13:53 You know, two years later, you look back and it's like, how did I become a programmer?

13:56 I thought I was a biologist.

13:57 I never, ever intended to be that.

13:59 So I feel like Python brings a lot of people in through that sort of gravity.

14:04 But because it'll solve the problems that are more advanced and has this rich sort of top end, you don't have to abandon it like you would MATLAB and go learn something else.

14:13 You just get to stay there.

14:14 And I think people are, it's sticky in that regard.

14:17 I agree.

14:18 This is a really common theme in how we think about data science.

14:21 There's this on-ramp, right?

14:22 You know nothing or you know how to use Excel.

14:25 You can use a little bit of pandas.

14:26 You can use a little bit of Dask on one machine.

14:27 You can go out to scale on a cloud.

14:29 You can run on the world's biggest supercomputers, right?

14:32 And it's that smooth experience that we think about.

14:34 We're getting ahead of myself a little bit talking about Dask.

14:36 But that smooth experience of starting from nothing and working up to being amazing is what I think a lot of the Python ethos is about.

14:45 Yeah, I agree.

14:46 I agree with all of that.

14:47 And I do think it's the rise of an entire community and network of tooling in the Python landscape.

14:53 It's also the rise of tireless educators such as the carpentries, data carpentry and software carpentry bringing this stuff around the world.

15:01 I'm very humbled to have played a small role in my work at DataCamp to spread the gospel of Python.

15:07 I've never used that term before.

15:08 We'll see how that lands.

15:10 The other thing I think that I've been really excited about is seeing the wider Python community embrace data science as a fundamental part of the Python landscape now.

15:20 I don't know if anyone's done a data analysis of, let's say, data science talks and keynotes and tutorials at PyCon over the past 10 years.

15:27 But at least anecdotally, we've seen a huge embrace there.

15:30 I mean, it was where we first met in your hometown of Portland, right?

15:33 Oregon.

15:34 Yeah.

15:34 Right.

15:35 Catherine Hough and Jake Vanderplus were two keynotes several years ago.

15:39 And I was like, oh, wow.

15:39 Yeah, they talked about the mosaic and it was beautiful.

15:42 And we're having keynotes from the Python community.

15:45 Exactly.

15:45 Yeah.

15:46 Super interesting.

15:47 So I do think that that's a lot of the magic of Python is that it welcomes people from these different areas.

15:54 And there was a really interesting survey.

15:56 I think it was the JetBrains PSF combined survey that asked something like, how many you are data scientists and what percentage of the Python community do you believe data scientists are?

16:09 And data scientists made up almost like the equal partition of web, which were the two biggest groups, but they thought they were much less represented because they, I think they felt like they were working more individually, but there were just so many of them.

16:22 So it was really interesting that they believed that they were not as big of a part as they actually were.

16:26 But I think that that perception is starting to change.

16:28 I'd be interested in what type of like self-identification bias there is with respect to data science.

16:33 Like I almost get the impression everyone has to identify as a data scientist these days, even like whatever they do, even if they're a gardener.

16:40 It's true.

16:41 It's true.

16:43 Have you, have you touched pandas?

16:44 Then you are.

16:45 This portion of Talk Python To Me is brought to you by Brilliant.org.

16:51 Brilliant has digestible courses in topics from the basics of scientific thinking all the way up to high-end science like quantum computing.

17:00 And while quantum computing may sound complicated, Brilliant makes complex learning uncomplicated and fun.

17:05 It's super easy to get started, and they've got so many science and math courses to choose from.

17:09 I recently used Brilliant to get into rocket science for an upcoming episode, and it was a blast.

17:14 The interactive courses are presented in a clean and accessible way, and you could go from knowing nothing about a topic to having a deep understanding.

17:21 Put your spare time to good use and hugely improve your critical thinking skills.

17:26 Go to talkpython.fm/brilliant and sign up for free.

17:29 The first 200 people that use that link get 20% off the premium subscription.

17:34 That's talkpython.fm/brilliant, or just click the link in the show notes.

17:41 Another thing that I want to ask you in this sort of philosophical idea is that I mostly do web stuff with Python.

17:49 So I run my online stuff, my online training company, and various things with Python.

17:54 But obviously, I'm interested in all of it.

17:57 When I look at the web world, I feel like once Python 3 really got fully adopted, or at least became the default thing to do, a whole bunch of older ideas were all of a sudden still relevant, but they became, you know, that's neat.

18:12 But let's rethink this now that we can.

18:14 So I'm thinking of things like FastAPI, where they use type annotations to mean stuff, or, you know, things like that, or we're going to build this from the ground up with async and await, and so on.

18:24 So there were just so many different web frameworks and other tools that just came out of nowhere.

18:30 And some of them are not really maintained anymore.

18:32 But all these flowers bloomed.

18:34 And it was just a really interesting thing to see.

18:36 I don't have the same visibility in the data science side for that.

18:39 Did that happen?

18:40 And what was it like?

18:41 I think to a certain extent, it did.

18:43 I think probably larger is the widespread adoption of tools and not necessarily the creation of new tooling and the development of specific tooling as well.

18:53 But there are several tools.

18:54 I think the first one that really is so important is the Jupyter ecosystem.

18:59 I referred to IPython notebooks, and I did that accidentally, but also intentionally, because they were IPython notebooks back then, right?

19:06 That was super cool.

19:07 And I feel like people forget about IPython, but it's so important in the Jupyter ecosystem.

19:13 But yeah, since 3.6, we've seen two years ago, like JupyterLab 1.0, which I think brings the possibility of so many new people to use the entire ecosystem.

19:24 I always think JupyterLab does for Python, I think, what the RStudio IDE does for the R ecosystem.

19:32 It reduces the barrier to entry for so many people.

19:35 But then, as we've noted in other conversations, tools like Streamlit are really cool.

19:41 And I think this is going to be the next evolution of data tooling is bridging the gap between local computation and iterative science in Python and productionized science, right?

19:51 And machine learning.

19:52 So the ability to kind of build a machine learning product quite quickly is wonderful there.

19:59 Yeah, Streamlit was definitely one of the things that came to mind for me is like, this is really a different take on how this works, but it's kind of a neat, modern way to do it.

20:08 Exactly.

20:09 So for me, it wasn't so much the change of Python 3.6.

20:12 I think the scientific stack, the PyData stack actually adopted Python 3 like two or three years before the web stack.

20:19 Yeah, they definitely had a head start there.

20:22 That was great.

20:22 So I think that wasn't as transformational.

20:25 We had a much more smooth transition, I think.

20:27 I think around the same time, though, there was this huge influx of new people and new actors in the system.

20:34 So there's like, you know, as we shifted from being this sort of fringe scientific computing language to becoming the de facto standard language for ML AI workloads, with the option of like TensorFlow and PyTorch, you've got, you know, Google and Facebook acting, you've got Nvidia in there now.

20:49 And you have just, you know, a 10x in the user growth.

20:53 That is really, I think, our like, change moment is the like the TensorFlow engagement of companies for better or for worse.

21:00 Yeah.

21:01 What about enterprises and large corporations seeing the PyData stack as legitimately the thing that they could adopt over, I don't know, saying they have to do with Java or something like that?

21:11 Like, it seems like that's changed as well.

21:13 Yeah, there's a huge acceptance of Python in both in the data science machine learning space, but also in production, right?

21:19 As sort of Docker came up and as Kubernetes came up, like Python became a thing you could easily deploy into PenPom.

21:24 And that's new as of the last decade.

21:27 And with that comes a bunch of money to be made, which again, brings in this like other slew of actors in this space.

21:32 Yeah.

21:32 Both good and bad.

21:34 Yeah.

21:34 I think this is something which started for the most part in tech, where we saw massive, massive wins in open source adoption and then bled out slowly.

21:42 I mean, finances is a great example, like seeing how much places like JPMC and Capital One and these places use Python now.

21:49 But then we're seeing it in retail.

21:51 I mean, one of the great use cases of Dask and Rapids and XGBoost is at Walmart.

21:56 Going back to tech, the fact that Netflix runs like hundreds of thousands of Jupyter notebook batch jobs a day or something like that.

22:04 And they're hiring.

22:04 I mean, they've got a bunch of Jupyter core devs, right?

22:07 Yeah, their whole paper mill thing.

22:08 Yeah.

22:09 Yeah, that's really interesting.

22:11 I guess the early start on Python 3, because a lot of the tooling was improving and there wasn't as much like legacy Python data science.

22:21 I think that was part of the key to being able to just say, you know what?

22:25 We're just going to start something new on these new set of tools, new machine learning library.

22:29 Let's just use Python 3 as opposed to, well, we're still on this web framework.

22:33 We can't change versions.

22:35 And that thing depends on this.

22:36 So we're stuck, you know, seven or eight years in the past.

22:39 So, yeah, pretty cool.

22:40 I mean, it was still a huge effort.

22:41 But I think there's just like there was less production use of the data science stack.

22:46 It seemed like there was production use of the web stack.

22:48 And that's what's really hard to change.

22:49 And I'll also add that data scientists require flexibility and nimbleness and agility in their tooling.

22:56 And it's such the questions they're answering are changing so quickly.

22:59 The way these questions integrate into making business decisions.

23:03 And that interface is changing so quickly that the tooling is kind of moving alongside that.

23:07 And there's this wonderful co-evolution between the tooling and the techniques and the questions that we're required to answer.

23:12 Yeah, that's a good point.

23:13 All right.

23:14 So let's set the foundation for this project, this company that you all are working on.

23:19 And that would be Dask.

23:22 Matthew, you want to kick us off?

23:23 Tell us, you give us the summary of Dask and all the cool things that it does.

23:27 I mean, we did talk about it at WAGA, but that was over a year and probably not everyone heard it anyway.

23:31 Yeah, sure.

23:32 So Dask is an open source Python library that was designed to parallelize other Python libraries.

23:37 So we first started Dask at Anaconda.

23:40 And the goal was to parallelize out NumPy and Pandas and scikit-learn to sort of like the foundation of the sort of scientific Python Python Python stack.

23:48 What we found pretty quickly is that about half users we were targeting, that's exactly what they wanted.

23:53 They had a big table or a big array of numbers.

23:56 And they wanted a bigger version of those things with the same APIs and the same feel.

24:01 And so Dask gave that to them.

24:02 But about half of our users wanted to do something totally different.

24:06 They wanted to use the internals of Dask to parallelize out some other crazy thing that they were building.

24:11 As we mentioned, in Python, people were building all sorts of new libraries.

24:14 And they wanted to add a little parallelism, a little bit of scalability into their library.

24:18 Yeah.

24:18 And the internals of Dask allowed them to do that.

24:21 So Dask is, at its core, a general purpose library for parallel computing.

24:25 You can think of it in the same way you think of the threading module or the concurrent futures module

24:30 or anything like that in the standard library.

24:32 But it runs at scale and gives you full scalability.

24:35 Right.

24:36 When you say at scale, you mean you could have a cluster of 20 machines.

24:39 Yeah.

24:40 Or 200.

24:40 Yeah.

24:41 Something that looks like a Pandas data frame.

24:43 And you ask it to do computation.

24:45 And that computation happens on all those machines.

24:48 Exactly.

24:48 Yeah.

24:48 And it's really smart about not moving data.

24:51 Because if you've got a lot of data, the movement of it around might actually be the slowest

24:55 part, right?

24:55 Yeah.

24:55 That and like 20 things.

24:57 Yeah.

24:57 So we think very, very deeply about how to run complicated task graphs at scale.

25:01 And that allows for us to do things like big pandas.

25:04 But also, you could use it in the same way you use Celery behind a web application.

25:08 Or you could use it backing a big machine learning application.

25:12 So Dask is a very flexible tool on which many people have built these sort of more special

25:18 purpose scalability tools.

25:19 You can think of it kind of like a tool that you would use to build Spark or the tool that

25:23 you would use to build a parallel airflow, for example.

25:26 I'll just build on this briefly by quoting Matt.

25:29 And I always quote him on this.

25:30 So he's heard me quote him many times on this.

25:32 But this is a blog post he wrote called A Brief History of Dask, which you can find on

25:36 coil.io forward slash blog.

25:39 But he speaks to three goals that the original idea that him and several other people at Anaconda

25:45 were thinking about.

25:46 Two technical goals, which were to harness the power of all the cores on your laptop in

25:50 parallel.

25:50 A second one to support larger than memory computation.

25:53 And these we know about.

25:55 But he also mentions a social goal, which is to invent nothing.

25:57 And I quote, we wanted to be as familiar as possible to what users already knew in the

26:02 Pi data stack.

26:03 And I think as the Pi data stack grew and garnered a lot of adoption, the fact that there was

26:08 a social goal to invent nothing allows pandas and NumPy users then to use it immediately.

26:13 And Matt says, you know, doesn't have an API because it uses the API, the packages that

26:18 it allows people to do distributed compute with.

26:20 Yeah, that's a really good goal because it doesn't matter how amazing the thing you came

26:25 up with.

26:25 If you say, OK, well, you used to talk to databases, but now use our special API that's

26:31 like a database, but it's not exactly because it's better.

26:34 And we have this graphing thing and it's like Altair, but it's not really Altair.

26:38 It's just motivated by it.

26:40 So you can just redo all your stuff and it's going to be better.

26:42 And, you know, a lot of times it's like, you know what?

26:44 The tools that I have really work well and I don't want to reinvent my world.

26:48 Exactly.

26:49 Yeah.

26:49 I call this the principle of minimal creativity.

26:52 I love it.

26:53 It's like productive.

26:54 That's a beautiful one.

26:55 It's similar to the productive, like laziness of developers or something like that.

27:01 Right?

27:02 Sure.

27:02 Yeah.

27:03 This thing that we thought was great, creativity, you know, building a new wonderful thing is

27:06 actually horrible.

27:07 Like you really don't want to impose your thoughts on others.

27:11 Building boring tools.

27:12 Yeah.

27:12 Right.

27:12 Building boring tools.

27:13 Yeah.

27:14 Yeah.

27:15 Otherwise known as getting stuff done, right?

27:21 This portion of Talk Python To Me is sponsored by Monday.com.

27:24 Monday.com is an online platform that powers over 100,000 teams daily work.

27:30 It's an easy to use, flexible and visual teamwork platform beautifully designed to manage any

27:35 team organization or online process.

27:37 Now, for most of us, we missed our chance to build the first apps ever in the mobile app

27:42 stores.

27:43 It was a once in a lifetime opportunity, but it's one that's coming around again.

27:47 Monday.com is launching their marketplace and running a contest for the best new apps featured

27:53 right from the get go.

27:54 Want to be one of the first in the Monday.com apps marketplace?

27:57 Start building today.

27:59 They're even giving away $184,000 in prizes, including three Teslas, 10 MacBooks, and more.

28:06 Build your idea for an app and get in front of hundreds of thousands of users on day one.

28:10 Start building today by visiting monday.com slash Python, or just click the link in your

28:15 podcast player's show notes.

28:16 My understanding is that Dask is used for all these different projects.

28:22 And like sometimes you'll find Dask being used and you're not even really aware that

28:27 Dask is somehow powering maybe it's parallelism or whatever.

28:30 So maybe give us a couple of examples of places you found Dask being used that surprised you or

28:36 you're proud of.

28:37 Yeah, sure.

28:38 I'll maybe talk about maybe Pangio first.

28:41 So I think I mentioned Pangio at the end of our last podcast.

28:45 Yeah.

28:45 So Pangio is amazing.

28:47 Pangio is a collaboration of earth scientists, so like climate scientists, meteorologists,

28:51 oceanographers, and a bunch of open source software developers who teamed up to make a new software

28:57 stack to solve a lot of climate science problems.

28:59 We combined things like JupyterHub, Dask, Kubernetes, and other libraries on top of that, like Xarray,

29:04 intake, all sorts of things.

29:06 Yeah.

29:06 And we just like, we revolutionized the way that that sector computes.

29:11 There were sort of decades old software stacks that we just immediately showed were not nearly

29:16 as powerful as we were able to do together, both in terms of computation and in terms of

29:20 accessibility.

29:21 It's at the point where, you know, if you were an undergraduate student in the Philippines,

29:25 you can go and you can look at cloud-based data sets that are many terabytes large and

29:31 see, you know, what climate change will do with sea level rise, for example.

29:34 And that's something that we were all, that happened in like six or 12 months.

29:37 It was amazing.

29:37 And that's such an important problem that we need to solve.

29:40 And so to know that some of the software you helped create is central to making that happen.

29:46 That's really cool.

29:46 Me and like a thousand other people.

29:48 Yeah, yeah, yeah.

29:48 Of course.

29:49 Absolutely.

29:50 Yeah.

29:50 But it was a great, it was a great example of a bunch of different kinds of people all

29:55 working together.

29:55 It was a bunch of like computer scientists working together.

29:57 We wouldn't have done it correctly.

29:58 Once the oceanographers are working together, it wouldn't have done it all together.

30:01 It needed to be a collaboration.

30:02 Right.

30:03 Yeah.

30:03 How many oceanographers are like, you know what?

30:05 I'm going to set up a Kubernetes cluster so we can do automatic scale out, but then scale

30:09 back down.

30:10 You'd be surprised.

30:10 You know what?

30:10 No, no, I know.

30:12 You'd be surprised.

30:13 They ended up having to learn those skill sets over time, which is partially why we made Coiled,

30:17 is to stop that from having to be the case.

30:20 They're also sharp folks.

30:21 I mean, people who are in that space are quite sharp.

30:22 Another example might be so Rapids.

30:25 So when we last talked, I had just joined NVIDIA.

30:28 So at the time, NVIDIA was building Rapids data science suite that is GPU accelerated.

30:33 So they have like NumPy, Pandas, CycleLearn equivalents that all run on the GPU.

30:38 It's a little bit like what Dask does for parallelism.

30:40 They're somewhat attempting to do that for the CUDA cores and GPU, right?

30:45 Yeah, exactly.

30:45 So, you know, Dask allowed you to do from Pandas to parallel Pandas.

30:50 Rapids allows you to go from Pandas to GPU accelerated Pandas.

30:54 And then Rapids and Dask together allow you to go to, you know, GPU accelerated Pandas across

30:58 the cluster of machines.

31:00 Yeah, that's a great way to put it together.

31:02 Yeah.

31:02 The really great result we had out there, which is super surprising to me, was they did a

31:06 benchmark, the TPC XPB benchmark, where there's like a bunch of sort of data science and business

31:11 analytics queries.

31:12 And they got something like 40% faster and like seven times cheaper than the next solution,

31:18 which is like Spark or MapReduce on a bunch of Dell machines.

31:21 Yeah.

31:21 And so in some sense, that's like 400, right?

31:24 Because it's, you know, the 40X faster and the cheaper, right?

31:28 Because normally you think I can get a lot faster, but I spend a lot of more money or I can get

31:31 cheaper, but it goes a lot slower.

31:32 But if you get it to expand in both axes, that's awesome.

31:35 No, to be honest here, it's a more expensive machine, but running for a much shorter amount

31:41 of time.

31:41 Yeah.

31:41 So it's not 400X, it's genuinely 40X faster and only 7X cheaper.

31:45 But yeah, no, it's amazing.

31:46 And I think what I love about that is that there are all of these sort of old guard companies

31:51 building out solutions for this benchmark.

31:53 You've got Dell, HHP, and they're all using like monolith software projects like Spark.

31:59 And this benchmark was different.

32:00 It was NVIDIA and it was people at Coiled and it was people at Blazing SQL, another startup

32:06 company.

32:06 There are a bunch of small startups and a bunch of small software projects that all collaborated

32:11 together to just smash this benchmark, right?

32:14 All the other companies were fighting over sort of 10%, 20% gains.

32:18 And we come in with like a 40X speed improvement.

32:21 And so it just sort of, we just wiped everyone out, which was great.

32:26 Yeah, that's a super cool project.

32:28 And Rapids, I mean, it seems like that's a pretty early stage project and it's probably

32:31 just going to grow.

32:32 Early stage of moving fast.

32:33 I mean, that's an example of a company investing in open source with a team of 50 people.

32:38 It's impressive.

32:39 And then the last one I might mention is Napari, which is kind of like Pangeo in that

32:44 they are using Dask to advance another field, in this case, biomedical imaging.

32:48 But their approach is really different.

32:49 A lot of lab bench scientists, people who are looking at microscopes, don't know how to program.

32:54 And so we can't give them a Jupyter notebook that they can play with.

32:58 Instead, they're used to sort of point and click interfaces.

33:00 And so Napari is an image viewer for large images.

33:04 With a point and click interface, you sort of pan and zoom around these, you know, picture

33:08 of a cell or picture of some sort of cancer thing.

33:10 And what I love about this is that they use other parts of Python.

33:12 They use the whole QT side of Python.

33:14 So that's a QT based application.

33:17 And they're not targeting data scientists.

33:19 They're targeting actual scientists.

33:21 And I think that's like a great space for growth of Python in the future is to build applications.

33:26 Yeah, yeah, yeah.

33:27 Because it's one thing to say we build it for data scientists.

33:29 We're kind of like Python programmers in general, but they have this special visualization data

33:35 handling skill set as opposed to the person who took your x-rays or whatever, right?

33:41 Yeah, exactly.

33:42 And also there's an order of magnitude growth that we can have.

33:46 Yeah.

33:46 And I'll also add to that that I've spoken with several users of Napari recently and the way

33:51 it changes their approach to science and their scientific flow.

33:55 I mean, they can do a bunch of imaging.

33:56 And whereas previously they'd have to wait until the next day to check out the images,

34:00 they can literally go away for 15 to 30 minutes, come back, check out the images, then plan the

34:05 next experiment.

34:05 So it changes their absolute flow state of scientific research and the rapid iterative

34:11 cycle for them.

34:11 Yeah, that's really awesome.

34:12 I mean, it's got to feel great to work on that project and just see, you know, it's making a true

34:17 impact for everyday people.

34:19 Yeah, that's cool.

34:20 The one other thing I'll add, which I think we hinted at before, is that Dask is famously used for

34:26 leveraging clusters and parallelization, but it also does it locally.

34:31 So before you scale out to your cluster or whatever, you can scale up to do out-of-core

34:36 computing on larger data sets yourself.

34:39 So it allows a lot of people to do more locally than they'd otherwise be able to across all

34:44 the types of questions we've been discussing, which is pretty cool.

34:47 Yeah, absolutely.

34:48 Well, you know, the MacBook that I'm sitting here with, the MacBook Pro is a core i9 with

34:53 six hyper-threaded cores.

34:55 So 12, basically.

34:57 And if I go and do Python stuff really hard on it, I get one twelfth the CPU consumption

35:02 of what's available.

35:02 And my gaming machine is, you know, 16 cores, right?

35:06 It's like, if that stuff is just mostly idle in the Python world, unless you find a way to

35:11 take advantage of it.

35:12 And so that actually really surprised me about Dask.

35:13 I thought of Dask as, you know, the way we sort of opened the conversation.

35:17 It's this way to take huge data that has to fit into a cluster and not just onto my little

35:22 laptop and run it distributed in terms of distributed machines.

35:26 But the fact that it also runs and sort of scales up and takes better advantage of your

35:30 local hardware is pretty awesome.

35:31 The median cluster size is one, which is just a laptop.

35:34 And so we optimize for that case.

35:36 Yeah.

35:36 Yeah, that's really cool.

35:37 I was going to add, the other thing to note is the interoperability with a lot of other

35:41 packages.

35:42 So if you're using scikit-learn and you want to parallelize your hyperparameter search or

35:47 whatever it is, you can use the, I think it's the njobs quag or something like that.

35:50 And you can use Dask in the backend there.

35:52 Recently, I was so excited to see Matt.

35:54 We're doing these YouTube live streams weekly of people using Dask and a bunch of other

35:59 stuff.

35:59 So subscribe to our YouTube channel if you're interested.

36:02 But marketing aside, somebody from Grubhub, Alex Egg, used Snorkel for weak supervision, which

36:09 is a clever way to label your data for supervised learning without actually hand labeling it.

36:15 And I saw Matt discover a sub-module of Snorkel called Snorkel Dask.

36:20 So seeing, you know, the whole array, the whole ecosystem start to incorporate Dask is really

36:24 exciting, right?

36:25 Yeah.

36:25 Yeah, that's cool.

36:26 I had no idea they'd integrated it.

36:27 I was like, oh, I just learned something today.

36:29 They're using this stuff.

36:31 Great.

36:32 That's really, really neat.

36:33 And yeah, I'll put a link to your YouTube channel there so people can check it out.

36:37 I think it's really interesting that people are doing stuff on Twitch and programming,

36:40 all these live streaming places that I never associated with programming for the longest

36:45 time.

36:46 It was always, you know, World of Warcraft or some other sort of gaming thing that was in

36:50 a different world.

36:51 It all starts with gaming, man.

36:53 Gaming's ahead of the curve, right?

36:56 Switching, putting on my for-profit hat just for a second.

36:59 Like now in COVID times, like marketing is completely the Wild West again.

37:04 Like we've got to figure out how to reach our audience.

37:06 I used to go to conferences and just know everybody.

37:08 Now, how do we do that?

37:10 You know, Twitch or live streaming is like a thing to play with.

37:12 Probably not the right thing, but we're going to find out.

37:14 And experimenting and finding out the right way to engage people today is actually a really

37:19 fun and interesting problem for which there are no right answers yet.

37:22 Yeah, yeah.

37:22 It definitely is the Wild West and it's a fun time, right?

37:25 Like you said, it's unknown, but it's really cool to be able to just explore all these

37:29 different ideas.

37:30 Hey, Hugo, marketing idea for you.

37:32 Let's get on all the cool Python podcasts and see if they can cross-market to coiled things.

37:37 I think that's a great idea.

37:38 Let me just note that down.

37:39 Yeah, great.

37:40 Yeah, well, I don't really know which ones you should talk to, but we talk after.

37:44 We embrace them after.

37:45 Who should we talk Python to?

37:46 I know, I know.

37:47 All right, so let's take it over to the business side a little bit.

37:52 We have DASK.

37:53 We talked about some of the things that it does.

37:55 One of the interesting challenges people often run into when they have these successful projects

38:00 is how do I take this really amazing and powerful thing that I built and allow me to keep working

38:07 on it by somehow getting paid to do so, right?

38:10 And I think probably the first step is we've seen a really big adoption of open source software on the enterprise.

38:17 It used to be, oh, what are you going to be doing?

38:20 Are you doing Java and Oracle?

38:22 Or are you doing Microsoft?

38:25 Which type of company are you at, right?

38:27 Like there's a lot of, there's even like Gardner reports talking about how open source software has not just lowered the cost for enterprises,

38:36 but actually increased the quality at the same time, which is like one of these double wins as well.

38:41 How do you guys see it?

38:41 So I'll add one more piece of the puzzle to the question in the sense that not only are we seeing a lot of people in enterprise adopting open source,

38:50 we're seeing, I'm actually going to paraphrase Brian Granger, who said this a couple of years ago at Jupiticon.

38:55 We're seeing a phase transition from a lot of people using open source individually in enterprise to large scale enterprise adoption of open source.

39:04 So having individuals use it within an organization, I think is, is something where the open source provides nearly all the value they, they need there.

39:12 But for an institution to actually adopt OSS at scale across an organization,

39:18 there are other moving parts that need to be in place that I think commercial companies can solve for.

39:24 So having said that, maybe Matt can talk about why we're starting Coiled.

39:27 Yeah.

39:27 I love the Brian phrase, phase transition language.

39:30 That sounds exactly like Brian, who's a physicist.

39:32 Physicist.

39:32 Physicist who then made Jupyter notebooks.

39:34 Yeah.

39:35 The way I think of it is that, sorry, so open source software one, right?

39:39 So like no new company is installing Oracle or SAS today.

39:42 They're installing open source.

39:43 But as we tossed out all of these sort of enterprise software companies and their enterprise software stacks,

39:49 we accidentally tossed out all the things they sold us that weren't software, right?

39:54 So like they would, Oracle would sell you, you know, their database things.

39:57 Like service level agreements.

39:58 Exactly.

39:59 How do, how do we get help if we're stuck and it does not working?

40:02 All that kind of stuff, right?

40:03 The sort of training safety blanket of commercial software.

40:06 Yeah.

40:06 Training, the safety blanket of enterprise support, but also, you know, hooking into enterprise office systems

40:12 or network security or sort of plugging all of those things together, integrating with other technologies.

40:18 And now that all these companies are adopting this open source stack, all of that stuff is missing.

40:22 And so it is again, the wild west.

40:24 And it's sort of a, it's an awkward place as a company to be adopting this technology.

40:28 I was in a conversation once with folks at NASA and they were saying, hey, look, what hosted notebook solution should we use?

40:35 At the time there was, there was Domino Data Labs.

40:37 There was Anaconda Enterprise was selling something.

40:39 There was AWS SageMaker.

40:41 And I said, hey, have you considered just using JupyterHub?

40:43 It's actually a really pleasant experience and it's free and it's easy to use.

40:48 And they said, no, but like we need to buy something.

40:50 Like we're not going to manage JupyterHub.

40:52 We need to buy something because we're NASA.

40:55 And the open source thought is just, it's free.

40:58 Why don't you just take it?

40:59 It's not so hard.

41:00 But of course for them, they don't want another puppy.

41:04 Right, exactly.

41:04 A thing they got to take care of and walk even in the rain.

41:07 Yeah.

41:07 And so that, I think filling in those gaps and providing all of that sort of supporting infrastructure,

41:12 which is both technological and cultural infrastructure, that I think is a great place where for-profit

41:17 companies can augment the open source software stacks that we have spent so all this time building

41:23 and have really just out-competed the proprietary software stacks.

41:26 But of course, doing those two together requires a lot of nuance.

41:29 I think we haven't yet figured out the right models to be a good actor in the open source

41:34 community that we love and love, that's highly principled, and also build a successful business.

41:38 Yeah.

41:38 And that's fun.

41:39 And that's what we're excited about experimenting with.

41:41 Well, and I think it's about time, right?

41:43 The fact that these companies, for lack of a better word, they've started to feel the pain.

41:50 They've decided they want to take on open source.

41:52 They see the value and they're going to move there, but they're like, but there's still these

41:57 problems or these new problems that we don't know how to solve.

41:59 And so I think there's opportunities for people to come along, knowledgeable in open source stuff

42:04 to help give them the support they need so that they can go to the CTO or whatever, say,

42:10 we can build on this and here's how we're going to get the support that we need.

42:15 Not, you know, Sarah's pretty good with Dask.

42:17 She could probably fork it and fix whatever.

42:19 Well, they've done that for a while and they're into problems.

42:23 Yes, I know.

42:23 To be clear, though, I think that isn't on them.

42:26 You said it's about time as though there's sort of blame.

42:28 But I think that blame is on us.

42:30 It's on the open source community.

42:32 Like we needed to learn how to speak corporation.

42:35 And like we're learning that.

42:36 We're learning how to write legal agreements that they can actually sign.

42:39 We're learning how to give them institutions that they can engage with on a peer-to-peer basis.

42:44 Right.

42:44 You know, Ford can't sign a contract with Dask.

42:46 Dask is not an entity that you can sue if they break your company.

42:50 Coiled, you can sue.

42:52 Like, great.

42:52 They can sign a contract with Coiled.

42:54 But I think that's really on us.

42:55 Our community needs to figure out how to build more of those institutions to engage better with

43:00 companies and eventually pull money out of them to feed back the open source innovation

43:05 that we've all been building really successfully over the last decade.

43:08 I think another part of the value commercial entities like Coil can add is, as Matt mentioned,

43:12 the PyData ecosystem is, for the most part, developed by a huge number of scientists and

43:18 people who are answering scientific questions themselves.

43:20 So the PyData ecosystem is very good at helping scientists and data scientists answer scientific

43:27 research or industry-based questions.

43:29 It doesn't necessarily solve for all the cultural challenges that occur within organizations

43:33 and the enterprise.

43:35 Now, I actually haven't...

43:36 This isn't as well thought out as I hope it to be in the future.

43:40 But these types of cultural concerns, I think, are about making sure everyone across an organization

43:45 is happy, such as IT, checking all the boxes from IT that I mentioned before, like network

43:49 security, auth, usage controls, making sure management is on board with everything that's happened.

43:54 So advanced telemetry for management so that they can enable collaboration and make sure

43:59 they're aware of costs and usage and that type of stuff.

44:01 So not necessarily solving for the nodes of the network of an organization, but solving for

44:06 the edge as well, I think, is something key that entities like Coil can help solve for.

44:11 Yeah, absolutely.

44:12 I definitely think that the donate button is not the right answer, right?

44:17 In the corporate world, making a donation, it doesn't even make sense.

44:22 How are the shareholders gaining value from this donation or whatever?

44:26 It just doesn't even make sense.

44:27 You can get pizza money pretty easily.

44:30 Yeah.

44:30 But going beyond, that's hard.

44:31 But yeah, having a career or a business around it is really tricky.

44:35 So I think you're right that the speaking corporate while keeping the ethos of open source

44:41 is pretty important.

44:41 So it sounds to me like you guys decided, let me guess, November to December time, you

44:48 decided there's this gap and you guys can start a company, which you founded in February, called

44:53 Coil to start addressing some of these problems in the data science, distributed computing world,

44:59 people doing task stuff.

45:01 That's right.

45:02 All right.

45:02 Yeah.

45:03 So tell us, what is Coil?

45:05 What slice of this enterprise challenge or corporate challenge are you taking on?

45:10 Yeah.

45:11 So Coil is a for-profit company based around scaling Python or sort of based around Dask,

45:16 whose mission is to make computing accessible, right?

45:19 So we care very deeply about enabling individuals to scale out comfortably and also extending that

45:24 capabilities into corporations.

45:26 We interacted with hundreds of companies who are using Dask and all had more or less the

45:31 same needs.

45:32 We mentioned them before, all the needs that you have around adopting open source software,

45:35 training, support, deployment problems, security off IT issues.

45:40 We were more or less begged by enough companies to make this thing that we decided to make it.

45:44 Yeah.

45:44 That's a good place to be.

45:46 Data scientists, as I mentioned before, require a lot of nimbleness and flexibility in the tools

45:52 they use.

45:52 And at the moment, they need to do all types of crazy stuff from, let's say they want to build

45:57 something at scale.

45:58 They need to, as we mentioned before, figure out Kubernetes and Docker containers.

46:02 And then like battle with AWS and all of this stuff.

46:06 So do a whole bunch of DevOps-y stuff, right?

46:08 That they shouldn't be doing.

46:10 Yeah.

46:10 And it probably makes their managers nervous because, oh yeah, all we had to do to make

46:14 this work is we created the Kubernetes cluster and then we put the data into this S3 bucket

46:18 and we were able to just talk to it and it was great.

46:20 Like, wait, wait, wait, wait, wait.

46:22 What's the access level or what's the publicity versus lockdown state of that S3 bucket?

46:30 Are we going to end up on the news next week because of this?

46:32 There's a story around Pangeo.

46:34 We mentioned Pangeo before.

46:35 The sort of Earth Science Collaboration Climate Science Group.

46:38 That was actually a really eye-opening experience for me because I really started to understand

46:42 what enterprise software looked like.

46:43 So what we did, I think we were giving a talk at the American Meteorological Society.

46:48 And over the weekend, a bunch of us hacked this thing together.

46:51 It was JupyterHub on the cloud with Dask enabled.

46:54 And so we computed this big data set.

46:57 And then we showed that everyone in the room could do the exact same computation on the cloud.

47:01 So I did this fantastic thing on stage.

47:03 And then we said, hey, everyone, open up your laptops.

47:05 You can do the same thing too.

47:06 And that blew up.

47:08 I think within about six months, we went from three groups in Columbia University, NCAR,

47:14 and Anaconda to about 50 groups.

47:16 So everybody was excited about this.

47:18 We sort of revolutionized how we do Earth System Science.

47:21 But then that's when the nightmare started.

47:23 Because suddenly we have all these groups trying to use this public access thing.

47:27 We had to figure out like, hey, wait a minute, this is costing us money.

47:30 So we had to implement user access controls.

47:32 Okay, who's whitelist on this thing?

47:33 Who can use it?

47:35 Suddenly, there were different groups in different continents who wanted to use different regions on the cloud.

47:39 Now we have to have not one Kubernetes cluster, but a dozen Kubernetes clusters.

47:42 Everyone wanted their own software to be installed.

47:45 The oceanographers have a different software stack from the satellite imagery people, it turns out.

47:49 And so they were all asking me, hey, can you just pip install this one more package, please?

47:53 And eventually, you can't add all those packages together.

47:56 So now we had to make software environment switchers.

47:59 It turns out that no one had...

48:01 We couldn't figure out how to do AWS credentials.

48:04 So people couldn't store their data on the cloud.

48:06 It was read...

48:07 You could read any data, but you couldn't then transform it and then store it into your own buckets.

48:11 And so figuring out auth and passing that around.

48:14 And then security, it was open for an embarrassing amount of time before Bitcoin folks showed up.

48:20 It was all of those problems.

48:21 Eventually, I stepped back.

48:22 And as you sort of predicted, all the oceanographers started taking over Kubernetes.

48:27 And that was what happened.

48:28 And that experience really showed me personally all of the challenges around doing this kind of science or data science at scale, both on a technical perspective, but also on a social cultural perspective.

48:41 There's a ton more than just algorithms work to do.

48:44 Yeah, it seems so easy just to set it up.

48:46 And then there's just one more thing and one more thing.

48:48 And eventually, yeah, you've got oceanographers doing Kubernetes, which is fine.

48:52 They can do it.

48:53 But you shouldn't have to understand how your car works just to go get groceries, right?

48:56 Or to go to work, right?

48:57 If that's not your intent, it should just get out of the way.

49:00 I mean, some of them then did become auto mechanics as a result, which is a fine career change.

49:05 Yeah, absolutely.

49:06 If you want it.

49:07 But if you're trying to do something else, you don't want to work on your car.

49:09 You just want to get there.

49:10 Awesome.

49:10 Okay, so tell us exactly what is it that Coiled is doing so we can sort of understand where it fits in trying to solve this enterprise, open source business story.

49:19 If I want to be your customer, what can I do?

49:22 Is it a SaaS service?

49:23 Is it like hosted notebooks?

49:26 What do I do?

49:27 Yeah, so Coiled sells many things.

49:29 Mostly, we focus on the Coiled product, which is a hosted Dask solution.

49:33 So we solve all of the infrastructure problems around hosting Dask.

49:37 So if you want to, if you're on our beta, for example, you could actually, I invited you just half an hour ago, Mike.

49:42 Yeah, and I'm sorry, I didn't get to it quick enough to sign up and play with it, but I wish I had.

49:48 So if you did, you could pip install Coiled wherever you run Python.

49:51 It could be in a Python session, could be in Jupyter Notebook, could be on the cloud.

49:55 And then you would ask for a cluster.

49:57 You'd say Coiled.cluster.

49:59 And that would allocate a bunch of machines for you on the cloud somewhere.

50:02 You would then connect to those machines in a super secure, super credentialized, very nice way.

50:07 And you'd be off to the races.

50:09 So if you start my beta, we sort of try to minimize the time of signing up to running computations on the cloud.

50:16 Currently, the number is around two or three minutes.

50:18 We can get it down to about one minute, I think.

50:20 Just enough time to make a cup of coffee.

50:22 Just enough time to make a cup of coffee.

50:24 Yeah.

50:25 If you have a Keurig with the pre-made filter thing, if you got to get the coffee out, maybe it won't be nice.

50:29 Yeah.

50:30 So from your perspective, it just looks like Dask.

50:33 We didn't have a set of Kubernetes.

50:34 But there's a bunch of other stuff around that that we manage.

50:37 So first, manage Dask clusters.

50:38 Second, customizable software environments.

50:41 You want to use your favorite software libraries.

50:44 And we're going to help you manage that between your local environment and also in Docker images on the cloud.

50:49 We also help you share that with your colleagues.

50:51 So often we find a sort of one person who manages software and they share it with other folks.

50:55 And then third, user management, cost management, and telemetry.

50:58 So that coil will help you understand what's going on.

51:01 Or more likely, we'll help your boss understand what you're doing and make sure that you don't break the bank.

51:06 Yeah.

51:07 I've definitely got to some very high cloud computing bills before.

51:12 Although in my defense, I said, this is going to be very expensive.

51:15 We should use this other service.

51:16 No, no.

51:16 We have a contract in agreement with this cloud service.

51:19 So we're going to use it.

51:20 Then I got a message.

51:22 Why did we spend $15,000 on that last month?

51:25 Yeah.

51:25 Because you told me to use it.

51:26 It would have been $500 over there.

51:28 But these surprises are unwelcome.

51:30 Yeah.

51:30 And so just elevating that to you visibly, right?

51:33 There's a page which shows all the running clusters.

51:34 It shows how much they're costing per hour and how much they've cost total.

51:37 You can aggregate those costs across users.

51:39 You can set policies.

51:41 Hey, I mean, by default, if you leave a cluster running, this is a super common mistake.

51:45 We'll just shut it off after 20 minutes for you.

51:47 So there's lots of things that you can do that save you a ton of money if you're doing them correctly.

51:51 So Coil is just our way of running Python at scale in the cloud in a way that's a bit opinionated.

51:58 It is the way that I find to be the best way to run, to run that.

52:01 And it also allows you to be opinionated as a user.

52:03 You can do it from the comfort of wherever you do your data science, right?

52:07 So if you like doing it in JupyterLab or Jupyter Notebooks or IPython, you can do it all from there.

52:12 So we're trying to meet end users where they are.

52:15 So we will spin up all the AWS stuff for you.

52:18 And you don't need to go there.

52:19 And I think one of the biggest pain points, which Matt hit the nail on the head with, is making sure that you can move seamlessly from local computation to cluster computation, where your environments and your data are all matched up, which is super exciting.

52:33 Maybe I'm doing some work in Jupyter, working with Dask to try to take advantage of the 16 cores that my machine has or whatever.

52:42 And at some point, I decide we're going to give it more data.

52:46 We're going to productionize it.

52:47 And it's going to actually need to be faster or work with more data.

52:50 I can basically switch where I'm talking to.

52:53 I can import coil, say, give me a cluster and go here.

52:56 And that's pretty much what that switch looks like.

52:59 Yeah.

52:59 You can run coil from anywhere.

53:00 You can run Python.

53:01 In that story, you're actually probably not starting with Dask.

53:03 You're probably starting with pandas on a very small sample of data.

53:06 In that same notebook, in that same process, you say, hey, wait, it's time to scale up.

53:09 Import Dask.

53:10 Use Dask locally.

53:11 You know, use your 16 cores.

53:13 Say, wait, actually, this is running a little slow.

53:15 I want to kick it up a notch.

53:16 I want to go to the cloud or go to your local Kubernetes cluster or whatever.

53:20 And then, you know, you import coiled, hit a button, and now you're running your computation

53:24 elsewhere on some remote cluster of honor machines, all in the same user flow.

53:28 You mentioned before, like, hey, are you hosting notebooks?

53:30 And actually, intentionally, we are not.

53:32 We do not want to own the user's environment because your user environment is going to look

53:38 so much different from everybody else's.

53:40 Maybe you don't use Jupyter.

53:41 Maybe you use PyCharm or VS Code.

53:43 Maybe you're actually just like running this in a cron job.

53:45 We don't want to own everything.

53:47 We just want to sprinkle in robust, secure parallelism and scalability into your existing

53:53 workflows, right?

53:54 So this is, again, in sort of the Dask ethos of invent nothing and being minimally creative.

53:58 Minimal invitation.

53:58 Yeah.

53:59 Minimal creativity.

54:02 Yeah.

54:03 The principle of minimal creativity.

54:04 Yeah.

54:05 So you don't have to do very much to switch to using coiled.

54:07 It's a small change.

54:09 So you talk about the different places I might, like, I could be using JupyterLab, which would

54:13 be cool.

54:13 I could be using VS Code or PyCharm, and that would be cool.

54:16 Another place that seems like it might be really interesting to use this is I'm in a fast

54:21 API backend, and I want to answer quick with a lot of data, right?

54:26 It could be, I guess what I'm getting at is like production.

54:29 Yeah, absolutely.

54:29 You can use Coil from anywhere you can use Python.

54:32 What's the story of data?

54:33 So, you know, one of the reasons I might switch away from just running Pandas or Dask would

54:38 be I have a lot of data.

54:39 At the same time, how do I get the data to you guys, to your clusters?

54:44 Because that, like I said, could be the slow thing moving around a cluster.

54:48 If you got to, like, get that out of one data center to another, that could even make it worse.

54:52 Yeah.

54:52 Well, we often find is that if you have, you know, gigabytes of data on your machine, it's

54:57 more likely that you've already downloaded the data from somewhere else.

54:59 You had your data on the cloud or on your local data center, it would have been better

55:03 just not to download it.

55:04 And so Coil really helps you run your computations in that remote space very comfortably.

55:09 So, you know, if you have data on AWS S3, you know, there's no reason to download it, open

55:14 up Coil, and you can just run on the data that's there.

55:16 Right.

55:16 Get an authenticated URL, like a signed URL over to that S3 bucket or something.

55:21 I mean, you already have the URL, right?

55:23 Yeah.

55:23 I mean, what we do is we scrape your local AWS credentials, generate a secure token, pass that

55:28 to all the Coil workers.

55:29 So those Coil workers look like you.

55:32 You then can operate on that data as though you were sitting, again, you have the ergonomic

55:36 experience of being on your laptop, but you have the computational experience of being on

55:40 the cloud.

55:40 And it feels much more proximate.

55:42 Yeah.

55:42 Okay.

55:43 That's really neat.

55:44 Where do these Kubernetes clusters that run my Dask stuff, where do they live?

55:49 Yeah.

55:49 So there's two answers to that.

55:51 One is you might have your own set of machines on-prem, you want to run Coil there.

55:56 This is actually really common today with GPU clusters.

55:58 There's really people buying all the GPU clusters.

56:00 They don't really know how to handle them.

56:01 Coil is a great way for that.

56:03 But if you're on AWS, so we currently support AWS and Kubernetes going to other clouds whenever

56:08 that makes sense for us.

56:09 If you're on AWS, they're running on the cloud.

56:12 We actually don't use Kubernetes.

56:13 We chose to use Elastic Container Service, ECS, which is like a older, slightly simpler system.

56:19 They could be running on our account if you want to use our hosted thing, or we can deploy

56:23 it inside of your AWS account.

56:24 Okay.

56:25 Yeah.

56:25 And I guess if it deploys on your AWS account, you could say, I'd like this to run in the

56:29 Bahrain data center or the Sydney one or wherever.

56:33 Sure.

56:33 I mean, we could also do that anywhere.

56:35 You can run like our public version of Coil.

56:38 So we host a public version, which is designed mostly for individuals to increase data access.

56:43 And that can run on any region.

56:45 So because we're using systems like ECS, we don't have to maintain Kubernetes clusters

56:49 everywhere.

56:50 We can run your code anywhere AWS is running.

56:52 Can you also run against an arbitrary Kubernetes cluster?

56:55 So like if I have set up a Kubernetes cluster at Ludnode, and I give you the connection and

57:00 credentials to the Kubernetes cluster, could I say run it there?

57:03 Yeah, you would want to run Coil very close to that system so that we have the right access

57:07 everywhere.

57:08 Oh, yes.

57:08 Yeah.

57:09 Okay.

57:09 Yeah.

57:10 That sounds quite neat.

57:11 It's fairly close conceptually to an infrastructure as a service sort of model in terms of like

57:17 what I'm buying, what I'm getting, what I'm paying for, but specifically focused on helping

57:22 data scientists not actually care about the infrastructure.

57:24 I'm asking those like, what do I pay for?

57:26 Right.

57:26 Do I pay for like the CPU cycle computation, sort of like a Lambda, like the function call?

57:31 Do I, for the machines that are running?

57:33 That is a super interesting question for which honestly, I don't have a good answer yet.

57:37 We are like talking about this every week right now.

57:40 Right now, you either pay, it's free for individuals currently, we're in beta, or like big companies

57:45 are signing us some check with some other pricing involved.

57:48 Yeah.

57:48 I actually recently wrote a blog post on cloud pricing, which I think is broken in lots of

57:51 ways.

57:52 Yeah.

57:52 The obvious thing to do is to charge you a surcharge on top of whatever your cloud provider

57:56 charges you.

57:57 If Amazon charges you a hundred bucks, we'll charge you a hundred bucks too.

58:00 That's kind of like the Databricks model.

58:01 Databricks charges about a hundred percent markup.

58:03 AWS charges about like a 40 or 50 percent markup.

58:06 I actually hate this model because it totally misaligns incentives.

58:11 So if we do this, then I'm incentivized to make your task workload as inefficient as possible

58:17 and use as many resources as possible.

58:19 You know what?

58:20 Reserved instances and spot instances, we don't need any of that.

58:24 Give me the most expensive flavor of a large three or whatever the machine name is, right?

58:30 Like that would be the incentive, which is actually the exact opposite of what you would hope.

58:33 It's like if Toyota or Ford were incentivized to make you burn more gas, right?

58:38 It's as though like the tar company has got a kickback from all the oil companies for burning

58:42 gas and you'd never get the Prius in that situation or the Tesla.

58:45 And we want to make the Tesla, right?

58:47 We want to make the super sexy, super ergonomic experience that's highly efficient and saves

58:52 you money in the long run.

58:52 So we're still thinking about that.

58:54 If your listeners have thoughts on the right pricing model, we'd love to hear from you at

58:58 hello at coil.io.

59:00 It is tricky.

59:01 I don't know.

59:01 I mean, I was just, I'm doing this new project where the video players for my online courses,

59:07 I would like, so this is a whole like custom player, just HTML5 and JavaScript, Python.

59:11 And I want when you hover over the little scrubber to like change the time, it to show

59:15 you a preview of each thing.

59:17 And I, okay, well, what I can do is I can re-encode all the videos really small and then

59:22 like put a little hidden player and like just set the time and move that around.

59:25 It all works beautifully, but I've got 200 hours of high def video that I need to turn

59:31 into that.

59:32 And so I literally sat down and said, okay, well, how much is it going to cost me to feed

59:35 that through elastic transcoder at AWS and in bandwidth and get it back and get it.

59:40 So I actually have that as those pictures I could show like, all right, that's going to

59:44 cost about five or $600.

59:45 All right.

59:46 This feature, having the data to back this feature is worth $600 to me.

59:50 I'll do it.

59:50 But I just think this, this thinking about how do I pay for a thing and the trade-offs is

59:55 really interesting.

59:57 Yeah, no, I'm putting on my sales hat for a second too.

01:00:00 Like that is a very clear win where like you need video.

01:00:03 It will give you this much money in sales.

01:00:05 Great.

01:00:06 It makes sense.

01:00:06 When you're talking about accelerating data science in a company, like the value that you're

01:00:11 providing is so far removed.

01:00:13 Yeah.

01:00:13 So like, you know, if I was going to sell you an efficient refrigerator, I would say,

01:00:17 Hey, look, it's going to save you, you know, $200 in electricity bill over the year,

01:00:21 over 10 years is $2,000.

01:00:23 Great.

01:00:24 Pay me a thousand dollars.

01:00:25 That's a very clear calculus.

01:00:27 Yeah.

01:00:27 But when it's like, Oh yeah, I'm going to help you scale Python.

01:00:29 That's like not quite as clearly valuable.

01:00:31 We need to get into a longer conversation of like, well, why do you care about Python?

01:00:34 Why do you care about data science?

01:00:35 How much value does that bring to you?

01:00:37 What actually are you trying to solve for?

01:00:39 Oh, you're trying to increase, you know, retention in your customers.

01:00:43 Like it's like, you have to get a lot more in depth with your customers to figure out what

01:00:48 the value proposition is of like more ergonomic scaled Python.

01:00:52 You get into things like, well, what if we could get your recommender system, the next

01:00:57 version into production a month sooner?

01:00:59 Yeah.

01:01:00 How much money would that get you?

01:01:01 And what if we could get you answers 100 milliseconds quicker?

01:01:06 How many fewer bouncing users would you have leaving your site?

01:01:09 Like these are, they're meaningful, but they're, they're not as clear cut.

01:01:13 So they are interesting discussions.

01:01:14 I think.

01:01:15 I think you've raised several important concerns there.

01:01:17 And one in particular is like, there are a whole bunch of companies who have hired lots

01:01:21 of data scientists.

01:01:22 Cause that's what you do these days.

01:01:23 And they actually haven't seen the value delivered.

01:01:25 They've got data scientists who are getting, you know, really serious, serious salaries.

01:01:29 And the data scientists will recognize the value of code, but the economic buyer in an organization

01:01:34 will be like, wait a second, we're paying you all this money.

01:01:36 Like, why are we paying more for tooling?

01:01:38 We thought you used all this OSS stuff.

01:01:40 So there's a slight mismatch there as well, which is something culturally that we're, we're

01:01:45 working to figure out as well.

01:01:46 Well, I think just put it on my, like I used to work at big company X or whatever thinking

01:01:52 hat, you know, from the people who make the decisions, there's also avoids security leaks

01:01:58 and downtime.

01:02:00 And, you know, there's, there's not just always the positive selling point or the positive advantages.

01:02:05 There's the, you know, you were avoiding these three really bad potential outcomes by productizing

01:02:11 your data science and you can do it quicker.

01:02:13 Right.

01:02:14 So these are also valuable.

01:02:15 Yeah.

01:02:16 Now the things that we've seen that work really well so far are what you just said.

01:02:19 I mean, actually the biggest one is Databricks is expensive and they like, they're expensive,

01:02:23 not in a good way.

01:02:24 They're expensive because we leave machines on all day.

01:02:26 If it was a better experience, that won't happen.

01:02:28 And so we want that.

01:02:29 We want to reduce our costs.

01:02:30 It must feel really bad to just waste money, right?

01:02:32 To literally just leave it running and not actually even because I, it just took a while.

01:02:37 Right.

01:02:37 Yeah.

01:02:37 There's what Hugo was talking about of like a common theme we hear about is like, I've got this

01:02:42 data science team.

01:02:43 They use pandas and Python and PyTorch on a single machine.

01:02:45 I've got like the scalable data engineering team and there's a lot of crosstalk between

01:02:50 them.

01:02:50 That's really slow.

01:02:51 And we want to be able to allow our data science team to go directly to scale without having

01:02:56 to interface with a different team.

01:02:57 Because that communication, just that iteration cycle is just killing our performance.

01:03:02 Another common sort of theme is, oops, we bought a bunch of hardware.

01:03:05 We bought like, you know, 50 GPU boxes and we have no idea how to use them.

01:03:09 Well, like there's only so many TensorFlow Keras runs we can do.

01:03:12 We've seen like the Dask Rapids XGBoost thing.

01:03:15 We want that.

01:03:16 But like, we just have like a bunch of hardware sitting in a rack somewhere.

01:03:19 Can you help us sell a product to manage that rack for us and give us a lot of utilization,

01:03:25 a lot of value out of our pre-existing purchase?

01:03:28 And those are the things that really tend to hit home.

01:03:30 Well, I do hope that you guys have a lot of success because I'm always excited to see some

01:03:36 interesting company taking a really powerful open source project and just adding value to

01:03:43 that space and being successful.

01:03:45 So it seems like you're on a good track.

01:03:47 Thanks.

01:03:48 Yeah.

01:03:48 We've been on the track for a while.

01:03:49 I mean, Dask is sort of unique among open source projects in the Python space.

01:03:52 We've always had really good funding.

01:03:55 Like there are a lot of funded Dask developers.

01:03:57 And it's because we've always been sort of scrambling a little bit to get money.

01:04:00 And this is maybe just the next step in that process.

01:04:02 Yeah.

01:04:02 Cool.

01:04:03 We're going a little long in time, but I did want to just ask you really quickly some meta

01:04:07 questions.

01:04:07 So are you guys using Coiled to build Coiled?

01:04:11 Is there like, what's the Python inside story, the Coiled Dask inside story?

01:04:16 Right now we're not.

01:04:18 This actually did come up recently.

01:04:19 We actually find ourselves building a lot of conda environments for our users.

01:04:23 And that is actually something that we've got, you know, web backend.

01:04:26 We've got the scalability need.

01:04:28 Dask is like an obvious way to scale that out.

01:04:31 Currently we're using a single machine for that.

01:04:32 Our needs are that big.

01:04:33 But yeah, that would be the next step as a build farm.

01:04:36 And you guys are using Django.

01:04:37 Is that right?

01:04:38 Sort of the API layer?

01:04:39 Yeah.

01:04:40 So Coiled actually looks kind of like a normal vanilla Python web application.

01:04:45 It's, you know, Django and Postgres and, you know, Amazon ECS, as we mentioned.

01:04:49 And it's been great.

01:04:50 It's actually, it's my first time using Django.

01:04:51 I'm like a decade or two late.

01:04:53 But it's been fantastic to seeing how much is ready for us out of the box.

01:04:57 So yeah, couldn't be happier.

01:04:59 Yeah, that's great.

01:05:00 And it sounds like there's a non-trivial team size.

01:05:04 I know a lot of people are saying I'm possibly looking for remote work.

01:05:07 Is that a thing that you guys are doing?

01:05:09 Are you looking for more people?

01:05:10 Are you kind of steady state until you get a little farther?

01:05:11 What's the status there?

01:05:13 Yeah.

01:05:13 So Coiled is a fully remote company.

01:05:15 We're actually born in the era of COVID, which is fun.

01:05:18 We've always been remote.

01:05:19 We've always worked remotely.

01:05:20 Yeah.

01:05:20 We're maybe like five or six full-time and two or three part-time right now.

01:05:25 Right now, I think I actually kind of like staying small.

01:05:28 We raised a bunch of money, but we're sort of not in a hurry to burn it.

01:05:31 By the time this episode airs, yeah, I wouldn't be surprised if we're looking for more folks.

01:05:34 If you go to coil.io slash jobs, I think, or just coil.io, we'll have active job listings.

01:05:40 Yeah.

01:05:41 And that's both data science folks and web folks and DevOps people.

01:05:45 It's a nice group of needs.

01:05:47 Excellent.

01:05:47 Well, like I said, really interesting project.

01:05:49 And I hope you guys make a lot of progress with it.

01:05:51 Now, before you get out of here, let me ask you the final two questions.

01:05:54 If you're going to write some Python code, what editor do you use?

01:05:57 Do you want to go first?

01:05:58 I use JupyterLab these days.

01:06:00 Right on.

01:06:00 Matthew?

01:06:01 So if I write Python code, I'm actually going to give me a cheeky answer.

01:06:04 I'll say GitHub.

01:06:05 So most of the Python code that I write today is indirect.

01:06:08 I mostly try to nerd snipe other people into writing code.

01:06:11 So maybe this is like the community maintainer's answer.

01:06:14 Yeah.

01:06:14 GitHub and I'll say, I'll put in a plug for whereby as well, which is my favorite video conferencing tool.

01:06:20 Oh, super.

01:06:21 Okay.

01:06:21 And notable PyPI package.

01:06:23 We've already got a pip install desk, pip install coiled.

01:06:27 Right?

01:06:28 Those are good ones.

01:06:28 But, you know, something like, oh, this is really cool.

01:06:30 It's not super well known.

01:06:32 Maybe people should check out X.

01:06:34 I'll point to Napari, which I mentioned earlier.

01:06:36 Yeah.

01:06:36 The search image viewer.

01:06:37 It's really cool.

01:06:38 It's actually fun seeing a rich client application written in Python.

01:06:41 Didn't know that was cool anymore.

01:06:43 I'll point to Snorkel, which I mentioned earlier, which is for weak supervision.

01:06:47 And it helps you build training data.

01:06:49 I think training data is one of the biggest challenges we have in the ML space.

01:06:53 Like I think scale AI raised a hundred million dollars last year or something like that.

01:06:58 And they still hand label data for people.

01:07:01 Like they're called scale AI, but they hand label data.

01:07:05 ML Turk.

01:07:06 Yeah.

01:07:06 Mechanical Turk is still like a huge thing.

01:07:08 Right?

01:07:09 So any ways we can figure out to do this programmatically is really cool.

01:07:13 And the Snorkel API allows domain experts to come in and write using basic functions and decorators,

01:07:19 heuristics for labeling data, which is super cool.

01:07:22 That's what I'll shout out today.

01:07:23 All right.

01:07:24 Super cool.

01:07:24 I guess I'll throw in as well.

01:07:26 Missing no.

01:07:27 Missing N-O, which is a quick visualization for missing data and like pandas and stuff like that.

01:07:33 Ask what's missing.

01:07:34 You'll get a cool little graph.

01:07:36 Like usually if the state is missing, also, I don't know, phone is missing.

01:07:40 Like all sorts of cool stuff.

01:07:41 So people can check that out as well.

01:07:42 All right.

01:07:43 So final call to action.

01:07:44 People are interested in Dask.

01:07:46 They're interested in Coiled.

01:07:47 They want to parallelize some things.

01:07:49 What do you guys tell them?

01:07:50 I would love.

01:07:51 So I've got a call to action for Coiled.

01:07:53 Matt may have one for Dask.

01:07:55 All the features we've just described of Coiled in rapid development.

01:07:58 And we'd love to involve anybody who likes to break things at scale in this development cycle.

01:08:03 So go to Coiled.io, sign up for our beta.

01:08:06 And take our product for a test drive, crash it, and let's figure out how to build it together.

01:08:10 So I'd love to involve you in that conversation.

01:08:12 Excellent.

01:08:12 Matthew?

01:08:13 I'll have you go with that one.

01:08:14 I'm actually going to...

01:08:15 Sorry, I'm going to go out of thinking this.

01:08:16 Mention one thing on that.

01:08:17 It is super fun to run a beta.

01:08:21 As an open source software developer, I'm used to never seeing how people use my software.

01:08:26 And running a beta on a public service, there's a bunch of creepy spyware I can put on our website,

01:08:31 which tells me exactly what you've done, which we'll take off at some point.

01:08:35 It is amazing getting this kind of immediate feedback.

01:08:38 Yeah.

01:08:39 A different level of visibility, right?

01:08:41 Into what people use and telemetry and whatnot, right?

01:08:44 So yeah, I'll just echo Hugo.

01:08:45 Go plan with a beta.

01:08:47 It's super fun, mostly for us, but also for you.

01:08:50 And yeah, you can play with that on the cloud.

01:08:52 It's cool.

01:08:53 Is there a free thing I can do?

01:08:55 Everything is free right now.

01:08:56 Yeah.

01:08:56 We'll limit how much you can spend, because that's the features that we care about.

01:08:59 But yeah, I think we just limit you to 100 cores at a time.

01:09:03 So you can't go totally wild, but you can do lots of fun things.

01:09:05 Just 100 cores, huh?

01:09:07 That's awesome.

01:09:07 Just 100 cores.

01:09:09 Yeah, that's really cool.

01:09:10 All right.

01:09:10 Well, you guys, thanks again for being on the show.

01:09:12 Good luck with Coiled.

01:09:14 It's a nice, natural extension of Dask, I think.

01:09:18 So excited to see you doing it.

01:09:19 Yeah.

01:09:20 Thanks, Michael.

01:09:20 Go start for the beta.

01:09:21 Thanks so much for having us on the show, Michael.

01:09:23 It's always fun.

01:09:24 Yeah, you bet.

01:09:25 Bye, guys.

01:09:25 This has been another episode of Talk Python To Me.

01:09:29 Our guests on this episode were Matthew Rockland and Hugo Bowne Anderson.

01:09:33 And it's been brought to you by Brilliant.org and Monday.com.

01:09:38 Brilliant.org encourages you to level up your analytical skills and knowledge.

01:09:41 Visit talkpython.fm/brilliant and get Brilliant Premium to learn something new every day.

01:09:48 Build your idea for an app and get it in front of hundreds of thousands of users on day one.

01:09:52 Start building today at the Monday.com marketplace by visiting monday.com slash python.

01:09:58 Want to level up your Python?

01:10:00 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:10:06 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

01:10:13 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

01:10:18 It's like a subscription that never expires.

01:10:20 Be sure to subscribe to the show.

01:10:22 Open your favorite podcatcher and search for Python.

01:10:25 We should be right at the top.

01:10:26 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:10:35 This is your host, Michael Kennedy.

01:10:37 Thanks so much for listening.

01:10:38 I really appreciate it.

01:10:39 Now get out there and write some Python code.

01:10:41 I really appreciate it.

01:11:02 Thank you.