Learn Python with Talk Python's 270 hours of courses

Parallelizing computation with Dask

Episode #207, published Sun, Apr 14, 2019, recorded Wed, Feb 20, 2019

What if you could write standard numpy and pandas code but have it run on a distributed computing grid for incredible parallel processing right from Python? How about just splitting it across multiprocessing to escape the limitations of the GIL on your local machine? That's what Dask was built to do.

On this episode, you'll meet Matthew Rocklin to talk about its origins, use-cases, and a whole bunch of other interesting topics.
Dask: dask.org
Matthew on Twitter: @mrocklin
Matthew's website: matthewrocklin.com
Dask examples: github.com
PyCon presentation: youtube.com
PyCon presentation slides: matthewrocklin.com/slides
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Episode Transcript

Collapse transcript

00:00 What if you could write standard NumPy and Pandas code, but have it run on a distributed

00:04 computing grid for incredible parallel processing right from Python? How about just splitting the

00:09 work across multi-processing to escape the limitations of the GIL right on your local

00:13 machine? That's what Dask was built to do. On this episode, you'll meet Matthew Rockland to

00:17 talk about its origins, use cases, and a whole bunch of other interesting topics.

00:21 This is Talk Python to Me, episode 207, recorded February 20th, 2019.

00:27 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the

00:44 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter

00:49 where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm

00:54 and follow the show on Twitter via at Talk Python.

00:56 Hey folks, a quick announcement before we get to our conversation with Matthew. People

01:01 have been enjoying the courses over at Talk Python Training for years now, but one shortcoming

01:06 we've been looking to address is to give you offline access to your content while you're

01:10 traveling or if you're just in a location with spotty internet speeds. We also want to

01:14 make taking our courses on your mobile devices the best they can be. That's why I'm thrilled

01:19 to tell you about our new apps, which solve both of these problems at once. Just visit

01:23 training. talkpython.fm/apps and download the Android version right now. The iOS version

01:28 is just around the corner. Should be out in a week or so. You'll have access to all of your

01:33 existing courses. But we've also added a single tap feature to allow you to join our two free

01:38 courses, the Responder Web Framework mini course and the MongoDB quick start course. So log in,

01:44 tap the free course, and you'll be right in there. Please check out the apps and try one of the

01:48 free courses today. Now, let's talk Dask with Matthew. Matthew, welcome to Talk Python to me.

01:54 Thanks, Michael. I've listened to the show a lot. It's really an honor to be here. Thank you for

01:57 having me. It's an honor to have you as well. You've done so much cool work in the distributed

02:02 computing space, in the data science and computation space. It's going to be super fun to dig into all

02:08 that and especially dig into Dask. But before we get to that, of course, let's start with your story.

02:13 How'd you get into programming in Python?

02:14 Yeah. So I got into programming originally actually on a TI basic calculator. So I was in math class,

02:20 plugging along, spent my few thousand hours not being a good programmer there. Picked up some C++

02:25 in high school, did some MATLAB and IDL and C# in college, did some engineering jobs,

02:32 did some science. And then in grad school, just tried out Python on a whim. I found it was all of

02:37 those things at the same time. So it was fun to work in object-oriented mode, like with C#.

02:43 It was fun to do data analytics, like with MATLAB. It was fun to do some science work that I was doing

02:48 in astronomy at the time, like with IDL. So it was sort of all things to all people.

02:51 And eventually I was in a computer science department and I found that Python was a good

02:55 enough language that computer scientists were happy with it, but also usable enough language that actual

03:00 scientists were happy with it. I think it's probably where Python gets a lot of its excitement,

03:04 at least on the sort of numeric Python side.

03:07 I think that's a really nice insight. And it sounds like a cool way you got into programming.

03:12 I feel like one of Python's superpowers is it is kind of what I think of as a full spectrum

03:17 language. You can make it what you need it to be at the beginning. But like you said, professional

03:22 developers and scientists can use it as well. And you kind of touched on that, right? Like you can

03:27 use it like C# or you can use it like MATLAB or you can kind of be what you need it to be.

03:32 But if you need more, it's not like, well, now you go right in C++. It's just like,

03:36 we'll now use these other features.

03:38 Yeah, definitely. I'd also say it has a lot of different parts to it. Like there,

03:41 before there was like the web programming community and it was a community that used it kind of like

03:45 Bash, like a sysops community. And there was the, you know, scientific community. I know those are

03:49 all kind of the same and it's really cool to see all those groups come together and use each other's

03:53 stuff. You see people using scikit-learn inside of a Flask app deployed with some other Python thing.

03:59 Ansible or something, right? Yeah, for sure.

04:01 And the fact that all those things were around makes it sort of the cool state that it's in today.

04:04 Yeah, it's just getting better, right?

04:06 But it was already good. It's just like connecting all the good parts.

04:08 Exactly. It's the interconnections are building a little bit stronger across these disciplines,

04:12 which it's kind of rare, right? There's not that many languages and ecosystems where that happens,

04:18 right? Like think of Swift for iOS, right? Swift is a pretty cool language. It has some issues that

04:23 I could take with it, but it's pretty cool. But it's not also used for data science and for DevOps,

04:27 right? It's just not.

04:28 Right.

04:28 Yeah.

04:29 That's pretty cool. Also, you said you did some IDL. It sounds like you were doing astronomy before,

04:34 right?

04:34 Yeah. This is just an undergrad. I was taking some astronomy courses. It was fun though,

04:38 because I knew how to program and I was just really productive as a scientist because I was

04:43 fluent in programming. That ability to get more productive was really good. It took me away from

04:48 science and towards computer science, but in service of science. So that was my experience.

04:53 Now you're building these platforms where legitimate world-class science is happening.

04:58 Yeah. No, it's cool.

04:59 Right? You must be really proud, right? To think of what people are doing with some of the software

05:04 you're building and you're collaborating on.

05:05 Yeah, the software that all of us build. And yeah, it's definitely very cool. It's very satisfying to

05:10 have that kind of impact, not just in one domain, but across domains. And I get to work with lots of

05:15 different kinds of people every day that are solving, I think, high impact problems. And yeah,

05:19 that's definitely a strong motivator for me.

05:20 Yeah, I'm sure it is. Programming is fun, but it's really not fun to work on a project and have

05:26 nobody use it. But to see it thrive and being used, it really is a gratifying experience.

05:31 Yeah. I'll go so far as to that. Programming is no longer fun. It's just the way we get things done

05:36 and seeing them done, as you're saying, is really where a lot of the satisfaction comes from.

05:39 Yeah, that's cool. So how about today? What are you up to these days?

05:42 You were, to put it past tense, you were doing a lot of stuff with Anaconda Inc.,

05:48 previously named Continuum, but same company, but recently you've made a change, right?

05:52 Yeah. So I like to say it's the same job, just a different employer. Yeah. So I guess my job today

05:57 is a few things. I work in the open source numeric Python ecosystem. So this is usually

06:03 called like SciPy or PyData. And I mostly think about how that system scales out to work with

06:10 multi-core machines or with distributed clusters. I usually do that with a project called Dask.

06:14 That's maybe like the flagship project that I help maintain, but I'm just generally active

06:18 throughout the ecosystem. So I talk to the NumPy developers, the Pandas developers,

06:22 scikit-learn and Jupyter developers. I think about file formats and protocols and conventions.

06:27 There are a few of us- Networking. Networking. Yeah. Lots of things. So generally in service

06:31 of those science use cases or in business use cases, there's tons of things that go wrong. And

06:36 I tend to think about whatever it is needs to get handled. But generally I tend to focus on Dask,

06:40 which I suspect we'll talk about more today. Yeah. A tiny bit. So speaking of Dask, let's set the

06:46 stage, talk a little bit about, you know, why does Dask exist and so on. And I guess maybe start with a

06:53 statement that you made at PyCon, which I thought was pretty interesting. And you said that Python has

06:59 a strong analytical computation landscape and it's really, really powerful, like Pandas and NumPy and

07:05 so on. But it's mostly focused on single core computation and importantly, data that fits into RAM.

07:11 You want to elaborate on that a little? Sure. So everything you just said, so all of those libraries,

07:17 so NumPy and Pandas and scikit-learn are loved today by data scientists because the APIs are,

07:23 they do everything they want them to do. They seem to be well tailored to their use cases.

07:26 And they're also decently fast, right? So they're not written in pure Python. They're written in C,

07:30 Fortran, C++, Cython, LLVM, whatever. Right. And so much of the interaction is like,

07:36 I have the data in Python. I just pass it off. Or maybe I don't even have it in Python. I have like

07:41 a NumPy array, which is a pointer to a data down in C level, which I then pass that off to another

07:46 part. And then internally down in C, it cranks on it and gives it back like five or something,

07:51 right? Like, so it's not, it's a whole, I think the whole story around performance and is Python

07:58 slow, is Python fast? Like it gets really interesting really quick because of those types of things.

08:04 Yeah. So we're, we have really nice usable Python APIs, which are kind of like the front end to a

08:08 data scientist. And that's hitting a backend, which is C++ Fortran code, which is very fast.

08:13 And that combination is really useful. Those libraries also have like now decades of expertise,

08:19 PhD theses thrown into them, just tons of work to make them fit. And people seem to like them a lot.

08:25 And it's also not just those packages. There's hundreds of packages on top of NumPyPanda's

08:29 Scikit-Learn, which do biology or do signals processing or do time series forecasting or do whatever.

08:35 And there's this ecosystem of thousands of packages, but that code, that code that's very

08:41 powerful, it's really just powerful on a single core, usually on data that fits in RAM. So as sort

08:46 of big data comes up or as these clusters become more available, the whole PyData SciPy stack starts

08:51 to look more and more antiquated. And so the goal that I was working on for a number of years and

08:56 continue to work on is how do we elevate all of those packages, you know, NumPyPanda,

09:00 scikit-learn, but also everything that depends on them to run well on distributed machines,

09:04 to run well on new architectures, to run well on large data sets? How do we sort of refactor an

09:09 ecosystem? That's a really hard problem.

09:11 You know, that's a super hard problem. And it's, I think, one of the challenges that PyPy,

09:16 P-Y-P-Y, the JIT compiled alternative to CPython has such, I think that would have really taken off

09:24 had it not been for, yes, we can do it this other way, but you have to lose NumPy,

09:30 or you have to lose this C library, or you lose the C speedups for SQLAlchemy, or whatever, right?

09:36 You're trying to balance this problem of like, I want to make this distributed and fast and parallel,

09:40 but it has to come along as is, right?

09:43 Right. Yeah, we're not going to rewrite the ecosystem. That's not a feasible project today.

09:47 Right. I mean, theoretically, you could write some alternate NumPy, but people trust NumPy.

09:53 And like you said, it's got so much polish, it's insane to try to recreate it.

09:57 Yeah. It also has, even if you recreated it, there's a lot of weight on it. A lot of other

10:02 packages depend on NumPy in very arcane ways, and you would have a hard time reproducing all of those.

10:06 Right. Yeah. It's like boiling the ocean, or trying to boil the ocean, pretty much. Okay. And then,

10:13 so that's the problem is like, we have problems that are too big for one machine. We have great

10:20 libraries that we can run on one machine. How do you take that ecosystem and elevate it in a way

10:26 that's roughly the same API, right? We had certain things like that, right? Like Spark,

10:32 for example, is one where you can do kind of distributed computation, and there's Python

10:37 interactions with Spark, right?

10:38 Yeah, definitely. But Spark is again, going down that rewrite path, right? Spark is not using NumPy

10:43 or Pandas. They're rewriting everything in Scala, and that's just a huge undertaking. Now,

10:46 they've done a great job. Spark is a fantastic project. If Spark works for you, you should use

10:50 it. But if you're used to Pandas, it's awkward to move over. Most good Python Pandas devs that I know

10:55 who moved to Spark eventually move to Scala. They eventually give up the ecosystem,

10:58 they move over. And that makes sense, given how things work.

11:01 Right.

11:01 But if you wanted to say...

11:02 Because if you're going to work there, you might as well be native in that space, right?

11:05 But if you wanted to say use NumPy, there is no Spark NumPy, nor is there a Spark SciPy,

11:09 nor is there a Spark time series prediction.

11:12 Everything built on it.

11:13 Yeah. So Spark really handled sort of the Pandas use case to a certain extent,

11:18 and like a little bit of the Scikit-learn use case. But there's this other sort of 90%

11:22 ecosystem that's not handled. And that's where Dask tends to be a bit more shiny.

11:26 Right. And isn't... I don't do much with Spark, so you'll have to keep me honest here. But isn't

11:32 Spark very like MapReduce-y? It has a few types of operations and pipelines you can build,

11:38 but you can't just run arbitrary code on it, right?

11:40 Yes and no. So yes, underneath the hood, from a distributed assistance point of view,

11:45 Spark implements sort of the same abstraction as MapReduce, but it's much more efficient and it's

11:49 much more usable. The APIs around it are much nicer. So if you do something like a group

11:53 bi-aggregation, that is sort of a mapping and then a production under the hood. And so as long as you

11:59 sort of stay within the set of algorithms that you can represent well with MapReduce, Spark is a good fit.

12:04 Not everything in the sort of the PyData SciPy stack looks like that, it turns out. And so we looked at

12:08 Spark early on and, hey, can we sort of use the underlying system to paralyze out PyData? The answer

12:12 came back pretty quickly, no. It's just not flexible enough.

12:15 Yeah. And you know, it's, I love that this format is audio, but for this little part, I wish we could

12:21 show some of the pictures that you've had and some of your presentations and stuff, because

12:26 the graphs of how all the data flows together is super, super interesting and complicated and

12:31 sort of self-referential and all sorts of stuff.

12:33 Yeah, no, those visuals are really nice and it's nice that they're built in. You get a good sense of

12:37 what's going on.

12:37 Yeah, it's cool. All right. So let's talk about Dask. You've mentioned it a couple of times.

12:42 Dask is a distributed computation framework focused on parallelizing and distributing the data science

12:51 stack of Python, right?

12:52 Yeah. So let me describe it maybe from a 10,000 foot view first, and then I'll dive in a little bit.

12:57 So from a 10,000 foot view, the tagline is that Dask scales PyData. Dask was designed to parallelize

13:02 the existing ecosystem of libraries like NumPy, Pandas, and Scikit-Learn, either on a single machine,

13:07 scaling out from memory to disk using all of your cores, or across a cluster of machines.

13:11 So what that means is if you like Pandas, you like NumPy, and you want to scale those up to

13:16 many terabytes of data, you should maybe investigate Dask. So diving in a little bit, Dask is kind of

13:22 two different levels. I'll describe it Dask at a low level first, and then a high level. So at a low

13:26 level, Dask is a generic Python library for parallel computing. Just think of it like the threading

13:31 module or like multiprocessing, but just more. There's more sophistication in there, it's more

13:36 more complex. It handles things like parallelism, like task scheduling, load balancing, deployment,

13:42 resilience, node fallover, that kind of stuff. Now using that very general purpose library,

13:47 we've built up a bunch of libraries that look a lot like PyData equivalents. So if you sort of take

13:54 Dask and NumPy and smash them together, you get DaskArray, which is a large scalable array

13:59 implementation. If you take Dask and Pandas and squash them together, you get DasDataFrame,

14:03 which uses many PandasDataFrames across a cluster, but then can sort of give you the same Pandas API.

14:08 So it looks and feels a lot like underlying sort of normal libraries, NumPyrePandas, but it scales

14:14 out. So today when people talk about Dask, they often mean DasDataFrame or DaskArray.

14:20 Yeah. So just to give people a sense, like this, when you interact with one of these, say,

14:25 DaskArrays, it looks like a NumPyArray. You treat it like a NumPyArray mostly. But what may

14:32 actually be happening is you might have 10 different machines in a cluster. And on each

14:38 of those machines, you might have some subset of that data in a true NumPy array, right? And that's,

14:43 they know how to coordinate operations. So if you do like a dot product on that with some other bit,

14:48 it can figure out the coordination and communication across all the different

14:51 machines in the cluster to compute that answer in a parallel way. But you think of it as like,

14:57 well, I just do a dot product on this array and we're good, right? Just like you would in NumPy.

15:01 That's exactly correct. Yeah. So Dask is like a very efficient secretary. It's doing all the

15:05 coordination work. NumPy is doing all the work on every node. And then we've written a lot of

15:09 parallel algorithms around NumPy that teach you how to do a big matrix multiply with a lot of small

15:13 matrix multiplies. Or on the pandas side, you know, a big join with a lot of smaller joins.

15:17 That sounds like a lot of coordination and a challenging problem to do in the general sense.

15:23 Yeah, it's super fun.

15:24 The general solution of that. Yeah, I bet it is. A lot of linear algebra and other types of stuff in

15:30 there, I'm guessing.

15:30 Sure. Or a lot of pandas work or a lot of machine learning work.

15:32 Yeah, yeah.

15:34 Or a lot of other work. So I want to point out here that the lower level Dask library,

15:38 the thing that just does that coordination, is also used separately from NumPy and pandas. And I'm

15:42 sure we'll get to that in the future. But so Dask is sort of this core library that handles parallelism.

15:46 You can use it for anything. There's also like a few big uses of it for the major PyTadet libraries.

15:52 Yeah. We have Dask Array and we have Dask DataFrame and so on. And these are really cool. And

15:57 we said the API is really similar, but there's one significant difference. And that is that

16:04 everything is lazy, right? Like a good programmer?

16:07 Sure. So Dask Array and Dask DataFrame do both do lazy operations. And so yeah, so at the end of your,

16:13 you know, you do DataFrame, you do, you know, read parquet, filter out some rows,

16:17 do a group aggregation, get out some small results. You then call .compute. At the end of that,

16:22 there's a method on your Das DataFrame object. And that will then ship off the sort of recipe for

16:27 your computation off to a scheduler, which will then execute that and give you back a result,

16:31 probably as a pandas DataFrame.

16:33 There's two major parts here. One is the data structures that you talked about. And the other

16:37 is the scheduler idea, right? And those are almost two separate independent things that make up Dask.

16:42 Two separate code bases even.

16:43 Okay.

16:46 This portion of Talk Python to me is brought to you by Linode. Are you looking for hosting that's

16:51 fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at

16:56 talkpython.fm/Linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server

17:03 with a gig of RAM. They have 10 data centers across the globe. So no matter where you are or where your

17:08 users are, there's a data center for you. Whether you want to run a Python web app, host a private

17:13 Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded

17:18 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee.

17:25 Need a little help with your infrastructure? They even offer professional services to help you with

17:30 architecture, migrations, and more. Do you want a dedicated server for free for the next four months?

17:34 Just visit talkpython.fm/Linode.

17:37 I want to dig into the schedulers because there's a lot of interesting stuff going on

17:43 there. One thing that I think maybe it might be worth digging into a little bit for folks is this

17:49 demo that you did at PyCon around working with, basically you had this question you want to answer,

17:56 it's like, what is the average or median tip that a person gives to a taxicab driver plotted by time of

18:04 day? You know, like Friday evening, Monday morning, and so on. And so there's a huge data set,

18:10 maybe you can tell people where to get it, but it's the taxicab data set for some year,

18:15 and it's got like 20 gigs on disk and 60 gigs in memory. Now, I have a MacBook Pro here that's like

18:22 all knobs to 11, and that would barely hold a half of that data, right? So like even on a high-end

18:28 computer, it's still like, I need something more to run this because I can't load that into memory.

18:34 Well, to be fair, actually, that's not by like big data standards, it's actually pretty small.

18:38 People use Dask on like 20 terabytes rather than 20 gigabytes pretty often. But you're right that

18:42 it's like, it's awkward to run on a single machine.

18:44 Yeah, you can do a lot of paging.

18:45 Yeah, it's a pretty easy data set, though, to run in a conference demo, just because it's

18:49 pretty interpretable.

18:50 Yeah.

18:50 Yeah.

18:50 Everyone can immediately understand average tip by time of day and day of the week.

18:54 Yeah, so it's a good demo. I'll maybe talk through it briefly here, but I encourage people to look at the PyCon. There's probably PyCon 2017, I think, for Dask.

19:02 Yeah, and I'll put a link to the talk in the show.

19:04 Yeah, so in that example, we had a bunch of CSV files living on like Amazon S3 or Google

19:10 Cloud Storage, I don't remember which.

19:11 Yeah, Google Cloud Storage.

19:12 And you could use the pandas read CSV function to read a bit of those files, but not the entire

19:17 thing. But if I had to read everything, it would fill up RAM and pandas would just halt.

19:21 And that's kind of the problem we run into in PyData today when we hit big data. Everything's

19:25 great to run out of RAM, then you're kind of, you're out of luck.

19:28 So we sort of switched out the import of pandas for DasDataFrame, called the DasDataFrame

19:33 read CSV function, which has all of the same keyword arguments of pandas read CSV, which

19:37 if you know pandas are numerous. And sort of reading one file, we sort of give it a glob

19:41 string. We'd ask it to read all the files.

19:43 Yeah, we'd like did normal pandas API stuff. I think we filtered out some bad rows or like

19:49 some free rides in New York City that we had to remove. We made a new column, which is

19:53 a tip fraction. We then use pandas datetime functionality, I think, to group by the hour of the day.

19:58 And the day of the week. And if you know pandas, like that API is pretty comfortable to you.

20:03 Just the data's too big.

20:04 Yeah, it was the same experience, pandas. Like we switched out the import, we put a star in

20:09 our file name to read all the CSV files. We might have asked for a cluster. So we might

20:13 have asked something like Kubernetes or Yarn or Slurm or some other cluster manager to

20:17 give us a bunch of Dask workers. That one's probably using Kubernetes because it's on Google.

20:20 Those showed up. So Google was fine enough to give us virtual machines. We deployed Dask on

20:25 them. And then, yeah, then we hit compute. And then all of the machines in our cluster

20:28 went ahead and they probably called little pandas read CSV functions on different ranges in those CSV

20:34 files coming off of Google Cloud Storage. They then did different group operations. They did some

20:39 filtering operations on their own. They're probably somewhat related to the ones that we asked them to

20:43 do, but a little bit different. They had to communicate between each other. So those machines

20:47 had different pandas data frames. They had to share results back and forth between each other.

20:50 If one of those machines went down, Dask had to ask for a new one or had to recover the data that

20:55 machine had. And at the very end, we got back at Panda's data frame and we plotted it. And I think

21:00 the punchline for that one is that the tips for like three or 4 a.m. are absurdly high. It's like

21:06 40% or something like that.

21:07 Yeah. Did you have any insight why that is? Do people know?

21:12 I mean, the hypothesis is that it's people coming home from bars. I was actually giving the same demo

21:15 at a conference once and someone was like, oh, filter out all the short rides and I'll bet it

21:20 goes away. And so you did that and the spike does go away. And so it's probably just people getting

21:24 $5 cab rides from the bar. And we were looking at tip fraction, which is sort of going to have a lot

21:29 of noise. But it was fun because this guy said, hey, try this thing out. And he knew how to do it.

21:33 He'd never seen Dask before. He'd just known pandas before. Gave him the keyboard. He typed in the

21:37 pandas commands and it just worked. We got a new insight out of it. And so that sort of experience

21:42 shows us that a tool like Dask data frame gives us the ability to do pandas operations,

21:47 to extend our pandas knowledge, but across a cluster. That data set could have been 20 terabytes

21:51 and the same thing would have worked just fine. You would have wanted more machines. It would have

21:55 been fine. Wait, or one more time. One more time. Yeah. So Dask also works well on a single machine

21:59 just by using small RAM intelligent ways. Oh, it does. Okay. So it can be more efficient.

22:03 You could give it a ton of data, but it can realize I can't take it all in one big bite. I got to eat

22:09 the elephant sort of thing. Yeah. So if you can run through your computation, a small RAM,

22:12 Dask can usually find a way to do so. And so that's actually like Dask can be used on clusters. And

22:16 that's the flashy thing to show to people at a PyCon talk. But most Dask users are using just their

22:22 laptops. So for a long time, Dask actually just ran on us on single machines for the first year of its

22:26 life. We built all the parallel algorithms, but didn't have a distributed scheduler. And there people

22:30 just wanted to sort of stream through their data, but with pandas or NumPy APIs. So people were

22:35 dealing with 100 gigabyte data sets on their laptops pretty comfortably.

22:38 Oh, that's interesting. I didn't realize that that was a thing it could do. That's cool.

22:41 Could it be used in the extreme? Like if I've got a really small system with not much RAM at all,

22:49 I'm thinking like little IoT things. Could it let me process like not big data,

22:53 but like normal data and like a tiny device?

22:55 Sure. Depending on how much data or how much you're looking for. Sure.

22:59 Yeah. Okay. Interesting. That's a super cool feature. Now, one of the first things that struck me

23:03 when you ran this, like obviously you've got all these machines. I think you had 32 systems with

23:09 two cores each when you ran the demo and it went really quickly, which is cool. I expected it to

23:14 come up with a graph and you could say, run it, but we can't even run it locally here. It is much,

23:18 you know, in 10 seconds or something. But instead, what I saw when you did that was there's like a

23:24 beautiful graph that's alive and animated describing like the life of your cluster and the computation

23:29 and like different colors, meaning different things. Can you describe like that diagnostic

23:34 style page? And so people know. Yeah, sure. It's hard to describe a picture, much less like an

23:40 interactive dashboard. But yeah, so maybe I'll describe a little bit of like the architecture

23:44 of Dask first. I'll make this a bit more clear. Okay. So Dask has a bunch of workers that are on

23:49 different machines that are doing work, holding onto data, communicating with each other. Then there's a

23:53 centralized scheduler, which is keeping track of all those workers. This is like the,

23:57 like the foreman at a job site, maybe of all these workers telling the workers what to do,

24:01 telling them to share with each other, et cetera. And that foreman, that scheduler has a ton of

24:06 information about what all the workers are doing, all the tasks that are being run, the memory usage,

24:11 network communications, you know, file descriptors that are open on every machine, et cetera.

24:15 And in trying to benchmark and profile and debug a lot of Dask work, we found that we wanted access to

24:21 all that information. And so we built this really nice dashboard. And so, yeah, it tells you all that

24:26 information. You saw one page at that demo, but there are like 10 or 20 different pages of different

24:31 sets of information. So there's just a ton of state in the scheduler. And we wanted to expose all that

24:36 state visually. This ends up being, it was originally designed for the developers, but it ends up being

24:40 critical for users because understanding performance is really hard in the distributed system, right?

24:45 Our understanding of what is fast and slow in a single machine gets totally turned around when you

24:49 start adding concurrent access to disk and network communications and everything. So yeah, it's a live

24:55 dashboard. We used Bokeh. That's a Bokeh server application to build it. So it's got, you know,

25:00 it's updating at something like a hundred milliseconds frame rates. So it looks live to the human eye. Yeah.

25:05 It's showing you progress. It's showing you every task that gets done and when it gets done and where

25:09 it gets done. It's showing you memory use on all different machines. You can dive in. You can actually

25:13 get like line by line profiling statistical information on all of your code.

25:18 So that coordinate across machines. So like this function is taking this long, but actually it's

25:22 summing up the profiling across all the cluster. So the scheduler is just aware of everything and the

25:27 workers are all gathering tons of information. There's tons of diagnostics on that machine.

25:30 Yeah. So maybe a way to say this is that in order to make good decisions about where to schedule certain

25:35 tasks, DAS needs to have a bunch of information about how long previous tasks have taken, the current

25:39 status everywhere. And so we've gotten very, very good at collecting telemetry and keeping it

25:45 in a nice index form on the scheduler. And so now that we have all that information,

25:48 we can just present to people visually.

25:50 Might as well share.

25:50 Yeah. I find myself sometimes using DASC just on sequential computations just because I want

25:55 the dashboard. Like I don't want parallelism. I just want the dashboard. It's just the best

25:58 profiler I have.

25:59 It is a good insight into what's happening there. That's super cool. And I love that it's live. So

26:04 it seems really nice. I was surprised at how polished that part of it was.

26:08 That is honestly just due to Bokeh. If people, if they like dashboards, go look at Bokeh server.

26:13 I didn't know any JavaScript. I didn't know about Bokeh before. It was super easy to use.

26:17 I integrated it in my application. It was clean. It was nice. And it looks polished, as you said.

26:21 That's super cool. Now you did say that you can run DASC locally and there's a lot of advantages

26:26 to doing that as you already described, but it's super easy, right? Maybe just describe,

26:31 like it's just a couple of lines to set up a little mini cluster running on your system,

26:36 right?

26:37 Yeah. So it's even easier than that, but yes. So if you do from DASC, so if you, first

26:40 of all, you would pip install DASC or content install DASC. It's a Python library. It's pure

26:45 Python. You would then, if you want the sort of the local cluster experience with the dashboard

26:49 and everything, you say from DASC, I would import client. And then you create a client, which

26:53 with no arguments, it would create for you like a few workers on your local machine and a scheduler

26:57 locally. And then all the rest of your DASC work would just run on that. You also don't need

27:01 to do that. You can just import DASC and there's a thread pool waiting for you if you don't

27:05 set anything up. So a lot of people, again, don't use the distributed part of DASC. They

27:10 just use DASC locally and DASC can operate just as a normal library. It spins up a thread

27:14 pool, runs your computations there and you're done. So there's no actual setup you need to

27:18 do.

27:18 Right.

27:18 Sorry. There was a great tweet about maybe a few weeks ago. Someone was using the X-Array

27:23 library. X-Array is a popular library for array computing in like the geoscience world.

27:28 Someone says like, yeah, someone was recommending X-Array and DASC on Twitter. And someone says,

27:32 yeah, I've heard about X-Array. Use it all the time. It's great. Never heard of DASC though.

27:35 Don't know what you're talking about. And it was hilarious because X-Array uses DASC under

27:39 the hood. The guy had just never realized that he was using DASC. So that's a real success

27:43 point.

27:43 Yeah. If you don't have to poke your head up and like make it really hard and obvious that

27:48 this thing is part of it, right? That's great.

27:50 DASC had disappeared into being just infrastructure.

27:52 Yeah. So another thing that I thought was a super cool feature about this is you were

27:58 talking about some kind of working with a really big tree of data, right? And you start

28:04 processing it and it's going okay, but it's going somewhat slowly and it's a live talk with

28:10 not that much time. So you're like, all right, well, let's add some more workers to it. And

28:14 while the computation is running, you add like 10 more workers or something. And it just on

28:19 your cool bokeh graph and everything, you just see it ramp up and take off. And so you can

28:23 dynamically add and remove these workers to the cluster, right?

28:27 Yeah, definitely. So DASC is all the things you'd expect from any modern distributed system.

28:32 It handles resilience, handles dynamically adding things. You can deploy it on any common job

28:37 scheduler. So it does sort of all the things that you'd expect from something like Spark,

28:40 you mentioned before, or TensorFlow or Flink or any of the sort of popular attribute systems today,

28:46 but it does sort of adjust those things. So you can sort of sprinkle in that magical dust

28:50 into existing projects. And that's really the benefit. Yeah. So we focus a lot on how do we

28:55 add clusters efficiently or how do we add workers to a cluster dynamically?

28:57 Yeah. It seems really useful. Just as you're doing more work, you just throw it in. And

29:02 I'm guessing these clusters can be long lived things and different jobs can come and go. And

29:06 are they typically set up for like research groups or other people in longer term

29:11 situations or are they like sprung up out of Kubernetes, like it burst into the cloud and

29:16 then go away?

29:17 I'm going to say both actually. So what we'll often see is that the institution, which will

29:21 have something like Kubernetes or Yarn or Slurm, and they'll give their users access to spin up

29:26 these clusters. So an analyst will sit down at their desk, they'll open up a Jupyter notebook,

29:29 they'll import DASC, use one of the DASC deployment solutions to ask for Kubernetes DASC cluster.

29:35 Then like their own little scheduler and work rules will be created on that machine,

29:39 on that cluster. They'll do some work for an hour or so, and they'll go away. And that cluster will go

29:43 away. What IT really likes about the sort of adding and removing of workers is that we can add and

29:49 remove workers during their computation. So previously they would go into the cluster, they'd ask for,

29:53 hey, I want 100 machines. They'd use those machines for like a minute while they load their data,

29:57 and they would stare at a plot for an hour. And so it's not so much the ability to add new workers,

30:01 it is the ability to remove those workers when they're not being used. This gives like really good

30:05 utilization. You can support a lot more people who have really bursty workloads. So a lot of data

30:10 scientists, a lot of scientists, their analysis is very bursty. They do a bunch of work and they stare

30:15 at their screen for half an hour. They do a bunch more work and they stare at the screen for half an

30:19 hour. And so that sort of dynamism is really helpful in those sorts of use cases.

30:22 Yeah, I can imagine. That's super cool. Now we talked about the DAS data frame and DAS array and

30:28 those being basically exact parallels of the NumPy and Pandas version. But another thing that you can do

30:37 is you can work in a more general purpose way with this thing called a delayed and a compute, right?

30:43 Which will let you take more arbitrary Python code and fan it out in this way, right?

30:48 Yeah. So let me first add a disclaimer. DAS data frame and DAS array are not exactly like

30:53 NumPy and Pandas. They're just very close. Close enough, you'll be fine.

30:55 Similar, okay.

30:56 If I don't say that, people will yell at me afterwards. So yeah, so the creation of DAS

31:00 delayed is actually really interesting. So when we first built DAS, we were aiming for something that

31:05 was Spark-like. Like, oh, we're going to build this like very specific system to parallelize NumPy and

31:10 parallelize Pandas and that'll handle everything. Everyone was built on top of those things. That did not

31:14 end up being the case. Some people definitely wanted parallel Pandas and parallel NumPy,

31:18 but a lot of people said, well, yeah, like I use NumPy and Pandas, but I don't want a big NumPy

31:22 and Pandas. I'm using something that's like really different, really custom, really bespoke.

31:25 And they wanted something else as though we had built like a really fast car. And they said,

31:31 great, that's a beautiful car. I actually just want the engine because I'm building a rocket ship

31:35 or I'm building a submarine or I'm building like a mechatronic cat or something.

31:39 It turns out that like in the Python space, people build like really diverse things. Like

31:46 Python developers are just on the forefront. They're doing new things. They're not solving

31:50 the same classic business intelligence problem over and over again. They're building things that are new.

31:54 And so DAS delayed was like a really low-level API where they could build out their own task graphs,

31:59 their own parallel computations, sort of one chunk at a time using normal Python for loops.

32:04 So it allows you to build like really custom, really complex systems with normal looking Python

32:09 code, but still get all the parallels in the task and provide. So you still get the dashboard,

32:13 you still get the resilience. You can build out your own system in that way.

32:15 Yeah. You even have the graph and the dashboard, as you said, which is really cool.

32:19 Yeah.

32:19 I love it. So when I saw that, I was thinking, well, this is really interesting. And I see how that lets

32:25 me solve a more general numerical problem. Could I solve other things with it, right? Like suppose I have

32:31 some interesting website that needs better performance, right? I've got something that

32:36 has really nothing at all to do with computation. Maybe it talks to databases or other things,

32:41 right? Could I put a DASC backend on like a high-performance website and get it to do,

32:48 you know, bizarre cluster computations to go faster? Or is it really more focused on just data science?

32:53 Yeah. So people would use DASC in that setting in the same way they might use like a multiprocessing

32:58 pool or a thread pool, or Celery maybe, but they would scale it out across a cluster.

33:03 So DASC satisfies like the concurrent futures interface, if you use concurrent futures,

33:07 and also does async away stuff. So people definitely integrate it into systems like what you're talking

33:10 about. I think at UC Berkeley, there's a project called Cesium that does something like that.

33:14 I gave a talk at PyGotham like a year or two ago that also talked about that. So yeah,

33:19 like people definitely integrate DASC into more webby kinds of systems. It's not accelerating

33:25 the web server. It's just accelerating the sort of computations that the web server is hosting.

33:29 It's useful because it has like millisecond latencies and scales out really nicely and

33:34 is like very Pythonic. It fits all the Python APIs you'd expect.

33:37 Yeah. And the fact that it implements async and await is super cool. If you use it with something

33:41 like Sanic or Court or one of these async enabled web frameworks, it's just a line, a weight,

33:47 some execution, and the scalability, you know, spreads maybe just some DASC Kubernetes cluster or

33:54 something. That's pretty awesome. You talked about the latency and I was pretty impressed

33:58 with how those were working. You were quoting some really good times. And actually you said

34:02 that's one of the reasons you created your own scheduler and you didn't just try to piggyback on

34:07 Spark is because you needed extremely low latency because it's very, it's kind of chatty, right? If I'm

34:12 going to take the dot product of a thing that's all over the place, they've got to coordinate quite a bit

34:16 back and forth, right? That can't be too bad.

34:18 Yeah. So Spark actually internally has okay latencies today. It didn't at the time,

34:22 but they're also at sort of the millisecond range where Spark falls down as in complexity.

34:25 They don't handle sort of arbitrary task graphs. Sort of the other side of that is systems like

34:30 Airflow, Luigi, and Celery, which can do more arbitrary task graphs. You can say, you know,

34:35 download this particular file. And if it works, then email this person, or if it doesn't work,

34:39 parse this other thing. You can build these dependencies into Airflow, Luigi, Celery.

34:43 Where those things fall down is that they don't have good inner task communication. So you can't

34:48 easily say, I've got these two different pandas dataframes with different machines,

34:51 have them talk to each other. They don't handle that. They also have long latencies in sort of

34:55 the like tens of hundreds of milliseconds, which can be...

34:58 Right. Which is not that long, but when you amplify it across, you know, a thousand times,

35:02 then all of a sudden it's huge.

35:03 Yeah. Or when your pandas computations take 20 milliseconds or one millisecond,

35:07 and you want to do a million seconds, then suddenly latency does become a big issue. So we needed sort of

35:11 the scalability and performance of Spark, but with the sort of flexibility of Airflow,

35:17 Luigi, Celery. And that's sort of where Dask is sort of that mix between those two camps.

35:21 This portion of Talk Python to Me is sponsored by Backlog from NewLab. Developers know the importance

35:29 of organization and efficiency when it comes to collaborating on a team. And Backlog is the

35:34 perfect collaborative task management software for your team. With Backlog, you can create tasks,

35:39 track bugs, make changes, give feedback, and have team conversations right next to your code. You track

35:44 progress with features like Gantt and burndown charts. You document your processes right alongside your

35:50 wikis. You can integrate with the tools you use every day like Slack, Jira, and Google Sheets. You can

35:54 automatically register issues from email or web form submissions. Take your work on the go using their

35:59 top-rated mobile apps available on Android and iOS. Try Backlog for your team for free for 30 days using

36:05 our special URL at talkpython.fm/backlog. That's talkpython.fm/backlog.

36:14 I was really impressed with the low latency that you guys were able to achieve. And so maybe that's a good

36:18 way to segment over to some of the internal implementations. You talked about using G-event,

36:23 Tornado. What are some of the internal pieces that are at work there?

36:28 Yeah. So on the distributed system, we use Tornado for concurrency. This is because we're

36:32 supporting both Python 2 and 3 at the time. Tornado is becoming more asyncIO friendly. That's sort of

36:37 becoming more common. So Tornado for concurrency. Also Tornado a little bit for TCP communications,

36:41 although we've had to improve Tornado's bandwidth for high-performance computers.

36:46 That's cool. So has Tornado gotten a little bit better because you guys have been

36:49 pushing it to the limit? Yeah, definitely.

36:50 That's good. Yeah, that's good.

36:52 Dask touches a ton of the ecosystem. And so Dask developers tend to get pretty comfortable with

36:57 lots of other projects. So Antoine Petroux used to work on Dask a lot. He did a lot of the

37:01 networking stuff in Python, and he also worked on Tornado and did lots of things. So he was handling that.

37:06 So yeah, Tornado for concurrency and TCP communication. So all the Dask workers are

37:11 TCP servers that connect to each other. So it's kind of like a peer-to-peer application.

37:15 A bunch of just raw, basic Python data structures for our internal state, just because those are

37:21 the fastest we can find. The Python dictionary is actually a pretty efficient data structure.

37:25 So that's how we handle most of the internal logic. And then for computation, if you're using

37:29 Dask data frame, we're using pandas for computation. If you're using Dask array,

37:32 we're using numpy for computation. So we didn't have to rebuild those things.

37:35 And then, you know, compression libraries, encryption libraries, Kubernetes libraries.

37:40 There's sort of a broad set of things we end up having to touch.

37:43 I can imagine. Is the cluster doing like HTTPS or other types of encrypted communication and

37:48 stuff like that between the workers and the supervisor?

37:50 Yeah. So not HTTPS. HTTP is sort of too slow for us. We tend to use TCP.

37:55 Okay.

37:56 Dask does support TLS out of the box, or SSL, you might know it as. But the comms are also

38:00 pluggable. So we're actually looking at right now a high-performance networking library called

38:04 UCX to do like InfiniBand and other stuff. Security is standard in that way, assuming

38:08 you're happy with TLS.

38:09 Yeah. Yeah. I suspect most people are. Until the quantum computers break it, and then we have

38:13 a problem.

38:14 Sure. Comms are extensible. We can add something else.

38:17 It's like a quantum state algorithm.

38:19 Yeah. Yeah. I suspect we're going to be scrambling to solve the banking problem and e-commerce

38:24 problem faster first, though. That'll be where the real fear and panic will be. But luckily,

38:31 that's not here yet. So this just sounds like you guys are really pushing the limits of so much

38:36 of these technologies, which sounds like it must be a lot of fun to work on.

38:39 Yeah. It's been a wild ride.

38:40 I can imagine.

38:41 Yeah. It's actually been interesting also building conventions and protocols among the community.

38:46 I think Dask is at an interesting spot because we do talk to everybody. We talk to most scientific

38:52 disciplines. We talk to banking. We talk to Fortune 500 companies. And we see that they all have the

38:58 same problems. So I think recently I put together a bunch of people on file formats. Not because I care

39:04 deeply about file formats, because I had talked to all of them. We also talked to all the different

39:07 library developers. And it's really interesting to see, you know, OK, Tornado 4.5. Are we ready to

39:12 drop it or not, for example? And we talked to the Bokeh developers, the Jupyter developers, the DAS

39:16 developers, the Tornado developers. It's a well-integrated project into the ecosystem. I want to sort of like

39:21 hang there for a moment and say that because, again, Dask was designed to affect an ecosystem,

39:25 move an ecosystem forward, it actually has like its sort of tendrils into lots of different

39:30 community groups. And that's been a really fun thing to see. Sort of not on a technological side,

39:34 just on a social side. It's been very sort of satisfying and very interesting to see all these

39:39 different groups work together. Yeah, I can imagine you could have worked with not just the library

39:43 developers on all the different parts of the ecosystem, but you're also on the other side of

39:47 the scientists and the data scientists and people solving really cool problems, right?

39:50 You'd be amazed how often like climate scientists and quantitative traders have exactly the same problems.

39:56 You know, that's something I noticed doing professional training for 10 years. You know,

40:01 I taught a class at Edwards Air Force Base. I taught one at a hedge fund in New York and I taught one at a

40:07 startup in San Francisco and like 80% that's the same. It's all the same, right? Like it's good

40:13 computation. It's algorithmical thinking. It's like distributed systems. And then there's the special

40:18 sauce that probably I don't even understand anyway, right? But it's super interesting to realize that how

40:23 similar these different disciplines can be.

40:25 Yeah. And from an open source standpoint, like this is new because previously all those domains had

40:30 like a very different technology stack. Now they're all using the same technology stack. And so as a

40:35 person who sort of is paid to think about software, but also cares about advancing humanity, it's like a

40:41 really nice opportunity, right? We can take all this money from these quant traders, build that

40:45 technology and then apply that technology to genomics or something.

40:48 Right. Or climate science or something that like really needs to get it right quickly.

40:52 That's sort of a nice opportunity that we have right now, especially in Python, I think that we can

40:58 hire a bunch of people to do a bunch of good work that advances humanity. Also helps people rich,

41:04 makes people rich playing the stock market, but you know, we can let that happen.

41:07 Hey, yeah. If they're helping drive it forward, you know, it might be okay.

41:10 Sure.

41:11 For the good, maybe that's debatable, but probably. It sounds like such a cool project. Now maybe let's

41:17 talk a little bit more of the human side of it. Like obviously you work a lot on it,

41:21 as you said, but who else? Like, are there a lot of people that maintain to ask or what's the story

41:26 there?

41:26 I've worked on it for a while within, so we started within Continuum or what is now called Anaconda.

41:31 And there's a bunch of other engineers within Anaconda who also work on Dask.

41:34 They do sort of halftime something else and halftime work on Dask.

41:37 Are they still?

41:37 Definitely. People like Jim Criss, Tom Augsburger, Martin Durant, and others throughout the company.

41:41 And so yeah, they'll maintain various parts of the project or push it forward. So like Jim recently has

41:46 been working on making Dask easier to deploy on Hadoop or Spark systems using Yarn. And that's

41:51 something that is important for Anaconda's business and also just generally useful for people. So there's

41:55 a sort of nice overlap between helping out the open source project and also getting paid to do the work.

42:00 Yeah.

42:01 Outside of Anaconda, so I'm now at NVIDIA, there's a growing group at NVIDIA who's also working on

42:05 Dask, which is pretty exciting. And also there's a bunch of other people who are working in other

42:09 research labs or other companies who need to use Dask for their work and need to build out the

42:14 Kubernetes connection or something like that. And they maintain that both because it's good for

42:18 their company and also they just sort of enjoy doing that. So I would say in sort of the periphery of

42:23 Dask, most of the work is happening in other companies and other people. And there's hundreds of people who do

42:29 work on Dask in any given month or two.

42:31 There's probably a lot of people that make like one or two little contributions to like fix their

42:36 edge case, but it all adds up, right?

42:37 There's that. There's also maybe like 20 or 30 people who are maintaining, not just adding a fix,

42:42 but like maintaining some Dask foo package. So there's like, there's a researcher, Alistair Miles,

42:48 who works in the Oxford area, who works on genomics processing. And he has like a Dask enabled

42:54 genomics suite called Psyched Allel. So that's his job, right?

42:58 So he's working on that day to day. There's, you know, the scikit-learn developers maintain the Dask

43:03 scikit-learn connection through Joblib. So Olivier Grissell has commit rights over all of Dask. He's

43:07 the main scikit-learn contributor. So there's a bunch of sort of, as Dask expands out into the rest of

43:13 the ecosystem, that it's sort of being adopted by many, many smaller groups who are in other

43:18 institutions.

43:19 So it sounds to me like there might be some decent opportunities for folks to contribute

43:23 to Dask in the sense that, you know, maybe there's somebody that's doing some kind of research

43:29 and their library or something like it would really benefit from having a Dask level version of that.

43:36 So if people are looking to contribute to open source and the data science space, are you guys

43:41 looking for that kind of stuff?

43:42 Yeah, definitely. I mean, you don't have to look these days. People are just doing it.

43:45 What I would say is that there's probably, in whatever domain you're in, there's probably some

43:50 group that's already thinking about how to use Dask in that space and jump onto that group.

43:55 There's a lot of sort of like low-hanging fruit right now. Pick your favorite PyData library.

43:59 Think about how it could scale. And there's probably some small effort around that currently.

44:03 And there's a lot of, you know, if you want to get into open source, it's a great place to be.

44:07 You're like right on the very exciting edge of big data and scalability. Things are pretty easy today.

44:12 We've solved a lot of the like the major issues by with sort of the core projects like NumPy and

44:17 Pandas and Scikit-learn. So yeah, there's a ton of ton of pretty exciting work that's like

44:21 just starting to pick up now. So yeah, if you're interested in open source, it's a great place to start.

44:25 It seems like there's so many data science libraries that there's probably a lot of opportunity to

44:30 like bring some of them into the fold. Pretty low-hanging fruit there. So let me ask you some

44:35 kind of how wild is that type of thing. One is what's the biggest cluster you've ever seen it run on?

44:43 The biggest I've had-

44:44 Or heard about it running on?

44:45 Yeah. I would say like a thousand machines is probably the biggest that I've seen.

44:51 And that's probably like with an order of magnitude of limits today. But those machines are big machines.

44:56 They have like 30 cores on them. Yeah. Like that's also just mostly for benchmarking. Like

45:00 most problems in data science aren't that big. Right. Thousands sort of.

45:03 That's awesome. I was thinking of places like CERN that have their big grid computing stuff and things

45:08 like that. Like they could potentially spin up something pretty incredible.

45:12 That gets used on a fair number of supercomputers today. But it's very, very rarely used on like

45:18 the entire supercomputer. There's really no system outside of MPI that I would expect to run on the full

45:24 like CERN federated grid. So most of the sort of data source oriented libraries end up sort of maxing

45:31 out around a thousand machines. I mean, hypothetically you go higher, but the overheads will be enough to

45:36 kill you. Does it have a sort of a, like how many people's hands do you have to shake to shake

45:40 everyone's hand in a group? You know, like that in factorial type of issue. As you add more,

45:45 does it get like harder and harder to add the next worker to the cluster or is it okay?

45:49 No, it's fine. Everything's linear in terms of number of workers. Workers are, it's not a,

45:53 like a mesh grid. It's workers talk to the scheduler to find out who they have to talk

45:56 to. They'll then peer to peer talk to those people. But in any given computation, you only talk to like,

46:01 you know, five to 10 people at once.

46:02 Okay.

46:03 It's fine.

46:03 Yeah. Interesting. Okay. The other one is what's the coolest thing that you've seen

46:07 computed or run on Dask or built with Dask?

46:09 I'm going to say, there's things I can't, yeah, things I can't talk about, but.

46:13 Some are, the most amazing ones are super secret.

46:16 I'm going to point people to the Pangio project. So this is actually my most recent

46:19 Okay.

46:19 PyCon talk in 2018, talking about other people's work. So Pangio is pretty cool.

46:23 Pangio is a bunch of earth scientists who are doing climate change, meteorology, hydrology,

46:28 figuring out the earth. And they have these, you know, 100 terabyte petabyte arrays that

46:33 they're trying to work with. They haven't quite gotten to the petabyte scale yet.

46:35 And yeah, so they usually set up something like Jupyter hub on either a supercomputer or

46:40 on the cloud and Kubernetes. They then analyze tens or hundreds of terabytes of data with

46:45 X-ray or Dask around the hood and get out, you know, nice plots that show that indeed sea

46:50 levels are rising. I think it's pretty cool in that it's good community of people. So there's

46:55 like a bunch of different groups, NASA, universities, USGS, UK Met Office, anyone who's looking at

47:01 large amounts of data, companies, plant labs, et cetera. So it's a good community.

47:05 They're like figuring out how to store one copy of the data in the cloud and then have everyone

47:11 compute on that using cool technologies like Kubernetes. They're inventing cool technologies

47:15 like how do we store multi-dimensional arrays in the cloud. So yeah, like that's maybe a good

47:19 effort. Not necessarily cool computation, like from a Dask perspective, it's actually pretty

47:24 like vanilla, but it's cool to see a scientific domain advance in a significant way using some

47:31 of these newer technologies like Jupyter hub and Dask and Kubernetes. And I like to see that

47:34 repeat it. So I think we do the same thing with medical imaging and with genomics. So that's maybe

47:39 like a cool example. It's not so much a technical situation, it's a community situation.

47:43 Yeah. Well, it sounds like a cool example that could be just replicated across verticals.

47:48 Definitely. We're starting to see that, which is really exciting.

47:50 Okay. Those are great. So when I was looking into Dask, I ran across some other libraries that are

47:55 like, kind of like you described, DaskFu. So there was DaskCuda, DaskML, DaskKubernetes. Do you want to

48:02 just talk a little briefly about some of those types of things?

48:05 Sure. And why you might use them?

48:06 I'll go in reverse order. So DaskKubernetes is a library that helps you deploy Dask clusters

48:11 on Kubernetes.

48:12 So you pointed at like a Kubernetes cluster and you say, go forth and use that and it could

48:18 just create the pods and run them in all that?

48:20 Definitely. So it creates pods for all the workers. It manages those pods. If you ask it,

48:23 there's like little buttons, you ask it to scale up or scale down. It'll manage all that stuff for you

48:27 using the Kubernetes API. It presents users with like a super simple interface. They can open up

48:32 a Jupyter notebook and just sort of play around. So that's what that does. And it has peers like Dask

48:36 Yarn for Hadoop Spark clusters and Dask JobQ for more high-performance computing schedulers like Slurm,

48:42 PBS, SGE, LSF. So that's what you ask for Kubernetes. That's like on the deployment side. DaskML is an effort

48:48 to, there were a lot of sort of interesting ways of using Dask with machine learning algorithms,

48:53 things like GLMs, Generalizing Your Models, XGBoost, Scikit-learn. That were all different

48:59 technical approaches. And DaskML is like an almost like a namespace library that collects all of them

49:03 together, but also using the sort of Scikit-learn style APIs. So if you like Scikit-learn and you want

49:08 to scale that across a cluster, either with simple things like hyperparameter searches or with large

49:14 data things like logistic regression or XGBoost, you should look at DaskML and it'll help you do

49:18 that. Dask CUDA is a, that's like a new thing I've been playing with since moving to NVIDIA. People

49:23 have been using Dask with GPUs for a long time and they've developed like a bunch of weird wonky

49:27 scripts to like set things up just the right way. There were a ton of these within NVIDIA when I

49:31 showed up. It's just a few like nice convenient Python ways to set up Dask to run on either one

49:38 machine with many GPUs or on a cluster of machines with lots of GPUs and how to make that easy.

49:42 So it's kind of like Dask Kubernetes in that it's about deployment, but it's more focused on

49:47 deploying around GPUs and there are some intricacies to doing that well.

49:51 I imagine you can scale like crazy if you put the right number of GPUs on a bunch of machines and let

49:58 it go.

49:58 No, GPUs are actually pretty game-changing. It's fun.

50:02 Yeah. It's kind of near the limits of human understanding to think about how much computation

50:08 per second happens on a GPU.

50:10 I think CPUs are also well beyond the limits of human understanding.

50:14 But this is like the galaxy versus the universe, you know, like...

50:20 Sure.

50:20 They're both outside of our understanding.

50:22 I should like mention that I work for NVIDIA now, so please don't trust anything I say.

50:25 But yeah, GPUs are like fundamentally like a game changer in terms of computation. Like you can get a

50:30 couple orders of magnitude performance increase out of the same power, which I mean, from like a science

50:35 perspective, like has the opportunity to unlock some things that we couldn't do before. GPUs are also

50:39 notoriously hard to program, which is why most scientists don't use them. And so that's sort of

50:44 part of why I work at NVIDIA today. They're building out a GPU data science stack and they're using

50:49 NASK to paralyze around it. And so it's fun to see... It's fun both to sort of grow NASK within NVIDIA and have

50:54 another large company behind it. But also from a science perspective, like I think we can do some

50:59 good work to improve accessibility to different kinds of hardware like GPUs that might dramatically

51:04 change the kinds of problems that we can solve, which is exciting in different ways.

51:08 Yeah, very incredible. And I would much rather see GPU cycles spent on that than on cryptocurrency.

51:13 Sure.

51:14 Although cryptocurrency is good for NVIDIA and all the GPU, the graphics card makers,

51:18 they don't seem as useful as like solving science problems on them.

51:22 All right. We're getting close to our time that we have left. So I'll ask you just a couple of

51:28 quick questions before we wrap it up. So we talked about where NASK is. Like, where is it going? It

51:34 sounds like there's a lot of headroom for it to grow. And NVIDIA is investing in it by sort of this work

51:40 that you all are doing. And it sounds like they're also building something amazing. So yeah, what's next?

51:44 NVIDIA aside for a moment, I'll say that NASK technologically is sort of done. The core of it,

51:49 there's plenty of bugs and plenty of features we can add, but we're sort of at like the incremental

51:52 advancement stage. I would say where I'm seeing a lot of change right now is in broadening it out

51:58 to other domains. So I mentioned Panjio a little bit before. I think that's a good example of NASK

52:02 spreading out and solving a particular domains problem. And we're seeing, I think, a lot more of

52:06 that. So I think we'll see a lot more social growth and a lot more applications to new domains.

52:11 That's really what I'm pretty excited about.

52:12 That's cool. So it's like NASK is basically ready, but there's all these other parts of the ecosystem

52:16 that could now build on it and really amp up what they're doing.

52:20 Yeah, let's go tackle those. Let's go look at genomics.

52:22 Yeah, for sure.

52:23 Let's go look at imaging. Let's go look at medicine. Let's go look at all of the parts of the world that

52:28 need computation on a large scale, but that didn't obviously fit into the sort of Spark,

52:33 Hadoop, or TensorFlow regime. And NASK, because it's more general, can probably handle those things a lot

52:38 better.

52:38 Yeah, it's amazing. I guess, let me ask you one more NASK question before we wrap it up real quick.

52:43 So NASK, is NASK 100% Python?

52:47 The core part, yeah.

52:48 Or near?

52:48 Yes. You're probably also using NumPy and Pandas, which are not Python.

52:51 Which, yeah, right, as we discussed, but yeah.

52:53 NASK itself is pure Python.

52:54 I think it's really interesting because you think of this really high-performance computing

52:59 sort of core, right? And it's not written in Go or C, but it's written in Python. And I just

53:06 think that's an interesting observation, right?

53:08 Python is actually not that bad when it comes to core data structure manipulation,

53:13 tuples, list, dictionaries. It's maybe like a factor of two to five slower, not a hundred times

53:17 slower. It's also not that bad at networking. It does that decently well. Socially, the fact

53:21 that Python had both a data science stack and a networking stack in the same language made it

53:26 sort of the obvious right choice. That being said, I wouldn't be surprised if in a couple of years,

53:30 we do rewrite the core bits in C++ or something, but we'll see.

53:33 Sure. Eventually. When it's kind of finely baked and you're looking for that extra 20,

53:37 30% here or there, right?

53:38 Yeah, we'll see. We haven't yet reached the point where it's a huge bottleneck to most people,

53:42 but like the climate science people do have petabyte arrays and sometimes a scheduler does

53:45 become a bottleneck there.

53:46 Right. It's cool. It's interesting. So final two questions. If you're going to write some

53:51 Python code, what editor do you use?

53:53 I use Vim, not for any particular reason. I just showed up the computer lab one day and said,

53:57 how do I use Linux? And the guy there was a Vim user. So now I'm also a Vim user.

54:01 I think I haven't moved off just because I'm so often on other machines. So like rich editors just

54:07 don't move over as easily.

54:08 Yeah. I'm on the cluster. I need to edit something. It doesn't work to fire up PyCharm and do that so

54:13 easily.

54:14 Right. Yeah. So Vim for now, but not religiously, just anecdotally.

54:18 Sure. Sounds good. Yeah. I'm sure you do some stuff with Jupyter Notebooks as well every now and then.

54:23 Sure. I'll maybe also put a plugin for JupyterLab. I think everyone should switch from the classic

54:27 notebook to JupyterLab, especially if you're running Dask because you can get all those dashboards

54:31 inside JupyterLab really nicely. It's a really nice environment.

54:33 Oh yeah. That sounds awesome. All right. And then notable PyPI package. I'll go ahead and throw

54:38 pip install Dask out there for you.

54:40 So I'll cheat a little bit here. I'm going to list a few. First, in a theme of the NumPy space,

54:46 I really like how NumPy is expanding beyond just the NumPy implementation.

54:49 So there's Kupy, which is a NumPy implementation on the GPU.

54:53 There's Sparse, which is a NumPy implementation with Sparse arrays. And there's Aura, which is a

54:58 new date-time D type for NumPy. I like that all of these were built on the NumPy API. We're built

55:04 outside of NumPy. I think the kind of heterogeneity is something we're going to move towards in the

55:08 future. And so it's cool to see NumPy develop protocols and conventions that allow that kind

55:12 of experimentation. And it's cool to see people picking that up and building things around it.

55:16 So that's pretty exciting.

55:17 Those are great picks. It's cool that I learned one API and now I can do other things,

55:23 right? Like I know how to do the NumPy stuff, so now I can do it on a GPU.

55:26 Right. Or I've got my NumPy code. I want to have better date times. I can just add this new library

55:31 and suddenly all my NumPy code works because that's sort of extensibility, which is something that we

55:35 need, I think, going forward.

55:37 Yeah, absolutely. All right, Matthew. Well, this was super interesting. Thank you for being on the show.

55:41 You have a final call to action. People want to get involved with Dask. What do they do?

55:44 Try it out. You should go to examples.dask.org. And there's a launch binder link on that website.

55:51 If you click on that link, you'll be taken to a JupyterLab notebook running in the cloud,

55:54 which has everything set up for you, pretty much of examples. So you can try things out,

55:57 see what interests you, and play around. So again, that's examples.dask.org.

56:00 Yeah, and I also link to the Dask examples GitHub repo that you have on your account

56:04 in the show notes. Yeah, so people can check that out too.

56:07 Awesome. All right. Well, thank you for being on the show. It was great to chat with you and

56:10 keep up the good work. It looks like you're making a big difference.

56:13 Great. Thank you so much for having me, Michael. Talk to you soon.

56:15 Yep. Bye.

56:15 This has been another episode of Talk Python to Me. Our guest on this episode was Matthew Rockland,

56:22 and it's been brought to you by Linode and Backlog. Linode is your go-to hosting for whatever

56:27 you're building with Python. Get four months free at talkpython.fm/Linode. That's L-I-N-O-D-E.

56:34 With Backlog, you can create tasks, track bugs, make changes, give feedback, and have team

56:39 conversations right next to your code. Try Backlog for your team for free for 30 days using the special

56:45 URL talkpython.fm/Backlog. Want to level up your Python? If you're just getting started,

56:52 try my Python Jumpstart by Building 10 Apps course. Or if you're looking for something more advanced,

56:58 check out our new async course that digs into all the different types of async programming you can do

57:03 Python. And of course, if you're interested in more than one of these, be sure to check out our

57:07 everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show.

57:12 Open your favorite podcatcher and search for Python. We should be right at the top. You can also find

57:17 the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at slash

57:23 RSS on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really

57:29 appreciate it. Now get out there and write some Python code.

57:31 Bye.

57:32 Bye.

57:32 Bye.

57:32 Bye.

57:32 I'm out.

Talk Python's Mastodon Michael Kennedy's Mastodon