Monitor performance issues & errors in your code

#207: Parallelizing computation with Dask Transcript

Recorded on Wednesday, Feb 20, 2019.

00:00 Michael Kennedy: What if you could write standard NumPy and Pandas code, but have it run on a distributed computing grid for incredible parallel processing right from Python? How about just splitting the work across multiprocessing to escape the limitations of the GIL right on your local machine? That's what Dask was built to do. On this episode, you'll meet Matthew Rocklin to talk about its origins, use cases, and a whole bunch of other interesting topics. This is Talk Python to Me, Episode 207, recorded February 20th, 2019. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @MKennedy, keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. Hey folks, a quick announcement before we get to our conversation with Matthew. People have been enjoying the courses over at Talk Python Training for years now, but one shortcoming we've been looking to address is to give you offline access to your content while you're traveling, or if you're just in a location with spotty internet speeds. We also want to make taking our courses on your mobile devices the best they can be. That's why I'm thrilled to tell you about our new apps which solve both of these problems at once. Just visit training.talkpython.fm/apps and download the Android version right now. The iOS version is just around the corner. Should be out in a week, or so. You'll have access to all of your existing courses, but we've also added a single tap feature to allow you to join our two free courses, the Responder web framework mini-course, and the MongoDB quick start course. So, login, tap the free course, and you'll be right in there. Please check out the apps and try one of the free courses today. Now, let's talk Dask with Matthew. Matthew, welcome to Talk Python to Me.

01:54 Matthew Rocklin: Thanks, Michael. I've listened to the show a lot. It's really an honor to be here. Thank you for having me.

01:58 Michael Kennedy: It's an honor to have you, as well. You've done so much cool work in the distributed computing space, and the data science and computation space. It's going to be super fun to dig into all that, and especially dig into Dask. But before we get to that, of course, let's start with your story. How did you get into programming with Python?

02:14 Matthew Rocklin: Yes, I got into programming originally, actually on a TI basic calculator. So, I was in math class plugging along. Spent my few thousand hours not being a good programmer there. C++ in high school, did some MATLAB, and IDL, and C# in college. Did some engineering jobs and some science, and then in grad school, just tried out Python on a whim. I found it was all of those things at the same time, so it was fun to work in object oriented mode like C#. It was fun to do data analytics like with MATLAB. It was fun to do some science work that I was doing in astronomy at the time like with IDL. So, it's sort of all things to all people, and eventually, I was in a computer science department and I found the that Python was a good enough language that computer scientists were happy with it, but also usable enough language that actual scientists were happy with it. I think that's probably where Python gets a lot of its excitement, at least on the sort of numeric Python side.

03:07 Michael Kennedy: I think that's a really nice insight. It sounds like a cool way you got into programming. I feel like one of Python's superpowers is it's kind of what I think of as a full spectrum language. You can make it what you need it to be at the beginning, but like you said, professional developers and scientists can use it, as well. You kind of touched on that, right? You can use it like C#, or you can use it like MATLAB, or you can kind of be what you need it to be, but if you need more it's not like, well, now you go write in C++. It's just like well now use these other features.

03:38 Matthew Rocklin: Yeah. Definitely. I'd also say it has a lot of different parts to it. Before, there was the web programming community. I was using it kind of like Bash, the SAS-ops community. And there was the scientific community. Know those are all kind of the same, and it's really cool to see all those groups come together and use each other's stuff. You see people using scikit-learn inside of a Flask app deployed with some other Python thing.

03:59 Michael Kennedy: And similar something, right? Yeah. For sure.

04:01 Matthew Rocklin: And the fact that all the things were around makes it sort of the cool state that it's in today.

04:05 Michael Kennedy: Yeah, it's just getting better, right?

04:06 Matthew Rocklin: But it was already good. It's just like connecting all the good parts.

04:09 Michael Kennedy: Exactly. It's the interconnections are building a little bit stronger across these disciplines, which it's kind of rare, right? There's not that many languages and ecosystems where that happens, right? Like you think of Swift for iOS, right? Swift is a pretty cool language. It has some issues that I could take with it, but it's pretty cool, but it's not also used for data science and for devops, right? It's just not.

04:28 Matthew Rocklin: Right. Yeah.

04:29 Michael Kennedy: That's pretty cool. Also, you said you did some IDL. It sounds like you were doing astronomy before, right?

04:34 Matthew Rocklin: Yeah, this is just in undergrad. I was taking some astronomy courses. It was fun, though, 'cause I knew how to program and I was really productive as a scientist because I was fluent in programming. That ability to be more productive really took me away from science and towards computer science, but in service of science. So, that was my experience.

04:52 Michael Kennedy: Now you're building these platforms where legitimate world class science is happening.

04:58 Matthew Rocklin: Yeah. No, it's cool.

04:59 Michael Kennedy: Right? You must be really proud, right? To think of what people are doing with some of the software you're building, and you're collaborating...

05:06 Matthew Rocklin: Yeah, the software that all of us build, and it's definitely very cool. It's very satisfying to have that kind of impact. Not just in one domain, but across domains. I get to work with lots of different kinds of people everyday that are solving, I think, high impact problems. I think that's definitely a strong motivator for me.

05:20 Michael Kennedy: Yeah, I'm sure it is. Programming is fun, but it's really not fun to work on a project and have nobody use it. But to see it thrive and being used, it really is a gratifying experience.

05:31 Matthew Rocklin: Yeah, I'll go so far to say that programming is no longer fun. It's just the way we get things done, and seeing them done, as you say, is really where a lot of the satisfaction comes from.

05:39 Michael Kennedy: Yeah, it's cool. So, how about today? What are you up to these days? You were, put it in past tense, you were doing a lot of stuff with Anaconda Inc, previously named Continuum, but same company.

05:50 Matthew Rocklin: Mm-hmm.

05:51 Michael Kennedy: Recently, you've made a change, right?

05:52 Matthew Rocklin: Yeah, so I like to say it's the same job, just a different employer. Yeah, so I guess my job today is a few things. I work in the open source numeric Python ecosystem, so those are usually called SciPy or PyData. And I mostly think about how that system scales out to work with multicore machines or with distributed clusters. I usually do that with a project called Dask. That's maybe like the flagship project that I help maintain, but I am just generally active throughout the ecosystem. So, I talk to the NumPy developers, the Panda developers, scikit-learn and Jupyter developers. I think about file formats, and protocols, and conventions. There are a few of us...

06:27 Michael Kennedy: Networking?

06:28 Matthew Rocklin: Networking. Yeah, lots of things. So, generally in service of those science use cases, or in business use cases, there's tons of things that go wrong, and I tend to think about whatever it is that needs to get handled. But generally, I tend to focus on Dask, which I suspect we'll talk about more today.

06:42 Michael Kennedy: Yeah, a tiny bit. So, speaking of Dask, let's set the stage. Talk a little bit about why does Dask exist, and so on? And I guess maybe start with a statement that you made at PyCon, which I thought was pretty interesting, and you said that Python has a strong analytical computation landscape, and it's really, really powerful. Like Pandas, and NumPy, and so on, but it's mostly focused on single core computation, and importantly, data that fits into RAM. Want to elaborate on that a little?

07:13 Matthew Rocklin: Sure, so everything you just said, so all of those libraries. So, NumPy, and Pandas, and scikit-learn are loved today by data scientists because the APIs are, they do everything they want them to do. They seem to be well tailored to their use cases, and they're also decently fast, right? So, they're not written in pure Python, they're written in C, Fortran, C++, Cython, LLVM, whatever.

07:34 Michael Kennedy: Right, and so much of the interaction is like, I have the data in Python. I just pass it off. Or maybe I don't even have it in Python. I have a NumPy array, which is a pointer to a data down in C level, which I then pass that off to another part, and then internally down in C, it cranks on it and gives it back at like five. Or something, right? So, I think the whole story around performance, and is Python slow, is Python fast? It gets really interesting really quick because of those types of things.

08:04 Matthew Rocklin: Yeah. So, we have really nice, usable Python APIs, which is kind of like the front end to a data scientist, and that's hitting a backend, which is C, C++, Fortran code, which is very fast. And that combination's really useful. Those libraries also have now decades of expertise, PhD theses thrown into them. Just tons of work to make them fit, and people seem to like them a lot. And it's also not just those packages. There's hundreds of packages on top of NumPy, Pandas, scikit-learn, which do biology, or do signals processing, or do time series forecasting, or do whatever, and there's this ecosystem of thousands of packages, but that code, that code that's very powerful, is really just powerful on a single core, usually, on data that fits in RAM. So, as sort of big data comes up, or as these clusters become more available, the whole PyData, SciPy stack starts to look more and more antiquated. And so, the goal that I was working on for a number of years, and continuing to work on, is how do we elevate all of those packages? NumPy, Pandas, scikit-learn, but also, everything that depends on them, to run well on distributed machines, to run well in new architectures, to run well on large data sets? How do we sort of refactor an ecosystem? That's a really hard problem.

09:11 Michael Kennedy: You know, that's a super hard problem, and it's I think, one of the challenges that PyPy, P-Y-P-Y, the JIT-compiled alternative to CPython, has such, I think that would've really taken off had it not been for yes, we can do it this other way, but you have to lose NumPy, or you have to lose this C library, or you lose the C speedups for SQLAlchemy, right? You're trying to balance this problem of like, I want to make this distributed, and fast, and parallel, but it has to come along as is, right?

09:43 Matthew Rocklin: Right. Yeah, we're not going to rewrite the ecosystem. That's not a feasible project today.

09:47 Michael Kennedy: Right. I mean, theoretically, you could write some alternate NumPy, but people trust NumPy. And like you said, it's got so much polish it's insane to try to recreate it.

09:57 Matthew Rocklin: Yeah, it also has, even if you recreated it, there's a lot of weight on it. A lot of other packages depend on NumPy in very arcane ways, and you'd have a hard time reproducing all of those.

10:06 Michael Kennedy: Right. Yeah, it's like boiling the ocean, or trying to boil the ocean, pretty much. Okay, and then, so that's the problem. We have problems that are too big for one machine. We have great libraries that we can run on one machine. How do you take that ecosystem and elevate it in a way that's roughly the same API, right? We had certain things like that, right? Like Spark, for example, is one, where you can do kind of distributed computation, and there's Python interactions with Spark, right?

10:38 Matthew Rocklin: Yeah, definitely. But Spark is again, going down that rewrite path, right? Spark is not using NumPy or Pandas. They're rewriting everything in Scala, and that's just a huge undertaking. Now, they've done a great job. Spark is a fantastic project. If Spark works for you, you should use it, but if you're used to Pandas, it's awkward to move over. Most good Python Pandas devs that I know who move to Spark, eventually move to Scala. Eventually give up the use, and they move over, and that makes sense, given how things...

11:01 Michael Kennedy: Right.

11:02 Matthew Rocklin: But if you wanted to say...

11:02 Michael Kennedy: If you're going to work there, you might as well be native in that space, right?

11:05 Matthew Rocklin: But if you wanted to say, use NumPy, like there is no Spark NumPy, nor is there a Spark SciPy, nor is there a Spark time series prediction.

11:13 Michael Kennedy: Everything built on it.

11:13 Matthew Rocklin: Yeah. So, Spark really handled sort of the Pandas use case, to a certain extent, and like a little bit of the scikit-learn use case, but there's this other sort of 90% that uses them that's not handled, and that's where Dask tends to be a bit more shiny.

11:26 Michael Kennedy: Right. I don't do much with Spark, so you'll have to keep me honest here, but isn't Spark very map-reduce-y? It has a few types of operations and pipelines you can build but you can't just run arbitrary code on it, right?

11:40 Matthew Rocklin: Yes and no. So yes, underneath the hood, from like a true assistance point of view, Spark implements sort of the same abstraction as MapReduce, but it's much more efficient, and it's much more usable. The APIs around it are much nicer. So, if you do something like a group by aggregation that is sort of a mapping layer of reduction under the hood. And so, as long as you sort of stay within the set of algorithms that you can represent well with MapReduce, Spark is a good fit. Not everything in sort of the PyData, SciPy stack looks like that, it turns out, and so we looked at Spark early on. Hey, can we sort of use the underlying system to parallelize out PyData? The answer came back pretty quickly no. It's just not flexible enough.

12:16 Michael Kennedy: Yeah, and I love that this format is audio, but for this little part, I wish we could show some of the pictures that you've had in some of your presentations and stuff, because the graphs of how all the data flows together is super, super interesting, and complicated, and sort of self-referential, and all sorts of stuff.

12:33 Matthew Rocklin: Yeah, those visuals are really nice, and it's nice that they're built in. You get a good sense of what's going on.

12:37 Michael Kennedy: Yeah, it's cool. Alright, so let's talk about Dask. You've mentioned it a couple of times. Dask is a distributed computation framework focused on parallelizing and distributing the data science stack of Python, right?

12:52 Matthew Rocklin: Yeah, so let me describe it maybe, I'll just do it from a 10,000 foot view first, and then I'll dive in a little bit. So, from 10,000 foot view, the tagline is that Dask scales PyData. Dask was designed to paralyze the existing ecosystem of libraries like NumPy, Pandas, and scikit-learn, either on a single machine, scaling up from memory to disk, using all of your cores, or across a cluster of machines. So, what that means is if you like Pandas, you like NumPy, and you want to scale those up to many terabytes of data, you should maybe investigate Dask. So, diving in a little bit, Dask is kind of two different levels. I'll describe Dask at a low level first, and then a high level. So, at a low level, Dask is a generic Python library for parallel computing. Just think of it like the threading module, or like multiprocessing, but just more. There's more sophistication and there is more complex. It handles things like parallelism, like task scheduling, load balancing, deployment, resilience, node fall over, that kind of stuff. Now, using that very general purpose library, we've built up a bunch of libraries that look a lot like PyData equivalents. So, you sort of take Dask and NumPy and smash 'em together, you get Dask array, which is a large, scalable array implementation. If you take Dask and Pandas, and squash 'em together, you get Dask DataFrame, which uses many Pandas DataFrames across a cluster, but then can sort of give you the same Pandas API. So, it looks and feels a lot like underlying sort of normal libraries, NumPy or Pandas, but it scales out. So, today when people talk about Dask, they often mean Dask DataFrame or Dask array.

14:20 Michael Kennedy: Yeah, and so just to give people a sense, when you interact with one of these say, Dask arrays, it looks like a NumPy array. You treat it like a NumPy array mostly, but what may actually be happening is you might have 10 different machines in a cluster, and on each of those machines, you might have some subset of that data in a true NumPy array, right? They know how to coordinate operations. So, if you do like a dot product on that with some other, but it can figure out the coordination and communication across all the different machines in the cluster to compute that answer in a parallel way, but you think of it as like, well, I just do a dot product on this array and we're good, right? Just like you would in NumPy.

15:01 Matthew Rocklin: That's exactly correct. Yeah, so Dask is like a very efficient secretary that's doing all the coordination work. NumPy's doing all the work on every node, and then we've written a lot of parallel algorithms around NumPy that teach you how to do a big matrix multiply with a lot of small matrix multiplies. On the Pandas side, a big join with a lot of smaller joins.

15:17 Michael Kennedy: It sounds like a lot of coordination, and a challenging problem to do in the general sense.

15:23 Matthew Rocklin: Yeah, it's super fun.

15:24 Michael Kennedy: The general solution of that. Yeah, I bet it is. A lot of linear algebra and other types of stuff in there, I'm guessing.

15:30 Matthew Rocklin: Sure, or a lot of Pandas work, or a lot of machine learning work.

15:33 Michael Kennedy: Yeah, yeah, for sure.

15:34 Matthew Rocklin: Or a lot of other work. So, I want to point out here that the lower level Dask library, the thing that just does that coordination is also used separately from NumPy and Pandas. I'm sure we'll get to that in the future. So, Dask is sort of this core library. Dask handles parallelism. You can use it for anything. There's also like a few big uses of it for the major PyData libraries.

15:52 Michael Kennedy: Yeah, we have Dask array and we have Dask DataFrame, and so these are really cool. We said the API is really similar, but there's one significant difference, and that is that everything is lazy, right? Like a good programmer.

16:08 Matthew Rocklin: Sure, so Dask array and Dask DataFrame do both do lazy operations. And so yeah, so at the end of your, you do DataFrame, you do read parquet, filter out some rows, do a group aggregation, get out some small results. You then call .compute at the end of that. There's a method on your Dask DataFrame object, and that will then ship off the sort of recipe for your computation off to a scheduler, which will then execute that and give you back a result probably as a Pandas DataFrame.

16:32 Michael Kennedy: Right, so there's two major parts here. One is the data structures that you talked about, and the other is the scheduler idea, right? And those are almost two separate, independent things that make up Dask, right?

16:43 Matthew Rocklin: Two separate code bases even.

16:43 Michael Kennedy: Okay. This portion of Talk Python to Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe, so no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24/7 friendly support even on holidays, and a seven day money-back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode. I want to dig into the scheduler 'cause there's a lot of interesting stuff going on there. One thing that I think maybe might be worth digging into a little bit for folks is this demo that you did at PyCon around working with basically, you had this question you want answered. It's like, what is the average or median tip that a person gives to a taxicab driver plotted by time of day? Like Friday evening, Monday morning, and so on. And so, there's a huge data set, maybe you tell people where to get it, but it's the taxicab data set for some year, and it's got 20 gigs on disk, and 60 gigs in memory. Now, I have a MacBook Pro here that's like all knobs to 11, and that would barely hold half of that data, right? So, even on a high end computer, it's still, like I need something more to run this 'cause I can't load that into memory.

18:34 Matthew Rocklin: Well, to be fair, actually, that's not, by big data centers, it's actually pretty small. People use Dask on like...

18:39 Michael Kennedy: No, yeah. I know.

18:40 Matthew Rocklin: 20 terabytes, rather than 20 gigabytes pretty often. But you're right that it's awkward to run on a single machine.

18:44 Michael Kennedy: Yeah, you can't do a lot of paging.

18:46 Matthew Rocklin: Yeah, it's a pretty easy data set, though, to run in a conference demo just 'cause it's pretty interpretable.

18:51 Michael Kennedy: Everyone can immediately understand average tip by time of day, and day of the week.

18:55 Matthew Rocklin: Yeah, so it's a good demo. I'll maybe talk through it briefly here, but I encourage you all to look at the PyCon. It's probably PyCon 2017, I think, for Dask.

19:02 Michael Kennedy: Yeah, and I'll put a link into the talk in the show.

19:04 Matthew Rocklin: Yeah, so in that example, we had a bunch of CSV files living on Amazon S3 or Google Cloud storage. I don't remember which.

19:11 Michael Kennedy: Yeah, Google Cloud storage.

19:13 Matthew Rocklin: You could use the Pandas read CSV function to read a bit of those files, but not the entire thing. But if you try to read everything, it would fill up RAM, and Pandas would just halt, and that's kind of the problem we run into in PyData today when we hit big data. Everything's great until you run out of RAM, then you're kind of, you're out of luck. So, we sort of switched out the imports over Pandas for Dask DataFrame, called the Dask DataFrame read CSV functions, which has all of the same keyword arguments of Pandas read CSV, which if you know Pandas, are numerous. And sort of reading one file, we sort of give it a glob string. We'd ask it to read all the files. Yeah, we like did normal Pandas API stuff. Like we filtered out some bad rows, or some free rides in New York City that we had to remove. We made a new column to the tip fraction. We then used Pandas daytime functionality, I think, to group by the hour of the day and the day of the week. And if you know Pandas, like that API is pretty comfortable to you.

20:03 Michael Kennedy: Just the data's too big.

20:04 Matthew Rocklin: Yeah, it was the same experience. Pandas, like we switched out the import, we put a star in our file name to read all the CSV files, and we might've asked for a cluster. So, we might've asked something like Kubernetes, or Yarn, or Slurm, or some other cluster manager to give us a bunch of Dask workers. That one's probably using Kubernetes 'cause it's on Google. Those showed up, so Google signed off to give us virtual machines. We deployed Dask on them, and then yeah, then we hit compute, and then all of the machines in our cluster went ahead, and they probably called little Pandas read CSV functions on different ranges in those CSV files coming off of Google Cloud storage. They then did different group operations. They did some filtering operations on their own. They're probably somewhat related to the ones that we asked them to do, but a little bit different. They had to communicate between each other. So, those machines had different Pandas DataFrames. They had to share results back and forth between each other. If one of those machines went down, Dask had to ask for a new one, or had to recover the data that that machine had. And at the very end, we got back in Pandas DataFrame and we plotted it. I think the punchline for that one is that the tips for three or 4 a.m. were absurdly high. It's like 40%, or something like that.

21:09 Michael Kennedy: Yeah, and did you have any insight to why that is? Do people know?

21:12 Matthew Rocklin: I mean, the hypothesis is that it's people coming home from bars. I was actually giving this same demo at a conference once, and someone was like, oh, filter out all the short rides. I'll bet it goes away. And so, we did that, and the spike does go away. And so, it's probably just people getting $5 cab rides from the bar, and we were looking at tip fraction, which is sort of going to have a lot of noise.

21:30 Michael Kennedy: Right.

21:31 Matthew Rocklin: But it was fun 'cause this guy said, hey, try this thing out, and he knew how to do it. He'd never seen Dask before. He'd just known Pandas before. Gave him the keyboard, he typed in the Pandas commands, and it just worked, we got new insight out of it. And so, that sort of experience shows us that a tool like Dask DataFrame gives us the ability to do Pandas operations, to extend our Pandas knowledge, but across a cluster. That data set could've been 20 terabytes, and the same thing would've worked just fine. You would've wanted more machines. It would've been fine.

21:55 Michael Kennedy: Wait, or more time.

21:56 Matthew Rocklin: Or more time. Yeah. So, Dask also works well on a single machine, just by using small RAM intelligent ways.

22:01 Michael Kennedy: Oh, it does? Okay, so it can be more efficient. You could give it a ton of data, but it can realize I can't take it all in one big bite. I got to eat the elephant, sort of thing?

22:09 Matthew Rocklin: Yeah, so if you can run through your computation with small RAM, Dask can usually find a way to do so. And so, that's actually like Dask can be used on clusters, and that's the flashy thing to show to people at a PyCon talk, but most Dask users are using just their laptops. So, for a long time, Dask actually just ran on single machines for like the first year of its life. We built out all the parallel algorithms, but didn't have a distributed scheduler. And there, people who just wanted to sort of stream through their data, but with Pandas or NumPy APIs. So, we're dealing with 100 gigabyte data sets on their laptops pretty comfortably.

22:38 Michael Kennedy: Oh, that's interesting. I didn't realize that, that was a thing it could do. That's cool. Could it be used in the extreme? Like if I've got a really small system with not much RAM at all. I'm thinking like little IoT things. Could it let me process not big data, but normal data, on like a tiny device?

22:55 Matthew Rocklin: Sure, depending on how much data or how much you're looking for. Sure.

22:59 Michael Kennedy: Yeah, okay. Interesting. That's a super cool feature. Now, one of the first things that struck me when you ran this, like obviously, you've got all these machines. I think you had 32 systems with two cores each when you ran the demo, and it went really quickly, which is cool. I expected it to come up with a graph, and you could say run it, we can even run it locally here, in 10 seconds, or something. But instead what I saw when you did that was there's like a beautiful graph that's alive and animated, describing the life of your cluster, and the computation, and different colors meaning different things. Can you describe that diagnostic style page so people know.

23:36 Matthew Rocklin: Yeah, sure. It's hard to describe a picture, much less, an interactive dashboard. But yeah, so maybe I'll describe a little bit of the architecture of Dask first. That'll make this a bit more clear. So, Dask has a bunch of workers on different machines that are doing work, holding onto data, communicating with each other. Then there's a centralized scheduler, which is keeping track of all those workers. This is like the foreman at a job site, maybe. You have all these workers telling the workers what to do, telling them to share with each other, et cetera. And that foreman, that scheduler, has a ton of information about what all the workers are doing, all the tasks that are being run, the memory usage, the network communications, file descriptors that are open on every machine, et cetera, and in trying to benchmark, and profile, and debug a lot of Dask work, we find we wanted access to all that information, and so we built this really nice dashboard. So, yeah, it tells you all that information. You saw one page at that demo, but there are like, 10 or 20 different pages of different sets of information. So, there's just a ton of state in this scheduler, and we wanted to expose all that state visually. It was originally designed for the developers. It has been critical for users because understanding performance is really hard in a distributed system, right? Our understanding of what is fast and slow on a single machine gets totally turned around when you start adding concurrent access to disk, and network communications, and everything. So yeah, it's a live dashboard. We use Bokeh. It's a Bokeh server application to build it. So, it's updating at something like 100 milliseconds frame rates, so it looks live to the human eye. Yeah, it's showing you progress, it's showing you every task that gets done, and when it gets done, and where it gets done. It's showing you memory use on all different machines. You can dive in. You can get line-by-line profiling statistical information on all of your code.

25:17 Michael Kennedy: Will that coordinate across machines? So like, this function is taking this long, but actually it's summing up the profiling across all the cluster.

25:24 Matthew Rocklin: Yeah, so the scheduler is just aware of everything, and the workers are all gathering tons of information. There's tons of diagnostics on that machine. Yeah, so maybe a way to say this is that in order to make good decisions about where to schedule certain tasks, Dask needs to have a bunch of information about how long previous tasks have taken, the current status everywhere. And so, we've got very, very good at collecting telemetry, and keeping it in a nice index form on the scheduler. And so, now that we have all that information, we can just present to people visually.

25:50 Michael Kennedy: Might as well share.

25:51 Matthew Rocklin: Yeah, I find myself sometimes using Dask computations just 'cause I want the dashboard. I don't want parallelism. I just want the dashboard. It's just the best profiler I have.

25:59 Michael Kennedy: It is a good insight into what's happening there. That's super cool. I love that it's live, so it seems really nice. I was surprised at how polished that part of it was.

26:08 Matthew Rocklin: That is honestly just due to Bokeh. People, if they like dashboards, go look at Bokeh server. I didn't know any JavaScript. I didn't know about Bokeh before. It was super easy to use. I integrated it into my application. It was clean. It was nice. And it looks polished, as you said.

26:21 Michael Kennedy: That's super cool. Now, you did say that you can run Dask locally, and there's a lot of advantages to doing that, as you already described. But it's super easy, right? Maybe just describe like it's just a couple of lines to set up a little, mini-cluster running on your system, right?

26:36 Matthew Rocklin: Yeah. So, it's even easier than that, but yes. So, if you do from Dask, so you first of all, you would pip install dask, or conda install dask a Python library, pure Python. You would then, if you want sort of the local cluster experience with the dashboard and everything, you say from Dask import client, and then you create a client, which with no arguments. It would create for you like a few workers on your local machine, and a scheduler locally, and then all the rest of your Dask work would just run on that. You also don't need to do that. You can just import Dask, and there's a thread pool waiting for you if you don't set anything up. So, a lot of people, again, don't use the distributive part of Dask. They just use Dask locally. And Dask can operate just as a normal library. Spins up a thread pool, runs your computations there, and you're done. So, there's no actual setup you need to do.

27:18 Michael Kennedy: Right.

27:19 Matthew Rocklin: There was a great tweet maybe a few weeks ago. Someone was using the xarray library. Xarray is a popular library for array computing in like the geoscience world. Someone says like, yeah, someone who's recommending xarray on Dask on Twitter, and someone says, yeah, I've heard about xarray. Use it all the time. It's great. Never heard of Dask, though. Don't know what you're talking about. And it was hilarious 'cause xarray uses Dask under the hood. The guy had just never realized that he was using Dask. So, that's a real success point.

27:43 Michael Kennedy: Yeah. If you don't have to poke your head up and make it really hard and obvious that this thing is part of it, right? That's great.

27:50 Matthew Rocklin: Dask had disappeared into being just infrastructure.

27:52 Michael Kennedy: Yeah. So, another thing that I thought was a super cool feature about this is you were talking about some kind of, working with a really big tree of data, right? And you start processing it, and it's going okay, but it's going somewhat slowly, and it's a live talk with not that much time. So, you're like, alright, well, let's add some more workers to it, and while the computation is running you add like, 10 more workers, or something, and it's just on your cool Bokeh graph and everything, you just see it ramp up and take off. So, you can dynamically add and remove these workers to the cluster, right?

28:27 Matthew Rocklin: Yeah. Definitely. So, Dask does all the things you'd expect from any modern distributed system. Handles resilience, handles dynamically adding things. You can deploy it on any common job scheduler. So, it does sort of all the things that you'd expect from something like Spark, you mentioned before, or TensorFlow, or Flink, or any of the sort of popular distributed systems today, but it does sort of just those things. So, you can sort of sprinkle in that magical dust into existing projects, and that's really the benefit. So yeah, we focus a lot on how do we add clusters efficiently, or how do we add workers to a cluster dynamically?

28:57 Michael Kennedy: Yeah, it seems really useful. Just as you're doing more work, you just throw it in, and I'm guessing these clusters can be long-lived things, and different jobs can come and go. Are they typically set up for research groups or other people in longer term situations, or are they sprung up out of Kubernetes, like burst into the Cloud, and then go away?

29:17 Matthew Rocklin: I want to say both, actually. So, what we'll often see is that the big institution, which will house something like Kubernetes, or Yarn, or Slurm, and they'll give their users access to spin up these clusters. So, an analyst will sit down at their desk, they'll open up a Jupyter notebook, they'll import Dask, use one of the Dask deployment solutions to ask for Kubernetes Dask cluster, then like their own little scheduler, workers will be created on that machine, on that cluster. They'll do some work for an hour or so, and then it'll go away, and that cluster will go away. What IT really likes about this sort of adding and removing of workers is that we can add or remove workers during their computation. So, previously, they would go onto the cluster, they'd ask for, hey, I want 100 machines. They'd use those machines for like a minute while they load their data, and then they would stare at a plot for an hour. And so, it's not so much the ability to add new workers. It is the ability to remove those workers when they're not being used. This gives really good utilization. You can support a lot more people who have really bursty workloads. So, a lot of data scientists, a lot of scientists, their analysis is very bursty. They do a bunch of work, and then they stare at their screen for half an hour. They do a bunch more work, and they stare at their screen for half an hour. And so, that's where dynamism is really helpful in those sorts of use cases.

30:22 Michael Kennedy: Yeah, I can imagine. That's super cool. Now, we talked about the Dask DataFrame and Dask array, and those being basically, exact parallels of the NumPy and Pandas version, but another thing that you can do is you can work in a more general purpose way with this thing called a delayed and a compute, right? Which will let you take more arbitrary Python code and fan it out in this way, right?

30:48 Matthew Rocklin: Yeah. So, let me first add a disclaimer. Dask DataFrame and Dask array are not exactly like NumPy and Pandas. They're just very close. Close enough, you'll be fine.

30:55 Michael Kennedy: Similar. Okay.

30:56 Matthew Rocklin: If I don't say that, people will yell at me afterwards. Yeah, so, the creation of Dask delayed was actually really interesting. So, when we first built Dask, we were aiming for something that was Spark-like. Oh, we're going to build this very specific system to paralyze NumPy, and parlayze Pandas, and that'll handle everything. Effortless bulid on top of those things. That did not end up being the case. Some people definitely wanted parallel Pandas and parallel NumPy, but a lot of people said, well, yeah, I use NumPy and Pandas, but I don't want a big NumPy and Pandas. I'm using something that's really different, really custom-related bespoke, and they wanted something else. As though we had built like a really fast car, and they said great, that's a beautiful car. I actually just want the engine 'cause I'm building a rocket ship, or I'm building a submarine, or I'm building like a megatronic cat, or something. It turns out that in the Python space, people build really diverse things. Python developers are just on the forefront. They're doing new things. They're not solving the same, classic, business intelligence problem over and over again. They're building things that are new. And so, Dask delayed was like a really low level API where they could build out their own task graphs, their own problem computations, sort of one chunk at a time using normal Python for loops. So, it allows you to build really custom, really complex systems, with normal looking Python code, but still get all the parallelism that Dask can provide. So, you still get the dashboard, you still have that resilience. You can build out your own system in that way.

32:15 Michael Kennedy: Yeah, you even have the graph and the dashboard, as you said, which is really cool. I love it. So, when I saw that I was thinking, well, this is really interesting, and I see how that lets me solve a more general numerical problem. Could I solve other things with it, right? Like suppose I have some interesting website that needs better performance, or I've got something that has really nothing at all to do with computation. Maybe it talks to databases or other things, right? Could I put a Dask backend on a high performance website and get it to do bizarre cluster computations to go faster, or is it really more focused on just data science?

32:53 Matthew Rocklin: Yeah, so people would use Dask in that setting in the same way they might use a multiprocessing pool, or a thread pool, or Celery, maybe, but they would scale it out across a cluster. So, Dask satisfies the concurrent futures interface, if you use concurrent futures. It also does async await stuff, so people definitely integrate it into systems like what you're talking about. Like at UC Berkeley, there's a project called Cesium that does something like that. I gave a talk at PyGotham like a year or two ago that also talked about that. So yeah, people definitely integrate Dask into more webby kinds of systems. It's not accelerating the web server. It's just accelerating the sort of computations that the web server is hosting. It's useful 'cause it has like, millisecond latencies, and scales out really nicely, and is very Pythonic. It fits all the Python APIs you'd expect.

33:37 Michael Kennedy: Yeah, and the fact that it implements async and await that's super cool. If you use it with something like Sanic, or Quart, or one of these async-enabled web frameworks, it's just aline, await some execution, and the scalability spreads. Maybe to some Dask Kubernetes cluster, or something. That's pretty awesome. You talked about the latency, and I was pretty impressed with how those were working. You were quoting some really good times. And actually, you said that's one of the reasons you created your own scheduler and you didn't just try to piggyback on Spark, is 'cause you needed extremely low latency. It's kind of chatty, right? If I'm going to take the dot product of a thing that's all over the place, they've got to coordinate quite a bit back and forth, right? That can't be too...

34:18 Matthew Rocklin: Yeah, so Spark actually interally, has okay latencies today. It didn't at the time, but they're also at sort of the millisecond range. Where Spark falls down is in complexity. They don't handle sort of arbitrary task graphs. Sort of the other side of that is systems like Airflow, Luigi and Celery, which can do more arbitrary task graphs. You can say download this particular file, and if it works, then email this person, or if it doesn't work, parse this other thing. You can build all these dependencies into Airflow, Luigi, Celery. Where those things fall down is that they don't have good inter-task communication. So, you can't easily say about these two different Pandas DataFrames, the different machines, have them talk to each other. They don't handle that. They also have long latencies, in sort of the like, tens to hundreds of milliseconds.

34:58 Michael Kennedy: Right, which is not that long, but when you amplify it across a thousand times, then all of a sudden, it's huge.

35:03 Matthew Rocklin: Yeah, or when your Pandas computations take 20 milliseconds, or one millisecond, and you want to do millions of them, then suddenly latency does become a big issue. So, we needed sort of the scalability and performance of Spark, but with the sort of flexibility of Airflow, Luigi, Celery. That's sort of where Dask is sort of that mix between those two camps.

35:24 Michael Kennedy: This portion of Talk Python to Me is sponsored by Backlog from Nulab. Developers know the importance of organization and efficiency when it comes to collaborating on a team. And Backlog is the perfect collaborative task management software for your team. With Backlog, you can create tasks, track bugs, make changes, give feedback, and have team conversations right next to your code. You track progress with features like Gantt and Burndown Charts. You document your processes right alongside your wikis. You could integrate with the tools you use every day like Slack, Jira, and Google Sheets. You could automatically register issues from email, or web form submissions. Take your work on the go using their top-rated mobile apps available on Android and iOS. Try Backlog for your team for free for 30 days using our special URL at talkpython.fm/backlog. That's talkpython.fm/backlog. I was really impressed with the low latency that you guys were able to achieve, and so maybe that's a good way to segue over to some other internal implementations. You talked about using Gevent, Tornado. What are some of the internal pieces that are at work there?

36:28 Matthew Rocklin: Yeah, so on the distributed system, we use Tornado for concurrency. This is 'cause we were supporting both Python 2 and 3 at the time, so that it's going to be more asyncio-friendly. That's sort of becoming more common. So, Tornado for concurrency. Also, Tornado a little bit for TCP communications, although we've had to improve Tornado's bandwidth for high performance computers.

36:46 Michael Kennedy: That's cool. So, has Tornado gotten a little bit better because you guys had been pushing it to the limits?

36:51 Matthew Rocklin: Yes, definitely.

36:51 Michael Kennedy: That's great.

36:52 Matthew Rocklin: Yeah, Dask touches a ton of the ecosystem, and so Dask developers tend to get pretty comfortable with lots of other projects. So, Antoine Pitrou used to work on Dask a lot. He did a lot of the networking stuff in Python, and he also worked on Tornado and did lots of things, so he was handling that. So yeah, Tornado for concurrency and TCP communication. So, all the Dask workers are TCP servers that connects to each other, so it's kind of like a peer-to-peer application. A bunch of just raw, basic Python data structures for internal state, just 'cause those are the fastest we can find. The Python dictionary is actually a pretty efficient data structure, so that's how we handle most of the internal logic. And then for computation, if you're using Dask DataFrame, we're using Pandas for computation. If you're using Dask array, we're using NumPy for computation, so we didn't have to rebuild those things. And then compression libraries, encryption libraries, Kubernetes libraries, they're sort of a broad set of things we end up having to touch.

37:43 Michael Kennedy: I can imagine. Is the cluster doing HTTPS, or other types of encrypted communication, and stuff like that between the workers and the supervisor?

37:51 Matthew Rocklin: Yeah. So, not HTPS. HTTP is sort of too slow for us. We tend to use TCP. Dask does support TLS out of the box, or SSL, you might know it as. But the comms are also pluggable, so we're actually looking at right now a high performance networking library called UCX to do Infiniband, and other stuff. Security is standard in that way, assuming you're happy with TLS.

38:09 Michael Kennedy: Yeah, yeah, I suspect most people are, until the quantum computers break it, and then we have a problem.

38:14 Matthew Rocklin: Sure, comm's are extensible. We can add something else.

38:17 Michael Kennedy: algorithm then. Yeah, I suspect we're going to be scrambling to solve the banking problem and e-commerce problem faster first, though. That'll be where the real fear and panic will be. But luckily, that's not here yet. So, it sounds like you guys are really pushing the limits of so much of these technologies, which sounds like it must be a lot of fun to work on.

38:39 Matthew Rocklin: Yeah, it's been a wild ride.

38:40 Michael Kennedy: I can imagine.

38:41 Matthew Rocklin: Yeah, it's actually been interesting also, building like conventions and protocols among the community. I think Dask is at an interesting spot because we do talk to everybody. We talk to most scientific disciplines. We talk to banking, we talk to Fortune 500 companies, and we see they all have kind of the same problems. So, I think recently I put together a whole bunch of people on file formats. Not because I care deeply about file formats, it's 'cause I had talked to all of them. We also talked to all the different library developers, and it's really interesting to see, okay, Tornado 4.5. Are we ready to drop it or not, for example? And we talked to the Bokeh developers, the Jupyter developers, the Dask developers, we talked to the Tornado developers. It's a well-integrated project into the ecosystem. I wanted to sort of like, hang there for a moment and say that because again, Dask was designed to affect an ecosystem, move an ecosystem forward, it actually has like, it sort of tendrils into lots of different community groups, and that's been a really fun thing to see. Sort of not on a technological side, but just on a social side, it's been very sort of satisfying and very interesting to see all these different groups work together.

39:40 Michael Kennedy: Yeah, I could imagine you could've worked with not just the library developers on all the different parts of the ecosystem, but you were also on the other side of the scientists, and the data scientists, and people solving really cool problems, right?

39:51 Matthew Rocklin: You'd be amazed how often climate scientists and quantitative traders have exactly the same problems.

39:57 Michael Kennedy: That's something I noticed doing professional training for 10 years. I taught a class at Edwards Air Force Base, I taught one at a hedge fund in New York, and I taught one at a startup in San Francisco, and like 80%, that's the same. It's all the same, right? It's good computation, it's algorithmical thinking, it's like distributed systems, and then there's the special sauce that probably I don't even understand anyway, right? But it's super interesting to realize that. How similar these different disciplines can be.

40:25 Matthew Rocklin: Yeah, and from an open source standpoint, like this is new 'cause previously, all those domains had a very different technology stack. Now they're all using the same technology stack, and so as a person who's paid to think about software, but also cares about advancing humanity is like a really nice opportunity. We can take all this money from these com traders, build the technology, and then apply that technology to genomics, or something.

40:48 Michael Kennedy: Right, or climate science, or something that really needs to get it right quickly.

40:52 Matthew Rocklin: That's sort of a nice opportunity that we have right now, especially in Python, I think. That we can hire a bunch of people to do a bunch of good work that advances humanity. Also makes people be rich playing the stock market, but we can let that happen.

41:08 Michael Kennedy: Hey, yeah. If they're helping drive it forward, you know? It might be okay.

41:10 Matthew Rocklin: Sure.

41:11 Michael Kennedy: For the good, maybe. That's debatable, but probably. It sounds like such a cool project. Maybe let's talk a little bit more of the human side of it. Obviously, you work a lot on it, as you said, but who else? Are there a lot of people that maintain Dask, or what's the story there?

41:26 Matthew Rocklin: I've worked on it for awhile. So, I started within Continuum, or what is now called Anaconda, and there's a bunch of other engineers within Anaconda who also work on Dask. They do sort of halftime something else, and halftime work on Dask.

41:37 Michael Kennedy: Are they still?

41:38 Matthew Rocklin: Definitely. People like Jim Crist, Tom Augspurger, Martin Durant, and others throughout the company. And so yeah, they'll maintain various parts of the project, or push it forward. So, Jim recently has been working on making Dask easier to deploy on Haddop Spark systems using Yarn. That's something that is important for Anaconda's business, and it's also just generally useful for people, so there's this sort of nice overlap between helping out the open source project, and also getting paid to do the work. Outside of Anaconda, so I'm now at NVIDIA, there's a growing group at NVIDIA who's also working on Dask, which is pretty exciting. Also, there's a bunch of other people who are working in other research labs, or other companies, who need to use Dask for their work, and needed to build out the Kubernetes connection, or something like that. And they maintain that both because it's good for their company, it's also they just sort of enjoy doing that. So, I would say in sort of the periphery of Dask, most of the work is happening in other companies, in other people, and there's hundreds of people who do work on Dask at any given month or two.

42:31 Michael Kennedy: There's probably a lot of people that make one or two little contributions to fix their edge case, but it all adds up, right?

42:37 Matthew Rocklin: There's that. There's also maybe like, 20 or 30 people who are maintaining, not just adding a fix, but maintaining some Dask foo package. So, there's a researcher Alistair Miles who works in the Oxford area who works on genomics processing, and he has a Dask-enabled genomics suite called scikit-allel, and so that's his job, right? So, he's working on that day-to-day. There's the scikit-learn developers maintain the Dask scikit-learn connection through Joblib. So has commit rights over all of Dask. He's the main scikit-learn contributor. So, there's a bunch of sort of, as Dask expands out into the rest of the ecosystem, that it's sort of being adopted by many, many smaller groups who are in other institutions.

43:19 Michael Kennedy: So, it sounds to me like there might be some decent opportunities for folks to contribute peripherally to Dask in the sense that maybe there's somebody that's doing some kind of research in their library, or something like it, would really benefit from having a Dask level version of that. So, if people are looking to contribute to open source and the data science space, are you guys looking for that kind of stuff?

43:42 Matthew Rocklin: Yeah, definitely. I mean, don't have to look these days. People are just doing it. What I would say is that there's probably, in whatever domain you're in, there's probably some group that's already thinking about how to use Dask in that space, and jump onto that group. There's a lot of sort of like, low-hanging fruit right now. Pick your favorite PyData library. Think about how it could scale, and there's probably some small effort around that currently. And there's a lot of, if you want to get into open source, it's a great place to be. You're right on the very exciting edge of big data and scalability. Things are pretty easy today. We've solved a lot of the major issues with sort of the core projects like NumPy, and Pandas, and scikit-learn. So yeah, there's a ton of pretty exciting work that's like just starting to pick up now. So yeah, if you're interested in open source, it's a great place to start.

44:25 Michael Kennedy: It seems like there's so many data science libraries. There's probably a lot of opportunity to bring some of them into the fold. Pretty low-hanging fruit there. So, let me ask you some kind of, how wild is that type of thing? One is what's the biggest cluster you've ever seen it run on?

44:43 Matthew Rocklin: The biggest I've had...

44:44 Michael Kennedy: Or heard about it running on?

44:45 Matthew Rocklin: Yeah. I would say like 1,000 machines is probably the biggest that I've seen. That's probably like, with an order of magnitude of limits today, but those machines are big machines. They have 30 cores on them. That's also just mostly for benchmarking. Most problems in data science aren't that big.

45:02 Michael Kennedy: Right.

45:03 Matthew Rocklin: Thousands is sort of the...

45:04 Michael Kennedy: That's awesome. I was thinking of places like CERN that have their big grid computing stuff, and things like that. They could potentially spin up something pretty incredible.

45:12 Matthew Rocklin: Dask gets used for a number of super computers today, but it's very, very rarely used on like, the entire super computer. There're really no system outside of MPI that I would expect to run on the full like, CERN federated grid. So, most of the sort of data oriented libraries end up sort of maxing out around 1,000 machines. I mean, hypothetically, you could go higher, but the overheads will be enough to kill you.

45:36 Michael Kennedy: Does it have sort of a, how many people's hands you have to shake to shake everyone's hand in a group? That N factorial type of issue. As you add more, does it get harder and harder to add the next worker to the cluster, or is it okay?

45:49 Matthew Rocklin: No, it's fine. Everything's linear in terms of number of workers. Workers, right? It's not like a mesh grid. It's workers talk to the scheduler to find out who they have to talk to. They'll then peer-to-peer talk to those people, but in any given computation, you only talk to five to 10 people at once. It's fine.

46:04 Michael Kennedy: Yeah, interesting. Okay, the other one is, what's the coolest thing that you've seen computer to run on Dask or built with Dask?

46:09 Matthew Rocklin: I'm going to say, well there's things I can't talk about. But...

46:14 Michael Kennedy: The most amazing ones are super secret.

46:16 Matthew Rocklin: I'm going to point people to the Pangeo project. So, this is actually my most recent PyCon talk in 2018. Talking about other people's work. So, Pangeo's pretty cool. Pangeo's a bunch of Earth scientists who are doing climate change, meteorology, hydrology, figuring out the Earth, and they have these 100 terabyte petabyte arrays that they're trying to work with. They haven't quite gotten to the petabyte scale yet. Yeah, so they usually set up something like JupyterHub on either a super computer or on the Cloud in Kubernetes. They then analyze tens or hundreds of terabytes of data with xarray or Dask array under the hood, and get out nice plots that show that indeed sea levels are rising. I think that's pretty cool in that it's good community of people. So, there's like a bunch of different groups. NASA, universities, USGS. Anyone who's looking at large amounts of data. Companies, plant labs, et cetera. So, it's a good community. They're figuring out how to store one copy of the data in the Cloud, and then have everyone compute on that using cool technologies like Kubernetes. They're inventing cool technologies like how do we store multidimensional arrays in the Cloud? So yeah, that's maybe a good effort. Not necessarily a cool computation. Like from a Dask perspective, it's actually pretty vanilla, but it's cool to see a scientific domain advance in a significant way using some of these newer technologies like JupyterHub, and Dask, and Kubernetes, and I'd like to see that repeated. So, I think we could do the same thing with medical imaging and with genomics. So, that's maybe like a cool example. It's not so much a technical situation. It's a community situation.

47:43 Michael Kennedy: Yeah, well it sounds like a cool example that could be just replicated across verticals.

47:48 Matthew Rocklin: Definitely, we're starting to see that, which is really exciting.

47:50 Michael Kennedy: Okay, those are great. So, when I was looking into Dask, I ran across some other libraries that are kind of like you described. Dask foo. So, there was Dask CUDA, Dask-ML, Dask Kubernetes. Do you want to just talk real briefly about some of those types of things.

48:06 Matthew Rocklin: I'll go in reverse order. So, Dask Kubernetes is a library that helps you deploy Dask clusters on Kubernetes.

48:12 Michael Kennedy: So, you point it at like a Kubernetes cluster, and you say, go forth and use that, and it can just create the pods and run 'em.

48:20 Matthew Rocklin: Definitely, so it creates pods for all the workers. It manages those pods. If you ask it, there's like little buttons. You ask it to scale up or scale down, and it'll manage all that stuff for you using the Kubernetes API. It presents users with a super simple interface. They can open up a Jupyter notebook and just sort of play around, so that's what that does. And it has peers like Dask Yarn for Hadoop/Spark clusters, and Dask-jobqueue for more high performance computing schedulers like Slurm, PBS, SGE, LSF. So, that's Dask Kubernetes. That's like on the deployment side. Dask-ML is an effort to, there are a lot of, sort of interesting ways of using Dask with machine learning algorithms. Things like GLMs, generalized linear models, XGBoost, scikit-learn, that were all different technical approaches, and Dask-ML is almost like a namespace library that collects all of them together, but also using the sort of scikit-learn style APIs. So, if you like scikit-learn, and you want to scale that across a cluster, either with simple things like hyper-parameter searches, or with large data things like linear regression or XGBoost, just look at Dask-ML. It'll help you do that. Dask CUDA, that's like a new thing I've been playing with since moving to NVIDIA. People have been using Dask with GPUs for a long time, and they've developed a bunch of weird, wonky scripts to set things up just the right way. There were a ton of these at NVIDIA when I showed up. It's just a few like, nice, convenient, Python ways to set up Dask to run on either one machine with many GPUs, or on a cluster of machines with lots of GPUs and how to make that easy. So, it's kind of like Dask Kubernetes in that it's about deployment, but it's more focused on deploying around GPUs, and there are some intricacies to doing that well.

49:51 Michael Kennedy: I imagine you can scale like crazy if you put the right number of GPUs on a bunch of machines and let it go.

49:59 Matthew Rocklin: No, GPUs are actually pretty game-changing. It's fun.

50:02 Michael Kennedy: Yeah. It's kind of near the limits of human understanding to think about how much computation per second happens on a GPU.

50:10 Matthew Rocklin: I think CPUs are also well beyond the limits of human understanding.

50:15 Michael Kennedy: But this is like the galaxy versus the universe, you know?

50:20 Matthew Rocklin: Sure.

50:20 Michael Kennedy: They're both outside of our understanding.

50:22 Matthew Rocklin: I should mention that I work for NVIDIA now, so please don't trust anything I say, but yeah, GPUs are like, fundamentally a game-changer in terms of computation. You can get a couple orders of magnitude performance increase out of the same power, which I mean, from a science perspective, has the opportunity to unlock some things we couldn't do before. GPUs are also notoriously hard to program, which is why most scientists don't use them. And so, that's sort of part of why I work at NVIDIA today. They're building out a GPU data science stack, and they're using Dask to paralyze around it. And so, it's fun both to sort of grow Dask with NVIDIA, and having the large company behind it. But also from a science perspective, I think we could do some good work to improve accessibility to different kinds of hardware like GPUs. That might dramatically change the kinds of problems that we can solve, which is exciting in different ways.

51:08 Michael Kennedy: Yeah, very incredible. And I would much rather see GPU cycles spent on that than on cryptocurrency.

51:14 Matthew Rocklin: Sure.

51:14 Michael Kennedy: Although, cryptocurrency is good for NVIDIA and all the GPU, the graphics cards makers. They don't seem as useful as solving science problems on. Alright, we're getting close to our time that we have left, so I'll ask you just a couple quick questions before we wrap it up. So, we talked about where Dask is. Where is it going? It sounds like there's a lot of headroom for it to grow, and NVIDIA's investing in it by sort of this work that you all are doing, and it sounds like they're also building something amazing. So yeah, what's next?

51:44 Matthew Rocklin: NVIDIA aside, for a moment, I'll say that Dask technologically is sort of done, the core of it. There's plenty of bugs and plenty of features we can add, but we're sort of at the incremental advancement stage. I would say where I'm seeing a lot of change right now is in broadening it out to other domains. So, I mentioned Pangeo a little bit before. I think that's a good example of Dask spreading out and solving a particular domain's problem. And we're seeing, I think, a lot more of that, so I think we'll see a lot more social growth and a lot more applications in new domains, and that's what I'm pretty excited about.

52:12 Michael Kennedy: That's cool. So, it's like Dask is basically ready, but there's all these other parts of the ecosystem that could now build on it, and really amp up what they're doing.

52:20 Matthew Rocklin: Yeah, let's go tackle those. Let's go look at genomics, let's go look at imaging, let's go look at medicine, let's go look at all of the parts of the world that need computation on the large scale, but that didn't obviously fit into the sort of Spark, Hadoop, or TensorFlow regime. And Dask, 'cause it's more general, can probably handle those things a lot better.

52:38 Michael Kennedy: Yeah, it's amazing. I guess, let me ask you one more Dask question before we wrap it up real quick. So, is Dask 100% Python?

52:47 Matthew Rocklin: The core parts, yeah.

52:48 Michael Kennedy: Or near?

52:49 Matthew Rocklin: Yes. You're probably also using NumPy and Pandas, which are not Python.

52:52 Michael Kennedy: Yeah, right, as we discussed, but yeah.

52:53 Matthew Rocklin: Dask itself is pure Python.

52:54 Michael Kennedy: I think it's really interesting 'cause you think of this really high performance computing sort of core, right? And it's not written in Go or C, but it's written in Python. And I just think that's an interesting observation, right?

53:08 Matthew Rocklin: Python is actually not that bad when it comes to core data structure manipulation. Tuples, lists, dictionaries, it's maybe a factor of two to five slower, not 100 times slower. It's also not that bad at networking. Like it does that decently well. Socially, the fact that Python had both the data science stack and a networking stack in the same language made it sort of the obvious right choice. That being said, I wouldn't be surprised if in a couple years we do rewrite the core bits in C++, or something. But we'll see.

53:33 Michael Kennedy: Sure, eventually, when it's kind of finely baked and you're looking for that extra 20, 30% here or there, right?

53:38 Matthew Rocklin: Yeah, we'll see. We haven't yet reached the point where it's a huge bottleneck to most people. But the climate science people do have petabyte arrays, and sometimes the scheduler does become a bottleneck there.

53:46 Michael Kennedy: Right. It's cool. It's interesting. So, final two questions. If you're going to write some Python code, what editor do you use?

53:53 Matthew Rocklin: I use Vim. Not for any particular reason. I just showed up at the computer lab one day and said, how do I use Linux? And the guy there was a Vim user, so now I'm also a Vim user. I think I haven't moved off just 'cause I'm so often on other machines. So, rich editors just don't move over as easily.

54:09 Michael Kennedy: Yeah. I'm on the cluster and I need to edit something. It doesn't work to fire up PyCharm and do that so easily.

54:14 Matthew Rocklin: Right. Yeah. So, Vim for now, but not religiously. Just anecdotally.

54:18 Michael Kennedy: Sure. Sounds good. Yeah, I'm sure you do some stuff with Jupyter notebooks, as well, right now, then.

54:23 Matthew Rocklin: Sure. I'll maybe also put a plugin for JupyterLab. I think everyone should switch classic notebook to JupyterLab, especially if you're running Dask. 'Cause you can get all those dashboards inside JupyterLab really nicely. It's a really nice environment.

54:33 Michael Kennedy: Oh yeah. That sounds awesome. Alright, and then notable PyPI package. I'll go ahead and throw pip install dask out there for you.

54:41 Matthew Rocklin: So, I'll cheat a little bit here. I'm going to list a few. First, in a theme of the NumPy space, I really like how NumPy is expanding beyond just the NumPy implementation. So, there's CuPy, which is a NumPy implementation on the GPU. There's Sparse, which is a NumPy implementation with Sparse arrays, and there's Ora, which is a new date time D-type for NumPy. I like that all of these were built on the NumPy API, were built outside of NumPy. I think that kind of heterogeneity is something we're going to move towards in the future, and so it's cool to see NumPy develop the protocols and conventions that allow that kind of experimentation, and it's cool to see people picking that up and building things around it. So, that's pretty exciting.

55:17 Michael Kennedy: Those are great picks. It's cool that I learned one API, and now I can do other things, right? Like I know how to do to NumPy stuff, so now I can do it on a GPU.

55:26 Matthew Rocklin: Right, or I've got my NumPy code. I want to have better datetimes. I could just add this new library, and suddenly, oh, I'm in a PyCode works, because that sort of extensibility, which is something that we need, I think, going forwards.

55:36 Michael Kennedy: Yeah, absolutely. Alright, Matthew. Well, this was super interesting. Thank you for being on the show. You have a final call to action? People want to get involved with Dask, what do they do?

55:45 Matthew Rocklin: Try it out. You should go to examples.dask.org. There's a launch binder link on that website. You click on that link, you'll be taken to a JupyterLab notebook running in the Cloud, which has everything set up for you, pretty much, with examples. So, you can try things out, see what interests you, and play around. So again, that's examples.dask.org.

56:00 Michael Kennedy: Yeah, and I also linked to the Dask examples GitHub repo that you have on your account in the show notes. Yeah, so people can check that out, too. Awesome. Alright, well thank you for being on the show. It was great to chat with you, and keep up the good work. It looks like you're making a big difference.

56:13 Matthew Rocklin: Great. Thank you so much for having me, Michael. Talk to you soon.

56:15 Michael Kennedy: Yeah. Bye. This has been another episode of Talk Python to Me. Our guest on this episode was Matthew Rocklin, and it's been brought to you by Linode and Backlog. Linode is your go-to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. With Backlog, you can create tasks, track bugs, make changes, give feedback, and have team conversations right next to your code. Try Backlog for your team for free for 30 days using the special URL, talkpython.fm/backlog. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course, or if you're looking for something more advanced, check out our new Async course that digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now, get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon