#207: Parallelizing computation with Dask Transcript
00:00 What if you could write standard NumPy and Pandas code, but have it run on a distributed
00:04 computing grid for incredible parallel processing right from Python? How about just splitting the
00:09 work across multi-processing to escape the limitations of the GIL right on your local
00:13 machine? That's what Dask was built to do. On this episode, you'll meet Matthew Rockland to
00:17 talk about its origins, use cases, and a whole bunch of other interesting topics.
00:21 This is Talk Python to Me, episode 207, recorded February 20th, 2019.
00:27 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the
00:44 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter
00:49 where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm
00:54 and follow the show on Twitter via at Talk Python.
00:56 Hey folks, a quick announcement before we get to our conversation with Matthew. People
01:01 have been enjoying the courses over at Talk Python Training for years now, but one shortcoming
01:06 we've been looking to address is to give you offline access to your content while you're
01:10 traveling or if you're just in a location with spotty internet speeds. We also want to
01:14 make taking our courses on your mobile devices the best they can be. That's why I'm thrilled
01:19 to tell you about our new apps, which solve both of these problems at once. Just visit
01:23 training. talkpython.fm/apps and download the Android version right now. The iOS version
01:28 is just around the corner. Should be out in a week or so. You'll have access to all of your
01:33 existing courses. But we've also added a single tap feature to allow you to join our two free
01:38 courses, the Responder Web Framework mini course and the MongoDB quick start course. So log in,
01:44 tap the free course, and you'll be right in there. Please check out the apps and try one of the
01:48 free courses today. Now, let's talk Dask with Matthew. Matthew, welcome to Talk Python to me.
01:54 Thanks, Michael. I've listened to the show a lot. It's really an honor to be here. Thank you for
01:57 having me. It's an honor to have you as well. You've done so much cool work in the distributed
02:02 computing space, in the data science and computation space. It's going to be super fun to dig into all
02:08 that and especially dig into Dask. But before we get to that, of course, let's start with your story.
02:13 How'd you get into programming in Python?
02:14 Yeah. So I got into programming originally actually on a TI basic calculator. So I was in math class,
02:20 plugging along, spent my few thousand hours not being a good programmer there. Picked up some C++
02:25 in high school, did some MATLAB and IDL and C# in college, did some engineering jobs,
02:32 did some science. And then in grad school, just tried out Python on a whim. I found it was all of
02:37 those things at the same time. So it was fun to work in object-oriented mode, like with C#.
02:43 It was fun to do data analytics, like with MATLAB. It was fun to do some science work that I was doing
02:48 in astronomy at the time, like with IDL. So it was sort of all things to all people.
02:51 And eventually I was in a computer science department and I found that Python was a good
02:55 enough language that computer scientists were happy with it, but also usable enough language that actual
03:00 scientists were happy with it. I think it's probably where Python gets a lot of its excitement,
03:04 at least on the sort of numeric Python side.
03:07 I think that's a really nice insight. And it sounds like a cool way you got into programming.
03:12 I feel like one of Python's superpowers is it is kind of what I think of as a full spectrum
03:17 language. You can make it what you need it to be at the beginning. But like you said, professional
03:22 developers and scientists can use it as well. And you kind of touched on that, right? Like you can
03:27 use it like C# or you can use it like MATLAB or you can kind of be what you need it to be.
03:32 But if you need more, it's not like, well, now you go right in C++. It's just like,
03:36 we'll now use these other features.
03:38 Yeah, definitely. I'd also say it has a lot of different parts to it. Like there,
03:41 before there was like the web programming community and it was a community that used it kind of like
03:45 Bash, like a sysops community. And there was the, you know, scientific community. I know those are
03:49 all kind of the same and it's really cool to see all those groups come together and use each other's
03:53 stuff. You see people using scikit-learn inside of a Flask app deployed with some other Python thing.
03:59 Ansible or something, right? Yeah, for sure.
04:01 And the fact that all those things were around makes it sort of the cool state that it's in today.
04:04 Yeah, it's just getting better, right?
04:06 But it was already good. It's just like connecting all the good parts.
04:08 Exactly. It's the interconnections are building a little bit stronger across these disciplines,
04:12 which it's kind of rare, right? There's not that many languages and ecosystems where that happens,
04:18 right? Like think of Swift for iOS, right? Swift is a pretty cool language. It has some issues that
04:23 I could take with it, but it's pretty cool. But it's not also used for data science and for DevOps,
04:27 right? It's just not.
04:28 Right.
04:28 Yeah.
04:29 That's pretty cool. Also, you said you did some IDL. It sounds like you were doing astronomy before,
04:34 right?
04:34 Yeah. This is just an undergrad. I was taking some astronomy courses. It was fun though,
04:38 because I knew how to program and I was just really productive as a scientist because I was
04:43 fluent in programming. That ability to get more productive was really good. It took me away from
04:48 science and towards computer science, but in service of science. So that was my experience.
04:53 Now you're building these platforms where legitimate world-class science is happening.
04:58 Yeah. No, it's cool.
04:59 Right? You must be really proud, right? To think of what people are doing with some of the software
05:04 you're building and you're collaborating on.
05:05 Yeah, the software that all of us build. And yeah, it's definitely very cool. It's very satisfying to
05:10 have that kind of impact, not just in one domain, but across domains. And I get to work with lots of
05:15 different kinds of people every day that are solving, I think, high impact problems. And yeah,
05:19 that's definitely a strong motivator for me.
05:20 Yeah, I'm sure it is. Programming is fun, but it's really not fun to work on a project and have
05:26 nobody use it. But to see it thrive and being used, it really is a gratifying experience.
05:31 Yeah. I'll go so far as to that. Programming is no longer fun. It's just the way we get things done
05:36 and seeing them done, as you're saying, is really where a lot of the satisfaction comes from.
05:39 Yeah, that's cool. So how about today? What are you up to these days?
05:42 You were, to put it past tense, you were doing a lot of stuff with Anaconda Inc.,
05:48 previously named Continuum, but same company, but recently you've made a change, right?
05:52 Yeah. So I like to say it's the same job, just a different employer. Yeah. So I guess my job today
05:57 is a few things. I work in the open source numeric Python ecosystem. So this is usually
06:03 called like SciPy or PyData. And I mostly think about how that system scales out to work with
06:10 multi-core machines or with distributed clusters. I usually do that with a project called Dask.
06:14 That's maybe like the flagship project that I help maintain, but I'm just generally active
06:18 throughout the ecosystem. So I talk to the NumPy developers, the Pandas developers,
06:22 scikit-learn and Jupyter developers. I think about file formats and protocols and conventions.
06:27 There are a few of us- Networking. Networking. Yeah. Lots of things. So generally in service
06:31 of those science use cases or in business use cases, there's tons of things that go wrong. And
06:36 I tend to think about whatever it is needs to get handled. But generally I tend to focus on Dask,
06:40 which I suspect we'll talk about more today. Yeah. A tiny bit. So speaking of Dask, let's set the
06:46 stage, talk a little bit about, you know, why does Dask exist and so on. And I guess maybe start with a
06:53 statement that you made at PyCon, which I thought was pretty interesting. And you said that Python has
06:59 a strong analytical computation landscape and it's really, really powerful, like Pandas and NumPy and
07:05 so on. But it's mostly focused on single core computation and importantly, data that fits into RAM.
07:11 You want to elaborate on that a little? Sure. So everything you just said, so all of those libraries,
07:17 so NumPy and Pandas and scikit-learn are loved today by data scientists because the APIs are,
07:23 they do everything they want them to do. They seem to be well tailored to their use cases.
07:26 And they're also decently fast, right? So they're not written in pure Python. They're written in C,
07:30 Fortran, C++, Cython, LLVM, whatever. Right. And so much of the interaction is like,
07:36 I have the data in Python. I just pass it off. Or maybe I don't even have it in Python. I have like
07:41 a NumPy array, which is a pointer to a data down in C level, which I then pass that off to another
07:46 part. And then internally down in C, it cranks on it and gives it back like five or something,
07:51 right? Like, so it's not, it's a whole, I think the whole story around performance and is Python
07:58 slow, is Python fast? Like it gets really interesting really quick because of those types of things.
08:04 Yeah. So we're, we have really nice usable Python APIs, which are kind of like the front end to a
08:08 data scientist. And that's hitting a backend, which is C++ Fortran code, which is very fast.
08:13 And that combination is really useful. Those libraries also have like now decades of expertise,
08:19 PhD theses thrown into them, just tons of work to make them fit. And people seem to like them a lot.
08:25 And it's also not just those packages. There's hundreds of packages on top of NumPyPanda's
08:29 Scikit-Learn, which do biology or do signals processing or do time series forecasting or do whatever.
08:35 And there's this ecosystem of thousands of packages, but that code, that code that's very
08:41 powerful, it's really just powerful on a single core, usually on data that fits in RAM. So as sort
08:46 of big data comes up or as these clusters become more available, the whole PyData SciPy stack starts
08:51 to look more and more antiquated. And so the goal that I was working on for a number of years and
08:56 continue to work on is how do we elevate all of those packages, you know, NumPyPanda,
09:00 scikit-learn, but also everything that depends on them to run well on distributed machines,
09:04 to run well on new architectures, to run well on large data sets? How do we sort of refactor an
09:09 ecosystem? That's a really hard problem.
09:11 You know, that's a super hard problem. And it's, I think, one of the challenges that PyPy,
09:16 P-Y-P-Y, the JIT compiled alternative to CPython has such, I think that would have really taken off
09:24 had it not been for, yes, we can do it this other way, but you have to lose NumPy,
09:30 or you have to lose this C library, or you lose the C speedups for SQLAlchemy, or whatever, right?
09:36 You're trying to balance this problem of like, I want to make this distributed and fast and parallel,
09:40 but it has to come along as is, right?
09:43 Right. Yeah, we're not going to rewrite the ecosystem. That's not a feasible project today.
09:47 Right. I mean, theoretically, you could write some alternate NumPy, but people trust NumPy.
09:53 And like you said, it's got so much polish, it's insane to try to recreate it.
09:57 Yeah. It also has, even if you recreated it, there's a lot of weight on it. A lot of other
10:02 packages depend on NumPy in very arcane ways, and you would have a hard time reproducing all of those.
10:06 Right. Yeah. It's like boiling the ocean, or trying to boil the ocean, pretty much. Okay. And then,
10:13 so that's the problem is like, we have problems that are too big for one machine. We have great
10:20 libraries that we can run on one machine. How do you take that ecosystem and elevate it in a way
10:26 that's roughly the same API, right? We had certain things like that, right? Like Spark,
10:32 for example, is one where you can do kind of distributed computation, and there's Python
10:37 interactions with Spark, right?
10:38 Yeah, definitely. But Spark is again, going down that rewrite path, right? Spark is not using NumPy
10:43 or Pandas. They're rewriting everything in Scala, and that's just a huge undertaking. Now,
10:46 they've done a great job. Spark is a fantastic project. If Spark works for you, you should use
10:50 it. But if you're used to Pandas, it's awkward to move over. Most good Python Pandas devs that I know
10:55 who moved to Spark eventually move to Scala. They eventually give up the ecosystem,
10:58 they move over. And that makes sense, given how things work.
11:01 Right.
11:01 But if you wanted to say...
11:02 Because if you're going to work there, you might as well be native in that space, right?
11:05 But if you wanted to say use NumPy, there is no Spark NumPy, nor is there a Spark SciPy,
11:09 nor is there a Spark time series prediction.
11:12 Everything built on it.
11:13 Yeah. So Spark really handled sort of the Pandas use case to a certain extent,
11:18 and like a little bit of the Scikit-learn use case. But there's this other sort of 90%
11:22 ecosystem that's not handled. And that's where Dask tends to be a bit more shiny.
11:26 Right. And isn't... I don't do much with Spark, so you'll have to keep me honest here. But isn't
11:32 Spark very like MapReduce-y? It has a few types of operations and pipelines you can build,
11:38 but you can't just run arbitrary code on it, right?
11:40 Yes and no. So yes, underneath the hood, from a distributed assistance point of view,
11:45 Spark implements sort of the same abstraction as MapReduce, but it's much more efficient and it's
11:49 much more usable. The APIs around it are much nicer. So if you do something like a group
11:53 bi-aggregation, that is sort of a mapping and then a production under the hood. And so as long as you
11:59 sort of stay within the set of algorithms that you can represent well with MapReduce, Spark is a good fit.
12:04 Not everything in the sort of the PyData SciPy stack looks like that, it turns out. And so we looked at
12:08 Spark early on and, hey, can we sort of use the underlying system to paralyze out PyData? The answer
12:12 came back pretty quickly, no. It's just not flexible enough.
12:15 Yeah. And you know, it's, I love that this format is audio, but for this little part, I wish we could
12:21 show some of the pictures that you've had and some of your presentations and stuff, because
12:26 the graphs of how all the data flows together is super, super interesting and complicated and
12:31 sort of self-referential and all sorts of stuff.
12:33 Yeah, no, those visuals are really nice and it's nice that they're built in. You get a good sense of
12:37 what's going on.
12:37 Yeah, it's cool. All right. So let's talk about Dask. You've mentioned it a couple of times.
12:42 Dask is a distributed computation framework focused on parallelizing and distributing the data science
12:51 stack of Python, right?
12:52 Yeah. So let me describe it maybe from a 10,000 foot view first, and then I'll dive in a little bit.
12:57 So from a 10,000 foot view, the tagline is that Dask scales PyData. Dask was designed to parallelize
13:02 the existing ecosystem of libraries like NumPy, Pandas, and Scikit-Learn, either on a single machine,
13:07 scaling out from memory to disk using all of your cores, or across a cluster of machines.
13:11 So what that means is if you like Pandas, you like NumPy, and you want to scale those up to
13:16 many terabytes of data, you should maybe investigate Dask. So diving in a little bit, Dask is kind of
13:22 two different levels. I'll describe it Dask at a low level first, and then a high level. So at a low
13:26 level, Dask is a generic Python library for parallel computing. Just think of it like the threading
13:31 module or like multiprocessing, but just more. There's more sophistication in there, it's more
13:36 more complex. It handles things like parallelism, like task scheduling, load balancing, deployment,
13:42 resilience, node fallover, that kind of stuff. Now using that very general purpose library,
13:47 we've built up a bunch of libraries that look a lot like PyData equivalents. So if you sort of take
13:54 Dask and NumPy and smash them together, you get DaskArray, which is a large scalable array
13:59 implementation. If you take Dask and Pandas and squash them together, you get DasDataFrame,
14:03 which uses many PandasDataFrames across a cluster, but then can sort of give you the same Pandas API.
14:08 So it looks and feels a lot like underlying sort of normal libraries, NumPyrePandas, but it scales
14:14 out. So today when people talk about Dask, they often mean DasDataFrame or DaskArray.
14:20 Yeah. So just to give people a sense, like this, when you interact with one of these, say,
14:25 DaskArrays, it looks like a NumPyArray. You treat it like a NumPyArray mostly. But what may
14:32 actually be happening is you might have 10 different machines in a cluster. And on each
14:38 of those machines, you might have some subset of that data in a true NumPy array, right? And that's,
14:43 they know how to coordinate operations. So if you do like a dot product on that with some other bit,
14:48 it can figure out the coordination and communication across all the different
14:51 machines in the cluster to compute that answer in a parallel way. But you think of it as like,
14:57 well, I just do a dot product on this array and we're good, right? Just like you would in NumPy.
15:01 That's exactly correct. Yeah. So Dask is like a very efficient secretary. It's doing all the
15:05 coordination work. NumPy is doing all the work on every node. And then we've written a lot of
15:09 parallel algorithms around NumPy that teach you how to do a big matrix multiply with a lot of small
15:13 matrix multiplies. Or on the pandas side, you know, a big join with a lot of smaller joins.
15:17 That sounds like a lot of coordination and a challenging problem to do in the general sense.
15:23 Yeah, it's super fun.
15:24 The general solution of that. Yeah, I bet it is. A lot of linear algebra and other types of stuff in
15:30 there, I'm guessing.
15:30 Sure. Or a lot of pandas work or a lot of machine learning work.
15:32 Yeah, yeah.
15:34 Or a lot of other work. So I want to point out here that the lower level Dask library,
15:38 the thing that just does that coordination, is also used separately from NumPy and pandas. And I'm
15:42 sure we'll get to that in the future. But so Dask is sort of this core library that handles parallelism.
15:46 You can use it for anything. There's also like a few big uses of it for the major PyTadet libraries.
15:52 Yeah. We have Dask Array and we have Dask DataFrame and so on. And these are really cool. And
15:57 we said the API is really similar, but there's one significant difference. And that is that
16:04 everything is lazy, right? Like a good programmer?
16:07 Sure. So Dask Array and Dask DataFrame do both do lazy operations. And so yeah, so at the end of your,
16:13 you know, you do DataFrame, you do, you know, read parquet, filter out some rows,
16:17 do a group aggregation, get out some small results. You then call .compute. At the end of that,
16:22 there's a method on your Das DataFrame object. And that will then ship off the sort of recipe for
16:27 your computation off to a scheduler, which will then execute that and give you back a result,
16:31 probably as a pandas DataFrame.
16:33 There's two major parts here. One is the data structures that you talked about. And the other
16:37 is the scheduler idea, right? And those are almost two separate independent things that make up Dask.
16:42 Two separate code bases even.
16:43 Okay.
16:46 This portion of Talk Python to me is brought to you by Linode. Are you looking for hosting that's
16:51 fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at
16:56 talkpython.fm/Linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server
17:03 with a gig of RAM. They have 10 data centers across the globe. So no matter where you are or where your
17:08 users are, there's a data center for you. Whether you want to run a Python web app, host a private
17:13 Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded
17:18 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee.
17:25 Need a little help with your infrastructure? They even offer professional services to help you with
17:30 architecture, migrations, and more. Do you want a dedicated server for free for the next four months?
17:34 Just visit talkpython.fm/Linode.
17:37 I want to dig into the schedulers because there's a lot of interesting stuff going on
17:43 there. One thing that I think maybe it might be worth digging into a little bit for folks is this
17:49 demo that you did at PyCon around working with, basically you had this question you want to answer,
17:56 it's like, what is the average or median tip that a person gives to a taxicab driver plotted by time of
18:04 day? You know, like Friday evening, Monday morning, and so on. And so there's a huge data set,
18:10 maybe you can tell people where to get it, but it's the taxicab data set for some year,
18:15 and it's got like 20 gigs on disk and 60 gigs in memory. Now, I have a MacBook Pro here that's like
18:22 all knobs to 11, and that would barely hold a half of that data, right? So like even on a high-end
18:28 computer, it's still like, I need something more to run this because I can't load that into memory.
18:34 Well, to be fair, actually, that's not by like big data standards, it's actually pretty small.
18:38 People use Dask on like 20 terabytes rather than 20 gigabytes pretty often. But you're right that
18:42 it's like, it's awkward to run on a single machine.
18:44 Yeah, you can do a lot of paging.
18:45 Yeah, it's a pretty easy data set, though, to run in a conference demo, just because it's
18:49 pretty interpretable.
18:50 Yeah.
18:50 Yeah.
18:50 Everyone can immediately understand average tip by time of day and day of the week.
18:54 Yeah, so it's a good demo. I'll maybe talk through it briefly here, but I encourage people to look at the PyCon. There's probably PyCon 2017, I think, for Dask.
19:02 Yeah, and I'll put a link to the talk in the show.
19:04 Yeah, so in that example, we had a bunch of CSV files living on like Amazon S3 or Google
19:10 Cloud Storage, I don't remember which.
19:11 Yeah, Google Cloud Storage.
19:12 And you could use the pandas read CSV function to read a bit of those files, but not the entire
19:17 thing. But if I had to read everything, it would fill up RAM and pandas would just halt.
19:21 And that's kind of the problem we run into in PyData today when we hit big data. Everything's
19:25 great to run out of RAM, then you're kind of, you're out of luck.
19:28 So we sort of switched out the import of pandas for DasDataFrame, called the DasDataFrame
19:33 read CSV function, which has all of the same keyword arguments of pandas read CSV, which
19:37 if you know pandas are numerous. And sort of reading one file, we sort of give it a glob
19:41 string. We'd ask it to read all the files.
19:43 Yeah, we'd like did normal pandas API stuff. I think we filtered out some bad rows or like
19:49 some free rides in New York City that we had to remove. We made a new column, which is
19:53 a tip fraction. We then use pandas datetime functionality, I think, to group by the hour of the day.
19:58 And the day of the week. And if you know pandas, like that API is pretty comfortable to you.
20:03 Just the data's too big.
20:04 Yeah, it was the same experience, pandas. Like we switched out the import, we put a star in
20:09 our file name to read all the CSV files. We might have asked for a cluster. So we might
20:13 have asked something like Kubernetes or Yarn or Slurm or some other cluster manager to
20:17 give us a bunch of Dask workers. That one's probably using Kubernetes because it's on Google.
20:20 Those showed up. So Google was fine enough to give us virtual machines. We deployed Dask on
20:25 them. And then, yeah, then we hit compute. And then all of the machines in our cluster
20:28 went ahead and they probably called little pandas read CSV functions on different ranges in those CSV
20:34 files coming off of Google Cloud Storage. They then did different group operations. They did some
20:39 filtering operations on their own. They're probably somewhat related to the ones that we asked them to
20:43 do, but a little bit different. They had to communicate between each other. So those machines
20:47 had different pandas data frames. They had to share results back and forth between each other.
20:50 If one of those machines went down, Dask had to ask for a new one or had to recover the data that
20:55 machine had. And at the very end, we got back at Panda's data frame and we plotted it. And I think
21:00 the punchline for that one is that the tips for like three or 4 a.m. are absurdly high. It's like
21:06 40% or something like that.
21:07 Yeah. Did you have any insight why that is? Do people know?
21:12 I mean, the hypothesis is that it's people coming home from bars. I was actually giving the same demo
21:15 at a conference once and someone was like, oh, filter out all the short rides and I'll bet it
21:20 goes away. And so you did that and the spike does go away. And so it's probably just people getting
21:24 $5 cab rides from the bar. And we were looking at tip fraction, which is sort of going to have a lot
21:29 of noise. But it was fun because this guy said, hey, try this thing out. And he knew how to do it.
21:33 He'd never seen Dask before. He'd just known pandas before. Gave him the keyboard. He typed in the
21:37 pandas commands and it just worked. We got a new insight out of it. And so that sort of experience
21:42 shows us that a tool like Dask data frame gives us the ability to do pandas operations,
21:47 to extend our pandas knowledge, but across a cluster. That data set could have been 20 terabytes
21:51 and the same thing would have worked just fine. You would have wanted more machines. It would have
21:55 been fine. Wait, or one more time. One more time. Yeah. So Dask also works well on a single machine
21:59 just by using small RAM intelligent ways. Oh, it does. Okay. So it can be more efficient.
22:03 You could give it a ton of data, but it can realize I can't take it all in one big bite. I got to eat
22:09 the elephant sort of thing. Yeah. So if you can run through your computation, a small RAM,
22:12 Dask can usually find a way to do so. And so that's actually like Dask can be used on clusters. And
22:16 that's the flashy thing to show to people at a PyCon talk. But most Dask users are using just their
22:22 laptops. So for a long time, Dask actually just ran on us on single machines for the first year of its
22:26 life. We built all the parallel algorithms, but didn't have a distributed scheduler. And there people
22:30 just wanted to sort of stream through their data, but with pandas or NumPy APIs. So people were
22:35 dealing with 100 gigabyte data sets on their laptops pretty comfortably.
22:38 Oh, that's interesting. I didn't realize that that was a thing it could do. That's cool.
22:41 Could it be used in the extreme? Like if I've got a really small system with not much RAM at all,
22:49 I'm thinking like little IoT things. Could it let me process like not big data,
22:53 but like normal data and like a tiny device?
22:55 Sure. Depending on how much data or how much you're looking for. Sure.
22:59 Yeah. Okay. Interesting. That's a super cool feature. Now, one of the first things that struck me
23:03 when you ran this, like obviously you've got all these machines. I think you had 32 systems with
23:09 two cores each when you ran the demo and it went really quickly, which is cool. I expected it to
23:14 come up with a graph and you could say, run it, but we can't even run it locally here. It is much,
23:18 you know, in 10 seconds or something. But instead, what I saw when you did that was there's like a
23:24 beautiful graph that's alive and animated describing like the life of your cluster and the computation
23:29 and like different colors, meaning different things. Can you describe like that diagnostic
23:34 style page? And so people know. Yeah, sure. It's hard to describe a picture, much less like an
23:40 interactive dashboard. But yeah, so maybe I'll describe a little bit of like the architecture
23:44 of Dask first. I'll make this a bit more clear. Okay. So Dask has a bunch of workers that are on
23:49 different machines that are doing work, holding onto data, communicating with each other. Then there's a
23:53 centralized scheduler, which is keeping track of all those workers. This is like the,
23:57 like the foreman at a job site, maybe of all these workers telling the workers what to do,
24:01 telling them to share with each other, et cetera. And that foreman, that scheduler has a ton of
24:06 information about what all the workers are doing, all the tasks that are being run, the memory usage,
24:11 network communications, you know, file descriptors that are open on every machine, et cetera.
24:15 And in trying to benchmark and profile and debug a lot of Dask work, we found that we wanted access to
24:21 all that information. And so we built this really nice dashboard. And so, yeah, it tells you all that
24:26 information. You saw one page at that demo, but there are like 10 or 20 different pages of different
24:31 sets of information. So there's just a ton of state in the scheduler. And we wanted to expose all that
24:36 state visually. This ends up being, it was originally designed for the developers, but it ends up being
24:40 critical for users because understanding performance is really hard in the distributed system, right?
24:45 Our understanding of what is fast and slow in a single machine gets totally turned around when you
24:49 start adding concurrent access to disk and network communications and everything. So yeah, it's a live
24:55 dashboard. We used Bokeh. That's a Bokeh server application to build it. So it's got, you know,
25:00 it's updating at something like a hundred milliseconds frame rates. So it looks live to the human eye. Yeah.
25:05 It's showing you progress. It's showing you every task that gets done and when it gets done and where
25:09 it gets done. It's showing you memory use on all different machines. You can dive in. You can actually
25:13 get like line by line profiling statistical information on all of your code.
25:18 So that coordinate across machines. So like this function is taking this long, but actually it's
25:22 summing up the profiling across all the cluster. So the scheduler is just aware of everything and the
25:27 workers are all gathering tons of information. There's tons of diagnostics on that machine.
25:30 Yeah. So maybe a way to say this is that in order to make good decisions about where to schedule certain
25:35 tasks, DAS needs to have a bunch of information about how long previous tasks have taken, the current
25:39 status everywhere. And so we've gotten very, very good at collecting telemetry and keeping it
25:45 in a nice index form on the scheduler. And so now that we have all that information,
25:48 we can just present to people visually.
25:50 Might as well share.
25:50 Yeah. I find myself sometimes using DASC just on sequential computations just because I want
25:55 the dashboard. Like I don't want parallelism. I just want the dashboard. It's just the best
25:58 profiler I have.
25:59 It is a good insight into what's happening there. That's super cool. And I love that it's live. So
26:04 it seems really nice. I was surprised at how polished that part of it was.
26:08 That is honestly just due to Bokeh. If people, if they like dashboards, go look at Bokeh server.
26:13 I didn't know any JavaScript. I didn't know about Bokeh before. It was super easy to use.
26:17 I integrated it in my application. It was clean. It was nice. And it looks polished, as you said.
26:21 That's super cool. Now you did say that you can run DASC locally and there's a lot of advantages
26:26 to doing that as you already described, but it's super easy, right? Maybe just describe,
26:31 like it's just a couple of lines to set up a little mini cluster running on your system,
26:36 right?
26:37 Yeah. So it's even easier than that, but yes. So if you do from DASC, so if you, first
26:40 of all, you would pip install DASC or content install DASC. It's a Python library. It's pure
26:45 Python. You would then, if you want the sort of the local cluster experience with the dashboard
26:49 and everything, you say from DASC, I would import client. And then you create a client, which
26:53 with no arguments, it would create for you like a few workers on your local machine and a scheduler
26:57 locally. And then all the rest of your DASC work would just run on that. You also don't need
27:01 to do that. You can just import DASC and there's a thread pool waiting for you if you don't
27:05 set anything up. So a lot of people, again, don't use the distributed part of DASC. They
27:10 just use DASC locally and DASC can operate just as a normal library. It spins up a thread
27:14 pool, runs your computations there and you're done. So there's no actual setup you need to
27:18 do.
27:18 Right.
27:18 Sorry. There was a great tweet about maybe a few weeks ago. Someone was using the X-Array
27:23 library. X-Array is a popular library for array computing in like the geoscience world.
27:28 Someone says like, yeah, someone was recommending X-Array and DASC on Twitter. And someone says,
27:32 yeah, I've heard about X-Array. Use it all the time. It's great. Never heard of DASC though.
27:35 Don't know what you're talking about. And it was hilarious because X-Array uses DASC under
27:39 the hood. The guy had just never realized that he was using DASC. So that's a real success
27:43 point.
27:43 Yeah. If you don't have to poke your head up and like make it really hard and obvious that
27:48 this thing is part of it, right? That's great.
27:50 DASC had disappeared into being just infrastructure.
27:52 Yeah. So another thing that I thought was a super cool feature about this is you were
27:58 talking about some kind of working with a really big tree of data, right? And you start
28:04 processing it and it's going okay, but it's going somewhat slowly and it's a live talk with
28:10 not that much time. So you're like, all right, well, let's add some more workers to it. And
28:14 while the computation is running, you add like 10 more workers or something. And it just on
28:19 your cool bokeh graph and everything, you just see it ramp up and take off. And so you can
28:23 dynamically add and remove these workers to the cluster, right?
28:27 Yeah, definitely. So DASC is all the things you'd expect from any modern distributed system.
28:32 It handles resilience, handles dynamically adding things. You can deploy it on any common job
28:37 scheduler. So it does sort of all the things that you'd expect from something like Spark,
28:40 you mentioned before, or TensorFlow or Flink or any of the sort of popular attribute systems today,
28:46 but it does sort of adjust those things. So you can sort of sprinkle in that magical dust
28:50 into existing projects. And that's really the benefit. Yeah. So we focus a lot on how do we
28:55 add clusters efficiently or how do we add workers to a cluster dynamically?
28:57 Yeah. It seems really useful. Just as you're doing more work, you just throw it in. And
29:02 I'm guessing these clusters can be long lived things and different jobs can come and go. And
29:06 are they typically set up for like research groups or other people in longer term
29:11 situations or are they like sprung up out of Kubernetes, like it burst into the cloud and
29:16 then go away?
29:17 I'm going to say both actually. So what we'll often see is that the institution, which will
29:21 have something like Kubernetes or Yarn or Slurm, and they'll give their users access to spin up
29:26 these clusters. So an analyst will sit down at their desk, they'll open up a Jupyter notebook,
29:29 they'll import DASC, use one of the DASC deployment solutions to ask for Kubernetes DASC cluster.
29:35 Then like their own little scheduler and work rules will be created on that machine,
29:39 on that cluster. They'll do some work for an hour or so, and they'll go away. And that cluster will go
29:43 away. What IT really likes about the sort of adding and removing of workers is that we can add and
29:49 remove workers during their computation. So previously they would go into the cluster, they'd ask for,
29:53 hey, I want 100 machines. They'd use those machines for like a minute while they load their data,
29:57 and they would stare at a plot for an hour. And so it's not so much the ability to add new workers,
30:01 it is the ability to remove those workers when they're not being used. This gives like really good
30:05 utilization. You can support a lot more people who have really bursty workloads. So a lot of data
30:10 scientists, a lot of scientists, their analysis is very bursty. They do a bunch of work and they stare
30:15 at their screen for half an hour. They do a bunch more work and they stare at the screen for half an
30:19 hour. And so that sort of dynamism is really helpful in those sorts of use cases.
30:22 Yeah, I can imagine. That's super cool. Now we talked about the DAS data frame and DAS array and
30:28 those being basically exact parallels of the NumPy and Pandas version. But another thing that you can do
30:37 is you can work in a more general purpose way with this thing called a delayed and a compute, right?
30:43 Which will let you take more arbitrary Python code and fan it out in this way, right?
30:48 Yeah. So let me first add a disclaimer. DAS data frame and DAS array are not exactly like
30:53 NumPy and Pandas. They're just very close. Close enough, you'll be fine.
30:55 Similar, okay.
30:56 If I don't say that, people will yell at me afterwards. So yeah, so the creation of DAS
31:00 delayed is actually really interesting. So when we first built DAS, we were aiming for something that
31:05 was Spark-like. Like, oh, we're going to build this like very specific system to parallelize NumPy and
31:10 parallelize Pandas and that'll handle everything. Everyone was built on top of those things. That did not
31:14 end up being the case. Some people definitely wanted parallel Pandas and parallel NumPy,
31:18 but a lot of people said, well, yeah, like I use NumPy and Pandas, but I don't want a big NumPy
31:22 and Pandas. I'm using something that's like really different, really custom, really bespoke.
31:25 And they wanted something else as though we had built like a really fast car. And they said,
31:31 great, that's a beautiful car. I actually just want the engine because I'm building a rocket ship
31:35 or I'm building a submarine or I'm building like a mechatronic cat or something.
31:39 It turns out that like in the Python space, people build like really diverse things. Like
31:46 Python developers are just on the forefront. They're doing new things. They're not solving
31:50 the same classic business intelligence problem over and over again. They're building things that are new.
31:54 And so DAS delayed was like a really low-level API where they could build out their own task graphs,
31:59 their own parallel computations, sort of one chunk at a time using normal Python for loops.
32:04 So it allows you to build like really custom, really complex systems with normal looking Python
32:09 code, but still get all the parallels in the task and provide. So you still get the dashboard,
32:13 you still get the resilience. You can build out your own system in that way.
32:15 Yeah. You even have the graph and the dashboard, as you said, which is really cool.
32:19 Yeah.
32:19 I love it. So when I saw that, I was thinking, well, this is really interesting. And I see how that lets
32:25 me solve a more general numerical problem. Could I solve other things with it, right? Like suppose I have
32:31 some interesting website that needs better performance, right? I've got something that
32:36 has really nothing at all to do with computation. Maybe it talks to databases or other things,
32:41 right? Could I put a DASC backend on like a high-performance website and get it to do,
32:48 you know, bizarre cluster computations to go faster? Or is it really more focused on just data science?
32:53 Yeah. So people would use DASC in that setting in the same way they might use like a multiprocessing
32:58 pool or a thread pool, or Celery maybe, but they would scale it out across a cluster.
33:03 So DASC satisfies like the concurrent futures interface, if you use concurrent futures,
33:07 and also does async away stuff. So people definitely integrate it into systems like what you're talking
33:10 about. I think at UC Berkeley, there's a project called Cesium that does something like that.
33:14 I gave a talk at PyGotham like a year or two ago that also talked about that. So yeah,
33:19 like people definitely integrate DASC into more webby kinds of systems. It's not accelerating
33:25 the web server. It's just accelerating the sort of computations that the web server is hosting.
33:29 It's useful because it has like millisecond latencies and scales out really nicely and
33:34 is like very Pythonic. It fits all the Python APIs you'd expect.
33:37 Yeah. And the fact that it implements async and await is super cool. If you use it with something
33:41 like Sanic or Court or one of these async enabled web frameworks, it's just a line, a weight,
33:47 some execution, and the scalability, you know, spreads maybe just some DASC Kubernetes cluster or
33:54 something. That's pretty awesome. You talked about the latency and I was pretty impressed
33:58 with how those were working. You were quoting some really good times. And actually you said
34:02 that's one of the reasons you created your own scheduler and you didn't just try to piggyback on
34:07 Spark is because you needed extremely low latency because it's very, it's kind of chatty, right? If I'm
34:12 going to take the dot product of a thing that's all over the place, they've got to coordinate quite a bit
34:16 back and forth, right? That can't be too bad.
34:18 Yeah. So Spark actually internally has okay latencies today. It didn't at the time,
34:22 but they're also at sort of the millisecond range where Spark falls down as in complexity.
34:25 They don't handle sort of arbitrary task graphs. Sort of the other side of that is systems like
34:30 Airflow, Luigi, and Celery, which can do more arbitrary task graphs. You can say, you know,
34:35 download this particular file. And if it works, then email this person, or if it doesn't work,
34:39 parse this other thing. You can build these dependencies into Airflow, Luigi, Celery.
34:43 Where those things fall down is that they don't have good inner task communication. So you can't
34:48 easily say, I've got these two different pandas dataframes with different machines,
34:51 have them talk to each other. They don't handle that. They also have long latencies in sort of
34:55 the like tens of hundreds of milliseconds, which can be...
34:58 Right. Which is not that long, but when you amplify it across, you know, a thousand times,
35:02 then all of a sudden it's huge.
35:03 Yeah. Or when your pandas computations take 20 milliseconds or one millisecond,
35:07 and you want to do a million seconds, then suddenly latency does become a big issue. So we needed sort of
35:11 the scalability and performance of Spark, but with the sort of flexibility of Airflow,
35:17 Luigi, Celery. And that's sort of where Dask is sort of that mix between those two camps.
35:21 This portion of Talk Python to Me is sponsored by Backlog from NewLab. Developers know the importance
35:29 of organization and efficiency when it comes to collaborating on a team. And Backlog is the
35:34 perfect collaborative task management software for your team. With Backlog, you can create tasks,
35:39 track bugs, make changes, give feedback, and have team conversations right next to your code. You track
35:44 progress with features like Gantt and burndown charts. You document your processes right alongside your
35:50 wikis. You can integrate with the tools you use every day like Slack, Jira, and Google Sheets. You can
35:54 automatically register issues from email or web form submissions. Take your work on the go using their
35:59 top-rated mobile apps available on Android and iOS. Try Backlog for your team for free for 30 days using
36:05 our special URL at talkpython.fm/backlog. That's talkpython.fm/backlog.
36:14 I was really impressed with the low latency that you guys were able to achieve. And so maybe that's a good
36:18 way to segment over to some of the internal implementations. You talked about using G-event,
36:23 Tornado. What are some of the internal pieces that are at work there?
36:28 Yeah. So on the distributed system, we use Tornado for concurrency. This is because we're
36:32 supporting both Python 2 and 3 at the time. Tornado is becoming more asyncIO friendly. That's sort of
36:37 becoming more common. So Tornado for concurrency. Also Tornado a little bit for TCP communications,
36:41 although we've had to improve Tornado's bandwidth for high-performance computers.
36:46 That's cool. So has Tornado gotten a little bit better because you guys have been
36:49 pushing it to the limit? Yeah, definitely.
36:50 That's good. Yeah, that's good.
36:52 Dask touches a ton of the ecosystem. And so Dask developers tend to get pretty comfortable with
36:57 lots of other projects. So Antoine Petroux used to work on Dask a lot. He did a lot of the
37:01 networking stuff in Python, and he also worked on Tornado and did lots of things. So he was handling that.
37:06 So yeah, Tornado for concurrency and TCP communication. So all the Dask workers are
37:11 TCP servers that connect to each other. So it's kind of like a peer-to-peer application.
37:15 A bunch of just raw, basic Python data structures for our internal state, just because those are
37:21 the fastest we can find. The Python dictionary is actually a pretty efficient data structure.
37:25 So that's how we handle most of the internal logic. And then for computation, if you're using
37:29 Dask data frame, we're using pandas for computation. If you're using Dask array,
37:32 we're using numpy for computation. So we didn't have to rebuild those things.
37:35 And then, you know, compression libraries, encryption libraries, Kubernetes libraries.
37:40 There's sort of a broad set of things we end up having to touch.
37:43 I can imagine. Is the cluster doing like HTTPS or other types of encrypted communication and
37:48 stuff like that between the workers and the supervisor?
37:50 Yeah. So not HTTPS. HTTP is sort of too slow for us. We tend to use TCP.
37:55 Okay.
37:56 Dask does support TLS out of the box, or SSL, you might know it as. But the comms are also
38:00 pluggable. So we're actually looking at right now a high-performance networking library called
38:04 UCX to do like InfiniBand and other stuff. Security is standard in that way, assuming
38:08 you're happy with TLS.
38:09 Yeah. Yeah. I suspect most people are. Until the quantum computers break it, and then we have
38:13 a problem.
38:14 Sure. Comms are extensible. We can add something else.
38:17 It's like a quantum state algorithm.
38:19 Yeah. Yeah. I suspect we're going to be scrambling to solve the banking problem and e-commerce
38:24 problem faster first, though. That'll be where the real fear and panic will be. But luckily,
38:31 that's not here yet. So this just sounds like you guys are really pushing the limits of so much
38:36 of these technologies, which sounds like it must be a lot of fun to work on.
38:39 Yeah. It's been a wild ride.
38:40 I can imagine.
38:41 Yeah. It's actually been interesting also building conventions and protocols among the community.
38:46 I think Dask is at an interesting spot because we do talk to everybody. We talk to most scientific
38:52 disciplines. We talk to banking. We talk to Fortune 500 companies. And we see that they all have the
38:58 same problems. So I think recently I put together a bunch of people on file formats. Not because I care
39:04 deeply about file formats, because I had talked to all of them. We also talked to all the different
39:07 library developers. And it's really interesting to see, you know, OK, Tornado 4.5. Are we ready to
39:12 drop it or not, for example? And we talked to the Bokeh developers, the Jupyter developers, the DAS
39:16 developers, the Tornado developers. It's a well-integrated project into the ecosystem. I want to sort of like
39:21 hang there for a moment and say that because, again, Dask was designed to affect an ecosystem,
39:25 move an ecosystem forward, it actually has like its sort of tendrils into lots of different
39:30 community groups. And that's been a really fun thing to see. Sort of not on a technological side,
39:34 just on a social side. It's been very sort of satisfying and very interesting to see all these
39:39 different groups work together. Yeah, I can imagine you could have worked with not just the library
39:43 developers on all the different parts of the ecosystem, but you're also on the other side of
39:47 the scientists and the data scientists and people solving really cool problems, right?
39:50 You'd be amazed how often like climate scientists and quantitative traders have exactly the same problems.
39:56 You know, that's something I noticed doing professional training for 10 years. You know,
40:01 I taught a class at Edwards Air Force Base. I taught one at a hedge fund in New York and I taught one at a
40:07 startup in San Francisco and like 80% that's the same. It's all the same, right? Like it's good
40:13 computation. It's algorithmical thinking. It's like distributed systems. And then there's the special
40:18 sauce that probably I don't even understand anyway, right? But it's super interesting to realize that how
40:23 similar these different disciplines can be.
40:25 Yeah. And from an open source standpoint, like this is new because previously all those domains had
40:30 like a very different technology stack. Now they're all using the same technology stack. And so as a
40:35 person who sort of is paid to think about software, but also cares about advancing humanity, it's like a
40:41 really nice opportunity, right? We can take all this money from these quant traders, build that
40:45 technology and then apply that technology to genomics or something.
40:48 Right. Or climate science or something that like really needs to get it right quickly.
40:52 That's sort of a nice opportunity that we have right now, especially in Python, I think that we can
40:58 hire a bunch of people to do a bunch of good work that advances humanity. Also helps people rich,
41:04 makes people rich playing the stock market, but you know, we can let that happen.
41:07 Hey, yeah. If they're helping drive it forward, you know, it might be okay.
41:10 Sure.
41:11 For the good, maybe that's debatable, but probably. It sounds like such a cool project. Now maybe let's
41:17 talk a little bit more of the human side of it. Like obviously you work a lot on it,
41:21 as you said, but who else? Like, are there a lot of people that maintain to ask or what's the story
41:26 there?
41:26 I've worked on it for a while within, so we started within Continuum or what is now called Anaconda.
41:31 And there's a bunch of other engineers within Anaconda who also work on Dask.
41:34 They do sort of halftime something else and halftime work on Dask.
41:37 Are they still?
41:37 Definitely. People like Jim Criss, Tom Augsburger, Martin Durant, and others throughout the company.
41:41 And so yeah, they'll maintain various parts of the project or push it forward. So like Jim recently has
41:46 been working on making Dask easier to deploy on Hadoop or Spark systems using Yarn. And that's
41:51 something that is important for Anaconda's business and also just generally useful for people. So there's
41:55 a sort of nice overlap between helping out the open source project and also getting paid to do the work.
42:00 Yeah.
42:01 Outside of Anaconda, so I'm now at NVIDIA, there's a growing group at NVIDIA who's also working on
42:05 Dask, which is pretty exciting. And also there's a bunch of other people who are working in other
42:09 research labs or other companies who need to use Dask for their work and need to build out the
42:14 Kubernetes connection or something like that. And they maintain that both because it's good for
42:18 their company and also they just sort of enjoy doing that. So I would say in sort of the periphery of
42:23 Dask, most of the work is happening in other companies and other people. And there's hundreds of people who do
42:29 work on Dask in any given month or two.
42:31 There's probably a lot of people that make like one or two little contributions to like fix their
42:36 edge case, but it all adds up, right?
42:37 There's that. There's also maybe like 20 or 30 people who are maintaining, not just adding a fix,
42:42 but like maintaining some Dask foo package. So there's like, there's a researcher, Alistair Miles,
42:48 who works in the Oxford area, who works on genomics processing. And he has like a Dask enabled
42:54 genomics suite called Psyched Allel. So that's his job, right?
42:58 So he's working on that day to day. There's, you know, the scikit-learn developers maintain the Dask
43:03 scikit-learn connection through Joblib. So Olivier Grissell has commit rights over all of Dask. He's
43:07 the main scikit-learn contributor. So there's a bunch of sort of, as Dask expands out into the rest of
43:13 the ecosystem, that it's sort of being adopted by many, many smaller groups who are in other
43:18 institutions.
43:19 So it sounds to me like there might be some decent opportunities for folks to contribute
43:23 to Dask in the sense that, you know, maybe there's somebody that's doing some kind of research
43:29 and their library or something like it would really benefit from having a Dask level version of that.
43:36 So if people are looking to contribute to open source and the data science space, are you guys
43:41 looking for that kind of stuff?
43:42 Yeah, definitely. I mean, you don't have to look these days. People are just doing it.
43:45 What I would say is that there's probably, in whatever domain you're in, there's probably some
43:50 group that's already thinking about how to use Dask in that space and jump onto that group.
43:55 There's a lot of sort of like low-hanging fruit right now. Pick your favorite PyData library.
43:59 Think about how it could scale. And there's probably some small effort around that currently.
44:03 And there's a lot of, you know, if you want to get into open source, it's a great place to be.
44:07 You're like right on the very exciting edge of big data and scalability. Things are pretty easy today.
44:12 We've solved a lot of the like the major issues by with sort of the core projects like NumPy and
44:17 Pandas and Scikit-learn. So yeah, there's a ton of ton of pretty exciting work that's like
44:21 just starting to pick up now. So yeah, if you're interested in open source, it's a great place to start.
44:25 It seems like there's so many data science libraries that there's probably a lot of opportunity to
44:30 like bring some of them into the fold. Pretty low-hanging fruit there. So let me ask you some
44:35 kind of how wild is that type of thing. One is what's the biggest cluster you've ever seen it run on?
44:43 The biggest I've had-
44:44 Or heard about it running on?
44:45 Yeah. I would say like a thousand machines is probably the biggest that I've seen.
44:51 And that's probably like with an order of magnitude of limits today. But those machines are big machines.
44:56 They have like 30 cores on them. Yeah. Like that's also just mostly for benchmarking. Like
45:00 most problems in data science aren't that big. Right. Thousands sort of.
45:03 That's awesome. I was thinking of places like CERN that have their big grid computing stuff and things
45:08 like that. Like they could potentially spin up something pretty incredible.
45:12 That gets used on a fair number of supercomputers today. But it's very, very rarely used on like
45:18 the entire supercomputer. There's really no system outside of MPI that I would expect to run on the full
45:24 like CERN federated grid. So most of the sort of data source oriented libraries end up sort of maxing
45:31 out around a thousand machines. I mean, hypothetically you go higher, but the overheads will be enough to
45:36 kill you. Does it have a sort of a, like how many people's hands do you have to shake to shake
45:40 everyone's hand in a group? You know, like that in factorial type of issue. As you add more,
45:45 does it get like harder and harder to add the next worker to the cluster or is it okay?
45:49 No, it's fine. Everything's linear in terms of number of workers. Workers are, it's not a,
45:53 like a mesh grid. It's workers talk to the scheduler to find out who they have to talk
45:56 to. They'll then peer to peer talk to those people. But in any given computation, you only talk to like,
46:01 you know, five to 10 people at once.
46:02 Okay.
46:03 It's fine.
46:03 Yeah. Interesting. Okay. The other one is what's the coolest thing that you've seen
46:07 computed or run on Dask or built with Dask?
46:09 I'm going to say, there's things I can't, yeah, things I can't talk about, but.
46:13 Some are, the most amazing ones are super secret.
46:16 I'm going to point people to the Pangio project. So this is actually my most recent
46:19 Okay.
46:19 PyCon talk in 2018, talking about other people's work. So Pangio is pretty cool.
46:23 Pangio is a bunch of earth scientists who are doing climate change, meteorology, hydrology,
46:28 figuring out the earth. And they have these, you know, 100 terabyte petabyte arrays that
46:33 they're trying to work with. They haven't quite gotten to the petabyte scale yet.
46:35 And yeah, so they usually set up something like Jupyter hub on either a supercomputer or
46:40 on the cloud and Kubernetes. They then analyze tens or hundreds of terabytes of data with
46:45 X-ray or Dask around the hood and get out, you know, nice plots that show that indeed sea
46:50 levels are rising. I think it's pretty cool in that it's good community of people. So there's
46:55 like a bunch of different groups, NASA, universities, USGS, UK Met Office, anyone who's looking at
47:01 large amounts of data, companies, plant labs, et cetera. So it's a good community.
47:05 They're like figuring out how to store one copy of the data in the cloud and then have everyone
47:11 compute on that using cool technologies like Kubernetes. They're inventing cool technologies
47:15 like how do we store multi-dimensional arrays in the cloud. So yeah, like that's maybe a good
47:19 effort. Not necessarily cool computation, like from a Dask perspective, it's actually pretty
47:24 like vanilla, but it's cool to see a scientific domain advance in a significant way using some
47:31 of these newer technologies like Jupyter hub and Dask and Kubernetes. And I like to see that
47:34 repeat it. So I think we do the same thing with medical imaging and with genomics. So that's maybe
47:39 like a cool example. It's not so much a technical situation, it's a community situation.
47:43 Yeah. Well, it sounds like a cool example that could be just replicated across verticals.
47:48 Definitely. We're starting to see that, which is really exciting.
47:50 Okay. Those are great. So when I was looking into Dask, I ran across some other libraries that are
47:55 like, kind of like you described, DaskFu. So there was DaskCuda, DaskML, DaskKubernetes. Do you want to
48:02 just talk a little briefly about some of those types of things?
48:05 Sure. And why you might use them?
48:06 I'll go in reverse order. So DaskKubernetes is a library that helps you deploy Dask clusters
48:11 on Kubernetes.
48:12 So you pointed at like a Kubernetes cluster and you say, go forth and use that and it could
48:18 just create the pods and run them in all that?
48:20 Definitely. So it creates pods for all the workers. It manages those pods. If you ask it,
48:23 there's like little buttons, you ask it to scale up or scale down. It'll manage all that stuff for you
48:27 using the Kubernetes API. It presents users with like a super simple interface. They can open up
48:32 a Jupyter notebook and just sort of play around. So that's what that does. And it has peers like Dask
48:36 Yarn for Hadoop Spark clusters and Dask JobQ for more high-performance computing schedulers like Slurm,
48:42 PBS, SGE, LSF. So that's what you ask for Kubernetes. That's like on the deployment side. DaskML is an effort
48:48 to, there were a lot of sort of interesting ways of using Dask with machine learning algorithms,
48:53 things like GLMs, Generalizing Your Models, XGBoost, Scikit-learn. That were all different
48:59 technical approaches. And DaskML is like an almost like a namespace library that collects all of them
49:03 together, but also using the sort of Scikit-learn style APIs. So if you like Scikit-learn and you want
49:08 to scale that across a cluster, either with simple things like hyperparameter searches or with large
49:14 data things like logistic regression or XGBoost, you should look at DaskML and it'll help you do
49:18 that. Dask CUDA is a, that's like a new thing I've been playing with since moving to NVIDIA. People
49:23 have been using Dask with GPUs for a long time and they've developed like a bunch of weird wonky
49:27 scripts to like set things up just the right way. There were a ton of these within NVIDIA when I
49:31 showed up. It's just a few like nice convenient Python ways to set up Dask to run on either one
49:38 machine with many GPUs or on a cluster of machines with lots of GPUs and how to make that easy.
49:42 So it's kind of like Dask Kubernetes in that it's about deployment, but it's more focused on
49:47 deploying around GPUs and there are some intricacies to doing that well.
49:51 I imagine you can scale like crazy if you put the right number of GPUs on a bunch of machines and let
49:58 it go.
49:58 No, GPUs are actually pretty game-changing. It's fun.
50:02 Yeah. It's kind of near the limits of human understanding to think about how much computation
50:08 per second happens on a GPU.
50:10 I think CPUs are also well beyond the limits of human understanding.
50:14 But this is like the galaxy versus the universe, you know, like...
50:20 Sure.
50:20 They're both outside of our understanding.
50:22 I should like mention that I work for NVIDIA now, so please don't trust anything I say.
50:25 But yeah, GPUs are like fundamentally like a game changer in terms of computation. Like you can get a
50:30 couple orders of magnitude performance increase out of the same power, which I mean, from like a science
50:35 perspective, like has the opportunity to unlock some things that we couldn't do before. GPUs are also
50:39 notoriously hard to program, which is why most scientists don't use them. And so that's sort of
50:44 part of why I work at NVIDIA today. They're building out a GPU data science stack and they're using
50:49 NASK to paralyze around it. And so it's fun to see... It's fun both to sort of grow NASK within NVIDIA and have
50:54 another large company behind it. But also from a science perspective, like I think we can do some
50:59 good work to improve accessibility to different kinds of hardware like GPUs that might dramatically
51:04 change the kinds of problems that we can solve, which is exciting in different ways.
51:08 Yeah, very incredible. And I would much rather see GPU cycles spent on that than on cryptocurrency.
51:13 Sure.
51:14 Although cryptocurrency is good for NVIDIA and all the GPU, the graphics card makers,
51:18 they don't seem as useful as like solving science problems on them.
51:22 All right. We're getting close to our time that we have left. So I'll ask you just a couple of
51:28 quick questions before we wrap it up. So we talked about where NASK is. Like, where is it going? It
51:34 sounds like there's a lot of headroom for it to grow. And NVIDIA is investing in it by sort of this work
51:40 that you all are doing. And it sounds like they're also building something amazing. So yeah, what's next?
51:44 NVIDIA aside for a moment, I'll say that NASK technologically is sort of done. The core of it,
51:49 there's plenty of bugs and plenty of features we can add, but we're sort of at like the incremental
51:52 advancement stage. I would say where I'm seeing a lot of change right now is in broadening it out
51:58 to other domains. So I mentioned Panjio a little bit before. I think that's a good example of NASK
52:02 spreading out and solving a particular domains problem. And we're seeing, I think, a lot more of
52:06 that. So I think we'll see a lot more social growth and a lot more applications to new domains.
52:11 That's really what I'm pretty excited about.
52:12 That's cool. So it's like NASK is basically ready, but there's all these other parts of the ecosystem
52:16 that could now build on it and really amp up what they're doing.
52:20 Yeah, let's go tackle those. Let's go look at genomics.
52:22 Yeah, for sure.
52:23 Let's go look at imaging. Let's go look at medicine. Let's go look at all of the parts of the world that
52:28 need computation on a large scale, but that didn't obviously fit into the sort of Spark,
52:33 Hadoop, or TensorFlow regime. And NASK, because it's more general, can probably handle those things a lot
52:38 better.
52:38 Yeah, it's amazing. I guess, let me ask you one more NASK question before we wrap it up real quick.
52:43 So NASK, is NASK 100% Python?
52:47 The core part, yeah.
52:48 Or near?
52:48 Yes. You're probably also using NumPy and Pandas, which are not Python.
52:51 Which, yeah, right, as we discussed, but yeah.
52:53 NASK itself is pure Python.
52:54 I think it's really interesting because you think of this really high-performance computing
52:59 sort of core, right? And it's not written in Go or C, but it's written in Python. And I just
53:06 think that's an interesting observation, right?
53:08 Python is actually not that bad when it comes to core data structure manipulation,
53:13 tuples, list, dictionaries. It's maybe like a factor of two to five slower, not a hundred times
53:17 slower. It's also not that bad at networking. It does that decently well. Socially, the fact
53:21 that Python had both a data science stack and a networking stack in the same language made it
53:26 sort of the obvious right choice. That being said, I wouldn't be surprised if in a couple of years,
53:30 we do rewrite the core bits in C++ or something, but we'll see.
53:33 Sure. Eventually. When it's kind of finely baked and you're looking for that extra 20,
53:37 30% here or there, right?
53:38 Yeah, we'll see. We haven't yet reached the point where it's a huge bottleneck to most people,
53:42 but like the climate science people do have petabyte arrays and sometimes a scheduler does
53:45 become a bottleneck there.
53:46 Right. It's cool. It's interesting. So final two questions. If you're going to write some
53:51 Python code, what editor do you use?
53:53 I use Vim, not for any particular reason. I just showed up the computer lab one day and said,
53:57 how do I use Linux? And the guy there was a Vim user. So now I'm also a Vim user.
54:01 I think I haven't moved off just because I'm so often on other machines. So like rich editors just
54:07 don't move over as easily.
54:08 Yeah. I'm on the cluster. I need to edit something. It doesn't work to fire up PyCharm and do that so
54:13 easily.
54:14 Right. Yeah. So Vim for now, but not religiously, just anecdotally.
54:18 Sure. Sounds good. Yeah. I'm sure you do some stuff with Jupyter Notebooks as well every now and then.
54:23 Sure. I'll maybe also put a plugin for JupyterLab. I think everyone should switch from the classic
54:27 notebook to JupyterLab, especially if you're running Dask because you can get all those dashboards
54:31 inside JupyterLab really nicely. It's a really nice environment.
54:33 Oh yeah. That sounds awesome. All right. And then notable PyPI package. I'll go ahead and throw
54:38 pip install Dask out there for you.
54:40 So I'll cheat a little bit here. I'm going to list a few. First, in a theme of the NumPy space,
54:46 I really like how NumPy is expanding beyond just the NumPy implementation.
54:49 So there's Kupy, which is a NumPy implementation on the GPU.
54:53 There's Sparse, which is a NumPy implementation with Sparse arrays. And there's Aura, which is a
54:58 new date-time D type for NumPy. I like that all of these were built on the NumPy API. We're built
55:04 outside of NumPy. I think the kind of heterogeneity is something we're going to move towards in the
55:08 future. And so it's cool to see NumPy develop protocols and conventions that allow that kind
55:12 of experimentation. And it's cool to see people picking that up and building things around it.
55:16 So that's pretty exciting.
55:17 Those are great picks. It's cool that I learned one API and now I can do other things,
55:23 right? Like I know how to do the NumPy stuff, so now I can do it on a GPU.
55:26 Right. Or I've got my NumPy code. I want to have better date times. I can just add this new library
55:31 and suddenly all my NumPy code works because that's sort of extensibility, which is something that we
55:35 need, I think, going forward.
55:37 Yeah, absolutely. All right, Matthew. Well, this was super interesting. Thank you for being on the show.
55:41 You have a final call to action. People want to get involved with Dask. What do they do?
55:44 Try it out. You should go to examples.dask.org. And there's a launch binder link on that website.
55:51 If you click on that link, you'll be taken to a JupyterLab notebook running in the cloud,
55:54 which has everything set up for you, pretty much of examples. So you can try things out,
55:57 see what interests you, and play around. So again, that's examples.dask.org.
56:00 Yeah, and I also link to the Dask examples GitHub repo that you have on your account
56:04 in the show notes. Yeah, so people can check that out too.
56:07 Awesome. All right. Well, thank you for being on the show. It was great to chat with you and
56:10 keep up the good work. It looks like you're making a big difference.
56:13 Great. Thank you so much for having me, Michael. Talk to you soon.
56:15 Yep. Bye.
56:15 This has been another episode of Talk Python to Me. Our guest on this episode was Matthew Rockland,
56:22 and it's been brought to you by Linode and Backlog. Linode is your go-to hosting for whatever
56:27 you're building with Python. Get four months free at talkpython.fm/Linode. That's L-I-N-O-D-E.
56:34 With Backlog, you can create tasks, track bugs, make changes, give feedback, and have team
56:39 conversations right next to your code. Try Backlog for your team for free for 30 days using the special
56:45 URL talkpython.fm/Backlog. Want to level up your Python? If you're just getting started,
56:52 try my Python Jumpstart by Building 10 Apps course. Or if you're looking for something more advanced,
56:58 check out our new async course that digs into all the different types of async programming you can do
57:03 Python. And of course, if you're interested in more than one of these, be sure to check out our
57:07 everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show.
57:12 Open your favorite podcatcher and search for Python. We should be right at the top. You can also find
57:17 the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at slash
57:23 RSS on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really
57:29 appreciate it. Now get out there and write some Python code.
57:31 Bye.
57:32 Bye.
57:32 Bye.
57:32 Bye.
57:32 I'm out.