Learn Python with Talk Python's 270 hours of courses

#361: Pangeo Data Ecosystem Transcript

Recorded on Friday, Apr 1, 2022.

00:00 Python's place in climate research is an important one. In this episode, you'll meet Joe Hayman and Ryan Abernathy, two researchers using powerful cloud computing systems and Python to understand how the world around us is changing. We are both involved in the Pan Geo Project, which brings a great set of tools for scaling complex compute in the cloud with Python. This is Talk Python to Me, episode 361, recorded April 1, 2022.

00:39 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy and keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. We've started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:04 This episode is sponsored by Signal Wire and Sentry. Transcripts for this and all of our episodes are brought to you by Assembly AI. Do you need a great automatic speech to text API? Get human level accuracy in just a few lines of code? Visit talkpython.fm/assemblyai. Joe Ryan, welcome to Talk Python to Me.

01:24 So much for having us.

01:25 Hey, it's great to be here.

01:26 It's fantastic to have you here. I'm really excited to talk about Earth science and all the cool large scale computing stuff and cloud computing and things like that. With you, it'd be a lot of fun. So really looking forward to getting into that. Now, before we dive into the topic, let's just start with your story, Joe. I guess you go first. How did you get into programming and Python?

01:47 My path came through grad school. I was studying civil engineering and climate modeling as a graduate student at the University of Washington, and I was in a computational hydrology group. So we're doing lots of computer things, and my PhD adviser at the time was like, we want to do Python stuff. I don't know anything about it or I don't know much about it. You should be the kind of guinea pig students to bring our group into the modern era. So kind of threw me to the Wolves, and I ended up really kind of taking on a role that not just learned it, but then started teaching other people, and I ended up contributing to open source packages and the rest history.

02:22 Oh, that's fantastic. What were you using before you said you brought them into the modern era? Where were you coming from? What was the Dark Ages?

02:29 Some terrible mix of Perl and Cshell and C and Fortran and a bunch of other shell scripting languages. So it's a total spaghetti land.

02:40 Wow, that is a spaghetti land. I would say Python and Jupyter. And the stack probably sounds a little simpler.

02:45 Yeah, the connections are a little more natural, for sure.

02:48 Indeed. How do people receive it?

02:50 It's been great. I think for me personally, it was like quite revolution in what was possible, but then passing it around the lab and then around kind of the research community. Since then, it's been overwhelmingly positively received and we're doing totally different things than we could have done without it. And I think that's the biggest thing. It's like a little easier program, but that you can do things that you couldn't have done before.

03:13 Yeah. Were people worried that coming from C and fortran that Python wasn't fast enough? They're like, we can't use this as one of these slow scripting languages you'd already proven and it was fine.

03:24 No, I think you hear that on occasion, but I think the developer Velocity outweighs that in 99% agree you can always optimize the 1% base further, for sure.

03:35 Ryan, how about you? How do you get a program in Python?

03:37 Kind of been a lifelong program. So I was actually just thinking back when you asked that. I think I wrote my first basic code at age seven. My dad worked for IBM and so I really been like a kind of lifelong computer nerd. Drifted away from that a little bit in College where I majored in physics. But then in graduate school, I did a PhD in climate physics and chemistry at MIT, and I had a huge need for scientific computing in that. So all of a sudden my computer stuff started really coming back for the center of my world. It was a MATLAB shop. Matlab Fortran was the stack there. After I sort of did my first project around 2006, 2007 in MATLAB, I decided, you know, I've been doing open source hackery for in other languages for many years. And I was like, oh, I need an open source scientific computing solution. So I tried Python, got into Python 2008. So that was when I've been at it long enough to have compiled NumPy.

04:38 Yeah.

04:38 It's like really when NumPy was basically coming out right around the time or Anaconda and the other one then just rolled with it ever since I was like the early adopter Python guy around there. I helped get a lot of other people into it, but it was still always just in my own projects.

04:56 Right.

04:56 Like there was no community.

04:58 I really got into open source community development, probably around 2014, 2015 when I discovered X array. That is like the project that really turned me from just a user to a contributor.

05:11 Fantastic.

05:12 Yeah.

05:12 That's like multi dimensional NumPy goodness. Right.

05:18 Was that other distribution to Anaconda? Was that Canopy I see out in the audience? Thank you, Erie.

05:25 Cool.

05:25 The other one I was thinking of was active state. So, yeah, there's these different distributions people can get for optimizing for different stuff.

05:32 Great.

05:32 Ryan, what are you doing day to day? It sounds like you both are still doing research University to like things. Absolutely.

05:37 So I'm a professor here at Columbia University in the Montana Observatory, and I sort of manage a medium sized scientific research lab and teach at the University. But then I wear this total other hat as open source developer and contributor. That can be sometimes a little exhausting to try and wear both those hats at the same time, but I really enjoy it. Our work in our lab is focused on computational oceanography, trying to understand the role the ocean plays in the climate system, particularly the role that small scale ocean processes, Eddie's, fronts, instabilities that are occurring like at, say, ten to 100 km scale, how that sort of turbulence and variability influences the large scale ocean and the role it plays in our changing climate.

06:26 Right.

06:26 The way we do that research is by working with large scale satellite data from NASA. So I'm involved in the NASA Surface Water and Ocean Topography Science team, a new satellite mission that's launching this year a ton of numerical modeling simulating the ocean with computers. At the end of the day, we're doing data analysis, some of which is sort of traditional statistics and visualization. Increasingly, machine learning AI, ML are part of our toolkit that we use to try and understand the ocean. But the bottom line is just working with a lot of data every day diverse projects that has really forced me to center the rule of these tools in our work and recognize them as sort of that is our like a lot of my colleagues in other buildings at our lab, they'll have a million dollar mass spectrometer. They have like an instrument. They use science, some crazy laser or something.

07:20 Right.

07:20 And they turn the crate and we're like a data driven lab. And so I think a pangeo, in a way, is our instrument that we all contribute to and maintain that then helps us to do all the research projects that we want to pursue.

07:34 Yeah. How fascinating. You talked about running a research lab and then also doing this open source thing, this dual hat thing. The world may have changed, but when I took a couple of computer science classes at University, when I had my undergraduate degree, I didn't feel like a lot of the instructors or professors there really had much real world experience in programming and stuff. And I think things like this, like contributing to X array, it must give you this really grounded sense of not only these are the tools that you can use, but here's how it works. These are the people doing you're in the trenches. Do you think that makes a big difference?

08:10 I think we have a major education problem around computational science. And I say this as someone like a University Professor, we have no curriculum to teach someone how to be like an effective computational science. And particularly in the context of open source, community driven software development, we have computer science classes that will teach you a lot of great things about algorithms and data structures and even machine learning, but they won't teach you how to write effective software. And as you said, currently we assume you have to learn that in the trenches, just getting into a project. Maybe you work at a company for me, definitely. I upped my software engineering team so much after I got involved in community open source because there are people like Stefan Hoyer, Google staff engineers who were like reviewing my PR.

08:58 Right.

08:58 So that was huge. But I wish actually the University could teach this skill set because I think it would help the world a lot.

09:05 Yeah, I think it would as well. I think we can absolutely find space in the curriculum for it. I think math might need to give up a little bit to allow for computational math rather than symbolic math. If I could pick I don't want to do realist too much, but Joe gets a chance to talk as well. But I think geometry and the going through proofs and the thinking about how do I take axioms and using proofs to solve problems is exactly the same mindset in the way of thinking as solving a computer program or solving a problem with a program. And I haven't applied geometry that much, but I sure applied a lot of computers. So anyway, I can put that out there for people.

09:43 You might be a little biased.

09:45 I may be a little biased, although I do have two math degrees, for what it's worth, but I'm still willing to put geometry out there. Anyway, Joe, how about you? What are you doing these days?

09:53 Yeah, so I wear a couple of hats as well. My main hat is on the technology director at Carbon Plan, a nonprofit that's working on the climate problem. And our focus is really on improving the transparency and the scientific integrity and quality of climate solutions. We do that by building open source tools and open data and doing research into the various climate solutions that are out there. So we use a lot of open source software to do that, and we build a lot of open source tooling, including software to help tell stories about different climate solutions. The other I'm a scientist at the National Center of Grammas Verdict Research, where that part of my role. Looks a lot like Ryan's Day To Day. We study different things, but I work with big climate model data, do data analysis, all of that.

10:38 Yeah. Sounds really fascinating. At Carbon Plan, what kind of data do you all use? Do you like hook into electric grids? Do you analyze the mixture of electric grids or transportation or what kind of problems and things are you solving there?

10:52 It's a bit of everything we do, big data, little data and everything in between.

10:57 One of the areas we spend a lot of time working is in the area of forest offsets and trying to ask questions about the quality and the potential of using forests as a climate solution. And so there we're using everything from Institute observations where people go out and measure trees with a tape measure, and they do that every five years. And that's actually a fairly small data set, even though there's a lot of trees out. There, only so many measurements you can make. And then we also work with climate model data. We study climate risk to forests. And so there we're building models of future forest fire risk and trying to understand how that's going to change in the future. So that's like big data stuff. We're doing lots and lots of model building and machine learning and that sort of thing on top of all data.

11:42 Sounds fascinating. You and I are both in the Pacific Northwest, and last year was not too terrible for fire season. But recently we've had some pretty bad forest fires up here. And it's definitely concerning.

11:55 Yeah. It definitely feels like I'm studying the world around me more than ever when we're working on these fire problems and then also experiencing the smoky weather that we've been having in the summers and the West Coast last year.

12:07 Yeah, it's been pretty crazy.

12:08 And I mean, I think that's a theme in climate science right now. It's gone very quickly from this academic problem to something that so much of society and our economy is engaging with. Companies are just getting to work on adapting to climate change because they are feeling it in their bottom line. And it's different than things were ten years ago.

12:31 Well, that's good to hear. I'm both pessimistic and optimistic about how things could go.

12:37 There are so many cool discoveries. Folks like you are using Python and computation to really understand exactly what's happening and keep your finger on the pulse of where they're going. And then also I see a mom and her one small kid in a Chevy Suburban next to me idling in traffic. And it's like I don't know, people do have to internalize it, I think, a little bit more. But I wonder if it is maybe in business they're starting to see starting to react a little bit sooner. Right. I think companies feel pressure of economics sooner than people do a lot of the times.

13:16 I know. I think that's true. I think there's also social pressures that are pushing companies to act soon. And so it's not just sometimes it's altruistic, but also, I think there's marketing involved in a range of other.

13:29 Right. Just don't want to look like the bad company if they can put on a good image and it's worth it. But whatever gets them to do it, I don't care if it's.

13:37 Yeah, that's what I was just going to say. We don't have to shame that. I think that there's a lot of action that needs to happen in a lot of sectors right now. And so to the extent we can motivate that through one mechanism or another that sounds like a good idea.

13:51 This portion of Talk Python to me is brought to you by Signal Wire. Let's kick this off with a question. Do you need to add multiparty video calls to your website or app? I'm talking about live video conference rooms that host 500 active participants, run in the browser, and work within your existing stack, and even support 1080p without devouring the bandwidth and CPU on your users devices. Signal Wire offers the APIs, the SDKs and Edge networks around the world for building the real estate of real time voice and video communication apps with less than 50 milliseconds of latency. Their core products use WebSockets to deliver 300% lower latency than APIs built on rest, making them ideal for apps where every millisecond of responsiveness makes a difference. Now you may wonder how they get 500 active participants in a browser based app. Most current approaches use a limited but more economical approach called SFU, or selective forwarding units, which leaves the work of mixing and decoding all those video and audio streams of every participant to each user's device. Browser based apps built on SFU struggle to support more than 20 interactive participants, so Signal Wire mixes all the video and audio feeds on the server and distributes a single, unified stream back to every participant so you can build things like live streaming fitness Studios where instructors demonstrate every move from multiple angles, or even live shopping apps that highlight the charisma of the presenter and the charisma of the products they're pitching at the same time. Signal Wire comes from the team behind Free Switch, the open source telecom infrastructure toolkit used by Amazon, Zoom, and tens of thousands of more to build mass scale telecom products. So sign up for your free account at Talkpython.fm/signalwire and be sure to mention Talk Python to Me. Receive an extra 5000 video minutes that's Talkpython.fm/SignalWire I mentioned Talk Python to Me for all those credits.

15:41 We're getting, like, maybe down this climate rabbit hole, but this is probably the most important issue of our time. So let's go down it. You got to distinguish between the terms we use are mitigation versus adaptation. Right? So mitigating climate is doing things like burning less fossil fuels that are going to reduce the potential impacts of climate change. Adaptation is just accepting that climate change is happening, is going to happen and changing our behavior like infrastructure. And so when I see a lot of companies taking action, I see especially a lot of companies taking action from where I sit on adaptation, using system data, using our projections from our climate models to make business decisions under this changing climate. Mitigation is what we've been calling for for decades, and that's where, like, the Chevy Suburban comes in. I guess I really push back against the idea that personal choices are, like, an important part of mitigation. This is a narrative that has actually been counterproductive. Don't think we need to rely on personal ethical choices about which type of bags to bring to the grocery store. I mean, it's important, but this is a very large scale.

16:50 It's solving the problem very much on the edges when you're just going. But there's a huge middle part.

16:55 What do we need? Global scale regulation around carbon emissions in order to mitigate climate. And that's a political problem.

17:02 It is. Well, the renewable energy story seems to be coming on faster than people thought recently, so there is a lot of hope in that space. Now let's talk you both mentioned a little bit of this sort of blend of open source side of things and then the science side of things. Let's just talk for a moment about just some general best practices with open source and science and stuff like that. One of the things, I guess, is, Ryan, you talked about having people like these high end software engineers reviewing your code and stuff. And I suspect there's a lot of lessons you've learned that you can kind of bring back to the science world from that open source experience.

17:43 I think there's this whole spectrum of open science. Right. And open source activities. So right now, it's pretty common in scientific fields to encourage research projects, to publish their research code under an open license or put it on GitHub or something like that. And I see that it's just like a very first step towards a much more transformative way we do science as a community. As you all know, just putting a repo up on GitHub has essentially no impact. Right.

18:18 No one uses it. Like if someone pushes a commit in a forest and no one hears it land, like, okay, great. Like you check the box. The real goal of open science is to encourage more reuse, more collaboration and accelerate the velocity of scientific discovery. And that takes more than just putting your code out there. Of course, putting their code out there is the first step, but it actually takes making sure people can run it like they have the environment for it. It can access the data that it needs to run, that they understand what it can do that is coded in a way that is extensible and modular. And all of those things are a lot more than a license. They're about essentially writing good scientific code. And so I do think just the process of getting involved in open source is a huge form of education for scientists about how collaboration can work, not just even in code, but in general. The way the collaboration process works in a well functioning open source project is kind of miraculous.

19:21 Yeah, it absolutely is. There's a lot of barometers people use when they go and look at an open source project to decide, can I trust this thing? The obvious ones are like, how many stars and Forks is it a popular thing that people seem to care about but others are, does it have test, or does it seem to be operated in a way that is going to lead to contributors being able to contribute and the software evolving over time in a way that they could depend upon? Right.

19:48 Yeah. I think another thing here that's really important to think about is what the incentive structures are for a researcher working on a scientific programming problem.

19:57 Absolutely.

19:57 For most graduate students or researchers at institutions that Brian and I work at, the goal is to write a paper or to produce a data set. And the software has been kind of thought of as a tool you use to get there. But it's not necessarily a tool that you pick up or improve along the way. And I think one of the things that we've been trying to do is kind of break that pattern a little bit and think of that the whole ecosystem of tools that we're working with is improvable so that we don't have to reinvent the wheel and individual researchers.

20:29 Maybe just speaking for myself, being able to say, okay, I'm going to take it rather than take the short cut path to get to the end of the paper, I'm going to improve the ecosystem so that later on I can reuse this improvement, but also Ryan and others in the community can reuse it. And that's like a fundamentally different way of thinking about the tools you're using.

20:48 That's interesting. It sure is. I've only spent a couple of years in that space, but my experience was so much of the code, at least traditionally, had been written just to solve a very focused problem and not in a way that could be adapted to future problems. Right. It'd be like, well, we're changing the algorithm. It's slightly different data. So we'll make a copy of the script and we'll copy it over there, and we'll just like, maybe there's not a single function in the whole thing. It's just top to bottom. And I suspect adopting some of these techniques to sort of produce more of a library out of it, even if you put it on GitHub and nobody comes, it still would benefit you and your research over time, I would imagine.

21:27 And I would say that is a big part of Python and why Python is a good tool for science, because it is easy to build higher level abstraction coming from the MATLAB world. You basically got MATLAB and its toolboxes, and then you got your scripts, and there's like this hard divide between the platform and the tool and your own work. You're used to thinking, well, these are like the primitives provided by the tool. And here's what I have to do. But Python allows you to build very flexibly and has this great ecosystem. I mean, I think the Segway is naturally into X array like many of us in, say, in 2014, had our own sort of private version of code that did what X array did, thinking like that's code that needs to live in users, that's not a package provided by the ecosystem. But then once X array started to catch on and we realized how powerful and how cool it was and what a solid foundation it had, many of us immediately stopped working on our own sort of private Xarray like thing and started contributing to X array. And we've seen over, say, the past five or six years, it's really steady growth in the capabilities, both in terms of features and robustness. Like X array that we never would have done if there hadn't been that coalescence around. Like, okay, we're all going to work together on this.

22:47 Yeah. And then you have these knock on effects right there's now other libraries and other systems that use X ray. And so if you're programming against it, it's super easy to plug into. It kind of like what Pandas and Dask are doing. Like, if you program against Pandas, you kind of automatically get like this scale up version because Dask is just Pandas.

23:06 But more and you don't have to write code to read CSV.

23:10 Yeah, exactly. So I gave the elevator pitch for X array, but let's go ahead and dive into that and maybe give us the story for Pangeo where X array is one of the sort of umbrella it's covered under that umbrella.

23:25 Yeah.

23:25 Whoever wants to take it.

23:26 Yes, I'll start. But I think just to take maybe a slight step back and say what X array is one more time. I think so. X ray is a package, a Python package for working with multi dimensional labeled arrays and data sets. And it integrates rates with NumPy and Pandas. In many ways. You can think of it as a multi dimensional Pandas, and it's used really widely in the climate science community and the Geosciences. But it's also used in fields outside of the Geosciences.

23:53 Right. It could be finance or all sorts of.

23:55 Yeah. Finance in biomedical, bio, imaging, etc.

24:00 Give us a sense of the data that you might load up off of some Oceanography, something.

24:04 I think our go to data set for Oceanography is like ocean. Sea surface temperature satellites observe the ocean from space. Infrared or microwave observations can tell how warm the water is. That gets processed by NASA. They distribute essentially a bunch of net CDF files that are up on basically an Http or FTP server, one file per day for the past 30 years, quarter degree resolution. Each file is a couple megabytes or something like that. We want to do an analysis on that data. Right. And so Xarray can open that individual file, but it can also open that collection of thousands of files as one coherent data set object.

24:44 Interesting. So do you give it something like a directory and a file pattern and it just somehow does a sort and then loads them up?

24:52 It can do global it connects past the list. That is one of the killer features of X array that I think brought a lot of people into it because we were all kind of used to writing code around files. Like, okay, I've gone to this analysis here's, 100 files for each file in my list of files. Do this. And instead, if the workflow changes with X array, it's like, okay, open multifile data set, mean, done, right.

25:18 And so it's just like this cognitive load that's lifted.

25:22 That's cool. Yeah. Especially in the data science space. I see a lot of these things, and it's almost about learning about the packages and the way that you can use them. Like an example that really quickly comes to mind for me is like, if I wanted to get a table out of an HTML page out of a website and then pull that in and process it, I could go get the page with requests. I could do some beautiful soup thing to find the table. And then I could, I don't know, try to parse it or something and then convert the elements. If they were really supposed to be numbers, you got to parse them as numbers and then get that into some data structure. Or you go to Pandas and you say, read HTML tables, bracket two or something like that. Those kind of things seem to appear so much in these data science libraries. Like, oh, you could do this big, long computer science thing. You could call this function over here, and you got the same outcome. And it's really about knowing about that. Those exist, right?

26:15 Yeah, totally. So, okay, to go back to your original question, we can come back. I think it's been a long time on X array, but I wanted to make the connection since you have the Pangeo website up on the screen. Yeah. After a couple of years of working on X array, a handful of us were starting to think, okay, we're onto something here.

26:31 This is really the beginnings of kind of a platform for doing research. So we all got together at Columbia University. Ryan hosted a workshop in late 2016, and I don't even remember what the name of the workshop was, atmospheric Ocean Sciences or something like that. It was a name that got dropped pretty quickly, but the idea was that it was probably 20 of us that worked on X array and Dask mostly. And it was kind of a mix of software developers and scientists. We got together and just kind of shared out the use cases that we were wrestling with and the problems. And out of that grew the Pangeo of project and a few ideas. So the mission that you read on the website today is what we wrote that weekend, which is to try to tackle a few key problems that we're facing our community, mostly big data reproducibility and really aiming at supporting the software ecosystem that connected all those dots. And since then, the Pangeo project has grown into a wider community project that has a lot of software packages involved, not just X array and Dask. That's the origin story. It really started with X array as the beginnings, and then from there.

27:43 Okay, yeah, very cool. Maybe we could talk a little bit about the other packages, but X array. And then there's a list of packages on the website Iris. I know you don't do too much with Iris, but maybe just tell us really quick with that.

27:55 Since it's the same level of the stack as X array. So you're probably using either X array or Iris. Iris, I would say it's maybe a little bit more opinionated than Xarray in the sense that it's scoped to geo data or data that has things built into it that are like more specific to that domain.

28:16 Like some of the specific file formats, which I'm not familiar with, but like grip and those kinds of things.

28:22 No, actually, X ray gets all of those file formats.

28:24 Okay.

28:25 I think it's more about the API, like an understanding that, like, what latitude and longitude actually mean and supporting things like regretting it directly rather than, say through third party packages like we would use with X ray. It's a great project, and it is in many ways very complementary to X ray and highly interoperable as well. You can like X array to Iris data set.

28:48 Iris data set to X array. You can think of them all as wrappers higher level data structures around arrays. Right. So many of us have probably coded if you work with NumPy at any level, you probably had a dictionary with NumPy arrays in it, like multiple different arrays you want to keep together. And at that point, I would say just use X array whenever you're starting programming that pattern.

29:12 Yeah. Because that's basically what X array is, right.

29:15 How do you label it? That's the keys and then multi dimensional is multiple arrays, right?

29:21 Yes. And then understanding relationships between these and then metadata, another is a huge part of this, right.

29:27 Both X array and Iris. And anything in this space is going to really understand metadata that comes with those things. So things like units or conventions that tell you how the variables are related to each other, and then it can do things with that metadata, computationally, not just like drag it around for posterity, but actually leverage it to make certain syntax or certain computation.

29:51 You filter by all the ones that are tagged by state or whatever.

29:55 Okay. Then the next one in the overall banner Pangeo is DASK. I've had Matthew Rocklin on the show before to talk about Dask, but it's been a while, so maybe tell folks about Desk. Yeah. So DASK.

30:08 It's a library for doing parallel computing in Python, and it has a bunch of different containers. And so there's Dask array, which is what X array uses. But there's also a Dask data frame which does kind of parallel chunked operations on Pandas data frames. And then there's the catch all the dask bag, which does graph style parallel computing. So where this comes in for X array, is that actually for Iris as well, since we were just talking about Iris, but the arrays in an X array data set can be backed by a Dask array instead of an NumPy array. And by just kind of swapping that out, it's almost a behind the scenes swap out. You do a chunk on your X array data set, and then your operations are going to be handled by Dask, which means they're going to be streamed through the scheduler. You'll be able to scale out to a cluster of workers and do instead of, say, gigabyte style operations, you can do terabyte scale or even petabyte scale at someday data analysis. So Dask is the thing that gives X array its horizontal scalability.

31:15 Yeah, very cool. So scaling across machine. Now, when I learned about Dask, I saw it as it's like the local Pandas are the other types of things that models, but you set up a cluster and it runs there. And then when I spoke to Matthew about it, I realized he pointed out that it's useful even on a single machine some of the times. Right. You've got a ton of data, but not enough Ram to hold it, or even have, like, pretty simple computer here as eight cores. If I run something on Pandas, I get one core worth of processing power.

31:50 Right.

31:51 So maybe, Joe, you're shaking your head like tell people about that use case.

31:55 Yeah.

31:55 So Dask has a bunch of schedulers, and some of those are local schedulers that run on a single machine, and they can either use Python's multiprocessing module or threading multiple threads to do computation. It also has distributed schedulers that might live on Kubernetes or on an HPC machine there's. Now, companies like Matt Rockland has gone on to start Coils, which has managed Dask clusters for you. But the idea is that at a small scale, when you're using the threaded scheduler, it's going to stream computation. So when you say taking the average of a terabyte size array, it's going to use its chunks and process those chunks one at a time and then aggregate those process chunks to the final result.

32:36 Yeah. A lot of times if it's the simple path, if you go find some tutorial or example code or something on Stack Overflow, it's just like, well, first you just load this up, you read the CSV or you load the JSON file, and then you go over it like this. But I have a terabyte of data and 16 gigs of Ram, so you need this sort of iterative streaming style to get there.

32:59 The brilliance of Dask, the game changing flavor of Dask is that for many cases the user doesn't really have to rewrite their code at all to scale out, typically with X array when we teach it and we really want to get people the sense of the power we like start by downloading like a ten megabyte file and opening with X array and doing some analysis, and then they learn the API and they use it. And then we sort of point people to like a massive 100 gigabyte data set in the cloud and a DASK cluster and we say write the same code and it just works and it's pretty fast and it is able to scale out without much really any expertise on the user side about distributed computing. I love that feature. On the other hand, I've also come around to the feeling that sometimes it's a double edged sword because some things actually just fail if you don't think hard about the parallelization strategy. It's not magic, it depends on the operation that you want to do. And so the flip side of that ease of parallelization is that sometimes users will think it is dask is smarter or more capability than it really could ever be and expect it to just automatically paralyze anything, even say, just IO patterns that are just not parallelizable, not scalable. Right or other operations that can't be accelerated.

34:24 Yeah, it seems like it just automatically fixes the problem. It just makes it faster with magic programming dust, and then obviously some points that comes undone. Right.

34:35 This portion of Talk. Python to me is brought to you by Sentry. How would you like to remove a little stress from your life? Do you worry that users may be encountering errors, slowdowns or crashes with your app right now? Would you even know it until they sent you that support email? How much better would it be to have the error or performance details immediately sent to you, including the call stack and values of local variables and the active user recorded in the report? With Sentry, this is not only possible, it's simple. In fact, we use Sentry on all the Talk Python web properties. We've actually fixed a bug triggered by a user and had the upgrade ready to roll out as we got the support email. That was a great email to write back. Hey, we already saw your error and have already rolled out the fix. Imagine their surprise, surprise and delight your users. Create your Sentry account at talk python. Fm/sentry and if you sign up with the code 'talkpython' all one word. It's good for two, three months of Sentry's business plan, which will give you up to 20 times as many monthly events as well as other features. Create better software, delight your users, and support the podcast. Visit talkpython.fm/sentry and use the coupon code talkpython. Coiled Matthew Rockland's company that he started with some other folks, I believe. And this is really interesting story. Like we talked about how you've got your X array light code or you got your Pandas light code that you just wrote for yourself. And then by sort of adopting one of these libraries and maybe even contributing it and building it up, you get these knock on effects. Right. So I gave the Pandas to Dask example. Here's the next step in that chain. Right. Like, now you have, oh, I can just spin up a cluster on the cloud automatically through coils with one or two lines of code because I built on Dask, because I built on Pandas.

36:31 That change just keeps going of how it all sort of this synergy between all of them.

36:35 Absolutely. And I mean, I would say it's not just like coincidentally, Matt Rocklin was at that Pangeo meeting at Columbia in 2016 or whatever. And actually, we've watched this evolution. So a big part of what we have been doing in Pangeo is experimenting with cloud computing, I think a little bit earlier and more sort of in a different way than a lot of the other scientific community was. I think we can thank Math for that to some degree because the story was pangeo cloud is that after that workshop, we wrote our first proposal to the NSF, and we got a grant to develop some of these stuff and develop NGO and support scientific use cases. And what we had put into that original grant were like we had a bunch of servers that we wanted to buy to host data and run.

37:23 That's what you do.

37:24 Like when you write a scientific grant. And NSF asked us to trim our budget and we decided, okay, we'll cut out these servers, but why don't you give us some credit on the cloud? Because at the time, NSF was running this pilot program called Big Data, where they were granting a partnership with Google and Amazon and stuff. And so we got like $100,000 worth of Google Cloud credits. And we just started playing around to see how well we could make this stack work in the cloud. And Matt was instrumental, actually, at that time, he was really involved and was helping us figure out how to deploy stuff on the cloud. And we learned all about Kubernetes and object storage and all of this stuff. And it was incredibly fortunate for us to have that because I think we really figured out a lot of stuff early on about how science, scientific research can interact with cloud computing. And that's where a lot of our focus and energy is today.

38:15 That's cool.

38:15 Yeah.

38:15 There's a lot of interesting things about large data sets. Right. You can put them, as you said, in object storage, and then people can come into that cloud and use the data without trying to download it or move it around. Some of these data sets are terabytes. Right. What are you going to do to get those shared? Right?

38:31 Yeah, I know some of them are petabytes at this point.

38:34 Terabytes, you can do petabytes. That might be like a little bit beyond what your ISP is going to let you do to download that's, right?

38:41 Yeah. Where are you going to put it?

38:42 Where you can get a hard drive for it. Right. Where are you going to put it?

38:44 Yeah, exactly. The thing that our work on cloud computing really unlocked for us was this idea that we could federate access not only to compute like everyone kind of knew, you can spin up a VM, but Federated access to the data in a way that was infinitely scalable, both in terms of access, but also in storage is a total game changer. And so as we've gone down this rabbit hole and we've gone fairly deep at this point, the idea of putting data in object store and letting anyone in the research community access, that has kind of revolutionized the way we think about what scientific computing platforms should look like going forward.

39:22 Yeah, absolutely. It solves like half of the problem. One is the computational time and power, but the other is just the storage and the data and the memory and all those kinds of things. Right, absolutely.

39:31 And I think data providers are really reckoning now with what the cloud means. Right. Because what we had seen in Geosciences and climate Sciences is there are a lot of data portals out there. An agency or a group would decide, we have a data set we want to share, we have this data we want to share. Let's make a portal, which was almost always sort of a highly customized website with the browser. And maybe you had to click through to do some JavaScript and you have to interact with the browser to get data files, and then you would like get some data files and maybe it seemed like a good idea. There are reasons why they wanted to have a portal, but from a user point of view, especially like an expert user point of view, they're just incredibly frustrating to interact with data that way. With cloud object storage. I linked in the chat this blog post I wrote about this, my fantasy about how it's this facetious post about how to create a big data portal. There's like one step. It's like upload your data to S3, right?

40:28 Exactly.

40:29 I think it's provocative. But, like, I think the fact is there's a lot of vested energy expertise within the scientific community that had built and maintained these really bespoke data access solutions. When I think really we should be moving to a very Gen we should just really be using object storage and the scalability of cloud style computing to distribute scientific data. It doesn't mean we need to just go all in on AWS. Although actually that's exactly what NASA has done. There's a lot of cloud storage, like things that provide a really scalable base layer of storage for Internet enabled computing. There's, like wasabi alternative to Amazon. There's like Cloudflare now is launching a data storage service, which I'm Super excited about.

41:14 Interesting.

41:15 Yeah.

41:15 And you might end up with a copy of that data. Maybe you have a copy in AWS, copy in Azure, maybe even like Linode, Digital Ocean and places like that. Right. But that's only four or five copies. Not every researcher trying to figure out how they're going to deal with it. Right.

41:31 Exactly.

41:32 Or you outsource that sort of mirroring to a service that knows how to scale that sort of thing. Another really interesting player in this space is IPFS and the distributed web. And where we're at now is like people are very excited about in science, about S3 and cloud computing. But on the other hand, the scientific community is wary of being too dependent on the big tech Giants, and it always has sort of a do it yourself sort of distributed approach to infrastructure. So I'm really excited about IPFS Interplanetary File system, which is a distributed yet highly performant way of sharing petabyte scale data on the Internet.

42:12 That sounds really interesting, really quick. James out in the audience is asking, are there any Pangeo specific resources to help with that transition that jump from the workstation to cloud computing?

42:22 Yeah.

42:23 I think this is a good opportunity to show off and talk about the list of tutorials and examples that we have. So I think Pangeo has a collection of Jupyter notebooks that show how to use this is Gallery PanGeo, IO. There's a list of Jupyter's notebooks here that walk through kind of real world examples of working with large geoscientific data in the cloud. And so I'd encourage people that are listening to pull this site up and take a look.

42:53 And I'll put a link in the show notes if you people it as well.

42:55 Yeah. So I think this is a good example of one of the best resources we can point folks to.

43:01 Okay. But I need to get a little bit more specific. The idea is actually you don't have to change very much at all about your workflow when moving to the cloud. Right. That's what we are aiming for with pangeoand a key part of this and sort of that another pillar of pangeo we haven't discussed yet is Jupyter. So Jupyter is a key part of our effort. We work very closely with the Jupyter for developers and also the team at TITC. Well, I know you've talked to recently. Right. And Jupyter has been, I think, an amazing sort of Trojan horse for cloud computing. Because the way when you use Jupyter for the first time, it launches up a notebook in your browser.

43:34 Honestly, the hardest way to use Jupyter is to try to use it locally because you got to install it and configure it and run it.

43:40 Exactly.

43:40 The cloud story is just even easier.

43:42 Exactly.

43:42 So we now have the scientific community who like scientists huge number of scientists in many different languages are all just used to the idea that we're going to do our sites in our browser through this type of idea, essentially. And Jupyter Lab especially has so many more features than classic notebook that make for a really rich interactive scientific computing environment. So then moving to work to the cloud is trivial for the user. In Pangeo, we operate some cloud based Jupyter Hubs, and also we have been operating a binder. And those Jupyter Hubs, basically, you just log on and anyone out there can actually sign up for the Pangeo Hubs or apply to get access to these Hubs. And then you just have a notebook environment in the cloud. Of course, just having your notebook in the cloud is not that cool. But what we can augment it with are some capabilities to run dask. So we use Dask gateway in those hubs as a Dask deployment solution. But Coil is another example of a Dask deployment solution. And then the key, though, what is going to bring scientists to the cloud is that data. Right? So that is what makes this appealing and game changing the fact that now you log into this hub. Okay, you're in Google Cloud US central one. We got a petabyte of data from the World Climate Research Program couple model comparison program project sitting there, organized analysis, ready that you can start doing science with. So before you know that if a grad student would decide, okay, I'm going to work with these kind of models and do this research project, they would literally spend months downloading, organizing, and sorting that data on a computer before they can get started. Now you can start it in five minutes and be processing data.

45:24 That's amazing.

45:25 That's what gets me excited and motivates me to devote energy to this project.

45:31 Maybe that also involved writing a grant so they could get enough compute to locally hold and work on that. And you got to wait. I'm going to order our Silicon Graphics or Cray or whatever it is you're getting. Whatever the supercomputers are these days, and they'll wait for some big thing to show up or you fire up a notebook and AWS, take your pick. Right. I suspect this is also democratized computing somewhat as well, right? You don't need as much compute to set up something like this and access the existing cloud database.

46:04 That's true. And I think part of that is we've separated in the cloud model of separation of concerns. The storage is not necessarily directly attached to compute, whereas like in a supercomputing center, it's all one warehouse, right?

46:17 Yeah.

46:17 And so, yeah, if you're doing a small problem, you get a small VM or you get a small cluster, a small Dask cluster. If you're doing a big problem, you scale out so you can sort your or you can arrange your compute infrastructure appropriate to the problem you're working on. And the other thing is the way we're storing the data allows for partial queries of these larger rates.

46:41 The data that Ryan was talking about, the CMF Six data, is a petabyte in scale, but you don't have to open the whole thing and you don't have to load the whole thing. We're able to slice into that and grab out just the parts that are interesting for the research project.

46:54 Right? Exactly. That's awesome. One thing I did want to ask you, too, about real quick.

46:59 We're getting short on time here. But do projects like Jupyter Lite, which is Jupyter but running on WebAssembly in the browser, or the Python stack a little bit running on the browser or Pyiodide? Do these projects offer any benefits to you all? Are you tracking them? Are you interested?

47:16 It's super cool on Jupyter Lite and Py iodied. It really lowers the bar to providing just like getting something up and running. The motivations for getting Pangeo into a big part of it was big data. We have large data sets. Right. And so we want to do data approximate computing. So by putting a hub in the cloud, we're putting our compute next to the data. Pyiodide actually takes it back to the laptop.

47:40 Yeah. This doesn't solve the data problem. This kind of wrecks that, but it does give computational capabilities without.

47:48 And if you couple that with something like Coil or any dask solution where you can actually then call out to a data processing layer. That is interesting.

47:58 It would make the need for us to operate those Jupyter hubs potentially go away or reduce or go away.

48:03 Oh, that's interesting. So you've got the Dask cluster or whatever next to the data in the cloud, and this is just handling the results of all that.

48:11 Okay.

48:12 I hadn't thought about combining those in that way. Very cool.

48:14 If you want to process, like many terabytes of data, you're not going to do it in your browser. You're not going to do it in your laptop at all. You do need a big computer, right?

48:22 Yeah. But still, it certainly expands what's possible larger than maybe the first impressions.

48:27 I was going to say a quick shout out recently added a Jupyter Light example tutorial to the Xarray homepage at Xarray.dev. So if folks want to go try out X array really quick without having to start up a JupyterLab server or something, it's there and you can run through.

48:44 Oh, that's fantastic.

48:46 Yeah.

48:46 We've got Binder, which sort of creates a cloud instance to run all these examples, but a lot of them could actually just run like this, which is great. Make it a lot simpler. All right. We could talk for much longer, but we also know that we don't have too much time left. So let me ask you both really quickly the final two questions. Joe, you can go first going to write some code. What Python code? What editor do you use?

49:08 VS Code.

49:09 All right.

49:10 Or Vim, if I can't.

49:11 Okay. I use Atom and I feel like I'm behind the times, but that's what I use.

49:16 That's like OG VS Code.

49:18 I guess what I need to do Don, but I feel like I'm missing out on things, so I probably need to upgrade. I do feel like my development environment is increasingly owned by Microsoft, so I have some mild reluctance to switch to Vs code.

49:32 Sure. I hear you.

49:33 Awesome.

49:34 And then notable PyPI package some library you've come across lately that you thought was awesome. Obviously. Shout out to Pangeo and all of it.

49:41 But it's not a package.

49:43 Now, I know I'm going to go with one of my favorites, which is not something I came across recently.

49:48 But I think everybody that works as well.

49:51 It'S FS spec and it's a library for accessing data across a bunch of different file storage systems. And it is a game changer for working with remote data. Everyone should know about it.

50:01 Yeah, absolutely. That's a super interesting one. Basically, you can connect it to all these different back ends and stuff, right? I don't remember where to find them, but yeah, you can connect S3 like it was a file and stuff like that, right?

50:13 Yes, exactly.

50:14 Yeah. Fantastic. All right, Ryan.

50:16 Well, mine is a shameless self promotion, but it's for a new project we have called Pangeoforge, okay. Which is basically an ETL tool for this Xarray scientific data space. Right. So what we found is that a lot of the ETL tools that exist for business style data analytics don't necessarily play well with our multi dimensional data model that we use in geo sciences and science more broadly. And so we're building this open source Python package called Pangeo Forge Recipes that is designed to help us with all this data movement data transformation that needs to happen as we're migrating so many sort of legacy data sets into this cloud native format.

50:57 That looks fantastic. All right, good. Shout out. And we'll put a link to that so people can check it out. Right, Joe? Ryan, it's been a lot of fun. Quick final call to action. People want to get started with Pangeo.

51:08 What do you say? Come to our forum.

51:10 Yeah, you mentioned before we hit record, you mentioned that the discuss page is like, really where the action is at right now. The discuss forum. Right.

51:17 Sorry, not discuss.

51:18 Yeah. Discourse, the tools and the stuff is self documented on the website. But what we really emphasize about Pangeo is the community aspect of it. We are not just trying to build a tool and like put it out there if we're really trying to build a community where scientists are talking to software developers, are talking to infrastructure maintainers, are talking to data providers, and are collectively sort of trying to keep the flywheel spinning of innovation and development, ultimately with the goal of empowering more scientific discoveries. And so if you have questions, if you have breaks, if you don't know if it's the right tool for you. This form is where we can welcome you. We don't have a slack. We try to be more open than as a community than a slack.

52:00 So this is where we have I can second that. It's a good idea. Awesome. And then Joe, you also wanted to throw out Xarray.dev, right?

52:09 Yeah, exactly. This is a relatively new site that sits on top of our documentation site, but the Jupyter Lite interface is down below the fold and it's a great starting point.

52:19 Thank you both for being here. This has been really interesting and thanks for all the hard work.

52:23 Thanks for having us.

52:24 Thank you for inviting us.

52:25 Yeah, you bet.

52:25 Bye.

52:27 This has been another episode of Talk Python to me. Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show. Add high performance multiparty video calls to any app or website with Signal wire. Visit talkpython. Fm SignalWire and mention that you came from talkpython to Me to get started and grab those free credits.

52:48 Take some stress out of your life. Get notified immediately about errors and performance issues in your web or mobile applications with Sentry. Just visit talkpython.fm/Sentry and get started for free and be sure to use the promo code 'talkpython' all One Word When you level up your Python we have one of the largest catalogs of Python video courses over at talk python. Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription in site. Check it out for yourself at training.talkpython.fm. Be sure to subscribe to the show, open your favorite podcast app and search for Python. We should be right at the top. You can also find the itunesfeed at /itunes, the GooglePlay Feed at /play and the direct RSS feed at rss on talkpython.fm.

53:37 We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/Youtube. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

54:15 You.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon