#334: Microsoft Planetary Computer Transcript
00:00 On this episode. Rob Emanuel and Tom Augspurger join us to talk about building and running Microsoft's Planetary Computer project. This project is dedicated to providing the data around climate records and the compute necessary to process it with the mission of helping us all understand climate change better. It combines multiple petabytes of data with a powerful hosted Jupyter Lab notebook environment to process it. This is Talk Python to Me episode 334, recorded September 9th, 2021.
00:43 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Twitter where I'm '@mkennedy' and keep up with a show and listen to past episodes at 'Talk Python.FM' and follow the show on Twitter via '@talkpython'. We've started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at 'talkpython.fm/youtube' to get notified about upcoming shows and be part of that episode. This episode is brought to you by Shortcut, formerly known as Clubhouse.IO and Us over at Talk Python training and the transcripts are brought to you by 'Assembly AI', Rob, Tom.
01:21 Welcome to Talk Python to Me.
01:22 Thank you.
01:23 Good to have you both here. We get to combine a bunch of fun topics and important topics.
01:29 Data science, Python, the cloud, big data, as in physically lots of data to deal with and then also climate change and being proactive about studying that, make predictions and do science on huge amounts of data for sure.
01:43 Look forward to it.
01:44 Yeah, that'll be fun.
01:45 Yeah, absolutely. Before we get into those, let's just start real quickly. How do you to get into programming Python? Rob, start with you.
01:51 Yeah, sure.
01:52 Been a developer for I don't know. Let's say, 14 years. I started at a shop that was doing Sybase PowerBuilder that goes back way back aways. And I actually come from a math background, so I didn't know a lot about programming and started using Python, just sort of like on the side to are some bank statements and do some personal stuff and started actually integrating some of our source control at the company with Python and had to write some C extensions. So got into the Python source code and started reading that code and being like, oh, this is how programming should work. Like, this is really good code. And that year went to my first Python.
02:34 It was just like, all in I need to get a different job where I'm not doing Power Builder and really kind of credit PyCon and the Code Base setting me on a better development path for sure.
02:45 Oh, that's super cool. PyCon is a fun experience, isn't it? Yeah, it's like my geek holiday, but sadly, the key holiday has been canceled the last two years.
02:54 Oh, no.
02:55 Yeah.
02:56 Tom, how about you?
02:57 Kind of similar to a lot of your guests, I think was in grad school and had to pick up programming for research and simulations. This is for economics.
03:07 They started us on MATLAB and Fortran, and it goes back to maybe further, almost as far as you can go. And anyway, I didn't really care for MATLAB, so moved over to Python pretty quickly and then just started enjoying the data analysis side more than the research side and got into that whole open source ecosystem around Pandas and STAT models in Econometric library, started contributing to open source, dropped out, got a job in data science stuff, and then moved on to Anaconda, where I worked on open source libraries like Pandas and DASK for a few years. Yeah.
03:44 In a weird turn of a coincidence. A weird coincidence. I was just the previous episode with Stan Siebert who you worked with over there.
03:51 Right at him here. Director of community innovation. And then it was a great place to work at, really enjoyed it. And then came on to this team at Microsoft almost a year ago. Now working on the planetary computer. Yeah.
04:04 Cool.
04:04 Well, the planetary computer stuff sounds super neat. You get to play with all the high end computers and the big data and whatnot right? Yeah.
04:11 It's a lot of fun. Although I did have a chance to play on. I think it was Summit, which is one of our nation super computers at my last. That was a lot of fun to.
04:19 Okay, well, it's hard to beat that, right. That's one of his like it takes up a whole room. A huge room. That's pretty fantastic. Awesome. All right. Well, what are you doing today? You're both on the planetary computer project. You're working at Microsoft. What are you doing there?
04:34 Yeah. So we're on a pretty small team that's building out a planetary computer, which really is sort of three components, which is a Data Catalog hosting a lot petabytes petabytes of data, openly licensed satellite imagery and other data sets on Azure's Blob storage. We're building API and running API services that ETL the data encode metadata according to the stack specification, which we can get into later about that those data sets putting them into a Postgres database and then building API services on top of that. That's a lot of what I do is manage the the ETL pipelines and the APIs and then expose that data to users, environmental data scientists, and really anybody. It's just publicly accessible. And that's sort of my side. And then there's a compute platform which can talk about.
05:26 Yeah. So all this isn't a service of environmental sustainability. And so we have our primary users are like people who know how to code, mostly in Python, but they're not developers.
05:37 And so we don't want them having to worry about things like Kubernetes or whatever to set up a distributed compute cluster.
05:44 So that's where kind of the hub comes in. It's a place where users can go log in, get a nice, convenient computing platform built on top of Jupyter hub and DASK where they can scale out to these really large workflows to do whatever analysis they need, produce whatever derived data sets they need for them to pass along to their decision makers and environmental sustainability.
06:07 It's super cool the platform you all are building people who might have some Python skills, some data science skills, but not necessarily high end cloud programming, right. Handling lots of data, setting up clusters, all those kinds of things. You just push a button, end up in a notebook. The notebook is nearby.
06:27 Petabytes of data. Right.
06:28 Right. Exactly. So we'll talk a lot about cloud native computing, data analysis. And really, what that means is just putting the compute as close as the data as possible. So in the same Azure region.
06:40 So you just need a big hard drive.
06:42 Really, really, really big hard drive.
06:44 What the cloud is.
06:45 One hard drive.
06:46 Exactly.
06:47 It is. Yeah. So super neat. Before we get into it, thouse, let's just maybe talk real briefly about Microsoft and the environment. This obviously is an initiative you are putting together to help client climate scientists study the climate and whatnot. But I was really excited to see last year that you announced that Microsoft will be carbon negative by 2030. Yeah, for sure.
07:10 I mean, Microsoft. And prior to me joining Microsoft, I didn't know any of this, but Microsoft been on the forefront of corporate efforts and environments and sustainability for a long time. And there's been an internal carbon tax that we place on business groups that there's actual payments made based on how much carbon emission each business group creates.
07:32 And that's been used to fund the environmental sustainability team and all these efforts. And that sort of culminated into these four focus areas and commitments that were announced in 2020. So Carbon's a big one, not just carbon negative by 2030, but by 2050, actually having removed more carbon than Microsoft has ever produced since its inception. And that's over scope one, scope two and scope three, which means accounting for downstream and upstream providers. And then there's a couple more focus areas around waste. So by 2030, achieving zero waste and around water becoming water positive and ensuring accessibility to clean drinking and sanitation water for more than 1.5 million people. There's an ecosystem element too by 2025, protecting more land than we use, and then also creating a planetary computer, which is really using Azure resources in the effort to model, monitor, and ultimately manage Earth's natural systems. That's awesome.
08:36 That's the part you all come in, right?
08:38 Yeah.
08:38 Exactly. The plants air computers in that ecosystem commitment, and that's what we're working towards.
08:42 Yeah. Very cool.
08:44 Removing all the historical carbon, I think is pretty fantastic. And being carbon negative, so much stuff runs on Azure and on these couple of large clouds, that actually is a statement about a large portion of a data center usage as well.
08:58 For sure.
08:59 How many data centers does Azure have, like, three or four.
09:02 Right.
09:02 I think it's the most out of any of them.
09:04 It's a lot, right.
09:05 Over 50 or something like that. Large data centers. I don't remember. But I read them all the time to yeah. Yeah. It's like constant. So that's a big deal. Super cool. Alright. Let's talk about this planetary computer. You told us a little bit about the motivation there, and it's made up of three parts, right.
09:24 Alright.
09:24 Yeah. So tell us about it.
09:25 So there's technically four parts.
09:27 We recognize that technology, for technology's sake, is just kind of spinning your wheels. We have to be building all of this data, all this data access the analytics platform towards applying data and insights to actually making an impact on environmental sustainability concerns. And that's done not by us. An engineering team, like kind of trying to figure out the climate scientists. We're engaging with organizations to build out applications specifically on these data and services, in partnering with organizations that have specific goals. So there's an applications pillar to the planetary computer. But from an engineering standpoint, we're mostly focused on the data catalog, the APIs and the hub that we had touched on briefly.
10:12 Yeah. And then the applications is what the partners and other people building on top of it, we're really doing right.
10:18 Exactly. And we participate in that and help bring different organizations together to build out the applications and use the money that we have to actually fund applications that are specifically aimed at different use cases. Yeah.
10:34 Very cool. There are some other things that are somewhat like this, right. Like Google Earth Engine and AWS. And you probably could just grab this yourself. You want to do a compare and contrast for either of you.
10:45 Yeah. So Google Earth Engine is sort of the bar that's set as far as using cloud compute resources for Earth science.
10:53 And it's an amazing platform that's been around for a long time. And it's really just like a giant compute cluster that has interfaces into an API and sort of like a JavaScript interface into it that you can run geospatial analytics. And so it's a great, like I said, a great tool can't sing it praises enough. One of the aspects of it that make it less useful in certain contexts is that it is a little bit of a black box, right. The operations, the geospatial operations that you can do on it, the way that you can manipulate the data are sort of whatever Google Earth engine provides.
11:30 If you wanted to run a PyTorch model against a large set of satellite imagery, that's a lot more difficult. You can't really do that inside a Google Earth engine, you have to ship data out and ship data in and getting data in and out of the system is a little a little tough because it's sort of a singular solution. And then you optimize a lot based on that. So the approach we're taking is more modular approach, leaning heavily on the open source ecosystem, the tools trying to, you know, make sure that the open source users are first class users that were thinking first, and that if people want to just use our data, we just have cloud optimized GeoTIFF, these flat file formats on Blob storage.
12:09 Go ahead.
12:09 And you don't have to use any of the other stuff that we're building. But if you want to do space temporal searches over it, we provide an API that's free access that allows you to do searches and get metadata about the data, so you don't have to actually read in the data bytes, and then also providing the Hub experience, which brings together that really rich open source ecosystem of Python tooling, including our tooling. And we're also building out other mechanisms to access this data.
12:37 The current focuses is really on that Python data science, but yeah, considering the open source ecosystem sort of as our user experience and try to treat that as the first class use case.
12:49 Yeah, that's fantastic. Tom tell me if I have this right. I feel like my limited experience working with this is you've got these incredible amounts of data, but they're super huge. You all built these APIs that let you ask questions and filter it down into. I just want the map data for this Polygon or whatever. And then you provide a Jupyter notebook and the compute to do stuff on that result. Is that pretty pretty good.
13:14 Yeah.
13:14 That's pretty good if you just think like, the API is so crucial to have and we'll get into what it's built on. But just for, like, the Python analogy here is like, imagine that you only had lists for your data structure. You don't have dictionaries. And now you have to traverse this entire list of files to figure out where is this one, like in space on Earth? Where is it at or what time period is it covering? And the nice thing about the API is you're able to do very fast lookups over space and time with that to get down to your subset that you care about and then bring it into memory on ideally, on machines that are in the same Azure region. Bring those data sets into memory using tools like Xray or Pandas and DASK things like that.
13:59 Yeah.
13:59 Very cool.
14:00 So, Robbie, you mentioned the Postgres database. Do you parse this data and generate the metadata and all that and then store some of that information in the database. So you get to it super quick. And then you've got the raw files as Blob storage, something like that.
14:13 Yeah, for sure. I mean, that's as much metadata that you can capture and to describe the data so that you can kind of do what Tom said and, like, ignore the stuff that you don't care about and just get to the area that you care about. We try to extract that and we do that according to a spec that is really interesting, like Community German spec. That one of the biggest complaints about dealing with satellite imagery and this observation imagery is that it's kind of a mess. There's a lot of different scientific variables and sensor variables and things. So there's been a community effort over the past three or four years to develop specifications that make this type of information machine readable. And so we've kind of bought fully into that and have processes to look at the data, extract the stack metadata, which is just a JSON schema specification with extensions, and then write that into Postgres.
15:08 And one of the things that we have been trying to do for transparency and contribution to open source is a lot of that ETL code base. Those Python, the Python code that actually works over the files and extracts the metadata is open source in the Stack Utils GitHub organization. So we're trying to contribute to that sort of body of work of how to generate stack metadata for these different data types.
15:34 You want, like the metadata for the exact same image that's coming from the USGS public sector data set. You want the stack metadata the identical for that, whether you're using our API or Google Earth engine, who also provides a Stack API. And so we're working together on these kind of, like shared core infrastructure libraries.
15:56 This portion of Talk Python to Me is brought to you by Shortcut, formerly known as Clubhouse.IO. Happy with your project management tool. Most tools are either too simple for a growing engineering team to manage everything or way too complex for anyone to want to use them without constant prodding. Shortcut is different, though, because it's worse. No, wait, no, I mean it's better. Shortcut is project management built specifically for software teams. It's fast, intuitive, flexible, powerful, and many other nice positive adjectives. Key features include genebased workflows. Individual teams can use default workflows or customize them to match the way they work. Org wide goals and roadmaps. The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and back. Height version control integration. Whether you use GitHub, GitLab or Bitbucket Club House ties directly into them so you can update progress from the command line. Keyboard friendly interface. The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse. Throw that thing in the trash. Iteration-planning, set weekly priorities, and let Shortcut run the schedule for you with accompanying burndown charts and other reporting. Give it a try over at 'talkpython.fm/shortcut' again, that's 'Talk Python.FM/shortcut'. Choose Shortcut because you shouldn't have to project manage your project management.
17:25 Well, let's dive into some of the data, actually, and talk a little bit about all these data sets. So a lot of data, as we said, over here, maybe highlight some of the important data sets that you all have on offer.
17:39 So Sentinel-2 is our largest and is incredibly important. It's multispectral imagery, optical imagery that is ten meter resolution. So it's the highest resolution. And when we talk about satellites, we often talk about what is the resolution that's captured? Because something like land set, which we also have Landsat eight is 30 meters resolution. So once you get down to street level, you can't really see everything's blurry.
18:07 each Pixel represents 30 meters on the ground.
18:11 Right.
18:12 Okay. So Sentinel is ten meter. You get a lot clearer picture you can track. If you're doing sort of deforestation monitoring, for instance, you can really track the edge of deforestation a lot better with ten meter imagery or glaciers.
18:26 And you want to understand the boundary of it or something.
18:28 Exactly.
18:29 And it's still pretty low resolution compared to commercially available imagery. But as far as open data sets, it's high resolution. It's passively collected. I think the revisit rate is I should have this off hand. I think it's eight days. So you can really do monitoring, use cases with that, it generates petabytes and petabytes of data. So it's a lot to sort of work over generating the stack metadata for that.
18:53 You got to fire up, like, 10,000 cores to kind of run through that you end up actually reaching the limits of how fast you can read and write from different services.
19:04 My gosh.
19:05 Yeah, but it's a really great data set. A lot of work is being done against Sentinel-2.
19:11 So a lot of what I'm seeing I'm reading through here is this annually or this from 2000 to 2006 or like, the one you're just speaking about is since from 2016. So this data is getting refreshed. And and can I ask questions like, how did this Polygon a map look two years ago versus last year versus today? Totally. Yeah.
19:32 And you can do that with the sort of API to say. Okay, here's my Polygon of interest. This is over my house or whatever. Fetch me all the images. But a lot of satellite imagery. I mean, most of it is clouds. It's just the Earth is covered in clouds. You're going to get a lot of clouds. So there's also metadata about the cloudiness. So you can say, okay, well, give me these images over time.
19:54 But I want the scenes to be under 10% cloudy, right?
19:58 I'm willing for it to not be exactly 365 days apart, but maybe 350 because I get a clear view. If I do that something like this.
20:05 Exactly. And then you can make a little time lapse of how that area has changed over time. And in fact, I think there was somebody who actually demoed timelapse similar type of time lapse, just grabbing the satellite imagery and trying to get into a video over an area. I forget who exactly that was.
20:22 Yeah, that one Sentinel the large one. The revisit time is every five days. That's a lot of data.
20:29 Yeah.
20:30 It ends up a lot of data.
20:32 A lot of cloud.
20:34 Yeah. What about some of these other ones here the daymat, which is grided estimates of weather parameters in North America.
20:41 That's pretty interesting.
20:43 So, 'Daymat's, actually an example of a lot of our data, geospatial, satellite imagery or things that are derived from that, like elevation data sets where you're using imagery to figure out how what's the elevation of the land or things like land cover data set. So if you scroll down just a tad, the land cover data set there that's based off Sentinel actually. And so there's a saying for every Pixel in Sentinel, they took like a mosaic over a year.
21:12 What is that Pixel being used for? Is it water, trees, buildings, roads, things like that. So those are examples based off of satellite imagery or aerial photography. And the daymats an example of something that's, like the output of a climate or weather model.
21:30 So these are typically higher dimensional. You're going to have things like temperature or maximum minimum temperature, water pressure, vapor, all sorts of things that are stored in this really big in-dimensional Cube at various coordinates. So latitude, longitude, time, maybe height above surface.
21:49 So those are stored in typically in formats like ZAR, which is this cloud native very friendly to object storage way of storing chunk in dimensional arrays.
21:59 Is it, like, streaming friendly? You can stream part of it and seek into it that kind of thing.
22:03 Exactly. And all the metadata is consolidated so you can load in the whole data set and less than a few hundred milliseconds, but then access the specific subset very efficiently.
22:13 Sure.
22:14 Yeah.
22:14 Very neat. Another one that's not directly based off of satellite is the high resolution electricity access. I'm guessing. I guess you could sort of approximate it from lights, but do you think it's light I'll put satellites from.
22:28 Yeah, I think it's all.
22:30 Yeah. It's via satellite.
22:32 Okay.
22:32 So this is of basically just steady and light. Interesting.
22:35 And we have a few more that are coming online shortly, which are kind of more tabular. So, like US Census gives you the Polygon. So the state of IOWA has these counties for census blocks, which are this shape. So giving you all those shapes that it has this population, things like that.
22:53 Things like GBIF has, which is I think on there now has occurrences of like, I think they're like observations of somebody spotted this animal or plant at this latitude, longitude at this time, things like that. So lots of different types of data.
23:07 Interesting. A mink was spotted running through the streets.
23:10 Okay.
23:10 Yeah.
23:10 You have one for agriculture.
23:12 That's pretty interesting. If you are doing something with agriculture and farming and then trying to do ML against that.
23:18 That's interesting, because that's actually run by the national agriculture that's actually aerial imagery, RGB, red, green, blue, and then also infrared aerial imagery that's collected about every three years.
23:31 So that's an example of high resolution imagery that I think it's 1 meter resolution.
23:37 Yeah.
23:38 You can see the little trees and stuff that it's very accurate.
23:42 Great data set specific to the US. So again, likeSentinel-2 is global in scope. But if you are doing things in the United States, Snape is a great data set to use.
23:52 Yeah. You've got the USGS 3D elevation for topology. That's cool.
23:57 And then you have some additional data sets. What's the difference between the main ones and these additional ones? Why are they separated?
24:04 We're catching up to where our stack API has all of the data sets we host.
24:09 But the AI for Earth program, which hosts all these data sets, has been going on since 2017. So there's plenty of data sets that they've been hosting that haven't yet made their way into the API.
24:21 And that's just because we're getting there. It's a bunch of work I see.
24:25 So for these additional ones, maybe I could directly access them out of Blob storage, but I can't ask API questions.
24:30 Exactly. And then another point, which is kind of interesting talking back to the tabular data is that some of these data formats aren't quite. I mean, rasters and imagery is fits really nicely in stack, and we know how to do space. Your temple queries over them. But some of these data formats, they're not as mature as maybe the rest of your data format, or it's not as clear how to host them in a cloud optimized format and then host them in a space or temporal API. So we're actually having to do work to say. Okay. What are the standards? Is it like geo Parquet or what are the formats that we're going to be using and hosting these data sets? And then how do we actually index the metadata through the API? So there's a lot of sort of data format and specification metadata specification work before we can actually host all of these in the API.
25:15 Really nice. A lot of good data here and quite large. Let's talk about the ETL for just a minute because you through out some crazy numbers there. We're looking at the Sentinel Two data, and it gets refreshed every five days. And it's the Earth talk us through what has to happen there.
25:31 Yeah.
25:31 So for the Sentinel, it's actually daily.
25:35 It's passive satellite collection. So the satellites are just always monitoring, always grabbing new imagery.
25:42 And so that comes off to ground stations through the European Space Agency.
25:46 And then we have some partners who are taking that, converting it to the cloud Optimus Geotiff format, putting it on Blob storage, at which point we run our ingest pipeline, look for new imagery, extract the stack metadata, insert that into the database. And we just have that running in an Azure service called Azure Batch which allows us to run parallel tasks on clusters that can auto scale. So if we're doing an ingest of a data set for the first time, there's going to be a lot of files to process, and we can scale scale that up, and it runs Docker container. So we just have a project that defines the Docker commands that can run. And then we can submit tasks for chunks of the files that we are processing that creates the stack items. And then another separate process actually takes the stack items and insert it into the database.
26:37 That's cool. So it's a little bit like data driven rather than a little bit like Azure functions or AWS Lamda.
26:45 But processing, we just kind of get all the data and just work through it kind of at scale. Interesting.
26:51 Yeah, for sure. Right now, it's a little bit. We're still building a plane as we're flying it. But the next iteration is actually going to be a lot more reactive. And based on another Azure service called Event Grid, where you can get notifications of new Blobs going into storage and then put messages into cues that can then turn into these Azure batch tasks that are run. Right.
27:13 I see. So you just get something that drops it in the Blob storage, and it kicks off everything from there, and you have to worry about it.
27:19 Yeah.
27:19 And then we publish those to our users saying, hey, this is ready now if they subscribe to that event Grid topic.
27:27 Oh, that's cool. There's a way to get notified of refreshes and things like that.
27:32 Not yet. We're hoping to get that end of the year, but the idea is that we would have basically a live feed of new imagery.
27:39 What I would really like to see just for myself. My own interest is to be able to have my areas of interest and then just go to a page that shows, like, almost an Instagram feed of Sentinel images over that area. It's like this new one is not cloudy. Look at that one. It's something I'm monitoring, but yeah, generally will be publishing new stack items, so that if you're running AI models off of the imagery as it comes in, you can do that processing based off of events. Yeah.
28:06 That'd be cool. I'm only interested in Greenland. I don't care if you've updated Arizona or not. Just tell me exactly if Greenland has changed, then I'm going to rerun my model on it or something.
28:15 Right.
28:15 Cool.
28:15 All right. Let's see the data part data catalog. And then we have the API in the Hub, which I want to get to, but I kind of want to just sort of put some perspective on what people have been doing with this. Some of your partner stuff under the applications. So which ones do you think we should highlight from those that are interesting?
28:34 We kind of talked about the land cover data set, so we worked with Impact Observatory, and to do that. And so we had some tips about how to use Azure batch, because that's a very big Azure batch job to generate that land cover map. So pulling down the Sentinel data that we're hosting and then running their model over it. So that was a fun data set to see come together and then use now the Carbon Plan carbon Monitoring app risk assessment application. That's like a really cool.
29:04 It's a cool, like JavaScript application that you can view risk on.
29:09 These companies are buying like carbon assets that are for trees that are planted to offset carbon. But there's a problem that we know about now is the wildfires are burning down some of those forests.
29:25 And it doesn't help if you planted a bunch of trees, offset your carbon if they go up and smoke. Right.
29:30 Right. Yeah. So Carbon Plan did a bunch of research, first of all, on essentially, they did the research before our hub existed. But we're working with these community members, and they have a very similar set up to what we have now to do the research to train the models and all that goes into this visualization here of how likely what are the different risks for each plot of land in the US. So that was a great collaboration there.
29:59 What are the things I was wondering when I was looking at these is you all are hosting these large amounts of data and you're offering compute to study them. How does something like Carbon Plan take that data and build this seemingly independent website? Does that run directly on that data.
30:15 Or do they export some stuff and then on their side or what's the story they have been doing all the heavy duty compute ahead of time to train the models and everything to gather the statistics necessary to power this. So then at that point, it's just a static JavaScript application just running in your browser. Now.
30:34 I'm interesting.
30:35 And I think that's a good point because it's running against our data, but it's running in their own infrastructure.
30:41 So it's sort of on the planetary computer, but it's like really, in this case, using the planetary computer data sets in sort of a production setting. That an infrastructure that they own, which is a use case we really want to support if they need to use search in order to find the images that they need. They can use our stack APIs, but really, it's like just an application running in Azure that insert cases with our grants program will end up supporting and sponsoring Azure subscriptions to run this type of infrastructure. But at the end of the day, it's really just applications running in the cloud.
31:16 It's just better if that it's in Azure, that it's nearby, but they could run anywhere. Technically. Right. And just get signed Blob a storage access or whatever.
31:24 Yeah. Well, throttle access at a certain point if you're trying to egress too much.
31:28 But yeah, I can imagine.
31:29 Yeah.
31:30 Yeah. Very cool. I can come over here and Zoom in on Portland, and it looks like we're in a decent bit of green. Still, it does rain up here a lot. Same for Seattle.
31:39 Yeah, quite cool.
31:41 You talked about this grant program. What's the story that people out there listening? They're like, I want to get into working with the state and building things. Grant might sound good to them. What is that awesome.
31:50 Yeah. Look up AI for Earth grants. We have rounds of supporting folks that are doing environmental sustainability work, and there's sort of a range of grant rewards. The lowest level is like giving Azure credits, being able to sponsor an account or sponsor resources for applications that are being developed or research that's being done for environmental sustainability. And we have folks running the grants program and take the applications. And there's different classes that we have and summits for each of the classes. And then the there's more involved grants and larger grants. As usually as people sort of show progress, we actually can end up bringing additional resources or paid projects to accomplish specific goals. But if anybody's out there and they're doing work in environmental sustainability that could benefit from the cloud, we'd love to work with you just to clarify.
32:46 The grants are great for if you have a complex deployment that's using a ton of Azure services and you want to integrate this all together and use the planetary computer data, then the grants are a great approach. If you're just like an individual researcher, a team of researchers, or whoever who wants to use this data, the data is there. It's publicly accessible. And if you need a place to compute from that's in Azure, so close to the data and you don't already have an Azure subscription, then you can sign up for a planetary computer account. And so that's a way lower bar of barrier to entry. There is you just sign up for an account, you get a approved by us, and then you're off to the races.
33:26 thats right point.
33:27 If you think you need a grant to use the cloud, try using the planetary computer first, because you might not.
33:33 Very good. Talk Python to Me is partially supported by our training courses. When you need to learn something new, whether it's foundational Python advanced topics like async or Web apps and Web APIs, be sure to check out our over 200 hours of courses at the How Python. And if your company is considering how they'll get up to speed on Python, please recommend they give our content a luck. Thanks.
33:58 So what's the business model around this? Is there going to be a fee for it? Is there some free level? Is it always free but restricted how you can use it? Because right now it's in a private beta. Right. I can come down and request access to it.
34:12 Yeah, it's a preview. We're still getting access by requiring requests, and there's a larger number of requests we're approving over time.
34:21 We're still coming up with the eventual final sort of target. Most likely it will be some sort of limits around what you can do as far as compute as far as data store. Once we have features around that, and with clear offboarding of if you're an enterprise organization that wants to utilize this technology, there should be paid services that allow you to just as easily do it as you're doing on the Planetary computer. But if you're doing low use use cases or if your use cases is super environmental sustainability focused and you apply for a grant, we could end up you're still using a paid service, but we're covering those costs through our grants program.
35:03 So we're still figuring that out.
35:06 I don't see this as something that we're trying to turn into a paid service necessarily. I think that there's a number of enterprise level services that could end up looking a lot like the Planetary computer, but really, we want to continue to support usage, particularly for environmental sustainably use cases through this Avenue.
35:28 One of the nice things about our overall approach is since we're so invested in the open source side of things is you might have requested an account a while ago, and we're very slowly going through it's like, so much to do. But if we're too slow approving your account, then you can replicate the Hub in your own Azure subscription. If we're blocking you, or if your needs are just so vastly beyond what we can provide within this one subscription, then you can go ahead and do your own setup on Azure and get access to our data from your own subscription.
36:02 Right. Because the Blob storage is public, right?
36:05 Exactly. Yep. Okay. Yeah.
36:07 Very nice.
36:08 Maybe the next two things to talk about are the API and the Hub, but I think maybe those would be good to see together. What do you think?
36:17 Yeah, exactly.
36:18 I think I'll let you talk us through some scenarios here, Tom.
36:21 Cool. So in this case, I've logged into the Hub here.
36:26 This is before you go further, there's a choice you get when you go there, you got an account and you click start my my notebook up.
36:35 Yeah.
36:35 It's actually going to fire up a machine, and it gives you four choices, right? Python with four cores and 32 gigs of memory and a Pangea notebook gives you R with eight cores and R Geospatial and GPU PyTorch as well as QGIS which I don't really know what that is. We started it.
36:55 So this is a Jupyter Hub deployment. So Jupyter Hubs, this really nice project. I think it came out of UC Berkeley when they were kind of teaching classes, data science courses to, like, thousands of students at once, even with, like, condo or whatever. You don't want to be trying to manage a thousand students, condo installations or whatever.
37:16 So that's just a nightmare. So they had this kind of cloud based set up where you just log in with your credentials or whatever. You get access to a computer environment to do your homework in that case, or do your geospatial data analysis in this case.
37:31 And so this kind of you mentioned in pangeo. This is the ecosystem of geo businesses, geo scientists who are trying to do scalable Geoscience on the cloud that Anaconda was involved with. And so they kind of pioneered this concept of a Jupyter Hub deployment on Kubernetes that's tied to DASK so you can create easily get a single node compute environment here, in this case is in the Python environment or multiple nodes, a cluster of machines to do your analysis using DASK and Dask gateway. Yeah. Let's just say a Kubernetes based computing environments. That's cool.
38:07 And I noticed right away the DASK integration, which is good for this massive amounts of data. Right. Because it allows you to scale across machines or stream data where you don't have enough to store a memory and things like that.
38:19 Yeah. Exactly. So this is a great thing that we get for Python Sedaka's, Python specific. We do have the other environments like R.
38:28 If you're doing Geospatial and R, which there's a lot of really great libraries there, that's an option that is unfortunately, single note. There's not really a DASK equivalent there, but there's some cool stuff that's being worked on, like multi-d player and things like that.
38:43 Cool. People haven't seen DASK running in Jupyter notebook. There's the whole cluster visualization and the sort of progress computation stuff is super neat to see it go.
38:53 Yeah. So when you're doing these distributed computations, it's really key to have and the understanding of what your cluster is up to. It's crucial to be able to have that information there.
39:05 And the example code that you've got there the cloudless mosaic Sentinel-2 notebook. It just has basic.
39:14 Create a cluster in DASK get the client, create four to 24 workers, and then office goes, right.
39:23 Yeah. Exactly.
39:23 What is the limits and how does that work? Right. As part of getting an account on there, you get access to this cluster.
39:30 So this is the first thing that we talked about today that does require an account. So the Hub requires an account. But accessing the Stack API, which we'll see in a second and even downloading the data does not require an account. You can just do that anonymously. Yeah. And in this case, I think the limits, like a thousand or something like that, some memory limit as well. So that's the one that you're into there. So you can get quite a bit out of this.
39:55 That's real computing, right.
39:56 There definitely a and in this case, we're using DASK adaptive mode. So we're saying right now there's nothing to do. It's just sitting around Idly. So I have three or four workers.
40:08 But once I start to actually do a computation that's using DASK, it'll automatically scale up in the background, which is a neat feature of DASK. And so the basic computation, the problem that we're trying to do here is we have some area of interest, which I think is over. Redmond Washington, Microsoft headquarters, which we're defining as this out exact square area.
40:28 Yeah.
40:28 So I think that's a square Polygon.
40:33 Anyway, we draw that out and then we say, okay, give me all of the Sentinel two items that cover that area. So again, back to what we're talking about at the start is like if you just had files and blob storage, I'd be extremely difficult to do.
40:48 But thanks to this nice stack API, which we can connect to here at Planetary Computer.Microsoft.Com, we're able to quickly say, hey, give me all the images from 2016 to 2020 from Sentinel that cover that intersect with our area of interest here. And we're even throwing in a query here saying, hey, I only want scenes where the cloud cover is less than 25%, according to the metadata.
41:13 Very likely summer in Seattle.
41:15 Because the way not so much fewer much here quickly. Within a second or two, we get back the 138 scenes items out of the I don't know how many there are in total, but like, hundreds of thousands, millions of individual stack items that million 20 million. Okay, that comprise settle to. So we're quickly able to filter that down.
41:39 Next up, we have a bit of signing. So this is that it that we talked about where you can do all this in ano smoothie. But in order to actually access the data, we have you sign the items, which basically opens this little token to the URLs, and then at that point, they can be opened up by any geospatial program like QGIS or a private private black storage URL to a temporary public one.
42:03 Yeah.
42:03 Exactly.
42:04 Exactly.
42:04 So you do that.
42:05 It's just like this kind of incidental happenstance that stack and DASK actually pair extremely nicely if you think about DASK as the way it operates is it's all about lately operating lately, constructing a task graph of computations, and then at the end of your whatever you're doing computing, that all at once. That just gives really nice rooms for optimizations and maximizing parallelization wherever possible. The thing about geospatial is again, if you didn't have stack, you'd have to open up these files to understand where on Earth is it? What latitude, longitude does it cover?
42:43 What? I have all 20 million files and then look and see what its metadata, isn't it right?
42:47 Yeah.
42:48 Okay. And in this case, we have, like, 138 times three files. So whatever, 400 and 5600 items files here, opening each one of those takes a few, maybe 200, 400, 500 milliseconds. So it's not awful, but it's like, too slow to really do interactively on any scale of any large number of stack items.
43:12 That's where stack scree. It has all the metadata. So we know that this tip file, this plot optimized geotiff file that contains the actual data. We know exactly where it is on Earth, what latitude, longitude it covers, what time period it covers, what asset it actually represents, wavelength. So we're able to very quickly stack these together into this X ray data.
43:33 In this case, it's fairly small since we've chopped it down. If we leave out the filtering, it'd be much, much larger because these are really large scenes. But anyway, we're able to really quickly generate these data arrays. And then using DASK using our DASK cluster, we can actually load those persist, those in distributed memory on all the workers on our cluster. So that's, like, very cool, very easy. It's like a few lines of code, a single function call. But it represents years of effort to build up the stack specification and all the metadata and then the integration in the desk.
44:07 So it's just a fantastic result that we have even cool once you just call data persist on the DASK array.
44:17 You could just see in the dashboard of DASK, like, all these clusters firing up and all this data getting processed?
44:25 Yes, exactly. So in this case, since we have that adaptive mode, we'll see additional workers come online here as we start to stress the cluster and saying, oh, I've got a bunch of unfinished tasks. I should bring online some more workers and that will be a team, either a few seconds if there's empty space on our cluster or a bit longer.
44:43 Yeah, I feel like with this. If it just sat there and said, it's going to take two minutes and just spun with a little star. The Jupyter star that would be both made a little DASK. More like, I want to just watch it go.
44:54 Look at that guy.
44:55 It's kind of like defragging your heard dive with the old day.
44:59 As you watch these little bars go across. It's very bizarre, bizarrely satisfying.
45:03 I will definitely just spend some time city here watching it. It's accessible, like monitoring. There's a lot of communication here. There shouldn't be. But really, I'm just watching the Lions move.
45:13 What other thing is working? Let me take a question from the live stream. Samaria, ask, can users bring their own data to this sort of processing? Or because you've got the data sets that you have. Is there a way to bring other research data over?
45:28 Yeah. So the answer now is like, yes, but you kind of have to do a lot of efforts to get it there. So your own data, maybe you do have your own stack API and database set up and all of that. But that's publicly accessible or have a token for most users don't already have that. So you can't this real divide between the data sets that we provide with our nice stack API and like, your own custom data set that might be a pile of files and Blob storage, and you can access it that way. Certainly. But there's kind of a divide there. So that is definitely something that we're interested in improving is making user data sets like that are private to you feel as nice to work with as our own public data sets.
46:12 Yeah. Another thing that I saw when I was looking through it said.
46:15 Under the data sets available.
46:16 Or if you have your own data and you'd like to contribute, contact us. And that's a slightly different question.
46:22 They were just asking that's the one question is I have my own data. I want to use it. This is like how I work at a University or something. I've got all this data. I want to make it available to the world. What's the story of that?
46:32 We have a backlog of data sets that we're on boarding onto Azure Blob storage and then importing into the API.
46:39 Still working through that backlog, but always on the lookout for good data sets that have real use cases and environmental sustainability. If there's a group that's doing some research or doing building applications that have environmental sustainability impact and they need a data set, that certainly bumps it up on our list. So I would love to hear from anybody that has data sets that you're looking to expose publicly host on the planetary computer for anybody to use and need a place to host it.
47:05 Yeah. Yeah.
47:05 Very cool. All right, Tom, your graph stopped moving around. It might be done.
47:10 Yeah. So we spent quite a while Loading up the data. And then that's just how it goes. You spend a bunch of time Loading up data and then once it's in memory, computations tend to be pretty quick. So in this case, we're taking a median over time.
47:23 Is this the median of the image? What exactly is that a medium of. Yeah.
47:27 So right now we have for.
47:28 Like, a list of numbers. I'm not sure what it means for an image.
47:31 Yeah. So this is a median over time. So our stack here our data arrays, a four dimensional array, and the dimensions are time. First of all. So we had, like, 138 time slices wavelengths. So these red, green, blue, near core central captures, like ten or twelve wavelengths and then latitude and longitude. So we took the median over time. And the idea here is that, like stuff like roads and mountains and forest tend not to move over time.
48:03 They're static relatively compared to something like cloud. So again, clouds are always a problem. And once you take the median over time, you kind of get the average image over this entire time period, which turns out to be an image that doesn't have too many clouds in it.
48:19 Yeah. It might have no clouds, because if you kind of average them out across all of them because you already filed it down pretty low. Yeah.
48:25 So now we can see a picture of the Seattle area where it's a cloud free composite or cloud.
48:31 Looks like maybe that Lake Washington. And you got Reneer there and all sorts of good stuff.
48:38 Yeah, I'm sure I actually do not know the geography that well, but I have been looking at lots of pictures. We tend to use this as our example area. A lot super cool.
48:49 And one nice thing here is where, again, investing heavily in open source, investing in building off of open source. So we have all the power of X ray to use X ray. Is this very general purpose individual array computing library kind of combines the best of NumPy and Pandas. In this case, we can do something like group by. So if you're familiar with Pandas, your affiliates group bys. We can group by time, month.
49:14 I want to do, like a monthly mosaic. Maybe I don't want to combine images from January, which might have snow on them with images from July, which won't have as much. So I can do it.
49:24 We have, like, twelve different images or something like that. Here's what it kind of averaged out to be in February.
49:30 Exactly. And so now we have a stack of images, twelve of them, and we can go ahead and representing a median. So we have multiple years and we group all of the ones from January together and take the median of those. And then we get nice little group of cloud free mosaics here one for each month. Yeah.
49:48 Sure enough, there is a little less snow around rainier summer than in the winter, as you would do the Cascades.
49:53 Yep. Definitely. So that's like a fun little introductory example to what the hub gives. You use single node environment, which that alone is quite a bit. You don't have to mess with fighting to get the right set of libraries installed, which can be especially challenging when you're interfacing with the C and C++ libraries like all. So that environment is all set up mostly compatible. Should all work for you on a single node. And then if you do have these larger computations, we saw it took a decent while to load the data, even with these fast Interop between the storage machines and the compute machines and the same as our region. But you can scale that out on enough machines that your computations complete in a reasonable amount of time because of the animations.
50:35 You don't even mind it. It's super cool. So you use the API to really narrow it down from 20 million to 150 or 138 images and then keep it out. So one thing that I was wondering when I was looking at this is what libraries come included that I can import and which ones.
50:55 If there's something that's not there, maybe I really want to use Httpx and you only have requests or whatever. Is there a way to get additional libraries and packages and stuff in there.
51:03 We do have a focus on geospatial, so that's like we'll have most of that there already. So Xray DASK, REST area and all those things. But if there is something there our container.
51:16 So these are all Docker images built from Conda environments. That all comes from this repository, Microsoft/Planetary Computer containers. So if you want httpx, you add it to the environment and we'll get a new image built and then available from the planetary computer. And so these are public images. They're just on the Microsoft container registry. So if you want to use our image like you don't want to fight with getting a compatible version of, say, PyTorch and Lib JPEG. Not that I was doing that recently, but if you want to avoid that pain, then you can just use our images locally, like from your laptop, and you can even connect to our DASK gateway using our images from your local laptop and do some really fun setups there.
52:02 Yeah, I see, because most of the work would be happening in the clusters. The DASK clusters, not locally anyway.
52:08 Yeah. So all the compute happens there, and then you bring back this little image. That's your plot, your result.
52:13 Okay.
52:14 Very cool. So how do I get mine in here? I see the containers. I see you have the last commit here.
52:20 Yeah. So that's one one per right now. Honestly, the easiest way is to send me open up an issue and I'll take care of it for you. Just got the continuous supply that quite working out. There's an environment Imo file there that gets oh, yeah.
52:34 So you see, there's quite a few packages in here already.
52:37 And those are just the ones we explicitly asked for, and then all their dependencies get pulled into a lock file, and they built into a Docker images. So this is building off of projects from NGEO, that group of geo scientists that I mentioned earlier who have been struggling with this problem for several years now. So they have a really nice Docker eyes set up, and we're just building off that base image.
52:59 Cool. Based on the NGO container. Very cool. Simapari asks, how long is the temporary URL active for the signed URL, the Blob storage.
53:08 So that actually depends on whether or not you're authenticated. We have some controls to say the plan check computer hub requires access, but also you get an API token, which gives you a little bit longer lasting tokens.
53:21 But forget what the actual current expiries are. If you use the Planter Computer Python library, you just pip install planetary_computer and use that dot sign method. It will actually request a token. And then as the token is going to expire, request a new token. So it reasons token and caches it. But it should be long enough for actually pulling down the data files that we have available. Right. Because we're working smaller cloud optimized formats. There aren't these 100 gig files that you should have to pull down and need a single SAS token to last for a really long time, so you can re request if you need a new one as it expires. And like I said, that library actually takes care of the logic for you there.
54:08 That's cool. Yeah, very nice. All right, guys. Really good work with this. And it seems like it's early days. It seems like it's getting started. There's probably gonna be a lot more going on with this.
54:17 Yeah, for sure. It's really fun.
54:18 I'm gonna go out on a limb and make a big prediction that understanding the climate climate change is going to be more important, not less important in the future. So I suspect that's also going to grow some interest.
54:31 In the new report. FCC is making some heavy predictions, and within the decade, we might reach plus 1.5 Celsius and we're already in it. We're already feeling the effects. And this is the data about our Earth and it's going to become more and more important as we mitigate and adapt to these effects. So yeah, I agree. I think that's a good question.
54:56 Thanks. If we are going to plan our way out of it and plan for the future and science our way out of it, what could I need? Stuff like this so well done.
55:06 All right.
55:06 I think we're about out of time. So let me ask you both the final two questions here if you're going to write some Python code. What editor do you use? Rob Vs code I suspect I could guess that, but yeah, yeah.
55:17 Actually, I was a big Emacs user, and then when I got this switched over to VS Code, I just integrated better with Windows and then really got into the Pylance and the typing system doing type annotations and basically having a compiler for the Python code really change instead of having all of the types in my head and having to worry about all that, actually having the type hinting was something I wasn't doing a year ago, and now it's drastically improved my development experience.
55:46 It's a huge difference. And I'm all about that as well. People talk about the type speeds super important for things like mypy and other stuff and a lot of cases it can be. But to me, the primary use case is when I hit dot after a thing. I wanted to tell me what I can do. If I have to go to the documentation.
56:05 Then it's kind of like something that's failing.
56:07 I shouldn't need documentation. I should be able to just auto complete my way through the world mostly totally.
56:12 And I come from I was a scholar developer for, you know, about six years.
56:18 So I was used to very heavy, heavily type system and kind of got away with from Python. I was like, you know what I like that there's no types, but I feel like the Python ecosystem is really hit in that sweet spot of introducing enough typing, which is really great.
56:32 And then the inference flies along for the rest of the program.
56:35 Yeah, totally.
56:36 All right, Tom, how about you VS Code as well for most stuff and then Emacs for Mega Magic the Git client and then a bit of them every now and then.
56:45 Right on. Very cool. Alright, then the other question is for either of you there's like a cool notable PyPI or condo package that I came across. This. It was amazing. People should know about it. Any idea how you going? Sure.
56:59 I'll go Seaborn. It's plotting library from Michael Wakem. Built on top of Matplotlib. It's just really great for exploratory data analysis easily create these great visualizations for mostly tabular data sets, but not exclusively.
57:15 That's interesting. I know a Seabourn new Matplotlib. I didn't realize that Seaborn was like, let's make Matplotlib easier.
57:21 Yeah, essentially for this very specific use case.
57:24 Matplotlib is extremely flexible, but there's a lot of boilerplate and Seaborn just wraps that all up nicely.
57:30 Yeah, super cool. All right. Well, thank you so much for being here. Final call to action. People wanna get started with Microsoft Planetary Computer. Maybe they've got some climate research.
57:38 What do they do planetarycomputer.microsoft.Com that'll get you anywhere you need to go and then if you want an account, then it's /account/request, I believe. Yeah.
57:47 There's a big request access to so you can click. That awesome.
57:51 Exactly. All right.
57:52 Rob Tom, thank you for being here and thanks for all the good work.
57:55 Thanks for having us. It's great.
57:56 Awesome.
57:57 Thanks so much.
57:58 Bye.
57:59 This has been another episode of Talk Python to me. Our guest in this episode were Rob Emmanuel and Tom Augspurger. This has been brought to you by Shortcut, formerly Clubhouse.IO us over at Dock Python training and the transcripts are brought to you by 'Assembly AI'.
58:15 Choose Shortcut, formerly Clubhouse.IO for tracking all of your projects work because you shouldn't have to project manage your project management. Visit 'talkpython.fm/shortcut' Do you need a great automatic speechtotext API?
58:29 Get human level accuracy in just a few lines of code?
58:32 Visit 'talkpython.fm/assemblyAI' when you level up your Python, we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async and best of all, there's not a subscription in sight. Check it out for yourself at 'training.talkpython.fm' be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top.
58:57 You can also find the itunes feed at /itunes, the Google Play feed at /Play and the Direct RSS feed at /rss on Talk Python FM.
59:06 We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air. Be sure to subscribe to our YouTube channel at 'talkpython.fm/youtube'.
59:18 This is your host, Michael Kennedy. Thanks so much for listening.
59:21 I really appreciate it.
59:22 Now get out there and write some Python code.