Microsoft Planetary Computer
Episode Deep Dive
Guests Introduction and Background
Rob Emanuele and Tom Augspurger are both seasoned Python developers at Microsoft, working on the Planetary Computer project. Rob has a background in mathematics, initially coding in Power Builder and C extensions for Python, before diving into the open-source ecosystem. Tom began using Python for econometric research, contributing to open-source libraries like pandas and Dask while at Anaconda. They joined Microsoft’s small team focused on the Planetary Computer to combine environmental and climate data with modern cloud computing, enabling large-scale analytics for sustainability and climate research.
What to Know If You're New to Python
This conversation is heavy on data science and cloud computing, but don’t let that discourage you if you’re just starting out. Here are a few pointers:
- Understand fundamental data structures like dictionaries and lists—this will help you parse and manage datasets.
- Familiarize yourself with basic libraries such as requests (for HTTP), pandas (for tabular data), and possibly Dask (for parallel workloads).
- Know that Jupyter and notebooks are common for data exploration; you’ll see references to them throughout the conversation.
Key Points and Takeaways
- Microsoft Planetary Computer: Core Mission and Components
The Planetary Computer is a cloud-based platform at Microsoft that combines petabytes of open Earth observation data with powerful compute resources in a hosted Jupyter environment. By gathering data on topics such as climate and biodiversity, it allows researchers and organizations to process and analyze massive datasets without worrying about setting up their own infrastructure.
- Links and Tools:
- Petabyte-Scale Data and Storage
The Planetary Computer stores many petabytes of satellite imagery and environmental data in Azure Blob Storage. A key part of the project is providing a scalable way to host data in cloud-friendly formats (like Cloud-Optimized GeoTIFF and Zarr) so that researchers can efficiently query exactly the slices of data they need.
- Links and Tools:
- API-Driven Access through STAC
They use STAC (SpatioTemporal Asset Catalog) to index and serve data. Rather than downloading huge files just to figure out where they cover, STAC-based APIs make it quick to search for the region or timeframe needed. This ensures a user only processes relevant images or records.
- Links and Tools:
- Hosted JupyterLab Environment (The “Hub”)
The Planetary Computer includes a JupyterHub-based environment, so you can log in and immediately start coding against the data. Users can scale their computations via Dask clusters running under the hood—no need to spin up Kubernetes directly.
- Links and Tools:
- Data Processing with Dask and Xarray
Large workloads become manageable with libraries such as Xarray, which uses Dask to process arrays in parallel. You can load thousands of files, lazily evaluate them, and then produce transformations (e.g., cloudless mosaics) by simply calling
.persist()
or.compute()
to offload tasks to the cluster. - Geospatial Data Catalog Highlights
Key datasets include Sentinel-2 imagery (10-meter resolution, updated every five days), Landsat 8 (30-meter resolution), DayMet (gridded weather), and more. These open-source remote sensing datasets are pivotal for research on land cover, climate patterns, and ecological changes.
- Links and Tools:
- Event-Driven Ingest and Azure Batch
New satellite imagery arrives daily, and the Planetary Computer pipeline automatically ingests the data in parallel using Azure Batch. Tools like Event Grid can trigger tasks on every new blob, meaning fresh imagery gets indexed and becomes available quickly for climate scientists.
- Links and Tools:
- Collaboration with Partners and Grants
Microsoft’s AI for Earth grants program supports organizations in building new sustainability-focused applications. Some partners, like CarbonPlan, create public tools that leverage the data on Planetary Computer for mapping wildfire risk or land cover changes.
- Links and Tools:
- Open-Source Ecosystem and Community Focus
Much of the ETL and metadata extraction pipeline is open source, encouraging community contributions. It’s important that the data remain truly “open,” allowing external projects to tap into Planetary Computer data or replicate its architecture in their own Azure subscriptions.
- Links and Tools:
- Future of Sustainability and Climate Research The conversation emphasizes that climate change research and large-scale environmental data science will become increasingly critical. Platforms like Microsoft’s Planetary Computer aim to empower specialists, scientists, and even curious developers to dive into these crucial datasets without immense startup costs.
- Links and Tools:
Interesting Quotes and Stories
- Rob on discovering Python: “I come from a math background... I actually credit Python… setting me on a better development path for sure.”
- Tom on Dask dashboards: “It’s kind of like defragging your hard drive... you watch these little bars go across… it’s bizarrely satisfying.”
- Future-looking statement: “We’re already in it. We’re already feeling the effects... This is the data about our Earth, and it’s going to become more and more important as we mitigate and adapt…”
Key Definitions and Terms
- Planetary Computer: Microsoft’s cloud-based platform combining environmental datasets with large-scale compute resources.
- STAC (SpatioTemporal Asset Catalog): A specification for describing and searching geospatial data.
- Dask: A Python library for parallel computing, often used with Xarray and pandas to process large amounts of data efficiently.
- Cloud-Optimized GeoTIFF: A TIFF-based raster format designed for efficient reading in cloud environments, without downloading an entire file.
Learning Resources
If you want to improve your Python fundamentals or need a refresher on concepts mentioned during the episode, here are some courses that might help you go deeper.
- Python for Absolute Beginners: A solid entry point into Python if you’re new or returning after a long break.
- Fundamentals of Dask: If you want to learn how Dask helps process large datasets and leverage parallel computing in Python.
Overall Takeaway
The Microsoft Planetary Computer aims to make massive amounts of Earth science and climate data accessible to anyone, from seasoned researchers to hobbyist coders. By leveraging open-source standards like STAC, plus flexible Python libraries including Dask and Xarray, the platform provides a seamless way to explore petabyte-scale data—no specialized infrastructure expertise required. This forward-looking initiative not only unlocks critical information needed to understand and address climate change but also showcases how Python and the cloud can accelerate real-world scientific discovery.
Links from the show
Tom Augspurger on Twitter: @TomAugspurger
Video at example walkthrough by Tom if you want to follow along: youtube.com?t=2360
Planetary computer: planetarycomputer.microsoft.com
Applications in public: planetarycomputer.microsoft.com
Microsoft's Environmental Commitments
Carbon negative: blogs.microsoft.com
Report: microsoft.com
AI for Earth grants: microsoft.com
Python SDK: github.com
Planetary computer containers: github.com
IPCC Climate Report: ipcc.ch
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy