Learn Python with Talk Python's 270 hours of courses

Microsoft Planetary Computer

Episode #334, published Sat, Sep 18, 2021, recorded Thu, Sep 9, 2021

On this episode, Rob Emanuele and Tom Augspurger join us to talk about building and running Microsoft's Planetary Computer project. This project is dedicated to providing the data around climate records and the compute necessary to process it with the mission of help use all understand climate change better. It combines multiple petabytes of data with a powerful hosted Jupyterlab notebook environment to process it.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Rob Emanuele and Tom Augspurger are both seasoned Python developers at Microsoft, working on the Planetary Computer project. Rob has a background in mathematics, initially coding in Power Builder and C extensions for Python, before diving into the open-source ecosystem. Tom began using Python for econometric research, contributing to open-source libraries like pandas and Dask while at Anaconda. They joined Microsoft’s small team focused on the Planetary Computer to combine environmental and climate data with modern cloud computing, enabling large-scale analytics for sustainability and climate research.

What to Know If You're New to Python

This conversation is heavy on data science and cloud computing, but don’t let that discourage you if you’re just starting out. Here are a few pointers:

  • Understand fundamental data structures like dictionaries and lists—this will help you parse and manage datasets.
  • Familiarize yourself with basic libraries such as requests (for HTTP), pandas (for tabular data), and possibly Dask (for parallel workloads).
  • Know that Jupyter and notebooks are common for data exploration; you’ll see references to them throughout the conversation.

Key Points and Takeaways

  1. Microsoft Planetary Computer: Core Mission and Components The Planetary Computer is a cloud-based platform at Microsoft that combines petabytes of open Earth observation data with powerful compute resources in a hosted Jupyter environment. By gathering data on topics such as climate and biodiversity, it allows researchers and organizations to process and analyze massive datasets without worrying about setting up their own infrastructure.
  2. Petabyte-Scale Data and Storage The Planetary Computer stores many petabytes of satellite imagery and environmental data in Azure Blob Storage. A key part of the project is providing a scalable way to host data in cloud-friendly formats (like Cloud-Optimized GeoTIFF and Zarr) so that researchers can efficiently query exactly the slices of data they need.
  3. API-Driven Access through STAC They use STAC (SpatioTemporal Asset Catalog) to index and serve data. Rather than downloading huge files just to figure out where they cover, STAC-based APIs make it quick to search for the region or timeframe needed. This ensures a user only processes relevant images or records.
  4. Hosted JupyterLab Environment (The “Hub”) The Planetary Computer includes a JupyterHub-based environment, so you can log in and immediately start coding against the data. Users can scale their computations via Dask clusters running under the hood—no need to spin up Kubernetes directly.
  5. Data Processing with Dask and Xarray Large workloads become manageable with libraries such as Xarray, which uses Dask to process arrays in parallel. You can load thousands of files, lazily evaluate them, and then produce transformations (e.g., cloudless mosaics) by simply calling .persist() or .compute() to offload tasks to the cluster.
  6. Geospatial Data Catalog Highlights Key datasets include Sentinel-2 imagery (10-meter resolution, updated every five days), Landsat 8 (30-meter resolution), DayMet (gridded weather), and more. These open-source remote sensing datasets are pivotal for research on land cover, climate patterns, and ecological changes.
  7. Event-Driven Ingest and Azure Batch New satellite imagery arrives daily, and the Planetary Computer pipeline automatically ingests the data in parallel using Azure Batch. Tools like Event Grid can trigger tasks on every new blob, meaning fresh imagery gets indexed and becomes available quickly for climate scientists.
  8. Collaboration with Partners and Grants Microsoft’s AI for Earth grants program supports organizations in building new sustainability-focused applications. Some partners, like CarbonPlan, create public tools that leverage the data on Planetary Computer for mapping wildfire risk or land cover changes.
  9. Open-Source Ecosystem and Community Focus Much of the ETL and metadata extraction pipeline is open source, encouraging community contributions. It’s important that the data remain truly “open,” allowing external projects to tap into Planetary Computer data or replicate its architecture in their own Azure subscriptions.
  10. Future of Sustainability and Climate Research The conversation emphasizes that climate change research and large-scale environmental data science will become increasingly critical. Platforms like Microsoft’s Planetary Computer aim to empower specialists, scientists, and even curious developers to dive into these crucial datasets without immense startup costs.

Interesting Quotes and Stories

  • Rob on discovering Python: “I come from a math background... I actually credit Python… setting me on a better development path for sure.”
  • Tom on Dask dashboards: “It’s kind of like defragging your hard drive... you watch these little bars go across… it’s bizarrely satisfying.”
  • Future-looking statement: “We’re already in it. We’re already feeling the effects... This is the data about our Earth, and it’s going to become more and more important as we mitigate and adapt…”

Key Definitions and Terms

  • Planetary Computer: Microsoft’s cloud-based platform combining environmental datasets with large-scale compute resources.
  • STAC (SpatioTemporal Asset Catalog): A specification for describing and searching geospatial data.
  • Dask: A Python library for parallel computing, often used with Xarray and pandas to process large amounts of data efficiently.
  • Cloud-Optimized GeoTIFF: A TIFF-based raster format designed for efficient reading in cloud environments, without downloading an entire file.

Learning Resources

If you want to improve your Python fundamentals or need a refresher on concepts mentioned during the episode, here are some courses that might help you go deeper.

Overall Takeaway

The Microsoft Planetary Computer aims to make massive amounts of Earth science and climate data accessible to anyone, from seasoned researchers to hobbyist coders. By leveraging open-source standards like STAC, plus flexible Python libraries including Dask and Xarray, the platform provides a seamless way to explore petabyte-scale data—no specialized infrastructure expertise required. This forward-looking initiative not only unlocks critical information needed to understand and address climate change but also showcases how Python and the cloud can accelerate real-world scientific discovery.

Links from the show

Rob Emanuele on Twitter: @lossyrob
Tom Augspurger on Twitter: @TomAugspurger

Video at example walkthrough by Tom if you want to follow along: youtube.com?t=2360

Planetary computer: planetarycomputer.microsoft.com
Applications in public: planetarycomputer.microsoft.com

Microsoft's Environmental Commitments
Carbon negative: blogs.microsoft.com
Report: microsoft.com

AI for Earth grants: microsoft.com
Python SDK: github.com
Planetary computer containers: github.com
IPCC Climate Report: ipcc.ch
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon