Pangeo Data Ecosystem
Episode Deep Dive
Guests Introduction and Background
Joe Hamman is the Technology Director at CarbonPlan and also a scientist at the National Center for Atmospheric Research (NCAR). He comes from a civil engineering and computational hydrology background, where he transitioned from Perl, Fortran, and other older tools to Python. Joe has helped bring modern Python tools and open-source contributions into his research lab and broader scientific communities, with a special focus on hydrology, forest management, and climate data analysis.
Ryan Abernathey is a professor at Columbia University’s Lamont-Doherty Earth Observatory, where he leads a research lab specializing in computational oceanography. He has a longstanding background in programming, dating back to writing BASIC code at age seven. Today, Ryan uses Python for analyzing massive Earth science datasets (e.g., satellite data for ocean surface temperatures) and for open-source work, especially around Xarray and the broader Pangeo ecosystem.
What to Know If You're New to Python
Below are some quick points to help you get more out of this episode. These will prepare you to follow the discussion on powerful Python libraries and data-focused workflows:
- Make sure you understand basic Python scripting, functions, and data structures (lists, dicts, etc.).
- Experiment with data libraries like pandas to see how Python handles tabular data.
- Try out Jupyter notebooks for an interactive, iterative coding approach that many scientists use.
Key Points and Takeaways
Pangeo as a Community-Driven Ecosystem
This project started with a small workshop on scaling climate and weather data analytics with Python. The community has grown into a broad collaboration that brings scientists, software developers, and data providers together to solve the massive data challenges in Earth sciences. The Pangeo ecosystem centers on open-source projects (like Xarray and Dask) and focuses on reproducibility, big-data workflows, and cloud computing.- Links and Tools:
- Pangeo Project
- Pangeo Discourse Forum (the main place to ask questions and connect with the community)
- Pangeo Gallery (Jupyter notebooks demonstrating real-world use cases)
- Links and Tools:
Xarray: Multidimensional Arrays with Labels
Often dubbed “multi-dimensional Pandas,” Xarray allows you to handle large collections of netCDF or related scientific files as one coherent dataset. It supports labeled dimensions (e.g., time, latitude, longitude), understands metadata, and can open multiple files as if they were one big dataset. This eliminates repetitive code for reading files in loops and handling manual indexing, greatly simplifying research workflows.- Links and Tools:
Dask for Scalable Parallel Computing
Dask provides on-the-fly chunking and parallelization for Python data structures, including dataframes, arrays, and Xarray objects. This means you can write code almost identical to the single-machine version, but under the hood, Dask will handle distributing work across cores or even multiple servers. It’s not “pure magic”—understanding chunk sizes and data access patterns is still crucial—but it unlocks workflows for datasets far larger than RAM.- Links and Tools:
- Dask
- Coiled (commercial Dask hosting solution)
- Fundamentals of Dask (in-depth course on how to use Dask effectively)
- Links and Tools:
Cloud Object Storage as the New Data Portal
Traditional “data portals” often involved custom websites for file downloads. Pangeo encourages storing large scientific datasets (e.g., satellite or climate data) in object storage like Amazon S3, Google Cloud Storage, or other vendors. This approach decouples the storage from a single server and allows easy partial reads, integration with tools like Dask, and more flexible scaling for diverse users.- Links and Tools:
- Amazon S3
- Google Cloud Storage
- FSSpec (Python library for uniform file access across various backends)
- Links and Tools:
Open Science Requires More Than Uploading Code to GitHub
Both guests emphasized that open science is about collaborative workflows, reproducible data processing, and building on each other’s tools. Simply uploading code with a license is insufficient; you also need accessible data, environment details, documentation, and an active community to help others adopt and extend your work. This broader perspective drives projects like Pangeo, where packages like Xarray, Dask, and Jupyter are co-evolving to meet real research needs.- Links and Tools:
- GitHub (host your code but think beyond a simple repo)
- Pangeo Discourse Forum (active community for open-science questions)
- Links and Tools:
Transition from Spaghetti Scripts to Collaborative Frameworks
Many labs were using scattered scripts in Perl, C, Fortran, or shell. Shifting to Python with best practices like Xarray and conda environments has dramatically accelerated research. Researchers can prototype faster, share knowledge, and scale out to HPC or cloud-based clusters with fewer custom workflows.- Tools Mentioned:
- conda (environment management for scientific Python)
- [Fortran/Perl/C to Python migrations](various references in transcript, no official link)
- Tools Mentioned:
Jupyter Notebooks in the Cloud
An essential part of Pangeo is running Jupyter or JupyterLab on cloud platforms, close to large datasets. Instead of downloading massive data, you spin up notebooks on Google Cloud, AWS, or HPC systems, saving time and storage costs. Jupyter Lite is an interesting twist for small data or demos (running Python in the browser), but the real large-scale number crunching happens when you connect notebooks to Dask clusters running on remote infrastructure.- Links and Tools:
Pangeo Forge for ETL (Extract-Transform-Load) Pipelines
Many open-source ETL tools are geared toward business-style data rather than multi-dimensional science data. Pangeo Forge aims to address this gap by automating the conversion of diverse data sources into cloud-friendly, chunked, and metadata-rich formats that Xarray and Dask can easily load. This improves accessibility for large, disparate datasets like climate model outputs and satellite imagery.- Links and Tools:
- Pangeo Forge (GitHub organization)
- Pangeo Forge Recipes
- Links and Tools:
Big Data in Earth Sciences: Petabytes of Climate Data
Modern research in climate science, oceanography, and meteorology deals with massive data from satellites (SST—sea surface temperature), high-resolution climate models, and new NASA missions. The standard approach is to keep these data on HPC or cloud storage. Tools like Dask + Xarray let scientists handle these petabyte-scale datasets piecewise, retrieving only what they need without rewriting custom code for each study.- Examples:
- NASA’s upcoming surface water and ocean topography data
- CMIP6 (Coupled Model Intercomparison Project) datasets
- Examples:
Collaboration Between Academia and Industry
Researchers are noticing that businesses are increasingly adopting climate data for both adaptation (planning for changing climate) and mitigation (reducing emissions). Python, with Pangeo’s open-source ecosystem, helps unify these efforts. This cross-pollination has improved the tooling and driven new features like ephemeral Dask clusters and integrated data catalogs.
- Links and Tools:
- CarbonPlan (Joe’s organization working on open-data climate solutions)
- [Azure, AWS, GCP climate data marketplaces](various references)
Interesting Quotes and Stories
On the shift to Python in research:
Joe: “We were doing lots of computer things…my PhD advisor said, ‘We want to do Python stuff. You should be the guinea pig student to bring our group into the modern era.’”On open science in practice:
Ryan: “Just putting a repo up on GitHub has essentially no impact if no one can run it. We want to accelerate velocity of discovery, and that means documentation, data, everything.”On the power of Jupyter as a Trojan horse for cloud:
Ryan: “The hardest way to run Jupyter is probably on your own machine. Once you’re on the cloud, it’s easy—and the data is already there.”
Key Definitions and Terms
- Xarray: A Python library that brings labeled multi-dimensional arrays and datasets, making it simpler to handle netCDF files, satellite data, and other high-dimensional data.
- Dask: A parallel computing library for Python that scales from single-core to multi-node clusters, often used for big-data analysis in the Python ecosystem.
- Object Storage: A form of data storage (e.g., AWS S3, Google Cloud Storage) where data is stored as objects rather than in a traditional filesystem or database, enabling large-scale data sharing and partial retrieval.
- ETL (Extract, Transform, Load): A process for collecting data from various sources, transforming it into a usable format, and loading it into a destination system.
Learning Resources
Below are some resources to help you dive deeper into Python and data workflows discussed in this episode:
- Python for Absolute Beginners — A foundational course to help you learn Python from square one.
- Fundamentals of Dask — A detailed look at parallelizing workflows with Dask, which is central to Pangeo.
- Xarray Documentation — Official docs showing how to use Xarray.
- Pangeo Discourse Forum — The main community hub for discussing open science, big data, and climate analytics.
Overall Takeaway
The Pangeo ecosystem stands at the intersection of big-data analytics, cloud computing, and open science, empowering climate researchers and beyond. By leveraging Python tools like Xarray, Dask, and Jupyter, teams can seamlessly analyze massive datasets without reinventing the wheel or building unwieldy, one-off scripts. Pangeo’s core ethos—community collaboration, reproducibility, and a shared commitment to transparency—offers a scalable model for any data-intensive field, inspiring both scientists and developers to build tools that are robust, open, and truly impactful.
Links from the show
Joe Hamman: @HammanHydro
Pangeo: pangeo.io
xarray: xarray.dev
Pangeo Forge: pangeo-forge.org
fsspec: filesystem-spec.readthedocs.io
Step-by-Step Guide to Building a Big Data Portal: medium.com
Coiled: coiled.io
Pangeo Gallery: gallery.pangeo.io
Pangeo Quickstart: pangeo.io
JupyterLite: jupyterlite.readthedocs.io
Jupyter: jupyter.org
Pangeo Packages: pangeo.io
Pangeo Discourse: discourse.pangeo.io
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy