Learn Python with Talk Python's 270 hours of courses

Pangeo Data Ecosystem

Episode #361, published Sat, Apr 16, 2022, recorded Fri, Apr 1, 2022

Python's place in climate research is an important one. In this episode, you'll meet Joe Hamman and Ryan Abernathey, two researchers using powerful cloud computing systems and Python to understand how the world around us is changing. They are both involved in the Pangeo project which brings a great set of tools for scaling complex compute with Python.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Joe Hamman is the Technology Director at CarbonPlan and also a scientist at the National Center for Atmospheric Research (NCAR). He comes from a civil engineering and computational hydrology background, where he transitioned from Perl, Fortran, and other older tools to Python. Joe has helped bring modern Python tools and open-source contributions into his research lab and broader scientific communities, with a special focus on hydrology, forest management, and climate data analysis.

Ryan Abernathey is a professor at Columbia University’s Lamont-Doherty Earth Observatory, where he leads a research lab specializing in computational oceanography. He has a longstanding background in programming, dating back to writing BASIC code at age seven. Today, Ryan uses Python for analyzing massive Earth science datasets (e.g., satellite data for ocean surface temperatures) and for open-source work, especially around Xarray and the broader Pangeo ecosystem.

What to Know If You're New to Python

Below are some quick points to help you get more out of this episode. These will prepare you to follow the discussion on powerful Python libraries and data-focused workflows:

  • Make sure you understand basic Python scripting, functions, and data structures (lists, dicts, etc.).
  • Experiment with data libraries like pandas to see how Python handles tabular data.
  • Try out Jupyter notebooks for an interactive, iterative coding approach that many scientists use.

Key Points and Takeaways

  1. Pangeo as a Community-Driven Ecosystem
    This project started with a small workshop on scaling climate and weather data analytics with Python. The community has grown into a broad collaboration that brings scientists, software developers, and data providers together to solve the massive data challenges in Earth sciences. The Pangeo ecosystem centers on open-source projects (like Xarray and Dask) and focuses on reproducibility, big-data workflows, and cloud computing.

  2. Xarray: Multidimensional Arrays with Labels
    Often dubbed “multi-dimensional Pandas,” Xarray allows you to handle large collections of netCDF or related scientific files as one coherent dataset. It supports labeled dimensions (e.g., time, latitude, longitude), understands metadata, and can open multiple files as if they were one big dataset. This eliminates repetitive code for reading files in loops and handling manual indexing, greatly simplifying research workflows.

  3. Dask for Scalable Parallel Computing
    Dask provides on-the-fly chunking and parallelization for Python data structures, including dataframes, arrays, and Xarray objects. This means you can write code almost identical to the single-machine version, but under the hood, Dask will handle distributing work across cores or even multiple servers. It’s not “pure magic”—understanding chunk sizes and data access patterns is still crucial—but it unlocks workflows for datasets far larger than RAM.

  4. Cloud Object Storage as the New Data Portal
    Traditional “data portals” often involved custom websites for file downloads. Pangeo encourages storing large scientific datasets (e.g., satellite or climate data) in object storage like Amazon S3, Google Cloud Storage, or other vendors. This approach decouples the storage from a single server and allows easy partial reads, integration with tools like Dask, and more flexible scaling for diverse users.

  5. Open Science Requires More Than Uploading Code to GitHub
    Both guests emphasized that open science is about collaborative workflows, reproducible data processing, and building on each other’s tools. Simply uploading code with a license is insufficient; you also need accessible data, environment details, documentation, and an active community to help others adopt and extend your work. This broader perspective drives projects like Pangeo, where packages like Xarray, Dask, and Jupyter are co-evolving to meet real research needs.

    • Links and Tools:
  6. Transition from Spaghetti Scripts to Collaborative Frameworks
    Many labs were using scattered scripts in Perl, C, Fortran, or shell. Shifting to Python with best practices like Xarray and conda environments has dramatically accelerated research. Researchers can prototype faster, share knowledge, and scale out to HPC or cloud-based clusters with fewer custom workflows.

    • Tools Mentioned:
      • conda (environment management for scientific Python)
      • [Fortran/Perl/C to Python migrations](various references in transcript, no official link)
  7. Jupyter Notebooks in the Cloud
    An essential part of Pangeo is running Jupyter or JupyterLab on cloud platforms, close to large datasets. Instead of downloading massive data, you spin up notebooks on Google Cloud, AWS, or HPC systems, saving time and storage costs. Jupyter Lite is an interesting twist for small data or demos (running Python in the browser), but the real large-scale number crunching happens when you connect notebooks to Dask clusters running on remote infrastructure.

  8. Pangeo Forge for ETL (Extract-Transform-Load) Pipelines
    Many open-source ETL tools are geared toward business-style data rather than multi-dimensional science data. Pangeo Forge aims to address this gap by automating the conversion of diverse data sources into cloud-friendly, chunked, and metadata-rich formats that Xarray and Dask can easily load. This improves accessibility for large, disparate datasets like climate model outputs and satellite imagery.

  9. Big Data in Earth Sciences: Petabytes of Climate Data
    Modern research in climate science, oceanography, and meteorology deals with massive data from satellites (SST—sea surface temperature), high-resolution climate models, and new NASA missions. The standard approach is to keep these data on HPC or cloud storage. Tools like Dask + Xarray let scientists handle these petabyte-scale datasets piecewise, retrieving only what they need without rewriting custom code for each study.

    • Examples:
      • NASA’s upcoming surface water and ocean topography data
      • CMIP6 (Coupled Model Intercomparison Project) datasets
  10. Collaboration Between Academia and Industry
    Researchers are noticing that businesses are increasingly adopting climate data for both adaptation (planning for changing climate) and mitigation (reducing emissions). Python, with Pangeo’s open-source ecosystem, helps unify these efforts. This cross-pollination has improved the tooling and driven new features like ephemeral Dask clusters and integrated data catalogs.

  • Links and Tools:
    • CarbonPlan (Joe’s organization working on open-data climate solutions)
    • [Azure, AWS, GCP climate data marketplaces](various references)

Interesting Quotes and Stories

  • On the shift to Python in research:
    Joe: “We were doing lots of computer things…my PhD advisor said, ‘We want to do Python stuff. You should be the guinea pig student to bring our group into the modern era.’”

  • On open science in practice:
    Ryan: “Just putting a repo up on GitHub has essentially no impact if no one can run it. We want to accelerate velocity of discovery, and that means documentation, data, everything.”

  • On the power of Jupyter as a Trojan horse for cloud:
    Ryan: “The hardest way to run Jupyter is probably on your own machine. Once you’re on the cloud, it’s easy—and the data is already there.”

Key Definitions and Terms

  • Xarray: A Python library that brings labeled multi-dimensional arrays and datasets, making it simpler to handle netCDF files, satellite data, and other high-dimensional data.
  • Dask: A parallel computing library for Python that scales from single-core to multi-node clusters, often used for big-data analysis in the Python ecosystem.
  • Object Storage: A form of data storage (e.g., AWS S3, Google Cloud Storage) where data is stored as objects rather than in a traditional filesystem or database, enabling large-scale data sharing and partial retrieval.
  • ETL (Extract, Transform, Load): A process for collecting data from various sources, transforming it into a usable format, and loading it into a destination system.

Learning Resources

Below are some resources to help you dive deeper into Python and data workflows discussed in this episode:

Overall Takeaway

The Pangeo ecosystem stands at the intersection of big-data analytics, cloud computing, and open science, empowering climate researchers and beyond. By leveraging Python tools like Xarray, Dask, and Jupyter, teams can seamlessly analyze massive datasets without reinventing the wheel or building unwieldy, one-off scripts. Pangeo’s core ethos—community collaboration, reproducibility, and a shared commitment to transparency—offers a scalable model for any data-intensive field, inspiring both scientists and developers to build tools that are robust, open, and truly impactful.

Links from the show

Ryan Abernathey: @rabernat
Joe Hamman: @HammanHydro
Pangeo: pangeo.io
xarray: xarray.dev
Pangeo Forge: pangeo-forge.org
fsspec: filesystem-spec.readthedocs.io
Step-by-Step Guide to Building a Big Data Portal: medium.com
Coiled: coiled.io
Pangeo Gallery: gallery.pangeo.io
Pangeo Quickstart: pangeo.io
JupyterLite: jupyterlite.readthedocs.io
Jupyter: jupyter.org
Pangeo Packages: pangeo.io
Pangeo Discourse: discourse.pangeo.io
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon