Learn Python with Talk Python's 270 hours of courses

Data Pipelines with Dagster

Episode #454, published Thu, Mar 21, 2024, recorded Thu, Jan 11, 2024

Do you have data that you pull from external sources or is generated and appears at your digital doorstep? I bet that data needs processed, filtered, transformed, distributed, and much more. One of the biggest tools to create these data pipelines with Python is Dagster. And we are fortunate to have Pedram Navid on the show this episode. Pedram is the Head of Data Engineering and DevRel at Dagster Labs. And we're talking data pipelines this week at Talk Python.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

About Pedram Navid

Pedram Navid is the Head of Data Engineering and Developer Relations at Dagster Labs. He’s been working with Python for many years, tracing his start back to version 2.7 as he sought to automate repetitive tasks (what he calls “productive laziness”). With experience across fields like banking, insurance, and advertisement tech, he now focuses heavily on data pipelines, data orchestration, and bridging the gap between product engineering and real-world data challenges.

  1. Pedram’s Path into Python
    Pedram began using Python out of necessity—he wanted to automate mundane server checks at a bank. Over time, his interest in “productive laziness” (creating simple tools to avoid repetitive tasks) guided him further into scripting and data automation. This path eventually led him to orchestrating complex pipelines, culminating in his role at Dagster.

    • Quote Highlight: “I got started with Python like I do with many things—just out of sheer laziness.” – Pedram
  2. Foundations of Data Pipelines
    Data pipelines are crucial for acquiring, transforming, and distributing data. Pedram highlights how companies produce data constantly, and the processing stages can grow complex. Instead of manually refreshing data for dashboards or analysis, pipelines automate tasks to handle ingestion, cleaning, and onward transformations.

    • Pipelines can be as simple as a single schedule and script or as elaborate as orchestrating dozens of data sources and destinations.
    • Frequent reliability headaches—like failed schedules, schema changes, or missing files—underscore the need for solid orchestration and observability.
  3. From Task-Based to Asset-Based Orchestration
    Traditional workflow managers like Airflow treat each step (e.g., fetch file, run transform) as a separate task. Dagster flips that model: instead of focusing on tasks, it emphasizes the assets produced (e.g., tables, models, CSV files).

    • This asset-based perspective makes it easier to track freshness, lineage, and usage of your data outputs.
    • It also provides a more intuitive interface for stakeholders who want to see which data objects exist, when they were last updated, and how they connect.
  4. What Makes Dagster Different
    Dagster is open source (dagster.io) and offers a powerful UI plus Python-driven configuration. Its architecture supports everything from local development on a laptop to deployments on Kubernetes or through Dagster Cloud.

    • The Dagster Open Platform repository exemplifies how Dagster Labs uses its own tool in production, helping teams see real patterns and best practices.
    • Users can either self-host or opt for Dagster Cloud. The latter offloads infrastructure concerns—no more debugging random server errors or scaling issues by yourself.
  5. Observability and Debugging
    A standout feature is Dagster’s structured logs: everything from Python exceptions to external service logs (like dbt runs) flows into one place. The UI highlights:

    • Which step failed, at what time, and why.
    • The ability to rerun just the failed step without reprocessing earlier successes.
    • Complete metadata for each “asset,” such as row counts or timestamps for tracking data drift and usage over time.
  6. Parallelism and Partitions
    Dagster automatically parallelizes tasks if the asset dependency graph allows it. Users can also define partitions (e.g., by date, region, or client) to scale transformations horizontally.

    • Pedram described how partitions let you reprocess only part of the dataset (the “backfill”) if data in a certain date or partition changes.
    • This approach saves time and resources while keeping the end-to-end data pipeline consistent.
  7. Ecosystem Integrations
    The show touched on several tools that work well with Dagster:

    • dbt for SQL-centric data transformations in data warehouses.
    • DuckDB for fast, in-process columnar analytics—ideal for local dev or embedded analytics.
    • Polars as a high-performance, Rust-based DataFrame library.
    • Apache Arrow as a powerful in-memory data format that underpins many modern analytics tools.
    • DLT (Data Load Tool) for easily pulling data from various sources into your pipelines.
    • And of course, standard Python libraries like requests or boto3 for AWS services.
  8. Open Source and Cloud Business Model
    Dagster’s “open core” strategy means the core orchestration framework is free, with most features available under an open-source license. Pedram explained that this fosters community adoption and contributions while letting Dagster Labs build a cloud offering for teams that prefer hands-off infrastructure and advanced hosting.

    • The self-hosted version (OSS) is robust and can run on a single server or scale across clusters.
    • Dagster Cloud uses the same codebase but adds convenience and enterprise-grade features such as serverless compute options.
  9. Dagster’s Roadmap and Future
    Dagster’s focus is on simplifying data orchestration while offering deeper integrations:

    • More tutorials (including “Dagster University”) to help data teams adopt best practices.
    • Enhancements for dbt usage, sensor-based pipelines (triggering on new data), and advanced partitioning.
    • Upcoming expansions to cover new data ecosystem tools and better developer experience in the local dev environment.

Overall Takeaway
Dagster offers a fresh, asset-focused approach to building and monitoring data pipelines in Python—one that goes beyond traditional, task-based workflow managers. By centralizing the definition of data assets, logs, and observability, teams can debug, reprocess, and collaborate far more effectively. Whether you’re a small group looking to move beyond basic Cron jobs or a larger enterprise wanting an open-source foundation with optional cloud features, Dagster can provide a modern orchestration layer that streamlines your entire data workflow.

Links from the show

Rock Solid Python with Types Course: training.talkpython.fm

Pedram on Twitter: twitter.com
Pedram on LinkedIn: linkedin.com
Ship data pipelines with extraordinary velocity: dagster.io
dagster-open-platform: github.com
The Dagster Master Plan: dagster.io
data load tool (dlt): dlthub.com
DataFrames for the new era: pola.rs
Apache Arrow: arrow.apache.org
DuckDB is a fast in-process analytical database: duckdb.org
Ship trusted data products faster: www.getdbt.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon