Learn Python with Talk Python's 270 hours of courses

Apache Airflow Open-Source Workflow with Python

Episode #330, published Fri, Aug 20, 2021, recorded Thu, Aug 5, 2021

If you are working with data pipelines, you definitely need to give Apache Airflow a look. This pure-Python workflow framework is one of the most popular and capable out there. You create your workflows by writing Python code using clever language operators and then you can monitor them and even debug them visually once they get started.

Stop writing manual code or cron-job based code to create data pipelines check out Airflow. We're joined by three excellent guests from the Airflow community: Jarek Potiuk, Kaxil Naik, and Leah Cole.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

  • Leah Cole: Works in the data and AI space at Google Cloud, first learned Python in college and has grown to love how straightforward it is for tasks like data engineering and machine learning.
  • Jarek Potiuk: Initially coded in COBOL, Delphi, and many other languages before falling in love with Python’s simplicity and power around six years ago. Active in open-source communities, particularly around Apache Airflow.
  • Kaxil Naik: Started coding Python around 2016 after using Java and R. He has become a prominent contributor and committer to Apache Airflow and works at Astronomer helping organizations run Airflow at scale.

What to Know If You’re New to Python

If you’re new to Python and want to follow along with the discussion of Airflow, here are a few key tips to ensure you’re ready for this episode’s concepts:

  • Understand Python’s basic syntax (loops, functions, imports).
  • Learn how virtual environments work so you can safely install and manage Python packages.
  • Be comfortable with writing simple scripts and command-line interactions.

Key Points and Takeaways

  1. Why Use Apache Airflow for Data Orchestration Apache Airflow is a Python-based orchestrator for complex workflows, especially data pipelines. Instead of manually scheduling with cron, you define tasks in Python (as directed acyclic graphs, or DAGs) and let Airflow handle dependencies, error handling, and scheduling.
  2. Defining Workflows in Python (DAGs as Code) Airflow workflows (DAGs) are just Python scripts, making it easy to leverage conditionals, loops, and standard libraries. This allows teams to generate large, programmatic DAGs and share them via version control without forcing a separate XML or JSON config approach.
  3. Operators, Sensors, and Providers Airflow’s “operators” and “sensors” let you run and monitor tasks on external services, databases, or APIs. The community contributes “provider packages” for everything from Google Cloud to Slack, so you can orchestrate tasks anywhere.
  4. Installation and Production Setups While you can install Airflow with pip install apache-airflow, a more common approach is using Docker and Docker Compose for local development. For production, there are Helm charts for Kubernetes deployments or managed services like Astronomer, AWS’s Managed Workflows for Apache Airflow (MWAA), and Google Cloud Composer.
  5. Monitoring and Debugging with the Airflow UI Airflow has a built-in web UI that shows DAGs, their status, and logs of individual task runs. You can visually see which tasks succeeded or failed, re-run tasks that require new data, or handle partial restarts if needed. Airflow 2.0 introduced a refreshed UI with features like auto-refresh for live feedback.
  6. Community and Contributing The Apache Airflow community is large and welcoming, with an active Slack workspace and frequent meetups. Contributions range from small documentation fixes to writing providers for new services and can grow into more significant roles such as committers or PMC members.
  7. The CLI and REST API Beyond the UI, Airflow offers a feature-rich command-line interface for tasks like backfilling historical data and debugging performance issues. With Airflow 2.x, you also have a robust, OpenAPI-based REST API, enabling custom UIs and integrations in any language.
  8. Handling Failures and Reruns With DAGs as code, you can define retries for tasks and fallback strategies (like skipping or sending alerts). If your data changes or a job fails halfway, you can trigger partial reruns or fully backfill older data to ensure accuracy in downstream systems.
  9. Future Directions: Timetables and Async Sensors The discussion highlighted Airflow’s evolving features, like more flexible timetables to handle complex scheduling (e.g., the “third trading day” use case) and deferred operators for sensors that use async instead of continuous polling. These moves aim to improve performance and reduce resource usage.
  10. Managed vs. Self-Hosted Solutions Airflow can be run in many ways: fully managed (Cloud Composer on Google Cloud, AWS MWAA, or Astronomer) or self-hosted on Kubernetes, VMs, or Docker. Each approach offers different trade-offs, such as less maintenance vs. more flexibility in customization.

Interesting Quotes and Stories

  • On Python’s Simplicity: “After years of programming in Java, I was amazed at what I could do in one line of Python code.” - Jarek
  • On DAGs as code: “It’s puzzles. You get to define everything in Python, including your loops and conditionals, and that’s what makes Airflow powerful.” - Leah
  • On Contributing to Open Source: “If something drives you nuts, fix it or clearly file an issue. You don’t have to wait for somebody else to fix it for you!” - Leah

Key Definitions and Terms

  • DAG (Directed Acyclic Graph): A collection of tasks organized in such a way that no cycles can form, meaning no task depends on itself indirectly.
  • Operator (Airflow): A class defining a single task, e.g., running a Bash script, spinning up a Kubernetes pod, or submitting a Spark job.
  • Sensor (Airflow): A special operator that waits for a condition (such as a file appearing in S3 or GCS) before letting downstream tasks proceed.
  • Provider (Airflow): A separately packaged integration that bundles operators, hooks, and sensors for external systems (e.g., AWS, Google Cloud).
  • Backfill: Rerunning tasks for past dates or data intervals to “fill in” missing or updated results without re-running everything from scratch.

Learning Resources

Here are a few resources to level up your Python skills and to better understand or contribute to Airflow.

Overall Takeaway

Apache Airflow has become the go-to orchestration platform for data pipelines and beyond because it speaks Python all the way. Rather than forcing you into rigid UIs or config files, Airflow leverages the full power of Python to code tasks and schedules dynamically. A welcoming open-source community, extensive integrations (“providers”), and rich dev tooling make it possible for new users, experienced Pythonistas, and large enterprise teams to use Airflow for everything from advanced machine learning workflows to simple daily tasks. If you’re looking to streamline your data operations (or orchestrate just about anything) and you enjoy working in Python, Airflow is a must-have in your toolkit.

Links from the show

Jarek Potiuk: linkedin.com
Kaxil Naik: @kaxil
Leah Cole: @leahecole

Airflow site: airflow.apache.org
Airflow on GitHub: github.com
Airflow community: airflow.apache.org
UI: github.com
Helm Chart for Apache Airflow: airflow.apache.org
Airflow Summit: airflowsummit.org
Astronomer: astronomer.io
Astronomer Registry (Easy to search for official and community Providers): registry.astronomer.io
REST API: airflow.apache.org
Contributing: github.com
Airflow Loves Kubernetes talk: airflowsummit.org
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon