Learn Python with Talk Python's 270 hours of courses

Kedro for Maintainable Data Science

Episode #337, published Sat, Oct 9, 2021, recorded Fri, Oct 1, 2021

Have you heard of Kedro? It's a Python framework for creating reproducible, maintainable and modular data science code.

We all know that reproducibility and related topics are important ones in the data science space. The freedom to pop open a notebook and just start exploring is much of the magic.

Yet, that free-form style can lead to difficulties in versioning, reproducibility, collaboration, and moving to production. Solving these problems is the goal of Kedro. And we have 3 great guests from the Kedro community here to give us the rundown: Yetunde Dada, Waylon Walker, and Ivan Danov.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Yetunde Dada Principal Product Manager for Kedro at QuantumBlack (McKinsey). With a background in mechanical engineering, Yetunde has guided Kedro’s growth for over three years, focusing on enabling data science teams to adopt software engineering best practices.

Ivan Danov Kedro’s tech lead, bringing a deep software engineering background with experience in web, distributed systems, and AI projects. Ivan joined QuantumBlack five years ago and helped build Kedro from the ground up, emphasizing clean and maintainable code for data science projects.

Waylon Walker A mechanical engineer-turned-data-science team lead who adopted Kedro in real-world engagements. Waylon highlights how Kedro’s structure and collaboration model streamline data science workflows, particularly for teams juggling multiple projects.

What to Know If You're New to Python

Here are a few quick pointers to help you get the most out of this discussion:

  • Understanding basic Python functions, modules, and file structures will help you follow Kedro’s pipeline and node concepts.
  • Familiarity with Python packaging tools (like pip and virtual environments) makes installing and using Kedro simpler.
  • Be aware of Jupyter notebooks as a data exploration tool—Kedro encourages moving production-ready logic into Python scripts for better maintainability.
  • If you’d like a guided introduction, below are a few additional resources for new developers:

Key Points and Takeaways

  1. Kedro’s Core Purpose for Data Science Kedro is a framework dedicated to creating reproducible, maintainable, and modular data science code. It helps solve challenges that arise from ad hoc notebook work—specifically when teams need to version, collaborate on, and productionize large data projects.
  2. Move from Notebooks to Production Code Although Jupyter notebooks are excellent for initial exploration, Kedro encourages you to refactor logic into Python modules. This approach avoids many pitfalls associated with notebook-only workflows—like hidden state or difficulty collaborating when multiple people work on one .ipynb.
  3. Project Templating with Cookiecutter Kedro’s kedro new command relies on cookiecutter-like templates to quickly scaffold a new project, ensuring consistent structure. This organization helps even non-traditional software developers, such as data scientists, adopt proven coding and file structure practices.
  4. Data Catalog for Easy Loading and Saving Kedro’s Data Catalog centralizes how your project loads and saves datasets from various sources (local files, cloud storage, databases). By abstracting storage details, you can change file systems (S3, local disk, GCP) without rewriting large chunks of code.
  5. Pipeline Abstraction and Visualizations Kedro treats each data transformation step as a pure function (node) and ties them into pipelines with explicit input-output dependencies. This DAG-based abstraction reduces confusion around execution order. Plus, Kedro Viz generates a clear visual map of your workflow for debugging or stakeholder demos.
  6. Collaboration and Reusability With Kedro’s modular structure, teams can focus on smaller function-based nodes and reassemble them into pipelines. This approach promotes code reusability and a more streamlined onboarding process—crucial for large data teams rotating in and out of the project.
    • Tools:
      • Version control workflows (Git, GitHub, GitLab)
      • Node-based pipeline building in Kedro
  7. Deployment Flexibility Kedro doesn’t force you to pick a specific orchestrator or deployment target. There are official plugins for Docker and Airflow, plus community guides for AWS Batch, SageMaker, Databricks, and more. This allows you to develop a robust codebase first, then choose your deployment strategy.
  8. Space Flight Tutorial A fun and practical tutorial helps users learn Kedro by predicting futuristic space flight prices. It walks you through pipelines, data catalog usage, and building a machine learning model, all within a straightforward, story-driven example.
  9. FS Spec and Dynaconf Under the Hood The Kedro team highlighted key open-source libraries they rely on. FS Spec simplifies reading from various file systems, while Dynaconf helps with lazy-loaded configurations, letting you only load settings you actually need for a given pipeline run.
  10. Future Roadmap: Experiment Tracking The Kedro team is actively working on built-in experiment tracking—logging parameters, metrics, and outputs in a unified way. This will enable data scientists to compare runs and handle model versioning without leaving Kedro or wiring up external tools from scratch.
  • Tools:
    • Potential MLflow integration in Kedro
    • Enhanced front-end in Kedro Viz for experiment results

Interesting Quotes and Stories

  • Waylon Walker on Jupyter Notebooks in Teams: “Three people are on a project; two people are sitting idle while one person has the notebook checked out.” This underscores how traditional notebook workflows can bottleneck collaboration.
  • Ivan on the Shift from Notebooks: “I was coming from a software engineering background and wondered, ‘Where are the frameworks here?’ Everything was just notebooks.”
  • Yetunde on NASA Using Kedro: “We found out NASA was using Kedro… it was like we went full circle—our Space Flight tutorial in real life.”

Key Definitions and Terms

  • Node: A pure Python function within a Kedro pipeline that takes defined inputs and outputs.
  • Pipeline: A directed acyclic graph (DAG) of nodes that defines data transformations in a clear, maintainable structure.
  • Data Catalog: Kedro’s system for loading and saving data without hardcoding paths or worrying about local vs. remote storage details.
  • Orchestrator: A workflow management tool (Airflow, Prefect, Luigi, etc.) that schedules and runs pipelines in production.

Learning Resources

Overall Takeaway

Kedro addresses one of the biggest pain points in data science: turning prototypes and notebooks into maintainable, production-grade code. Its strong emphasis on project structure, modular pipelines, and a unified data catalog helps teams of any size collaborate more efficiently. Whether you’re new to Python or a seasoned engineer looking to bring best practices into data science, Kedro’s flexible framework and bright roadmap promise a well-organized, scalable approach to your next data project.

Links from the show

Waylong on Twitter: @_WaylonWalker
Yetunda on Twitter: @yetudada
Ivan on Twitter: @ivandanov

Kedro: kedro.readthedocs.io
Kedro on GitHub: github.com
Join the Kedro Discord: discord.gg

Articles about Kedro by Waylan: waylonwalker.com
Kedro spaceflights tutorial: kedro.readthedocs.io
“Hello World” on Kedro: kedro.readthedocs.io
Kedro Viz: quantumblacklabs.github.io
Spaceflights Tutorial video: youtube.com
Dynaconf package: dynaconf.com
fsspec: Filesystem interfaces for Python: filesystem-spec.readthedocs.io
Neovim: neovim.io
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon