Learn Python with Talk Python's 270 hours of courses

The Data Engineering Landscape in 2021

Episode #302, published Thu, Feb 4, 2021, recorded Fri, Jan 29, 2021

I'm sure you're familiar with data science. But what about data engineering? Are these the same or how are they related?

Data engineering is dedicated to overcoming data-processing bottlenecks, data cleanup, data flow and data-handling problems for applications that utilize lots of data.

On this episode, we welcome back Tobias Macey to give us the 30,000 ft view of the data engineering landscape in 2021.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guest Introduction and Background

Tobias Macy is a seasoned data engineer and podcaster, best known for hosting the Data Engineering Podcast and Podcast.init. He leads the platform and data engineering team at MIT’s Open Learning department, where he works on cloud architectures, data pipelines, and analytics to optimize the global learner experience. He also consults with businesses and venture capital firms to highlight challenges and opportunities in the data ecosystem. Tobias has spent years exploring how companies handle data flows and infrastructure, and he’s passionate about making data engineering more accessible through thoughtful software design and tooling.

What to Know If You’re New to Python

If this is your first deep dive into a Python-centric conversation, it helps to have a grasp of basic concepts like virtual environments, installing packages, and working with data libraries such as pandas. You should also be comfortable with reading code samples (e.g., loops and functions) and knowing your way around a simple project directory. Lastly, keep in mind that Python’s community and vast ecosystem make it a top choice for data workflows—everything from small scripts to large-scale data pipelines.

Key Points and Takeaways

  • 1) Data Engineering vs. Data Science Data engineering focuses on building pipelines, cleaning data, and ensuring data flows reliably for downstream analysis or machine learning. It often involves a wider scope of system integration, infrastructure, and tooling. This contrasts with data science, which targets model building and analytics. Both roles complement one another, but data engineers create the foundation that data scientists rely on.
  • 2) Real-Time Data Pipelines and Streaming Organizations increasingly require immediate insights, so waiting for a once-a-day batch job no longer suffices. Real-time processing via tools like Apache Kafka or Spark streaming provides near-instant updates and alerts. This capability is key to modern data-driven businesses, allowing fresh data to flow continuously for dashboards, alerts, and real-time analytics. Implementations can be trickier than batch pipelines, but the business benefits often outweigh the complexity.
  • 3) Shift from ETL to ELT Traditional pipelines followed Extract-Transform-Load (ETL), transforming data before placing it into a data warehouse with a strict schema. Today, many teams use Extract-Load-Transform (ELT), dumping raw or semi-structured data into a cloud data warehouse and then transforming with SQL-based tooling. This helps teams store all data, even if only partially understood at ingestion time, while transformations evolve as insights or structures become clearer. It also opens collaboration with analysts who are comfortable with SQL.
  • 4) Data Orchestration: Airflow, Dagster, and Prefect Managing dependencies and scheduling a series of data tasks is crucial, especially when data has to move from one system to another. Tools like Apache Airflow, Dagster, and Prefect let you define these workflows (often called Directed Acyclic Graphs or DAGs) cleanly. Each has unique strengths: Airflow is a longstanding standard, Dagster focuses on “software-defined data assets,” and Prefect emphasizes a clean Python API, all leading to better reliability and maintainability for data pipelines.
  • 5) Data Quality and Testing with Great Expectations In data engineering, code testing is only half the battle; you must also validate the data itself. Great Expectations automates checks on expected value ranges, missing fields, or schema integrity. This ensures that any data passed downstream is trustworthy and avoids painful issues where everything “runs,” but the results are unexpectedly incorrect. Integrated testing helps detect mismatched data or schema drift before your stakeholders see flawed reports.
  • 6) Scaling Workloads with Dask Python’s Global Interpreter Lock can limit CPU-bound tasks, but Dask circumvents that by orchestrating parallel or distributed computation. Originally designed with scientific workloads in mind, Dask is equally relevant for data pipelines. It’s often seen as a Pythonic answer to Spark, with straightforward APIs that integrate seamlessly with pandas and NumPy. Dask lets you process larger-than-memory datasets and scale to clusters with minimal rewrites.
  • 7) Cloud Data Warehouses and On-Prem Trade-Offs Many teams are migrating to cloud data warehouses (e.g., Snowflake, Amazon Redshift, Google BigQuery) for elasticity and maintenance ease. On-premises solutions still exist for those with strict privacy or security needs, but cloud providers now offer encryption at rest, encryption in transit, and key management solutions. This maturity has alleviated a lot of the early security concerns, making cloud-based platforms attractive for many organizations.
  • 8) Meltano for Data Integration Meltano is an open source framework originally from GitLab, designed to unify various data ingestion tasks. It provides a cohesive approach to scheduling jobs, monitoring state, and integrating taps and targets from the Singer specification. Instead of stitching together custom scripts or paying for hosted solutions, you can keep your pipelines versioned in Git while seamlessly managing multiple data sources. Meltano helps adopt a DevOps-like approach to data.
  • 9) Avro, Parquet, and Schema Evolution As data grows more complex, formats like Avro and Parquet help keep schemas organized and data efficient. Avro includes a built-in schema definition per file, aiding in schema evolution while maintaining compatibility over time. Parquet is columnar, offering better compression and faster reads for analytical queries. These formats integrate with most modern data lake architectures, allowing flexible data storage and more powerful analytics.
  • 10) Graph Databases and GraphQL Graph-based technologies solve problems involving highly interconnected data, such as social networks or recommendation engines. Tools like Neo4j or DGraph store relationships efficiently, while GraphQL is a flexible query language that fits well in microservices and dynamic front-ends. Even in data engineering contexts, GraphQL-based APIs allow more tailored data fetching and reduce under/over-fetching issues.

Interesting Quotes and Stories

“Podcasting opened doors like nothing else. People who wouldn't normally talk to you are suddenly ready to spend an hour together.” – A reflection on how podcast hosting helps build networks and communities.

“A lot of data engineers come from software engineering or data science backgrounds—there’s no one ‘official’ path. It’s about wanting to make data reproducible and valuable.” – Tobias emphasizing the accessibility and varied backgrounds in data engineering.

Key Definitions and Terms

  • Data Engineering: The practice of designing systems and workflows to collect, store, transform, and serve data for analysis or applications.
  • ETL (Extract, Transform, Load): A traditional sequence where data is transformed before being loaded into a final destination.
  • ELT (Extract, Load, Transform): A newer approach that loads raw data first, then applies transformations within the data warehouse.
  • DAG (Directed Acyclic Graph): A graph structure for defining job dependencies, ensuring tasks run in a logical, non-circular sequence.
  • Data Lake: A central storage repository that holds a vast amount of raw data in its native format.
  • Data Warehouse: A structured store optimized for analysis, often using a columnar format for queries and aggregates.

Learning Resources

If you want to deepen your understanding of Python and the data engineering ecosystem, check out these courses from Talk Python Training:

Overall Takeaway

Data engineering is essential to transform raw information into reliable, usable data for analytics and applications. By embracing modern tools for orchestration, real-time processing, data quality checks, and distributed computing, Python developers can drive faster and more efficient pipelines. Whether you’re creating quick scripts or enterprise-level systems, investing in solid data engineering practices ensures that everyone in your organization—from data scientists to decision-makers—benefits from high-quality, well-managed data. Above all, curiosity, continuous learning, and experimentation are key to thriving in this rapidly evolving field.

Links from the show

Live Stream Recordings:
YouTube: youtube.com

Tobias Macey: boundlessnotions.com

Podcast.__init__: pythonpodcast.com
Data Engineering podcast: dataengineeringpodcast.com

Designing Data-Intensive Applications Book: amazon.com
wally: github.com
lakeFS: lakefs.io
A Beginner’s Guide to Data Engineering: medium.com
Apache Airflow: airflow.apache.org
Dagster: dagster.io
Prefect: prefect.io
#68 Crossing the streams with Podcast.__init__: talkpython.fm/68
dbt: getdbt.com
Great Expectations: github.com
Dask: dask.org
Meltano: meltano.com
Languages trends on StackOverflow: insights.stackoverflow.com
DVC: dvc.org
Pandas: pandas.pydata.org
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon