The Data Engineering Landscape in 2021
Data engineering is dedicated to overcoming data-processing bottlenecks, data cleanup, data flow and data-handling problems for applications that utilize lots of data.
On this episode, we welcome back Tobias Macey to give us the 30,000 ft view of the data engineering landscape in 2021.
Episode Deep Dive
Guest Introduction and Background
Tobias Macy is a seasoned data engineer and podcaster, best known for hosting the Data Engineering Podcast and Podcast.init. He leads the platform and data engineering team at MIT’s Open Learning department, where he works on cloud architectures, data pipelines, and analytics to optimize the global learner experience. He also consults with businesses and venture capital firms to highlight challenges and opportunities in the data ecosystem. Tobias has spent years exploring how companies handle data flows and infrastructure, and he’s passionate about making data engineering more accessible through thoughtful software design and tooling.
What to Know If You’re New to Python
If this is your first deep dive into a Python-centric conversation, it helps to have a grasp of basic concepts like virtual environments, installing packages, and working with data libraries such as pandas
. You should also be comfortable with reading code samples (e.g., loops and functions) and knowing your way around a simple project directory. Lastly, keep in mind that Python’s community and vast ecosystem make it a top choice for data workflows—everything from small scripts to large-scale data pipelines.
Key Points and Takeaways
- 1) Data Engineering vs. Data Science
Data engineering focuses on building pipelines, cleaning data, and ensuring data flows reliably for downstream analysis or machine learning. It often involves a wider scope of system integration, infrastructure, and tooling. This contrasts with data science, which targets model building and analytics. Both roles complement one another, but data engineers create the foundation that data scientists rely on.
- Links and Tools:
- 2) Real-Time Data Pipelines and Streaming
Organizations increasingly require immediate insights, so waiting for a once-a-day batch job no longer suffices. Real-time processing via tools like Apache Kafka or Spark streaming provides near-instant updates and alerts. This capability is key to modern data-driven businesses, allowing fresh data to flow continuously for dashboards, alerts, and real-time analytics. Implementations can be trickier than batch pipelines, but the business benefits often outweigh the complexity.
- Links and Tools:
- 3) Shift from ETL to ELT
Traditional pipelines followed Extract-Transform-Load (ETL), transforming data before placing it into a data warehouse with a strict schema. Today, many teams use Extract-Load-Transform (ELT), dumping raw or semi-structured data into a cloud data warehouse and then transforming with SQL-based tooling. This helps teams store all data, even if only partially understood at ingestion time, while transformations evolve as insights or structures become clearer. It also opens collaboration with analysts who are comfortable with SQL.
- Links and Tools:
- 4) Data Orchestration: Airflow, Dagster, and Prefect
Managing dependencies and scheduling a series of data tasks is crucial, especially when data has to move from one system to another. Tools like Apache Airflow, Dagster, and Prefect let you define these workflows (often called Directed Acyclic Graphs or DAGs) cleanly. Each has unique strengths: Airflow is a longstanding standard, Dagster focuses on “software-defined data assets,” and Prefect emphasizes a clean Python API, all leading to better reliability and maintainability for data pipelines.
- Links and Tools:
- 5) Data Quality and Testing with Great Expectations In data engineering, code testing is only half the battle; you must also validate the data itself. Great Expectations automates checks on expected value ranges, missing fields, or schema integrity. This ensures that any data passed downstream is trustworthy and avoids painful issues where everything “runs,” but the results are unexpectedly incorrect. Integrated testing helps detect mismatched data or schema drift before your stakeholders see flawed reports.
- 6) Scaling Workloads with Dask
Python’s Global Interpreter Lock can limit CPU-bound tasks, but Dask circumvents that by orchestrating parallel or distributed computation. Originally designed with scientific workloads in mind, Dask is equally relevant for data pipelines. It’s often seen as a Pythonic answer to Spark, with straightforward APIs that integrate seamlessly with
pandas
and NumPy. Dask lets you process larger-than-memory datasets and scale to clusters with minimal rewrites.- Links and Tools:
- Dask
- Fundamentals of Dask course (see “Learning Resources” section below)
- Links and Tools:
- 7) Cloud Data Warehouses and On-Prem Trade-Offs Many teams are migrating to cloud data warehouses (e.g., Snowflake, Amazon Redshift, Google BigQuery) for elasticity and maintenance ease. On-premises solutions still exist for those with strict privacy or security needs, but cloud providers now offer encryption at rest, encryption in transit, and key management solutions. This maturity has alleviated a lot of the early security concerns, making cloud-based platforms attractive for many organizations.
- 8) Meltano for Data Integration
Meltano is an open source framework originally from GitLab, designed to unify various data ingestion tasks. It provides a cohesive approach to scheduling jobs, monitoring state, and integrating taps and targets from the Singer specification. Instead of stitching together custom scripts or paying for hosted solutions, you can keep your pipelines versioned in Git while seamlessly managing multiple data sources. Meltano helps adopt a DevOps-like approach to data.
- Links and Tools:
- 9) Avro, Parquet, and Schema Evolution
As data grows more complex, formats like Avro and Parquet help keep schemas organized and data efficient. Avro includes a built-in schema definition per file, aiding in schema evolution while maintaining compatibility over time. Parquet is columnar, offering better compression and faster reads for analytical queries. These formats integrate with most modern data lake architectures, allowing flexible data storage and more powerful analytics.
- Links and Tools:
- 10) Graph Databases and GraphQL Graph-based technologies solve problems involving highly interconnected data, such as social networks or recommendation engines. Tools like Neo4j or DGraph store relationships efficiently, while GraphQL is a flexible query language that fits well in microservices and dynamic front-ends. Even in data engineering contexts, GraphQL-based APIs allow more tailored data fetching and reduce under/over-fetching issues.
Interesting Quotes and Stories
“Podcasting opened doors like nothing else. People who wouldn't normally talk to you are suddenly ready to spend an hour together.” – A reflection on how podcast hosting helps build networks and communities.
“A lot of data engineers come from software engineering or data science backgrounds—there’s no one ‘official’ path. It’s about wanting to make data reproducible and valuable.” – Tobias emphasizing the accessibility and varied backgrounds in data engineering.
Key Definitions and Terms
- Data Engineering: The practice of designing systems and workflows to collect, store, transform, and serve data for analysis or applications.
- ETL (Extract, Transform, Load): A traditional sequence where data is transformed before being loaded into a final destination.
- ELT (Extract, Load, Transform): A newer approach that loads raw data first, then applies transformations within the data warehouse.
- DAG (Directed Acyclic Graph): A graph structure for defining job dependencies, ensuring tasks run in a logical, non-circular sequence.
- Data Lake: A central storage repository that holds a vast amount of raw data in its native format.
- Data Warehouse: A structured store optimized for analysis, often using a columnar format for queries and aggregates.
Learning Resources
If you want to deepen your understanding of Python and the data engineering ecosystem, check out these courses from Talk Python Training:
- Python for Absolute Beginners: Perfect if you’re just getting started with Python.
- Fundamentals of Dask: Learn how to scale pandas workflows and leverage distributed computing in Python.
- Move from Excel to Python with Pandas: Transition from spreadsheet-based analysis to more flexible and powerful Python solutions.
Overall Takeaway
Data engineering is essential to transform raw information into reliable, usable data for analytics and applications. By embracing modern tools for orchestration, real-time processing, data quality checks, and distributed computing, Python developers can drive faster and more efficient pipelines. Whether you’re creating quick scripts or enterprise-level systems, investing in solid data engineering practices ensures that everyone in your organization—from data scientists to decision-makers—benefits from high-quality, well-managed data. Above all, curiosity, continuous learning, and experimentation are key to thriving in this rapidly evolving field.
Links from the show
YouTube: youtube.com
Tobias Macey: boundlessnotions.com
Podcast.__init__: pythonpodcast.com
Data Engineering podcast: dataengineeringpodcast.com
Designing Data-Intensive Applications Book: amazon.com
wally: github.com
lakeFS: lakefs.io
A Beginner’s Guide to Data Engineering: medium.com
Apache Airflow: airflow.apache.org
Dagster: dagster.io
Prefect: prefect.io
#68 Crossing the streams with Podcast.__init__: talkpython.fm/68
dbt: getdbt.com
Great Expectations: github.com
Dask: dask.org
Meltano: meltano.com
Languages trends on StackOverflow: insights.stackoverflow.com
DVC: dvc.org
Pandas: pandas.pydata.org
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy