The Data Engineering Landscape in 2021

Episode Deep Dive Links Transcript

I'm sure you're familiar with data science. But what about data engineering? Are these the same or how are they related?

Data engineering is dedicated to overcoming data-processing bottlenecks, data cleanup, data flow and data-handling problems for applications that utilize lots of data.

On this episode, we welcome back Tobias Macey to give us the 30,000 ft view of the data engineering landscape in 2021.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guest Introduction and Background

Tobias Macy is a seasoned data engineer and podcaster, best known for hosting the Data Engineering Podcast and Podcast.init. He leads the platform and data engineering team at MIT’s Open Learning department, where he works on cloud architectures, data pipelines, and analytics to optimize the global learner experience. He also consults with businesses and venture capital firms to highlight challenges and opportunities in the data ecosystem. Tobias has spent years exploring how companies handle data flows and infrastructure, and he’s passionate about making data engineering more accessible through thoughtful software design and tooling.

What to Know If You’re New to Python

If this is your first deep dive into a Python-centric conversation, it helps to have a grasp of basic concepts like virtual environments, installing packages, and working with data libraries such as pandas. You should also be comfortable with reading code samples (e.g., loops and functions) and knowing your way around a simple project directory. Lastly, keep in mind that Python’s community and vast ecosystem make it a top choice for data workflows—everything from small scripts to large-scale data pipelines.

Key Points and Takeaways

1) Data Engineering vs. Data Science Data engineering focuses on building pipelines, cleaning data, and ensuring data flows reliably for downstream analysis or machine learning. It often involves a wider scope of system integration, infrastructure, and tooling. This contrasts with data science, which targets model building and analytics. Both roles complement one another, but data engineers create the foundation that data scientists rely on.
- Links and Tools:
  - Data Engineering Podcast
  - Podcast.init
2) Real-Time Data Pipelines and Streaming Organizations increasingly require immediate insights, so waiting for a once-a-day batch job no longer suffices. Real-time processing via tools like Apache Kafka or Spark streaming provides near-instant updates and alerts. This capability is key to modern data-driven businesses, allowing fresh data to flow continuously for dashboards, alerts, and real-time analytics. Implementations can be trickier than batch pipelines, but the business benefits often outweigh the complexity.
- Links and Tools:
  - Apache Kafka
  - Apache Spark (Streaming)
3) Shift from ETL to ELT Traditional pipelines followed Extract-Transform-Load (ETL), transforming data before placing it into a data warehouse with a strict schema. Today, many teams use Extract-Load-Transform (ELT), dumping raw or semi-structured data into a cloud data warehouse and then transforming with SQL-based tooling. This helps teams store all data, even if only partially understood at ingestion time, while transformations evolve as insights or structures become clearer. It also opens collaboration with analysts who are comfortable with SQL.
- Links and Tools:
  - DBT (Data Build Tool)
4) Data Orchestration: Airflow, Dagster, and Prefect Managing dependencies and scheduling a series of data tasks is crucial, especially when data has to move from one system to another. Tools like Apache Airflow, Dagster, and Prefect let you define these workflows (often called Directed Acyclic Graphs or DAGs) cleanly. Each has unique strengths: Airflow is a longstanding standard, Dagster focuses on “software-defined data assets,” and Prefect emphasizes a clean Python API, all leading to better reliability and maintainability for data pipelines.
- Links and Tools:
5) Data Quality and Testing with Great Expectations In data engineering, code testing is only half the battle; you must also validate the data itself. Great Expectations automates checks on expected value ranges, missing fields, or schema integrity. This ensures that any data passed downstream is trustworthy and avoids painful issues where everything “runs,” but the results are unexpectedly incorrect. Integrated testing helps detect mismatched data or schema drift before your stakeholders see flawed reports.
6) Scaling Workloads with Dask Python’s Global Interpreter Lock can limit CPU-bound tasks, but Dask circumvents that by orchestrating parallel or distributed computation. Originally designed with scientific workloads in mind, Dask is equally relevant for data pipelines. It’s often seen as a Pythonic answer to Spark, with straightforward APIs that integrate seamlessly with pandas and NumPy. Dask lets you process larger-than-memory datasets and scale to clusters with minimal rewrites.
- Links and Tools:
  - Dask
  - Fundamentals of Dask course (see “Learning Resources” section below)
7) Cloud Data Warehouses and On-Prem Trade-Offs Many teams are migrating to cloud data warehouses (e.g., Snowflake, Amazon Redshift, Google BigQuery) for elasticity and maintenance ease. On-premises solutions still exist for those with strict privacy or security needs, but cloud providers now offer encryption at rest, encryption in transit, and key management solutions. This maturity has alleviated a lot of the early security concerns, making cloud-based platforms attractive for many organizations.
8) Meltano for Data Integration Meltano is an open source framework originally from GitLab, designed to unify various data ingestion tasks. It provides a cohesive approach to scheduling jobs, monitoring state, and integrating taps and targets from the Singer specification. Instead of stitching together custom scripts or paying for hosted solutions, you can keep your pipelines versioned in Git while seamlessly managing multiple data sources. Meltano helps adopt a DevOps-like approach to data.
- Links and Tools:
  - Singer Specification
  - Meltano
9) Avro, Parquet, and Schema Evolution As data grows more complex, formats like Avro and Parquet help keep schemas organized and data efficient. Avro includes a built-in schema definition per file, aiding in schema evolution while maintaining compatibility over time. Parquet is columnar, offering better compression and faster reads for analytical queries. These formats integrate with most modern data lake architectures, allowing flexible data storage and more powerful analytics.
- Links and Tools:
  - Apache Parquet
  - Apache Avro
10) Graph Databases and GraphQL Graph-based technologies solve problems involving highly interconnected data, such as social networks or recommendation engines. Tools like Neo4j or DGraph store relationships efficiently, while GraphQL is a flexible query language that fits well in microservices and dynamic front-ends. Even in data engineering contexts, GraphQL-based APIs allow more tailored data fetching and reduce under/over-fetching issues.
- Links and Tools:
  - GraphQL
  - Neo4j

Interesting Quotes and Stories

“Podcasting opened doors like nothing else. People who wouldn't normally talk to you are suddenly ready to spend an hour together.” – A reflection on how podcast hosting helps build networks and communities.

“A lot of data engineers come from software engineering or data science backgrounds—there’s no one ‘official’ path. It’s about wanting to make data reproducible and valuable.” – Tobias emphasizing the accessibility and varied backgrounds in data engineering.

Key Definitions and Terms

Data Engineering: The practice of designing systems and workflows to collect, store, transform, and serve data for analysis or applications.
ETL (Extract, Transform, Load): A traditional sequence where data is transformed before being loaded into a final destination.
ELT (Extract, Load, Transform): A newer approach that loads raw data first, then applies transformations within the data warehouse.
DAG (Directed Acyclic Graph): A graph structure for defining job dependencies, ensuring tasks run in a logical, non-circular sequence.
Data Lake: A central storage repository that holds a vast amount of raw data in its native format.
Data Warehouse: A structured store optimized for analysis, often using a columnar format for queries and aggregates.

Learning Resources

If you want to deepen your understanding of Python and the data engineering ecosystem, check out these courses from Talk Python Training:

Python for Absolute Beginners: Perfect if you’re just getting started with Python.
Fundamentals of Dask: Learn how to scale pandas workflows and leverage distributed computing in Python.
Move from Excel to Python with Pandas: Transition from spreadsheet-based analysis to more flexible and powerful Python solutions.

Overall Takeaway

Data engineering is essential to transform raw information into reliable, usable data for analytics and applications. By embracing modern tools for orchestration, real-time processing, data quality checks, and distributed computing, Python developers can drive faster and more efficient pipelines. Whether you’re creating quick scripts or enterprise-level systems, investing in solid data engineering practices ensures that everyone in your organization—from data scientists to decision-makers—benefits from high-quality, well-managed data. Above all, curiosity, continuous learning, and experimentation are key to thriving in this rapidly evolving field.

Links from the show

Live Stream Recordings:
YouTube: youtube.com

Tobias Macey: boundlessnotions.com

Podcast.__init__: pythonpodcast.com
Data Engineering podcast: dataengineeringpodcast.com

Designing Data-Intensive Applications Book: amazon.com
wally: github.com
lakeFS: lakefs.io
A Beginner’s Guide to Data Engineering: medium.com
Apache Airflow: airflow.apache.org
Dagster: dagster.io
Prefect: prefect.io
#68 Crossing the streams with Podcast.__init__: talkpython.fm/68
dbt: getdbt.com
Great Expectations: github.com
Dask: dask.org
Meltano: meltano.com
Languages trends on StackOverflow: insights.stackoverflow.com
DVC: dvc.org
Pandas: pandas.pydata.org

Episode #302 deep-dive: talkpython.fm/302
Episode transcripts: talkpython.fm

---== Don't be a stranger ==---
YouTube: youtube.com/@talkpython

Bluesky: @talkpython.fm
Mastodon: @talkpython@fosstodon.org
X.com: @talkpython

Michael on Bluesky: @mkennedy.codes
Michael on Mastodon: @mkennedy@fosstodon.org
Michael on X.com: @mkennedy

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 I'm sure you're familiar with data science, but what about data engineering?

00:03 Are these the same thing, or how are they related?

00:06 Data engineering is dedicated to overcoming data processing bottlenecks,

00:10 data cleanup, data flow, and data handling problems for applications that utilize a lot of data.

00:15 On this episode, we welcome back Tobias Macy, give us a 30,000-foot view of the data engineering landscape in 2021.

00:22 This is Talk Python To Me, episode 302, recorded January 29, 2021.

00:27 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:47 This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy,

00:51 and keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at talkpython.

00:58 This episode is brought to you by Datadog and Retool.

01:01 Please check out what they're offering during their segments. It really helps support the show.

01:05 Tobias, you ready to kick it off?

01:07 Yeah, sounds good. Thanks for having me on, Mike.

01:08 Yeah, great to have you here. Good to have you back.

01:11 I was recently looking at my podcast page here, and it says you were on the show 68, which is a lot of fun.

01:18 That was when Chris Paddy was with you as well around Podcasts and Knit.

01:22 But boy, that was 2016.

01:24 Yeah, it's been a while.

01:26 We've been at this a while. I mean, ironically, we started within a week of each other, but yeah, we're still going, both of us.

01:31 Yeah, it's definitely been a fun journey and a lot of great sort of unexpected benefits and great people that I've been able to meet as a result of it.

01:38 So definitely glad to be on the journey with you.

01:41 Yeah, yeah. Same here. Podcasting opened doors like nothing else. It's crazy.

01:45 People who wouldn't normally want to talk to you are like, hey, you want to be on the show?

01:48 Yeah, let's spend an hour together all of a sudden, right? It's fantastic.

01:51 What's new since 2016? What have you been up to?

01:54 Definitely a number of things. I mean, one being that I actually ended up going solo as the host.

01:59 So I've been running the Podcast in It show by myself. I don't remember exactly when it happened, but I think probably sometime around 2017.

02:07 I know around the same time that I was on your show, you were on mine. So we kind of flip-flopped.

02:12 And then you've been on the show again since then, talking about your experience working with MongoDB and Python.

02:18 Yeah.

02:18 Beyond that, I also ended up starting a second podcast. So I've got Podcast in It, which focuses on Python and its community.

02:25 So a lot of stuff about DevOps, data science, machine learning, web development, you name it.

02:31 Anything that people are doing with Python, I've had them on.

02:33 But I've also started a second show focused on data engineering.

02:37 So going beyond just the constraints of Python into a separate niche.

02:41 So more languages, but more tightly focused problem domain.

02:46 And so I've been enjoying learning a lot more about the area of data engineering.

02:50 And so it's actually been a good companion to the two where there's a lot of data science that happens in Python.

02:57 So I'm able to cover that side of things on podcast.init.

02:59 And then data engineering is all of the prep work that makes data scientists' lives easier.

03:05 And so just learning a lot about the technologies and challenges that happen on that side of things.

03:09 Yeah, that's super cool.

03:10 And to be honest, one of the reasons I invite you on the show is because I know people talk about data engineering.

03:15 And I know there's neat tools in there.

03:18 They feel like they come out of the data science space, but not exactly.

03:21 And so I'm really looking forward to learning about them along with everyone else listening.

03:25 So it's going to be a lot of fun.

03:26 Yeah.

03:27 Before we dive into that, let people maybe know, what are you doing day to day these days?

03:31 Are you doing consulting?

03:32 Are you got a full time job?

03:33 Oh, what's the plan?

03:34 Yes.

03:35 Yes to all of that.

03:36 So, yeah, I mean, I run the podcast as a side, just sort of hobby.

03:41 And for my day to day, I actually work full time at MIT in the open learning department and help run the platform engineering and data engineering team there.

03:51 So responsible for making sure that all the cloud environments are set up and secured and servers are up and running and applications stay available.

03:59 And working through building out a data platform to provide a lot of means for analytics and gaining insights into the learning habits and the behaviors that global learners have and how they interact with all of the different platforms that we run.

04:13 That's fantastic.

04:14 Yeah, it's definitely a great place to work.

04:16 I've been happy to be there for a number of years now.

04:18 And then, you know, I run the podcast.

04:20 So those go out every week.

04:22 So a lot of stuff that happens behind the scenes there.

04:24 And then I also do some consulting where lately it's been more of the advisory type where it used to be I'd be hands on keyboard, but I've been able to level up beyond that.

04:34 And so I've been working with a couple of venture capital firms to help them understand the data ecosystem.

04:40 So data engineering, data science.

04:42 I've also worked a little bit with a couple of businesses, just helping them understand sort of what are the challenges and what's the potential in the data marketplace and data ecosystem to be able to go beyond just having an application and then being able to use the information and the data that they gather from that to be able to build more interesting insights into their business, but also products for their customers.

05:04 Oh, yeah, that sounds really fun.

05:06 I mean, working for MIT sounds amazing.

05:08 And then those advisory roles are really neat because you kind of get a take, especially as a podcaster, you get this broad view because you talk to so many people and, you know, they've got different situations and different contexts.

05:19 And so you can say, all right, look, here's kind of what I see.

05:21 You seem to fit into this realm.

05:23 And so this might be the right path.

05:25 Absolutely.

05:25 Yeah.

05:26 I mean, the data ecosystem in particular is very fast moving.

05:29 So it's definitely very difficult to be able to keep abreast with all the different aspects of it.

05:35 And so because that's kind of my job as a podcaster, I'm able to dig deep into various areas of it and be able to learn from and take advantage of all of the expertise and insight that these various innovators and leaders in the space have and kind of synthesize that because I'm talking to people across the storage layer to the data processing layer and orchestration and analytics and machine learning and operationalizing all of that.

06:04 Whereas if I were one of the people who's deep in the trenches, it's, you know, you get a very detailed but narrow view, whereas I've got a very shallow and broad view across the whole ecosystem.

06:14 So I'm able to.

06:15 Which is the perfect match for the high level view, right?

06:17 Exactly.

06:18 Nice.

06:19 All right.

06:19 Well, let's jump into our main topic.

06:21 And we touched on it a little bit, but I know what data science is, I think.

06:25 And there's a really interesting interview I did with Emily and Jacqueline.

06:30 I don't remember both their last names recently, but about building a career in data science.

06:34 And they talked about basically three areas of data science that you might be in, like production and machine learning versus making predictions and so on.

06:43 And data engineering, it feels like it's kind of in that data science realm, but it's not exactly that.

06:48 Like it could kind of be databases and other stuff too, right?

06:51 Like what is this data engineering thing, maybe compare and contrast against data science as people probably know that pretty well.

06:57 Yeah.

06:58 So it's one of those kind of all encompassing terms that, you know, the role depends on the organization that you're in.

07:05 So in some places, data engineer might just be the person who, you know, used to be the DBA or the database administrator.

07:12 And other places, they might be responsible for maintaining streaming systems.

07:20 One way that I've seen it broken down is kind of two sort of broad classifications of data engineering is there's the SQL focused data engineer where they might have a background as a database administrator.

07:32 And so they do a lot of work in managing the data warehouse.

07:35 They work with SQL oriented tools where there are a lot of them coming out now where you can actually use SQL for being able to pull data from source systems into the data warehouse and then provide, you know, build transformations to provide to analysts and data scientists.

07:51 And then there is the more engineering oriented data engineer, which is somebody who writes a lot of software.

07:58 They're building complex infrastructure and architectures using things like Kafka or Flink or Spark.

08:04 They're working with the database.

08:06 They're working with data orchestration tools like Airflow or Dagster or Prefect.

08:11 They might be using Dask.

08:12 And so they're much more focused on actually writing software and delivering code as the output of their efforts.

08:18 Right.

08:18 Okay.

08:19 But the shared context across however you define data engineering, the shared aspect of it is that they're all working to bring data from multiple locations into a place that is accessible for various end users where the end users might be analysts or data scientists or the business intelligence tools.

08:40 And they're tasked with making sure that those workflows are repeatable and maintainable and that the data is clean and organized so that it's useful because, you know, everybody knows the whole garbage in garbage out principle.

08:53 Yeah.

08:53 If you're a data scientist and you don't have all the context of where the data is coming from, you just have a small narrow scope of what you need to work with.

09:02 You're kind of struggling with that garbage in garbage out principle.

09:04 And so the data engineer's job is to get rid of all the garbage and give you something clean that you can work from.

09:09 I think that's really a tricky problem in the data science side of things.

09:13 You take your data, you run it through a model or through some analysis graphing layer, and it gives you a picture.

09:19 And you're like, well, that's the answer.

09:20 Maybe.

09:21 Maybe it is.

09:22 Right.

09:22 Did you give it the right input?

09:23 Did you train the models in the right data?

09:26 Who knows, right?

09:27 Right.

09:27 That's, you know, definitely a big challenge.

09:29 And that's one of the reasons why data engineering has become so multifaceted is because what you're doing with the data informs the ways that you prepare the data.

09:38 You know, you need to make sure that you have a lot of the contextual information as well to make sure that the data scientists and data analysts are able to answer the questions accurately.

09:47 Because data in isolation, if you just give somebody the number five, it's completely meaningless.

09:52 But if you tell them that a customer ordered five of this unit, well, then now you can actually do something with it.

09:58 So the context helps to provide the information about the isolated number and understanding where it came from and why it's important.

10:07 Yeah, absolutely.

10:07 You know, two things come to mind when I hear data engineering for me is like one is like pipelines of data.

10:12 You know, maybe you've got to bring in data and do transformations to it to get it ready.

10:16 This is part of that data cleanup, maybe, and taking disparate sources and unifying them under one canonical model or something in representation.

10:24 And then ETL, I kind of like we get something terrible like FTP uploads of CSV files.

10:30 And we've got to turn those into databases like overnight jobs, right?

10:33 Or things like that, which probably still exist.

10:36 They existed not too long ago.

10:37 Yeah.

10:37 Every sort of legacy technology that you think has gone away because you're not working with it anymore is still in existence somewhere, which is why we still have COBOL.

10:47 Exactly.

10:48 Oh, my gosh.

10:49 I've got some crazy, crazy COBOL stories for you that probably shouldn't go out public.

10:54 But ask me over the next conference.

10:56 The next time we get to travel somewhere, you know?

10:57 All right.

10:58 Sounds good.

10:58 For sure.

10:59 So let's talk about trends.

11:01 I made that joke, right?

11:02 Like, well, maybe it used to be CSV files or text files and FTP and then a job that would put that into a SQL database or some kind of relational database.

11:10 What is it now?

11:11 It's got to be better than that, right?

11:12 I mean, again, it depends where you are.

11:14 I mean, CSV files are still a thing.

11:16 You know, it might not be FTP anymore.

11:18 It's probably going to be living in object storage like S3 or Google Cloud Storage.

11:22 But, you know, you're still working with individual files in some places.

11:26 A lot of it is coming from APIs or databases where you might need to pull all of the information from Salesforce to get your CRM data.

11:34 Or you might be pulling data out of Google Analytics via their API.

11:38 You know, there are a lot of evolutionary trends that have happened.

11:42 Sort of first big movement in data engineering beyond just the sort of, well, there have been a few generations.

11:49 So the first generation was the data warehouse where you took a database appliance, whether that was Oracle or Microsoft SQL Server or Postgres.

11:57 You put all of your data into it.

11:59 And then you had to do a lot of work to model it so that you could answer questions about that data.

12:05 So in an application database, you're liable to just overwrite a record when something changes,

12:10 where in a data warehouse, you want that historical information about what changed and the evolution of that data.

12:17 What about like normalization in operational databases?

12:20 It's all about one source of truth.

12:23 We better not have any duplication.

12:24 It's fine if there's four joins to get there.

12:26 Whereas in warehousing, it's maybe better to have that duplication so you can run different types of reports real quickly and easily.

12:33 Exactly.

12:34 Yeah.

12:34 I mean, you still need to have one source of truth, but you will model the tables differently than in an application database.

12:41 So there are things like the star schema or the snowflake schema that became popular in the initial phase of data warehousing.

12:47 So Ralph Kimball is famous for building out the sort of star schema approach with facts and dimensions.

12:53 Yeah.

12:53 Maybe describe that a little bit for people because maybe they don't know these terms.

12:57 Sure.

12:57 So facts are things like, you know, a fact is Tobias Macy works at MIT.

13:05 And then a dimension might be he was hired in 2016 or whatever year it was.

13:11 And another dimension of it is he, you know, his work anniversary is X date.

13:17 And so the way that you model it makes it so a fact is something that's immutable and then a dimension or things that might evolve over time.

13:25 And then in sort of the next iteration of data engineering and data management was the sort of, quote unquote, big data craze where Google released their paper about MapReduce.

13:35 And so Hadoop came out as a open source option for that.

13:39 And so everybody said, oh, I've got to get.

13:40 Yeah.

13:41 MapReduce was going to take over the world.

13:43 Right.

13:43 Like that was the only way you could do anything.

13:45 If you had big data, then you had to MapReduce it.

13:47 And then maybe it had to do with one of these large scaled out databases.

13:51 Right.

13:52 Spark or Cassandra or who knows something like that.

13:54 Yeah.

13:55 I mean, Spark and Cassandra came after Hadoop.

13:57 So, I mean, Hadoop was your option in the early 2000s.

14:01 And so everybody said, oh, big data is the answer.

14:03 If I just throw big data at everything, it'll solve all my problems.

14:07 And so people built these massive data lakes using Hadoop and built these MapReduce jobs and then realized that what are we actually doing with all this data?

14:14 It's costing us more money than it's worth.

14:15 MapReduce jobs are difficult to scale.

14:18 They're difficult to understand the order of dependencies.

14:21 And so that's when things like Spark came out to use the data that you were already collecting, but be able to parallelize the operations and run it a little faster.

14:29 And so, you know, that was sort of the era of batch oriented workflows.

14:34 And then with the advent of things like Spark streaming and Kafka and, you know, there are a whole number of other tools out there now, like Flink and Pulsar.

14:42 The sort of real time revolution is where we're at now, where it's not enough to be able to understand what happened the next day.

14:50 You have to understand what's happening, you know, within five minutes.

14:53 And so there are principles like change data capture, where every time I write a new record into a database, it goes into a Kafka queue, which then gets replicated out to an Elasticsearch cluster and to my data warehouse.

15:05 And so within five minutes, my business intelligence dashboard is updated with the fact that customer A bought product B rather than having to wait in 24 hours to get that insight.

15:15 I think that makes tons of sense.

15:16 So instead of going like, we're just going to pile the data into this, you know, some sort of data lake type thing, then we'll grab it and we'll do our reports nightly or hourly or whatever.

15:25 You just keep pushing it down the road as it comes in or as it's generated, right?

15:29 Right.

15:30 Yeah.

15:30 So, I mean, there are still use cases for batch.

15:33 I mean, and there are different ways of looking at it.

15:35 So, I mean, a lot of people view batch as just a special case of streaming where, you know, streaming is sort of micro batches where as a record comes in, you operate on it.

15:43 And then for large batch jobs, you're just doing the same thing, but multiple times for a number of records.

15:48 Yeah.

15:49 I mean, there are a lot of paradigms that are building up.

15:50 People are getting used to the idea.

15:52 I mean, batch is still the easier thing to implement.

15:54 It requires fewer moving pieces, but streaming the sort of availability of different technologies is making it more feasible for more people to be able to actually take advantage of that.

16:04 And so there are managed platforms that help you with that problem.

16:08 There are a lot of open source projects that approach it.

16:10 Yeah.

16:11 There's a whole platforms that are just around to just do data streaming for you, right?

16:15 To just like sort of manage that and keep that alive.

16:18 And with the popularization of web hooks, right, it's easy to say if something changes here, you know, notify this other thing and that thing can call it things.

16:26 And it seems like it's coming along.

16:28 Yeah.

16:28 One of the interesting aspects, too, of a lot of the work that's been going into the data engineering space is that you're starting to see some of the architectural patterns and technologies move back into the application development domain.

16:40 Where a lot of applications, particularly if you're working with microservices, will use something like a Kafka or a Pulsar queue as the communication layer for being able to propagate information across all the different decoupled applications.

16:54 And that's the same technology and same architectural approaches that are being used for these real-time data pipelines.

17:00 Yeah.

17:00 Man, aren't queues amazing for adding scale to systems, right?

17:04 And if it's going to take too long, throw it in a queue and let the thing crank on it for 30 seconds.

17:09 It'll be good.

17:09 Absolutely.

17:10 I mean, Celery is, you know, the same idea.

17:12 It's just a smaller scale.

17:14 And so, you know, RabbitMQ, it's more ephemeral.

17:16 Whereas when you're putting it into these durable queues, you can do more with the information where you can rewind time to be able to say, okay, I changed my logic.

17:25 I now want to reprocess all of these records from the past three months.

17:29 Whereas if you had that on RabbitMQ, all of those records are gone unless you wrote them out somewhere else.

17:40 You're not going to have to go to the next one.

17:41 You're not going to have to go to the next one.

17:42 You're not going to have to go to the next one.

17:43 You're not going to have to go to the next one.

17:44 where the issue is coming from or how to solve it? Datadog seamlessly correlates logs and traces

17:49 at the level of individual requests, allowing you to quickly troubleshoot your Python application.

17:53 Plus, their continuous profiler allows you to find the most resource-consuming parts of your

17:58 production code all the time at any scale with minimal overhead. Be the hero that got that app

18:04 back on track at your company. Get started today with a free trial at talkpython.fm/datadog,

18:10 or just click the link in your podcast player's show notes. Get the insight you've been missing

18:14 with Datadog. A couple of comments from the live stream. Defria says Airflow, Apache Airflow is

18:20 really cool for sure. We're going to talk about that. But I did want to ask you about the cloud.

18:23 Stefan says, I'm a little bit skeptical about the privacy and security on the cloud. So kind of

18:29 want to use the zone server more often. So maybe that's a trend that you could speak to that you've

18:34 seen with folks you've interviewed. This kind of data is really sensitive sometimes and people are

18:39 very protective of it or whatever. Right. So what is the cloud story versus, oh, we got to do this all

18:45 on prem or maybe even some hybrid thereof? Right. So I mean, it's definitely an important question and

18:50 something that is, it's a complicated problem. There are ways to solve it. I mean, data governance

18:55 is kind of the umbrella term that's used for saying, I want to keep control of my data and make sure that I

19:01 am using the appropriate regulatory aspects and making sure that I am filtering out private information or

19:09 encrypting data at rest, encrypting data in transit. And so there are definitely ways that you can

19:14 keep tight control over your data, even when you're in the cloud. And a lot of the cloud platforms have

19:19 been building out capabilities to make it easier for you. So for instance, if you're on Amazon,

19:23 they have their key management service that you can use to encrypt all of your storage at rest.

19:28 You can provide your own keys if you don't trust them to hold the keys to the kingdom there so that you

19:34 are the person who's in control of being able to encrypt and decrypt your data.

19:38 You know, there are a class of technologies used in data warehouses called privacy enhancing

19:42 technologies, where you can actually have all of the rows in your database fully encrypted.

19:48 And then you can encrypt the predicate of a SQL query to be able to see if the data matches the

19:56 values in the database without ever actually having to decrypt anything so that you could do some

20:00 rudimentary analytics like aggregates on that information so that it all stays safe.

20:05 There's also a class of technologies that are still a little bit in the experimental phase called

20:11 homomorphic encryption, where it actually the data is never actually decrypted. So it lives in this

20:17 encrypted enclave, your data processing job operates within that encrypted space. And so there's never any

20:24 actual clear text information stored anywhere, not even in your computer's RAM.

20:29 Wow. So if one of those like weird CPU bugs that lets you jump through the memory of like different

20:35 VMs or something like that, even then you're probably okay, right?

20:39 Absolutely. Yeah. I mean, the homomorphic encryption, there are some companies out there that are

20:43 offering that as a managed service. And, you know, it's becoming more viable. It's been something that's

20:49 been discussed and theorized about for a long time. But because of the computational cost, it was something

20:54 that was never really commercialized. But there are a lot of algorithms that have been discovered to help

20:59 make it more feasible to actually use in production contexts.

21:02 Yeah. I don't know about the other databases. I know MongoDB, they added some feature where you

21:06 can encrypt just certain fields, right? So maybe here's a field that is sensitive, but you don't

21:11 necessarily need to query by for your reports, but it needs to be in with, say, with a user or an order

21:16 or something like that. So even going to that part might be a pretty good step. But yeah, the clouds are

21:21 both amazing and scary, I suppose.

21:23 Yeah. Yeah. I mean, there's definitely a lot of options. It's something that requires a bit of

21:27 understanding and legwork, but it's definitely possible to make sure that all your data stays

21:32 secured and that you are in full control over where it's being used.

21:37 Yeah. So one of the next things I wanted to ask you about is languages. So you're probably familiar

21:43 with this chart here, right? Which if people are not watching the stream, this is the Stack Overflow

21:49 trend showing Python just trouncing the other languages, including Java. But I know Java had been

21:56 maybe one of the main ways that probably has to do with Spark and whatnot to some degree.

22:00 What do you see Python's role relative to other technologies here?

22:04 So Python has definitely been growing a lot in the data engineering space,

22:08 largely because of the fact that it's so popular in data science. And so there are data scientists

22:14 who have been moving further down the stack into data engineering as a requirement of their job.

22:19 And so they are bringing Python into those layers of the stack. It's also being used as just a unifying

22:25 language so that data engineers and data scientists can work on the same code bases. As you mentioned,

22:31 Java has been popular for a long time in the data ecosystem because of things like Hadoop and Spark.

22:37 And looking at the trend graph, I'd be interested to see what it looks like if you actually combine the

22:43 popularities of Java and Scala because Scala has become the strong contender in that space as well

22:49 because of things like Spark and Flink that have native support for Scala. It's a bit more of an

22:55 esoteric language, but it's used a lot in data processing. But Python has definitely gained a lot

23:00 of ground. And also because of tools like Airflow, which was kind of the first generation tool built for

23:07 data engineers by data engineers to be able to manage these dependency graphs of operations so

23:12 that you can have these pipelines to say, you know, I need to pull data out of Salesforce and then land

23:18 it into S3. And then I need to have another job that takes that data out of S3 and puts it into the

23:22 database. And then also that same S3 data needs to go into an analytics job. Then once those two jobs

23:29 are complete, I need to kick off another job that then runs a SQL query against the data warehouse to be

23:34 able to provide some aggregate information to my sales and marketing team to say, this is what your

23:40 customer engagement is looking like or whatever it might be. Yeah. And that was all written in Python.

23:44 And also just because of the massive ecosystem of libraries that Python has for being able to

23:49 interconnect across all these different systems and data engineering at a certain level is really just

23:55 a systems integration task where you need to be able to have information flowing across all of these

24:01 different layers and all these different systems and get good control over it. Some of the interesting tools

24:05 that have come out as a sort of generational improvement over Airflow are Dagster and Prefect.

24:10 I've actually been using Dagster for my own work at MIT and been enjoying that tool. I'm always happy to dig into

24:16 that.

24:17 Let's sort of focus on those things. And one of the themes I wanted to cover is maybe the five most important

24:21 packages or libraries for data engineering. And you kind of hit the first one that will group together as a

24:27 trifecta, right? So Airflow, Dagster, and Prefect. You want to maybe tell us about those three?

24:33 Yeah.

24:33 Which one do you prefer?

24:35 So I personally use Dagster. I like a lot of the abstractions and the interface design that they

24:40 provide, but they're all three grouped into a category of tools called sort of workflow management

24:45 or data orchestration. And so the responsibility there is that you need to have a way to build these

24:52 pipelines, build these DAGs or directed acyclic graphs of operations where the vertices of the graph

24:59 are the data and the nodes are the jobs or the operations being performed on them. And so you

25:05 need to be able to build up this dependency chain because you need to get information out of a source

25:10 system. You need to get it into a target system. You might need to perform some transformations either

25:14 on route or after it's been landed. You know, one of the common trends that's happening is it used to

25:20 be extract, transform, and then load because you needed to have all of the information in that

25:25 specialized schema for the data warehouse that we were mentioning earlier.

25:28 Right. Right. And all the relational database, database, it's got to have these columns in this.

25:33 It can't be a long character. It's got to be a VAR, VAR car 10 or whatever.

25:37 Right. And then with the advent of the cloud data warehouses that have been happening in the past few

25:43 years that was kicked off by Redshift from Amazon and then carried on by things like Google BigQuery,

25:48 Snowflake that a lot of people will probably be aware of. You know, there are a number of other

25:53 systems and platforms out there. Presto out of Facebook that is now an open source project,

25:58 actually renamed it to Trino. Those systems are allowing people to be very SQL oriented,

26:02 but because of the fact that they're scalable and they provide more flexible data models,

26:07 the trend has gone to extract, load, and then transform because you can just replicate the schema

26:12 as is into these destination systems. And then you can perform all of your transformations in SQL.

26:18 And so that brings us into another tool that is in the Python ecosystem that's been gaining a lot of

26:23 ground called DBT or data build tool. And so this is a tool that actually brings data analysts and

26:30 improves their skill set, makes them more self-sufficient within the organization and provides a lot of,

26:37 provides a great framework for them to operate in an engineering mindset where it helps to build up a

26:43 specialized DAG within the context of the data warehouse to take those source data sets that are

26:49 landed into the data warehouse from the extract and load jobs and build these transformations.

26:54 So you might have the user table from your application database and the orders table.

27:00 And then you also have the Salesforce information that's landed in a separate table.

27:05 And you want to be able to combine all of those to be able to understand your customer order,

27:10 customer buying patterns. And so you use SQL to build either a view or build a new table out of that

27:16 source information in the data warehouse and DBT will handle that workflow.

27:21 It also has support for being able to build unit tests in SQL into your workflow.

27:27 Oh, how interesting. Yeah. That's something that you hadn't really heard very much of

27:31 10 years ago was testing and databases. It was usually, how do I get the database out of the

27:37 picture so I can test without depending upon it or something like that? That was the story.

27:41 Yeah. That's another real growing trend is the overall aspect of data quality and confidence in your

27:47 data flows. So things like in Dagster and Prefect and Airflow, they have support for being able to

27:54 unit test your pipelines, which is another great aspect of the Python ecosystem is you can just

27:58 write pytest code to ensure that all the operations on your data match your expectations and you don't

28:03 have regressions and bugs. Right. Right. Absolutely.

28:06 The complicating aspect of data engineering is that it's not just the code that you need to make sure

28:11 is right, but you also need to make sure that the data is right. And so another tool that is helping

28:16 in that aspect, again, from the Python ecosystem is great expectations.

28:19 Right. And that's right in the realm of this testing your data.

28:23 Yeah. Exactly. Absolutely.

28:24 So you can say, you know, I'm pulling data out of my application database. I expect the schema to have

28:30 these columns in it. I expect the data distribution within this column to, you know, the values are only

28:36 going to range from zero to five. And then if I get a value outside of that range, then I can,

28:41 you know, it will fail the test and it will notify me that something's off. So you can build these very

28:45 expressive and flexible expectations of what your data looks like, what your data pipeline is going to do

28:51 so that you can gain visibility and confidence into what's actually happening as you are propagating

28:57 information across all these different systems. So do you make this part of your continuous

29:01 integration tests? Absolutely. Yeah. So it would be part of your continuous integration as you're

29:05 delivering new versions of your pipeline, but it's also something that executes in the context of the

29:10 nightly batch job or of your streaming pipeline. So it's both a build time and a runtime expectation.

29:16 Yeah, yeah, yeah. So it's like a pre test. It's like an if test for your function.

29:21 But for your data, right? Like, let's make sure everything's good before we run through this and

29:25 actually drop the answer on to the dashboard for the morning or something like that.

29:29 Okay, right. Yeah, it helps to build up that confidence because anybody who's been working

29:33 in data has had the experience of I delivered this report, I feel great about it. I'm happy that I was

29:40 able to get this thing to run through and then you hand it off to your CEO or your CTO and they look at it

29:45 and they say, well, this doesn't quite look right. And then you go back and realize, oh, crud,

29:49 that's because I forgot to pull in this other column or whatever it is. And so this way you

29:53 can not have to have that sinking feeling in your gut when you hand off the report.

29:57 That would be bad. What would be worse is we decided to invest by buying a significant position

30:03 in this other company. Oh, but it turned out, whoops, it was actually we had a negative sign.

30:07 It wasn't really good for you to invest in this.

30:09 Absolutely. Yep.

30:10 If it's actions have been taken.

30:15 The next question of Talk Python To Me is brought to you by Retool. Do you really need a full dev team to build

30:19 that simple internal app at your company? I'm talking about those back office apps,

30:23 the tool your customer service team uses to access your database, that S3 uploader you built last year for the marketing team,

30:29 the quick admin panel that lets you monitor key KPIs, or maybe even the tool your data science team hacked together

30:35 so they could provide custom ad spend insights.

30:38 Literally every type of business relies on these internal tools, but not many engineers love building these tools,

30:45 let alone get excited about maintaining or supporting them over time.

30:48 They eventually fall into the please don't touch it. It's working category of apps.

30:52 And here's where Retool comes in. Companies like DoorDash, Brex, Plaid, and even Amazon use Retool to build internal tools super fast.

31:01 The idea is that almost all internal tools look the same.

31:04 Forms over data. They're made up of tables, dropdowns, buttons, text input,

31:08 and so on.

31:09 Retool gives you a point, click, and drag and drop interface that makes it super simple to build internal UIs like this in hours, not days.

31:18 Retool can connect to any database or API.

31:20 Want to pull data from Postgres? Just write a SQL query and drag the table onto your canvas.

31:25 Search across those fields, add a search input bar and update your query.

31:29 Save it, share it, super easy.

31:31 Retool is built by engineers, explicitly for engineers.

31:36 It can be set up to run on-prem in about 15 minutes using Docker, Kubernetes, or Heroku.

31:41 Get started with Retool today.

31:42 Just visit talkpython.fm/retool or click the Retool link in your podcast player show notes.

31:51 Hey, let me jump you back really quick to that language trends question real quick.

31:54 So Anthony Lister asks if R is still widely used and sort of a strong competitor, let's say, to Python.

32:02 What's your thoughts these days?

32:03 I kind of honestly hear a little bit less of it in my world for some reason.

32:06 Yeah, so there are definitely a lot of languages.

32:09 R is definitely one of them that's still popular in the data space.

32:12 I don't really see R in the data engineering context.

32:15 It's definitely still used for a lot of statistical modeling, machine learning, data science workloads.

32:21 There's a lot of great interoperability between R and Python now, especially with the Arrow project, which is a in-memory columnar representation that provides an interoperable, provides an in-memory space where you can actually exchange data between R and Python and Java without having to do any IO copying between them.

32:41 So it helps to reduce a lot of the impedance mismatch between those languages.

32:46 Another language that's been gaining a lot of ground in the data ecosystem is Julia, and they're actually under the NumFocus organization that supports a lot of the Python data ecosystem.

32:57 Yeah.

32:58 So Julia has been gaining a lot of ground, but Python, just because of its broad use, is still very popular.

33:03 And there's an anecdote that I've heard a number of times, I don't remember where I first came across it, that Python isn't the best language for anything, but it's the second best language for everything.

33:12 Yeah, that's a good quote.

33:14 I think it does put a lot of perspective on it.

33:16 I feel like it's just so approachable, right?

33:19 Exactly.

33:19 And there's a lot of these languages that might make slightly more sense for a certain use case like R and statistics, but you better not want to have to build some other thing that reaches outside of what's easily possible, right?

33:32 Like, right, you want to make that an API now?

33:34 Well, all of a sudden, it's not so easy or whatever, right?

33:37 Something along those lines.

33:38 Exactly.

33:38 All right.

33:39 Next in our list here is Dask.

33:41 Yeah.

33:42 So Dask is a great tool.

33:44 I kind of think about it as the Python version of Spark.

33:48 There are a number of reasons that's not exactly accurate, but it's a tool that lets you parallelize your Python operations, scale it out into clusters.

33:58 It also has a library called Dask.distributed that's used a lot for just scaling out Python independent of actually building the directed acyclic graphs in Dask.

34:09 So one of the main ways that Spark is used is as an ETL engine.

34:13 So you can build these graphs of tasks in Spark.

34:15 You can do the same thing with Dask.

34:17 It was actually built originally more for the hard sciences and for scientific workloads and not just for data science.

34:24 Yeah.

34:24 But Dask is actually also used as a foundational layer for a number of the data orchestration tools out there.

34:31 So Dask is the foundational layer for Prefect.

34:35 You can use it as an execution substrate for the Dagster library, the Dagster framework.

34:40 It's also supported in Airflow as a execution layer.

34:45 And there are also a number of people who are using it as a replacement for things like Celery as just a means of running asynchronous tasks outside of the bounds of a request response cycle.

34:54 So it's just growing a lot in the data ecosystem, both for data engineering and data science.

34:59 And so it just provides that unified layer of being able to build your data engineering workflows and then hand that directly off into machine learning so that you don't have to jump between different systems.

35:11 You can do it all in one layer.

35:12 Yeah, that's super neat.

35:13 And Dask, I never really appreciated it.

35:15 Sort of it's different levels at which you can use it, I guess I should say.

35:19 You know, when I thought about it, OK, well, this is like parallel computing for Pandas or for NumPy or something like that.

35:25 Right.

35:25 But it's also it works well on just your single laptop.

35:29 Right.

35:29 It'll let you run multi-core stuff locally because Python doesn't always do that super well.

35:34 And it'll even think it'll even do caching and stuff so it can actually work with more data than you have RAM, which is hard, which is straight NumPy.

35:43 But then, of course, you can point it at a cluster and go crazy.

35:45 Exactly.

35:46 Yeah.

35:46 And because of the fact that it has those transparent API layers for being able to swap out the upstream Pandas with the Dask Pandas library and NumPy, it's easy to go from working on your laptop to just changing an import statement.

36:02 And now you're scaling out across a cluster of hundreds of machines.

36:04 Yeah, that's pretty awesome, actually.

36:06 Yeah.

36:06 Maybe that has something as well to do with the batch to real time.

36:10 Right.

36:10 If you've got to run it in one on one core on one machine, it's a batch job.

36:14 If you can run it on the entire cluster at, you know, that's sitting around idle.

36:18 Well, then all of a sudden it's real time.

36:20 Right.

36:21 Yeah, there's a lot of interesting real time stuff.

36:23 There's an interesting project, sort of a side note here called Wallaroo that's built for building stateful stream processing jobs using Python.

36:32 And interestingly, it's actually implemented in a language called Pony.

36:35 But Pony?

36:37 Yeah.

36:37 An interesting project, you know, levels up your ability to scale out the speed of execution and the sort of just being able to build these complex pipelines real time jobs.

36:50 without having to build all of the foundational layers of it.

36:53 Yeah.

36:54 Okay.

36:54 Interesting.

36:55 I have not heard of this one.

36:56 That sounds fun.

36:57 Yeah.

36:57 It's not as widely known.

36:58 I interviewed the creator of it on the data engineering podcast a while back, but it's a tool that comes up every now and then.

37:05 Interesting approach to it.

37:06 Yeah.

37:07 Right in that stream processing real time world.

37:09 Right.

37:10 The next one that you put on our list here is...

37:12 Meltano.

37:13 Meltano.

37:14 I got to say it right.

37:15 Yeah.

37:16 So that one is an interesting project.

37:18 It came from the GitLab folks.

37:20 It's still supported by them.

37:21 And in its earliest stage, they actually wanted it to be the full end-to-end solution for data analytics for startups.

37:32 Meltano is actually an acronym for, if I can remember correctly, model, extract, load, transform, analyze, notebook, and orchestrate.

37:41 Okay.

37:42 Yeah.

37:43 That's quite a wild one to put into something you can say well.

37:47 Exactly.

37:47 And, you know, about a year, year and a half ago now, they actually decided that they were being a little too ambitious and trying to boil the ocean and scoped it down to doing the extract and load portions of the workflow really well.

38:02 Because it's a very underserved market where you would think that given the amount of data we're all working with, point-to-point data integration and extract and load would be a solved problem, easy to do.

38:11 But there's a lot of nuance to it.

38:13 And there isn't really one easy thing to say, yes, that's the tool you want to use all the time.

38:18 And so there are some paid options out there that are good.

38:21 Meltano is aiming to be the default open source answer for data integration.

38:26 And so it's building on top of the Singer specification, which is sort of an ecosystem of libraries that was built by a company called Stitch Data.

38:35 But the idea is that you have what they call taps and targets, where a tap will tap into a source system, pull data out of it, and then the targets will load that data into a target system.

38:47 And they have this interoperable specification that's JSON-based so that you can just wire together any two taps and targets to be able to pull data from a source into a destination system.

38:57 Nice.

38:58 Yeah, it's definitely a well-designed specification.

39:00 A lot of people like it.

39:02 There are some issues with the way that the ecosystem was sort of created and fostered.

39:07 So there's a lot of uncertainty or variability in terms of the quality of the implementations of these taps and targets.

39:13 And there was never really one cohesive answer to this is how you run these in a production context, partially because Stitch Data was the answer to that.

39:23 So they wanted you to buy into this open source ecosystem so that you would then use them as the actual execution layer.

39:29 And so Meltano is working to build an open source option for you to be able to wire together these taps and targets and be able to just have an easy out-of-the-box data integration solution.

39:39 So it's a small team from GitLab, but there's a large and growing community helping to support it.

39:45 And they've actually been doing a lot to help push forward the state-of-the-art for the Singer ecosystem, building things like a starter template for people building taps and targets so that there's a common baseline of quality built into these different implementations without having to wonder about, you know, is this tap going to support all of the features of the specification that I need?

40:06 Nice.

40:06 Is this actually from GitLab?

40:08 Yeah, so it's sponsored by GitLab.

40:10 It's the source code is within the GitLab organization on GitLab.com, but it's definitely a very community-driven project.

40:18 Yeah, Stefan is quite excited about the open source and default open source choice.

40:23 Yeah.

40:23 Well, I think there's two things.

40:24 One, open source is amazing, but two, you get this paradox of choice, right?

40:29 It's like, well, it's great.

40:30 You can have anything, but there's so many things and I'm new to this.

40:33 What do I do?

40:34 Right.

40:34 And so, yeah, Meltano is trying to be the answer to, you know, you just Meltano in it, you have a project, you say, I want these sources and destinations, and then it will help you handle things like making sure that the jobs run on a schedule, handling, tracking the state of the operations, because you can do either full extracts and loads every time, or you can do incremental, because you don't necessarily want to dump a 4 million line source table every single time it runs.

41:00 You just want to pull the 15 lines that changed since the last operation.

41:03 So it will help track that state for you.

41:06 Oh, that's cool.

41:06 And try to be real efficient and just get what it needs.

41:09 Yeah.

41:09 And it builds in some of the monitoring information that you want to be able to see as far as like execution time, performance of these jobs.

41:17 And it actually, out of the box, will use Airflow as the orchestration engine for being able to manage these schedules, but everything is pluggable.

41:24 So if you wanted to write your own implementation that will use Dagster as the orchestrator instead, then they'll do that.

41:30 There's actually a ticket in their tracker for doing that work.

41:33 So it's very pluggable, very flexible, but gives you a lot of out of the box answers to being able to just get something up and running quickly.

41:40 Yeah.

41:40 And it looks like you can build custom loaders and custom extractors.

41:43 So if you've got some internal API, that's who knows, maybe it's a SOAP XML endpoint or some random thing, right?

41:51 You could do that.

41:51 Exactly.

41:52 Yeah.

41:52 And they actually lean on DBT, another tool that we were just talking about, as the transformation layer.

41:58 So they hook directly into that so that you can very easily do the extract and load and then jump into DBT for doing the transformations.

42:06 Yeah.

42:06 Now, you didn't put this one on the list, but I do want to ask you about it.

42:09 What's the story of something like Zapier in this whole, you know, get notified about these changes, push stuff here?

42:15 I mean, it feels like if you were trying to wire things together, I've seen more than one Python developer reach for Zapier.

42:21 Yeah. So Zapier is definitely a great platform, particularly for doing these event-based workflows.

42:27 You can use it as a data engineering tool if you want, but it's not really what it's designed for.

42:33 It's more just for business automation aspects or maybe automation of my application did this thing,

42:39 and now I want to have it replicate some of that state out to a third-party system.

42:43 Zapier isn't really meant for the sort of full-scale data engineering workflows, maintaining visibility.

42:50 It's more just for this evented IO kind of thing.

42:53 Yeah. So here on the Meltano, it says, Pipelanger code ready to be version controlled and containerized and deployed continuously.

43:00 The CI-CD side sounds pretty interesting, right?

43:04 Especially with these workflows that might be in flight while you're making changes.

43:08 How does that work?

43:09 How does that work? Do you know?

43:10 It's basically the point with Meltano is that everything is versioned in Git.

43:15 So that's another movement that's been happening in the data engineering ecosystem where early on,

43:19 a lot of the people coming to it were systems administrators, database administrators,

43:24 maybe data scientists who had a lot of the domain knowledge, but not as much of the engineering expertise to be able to build these workflows in a highly engineered,

43:34 highly repeatable way.

43:35 And the past few years has been seeing a lot of movement of moving to data ops and ML ops to

43:41 make sure that all of these workflows are well-engineered, well-managed,

43:46 you know, version controlled, tested.

43:49 And so having this DevOps oriented approach to data integration is what Meltano is focusing on,

43:55 saying all of your configuration, all of your workflows, it lives in Git.

43:58 You can run it through your CI-CD pipeline to make sure that it's tested.

44:01 And then when you deliver it, you know that you can trust that it's going to do what you want it to do

44:06 rather than I just push this config from my laptop and hopefully it doesn't blow up.

44:11 Right.

44:12 It also sounds like there's a lot of interplay between these things like Meltano might be leveraging Airflow

44:18 and DBT and maybe you want to test this through CI with great expectations before it goes through its CD side,

44:26 like continuous deployment.

44:27 Seems like there's just a lot of interflow here.

44:29 Definitely.

44:30 And there have been a few times where I've been talking to people and they've asked me to kind of

44:34 categorize different tools or like draw nice lines about what are the dividing layers of the different

44:40 of the data stack.

44:40 And it's not an easy answer because so many of these tools fit into a lot of different boxes.

44:46 So, you know, Spark is a streaming engine, but it's also an ELT tool.

44:52 And, you know, Dagster is a data orchestration tool, but it can also be used for managing delivery.

45:00 You can write it to do arbitrary tasks.

45:02 So you can build up these chains of tasks.

45:04 So if you wanted to use it for UCI-CD, you could.

45:07 Quite what it's built for.

45:08 But, you know, and then different databases have been growing a lot of different capabilities

45:14 where, you know, it used to be you had your SQL database or you had your document database

45:18 or you had your graph database.

45:20 And then you have things like Arango DB, which can be a graph database and a document database

45:26 and a SQL database all on the same engine.

45:28 So, you know, there's a lot of multimodal databases.

45:31 It's all of the SQL and all the NoSQL, all in one.

45:34 Right.

45:34 And, you know, JSON is being pushed into relational databases and data warehouses.

45:39 So there's a lot of crossover between the different aspects of the data stack.

45:43 Yeah, there probably is more of that, I would say, in this like data warehousing stuff.

45:48 You know, in an operational database, it doesn't necessarily make a ton of sense to jam JSON

45:52 blobs all over the place.

45:54 You might as well just make tables and columns.

45:55 Yeah.

45:56 No, it makes some sense, but not that much.

45:57 But in this space, you might get a bunch of things you don't really know what their shape

46:00 is or exactly you're not ready to process it.

46:02 You just want to save it and then try to deal with it later.

46:05 So do you see more of that, those kind of JSON columns or more NoSQL stuff?

46:09 Absolutely.

46:10 Basically, any data warehouse worth its salt these days has to have some sort of support for nested

46:15 data.

46:15 A lot of that, too, comes out of the outgrowth of, you know, we had the first generation data

46:20 warehouses.

46:21 They did their thing, but they were difficult to scale and they were very expensive.

46:24 And you had to buy these beefy machines so that you were planning for the maximum capacity

46:29 that you're going to have.

46:30 And then came things like Hadoop, where you said, oh, you can scale out as much as you

46:34 want, just add more machines.

46:35 They're all commodity.

46:36 And so that brought in the era of the data lake.

46:40 And then things like S3 became inexpensive enough that you could put all of your data storage

46:45 in S3, but then still use the rest of the Hadoop ecosystem for doing MapReduce jobs

46:49 on that.

46:50 And then that became the next generation data lake.

46:53 And then things like Presto came along to be able to build a data warehouse interface on

46:58 top of this distributed data and these various data sources.

47:01 And then you had the dedicated data warehouses built for the cloud, where they were designed

47:07 to be able to ingest data from S3, where you might have a lot of unstructured information.

47:12 And then you can clean it up using things like DBT to build these transformations, have these

47:17 nicely structured tables built off of this nested or messy data that you're pulling in from various

47:23 data sources.

47:23 Yeah.

47:24 Interesting.

47:24 When you see the story of versioning of this, the data itself, I'm thinking.

47:30 So I've got this huge pile of data I've built up and we're using to drive these pipelines.

47:35 But it seems like the kind of data that could change.

47:38 Or I brought in a new source now that we've switched credit card providers or we're now screen

47:43 scraping extra data.

47:44 Do you see anything interesting happen there?

47:46 Yeah.

47:46 So there's definitely a lot of interesting stuff happening in the data versioning space.

47:49 So I mean, one tool that was kind of early to the party is a platform called Packaderm.

47:55 They're designed as a end-to-end solution built on top of Kubernetes for being able to do data

48:01 science and data engineering and data versioning.

48:03 So your code and your data all gets versioned together.

48:06 There's a system called LakeFS that was released recently that provides a Git-like workflow on top

48:13 of your data that lives in S3.

48:15 And so they act as a proxy to S3, but it lets you branch your data to say,

48:21 I want to bring in this new data source.

48:23 And as long as everything is using LakeFS as the interface, then your main branch won't see any of this new data source until you are happy with it.

48:32 And then you can commit it and merge it back into the main branch and then it becomes live.

48:36 And so this is a way to be able to experiment with different processing workflows to say,

48:40 I want to try out this new transformation job or this new batch job, or I want to bring in this new data source, but I'm not quite confident about it yet.

48:47 And so it brings in this versioning workflow.

48:49 There's another system combination of two tools called Iceberg, which is a table format for use in these large-scale data lakes,

48:58 data warehouses that hooks into things like Spark and Presto.

49:01 And there's another accompanying project called Nessie that is inspired by Git for being able

49:06 to do the same type of branching and merging workflow for bringing in new data sources

49:10 or changing table schemas and things like that.

49:12 Wow.

49:13 These all sound like such fun tools to learn and they're all solving painful problems.

49:17 Right.

49:17 And then another one actually from the Python ecosystem is DVC or data version control that's

49:22 built for machine learning and data science workflows that actually integrates with your source code

49:30 management so that you get commit and get push.

49:33 There's some additional commands, but they're modeled after Git where you commit your code

49:37 and then you also push your data and it lives in S3 and it will version the data assets so that

49:42 as you make different versions of your experiment with different versions of your data,

49:46 it all lives together so that it's repeatable and easier for multiple data scientists or data engineers

49:51 to be able to collaborate on it.

49:53 Yeah, the version control story around data has always been interesting, right?

49:59 It's super tricky.

50:00 Absolutely.

50:01 I mean, on one hand, your schemas might have to evolve over time.

50:04 If you've got a SQLAlchemy model trying to talk to a database, it really hates it if there's a mismatch at all, right?

50:11 And so you want those things to go, the database schema maybe to change along with your code with like migrations or something.

50:17 But then the data itself, yeah, that's tricky.

50:19 Yeah.

50:20 And so there's actually a tool called Avro and another one called Parquet.

50:24 Well, they're tools, they're data serialization formats.

50:27 And Avro in particular has a concept of schema evolution for, you know, what are compatible evolutions of a given schema.

50:36 So each record in an Avro file has the schema co-located with it.

50:40 So it's kind of like a binary version of JSON, but the schema is embedded with it.

50:44 Oh, wow.

50:44 Okay, that's interesting.

50:45 Yeah. So if you say, I want to change the type of this column from an int to a float, then, you know, maybe that's a supported conversion.

50:54 And so it will let you change the schemas or add columns.

50:58 But if you try to change the schema in a method that is not backwards compatible, it will actually throw an error.

51:04 I see. Like a float to an int might drop data, but an int to a float probably wouldn't.

51:08 Exactly. So it will let you evolve your schemas.

51:11 And Parquet is actually built to be interoperable with Avro for being able to handle those schema evolutions as well, where Avro is a row or record oriented format.

51:20 And Parquet is column oriented, which is more powerful for being able to do aggregate analytics.

51:25 And it's more efficient so that you're not pulling all of the data for every row.

51:29 You're just pulling all of the data for a given column.

51:31 So it's also more compressible.

51:32 Yeah. I think I need to do more thinking to really fully grok the column oriented data stores.

51:37 Yeah.

51:37 It's a different way of thinking.

51:38 Yeah. The column oriented aspect is also a major revolution in how data warehousing has come about, where, you know, the first generation was all built on the same databases that we were using for our application.

51:49 So it was all row oriented. And that was one of the inherent limits to how well they could scale their compute.

51:55 Whereas all of the modern cloud data warehouses or all the modern, even non-cloud data warehouses are column oriented.

52:02 And so if you have, you know, one column that is street addresses and another column that's integers and another column that is, you know, bare care 15, all of those are the same data type.

52:14 And so they can compress them down a lot more than if you have one row that is a street address and a text field and an integer and a float and a JSON array.

52:23 If you try to compress all of those together, they're not compatible data types.

52:27 So you have a lot more inefficiency in terms of how well you can compress it.

52:31 And then also as you're scanning, you know, a lot of analytics jobs are operating more on aggregates of information than on individual records.

52:39 And so if you want to say, I want to find out what is the most common street name across all the street addresses that I have in my database, all I have to do is pull all the information out of that street address column.

52:49 It's all co-located on disk.

52:51 So it's a faster seek time and it's all compressed the same.

52:55 And that way you don't have to read all of the values for all of the rows to get all of the street addresses, which is what you would do in a relational database.

53:02 Right.

53:03 Because probably those are co-located on disk by row.

53:06 Exactly.

53:06 Whereas if you're going to ask, so all about the streets across everyone, then it's better to put all the streets and then all the cities or whatever.

53:14 Right.

53:15 Exactly.

53:15 Interesting.

53:16 Cool.

53:16 Well, I think I actually understand a little bit better now.

53:18 Thanks.

53:18 The final one that you put on the list that just maybe to put a pen in it as a very, very popular pandas.

53:24 I never cease to be amazed with what you can do with pandas.

53:26 Yeah.

53:27 So, I mean, pandas, it's one of the most flexible tools in the Python toolbox.

53:31 I've used it in web development contexts.

53:34 I've used it for data engineering.

53:36 I've used it for data analysis.

53:37 And it's definitely the Swiss army knife of data.

53:40 So it's absolutely one of the more critical tools in the toolbox of anybody who's working with data, regardless of the context.

53:47 And so it's absolutely no surprise that data engineers reach for it a lot as well.

53:51 So pandas is supported natively in things like Dagster, where it will give you a lot of rich metadata information about the column layouts and the data distributions.

53:59 But yeah, it's just absolutely indispensable.

54:02 You know, it's been covered enough times in both your show and mine.

54:05 We don't need to go too deep into it.

54:07 But yeah, if you're working with data, absolutely get at least a little bit familiar with pandas.

54:11 Well, just to give people a sense, like one of the things I learned yesterday, I think it was, Chris Moffitt was showing off some things with pandas.

54:19 And he's like, oh, over on this Wikipedia page, three-fourths of the way down, there's a table.

54:24 The table has a header that has a name.

54:26 And you can just say, load HTML, give me the table called this as a data frame from screen scraping as part of the page.

54:33 It's amazing.

54:34 Yeah.

54:35 Another interesting aspect of the pandas ecosystem is the pandas extension arrays library that lets you create plugins for pandas to support custom data types.

54:45 So I know that they have support for things like geo.json and IP addresses so that you can do more interesting things out of the box in terms of aggregates and group buys and things like that.

54:56 So, you know, if you have the IP address pandas extension, then you can say, give me all of the rows that are grouped by this network prefix and things like that.

55:06 Whereas just pandas out of the box, we'll just treat it as an object.

55:09 And so you have to do a lot more additional coding around it and it's not as efficient.

55:13 So that was an interesting aspect to the pandas ecosystem as well.

55:18 Nice.

55:18 One quick question.

55:19 And then I think we should probably wrap this up, Stefan, throughout some stuff about graph databases, particularly GraphQL.

55:26 Or that's actually the API, right?

55:27 It's efficient, but what about its maturity?

55:29 Like, what do you think about some of these new API endpoints?

55:33 GraphQL is definitely gaining a lot of popularity.

55:36 I mean, so as you mentioned, there's sometimes a little bit of confusion about they both have the word graph in the name.

55:41 So GraphQL and GraphDB.

55:42 I read it too quickly.

55:43 Yeah.

55:43 I'm like, oh, yeah, like Neo4j.

55:45 Wait, no, it has nothing to do with that.

55:46 Right.

55:46 So, you know, GraphQL is definitely a popular API design.

55:51 Interesting side note is that the guy who created Dagster is also one of the co-creators of GraphQL.

55:57 And Dagster has a really nice web UI that comes out of the box that has a GraphQL API to it so that you can do things like trigger jobs or introspect information about the running system.

56:07 Another interesting use case, use of GraphQL is there's a database engine called GraphQL that uses GraphQL as its query language.

56:16 So it's a native graph storage engine.

56:18 It's scalable, horizontally distributable.

56:21 And so you can actually model your data as a graph and then query it using GraphQL.

56:26 So certainly seeing a lot of interesting use cases within the data ecosystem as well.

56:31 Yeah.

56:31 For the right hyperdata, a graph database seems like it would really light up the speed of accessing certain things.

56:37 Absolutely.

56:37 Yeah.

56:38 So the funny thing is you have this concept of a relational database, but it's actually not very good at storing information about relationships.

56:45 It is.

56:46 The joins make them so slow and so on.

56:48 Exactly.

56:49 The lazy loading or whatever.

56:50 Yeah.

56:50 Right.

56:51 So graph databases are entirely optimized for storing information about relationships so that you can do things like network traversals or understanding within this,

57:01 structure of relations.

57:02 You know, things like social networks are kind of the natural example of a graph problem where I want to understand what are the degrees of separation between these people.

57:11 So, you know, the six degrees of Kevin Bacon kind of thing.

57:13 Yeah.

57:14 Seems like you could also model a lot of interesting things like the, I don't know how real it is, but, you know, the bananas are at the back or the milk is at the back of the store.

57:21 So you have to walk all the way through the store and you can find those kind of traversing those like behaviors and relationships.

57:27 Yeah.

57:27 The traveling salesman problem, stuff like that.

57:29 Yeah.

57:29 Yeah, exactly.

57:30 All right.

57:31 Well, so many tools, way more than five that we actually made our way through, but very, very interesting because I think there's just so much out there and it sounds like a really fun place to work, like a technical space to work.

57:42 Absolutely.

57:42 You know, a lot of these ideas also seem like they're probably really ripe for people who have programming skills and software engineering mindsets like CICD testing and so on.

57:53 Absolutely.

57:54 They'll come in and say, I could make a huge impact.

57:56 We have this organization that's going to be able to do that.

58:26 You know, the first step is, you know, just kind of start to take a look at it.

58:30 You know, you probably have data problems in your applications that you're working with that maybe you're just using a sequence of salary jobs and hoping that they complete in the right order.

58:39 You know, maybe take a look at something like Dagster or Prefect to build a more structured graph of execution.

58:45 If you don't want to go for a full fledged framework like that, there are also tools like Bonobo that are just command line oriented that help you build up that same structured graph of execution.

58:54 So, you know, definitely just start to take a look and try and understand, like, what are the data flows in your system?

58:59 If you think about it more than just flows of logic and think about it in flows of data, then it starts to become a more natural space to solve it with some of these different tools and practices.

59:09 So getting familiar with thinking about it in that way.

59:12 Another really great book, if you're definitely interested in data engineering and want to kind of get deep behind the scenes, is Designing Data Intensive Applications.

59:20 I read that book recently and learned a whole lot more than I thought I would about just the entire space of building applications oriented around data.

59:28 So great resource there.

59:30 Nice. We'll put those in the show notes.

59:31 Yeah. And also just kind of raise your hand, say to your management or your team to say, hey, it looks like we have some data problems.

59:38 I'm interested in digging into it.

59:40 And chances are they'll welcome the help.

59:42 You know, lots of great resources out there if you want to get if you want to learn more about it.

59:46 You know, shameless plug.

59:47 The Data Engineering Podcast is one of them.

59:49 I'm always happy to help answer questions.

59:52 Yeah. I mean, basically just start to dig into the space, take a look at some of the tools and frameworks and just try to implement them in your day to day work.

59:59 You know, a lot of data engineers come from software engineering backgrounds.

01:00:02 A lot of data engineers might come from database administrator positions because they're familiar with the problem domain of the data.

01:00:09 And then it's a matter of learning the actual engineering aspects of it.

01:00:13 A lot of people come from data analyst or data scientist backgrounds where they actually decide that they enjoy working more with getting the data clean and well managed than doing the actual analysis on it.

01:00:24 So there's not really any one concrete background to come from.

01:00:27 It's more just a matter of being interested in making the data reproducible, helping make it valuable.

01:00:34 Interesting note is that if you look at some of the statistics around it, there are actually more data engineering positions open, at least in the U.S., than there are data scientist positions because of the fact that it is such a necessary step in the overall lifecycle of data.

01:00:51 How interesting.

01:01:21 solid results because of solid output.

01:01:22 Those are extremely marketable skills.

01:01:25 That's awesome.

01:01:25 Exactly.

01:01:26 All right.

01:01:26 Well, Tobias, thanks so much for covering that.

01:01:29 Before we get out of here, though, final two questions.

01:01:32 So if you're going to write some Python code, what editor do you use these days?

01:01:36 So I've been using Emacs for a number of years now.

01:01:38 I've tried out things like PyTarm and VS Code here and there, but it just never feels quite right just because my fingers have gotten so used to Emacs.

01:01:46 You just want to have an entire operating system as your editor, not just a piece of software.

01:01:50 Exactly.

01:01:51 Got you.

01:01:52 And it has that ML background with Lisp as its language.

01:01:54 Right.

01:01:55 And then notable PyPI package or packages.

01:01:58 Yeah.

01:01:58 You should check out.

01:01:59 I mean, we kind of touched on some, right?

01:02:00 Yeah, exactly.

01:02:01 I mean, a lot of them in the list here.

01:02:03 I'll just mention again, Dagster, DBT, and Great Expectations.

01:02:07 Yeah.

01:02:07 Very nice.

01:02:08 All right.

01:02:08 Final call to action.

01:02:09 People are excited about this.

01:02:11 What should they do?

01:02:11 Listen to the Data Engineering Podcast.

01:02:13 Listen to podcast.init if you want to understand a little bit more about the whole ecosystem.

01:02:18 Because since I do spend so much time in the data engineering space, I sometimes have crossover where if there's a data engineering tool that's implemented in Python, I'll have them on podcast.init just to make sure that I can get everybody out there.

01:02:29 And yeah, feel free to send questions my way.

01:02:32 I'll add the information about the podcast in the show notes.

01:02:36 And yeah, just be curious.

01:02:38 Yeah, absolutely.

01:02:39 Well, like I said, it looks like a really interesting and growing space that has got a lot of low-hanging fruit.

01:02:45 So it sounds like a lot of fun.

01:02:46 Absolutely.

01:02:46 Yeah.

01:02:47 All right.

01:02:47 Well, thanks for being here.

01:02:48 And thanks, everyone, for listening.

01:02:49 Thanks for having me.