#302: The Data Engineering Landscape in 2021 Transcript
00:00 I'm sure you're familiar with data science. But what about data engineering? Are these the same thing? Or how are they related? data engineering is dedicated to overcoming data processing bottlenecks, data cleanup, data flow and data handling problems for applications that utilize a lot of data. On this episode, we welcome back Tobias Macy, give us a 30,000 foot view of the data engineering landscape in 2021. This is talk by me Episode 302, recorded January 29 2021.
00:41 Welcome to talk Python, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy, and keep up with the show and listen to past episodes at 'talkpython.fm' and follow the show on Twitter via @talkpython. This episode is brought to you by 'Data Dog' and 'Retool', please check out what they're offering during their segments. It really helps support the show. Tobias ready to kick it off.
01:07 Yeah, sounds good. Thanks for having me on, Mike. Yeah. Great to have you here. Good to have you back. I was recently looking at my podcast page here. And it says you were on the show. 68 which a lot of fun. That was, uh, when Chris Patti was with you as well, around podcasts in it. And but boy, that was 2016.
01:25 It has been a while we've been at this a while. I mean, ironically, we started within a week of each other. But yeah, it's we're still going both of us. It's definitely been fun journey and a lot of a lot of great sort of unexpected benefits and great people that I've been able to meet as a result of it. So definitely glad to be able to be on the journey with you. Yeah, same here. podcasting, open doors, like nothing else. It's crazy people who wouldn't normally want to talk to you like, Hey, you want to be on the show? Yeah, let's spend an hour together all of a sudden, right? It's, it's fantastic. What's new since 2016? What do you been up to? Definitely a number of things. I mean, one being that I actually ended up going solo as the host. So I've been running the podcast and it show by myself. I don't remember exactly when it happened. But I think probably sometime around 2017. I know around the same time that I was on your show you were on mine. So we kind of flip flopped, and then you've been on the show again, since then talking about your experience working with MongoDB and Python. Yeah, you know, beyond that, I also ended up starting a second podcast. So I've got podcast in it, which focuses on Python and its community. So a lot of stuff about DevOps, data science, machine learning, web development, you name it, anything that people are doing with Python I've had them on. But I've also started a second show focused on data engineering. So going beyond just the constraints of Python into this separate niche, so more languages, but more tightly focused problem domain. And so I've been enjoying learning a lot more about the area of data engineering. And so it's actually been a good companion to the to where there's a lot of data science that happens in Python, so I'm able to cover that side of things on 'podcast.in' it and then data engineering is all of the prep work that makes data scientists lives easier. And so just learning a lot about the technologies and challenges that happen on that side of things.
03:09 Yeah, that's super cool. And to be honest, one of the reasons I invite you on the show is because I know people talk about data engineering, and I'm I know there's neat tools. And they're they feel like they come out of the data science space, but not exactly. And so I'm really looking forward to learning about them, along with everyone else listening, so it's gonna be a lot of fun. Absolutely. Yeah. Before we dive into that let people maybe know, what are you doing day to day these days? Are you doing consulting? Or you got a full time job? Oh, what's the plan?
03:34 Yes, yes. So yeah, I mean, I run the podcast as a side, just sort of hobby. And from my day to day, I actually work full time at MIT in the open learning department and help run the platform engineering and data engineering team. They're so responsible for making sure that all that all the cloud environments are set up and secured and servers are up and running and applications stay available. And working through building out a data platform to provide a lot of means for analytics and gaining insights into the learning habits and the behaviors that global learners have and how they interact with all of the different platforms that we run. That's fantastic. Yeah, it's definitely a great place to work. And happy to be there for a number of years now. And then, you know, I run the podcasts. So those go out every week. So a lot of stuff that happens behind the scenes there. And then I also do some consulting, where lately it's been more of the advisory type where it used to be I'd be hands on keyboard, but I've been able to level up beyond that. And so I've been working with a couple of venture capital firms to help them understand the data ecosystem. So data engineering, data science, have also worked a little bit with a couple of businesses just helping them understand sort of what are the challenges and what's the potential in the data marketplace and data ecosystem to be able to go beyond just having an application and then being able to use the information and the data that they gather from that to be able to build more interesting insights into their business, but also products for their customers. Oh, yeah,
05:05 that sounds really fun. I mean, work, MIT sounds amazing. And then those advisory roles are really neat, because you kind of get a take, especially as a podcaster, you get this broad view, because he talked to so many people. And you know, they've got different situations in different contexts. And so you can say, all right, look, here's kind of what I see, you seem to fit into this relevant. So this might be the right path.
05:25 Absolutely. I mean, the data ecosystem in particular is very fast moving. So it's definitely very difficult to be able to keep abreast with all the different aspects of it. And so because that's kind of my job as a podcaster, I'm able to dig deep into various areas of it and be able to learn from and take advantage of all of the expertise and insight that these various innovators and leaders in the space have and kind of synthesize that because I'm talking to people across, you know, the storage layer to the data processing layer and orchestration and analytics, and, you know, machine learning and operationalizing, all of that. Whereas if I were one of the people who's deep in the trenches, it's, you get a very detailed but narrow view. Whereas I've got a very shallow and broad view across the whole ecosystem. So I'm able to,
06:15 which is the perfect match for the high level view. Right? Exactly. Nice. All right, well, let's jump into our main topic. And we touched on it a little bit, but I know what data science is, I think, and there's a really interesting interview I did with Emily and Jacqueline, I don't remember both their last names recently about about building a career in data science. And they talked about basically three areas of data science that you might be in, like production and machine learning, versus making predictions and so on. And data engineering, it feels like it's kind of in that data science realm. But it's not exactly that. Like it could kind of be databases and other stuff, too, right? Like, what is this data engineering thing? Maybe compare contrast against data sciences, people probably know that pretty well.
06:57 Yeah. So it's one of those kind of all encompassing terms that, you know, the role depends on the organization that you're in. So in some places, data engineer might just be the person who used to be the DBA, or the database administrator. In other places, they might be responsible for cloud infrastructure. And another place, they might be responsible for maintaining streaming systems. One way that I've seen it broken down as kind of two sort of broad classifications of data engineering is, there's the SQL focused data engineer, where they might have a background as a database administrator. And so they do a lot of work in managing the data warehouse, they work with SQL oriented tools, where there are a lot of them coming out now where you can actually use SQL for being able to pull data from source systems into the data warehouse, and then provide, you know, build transformations to provide to analysts and data scientists. And then there is the more engineering oriented data engineer, which is somebody who writes a lot of software, they're building complex infrastructure. And architecture is using things like Kafka or Flink or Spark, they're working with the database, they're working with data orchestration tools, like 'Airflow' or 'Daxter' or 'Prefect', they might be using 'Dask' and so they're much more focused on actually writing software and delivering code as the output of their efforts. Right, okay. But the shared context across what however you define data engineering, the shared aspect of it is that they're all working to bring data from multiple locations into a place that is accessible for various end users, where the end users might be analysts or data scientists or the business intelligence tools. And they're tasked with making sure that those workflows are repeatable and maintainable and that the data is clean and organized. So that it's useful because you know, everybody knows the whole garbage in garbage out principle. Yeah, if you're a data scientist, and you don't have all the context of where the data is coming from, you just have a small, narrow scope of what you need to work with. You're kind of struggling with that garbage in garbage out principle. And so the data engineers job is to get rid of all the garbage and give you something clean that you can work from, I think that's really a tricky problem in the data science side of things. You take your data, you run it through a model or through some analysis graphene layer, and it gives you a picture like, well, that's the answer. Maybe, maybe it is right. Did you give it the right input? And did you train the models in the right data? Who knows? Right, right. That's, you know, definitely a big challenge. And that's one of the reasons why data engineering has become so multifaceted is because what you're doing with the data informs the ways that you prepare the data, you know, you need to make sure that you have a lot of the contextual information as well to make sure that the data scientists and data analysts are able to answer the questions accurately because data in isolation, if you just give somebody the number five, it's completely meaningless. But if you tell them that a customer ordered five of this unit, well then now you can actually do something with it. So the the Context helps to provide the information about that isolated number and understanding where it came from and why it's important.
10:07 Yeah, absolutely. You know, two things come to mind when I hear data engineering for me is like, one is like pipelines of data, you know, maybe you've got to bring in data and do transformations to it to get it ready. This is part of that data cleanup, maybe, and taking disparate sources and, and unifying them under one canonical model or something in representation. And then ETL, I kind of like, we get something terrible, like FTP uploads of CSV files, and we've got to turn those into databases like overnight jobs, right? or things like that, which probably still exist. they existed not too long ago.
10:37 Yeah. Every sort of legacy technology that you think has gone away, because you're not working with it anymore, is still in existence somewhere, which is why we still have 'Cobalt'. Exactly. Oh, my gosh, I've got some crazy, crazy cobalt stories for you that probably shouldn't go out public. Ask me over the next conference The next time we get to travel somewhere, you know? Alright. Sounds good. For sure. So let's talk about trends. I made that joke, right? Like, well, maybe it used to be CSV files, or text files, and FTP, and then a job that would put that into a SQL database or some kind of relational database. What is it now, it's got to be better than that, right? I mean, again, depends where you are. I mean, CSV files are still a thing. You know, it may not be FTP anymore, it's probably going to be living in object storage, like s3, or Google Cloud Storage. But you know, you're still working with individual files. And some places, a lot of it is coming from API's or databases, where you might need to pull all of the information from Salesforce to get your CRM data. Or you might be pulling data out of Google Analytics by their API, you know, a lot, there are a lot of evolutionary trends that have happened sort of first big movement in data engineering, beyond just the sort of, well, there have been a few generations. So the first generation was the data warehouse, where you took a database appliance, whether that was Oracle, or Microsoft SQL Server or Postgres, you put all of your data into it. And then you had to do a lot of work to model it so that you could answer questions about that data. So in an application database, you're liable to just overwrite a record when something changes when it's in a data warehouse, you want that historical information about what changed and the evolution of that data? What about like normalization, in operational databases, it's all about one source of truth, we better not have any duplication, it's fine if there's four joins to get there. Whereas in warehousing, it's maybe better to have that duplication. So you can run different types of reports real quickly and easily. Exactly. Yeah. I mean, you still need to have one source of truth, but you will model the tables differently than an up in an application database. So there are things like the 'Star schema', or the 'Snowflake schema' became popular in the initial phase of data warehousing. So Ralph Kimball is famous for building out the sort of Star schema approach with facts and dimensions. Yeah, maybe describe that a little bit for people, because maybe they don't know these terms. Sure. So facts are things like, you know, a fact is Tobias Macy works at MIT. And that a dimension might be he was hired in 2016, or whatever year it was. And another dimension of it is he you know, his work anniversary is x date. And so the way that you model it makes it so a fact is something that's immutable. And then a dimension are things that might evolve over time. And then in sort of the next iteration of data engineering and data management was the sort of, quote unquote, big data craze where Google released their paper about MapReduce. And so Hadoop came out as a open source option for that. And so everybody said, oh, I've got to get Yeah, MapReduce was gonna take over the world, right? Like, that was the only way you could do anything. Big Data, then you had to MapReduce it. And then maybe it had to do with one of these large scale databases, right, spark or Cassandra? Or who knows something like that? Yeah. I mean, SPARK and 'Cassandra' came after Hadoop. So I mean, Hadoop was your option in the, you know, early 2000s. And so everybody said, Oh, big data is the answer. If I just throw big data at everything, it'll solve all my problems. And so people built these massive data lakes using Hadoop and built these MapReduce jobs, and then realized that what are we actually doing with all this data, it's costing us more money than it's worth, MapReduce jobs are difficult to scale, they're, you know, difficult to understand the order of dependencies. And so that's when things like spark came out to use the data that you're already collecting, but be able to parallelize the operations and run it a little faster. And so, you know, that was sort of the era of batch oriented workflows. And then with the advent of things like Spark streaming and Kafka, and you know, there are a whole number of other tools out there now, like Flink and pulsar, the sort of real time revolution is where we're at now, where it's not enough to be able to understand what happened The next day, you have to understand what's happening, you know, within five minutes, and so, there are principles like Change Data Capture, where every time I write a new record into a database, it goes into Kafka queue, which then gets replicated out to an Elasticsearch cluster, and to my data warehouse. And so within five minutes, my business intelligence dashboard is updated with the fact that customer a bought product B, rather than having to wait in 24 hours to get that insight,
15:15 I think that makes tons of sense. So instead of going, like, we're just gonna pile the data into this, you know, some sort of data lake type thing, then we'll grab it, and we'll do our reports, nightly, or hourly or whatever, you just keep pushing it down the road as it comes in or as it's generated. Right, right.
15:30 Yes, I mean, there's still use cases for batch, I mean, and there are different ways of looking at it. So I mean, a lot of people view batch as just a special case of streaming, where, you know, streaming is sort of micro batches, where as a record comes in, you operate on it. And then for large batch jobs, you're just doing the same thing. But multiple times for a number of records, I mean, that there are a lot of paradigms that are building up people are getting used to the idea, I mean, batch is still the easier thing to implement. It requires fewer moving pieces. But streaming, the sort of availability of different technologies is making it more feasible for more people to be able to actually take advantage of that. And so they're platform managed platforms that help you with that problem. There are a lot of open source projects that approach it.
16:10 Yeah, there's a whole platforms that are just around to just do data streaming for you, right, there's like, sort of manage that and keep that alive. And with the popularization of web hooks, right? It's easy to say if something changes here, you know, notify this other thing, and that thing can call other things. And it seems like it's coming along. Yeah,
16:28 yeah, one of the interesting aspects to have a lot of the work that's been going into the data engineering space is that you're starting to see some of the architectural patterns and technologies move back into the application development domain where a lot of applications, particularly if you're working with micro services, will use something like a Kafka or a pulsar queue as the communication layer for being able to propagate information across all the different decoupled applications. And that's the same technology and same architectural approaches that are being used for these real time data pipelines. Yeah. And aren't queues amazing for adding scale systems, right? It's gonna take too long throw in a queue and let thing crank on over 30 seconds. It'll be good. Absolutely. I mean, Celery is, you know that the same idea is just a smaller scale. And so, you know, Rabbit MQ, it's more ephemeral. Whereas when you're putting it into these durable queues, you can do more with the information where you can rewind time to be able to say, Okay, I changed my logic, I now want to reprice, reprocess all of these records from the past three months. Whereas if you had that on 'RabbitMQ', all those records are gone unless you wrote them out somewhere else. This portion of talk Python, to me is brought to you by Data dog. Are you having trouble visualizing latency and CPU or memory bottlenecks in your app, not sure where the issue is coming from or how to solve it. Data dog seamlessly correlates logs and traces at the level of individual requests, allowing you to quickly troubleshoot your Python application. Plus, their continuous profiler allows you to find the most resource consuming parts of your production code all the time at any scale with minimal overhead. be the hero that got that app back on track at your company. Get started today with a free trial at 'talkpython.fm/datadog', or just click the link in your podcast player shownotes. Get the insight you've been missing with Data dog a
18:16 couple comments from the live stream. Defra says airflow, Apache airflow is really cool, for sure we're going to talk about that. But I did want to ask you about the cloud. Stefan says I'm skeptic, a little bit skeptical about the privacy and security on the cloud. So kind of want to use the known server more often. So maybe that's a trend that you could speak to that you've seen with folks you've interviewed, this kind of data is really sensitive sometimes. And people are very protective of it or whatever. Right?
18:42 So what is the cloud story versus Oh, we got to do this all on prem, or maybe even some hybrid thereof, right? So I mean, it's definitely an important question. And something that is, it's a complicated problem, there are ways to solve it. I mean, data governance is kind of the umbrella term that's used for saying, I want to keep control of my data and make sure that I am using the appropriate regulatory aspects and making sure that I'm, you know, filtering out private information or encrypting data at rest, encrypting data in transit. And so there are definitely ways that you can keep tight control over your data, even when you're in the cloud. And a lot of the cloud platforms have been building out capabilities to make it easier for you. So for instance, if you're on Amazon, they have their key management service that you can use to encrypt all of your storage at rest. You can provide your own keys if you don't trust them to hold the keys to the kingdom there so that you are the person who's in control of being able to encrypt and decrypt your data. You know, there are a class of technology as we used in data warehouses called privacy enhancing technologies, where you can actually have all of the rows in your database fully encrypted. And then you can encrypt the predicate of a SQL query to be able to see if the data matches the values in the database without ever actually having to decrypt anything so that you could do that. Some rudimentary analytics like aggregates on that information so that it all stays safe. There's also a class of technologies that are still a little bit in the experimental phase called homomorphic encryption, where it actually, the data is never actually decrypted. So it lives in this encrypted enclave, your data processing job operates within that encrypted space. And so there's never any actual clear text information stored anywhere, not even in your computer's Ram. Wow. So if one of those like weird CPU bugs that lets you jump through the memory of different VMs, or something like that great even then you're probably okay, right? Absolutely. Yeah, I mean, homomorphic encryption, there are some companies out there that are offering that as a managed service. And, you know, it's becoming more viable. It's been something that's been discussed and theorized about for a long time, but because of the computational cost, it was something that was never really commercialized. But there are a lot of algorithms that have been discovered to help make it more feasible to actually use in production contexts.
21:02 Yeah, I don't know about the other databases, I know, MongoDB, they added some feature where you can encrypt just certain fields, right? So maybe here's a field that is sensitive, but you don't necessarily need the query by for your reports. But it needs to be in with, say, with a user or an order or something like that. So even go into that part might be pretty good step. But yeah, the clouds are both amazing. And scary, I suppose. Yeah. Yeah, I
21:24 mean, there's definitely a lot of options. It's something that requires a bit of understanding and legwork, but it's definitely possible to make sure that all your data stays secured, and that you are in full control over where it's being used.
21:36 Yeah, what are the next things I wanted to ask you about is languages. So you probably familiar with this, this chart here? Right? Which if not, people are not watching the stream. This is the StackOverflow trend show in Python, just trouncing the other languages, including Java. But I know Java had been maybe one of the main ways that probably has to do with spark and whatnot. And some degree, what do you see pythons role relative to other technologies here.
22:04 So Python has definitely been growing a lot in the data engineering space, largely because of the fact that it's so popular in data science. And so there are data scientists who have been moving further down the stack into data engineering as a requirement of their job. And so they are bringing Python into those layers of the stack. It's also being used as just a unifying language so that data engineers and data scientists can work on the same code bases. As you mentioned, Java has been popular for a long time in the data ecosystem, because of things like Hadoop and Spark. And looking at the trend graph, I'd be interested to see what what it looks like if you actually combine the popularity of Java and Scala because Scala right become the strong contender in that space as well, because of things like spark and Flink that have native support for Scala, it's a bit more of an esoteric language, but it's used a lot in data processing. But Python has definitely gained a lot of ground. And also because of tools like airflow, which was kind of the first generation tool built for data engineers, by data engineers to be able to manage these dependency graphs of operations so that you can have these pipelines to say, you know, I need to pull data out of Salesforce and then landed into s3. And then I need to have another job that takes that data out of s3 and puts it into the database. And then also that same s3 data needs to go into an analytics job, then once those two jobs are complete, I need to kick off another job that then runs a SQL query against the data warehouse to be able to provide some aggregate information to my sales and marketing team to say, this is what you know, your customer engagement is looking like, or whatever it might be. Yeah, and that was all written in Python. And also, just because of the massive ecosystem of libraries that Python has for being able to interconnect across all these different systems. And data engineering, at a certain level is really just a systems integration task where you need to be able to have information flowing across all of these different layers and all these different systems and get good control over it. Some of the interesting tools that have come out as a sort of generational improvement over airflow are Dexter and prefect. I've actually been using Daxter for my own work at MIT and been enjoying that tool. I'm always happy to dig into that let's sort of focus on those things. And what are the themes I wanted to cover is maybe the five most important packages or libraries for data engineering, and you kind of hit the first one that will group together as a trifecta, right? So right airflow, Daxter and prefect. You want to maybe tell us about those three of them which one you prefer. So I personally use Daxter. I like a lot of the abstractions and the interface design that they provide, but they're all three grouped into a category of tools called sort of workflow management or data orchestration. And so the responsibility there is that you need to have a way to build these pipelines build these DAGs are directed acyclic graphs of operations where the vertices of the graph The data and the nodes are the jobs of the operations being performed on them. And so you need to be able to build up this dependency chain because you need to get information out of a source system, you need to get it into a target system, you might need to perform some transformations either on route or after it's been landed. You know, one of the common trends that's happening is it used to be extract, transform, and then load because you needed to have all of the information in that specialized schema for the data warehouse that we were mentioning earlier. Right, right. All the relational database database actually had to have these columns in this, it can't be long characters got to via var, var car 10 or whatever, right.
22:04 And then, with the advent of the cloud data warehouses that have been happening in the past few years that was kicked off by redshift from Amazon, and then carried on by things like Google BigQuery snowflake that a lot of people will probably be aware of, you know, there are a number of other systems and platforms out there, Presto, out of Facebook, that is now an open source project actually renamed to 'Trino'. Those systems are allowing people to be very SQL oriented, but because of the fact that they're scalable, and they provide more flexible data models, the trend has gone to extract, load, and then transform, because you can just replicate the schema as is into these destination systems. And then you can perform all of your transformations in SQL. And so that brings us into another tool that is in the Python ecosystem that's been gaining a lot of ground called DBT, or data build tool. And so this is a tool that actually brings data analysts and improves their skill set makes them more self sufficient within the organization, and provides a lot of threads a great framework for them to operate in an engineering mindset where it helps to build up a specialized dag within the context of the data warehouse to take those source data sets that are landed into the data warehouse from the extract and load jobs and build these transformations. So you might have the user table from your application database and the Orders table. And then you also have the Salesforce information that's landed in a separate table. And you want to be able to combine all of those to be able to understand your customer order customer buying patterns. And so you use SQL to build either a view or build a new table out of that source information in the data warehouse, and DBT will handle that workflow. It also has support for being able to build unit tests in SQL into your workflow. Oh, how interesting. Yeah, that's something that you hadn't really heard very much of 10 years ago, testing in databases is usually how do I get the database out of the picture? So I can test without depending upon it, or something like that. That was the story. Yeah, that's another real growing trend is the overall aspect of data quality and confidence in your data flows. So things like in Daxter, and prefect and airflow, they have support for being able to unit test your pipelines, which is another great aspect of the Python ecosystem, as you can just write 'Py test' code to ensure that all the operations on your data match your expectations, and you don't have regressions and bugs. Right.
22:04 Right. Absolutely.
22:04 The complicating aspect of data engineering is that it's not just the code they need to make sure is right. But you also need to make sure that the data is right. And so another tool that is helping in that aspect, again, from the Python ecosystem is great expectations, right? And that's right, in the realm of this testing your data. Exactly, absolutely. So you can say, you know, I'm pulling data out of my application database, I expect the schema to have these columns in it, I expect the data distribution within this column to you know, the values are only going to range from zero to five. And then if I get a value outside of that range, then I can, it will fail the test, and it will notify me that something's off. So you can build these very expressive and flexible expectations of what your data looks like what your data pipeline is going to do, so that you can gain visibility and confidence into what's actually happening as you are propagating information across all these different systems. Should you make this part of your continuous integration tests? Absolutely. Yeah. So it would be part of your continuous integration as you're delivering new versions of your pipeline, but it's also something that executes in the context of the nightly batch job or of your streaming pipeline. So it's both a build time and a runtime expert. Yeah. So it's like a pre test. It's like an IF test for your function. But for your data, right? Like, let's make sure everything's good before we run through this and actually drop the answer directly on to the dashboard for the morning or something like that. Okay, right. Yeah, it helps to build up that confidence, because anybody who's been working in data has had the experience of I delivered this report, I feel great about it, I'm happy that I was able to get this thing to run through and then you hand it off to your CEO or your CTO and they look at it and they say, well, this doesn't quite look right. And then you go back and realize, Oh, crap, that's because I forgot to pull in that other column or whatever it is. And so this way, you can not have to have that sinking feeling in your gut when you hand off the report that would be bad versus when We decided to invest by buying a significant other company. But it turned out whoops, it was actually we had a negative sign. It wasn't really good for you to invest in this. Absolutely.
22:04 Yep. It's actions have been taken. And
22:04 this portion of talk Python to me is brought to you by 'Retool', do you really need a full dev team to build that simple internal app at your company? I'm talking about those Back Office apps. The tool your customer service team uses to access your database, that s3 uploader you built last year for the marketing team, the quick admin panel that lets you monitor key KPIs, or maybe even the tool your data science team hacked together so they could provide custom ad spend insights. Literally, every type of business relies on these internal tools. But not many engineers love building these tools, let alone get excited about maintaining or supporting them over time. They eventually fall into the please don't touch it. It's working category of apps. And here's where retool comes in. Companies like Doordash brex. Plaid and even Amazon use retool to build internal tools superfast ideas them almost all internal tools look the same forms over data. They're made up of tables, dropdowns, buttons, text input, and so on. free tool gives you a point click and drag and drop interface that makes it super simple to build internal UI like this in hours not days. retool can connect to any database or API want to pull data from Postgres. Just write a SQL query and drag the table onto your canvas. search across those fields. Add a search input bar and update your query. Save it share it super easy. Retool is built by engineers explicitly for engineers. It can be set up to run on prem in about 15 minutes using Docker or Kubernetes or Heroku. Get started with retools today. Just visit 'talkpython.fm/retool' or click the retool link in your podcast player show notes.
22:04 They'll be jumping back really quick to that language trends question real quick. So Anthony Lister asks if is are still widely used as sort of a strong competitor, let's say to Python, and what's your thoughts these days, I can honestly hear a little bit less of it in my world for some reason. Yeah. So there are definitely a lot of languages are is definitely one of them that's still popular in the data space, I don't really see are in the data engineering context, it's definitely still used for a lot of statistical modeling, machine learning data science workloads. There's a lot of great interoperability between R and Python. Now, especially with the arrow project, which is a in memory columnar representation that provides an interoperable, it provides an in memory space where you can actually exchange data between R and Python and Java without having to do any IO copying between them. So it helps to reduce a lot of the impedance mismatch between between those languages. Another language that's been gaining a lot of ground in the data ecosystem is Julia. And they're actually under the num focus organization that supports a lot of the Python data ecosystem. Yeah, so Julia has been gaining a lot of ground, but Python, just because of its broad use is still very popular. And there's an anecdote that I've heard a number of times, I don't remember where I first came across it that Python isn't the best language for anything, but it's the second best language for everything.
22:04 Yeah, that's a good quote, I think it does put a lot of perspective on it. I feel like it's just so approachable, right? Exactly. And there's a lot of these languages that might make slightly more sense for certain use case like R and statistics. But you better not want to have to, you know, build some other thing that reaches outside of what's easily possible, right? Like, right, you want to make that an API now? Well, all of a sudden, it's not so easy or whatever, right? Something along those lines. Exactly. Alright, next in our list here is dask.
22:04 Yeah, so Dask is a great tool, I kind of think about it as the Python version of Spark. There are a number of reasons that's not exactly accurate. But it's a tool that lets you parallelize your Python operations, scale it out into clusters. It also has a library called task 'dask.distributed' that's used a lot for just scaling out Python independent of actually building the directed acyclic graphs in DASK. So one of the main ways that spark is used is as an ETL engine. So you can build these graphs of tasks in Spark, you can do the same thing with Dask, it was actually built originally more for the hard sciences and for scientific workloads. And not just for data science. Yeah, but dask is actually also used as a foundational layer for a number of the data orchestration tools out there. So dask is the foundational layer for prefect, you can use it as an execution substrate for the Daxter library, the Daxter framework and also supports, it's also supported in airflow as a execution layer. Then there are also a number of people who are using it as a replacement for things like Celery is just a means of running asynchronous tasks outside of the bounds of a request response cycle. So it's just growing a lot in the data ecosystem, both for data engineering and data science. And so just provides that unified layer of being able to build your data engineering workflows, and then hand that directly off into machine learning so that you don't have to jump between different systems. You can do it all in one layer.
22:04 Yeah, that's super neat and dask, I never really appreciated it sort of it's different levels at which you can use it, I guess I should say, you know, when I thought about it, okay, well, this is like parallel computing, for Pandas, or NumPy, or something like that, right. But it's also it works well on just your single laptop, right? It'll let you run multi core
22:04 stuff locally, because Python doesn't always do that super well. And it'll even think it'll even do caching and stuff. So it can actually work with more data than you have Ram. Right? It's hard with just straight NumPy. But then, of course, you can point it at a cluster and go crazy. Exactly, yeah. And because of the fact that it has those transparent API layers for being able to swap out the upstream pandas with the dask, pandas library and NumPy. It's easy to go from working on your laptop to just changing an import statement. And now you're scaling out across a cluster of hundreds of machines? Yeah, that's pretty awesome. Actually, maybe that had some things as well to do with the batch to real time, right? If you've got to run it in one on one core on one machine, it's a batch job, if you can run it on the entire cluster at you know, that's sitting around idle while then all of a sudden, it's real time, right? Yeah, there's a lot of interesting real time stuff. There's a interesting project, sort of a side note here called 'Wallaroo'. that's built for building stateful Stream Processing jobs using Python. And interestingly, it's actually implemented in a language called Pony. But how many? Yeah, hey, it's an interesting project, you know, levels up your ability to scale out the speed of execution, and the sort of just being able to build these complex pipelines, real time jobs, without having to build all of the foundational layers of it. Yeah.
22:04 Okay. And I see I have not heard of this one. That sounds fun.
22:04 Yeah, it's not as widely known. I interviewed the creator of it on the data engineering podcast A while back, but it's a tool that comes up every now and then interesting approach to it. Yeah. Right. In that stream processing real time world, right. The next one that you put on our list here is Meltano Meltano. I gotta say it, right. Yeah. Yeah. So that one is an interesting project. It came from the Git lab, folks, it's still supported by them. And in its earliest stage, they actually wanted it to be the full end to end solution for data analytics for startups. So Meltano is actually an acronym for if I can remember correctly, model, extract, load, transform, analyze, notebook and orchestrate. Okay. That's quite a wild one to put into. Yeah, some of you can say, well, exactly. And you know, about a year, year and a half ago, now, they actually decided that they were being a little too ambitious and trying to boil the ocean and scoped it down to doing the extract and load portions of the workflow really well, because it's a very underserved market, where you would think that, given the amount of data we're all working with, point to point data integration, and extract and load would be a solved problem, easy to do. But there's a lot of nuance to it. And there isn't really one easy thing to say, yes, that's the tool you want to use all the time. And so there are some paid options out there that are good. Mel Tano is aiming to be the default open source answer for data integration. And so it's building on top of the singer specification, which is sort of an ecosystem of libraries that was built by a company called stitch data. But the idea is that you have the what they call taps and targets where a tap will tap into a source system, pull data out of it, and then the targets will load that data into a target system. And they have this interoperable specification that's JSON based, so that you can just wire together any two taps and targets to be able to pull data from a source into a destination system with nice, yeah, it's definitely a well designed specification, a lot of people like it, there are some issues with the way that the ecosystem was sort of created and fostered. So there's a lot of uncertainty or like variability in terms of the quality of the implementations of these tabs and targets. And there was never really one cohesive answer to this is how you run these in a production context, partially because stitch data was the answer to that. So they wanted you to buy into this open source ecosystem, so that you would then use them as the actual execution layer. And so Meltano is working to build an open source option for you to be able to wire together these tabs and targets and be able to just have an easy out of the box data integration solution. So yeah, it's a small team from Git lab, but there's a large and growing community helping to support it and they've actually been doing a lot to help push forward the state of the art for the single ecosystem, building things like a starter template for people building taps and targets so that there's a common baseline of quality built into these different elements. Without having to wonder about, you know, is this tab going to support all the features of the specification that I need? Nice. Is this actually from Git lab? Yeah. So it's sponsored by Git lab. It's the source code is within the Git lab organization on 'Gitlab.com'. But it's definitely a very community driven project. Yeah. Stefan is a quite excited about the open source. And open source choice. Yeah, well, I think there's two things one open source is amazing. But two, you get this Paradox of Choice, right? It's like, well, it's great. You can have anything, but there's, there's so many things. And I'm new to Plato. I do, right. And so yeah, Meltano was trying to be the answer to you know, you just meltano on it, you have a project, you say I want these sorts of sources and destinations. And then it will help you handle things like making sure that the jobs run on a schedule handling, tracking the state of the operations, because you can do either full extracts and loads every time or you can do incremental because you don't necessarily want to dump a 4 million line source table every single time it runs, you just want to pull the 15 lines that changed since the last operation. So it will help track that state for you. Oh, that's cool. And try to be real efficient, and exactly what it needs. And it builds in some of the monitoring information that you want to be able to see as far as like execution time performance of these jobs. In it actually out of the box, we'll use airflow as the orchestration engine for being able to manage these schedules. But everything is pluggable. So if you wanted to write your own implementation that will use 'Dagster' as the orchestrator instead, then they'll do that there's actually a ticket in their tracker for doing that work, though. It's very pluggable, very flexible, but gives you a lot of out of the box answers to being able to just get something up and running quickly. And it looks like you can build custom loaders and custom extractors. So if you've got some internal API, that's who knows, maybe it's a soap XML endpoint or some random thing, right? You could do that.
22:04 Exactly. Yeah. And they actually lean on DBT and other tools that we were just talking about as the transformation layer. So they hook directly into that so that you can very easily do the extract and load and then jump into DBT for doing the transformations. Yeah. Now you didn't put this one on the list, but I do want to ask you about it. What's the story of something like Zapier in this hole to get notified about these changes pushed up here? It feels like if you are trying to wire things together, I've seen more than one Python developer reach for Zapier. Yeah. So Zapier is definitely a great platform, particularly for doing these event based workflows. You can use it as a data engineering tool if you want, but it's not really what it's designed for. It's more just for business automation aspects, or maybe automation of my application did this thing and now I want to have it replicate some of that state out to a third party system. Zapier isn't really meant for the sort of full scale data engineering workflows, maintaining visibility, it's more just for this event, Id IO kind of thing. Yeah. So here on the Montano, it says pipelines are code ready to be version controlled, and containerized and deployed continuously. The CI/ CD side sounds pretty interesting, right? Especially with these workflows that might be in flight changes. How does that work? You know, it's basically the point with Meltano is that everything is version didn't get. So that's another movement that's been happening in the data engineering ecosystem, where early on, a lot of the people coming to it were systems administrators, database administrators, maybe data scientists who had a lot of the domain knowledge, but not as much of the engineering expertise to be able to build these workflows in a highly engineered highly repeatable way. And the past few years has been seeing a lot of movement of moving to data Ops, and ml ops to make sure that all of these workflows are well engineered, well managed, you know, version controlled, tested. And so having this DevOps oriented approach to data integration is what Meltano was focusing on saying, all of your configuration, all of your workflows, it lives in git, you can run it through your ci CD pipeline to make sure that it's tested. And then when you deliver it, you know that you can trust that it's going to do what you want it to do, rather than I just pushed this config from my laptop, and hopefully it doesn't blow up. Right? It also sounds like there's a lot of interplay between these things like Meltano, might be leveraging airflow, and DBT. And maybe you want to test this through CI with great expectations before it goes through at CD side, like continuous deployment. Seems like there's just a lot of inner flow here. Definitely. And there have been a few times where I've been talking to people and they've asked me to kind of categorize different tools or like draw nice lines about what are the dividing layers of the different of the data stack? And it's not an easy answer, because so many of these tools fit into a lot of different boxes. So you know, spark is a streaming engine, but it's also an ELT tool. And, you know, 'Dagster' is a data orchestration tool, but it can also be used for managing delivery of you can write it to do arbitrary tasks. So you can build up these chains of tasks. So if you wanted to use it for a CI CD, you could write what it's built for. But you know, and then different databases have been growing a lot of different capabilities where, you know, it used to be you had your SQL database, or you had your document database, or you had your graph database. And then you have things like Rango dB, which can be a graph database, and the document database and a SQL database all on the same engine. So there's a lot of multimodal databases, it's all of the SQL and all the no SQL all in one. Right. And you know, JSON is being pushed into relational databases and data warehouses. So it's, there's a lot of crossover between the different aspects of the data stack.
22:04 Yeah, there probably is more of that. I would say in this, like data warehousing stuff, you know, no operational database, it doesn't necessarily make a ton of sense to jam JSON blobs all over the place, you might as well just make tables and columns. Yep, no, it makes some sense, but not that much. But in this space, you might get a bunch of things, you don't really know what their shape is, or exactly, you're not ready to process it, you just want to save it and then try to deal with it later. So do you see more of that those kind of JSON columns or more no SQL stuff?
22:04 Absolutely. Basically, any data warehouse worth its all these days has to have some sort of support for nested data, a lot of that, too comes out of the outgrowth of, you know, we had the first generation data warehouses, they did their thing, but they were difficult to scale. And they were very expensive. And you had to buy these beefy machines so that you were planning for the maximum capacity that you're going to have. And then came things like Hadoop where you said, Oh, you can scale out as much as you want, just add more machines, they're all commodity. And so that brought in the the area of the era of the data lake. And then things like s3 became inexpensive enough that you could put all of your data storage in s3, but then still use the rest of the Hadoop ecosystem for doing MapReduce jobs on that. And then that became the next generation data lake. And then things like presto came along, to be able to build a data warehouse interface on top of this distributed data and these various data sources. And then you had the, you know, dedicated data warehouses built for the cloud, where they were designed to be able to ingest data from s3, where you might have a lot of unstructured information. And then you can clean it up using things like DBT to build these transformations that have these nicely structured tables built off of this, you know, nested or messy data that you're pulling in from various data sources.
22:04 Yeah, interesting. When you see the story of versioning of this, the data itself, I'm thinking, so I've got this huge pile of data I've built up. And we're using to drive these pipelines. But it seems like the kind of data that could change or I brought in a new source now that we've switched credit card providers, or we're now screen scraping extra data. Do you see anything interesting happening there?
22:04 Yeah, so there's definitely a lot of interesting stuff happening in the data versioning space. So I mean, one tool that was kind of early to the party is a platform called 'Pachyderm', they're designed as a end to end solution built on top of Kubernetes for being able to do data science, and data engineering, and data versioning. So your code and your data all gets versions together, there's a system called 'LakeFS', that is, was released recently that provides a git like workflow on top of your data that lives in s3. And so they act as a proxy to s3. But it lets you branch your data to say I want to bring in this new data source. And as long as everything is using like Fs as the interface, then your main branch won't see any of this new data source until you are happy with it. And then you can commit it and merge it back into the main branch, and then it becomes live. And so this is a way to be able to experiment with different processing workflows to say I want to try out this new transformation job or this new batch job, or I want to bring in this new data source, but I'm not quite confident about it yet. And so it brings in this versioning workflow. There's another system combination of tools called iceberg, which is a table format for use in these large scale data lakes data warehouses that hooks into things like spark and presto. And there's another company project called Nesi, that is inspired by Git for being able to do this same type of branching and merging workflow for bringing in new data sources, or changing table schemas and things like that.
22:04 These all sound like such fun tools to learn, and they're all solving
22:04 painful problems, right. And then another one, actually, from the Python ecosystem is DVC, or data version control that's built for machine learning and data science workflows, that actually integrates with your source code management so that you get commit and git push, you know, there's some additional commands, but they're modeled after git, where you commit your code, and then you also push your data and it lives in s3, and it will version the data assets so that as you make different versions of your experiment with different versions of your data, it all lives together so that it's repeatable and easier for multiple data scientists or data engineers to be able to collaborate on it. Well,
22:04 yeah, the versioning the version control story around data has always been interesting, right? It's it's Super tricky. On one hand, your schemas might have to evolve over time. Like if you've got a SQL alchemy model trying to talk to a database. It really hates it if there's a mismatch at all right? And so you want those things to go the database schema maybe to change along with your code with like, migrations or something, but then the data itself. Yeah, that's tricky.
22:04 Yeah. And so there's actually a tool called 'Avro' and another one called 'Parquet'. Well, they're tools. They're data serialization formats. And Avro particular has a concept of schema evolution for, you know, what are compatible evolutions of a given schema. So each record in an Avro file has the schema co located with it. So it's kind of like a binary version of JSON, but the schema is embedded with it. Oh, okay. That's interesting. Yeah. So if you say, I want to change the type of this column from an int to a float, then you know, maybe that's a supported conversion. And so it will let you change the schema or add columns. But if you try to change the schema and a mean, in a method that is not backwards compatible, it will actually throw an error I see like a float to an end might drop data, but into a float probably wouldn't. Exactly. So it will let you evolve your schemas and parquet is actually built to be interoperable with Avro for being able to handle those schema evolutions as well, where Avro is a row or record oriented format. And parquet is column oriented, which is more powerful for being able to do aggregate analytics. And it's more efficient so that you're not pulling all of the data for every row, you're just pulling all of the data for a given column. So it's also more compressible. Yeah, I think I need to do more thinking to really fully grok, the column oriented data stores. Yeah, it's a different way of thinking. Yeah, the column oriented aspect is also a major revolution in how data warehousing has come about where, you know, the first generation was all built on the same databases that we were using for our application. So it was all row row oriented. And that was one of the inherent limits to how well they could scale their compute. Whereas all of the modern cloud data warehouses or all the modern, even non cloud data warehouses are column oriented. And so if you have, you know, one column that is street addresses, and another column that's integers, and another column that is, you know, Var Char, 15, all of those are the same data type. And so they can compress them down a lot more than if you have one row that is a street address, and a text field and an integer and a float and a JSON array. If you try to compress all of those together, they're not compatible data types. And so you have a lot more inefficiency in terms of how well you can compress it. And then also, as you're scanning, you know, a lot of analytics jobs are operating more on aggregates of information than on individual records. And so if you want to say, I want to find out what is the most common street name across all the street addresses that I have in my database, all I have to do is pull all the information out of that street address column, it's all co located on disk, so it's a faster seek time, and it's all compressed the same. And that way, you don't have to read all of the values for all of the rows to get all of the street addresses, which is what you would do in a relational database,
22:04 right, Because probably those are co located on disk by row. Whereas if you're going to ask so all about the streets across everyone, then it's better to put all the streets and then all the cities or whatever, right, exactly. Interesting. Cool. I
22:04 think I actually understand a little bit better now. Thanks. The final one that you put on the list that just maybe to put a pin in it as a very, very popular pandas. I've never I never cease to be amazed with what you can do with pandas. Yeah, so I mean, pandas. It's one of the most flexible tools in the Python toolbox. I've used it in web development contexts. I've used it for data engineering, or used it for data analysis. And it's definitely the Swiss Army Knife of data. So it's absolutely one of the more critical tools in the toolbox of anybody who's working with data, regardless of the context. And so it's absolutely no surprise that data engineers reach for it a lot as well. So pandas is supported natively, and things like 'Dagster', where it will, you know, give you a lot of rich metadata information about the column layouts and data distributions. But yeah, it's just absolutely indispensable. You know, it's been covered enough times in both your show and mine. We don't need to go too deep into it. But yeah, working with data, absolutely. get at least a little bit familiar with pandas. Well, just to give people a sense, like one of the things I learned yesterday, I think it was Chris Moffitt was showing off some things with pandas. And he's like, oh, over on this Wikipedia page, three fourths of the way down, there's a table. The table has a header that has a name. And you could just say, load HTML, give me the table called this as a DataFrame. From screen scraping as part of the page. It's amazing. Yeah, another interesting aspect of the pandas ecosystem is the pandas extension arrays library that lets you create plug ins for pandas to support custom data types. So I know that they have support for things like Geo JSON, and IP addresses so that you can do more interesting things out of the box in terms of aggregates and group buys and things like that. So you know, if you have the IP address, pandas extension, then you can say gives me all of the rows that are grouped by this network prefix and things like that. Whereas just pandas out of the box will just treat it as an object. And so you have to do a lot more additional coding around it. And it's not as efficient. So there's an interesting interest. Yeah, that's it. That's as close to the pandas as well. Nice. One quick question. And then I think we should probably wrap this up. Stefan threw out some stuff about graph databases, particularly 'GraphQL', or that's actually the API. Right? It's efficient. But what about its maturity? Like, what do you think about some of these new API endpoints?
22:04 GraphQL is definitely gaining a lot of popularity. I mean, so as you mentioned, there's sometimes a little bit of confusion about they both have the word graph in the name. So GraphQL, and GraphDB. I've read it too quickly. Like, oh, yeah, no, like Neo for j, wait, no, it has nothing to do with that. Right. So you know, GraphQL is definitely popular API design. interesting side note is that the guy who created 'Dagster' is also one of the CO creators of GraphQL. and 'Dagster' has a really nice web UI that comes out of the box that has a GraphQL API to it, so that you can do things like trigger jobs or introspect information about the running system. Another interesting use case, or use of GraphQL is there's a database engine called D graph that uses GraphQL, as its query language, so it's native. It's a native graph, storage engine, it's scalable, horizontally distributable. And so you can actually model your data as a graph, and then query it using GraphQL. So not only seeing a lot of interesting use cases within the data ecosystem as well, yeah. For the right type of data, a graph database seems like it would really light up the the speed of accessing, sir, absolutely, yeah. So the funny thing is, you have this concept of a relational database, but it's actually not very good at storing information about relationships. It is the joins make them so slow. And so Exactly. So lazy loading or whatever. Yeah, right. So graph databases are entirely optimized for storing information about relationships so that you can do things like network traversals, or understanding within this structure of relations, you know, things like social networks are kind of the natural example of a graph problem where I want to understand what are the degrees of separation between these people? So you know, the Six Degrees of Kevin Bacon kind of thing? Yeah. Yeah, seems like you could also model a lot of interesting things. Like, I don't know how real it is. But you know, the bananas are at the back, or the milk is at the back of the store. So you have to walk them all the way through the store. And you can find those kind of traversing those like,
22:04 relations, you know, the Traveling Salesman Problem, stuff like that. Yeah, yeah, exactly. All right. Well, so many tools, way more than five that we actually made our way through, but very, very interesting, because I think there's just so much out there. And it sounds like a really fun place to work like a
22:04 technical space to work. Absolutely. You know, a lot of these ideas also seem like they're probably really ripe for people who have programming skills and software engineering, mindsets, like CI/CD testing, and so on, absolutely come in and say I could make a huge impact we have this organization has tons of data, if people work with the data, but not in this formalized way. If people are interested in getting started with this kind of work, what would you recommend, there's actually one resource I'll recommend, I'll see if I can pick up the link after the show. There's a gentleman called named Jesse Davidson, who wrote a really great resource. That's a short ebook of kind of, you know, you think you might want to be a data engineer, here's a good way to understand if that's actually what you want to do. So I'll share that. But more broadly, if you're interested in data engineering, you know, the first step is, you know, just kind of start to take a look at it, you know, you probably have data problems in your applications that you're working with that maybe you're just using a sequence of celery jobs and hoping that they complete in the right order, you know, maybe take a look at something like Dagster a prefect to build a more structured graph of execution. If you don't want to go for a full fledged framework like that. There are also tools like 'Bonobo' that are just command line oriented, that help you build up that same structured graph of execution. So definitely to start to take a look and try and understand like, what are the data flows in your system, if you think about it more than just flows of logic and think about it and flows of data, then it starts to become a more natural space to solve it with some of these different tools and practices. So getting familiar with thinking about it in that way. Another really great book, if you're definitely interested in data engineering, and want to kind of get deep behind the scenes is 'Designing Data Intensive Applications'. I read that book recently and learned a whole lot more than I thought I would about just the entire space of building applications oriented around data. So great resource there. Nice. We'll put those in the show notes. Yeah. And also just kind of raise your hand say to your management or your team to say, hey, it looks like we have some data problems. I'm interested in digging into it. And chances are they'll Welcome to help, you know, lots of great resources out there. If you want to get if you want to learn more about it, you know, shameless plug the data engineering podcast is one of them. Why should I be to help answer questions, I mean, basically just start to dig into the space. Take a look at some of the tools and frameworks and just try to implement them in your day to day work. A lot of data engineers come from software engineering backgrounds, a lot of data engineers might come from database administrator positions, because they're familiar with the problem domain of the data. And then it's a matter of learning the actual engineering aspects of it. A lot of people come from data analyst or data scientist backgrounds, where they actually decide that they enjoy working more with getting the data clean, and well managed than doing the actual analysis on it. So there's not really any one concrete background to come from, it's more just a matter of being interested in making the data reproducible, helping make it valuable, interesting note is that if you look at some of the statistics around it, there are actually more data engineering positions open, at least in the US than there are data scientist positions, because of the fact that is such a necessary step in the overall lifecycle of data. Yeah, How interesting. And probably traditionally, those might have been just merged together into one group, right in the category of data science, but now getting a little more fine grained. Exactly. And, you know, with the advent of data Ops, and ML Ops, a lot of organizations are understanding that this is actually a first class consideration that they need dedicated people to be able to help build. And it's not just something that they can throw on the plate of the person who's doing the data science. Yeah, certainly, if you can help organizations go from batch to real time or maybe shaky results, because the shaky input to solid results because a solid input, like those are extremely marketable skills. That's exactly All right. Well, Tobias, thanks so much for covering that. Before we get out of here, though. final two questions. So if you're going to write some Python code, what editor Do you use these days? So I've been using Emacs for a number of years now I've tried out things like PyCharm and VS code here and there, but it just never feels quite right. Just because my fingers have gotten so used to Emacs. You just want to have an entire operating system as your editor, not just a software. It has that MLbackground with Lisp as its language, right. And then notable PyPI package or packages. Yeah, we kind of touched on some right. Yeah, exactly. I mean, a lot of them in the list here. I'll just mention again, 'Dagster', DBT. And great expectations. Yeah, very nice. All right, but a call to action. You know, people feel excited this what what should they do? Listen to the data engineering podcast, listen to podcast thought in it if you want to understand a little bit more about the whole ecosystem, because since I do spend so much time in the data engineering space, I sometimes have crossover where if there's a data engineering tool that's implemented in Python, I'll have them on podcast on it, just to make sure that I can get everybody out there. And yeah, feel free to send questions my way all the information about the podcast in the show notes. And yeah, just be curious.
01:02:38 Yeah, absolutely. Well, like I said, it looks like a really interesting and growing space that is got a lot of low hanging fruit. So it sounds like a lot of fun. Absolutely. Yeah. All right. Well, thanks for being here. And thanks, everyone, for listening.
01:02:49 Thanks for having me.
01:02:51 This has been another episode of talk Pythonto me. Our guest in this episode was Tobias Macy. And it's been brought to you by 'Datadog' and Retool. 'Datadog' gives you visibility into the whole system running your code, visit 'talkpython.fm/datadog' and see what you've been missing. But throw in a free t shirt with your free trial. supercharge your developers and power users. But then build and maintain their internal tools quickly and easily with retool, just visit 'talkpython.fm/retool' and get started today, I want to level up your Python. We have one of the largest catalogs of Python video courses over at talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription in sight. Check it out for yourself at 'training.talkpython.fm' Be sure to subscribe to the show, open your favorite podcast app and search for Python. We should be right at the top. You can also find the iTunes feed at /iTunes, the Google Play feed at /play and the direct RSS feed at /RSS on talk python.fm. We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code