Learn Python with Talk Python's 270 hours of courses

#454: Data Pipelines with Dagster Transcript

Recorded on Thursday, Jan 11, 2024.

00:00 Do you have data that you pull from external sources or that is generated and appears at

00:05 your digital doorstep?

00:06 I bet that data needs processed, filtered, transformed, distributed, and much more.

00:10 One of the biggest tools to create these data pipelines with Python is Dagster.

00:15 And we're fortunate to have Pedram Navid on the show to tell us about it.

00:19 Pedram is the head of data engineering and DevRel at Dagster Labs.

00:23 And we're talking data pipelines this week here at Talk Python.

00:28 This is Talk Python to me, episode 454, recorded January 11th, 2024.

00:33 Welcome to Talk Python to me, a weekly podcast on Python.

00:51 This is your host, Michael Kennedy.

00:53 Follow me on Mastodon, where I'm @mkennedy and follow the podcast using at

00:57 Talk Python, both on fosstodon.org.

01:00 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:05 We've started streaming most of our episodes live on YouTube.

01:09 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified

01:14 about upcoming shows and be part of that episode.

01:18 This episode is sponsored by Posit Connect from the makers of Shiny.

01:22 Publish, share, and deploy all of your data projects that you're creating using Python.

01:27 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quattro, Reports, Dashboards, and APIs.

01:33 Posit Connect supports all of them.

01:35 Try Posit Connect for free by going to talkpython.fm/posit.

01:40 Posit Connect.

01:41 And it's also brought to you by us over at Talk Python Training.

01:46 Did you know that we have over 250 hours of Python courses?

01:50 Yeah, that's right.

01:51 Check them out at talkpython.fm/courses.

01:56 Last week, I told you about our new course, Build an AI Audio App with Python.

02:00 Well, I have another brand new and amazing course to tell you about.

02:05 This time, it's all about Python's typing system and how to take the most advantage of it.

02:10 It's a really awesome course called Rock Solid Python with Python Typing.

02:15 This is one of my favorite courses that I've created in the last couple of years.

02:19 Python type hints are really starting to transform Python, especially from the ecosystem's perspective.

02:25 Think FastAPI, Pydantic, BearType, etc.

02:28 This course shows you the ins and outs of Python typing syntax, of course,

02:34 but it also gives you guidance on when and how to use type hints.

02:37 Check out this four and a half hour in-depth course at talkpython.fm/courses.

02:43 Now, on to those data pipelines.

02:46 Pedram, welcome to Talk Python to me.

02:50 It's amazing to have you here.

02:51 Michael, great to have you.

02:52 Good to be here.

02:53 Yeah.

02:54 Good to talk about data, data pipelines, automation.

02:57 And boy, oh boy, let me tell you, have I been in the DevOps side of things this week.

03:02 And I'm going to have a special, special appreciation of it.

03:07 I can tell already.

03:08 So excited to talk.

03:10 My good old, sis.

03:11 Indeed.

03:13 So before we get to that, though, before we talk about Dagster and data pipelines and orchestration more broadly,

03:19 let's just give a little bit of background on you.

03:22 Introduce yourself for people.

03:23 How did you get into Python and data orchestration and all those things?

03:27 Of course.

03:27 Yeah.

03:27 So my name is Pedram Navid.

03:29 I'm the head of data engineering and DevRel at Dagster.

03:33 That's a mouthful.

03:34 And I've been a longtime Python user since 2.7.

03:38 And I got started with Python like I do with many things just out of sheer laziness.

03:42 I was working at a bank and there was this rote task, something involving going into servers,

03:48 opening up a text file and seeing if a patch was applied to a server.

03:52 A nightmare scenario when there's 100 servers to check and 15 different patches to confirm.

03:57 Yeah.

03:57 So this kind of predates like the cloud and all that automation and stuff, right?

04:01 So this is definitely before cloud.

04:03 This was like right between Python 2 and Python 3.

04:06 And we were trying to figure out how to use print statements correctly.

04:09 That's when I learned Python.

04:10 I was like, there's got to be a better way.

04:12 And honestly, I've not looked back.

04:13 I think if you look at my entire career trajectory, you'll see it's just

04:17 punctuated by finding ways to be more lazy in many ways.

04:21 Yeah.

04:23 Who was it?

04:24 I think it was Matthew Rocklin that had the phrase something like productive laziness or

04:29 something like that.

04:30 I'm going to find a way to leverage my laziness to force me to build automation.

04:36 So I never, ever have to do this sort of thing again.

04:39 I got that sort of principle.

04:40 It's very motivating to not have to do something and I'll do anything to not do something.

04:44 Yeah.

04:44 Yeah.

04:45 Yeah.

04:45 It's incredible.

04:46 And like that DevOps stuff I was talking about and just, you know, one command and there's

04:51 maybe eight or nine new apps with all their tiers redeployed, updated, resynced.

04:55 And it's, it took me a lot of work to get there, but now I never have to think about it again,

05:01 at least not for a few years.

05:02 And it's amazing.

05:03 I can just be productive.

05:04 It's like right in line, in line with that.

05:07 So what are some of the Python projects you've been, you've worked on, talked about different

05:11 ways to apply this over the years?

05:13 Oh yeah.

05:13 So it started with internal, just like Python projects, trying to automate, like I said, some

05:18 rote tasks that I had and that accidentally becomes, you know, a bigger project.

05:23 People see it and they're like, oh, I want that too.

05:25 And so, well, now I have to build like a GUI interface because most people don't speak

05:29 Python.

05:29 And so that got me into iGUI, I think it was called way back when.

05:34 That was a fun journey.

05:36 And then from there, it's really taken off.

05:38 A lot of it has been mostly personal projects.

05:40 Trying to understand open source was a really big learning path for me as well.

05:45 Really being absorbed by things like SQLAlchemy and requests back when they were coming out.

05:50 Eventually it led to more of a data engineering type of role where I got involved with

05:55 tools like Airflow and trying to automate data pipelines instead of patches on a server.

06:00 That one day led to, I guess, making a long story short, a role at Dexter, where now I

06:06 contribute a little bit to Dexter.

06:07 I work on Dexter, the core project itself, but I also use Dexter internally to build our

06:12 own data pipelines.

06:13 I'm sure it's interesting to see how you all both build Dexter and then consume Dexter.

06:19 Yeah, it's been wonderful.

06:21 I think there's a lot of great things about it.

06:23 One is getting access to Dexter before it's fully released, right?

06:27 So internally, we dog food, new features, new concepts, and we work with the product team,

06:33 the engineering team to say, hey, this makes sense.

06:35 This doesn't.

06:35 This works really well.

06:36 That doesn't.

06:37 And that feedback loop is so fast and so iterative that for me personally, being able to see that

06:43 come to fruition is really, really compelling.

06:45 But at the same time, it's like I get to work at a place that's building a tool for me, right?

06:49 You don't often get that luxury.

06:51 I've worked in ads.

06:53 I've worked in insurance.

06:54 It's like banking.

06:55 It's like, these are nice things, but it's not built for me, right?

06:59 And so for me, that's probably been the biggest benefit, I would say.

07:01 Right.

07:02 If you work in some marketing thing, you're like, you know, I retargeted myself so well

07:06 today.

07:06 You wouldn't believe it.

07:07 I really enjoyed it.

07:09 I've seen the ads that I've created before.

07:11 So it's a little fun, but it's not the same.

07:14 Yeah.

07:15 I've heard of people who are really, really good at ad targeting and finding groups where

07:21 they've like pranked their wife or something or just had an ad that would only show up for

07:26 their wife by running.

07:27 It was like so specific and, you know, freaked them out a little bit.

07:29 That's pretty clever.

07:30 Yeah.

07:32 Maybe it wasn't appreciated, but it is clever.

07:34 Who knows?

07:36 All right.

07:37 Well, before we jump in, you said that, of course, you built GUIs with PyGUI and those

07:43 sorts of things because people don't speak Python back then, two, seven days and whatever.

07:47 Is that different now?

07:48 Not that people speak Python, but is it different in the sense that like, hey, I could give them

07:53 a Jupyter notebook or I could give them Streamlit or one of these things, right?

07:58 Like a little more or less you build and just like plug it in?

08:00 I think so.

08:01 I mean, yeah, like you said, it's not different in that, you know, most people probably still

08:05 to this day don't speak Python.

08:07 Yeah.

08:07 I know we had this like movement a little bit back where everyone was going to learn

08:10 like SQL and everyone was going to learn to code.

08:12 I was never that bullish on that trend because like if I'm a marketing person, I've got 10,000

08:18 things to do and learning to code isn't going to be a priority ever.

08:22 So I think building interfaces for people that are easy to use and speaks well to them is always

08:27 useful.

08:28 That never has gone away.

08:29 But I think the tooling around it has been better, right?

08:32 I don't think I'll ever want to use PyGUI again.

08:34 Nothing wrong with the platform.

08:35 It's just like not fun to write.

08:36 Streamlit makes it so easy to do that.

08:39 So it's like something like Retool and there's like a thousand other ways now that you can

08:43 bring these tools in front of your stakeholders and your users that just wasn't possible before.

08:47 I think it's a pretty exciting time.

08:48 There are a lot of pretty polished tools.

08:51 Yeah.

08:51 It's gotten so good.

08:52 Yeah.

08:53 There are some interesting ones like OpenBB.

08:55 Do you know that?

08:55 The financial dashboard thing.

08:58 I've heard of this.

08:58 I haven't seen it.

08:59 Yeah.

08:59 It's basically for traders, but it's like a terminal type thing that has a bunch of map

09:05 plot lib and other interactive stuff that pops up compared to, say, Bloomberg dashboard

09:11 type things.

09:13 But yeah, that's one sense where like maybe like traders go and learn Python because it's

09:18 like, all right, there's enough value here.

09:19 But in general, I don't think, yeah, I don't think people are going to stop what they're

09:23 doing and learning the code.

09:24 So these new UI things are not.

09:25 All right.

09:26 Let's dive in and talk about this general category first of data pipelines, data orchestration,

09:32 all those things.

09:33 And we'll talk about Dagster and some of the trends and that.

09:36 So let's grab some random internet search for what does a data pipeline maybe look like?

09:41 But, you know, people out there listening who don't necessarily live in that space, which

09:45 I think is honestly many of us, maybe we should, but maybe in our minds, we don't think we live

09:50 in data pipeline land.

09:52 Tell them about it.

09:53 Yeah, for sure.

09:54 It is hard to think about if you haven't done or built one before.

09:57 In many ways, a data pipeline is just a series of steps that you apply to some data set that

10:02 you have in order to transform it to something a little bit more valuable at the very end.

10:08 That's a simplified version.

10:09 The devil's in the details.

10:10 But really, like at the end of the day, you're in a business.

10:13 The production of data sort of happens by the very nature of operating that business.

10:17 It tends to be the core thing that, you know, all businesses have in common.

10:21 And then the other sort of output is you have people within that business who are trying

10:25 to understand how the business is operating.

10:27 And this used to be easy when all we had was a single spreadsheet that we could look at once

10:31 a month.

10:31 Yeah.

10:32 I think businesses have gone a little bit more complex than these days.

10:35 Computers.

10:36 And the expectations.

10:36 Like they expect to be able to see almost real time, not I'll see it at the end of the month.

10:41 Sort of.

10:42 That's right.

10:42 Yeah.

10:43 I think people have gotten used to getting data too, which is both good and bad.

10:47 Good in the sense that now people are making better decisions.

10:49 Bad.

10:50 And then there's more work for us to do.

10:51 And we can't just sit on our feet for half a day, half a month and waiting for the next

10:55 request to come in.

10:56 There's just an endless stream that seems to never end.

10:58 So that's what really Pipeline is all about.

11:00 It's like taking these data and making it consumable in a way that users, tools will understand.

11:06 That helps people make decisions at the very end of the day.

11:09 That's sort of the nuts and bolts of it.

11:10 In your mind, does data acquisition live in this land?

11:14 So for example, maybe we have a scheduled job that goes and does web scraping, calls an

11:19 API once an hour, and that might kick off a whole pipeline of processing.

11:24 Or we watch a folder for people to upload over FTP, like a CSV file or something horrible

11:32 like that.

11:32 You don't even know what it's unspeakable.

11:33 But something like that where you say, oh, a new CSV has arrived for me to get, right?

11:38 Yeah.

11:38 I think that's the beginning of all data pipeline journeys in my mind.

11:42 Very much, right?

11:43 Like an FTP, as much as we hate it, it's not terrible.

11:46 I mean, there are worse ways to transfer files, but I think it's still very much in use today.

11:52 And every data pipeline journey at some point has to begin with that consumption of data from

11:57 somewhere.

11:58 Yeah.

11:58 Hopefully it's SFTP, not just straight FTP, like the encrypted.

12:02 Don't just send your password in the plain text.

12:06 Oh, well.

12:07 I've seen that go wrong.

12:09 That's a story for another day, honestly.

12:10 All right.

12:12 Well, let's talk about the project that you work on.

12:14 We've been talking about it in general, but let's talk about Dagster.

12:17 Like, where does it fit in this world?

12:19 Yes.

12:20 Dagster, to me, is a way to build a data platform.

12:24 It's also a different way of thinking about how you build data pipelines.

12:28 Maybe it's good to compare it with kind of what the world was like, I think, before Dagster

12:32 and how it came about to be.

12:34 So if you think of Airflow, I think it's probably the most canonical orchestrator out there.

12:39 But there are other ways which people used to orchestrate these data pipelines.

12:43 They were often task-based, right?

12:45 Like, I would download file.

12:47 I would unzip file.

12:48 I would upload file.

12:50 These are sort of the words we use to describe the various steps within a pipeline.

12:55 Some of those little steps might be Python functions that you write.

12:58 Maybe there's some pre-built other ones.

13:00 Yeah.

13:00 There might be Python.

13:01 Could be a bash script.

13:03 Could be logging into a server and downloading a file.

13:05 Could be hitting requests to download something from the internet and zipping it.

13:09 Just a various, you know, hodgepodge of commands that would run.

13:11 That's typically how we thought about it.

13:13 For more complex scenarios where your data is bigger, maybe it's running against a Hadoop cluster or a Spark cluster.

13:18 The compute's been offloaded somewhere else.

13:20 But the sort of conceptual way you tended to think about these things is in terms of tasks, right?

13:25 Process this thing, do this massive data dump, run a bunch of things, and then your job is complete.

13:31 With Airflow, or sorry, with Dagster, we kind of flip it around a little bit on our heads.

13:36 And we say, instead of thinking about tasks, what if we flipped that around and thought about the actual underlying assets that you're creating?

13:43 What if you told us not, you know, the steps that you're going to take, but the thing that you produce?

13:48 Because it turns out as people and as data people and stakeholders, really, we don't care about the task.

13:54 Like, we just assume that you're going to do it.

13:56 What we care about is, you know, that table, that model, that file, that Jupyter notebook.

14:01 And if we model our pipeline through that, then we get a whole bunch of other benefits.

14:06 And that's sort of the Dagster sort of pitch, right?

14:08 Like, if you want to understand the things that are being produced by these tasks,

14:13 tell us about the underlying assets.

14:15 And then when a stakeholder says and comes to you and says, you know, how old is this table?

14:19 Has it been refreshed lately?

14:20 Well, you don't have to go look at a specific task.

14:22 And remember that task ABC had model XYZ.

14:25 You just go and look up model XYZ directly there, and it's there for you.

14:29 And because you've defined things in this way, you get other nice things like a lineage graph.

14:33 You get to understand how fresh your data is.

14:36 You can do event-based orchestration and all kinds of nice things that are a lot harder to do in a task world.

14:40 Yeah, more declarative, less imperative, I suppose.

14:44 Yeah, it's been the trend, I think, in lots of tooling.

14:48 React, I think, was famous for this as well, right?

14:50 In many ways.

14:51 Yeah.

14:51 It was a hard framework, I think, for people to sort of get their heads around initially,

14:55 because we were so used to like the jQuery declarative, or jQuery style of doing things.

15:01 Yeah, how do I hook the event that makes the thing happen?

15:03 Right?

15:03 And React said, let's think about it a little bit differently.

15:05 Let's do this event-based orchestration, really.

15:08 And I think the proof's in the pudding, React's everywhere now, and jQuery could be not so much.

15:12 Yeah.

15:13 There's still a lot of jQuery out there, but there's not a lot of active jQuery.

15:18 But I imagine there's some.

15:19 There is, there is, yes.

15:20 Yeah, just because people are like, you know what, don't touch that, that works.

15:23 Which is probably the smartest thing people can do, I think.

15:26 Yeah, honestly.

15:27 Even though new frameworks are shiny.

15:29 And, you know, if there's any ecosystem that loves to chase the shiny new idea,

15:34 it's the JavaScript web world.

15:36 Oh, yeah.

15:36 There's no shortage of new frameworks coming out every time.

15:39 Yeah, I mean, we do too, but not as much as like, that's six months old.

15:43 That's so old, we can't possibly do that anymore.

15:46 We're rewriting it.

15:46 We're going to do the big rewrite again.

15:48 Mm-hmm.

15:48 Yep.

15:49 Fun.

15:49 Okay, so Dagster is the company, but also is open source.

15:53 What's the story around, like, can I use it for free?

15:57 Is it open source?

15:57 Do I pay for it?

15:58 100%.

15:58 Okay.

15:59 So Dagster Labs is the company.

16:00 Dagster open source is the product.

16:03 The 100% free, we're very committed to the open source model.

16:06 I would say 95% of the things you can get out of Dagster are available through open source.

16:11 And we tend to try to release everything through that model.

16:14 You can run very complex pipelines and you can deploy it all on your own if you wish.

16:19 There is a Dagster Cloud product, which is really the hosted version of Dagster.

16:22 If you want hosted plain, we can do that for you through Dagster Cloud, but it all runs on the same code base

16:28 and the modeling and the files all essentially look the same.

16:31 Mm-hmm.

16:31 Okay.

16:32 So obviously you could get, like I talked about at the beginning, you could go down the DevOps side,

16:36 get your own open source Dagster set up, schedule it, run it on servers,

16:41 all those things.

16:41 But if we just wanted something real simple, we could just go to you guys and say,

16:46 hey, I built this with Dagster.

16:47 Will you run it for me?

16:49 Pretty much.

16:49 Yeah.

16:49 Right.

16:50 So there's two options there.

16:51 You can do the serverless model, which says, you know, Dagster, just run it.

16:54 We take care of the compute, we take care of the execution for you, and you just write the code

16:58 and upload it to GitHub or any, you know, repository of your choice and we'll sync to that and then run it.

17:04 The other option is to do the hybrid model.

17:06 So you basically do the CICD aspect.

17:08 You just say, you push to name your branch.

17:11 If you push to that branch, that means we're just going to deploy a new version

17:15 and whatever happens after that, it'll be in production, right?

17:18 Exactly.

17:18 Yeah.

17:18 And we offer some templates that you can use in GitHub for workflows in order to accommodate that.

17:23 Excellent.

17:24 Then I cut you off.

17:25 You're saying something about hybrid.

17:27 Hybrid is the other option.

17:28 For those of you who want to run your own compute, you don't want the data leaving your ecosystem,

17:32 you can say, we've got this Kubernetes cluster, this ECS cluster, but we still want to use the Dagster Cloud product

17:38 to sort of manage the control plane.

17:40 Dagster Cloud will do that.

17:41 And then you can go off and execute things on your own environment if that's something you wish to do.

17:45 Oh yeah, that's pretty clever because running stuff in containers isn't too bad,

17:49 but running container clusters, all of a sudden, you're back doing a lot of work, right?

17:54 Exactly, yeah.

17:55 Yeah.

17:55 Okay, well, let's maybe talk about Dagster for a bit that I want to talk about some of the trends as well,

18:01 but let's just talk through maybe setting up a pipeline.

18:03 Like, what does it look like?

18:05 You know, you talked about in general, less imperative, more declarative,

18:08 but what does it look like?

18:10 Be careful about talking about code on audio, but you know, just give us a sense

18:14 of what the programming model feels like for us.

18:16 As much as possible, it really feels like just writing Python.

18:19 It's pretty easy.

18:21 You add a decorator on top of your existing Python function that does something.

18:25 That's a simple decorator called asset.

18:27 And then your pipeline, that function becomes a data asset.

18:31 That's how it's represented in the Dagster UI.

18:33 So you could imagine you've got a pipeline that gets like maybe Slack analytics

18:38 and uploads that to some dashboard, right?

18:41 Your first pipeline, your function would be called something like Slack data,

18:44 and that would be your asset.

18:45 In that function is where you do all the transform, the downloading of the data

18:50 until you've really created that fundamental data asset that you care about.

18:53 And that could be stored either, you know, you know, data warehouse to F3,

18:57 however you sort of want to persist it, that's really up to you.

18:59 And then the resources is sort of where the power, I think, of a lot of Dagster comes in.

19:04 So the asset is sort of like declaration of the thing I'm going to create.

19:07 The resource is how I'm going to, you know, operate on that, right?

19:11 Because sometimes you might want to have a, let's say, a DuckDB instance locally

19:16 because it's easier and faster to operate.

19:18 But when you're moving to the cloud, you want to have a Databricks or a Snowflake.

19:22 You can swap out resources based on environments and your asset can reference that resource.

19:27 And as long as it has that same sort of API, you can really flexibly change between

19:32 where that data is going to be persistent.

19:34 Does Dagster know how to talk to those different platforms?

19:36 Does it like natively understand DuckDB and Snowflake?

19:40 Yeah.

19:40 So it's interesting.

19:41 People often look to Dagster and like, oh, does it do X?

19:44 And the question is like, Dagster does anything you can do Python with.

19:48 Which is most things.

19:49 Yeah.

19:49 Which is most things.

19:50 So I think if you come from the Airflow world, you're very much used to like these Airflow providers.

19:54 And if you want to use...

19:55 That's kind of what I was thinking.

19:56 Yeah.

19:56 Yeah.

19:57 You want to use a Postgres, you need to find the Postgres provider.

19:59 You want to use S3, you need to find the S3 provider.

20:01 With Dagster, you kind of say, you don't have to do any of that.

20:04 If you want to use Snowflake, for example, install the Snowflake connector package from Snowflake,

20:09 and you use that as a resource directly.

20:10 And then you just run your SQL that way.

20:13 There are some places where we do have integrations that help.

20:17 If you want to get into the weeds with IO manager, it's where we persist the data on your behalf.

20:21 And so for S3, for Snowflake, for example, there's other ways where we can persist that data for you.

20:26 But if you're just trying to run a query, just trying to execute something,

20:30 just trying to save something somewhere, you don't have to use that system at all.

20:33 You can just use whatever Python package you would use anyway to do that.

20:37 So maybe some data is expensive for us to get as a company.

20:42 Like maybe we're charged on a usage basis or super slow or something.

20:46 I could write just Python code that goes and say, well, look in my local database.

20:50 If it's already there, use that and it's not too stale.

20:52 Otherwise, then do actually go get it, put it, put it there, and then get it back.

20:57 And like that kind of stuff would be up to me to put together.

21:00 Yeah.

21:01 And that's the nice thing is you're not really limited by like anyone's data model

21:05 or world view on how data should be retrieved or saved or augmented.

21:09 You could do it a couple of ways.

21:10 You could say, whenever I'm working locally, use this persistent data store that we're just going to use for development purposes.

21:17 Fancy database called SQLite, something like that.

21:20 Exactly.

21:20 Yes.

21:20 A wonderful database.

21:21 And actually it's, yeah, it'll work really, really well.

21:24 And then you just say, when I'm in a different environment, when I'm in production,

21:27 swap out my SQLite resource for a name, your favorite cloud warehouse resource and go fit,

21:33 fetch that data from there.

21:34 Or I want to use it mini IO locally.

21:36 I want to use S3 on prod.

21:38 It's very simple to swap these things out.

21:40 Okay.

21:40 Yeah.

21:41 So it looks like you build up these assets as y'all call them, these pieces of data,

21:46 Python code that accesses them.

21:48 And then you have a nice UI that lets you go and build those out kind of workflow style,

21:54 right?

21:54 Yeah, exactly.

21:55 This is where we get into the wonderful world of DAGs, which stands for directed acyclic graph.

22:00 I think it stands for a bunch of things that are not connected in a circle,

22:04 but are connected in some way.

22:05 So there can't be any loops, right?

22:07 Because then you never know where to start or where to end.

22:08 Could be a diamond, but not a circle, right?

22:11 Not a circle.

22:12 As long as there's like a path through this data set with the beginning and an end,

22:17 then we can kind of start to model this connected graph of things.

22:21 And then we know how to execute them, right?

22:23 We can say, well, this is the first thing we have to run because that's where all dependencies start.

22:26 And then we can either branch off in parallel or we can continue linearly until everything

22:31 is complete.

22:31 And if something breaks in the middle, we can resume from that broken spot.

22:35 Okay.

22:35 Excellent.

22:36 And is that the recommended way?

22:38 Like if I write all this Python code that works on the pieces, then the next recommendation would be to fire up the UI and start building it?

22:45 Or do you say, ah, you should really write it in code and then you can just visualize it or monitor it?

22:50 Everything in Dagster is written as code.

22:51 The UI reads that code and it interprets it as a DAG and then it displays that for you.

22:57 There are some things to do with the UI.

22:58 Like you can materialize assets, you can make them run, you can do backfills,

23:02 you can view metadata, you can sort of enable and disable schedules.

23:06 But the core, we really believe this is Dagster, like the core declaration of how things are done,

23:11 it's always done through code.

23:13 Okay.

23:13 Excellent.

23:13 So when you say materialize, maybe I have an asset, which is really a Python function I wrote that goes and pulls down a CSV file.

23:22 The materialize would be, I want to see kind of representative data in this,

23:26 in the UI.

23:28 so I could go, all right, I think this is right.

23:30 Let's keep passing it down.

23:31 Is that what that means?

23:32 Materialize really means just run this particular asset, make this asset new again,

23:36 fresh again, right?

23:37 As part of that materialization, we sometimes output metadata.

23:41 And you can kind of see this on the right.

23:42 If you're looking at the screen here, where we talk about what the timestamp was,

23:46 the URL, there's a nice little graph of like number of rows over time.

23:51 All that metadata is something you can emit.

23:53 And we emit some ourselves by default with the framework.

23:56 And then as you materialize these assets, as you run that asset over and over again,

24:00 over time, we capture all that.

24:01 And then you can really get a nice overview of, you know, this asset's lifetime,

24:05 essentially.

24:06 Nice.

24:06 Yeah.

24:07 I think the asset, the metadata is really pretty excellent, right?

24:10 Over time, you can see how the data's grown and changed.

24:14 Yeah.

24:14 The metadata is, is really powerful.

24:16 And it's one of the nice benefits of being in this asset world, right?

24:19 Because you don't really want to metadata on like this task that run.

24:22 You want to know like this table that I created, how many rows has it had every single time it's run?

24:27 Because if that number drops by like 50%, that's a big problem.

24:30 Conversely, if the runtime is slowly increasing every single day, you might not notice it,

24:35 but over a month or two, it went from a 30 second pipeline to 30 minutes.

24:39 maybe there's like a great place to start optimizing that one specific asset.

24:43 Right.

24:43 And what's cool if it's just Python code, you know how to optimize that probably,

24:47 right?

24:47 Hopefully.

24:48 Yeah.

24:48 Well, as much as you're going to, yeah, you got, you have all the power of Python and you should be able to,

24:55 as opposed to it's deep down inside some framework that you don't really.

24:57 Exactly.

24:58 Yeah.

24:58 You use Python, you can benchmark it.

25:00 There's probably, you probably knew you didn't write it that well when you first started and you can

25:05 always find ways to improve it.

25:07 So this UI is something that you can just run locally.

25:09 Kind of like Jupyter.

25:10 A hundred percent.

25:11 Just type Degester dev and then you get the full UI experience.

25:15 You get to see the runs, all your assets.

25:17 Is it a web app?

25:18 It is.

25:19 Yeah.

25:19 It's a web app.

25:20 There's a Postgres backend.

25:21 And then there's a couple of services that run the web server, the GraphQL,

25:25 and then the workers.

25:26 Nice.

25:26 Yeah.

25:27 So pretty serious web app.

25:28 It sounds like, but yeah, just something you run all probably containers or something.

25:36 You just fire up when you download it.

25:38 Right.

25:38 Locally it doesn't even use containers.

25:39 It's just all pure Python for that.

25:42 But once you deploy, yeah, I think you might want to go down the container route,

25:46 but it's nice not having to have Docker just to like run a simple test deployment.

25:50 Yeah.

25:50 I guess not everyone's machine has that for sure.

25:53 So question from the audience here.

25:55 Jazzy asks, does it hook into AWS in particular?

25:59 Is it compatible with existing pipelines like ingestion lambdas or transform lambdas?

26:04 Yeah, you can hook into AWS.

26:06 So we have some AWS integrations built in.

26:09 Like I mentioned before, there's nothing stopping you from importing Boto3 and doing anything really you want.

26:14 So a very simple use case, like let's say you already have an existing transformation being triggered in AWS through

26:20 some Lambda.

26:20 You could just model that within Dagster and say, you know, trigger that Lambda Boto3.

26:25 Okay.

26:26 Then the asset itself is really that representation of that pipeline, but you're not actually running that code within Dagster itself.

26:32 That's still occurring on the AWS framework.

26:34 And that's a really simple way to start adding a little bit of observability and orchestration

26:38 to existing pipelines.

26:39 Okay.

26:40 That's pretty cool because now you have this nice UI and these metadata in this history,

26:45 but it's someone else's cloud.

26:47 Exactly.

26:47 Yeah.

26:47 Yeah.

26:48 And you can start to pull more information in there.

26:50 And over time, you might decide, you know, this, you know, Lambda that I had,

26:53 it's starting to get out of hand.

26:54 I want to kind of break it apart into multiple assets where I want to sort of optimize it a

26:59 little way in Dagster can help you along that.

27:01 Yeah.

27:02 Excellent.

27:02 How do you set up, like triggers or observability inside Dagster?

27:07 Like Jazzy's asking about S3, but like in general, right?

27:11 If a row is entered into a database, something is dropped in a blob storage or the date changes.

27:16 I don't know.

27:16 Yeah.

27:17 Those are great questions.

27:18 You have a lot of options.

27:19 In Dagster, we do model every asset with a couple little flags, I think that are really useful to think about.

27:24 One is whether the code of that particular asset has changed.

27:28 Right.

27:28 And then the other one is whether anything upstream of that asset has changed.

27:32 And those two things really power a lot of automation functionality that we can get downstream.

27:37 So let's start with the, I think the S3 example is the easiest to understand.

27:41 You have a bucket and there is, you know, a file that gets uploaded every day.

27:45 You don't know what time that file gets uploaded.

27:47 You don't know when it'll be uploaded, but you know, at some point it will be in Dagster.

27:51 We have a thing called the sensor, which you can just connect to an S3 location.

27:55 You can define how it looks into that file or into that folder.

27:58 And then you would just pull every 30 seconds until something happens.

28:02 When that something happens, that triggers sort of an event.

28:06 And that event can trickle at your will downstream to everything that depends on it

28:10 as you sort of connect to these things.

28:11 So it gets you awake from this, like, Oh, I'm going to schedule something to run every hour.

28:16 Maybe the data will be there, but maybe it won't.

28:18 And you can have a much more event-based workflow.

28:20 When this file runs, I want everything downstream to know that this data has changed.

28:24 And as sort of data flows through these systems, everything will sort of work its way down.

28:28 Yeah, I like it.

28:29 This portion of Talk Python to Me is brought to you by Posit, the makers of Shiny,

28:35 formerly RStudio, and especially Shiny for Python.

28:39 Let me ask you a question.

28:41 Are you building awesome things?

28:43 Of course you are.

28:44 You're a developer or data scientist.

28:45 That's what we do.

28:46 And you should check out Posit Connect.

28:48 Posit Connect is a way for you to publish, share, and deploy all the data products that you're building using Python.

28:55 People ask me the same question all the time.

28:58 Michael, I have some cool data science project or notebook that I built.

29:02 How do I share it with my users, stakeholders, teammates?

29:05 Do I need to learn FastAPI or Flask or maybe Vue or React.js?

29:10 Hold on now.

29:11 Those are cool technologies, and I'm sure you'd benefit from them, but maybe stay focused on the data,

29:15 project.

29:16 Let Posit Connect handle that side of things.

29:18 With Posit Connect, you can rapidly and securely deploy the things you build in Python.

29:23 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, ports, dashboards,

29:28 and APIs.

29:29 Posit Connect supports all of them.

29:32 And Posit Connect comes with all the bells and whistles to satisfy IT and other enterprise requirements.

29:38 Make deployment the easiest step in your workflow with Posit Connect.

29:42 For a limited time, you can try Posit Connect for free for three months by going to talkpython.fm/posit.

29:48 That's talkpython.fm/P-O-S-I-T.

29:52 The link is in your podcast player show notes.

29:53 Thank you to the team at Posit for supporting Talk Python.

29:59 The sensor concept is really cool because I'm sure that there's a ton of cloud machines,

30:05 people provisioned, just because this thing runs every 15 minutes, that runs every 30 minutes,

30:10 and you add them up and aggregate, we need eight machines just to handle the automation rather than,

30:16 you know, because they're hoping to catch something without too much latency,

30:19 but maybe like that actually only changes once a week.

30:21 Exactly.

30:22 And I think that's where we have to like sometimes step away from the way we're so used to thinking

30:27 about things.

30:28 And I'm guilty of this.

30:29 When I create a data pipeline, my natural inclination is to create a schedule where it's

30:33 a, is this a daily one?

30:34 Is this weekly?

30:35 Is this monthly?

30:36 But what I'm finding more and more is when I'm creating my pipelines, I'm not adding a schedule.

30:40 I'm using Dacister's auto-materialized policies, and I'm just telling it, you figure it out.

30:45 I don't have to think about schedules.

30:46 Just figure out when the things should be updated.

30:48 When it's, you know, parents have been updated, you run.

30:51 When the data has changed, you run.

30:53 And then just like figure it out and leave me alone.

30:55 Yeah.

30:55 And it's worked pretty well for me so far.

30:57 I think it's great.

30:57 I have a search, refresh the search index on the various podcast pages that runs and it

31:04 runs every hour, but the podcast ships weekly, right?

31:08 But I don't know which hour it is.

31:09 And so it seems like that's enough latency, but it would be way better to put just a little

31:14 bit of smarts.

31:15 Like what was the last date that anything changed?

31:18 Was that since the last time you saw it?

31:19 Maybe we'll just leave that alone.

31:21 You know, you're starting to inspire me to go write more code, but pretty cool.

31:25 All right.

31:26 So on the homepage at Dagster.io, you've got a nice graphic that shows you both how to write

31:33 the code, like some examples of the code, as well as how that looks in the UI.

31:37 And one of them is called, says to launch backfills.

31:41 What is this backfill thing?

31:43 Oh, this is my favorite thing.

31:44 Okay.

31:44 Okay.

31:45 So when you first start your data journey as a data engineer, you sort of have a pipeline

31:50 and you build it and it just runs on a schedule and that's fine.

31:53 What you soon find is, you know, you might have to go back in time.

31:58 You might say, I've got this data set that updates monthly.

32:01 Here's a great example.

32:02 AWS cost reporting, right?

32:04 AWS will send you some data around, you know, all your instances and your S3 bucket, all

32:10 that.

32:10 And it'll update that data every day or every month or whatever have you.

32:14 Due to some reason, you've got to go back in time and refresh data that AWS updated due

32:18 to some like discrepancy.

32:19 Backfill is sort of how you do that.

32:21 And it works hand in hand with this idea of a partition.

32:24 A partition is sort of how your data is naturally organized.

32:27 And it's like a nice way to represent that natural organization.

32:31 It has nothing to do with like the fundamental way, how often you want to run it.

32:34 It's more around like, I've got a data set that comes in once a month, it's represented monthly.

32:39 It might be updated daily, but it's the representation of the data is monthly.

32:43 So I will partition it by month.

32:44 It doesn't have to be dates.

32:45 It could be strings.

32:47 It could be a list.

32:48 You could have a partition for every company or every client or every domain you have,

32:55 whatever you sort of think is a natural way to think about breaking apart that pipeline.

32:59 And once you do that partition, you can do these nice things called backfills, which says,

33:04 instead of running this entire pipeline and all my data, I want you to pick that one month

33:08 where your data went wrong or that one month where data was missing and just run the partition

33:12 on that range.

33:13 And so you limit compute, you save resources and get a little bit more efficient.

33:18 And it's just easier to like think about your pipeline because you've got this natural

33:22 built-in partitioning system.

33:23 Excellent.

33:24 So maybe you missed some important event.

33:27 Maybe your automation went down for a little bit, came back up.

33:31 You're like, oh no, we've, we've missed it.

33:33 Right.

33:33 But you want to start over for three years.

33:36 So maybe we could just go and run the last day.

33:39 It's worth it.

33:40 Exactly.

33:40 Or another one would be your vendor says, hey, by the way, we actually screwed up.

33:44 We uploaded this file from two months ago, but the numbers were all wrong.

33:48 So we've uploaded a new version to that destination.

33:51 Can you update your data set?

33:52 One way is to recompute the entire universe from scratch.

33:55 But if you've partitioned things and you can say, no, limit that to just this one partition

34:00 for that month.

34:01 And that one partition can trickle down all the way to all your other assets that depend

34:04 on that one.

34:05 Do you have to pre decide, do you have to think about this partitioning beforehand or can you

34:10 do it retroactively?

34:11 You could do it retroactively.

34:12 And I have done that before as well.

34:14 It really depends on, on where you're at.

34:16 I think it's your first asset ever.

34:18 Probably don't bother with partitions, but it really isn't a lot of work to get them to

34:23 get them started.

34:23 Okay.

34:24 Yeah.

34:25 Really neat.

34:25 I like a lot of the ideas here.

34:27 I like that it's got this visual component that you can see what's going on, inspect it.

34:33 You also see you can debug runs.

34:34 What happens there?

34:35 Like, obviously, when you're pulling data from many different sources, maybe it's not your

34:40 data you're taking in.

34:41 Fields can vanish.

34:42 It can be the wrong type.

34:43 Systems can go down.

34:44 I'm sure.

34:45 Sure.

34:46 The debugging is interesting.

34:47 So what's it looks a little bit kind of like a web browser debug dev tools thing.

34:52 So for the record, my code never fails.

34:54 I've never had a bug in my life.

34:56 But for the other you have.

34:57 Yeah.

34:58 Well, mine doesn't.

34:59 I only do it to make an example and remind me how others.

35:02 Yes.

35:03 If I do, it's intentional, of course.

35:05 Yeah.

35:05 To humble myself a little bit.

35:07 Exactly.

35:07 This view is one of my favorite.

35:10 I mean, so many favorite views.

35:11 But this is, it's actually really fun to watch.

35:13 Watch this actually run when you execute this pipeline.

35:16 But really, like, let's go back to, you know, the world before orchestrators.

35:20 We use cron, right?

35:22 We'd have a bash script that would do something.

35:24 And we'd have a cron job that said, make sure this thing runs.

35:27 And then hopefully it was successful, but sometimes it wasn't.

35:30 And it's a sometimes it wasn't.

35:32 That's always been the problem, right?

35:34 It's like, well, what do I do now?

35:35 How do I know why it failed?

35:36 What was, when did it fail?

35:38 You know, at what point or what steps did it fail?

35:40 That's really hard to do.

35:42 What this debugger really is, is a structured log of every step that's been going on through your pipeline, right?

35:48 So in this view, there's three assets that we can kind of see here.

35:51 One is called users.

35:53 One is called orders.

35:54 And one is to run dbt.

35:56 So presumably there's these two, you know, tables that are being updated.

35:59 And then a dbt job, it looks like that's being updated at the very end.

36:02 Once you execute this pipeline, all the logs are captured from each of those assets.

36:07 So you can manually write your own logs.

36:10 You have access to a Python logger and you can use your info, your error, whatever have you, in log output that way.

36:16 And it'll be captured in a structured way.

36:18 But it also capture logs from the integrations.

36:21 So using dbt, we capture those logs as well.

36:24 You can see it processing every single asset.

36:26 So if anything does go wrong, you can filter down and understand at what step, at what point did something go wrong.

36:33 That's awesome.

36:34 And just the historical aspect.

36:36 Because just going through logs, especially multiple systems, can be really, really tricky to figure out what's the problem.

36:43 What actually caused this to go wrong?

36:45 But come back and say, oh, it crashed.

36:47 Pull up the UI and see.

36:49 All right.

36:49 Well, show me what this run did.

36:52 Show me what this job did.

36:53 And it seems like it's a lot easier to debug than your standard web API or something like that.

36:57 You can click onto any of these assets that get that metadata that we had earlier as well.

37:02 If, you know, one step failed and it's kind of flaky, you can just click on that one step and say, just rerun this.

37:08 Everything else is fine.

37:09 We don't need to restart from scratch.

37:10 Okay.

37:10 And it'll keep the data from before.

37:13 So you don't have to rerun that.

37:15 Yeah.

37:15 I mean, it depends on how you built the pipeline.

37:18 We like to build idempotent pipelines is how we sort of talk about it, the data engineering landscape, right?

37:23 So you should be able to run something multiple times and not break anything in a perfect world.

37:27 That's not always possible.

37:28 But ideally, yes.

37:29 And so we can presume that if users completed successfully, then we don't have to run that again because that data was persisted, you know, database S3 somewhere.

37:38 And if orders was the one that was broken, we can just only run orders and not have to worry about rewriting the whole thing from scratch.

37:46 So idempotent for people who maybe don't know, you run it once or you perform the operation once or you perform it 20 times, same outcome.

37:53 Should have side effects, right?

37:55 That's the idea.

37:55 Yeah.

37:56 That's the idea.

37:56 Easier something done sometimes.

37:58 It sure is.

38:00 Sometimes it's easy.

38:00 Sometimes it's very hard.

38:01 But the more you can build pipelines that way, the easier your life becomes in many ways.

38:07 Generally, not always, but generally true for programming as well, right?

38:10 If you talk to functional programming people, they'll say like, it's an absolute, but.

38:14 Yes.

38:14 Functional programmers love this kind of stuff.

38:17 And it actually does lend itself really well to data pipelines.

38:20 Data pipelines, unlike maybe some of the software engineering stuff, it's a little bit different in that the data changing is what causes often most of the headaches, right?

38:30 It's less so the actual code you write, but more the expectations tend to change so frequently and so often in new and novel and interesting ways that you would often never expect.

38:40 And so the more you can sort of make that function so pure that you can provide any sort of data set and really test really easily.

38:48 These expectations, when they get the easier it is to sort of debug these things and build on them in the future.

38:54 Yeah.

38:54 And cache them as well.

38:55 Yes.

38:56 It's always nice.

38:57 Yeah.

38:57 So speaking of that kind of stuff, like what's the scalability story?

39:02 If I've got some big, huge, complicated data pipeline, can I parallelize them and have them run multiple pieces?

39:10 Like if there's different branches or something like that?

39:12 Yeah, exactly.

39:13 That's one of the key benefits, I think, in writing your assets in this DAG way, right?

39:19 Anything that is parallelizable will be parallelized.

39:22 Now, sometimes you might want to put limits on that.

39:24 Sometimes too much parallelization is bad.

39:25 Your poor little database can't handle it.

39:27 And you can say, you know, maybe a concurrency limit on this one just for today is worth putting.

39:33 Or if you're hitting an API for an external vendor, they might not appreciate 10,000 requests a second on that one.

39:38 So maybe you would slow it down.

39:40 But in essence...

39:41 Or rate limiting, right?

39:41 You can run into too many requests and then your stuff crashes.

39:45 Then you got to start.

39:45 It can be a whole thing.

39:46 It can be a whole thing.

39:47 There's memory concerns.

39:48 But let's pretend the world is simple.

39:50 Anything that can be paralyzed will be through Dexter.

39:54 And that's really the benefit of writing these DAGs is that there's a nice algorithm for determining what that actually looks like.

39:59 Yeah.

39:59 I guess if you have a diamond shape or any sort of split, right?

40:03 Those two things now become...

40:04 Because it's acyclical.

40:06 They can't turn around and then eventually depend on each other again.

40:09 So...

40:09 That's a perfect chance to just go fork it out.

40:12 Exactly.

40:12 And that's kind of where partitions are also kind of interesting.

40:15 If you have a partitioned asset, you could take your data set, partition it to five buckets and run all five partitions at once, knowing full well that because you've written this in a idempotent and partition way, that the first pipeline will only operate on apples and the second one only operates on bananas.

40:31 And there is no commingling of apples and bananas anywhere in the pipeline.

40:35 Oh, that's interesting.

40:36 I hadn't really thought about using the partitions for parallelism, but of course...

40:39 Yeah.

40:39 It's a fun little way to break things apart.

40:43 So if we run this on the Dagster cloud or even on our own, this is pretty much automatic.

40:49 We don't have to do anything.

40:51 Like Dagster just looks at it and says, this looks parallelizable and it'll go or...

40:55 That's right.

40:55 Yeah.

40:55 As long as you've got the full deployment, whether it's OSS or cloud, Dagster will basically parallelize it for you as much as possible.

41:02 Excellent.

41:02 You can set global concurrency limits.

41:04 So you might say, you know, 64 is more than enough, you know, parallelization that I need.

41:09 Or maybe I want less because I'm worried about overloading systems.

41:12 It's really up to you.

41:13 I'm putting this on a $10 server.

41:15 Please don't.

41:16 Please don't kill it.

41:18 Just respect that it's somewhat wimpy, but that's okay.

41:21 It'll get the job done.

41:22 It'll get the job done.

41:23 All right.

41:24 I want to talk about some of the tools and some of the tools that are maybe at play here when working with Dagster and some of the trends and stuff.

41:31 But before that, it may be speak to where you could see people adopt a tool like Dagster, but they generally don't.

41:38 They don't realize like, oh, actually, there's a whole framework for this, right?

41:42 Like I could, sure, I could go and build just on HTTP server and hook into the request and start writing to it.

41:50 But like, maybe I should use Flask or FastAPI.

41:52 Like there's these frameworks that we really naturally adopt for certain situations like APIs and others, background jobs, data pipelines, where I think there's probably a good chunk of people who could benefit from stuff like this.

42:06 But they just don't think they need a framework for it.

42:08 Like Cron is enough.

42:10 Yeah, it's funny because sometimes Cron is enough.

42:12 And I don't want to encourage people not to use Cron.

42:15 But think twice, at least, is what I would say.

42:19 So probably the first like trigger for me of thinking of, you know, is that actually a good choice is like, am I trying to ingest data from somewhere?

42:26 That's something that fails.

42:28 Like, I think we just can accept that, you know, if you're moving data around, the data source will break.

42:33 The expectations will change.

42:34 You will need to debug it.

42:36 You will need to run it.

42:37 And doing that in Cron is a nightmare.

42:38 So I would say definitely start to think about an orchestration system if you're ingesting data.

42:43 If you have a simple Cron job that sends one email, like, you're probably fine.

42:47 I don't think you need to implement all of DAG searches just to do that.

42:50 But the more closer you get to data pipelining, I think the better your life will be if you're not trying to debug a obtuse process that no one really understands six months from now.

43:04 Excellent.

43:04 All right.

43:05 Maybe we could touch on some of the tools that are interesting.

43:08 You see people using, you talked about DuckDB and DBT, a lot of D's starting here.

43:13 But give us a sense of like some of the supporting tools you see a lot of folks using that are interesting.

43:19 Yeah, for sure.

43:19 I think in the data space, probably DBT is one of the most popular choices.

43:25 And DBT, in many ways, it's nothing more than a command line tool that runs a bunch of SQL in a DAG as well.

43:34 So there's actually a nice fit with Dijkstra and DBT together.

43:37 DBT is really used by people who are trying to model that business process using SQL against typically a data warehouse.

43:45 So if you have your data in, for example, Postgres, Snowflake, Databricks, Microsoft SQL, these types of data warehouses, generally you're trying to model some type of business process.

43:57 And typically people use SQL to do that.

44:00 Now you can do this without DBT, but DBT has provided a nice clean interface to doing so.

44:06 It makes it very easy to connect these models together, to run them, to have a development workflow that works really well.

44:12 And then you can push it to prod and have things run again in production.

44:15 So that's DBT.

44:16 We find it works really well.

44:18 And a lot of our customers are actually using DBT as well.

44:21 There's DuckDB, which is a great, it's like the SQLite for columnar databases, right?

44:27 Yeah.

44:27 It's in process.

44:28 It's fast.

44:29 It's written by the Dutch.

44:30 There's nothing you can't like about it.

44:32 It's free.

44:32 We love that.

44:33 It feels very comfortable in Python itself.

44:36 It does.

44:37 It's so easy.

44:37 Yes, exactly.

44:39 The Dutch have given us so much and they've asked nothing of us.

44:42 So I'm always very thankful for them.

44:44 It's fast.

44:45 It's so fast.

44:45 It's like if you've ever used Pandas for processing large volumes of data, you've occasionally hit memory limits or inefficiencies in doing these large aggregates.

44:56 I won't go into all the reasons of why that is, but DuckDB sort of changes that because it's a fast serverless sort of C++ written tooling to do really fast vectorized work.

45:08 And by that, I mean like it works on columns.

45:10 So typically in like SQLite, you're doing transactions.

45:13 You're doing single row updates, writes, inserts, and SQLite is great at that.

45:18 Where typical transactional databases fail or aren't as powerful is when you're doing aggregates, when you're looking at an entire column, right?

45:25 Just the way they're architected.

45:26 If you want to know the average, the median, the sum of some large number of columns, and you want to group that by a whole bunch of things, you want to know the first date someone did something and the last one.

45:37 Those types of vectorized operations, DuckDB is really, really fast at doing.

45:42 And it's a great alternative to, for example, Pandas, which can often hit memory limits and be a little bit slow in that regard.

45:50 Yeah, it looks like it has some pretty cool aspects, transactions, of course.

45:53 But it also says direct Parquet, CSV, and JSON querying.

45:59 So if you've got a CSV file hanging around and you want to ask questions about it or JSON or some of the data science stuff through Parquet, turn an indexed proper query engine against it.

46:10 Don't just use a dictionary or something, right?

46:12 Yeah.

46:12 It's great for reading a CSV, zip files, tar files, Parquets, partition Parquet files, all that stuff that usually was really annoying to do and operate on.

46:22 You can now install DuckDB.

46:24 It's got a great CLI, too.

46:25 So before you go and program your entire pipeline, you just run DuckDB and you start writing SQL against CSV files and all this stuff to really understand your data and just really see how quick it is.

46:37 I used it on a bird data set that I had as an example project and there was millions of rows and I was joining them together and doing massive group buys and it was done in like seconds.

46:48 And it's just hard for me to believe that it was even correct because it was so quick.

46:52 So it's wonderful.

46:53 I must have done that wrong somehow because it's done.

46:57 It shouldn't be done.

46:57 Yeah.

46:58 Yeah.

46:58 And the fact it's in process means there's not a babysit, a server for you to babysit patch, make sure it's still running.

47:06 It's accessible, but not too accessible.

47:08 All that, right?

47:09 It's a pip and sell away, which is always, we love that, right?

47:12 Yeah, absolutely.

47:12 You mentioned, or I guess I mentioned Parquet, but also Apache Arrow seems like it's making its way into a lot of different tools and sort of foundational sort of high memory, high performance in memory processing.

47:25 Have you used this, Eddie?

47:26 I've used it, especially through like working through different languages.

47:29 So moving data between Python and R is where I last used this.

47:34 I didn't know Arrow's great at that.

47:35 I believe Arrow is like the underneath some of the Rust to Python as well.

47:40 It's working there.

47:41 So typically I don't use Arrow like directly myself, but it's in many of the tooling I use.

47:47 Right.

47:47 It's a great product.

47:48 And like so much of the ecosystem is now built on Arrow.

47:52 Yeah, I think a lot of it is, I feel like the first time I heard about it was through Polars.

47:55 That's right.

47:56 Yeah.

47:56 I'm pretty sure, which is another Rust story for kind of like Pandas, but a little bit more fluent, lazy API.

48:04 Yes.

48:05 We live in such great times, to be honest.

48:07 So Polars is a Python bindings for Rust, I believe is kind of how I think about it.

48:13 It does all the transformation in Rust, but you've had this Python interface to it and it makes things, again, incredibly fast.

48:20 I would say similar in speed to DuckDB.

48:22 They both are quite comparable sometimes.

48:24 Yeah, it also claims to have vectorized and columnar processing and all that kind of stuff.

48:29 Yeah, it's pretty incredible.

48:30 So not a drop-in replacement for Pandas, but if you have the opportunity to use it and you don't need to use the full breadth of what Pandas offers, because Pandas is quite a huge package.

48:40 There's a lot it does.

48:41 But if you're just doing simple transforms, I think Polars is a great option to explore.

48:45 Yeah, I talked to Richie Vink, who is part of that, and I think they explicitly chose to not try to make it a drop-in replacement for Pandas, but tried to choose an API that would allow the engine to be smarter and go like, I see you're asking for this, but the step before you wanted this other thing.

49:01 So let me do that transformation all in one shot and a little bit like a query optimization engine.

49:06 Yeah.

49:07 What else is out there?

49:08 We've got time for just a couple more.

49:09 If there's anything that you're like, oh, yeah, people use this all the time.

49:12 Obviously, the databases, you've said, Postgres, Snowflake, et cetera.

49:15 Yeah, there's so much.

49:17 So another little one I like, it's called DLT, DLT Hub.

49:20 It's getting a lot of attraction as well.

49:23 And what I like about it is how lightweight it is.

49:25 I'm such a big fan of lightweight tooling that's not, you know, massive frameworks.

49:29 Loading data is, I think, still kind of yucky in many ways.

49:32 It's not fun.

49:33 And DLT makes it a little bit simpler and easier to do so.

49:36 So that's what I would recommend people just look into if you got to either ingest data from, you know, some API, some website, some CSV file.

49:45 It's a great way to do that.

49:47 It claims it's the Python library for data teams loading data into unexpected places.

49:52 Very interesting.

49:53 Yes, that's great.

49:54 Yeah, this looks cool.

49:55 All right.

49:56 Well, I guess maybe let's talk about, let's talk business and then we can talk about what's next and then we'll probably be out of time.

50:02 I'm always fascinated.

50:04 I think there's starting to be a bit of a blueprint for this, but companies that take a thing, they make it and they give it away and then they have a company around it.

50:11 And, you know, congratulations to you all for doing that.

50:14 Right.

50:14 And a lot of it seems to kind of center around the open core model, which I don't know if that's exactly how you would characterize yourself.

50:23 But maybe you talk about the business side, because I know there's many successful open source projects that don't necessarily result in full time jobs or companies if people were to want that.

50:32 It's a really interesting place.

50:33 And I don't think it's one that anyone has truly figured out well.

50:38 I can say this is the way forward for everyone, but it is something we're trying.

50:41 And I think for Dexter, I think it's working pretty well.

50:44 And what I think is really powerful about Dexter is like the open source product is really, really good.

50:49 And it hasn't really been limited in many ways in order to drive like cloud product consumption.

50:55 Yeah, sure.

50:55 We really believe that there's actual value in that separation of these things.

50:58 There are some things that we just can't do in the open source platform.

51:01 For example, there's pipelines on cloud that involve, you know, ingesting data through our own systems in order to do reporting, which just doesn't make sense to do on the open source system.

51:10 It makes the product way too complex.

51:12 But for the most part, I think Dexter open source, like we really believe that like just getting it in the hands of developers is the best way to prove the value of it.

51:19 And if we can build a business on top of that, I think we're all super happy to do so.

51:23 It's nice that we get to sort of drive both sides of it.

51:27 To me, that's like one of the more exciting parts, right?

51:29 A lot of the development that we do in Dexter open source is driven by people who are paid through, you know, what happens on Dexter cloud.

51:36 And I think from what I can tell, there's no better way to build open source product than to have people who are adequately paid to develop that product.

51:44 Otherwise, it can be, you know, a labor of love, but one that doesn't last for very long.

51:48 Yeah, whenever I think about building software, there's 80% of it that's super exciting and fun, 10%.

51:53 And then there's that little sliver of like really fine polish that if it's not just your job to make that thing polished, you're just for the most part, just not going to polish that bit, right?

52:04 It's tough. UI, design, support.

52:07 There's all these things that go into making software like really extraordinary.

52:11 That's really, really tough to do.

52:13 And I think I really like the open source business model.

52:16 I think for me, being able to just try something, not having to talk to sales and being able to just deploy locally and test it out and see if this works.

52:24 And if I choose to do so, deploy it in production.

52:28 Or if I bought the cloud product and I don't like the direction that is going, I can even go into open source as well.

52:34 That's pretty compelling to me.

52:35 Yeah, for sure it is.

52:36 And I think the more moving pieces of infrastructure, the more uptime you want and all those types of things, the more somebody who's maybe a programmer but not a DevOps infrastructure person but needs to have it there, right?

52:50 Like that's an opportunity as well, right?

52:52 For you to say, look, you can write the code.

52:54 We made it cool for you to write the code, but you don't have to like get notified when the server's down or whatever.

53:00 Like we'll just take care of that for you.

53:01 That's pretty awesome.

53:02 Yeah.

53:02 And there's efficiencies of scale as well, right?

53:04 Like we've learned the same mistakes over and over again, so you don't have to, which is nice.

53:08 I don't know how many people who want to maintain servers, but people do.

53:12 And they're more than welcome to if that's how they choose to do so.

53:15 Yeah, for sure.

53:16 All right.

53:17 Just about out of time.

53:18 Let's close up our conversation with where are things going for Dagster?

53:22 Like what's on the roadmap?

53:24 What are you excited about?

53:25 Oh, that's a good one.

53:26 I think we've actually published our roadmap line somewhere.

53:29 If you search Dagster roadmap, it's probably out there.

53:31 I think for the most part, that hasn't changed much going into 2024, though we may update it.

53:36 There it is.

53:37 We're really just doubling down on what we've built already.

53:40 I think there's a lot of work we can do on the product itself to make it easier to use, easier to understand.

53:45 My team specifically is really focused around the education piece.

53:49 And so we launched Dagster University's first module, which helps you really understand the core.

53:54 Our next module is coming up in a couple months, and that'll be around using Dagster with DBT, which is our most popular integration.

54:01 We're building up more integrations as well.

54:03 So I've built a little integration called Embedded ELT that makes it easy to ingest data.

54:09 But I want to actually build an integration with DLT as well, DLT Hub.

54:12 So we'll be doing that.

54:13 And there's more coming down the pipe, but I don't know how much I can say.

54:17 Look forward to an event in April where we'll have a launch event on all that's coming.

54:23 Nice.

54:23 Is that an online thing people can attend or something?

54:26 Exactly.

54:27 Yeah, there'll be some announcement there on the Dagster website on that.

54:30 Maybe I will call it one thing that's actually really fun.

54:32 It's called Dagster Open Platform.

54:34 It's a GitHub repo that we launched a couple months ago, I want to say.

54:39 We took our internal...

54:41 I should go back one more.

54:42 Sorry.

54:42 It's like GitHub, Dagster Open Platform and GitHub.

54:45 I have it somewhere.

54:47 Yeah.

54:48 It's here under the organization?

54:51 Yes, it should be somewhere in here.

54:54 There it is.

54:54 Dagster Open Platform on GitHub.

54:56 And it's really a clone of our production pipelines for the most part.

55:00 There's some things we've chosen to ignore because they're sensitive.

55:03 But as much as possible, we've defaulted to making it public and open.

55:06 And the whole reason behind this was because, you know, as data engineers, it's often hard to see how other data engineers write code.

55:12 We get to see how software engineers write code quite often, but most people don't want to share their platforms for various good reasons.

55:19 Also, there's like smaller teams or maybe just one person.

55:23 And then like those pipelines are so integrated into your specific infrastructure, right?

55:29 So it's not like, well, here's a web framework to share, right?

55:32 Like, here's how we integrate into that one weird API that we have that no one else has.

55:36 So it's no point in publishing it to you, right?

55:38 That's typically how it goes.

55:39 Or they're so large that they're afraid that there's like some, you know,

55:43 important information that they just don't want to take the risk on.

55:46 Yep.

55:46 And so we built like something that's in the middle where we've taken as much as we can

55:49 and we've publicized it.

55:50 And you can't run this on your own.

55:52 Like, that's not the point.

55:53 The point is to look at the code and see, you know, how does Dagster use Dagster?

55:56 And what does that kind of look like?

55:57 Nice.

55:57 Okay.

55:58 All right.

55:58 Well, I'll put a link to that in the show notes and people can check it out.

56:01 Appreciate it.

56:02 Yeah.

56:02 I guess let's wrap it up with the final call to action.

56:05 People are interested in Dagster.

56:06 How do they get started?

56:07 What do you tell them?

56:07 Oh, yeah.

56:08 Dagster.io is probably the greatest place to start.

56:11 You can try the cloud product.

56:13 We have free self-serve or you can try the local install as well.

56:17 If you get stuck, a great place to join is our Slack channel, which is up on our website.

56:22 There's even a Ask AI channel where you can just talk to a Slack bot that's been trained

56:27 on all our GitHub issues and discussions.

56:29 And it's surprisingly good at walking you through, you know, any debugging, any issues or even advice.

56:34 That's pretty excellent, actually.

56:36 Yeah.

56:36 It's real fun.

56:37 It's really fun.

56:37 And if that doesn't work, we're also there in the community where you can just chat to

56:40 us as well.

56:41 Cool.

56:42 All right.

56:43 Well, Pedram, thank you for being on the show.

56:45 Make sure all the work on Dagster and sharing it with us.

56:47 Thank you, Michael.

56:47 You bet.

56:48 See you later.

56:48 This has been another episode of Talk Python to Me.

56:52 Thank you to our sponsors.

56:53 Be sure to check out what they're offering.

56:55 It really helps support the show.

56:56 This episode is sponsored by Posit Connect from the makers of Shiny.

57:01 Publish, share, and deploy all of your data projects that you're creating using Python.

57:06 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs.

57:12 Posit Connect supports all of them.

57:15 Try Posit Connect for free by going to talkpython.fm/posit, P-O-S-I-T.

57:22 Want to level up your Python?

57:23 We have one of the largest catalogs of Python video courses over at Talk Python.

57:27 Our content ranges from true beginners to deeply advanced topics like memory and async.

57:32 And best of all, there's not a subscription in sight.

57:35 Check it out for yourself at training.talkpython.fm.

57:38 Be sure to subscribe to the show.

57:40 Open your favorite podcast app and search for Python.

57:43 We should be right at the top.

57:44 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

57:49 and the direct RSS feed at /rss on talkpython.fm.

57:53 We're live streaming most of our recordings these days.

57:56 If you want to be part of the show and have your comments featured on the air,

58:00 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

58:04 This is your host, Michael Kennedy.

58:06 Thanks so much for listening.

58:08 I really appreciate it.

58:09 Now get out there and write some Python code.

58:11 I'll see you next time.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon