Monitor performance issues & errors in your code

#454: Data Pipelines with Dagster Transcript

Recorded on Thursday, Jan 11, 2024.

00:00 Do you have data that you pull from external sources or that is generated and appears at

00:05 your digital doorstep?

00:06 I bet that data needs processed, filtered, transformed, distributed, and much more.

00:11 One of the biggest tools to create these data pipelines with Python is Dagster.

00:16 And we're fortunate to have Pedram Navid on the show to tell us about it.

00:20 Pedram is the head of data engineering and dev rel at Dagster Labs.

00:24 And we're talking data pipelines this week here at Talk Python.

00:28 This is Talk Python to Me, episode 454, recorded January 11th, 2024.

00:48 Welcome to Talk Python to Me, a weekly podcast on Python.

00:51 This is your host, Michael Kennedy.

00:53 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:58 both on fosstodon.org.

01:01 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:06 We've started streaming most of our episodes live on YouTube.

01:10 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be

01:16 part of that episode.

01:19 This episode is sponsored by Posit Connect from the makers of Shiny.

01:23 Posit Connect lets you share and deploy all of your data projects that you're creating

01:26 using Python.

01:27 Streamlet, Dash, Shiny, Bokeh, FastAPI, Flask, Quatro, Reports, Dashboards, and APIs.

01:34 Posit Connect supports all of them.

01:36 Try Posit Connect for free by going to talkpython.fm/posit, P-O-S-I-T.

01:42 And it's also brought to you by us over at Talk Python Training.

01:46 Did you know that we have over 250 hours of Python courses?

01:50 Yeah, that's right.

01:52 Check them out at talkpython.fm/courses.

01:56 Last week, I told you about our new course, Build an AI Audio App with Python.

02:01 Well, I have another brand new and amazing course to tell you about.

02:06 This time, it's all about Python's typing system and how to take the most advantage

02:10 of it.

02:11 It's a really awesome course called Rock Solid Python with Python Typing.

02:16 This is one of my favorite courses that I've created in the last couple of years.

02:20 Python type hints are really starting to transform Python, especially from the ecosystem's perspective.

02:26 Think FastAPI, Pydantic, BearType, et cetera.

02:30 This course shows you the ins and outs of Python typing syntax, of course, but it also

02:34 gives you guidance on when and how to use type hints.

02:38 Check out this four and a half hour in-depth course at talkpython.fm/courses.

02:44 Now onto those data pipelines.

02:47 Pedram, welcome to Talk Python to Me.

02:50 It's amazing to have you here.

02:51 >> Michael, great to have you.

02:52 Good to be here.

02:53 >> Yeah.

02:54 We're going to talk about data, data pipelines, automation, and boy, oh boy, let me tell you,

02:59 have I been in the DevOps side of things this week.

03:03 And I'm going to have a special, special appreciation of it.

03:07 I can tell already.

03:08 So excited to talk.

03:10 >> My condolences.

03:12 >> Indeed.

03:13 >> So before we get to that, though, before we talk about Dagster and data pipelines and

03:18 orchestration more broadly, let's just get a little bit of background on you.

03:22 Introduce yourself for people.

03:23 How'd you get into Python and data orchestration and all those things?

03:27 >> Of course, yeah.

03:28 So my name is Pedram Navid.

03:29 I'm the head of data engineering and dev rel at Dagster.

03:33 That's a mouthful.

03:34 And I've been a longtime Python user since 2.7.

03:38 And I got started with Python like I do with many things just out of sheer laziness.

03:42 I was working at a bank and there was this rote task, something involving going into

03:47 servers, opening up a text file and seeing if a patch was applied to a server.

03:52 A nightmare scenario when there's 100 servers to check and 15 different patches to confirm.

03:57 >> Yeah, so this kind of predates like the cloud and all that automation and stuff, right?

04:02 >> This was definitely before cloud.

04:04 This was like right between Python 2 and Python 3, and we were trying to figure out how to

04:07 use print statements correctly.

04:09 That's when I learned Python.

04:10 I was like, there's got to be a better way.

04:12 And honestly, I've not looked back.

04:13 I think if you look at my entire career trajectory, you'll see it's just punctuated by finding

04:18 ways to be more lazy in many ways.

04:22 >> Yeah.

04:23 Who was it?

04:24 I think it was Matthew Rocklin that had the phrase something like productive laziness

04:29 or something like that.

04:31 Like, I'm going to find a way to leverage my laziness to force me to build automation

04:36 so I never ever have to do this sort of thing again.

04:39 I got that sort of print.

04:40 >> It's very motivating to not have to do something.

04:43 And I'll do anything to not do something.

04:44 >> Yeah, yeah, yeah.

04:45 It's incredible.

04:46 And like that DevOps stuff I was talking about, just, you know, one command and there's maybe

04:51 eight or nine new apps with all their tiers redeployed, updated, resynced.

04:55 And it took me a lot of work to get there.

04:59 But now I never have to think about it again, at least not for a few years.

05:03 And it's amazing.

05:04 I can just be productive.

05:05 It's like right in line with that.

05:07 >> So what are some of the Python projects you've been, you've worked on, talked about

05:11 different ways to apply this over the years?

05:13 >> Oh, yeah.

05:14 So it started with internal, just like Python projects, trying to automate, like I said,

05:18 some rote tasks that I had.

05:20 And that accidentally becomes, you know, a bigger project.

05:23 People see it and they're like, oh, I want that too.

05:25 And so, well, now I have to build like a GUI interface because most people don't speak

05:29 Python.

05:30 And so that got me into iGUI, I think it was called, way back when.

05:35 That was a fun journey.

05:36 And then from there, it's really taken off.

05:38 A lot of it has been mostly personal projects.

05:41 Trying to understand open source was a really big learning path for me as well.

05:46 Really being absorbed by things like SQLAlchemy and requests back when they were coming out.

05:50 Eventually, it led to more of a data engineering type of role, where I got involved with tools

05:55 like Airflow and tried to automate data pipelines instead of patches on a server.

06:01 That one day led to, I guess, making a long story short, a role at Dagster, where now I

06:06 contribute a little bit to Dagster.

06:08 I work on Dagster, the core project itself, but I also use Dagster internally to build

06:11 our own data pipelines.

06:13 I'm sure it's interesting to see how you all both build Dagster and then consume Dagster.

06:19 Yeah, it's been wonderful.

06:21 I think there's a lot of great things about it.

06:23 One is like getting access to Dagster before it's fully released, right?

06:27 Internally, we dog food, new features, new concepts, and we work with the product team,

06:33 the engineering team, and say, "Hey, this makes sense.

06:35 This doesn't.

06:36 This works really well.

06:37 That doesn't." That feedback loop is so fast and so iterative that for me personally, being able to see

06:43 that come to fruition is really, really compelling.

06:45 But at the same time, I get to work at a place that's building a tool for me.

06:50 You don't often get that luxury.

06:51 I've worked in ads.

06:53 I've worked in insurance.

06:54 It's like banking.

06:55 These are nice things, but it's not built for me, right?

06:59 And so for me, that's probably been the biggest benefit, I would say.

07:01 Right.

07:02 If you work in some marketing thing, you're like, "You know, I retargeted myself so well

07:06 today.

07:07 You wouldn't believe it.

07:08 I really enjoyed it." I've seen the ads that I've created before, so it's a little fun, but it's not the same.

07:15 Yeah.

07:16 I've heard of people who are really, really good at ad targeting and finding groups where

07:21 they've pranked their wife or something, or just had an ad that would only show up for

07:26 their wife by running it.

07:27 It was so specific and freaked them out a little bit.

07:30 That's pretty clever.

07:31 Yeah.

07:32 Maybe it wasn't appreciated, but it is clever.

07:34 Who knows?

07:35 All right.

07:37 Well, before we jump in, you said that, of course, you built GUIs with PyGUI and those

07:43 sorts of things because people don't speak Python back then, two, seven days and whatever.

07:48 Is that different now?

07:49 Not that people speak Python, but is it different in the sense that, "Hey, I could give them

07:53 a Jupyter notebook," or, "I could give them Streamlit," or one of these things, right?

07:58 A little more or less you building and just plug it in?

08:01 I think so.

08:02 Like you said, it's not different in that most people probably still to this day don't

08:06 speak Python.

08:07 I know we had this movement a little bit back where everyone was going to learn SQL and

08:11 everyone was going to learn to code.

08:13 I was never that bullish on that trend because if I'm a marketing person, I've got 10,000

08:18 things to do and learning to code isn't going to be a priority ever.

08:22 So I think building interfaces for people that are easy to use and speak well to them

08:27 is always useful.

08:29 That never has gone away, but I think the tooling around it has been better, right?

08:32 I don't think I'll ever want to use PyGUI again.

08:34 Nothing wrong with the platform.

08:35 It's just not fun to write.

08:37 Streamlit makes it so easy to do that.

08:39 So it's something like retool and there's a thousand other ways now that you can bring

08:43 these tools in front of your stakeholders and your users that just wasn't possible before.

08:47 I think it's a pretty exciting time.

08:49 There are a lot of pretty polished tools.

08:51 Yeah, it's gotten so good.

08:52 Yeah.

08:53 There's some interesting ones like OpenBB.

08:54 Do you know that?

08:55 The financial dashboard thing.

08:58 I've heard of this.

08:59 I haven't seen it.

09:00 Yeah.

09:01 It's basically for traders, but it's like a terminal type thing that has a bunch of

09:05 Matplotlib and other interactive stuff that pops up kind of compared to say Bloomberg

09:10 dashboard type things.

09:13 But yeah, that's one sense where like maybe like traders go and learn Python because it's

09:18 like, all right, there's enough value here.

09:19 But in general, I don't think people are going to stop what they're doing and learning the

09:24 code.

09:25 So these new UI things are not.

09:26 All right, let's dive in and talk about this general category first of data pipelines,

09:32 data orchestration, all those things.

09:33 We'll talk about Dagster and some of the trends and that.

09:36 So let's grab some random internet search for what does a data pipeline maybe look like?

09:41 But people out there listening who don't necessarily live in that space, which I think is honestly

09:47 many of us, maybe we should, but maybe in our minds, we don't think we live in data

09:51 pipeline land.

09:52 Tell them about it.

09:53 Yeah, for sure.

09:54 It is hard to think about if you haven't done or built one before.

09:57 In many ways, a data pipeline is just a series of steps that you apply to some dataset that

10:02 you have in order to transform it to something a little bit more valuable at the very end.

10:08 It's a simplified version, the devil's in the details, but really like at the end of

10:12 the day, you're in a business, the production of data sort of happens by the very nature

10:16 of operating that business.

10:17 It tends to be the core thing that all businesses have in common.

10:21 And then the other sort of output is you have people within that business who are trying

10:25 to understand how the business is operating.

10:27 And this used to be easy when all we had was a single spreadsheet that we could look at

10:31 once a month.

10:32 Yeah, I think businesses have gone a little bit more complex than these days.

10:36 Computers and automation.

10:37 And the expectations.

10:38 I expect to be able to see almost real time, not I'll see it at the end of the month sort

10:42 of.

10:43 That's right.

10:44 Yeah.

10:45 I think people have gotten used to getting data too, which is both good and bad.

10:47 Good in the sense that now people are making better decisions.

10:49 Bad, and then there's more work for us to do.

10:51 And we can't just sit on our feet for half a day, half a month waiting for the next request

10:55 to come in.

10:56 There's just an endless stream that seems to never end.

10:59 So that's what really a pipeline is all about.

11:00 It's like taking these data and making it consumable in a way that users, tools will

11:06 understand that helps people make decisions at the very end of the day.

11:09 That's sort of the nuts and bolts of it.

11:11 In your mind, does data acquisition live in this land?

11:14 So for example, maybe we have a scheduled job that goes and does web scraping, calls

11:19 an API once an hour, and that might kick off a whole pipeline of processing.

11:24 Or we watch a folder for people to upload over FTP, like a CSV file or something horrible

11:32 like that.

11:33 You don't even, it's unspeakable.

11:34 But something like that where you say, oh, a new CSV has arrived for me to get, right?

11:38 Yeah, I think that's the beginning of all data pipeline journeys in my mind, very much,

11:43 right?

11:44 Like an FTP, as much as we hate it, it's not terrible.

11:46 I mean, the worst, there are worse ways to transfer files, but it's, I think still very

11:52 much in use today.

11:53 And every data pipeline journey at some point has to begin with that consumption of data

11:57 from somewhere.

11:58 Yeah.

11:59 Hopefully it's SFTP, not just straight FTP.

12:02 Like the encrypted, don't just send your password in the plain text.

12:06 Oh, well, I've seen that go wrong.

12:09 That's a story for another day, honestly.

12:11 All right.

12:12 Well, let's talk about the project that you work on.

12:14 We've been talking about it in general, but let's talk about Daxter.

12:18 Like, where does it fit in this world?

12:20 Yes.

12:21 Dagster to me is a way to build a data platform.

12:24 It's also a different way of thinking about how you build data pipelines.

12:28 Maybe it's good to compare it with kind of what the world was like, I think, before Dagster

12:32 and how it came about to be.

12:34 So if you think of Airflow, I think it's probably the most canonical orchestrator out there,

12:39 but there are other ways which people used to orchestrate these data pipelines.

12:43 They were often task-based, right?

12:45 Like I would download file, I would unzip file, I would upload file.

12:50 These are sort of the words we use to describe the various steps within a pipeline.

12:55 Some of those little steps might be Python functions that you write.

12:59 Maybe there's some pre-built other ones.

13:00 Yeah, they might be Python.

13:02 Could be a bash script.

13:03 It'd be logging into a server and downloading a file.

13:05 Could be hitting request to download something from the internet, unzipping it.

13:09 Just a various, you know, hodgepodge of commands that would run.

13:12 That's typically how we thought about it.

13:13 For more complex scenarios where your data is bigger, maybe it's running against a Hadoop

13:17 cluster or a Spark cluster.

13:19 The compute's been offloaded somewhere else.

13:21 But the sort of conceptual way you tended to think about these things is in terms of

13:25 tasks, right?

13:26 Process this thing, do this massive data dump, run a bunch of things, and then your job is

13:31 complete.

13:32 With Airflow, or sorry, with Dijkstra, we kind of flip it around a little bit on our

13:35 heads and we say, instead of thinking about tasks, what if we flipped that around and

13:40 thought about the actual underlying assets that you're creating?

13:43 What if you told us not, you know, the step that you're going to take, but the thing that

13:47 you produce?

13:48 Because it turns out as people and as data people and stakeholders, really, we don't

13:52 care about the task, like we just assume that you're going to do it.

13:56 What we care about is, you know, that table, that model, that file, that Jupyter notebook.

14:01 And if we model our pipeline through that, then we get a whole bunch of other benefits.

14:06 And that's sort of the Dijkstra sort of pitch, right?

14:09 Like if you want to understand the things that are being produced by these tasks, tell

14:13 us about the underlying assets.

14:15 And then when a stakeholder says and comes to you and says, you know, how old is this

14:19 table?

14:20 Has it been refreshed lately?

14:21 You don't have to go look at a specific task.

14:22 And remember that task ABC had model XYZ.

14:26 You just go and look up model XYZ directly there, and it's there for you.

14:29 And because you've defined things in this way, you get other nice things like a lineage

14:33 graph.

14:34 You get to understand how fresh your data is.

14:36 You can do event-based orchestration and all kinds of nice things that are a lot harder

14:39 to do in a task world.

14:41 Yeah, more declarative, less imperative, I suppose.

14:45 Yeah, it's been the trend, I think, in lots of tooling.

14:48 React, I think was famous for this as well, right?

14:51 In many ways.

14:52 It was a hard framework, I think, for people to sort of get their heads around initially

14:55 because we were so used to like the jQuery declarative or jQuery style of doing things.

15:01 Yeah.

15:02 How do I hook the event that makes the thing happen?

15:03 And React said, let's think about it a little bit differently.

15:06 Let's do this event-based orchestration.

15:08 And I think the proof's in the pudding.

15:10 React's everywhere now and jQuery, maybe not so much.

15:13 Yeah.

15:14 There's still a lot of jQuery out there, but there's not a lot of active jQuery.

15:18 But I imagine there's some.

15:19 There is.

15:20 Yeah.

15:21 Just because people are like, you know what?

15:22 Don't touch that.

15:23 That works.

15:24 Which is probably the smartest thing people can do, I think.

15:27 Yeah, honestly.

15:28 Even though new frameworks are shiny.

15:30 And if there's any ecosystem that loves to chase the shiny new idea, it's the JavaScript

15:35 web world.

15:36 Oh, yeah.

15:37 There's no shortage of new frameworks coming out every time.

15:40 We do too, but not as much as like, that's six months old.

15:44 That's so old, we can't possibly do that anymore.

15:46 We're rewriting it.

15:47 We're going to do the big rewrite again.

15:48 Yep.

15:49 Fun.

15:50 So, Dagster is the company, but also is open source.

15:54 What's the story around like, can I use it for free?

15:57 Is it open source?

15:58 Do I pay for it?

15:59 100%.

16:00 Okay.

16:01 So, Dagster Labs is the company.

16:02 Dagster open source is the product.

16:03 It's 100% free.

16:04 We're very committed to the open source model.

16:06 I would say 95% of the things you can get out of Dagster are available through open source.

16:11 And we tend to try to release everything through that model.

16:14 You can run very complex pipelines, and you can deploy it all on your own if you wish.

16:19 There is a Dagster cloud product, which is really the hosted version of Dagster.

16:23 If you want hosted plain, we can do that for you through Dagster cloud, but it all runs

16:27 on the same code base and the modeling and the files all essentially look the same.

16:32 Okay.

16:33 So obviously you could get, like I talked about at the beginning, you could go down

16:36 the DevOps side, get your own open source Dagster set up, schedule it, run it on servers,

16:41 all those things.

16:42 But if we just wanted something real simple, we could just go to you guys and say, "Hey,

16:47 I built this with Dagster.

16:48 Will you run it for me?" Pretty much.

16:50 Yeah.

16:51 Right.

16:52 So there's two options there.

16:53 You can do the serverless model, which says, "Dagster, just run it.

16:55 We take care of the compute, we take care of the execution for you, and you just write

16:58 the code and upload it to GitHub or any repository of your choice, and we'll sync to that and

17:04 then run it." The other option is to do the hybrid model.

17:06 So you basically do the CI/CD aspect.

17:09 You just say, you push to name your branch.

17:11 If you push to that branch, that means we're just going to deploy a new version and whatever

17:15 happens after that, it'll be in production, right?

17:18 Exactly.

17:19 Yeah.

17:20 And we offer some templates that you can use in GitHub for workflows in order to accommodate

17:23 that.

17:24 Excellent.

17:25 Then I cut you off.

17:26 You're saying something about hybrid.

17:27 Hybrid is the other option for those of you who want to run your own compute.

17:30 You don't want the data leaving your ecosystem.

17:32 You can say, "We've got this Kubernetes cluster, this ECS cluster, but we still want to use

17:37 a Dagster Cloud product to sort of manage the control plane.

17:40 Dagster Cloud will do that." And then you can go off and execute things on your own environment if that's something

17:44 you wish to do.

17:45 Oh, yeah.

17:46 Because running stuff in containers isn't too bad, but running container clusters, all

17:51 of a sudden you're back doing a lot of work, right?

17:55 Exactly.

17:56 Yeah.

17:57 Okay.

17:58 Well, let's maybe talk about Dagster for a bit.

17:59 I want to talk about some of the trends as well, but let's just talk through maybe setting

18:02 up a pipeline.

18:04 What does it look like?

18:05 You talked about in general, less imperative, more declarative, but what does it look like?

18:10 Be careful about talking about code on audio, but just give us a sense of what the programming

18:15 model feels like for us.

18:16 As much as possible, it really feels like just writing Python.

18:20 It's pretty easy.

18:21 You add a decorator on top of your existing Python function that does something.

18:25 That's a simple decorator called asset.

18:28 And then your pipeline, that function becomes a data asset.

18:31 That's how it's represented in the Daxter UI.

18:33 So you could imagine you've got a pipeline that gets like maybe Slack analytics and uploads

18:39 that to some dashboard, right?

18:41 Your first pipeline, your function will be called something like Slack data, and that

18:45 would be your asset.

18:46 In that function is where you do all the transform, the downloading of the data until you've really

18:51 created that fundamental data asset that you care about.

18:53 And that could be stored either in a data warehouse to S3, however you sort of want

18:58 to persist it, that's really up to you.

19:00 And then the resources is sort of where the power, I think, of a lot of Dagster comes in.

19:04 So the asset is sort of like declaration of the thing I'm going to create.

19:08 The resource is how I'm going to operate on that, right?

19:12 Because sometimes you might want to have a, let's say a DuckDB instance locally, because

19:17 it's easier and faster to operate.

19:18 But when you're moving to the cloud, you want to have a Databricks or a Snowflake.

19:23 You can swap out resources based on environments and your asset can reference that resource.

19:28 And as long as it has that same sort of API, you can really flexibly change between where

19:32 that data is going to be persistent.

19:34 Does Dagster know how to talk to those different platforms?

19:37 Does it like natively understand DuckDB and Snowflake?

19:40 Yeah.

19:41 Interesting.

19:42 People often look to Dagster and like, "Oh, does it do X?" And the question is like, "Dagster does anything you can do Python with?"

19:48 Which is most things, yeah.

19:49 Which is most things.

19:50 So I think if you come from the Airflow world, you're very much used to like these Airflow

19:54 providers and if you want to use...

19:55 That's kind of what I was thinking, yeah.

19:56 Yeah.

19:57 You want to use a Postgres, you need to find the Postgres provider.

19:59 You want to use S3, you need to find the S3 provider.

20:01 With Dagster, you kind of say you don't have to do any of that.

20:04 If you want to use Snowflake, for example, install the Snowflake connector package from

20:08 Snowflake and you use that as a resource directly.

20:11 And then you just run your SQL that way.

20:13 There are some places where we do have integrations that help if you want to get into the weeds

20:18 with I/O Manager, it's where we persist the data on your behalf.

20:21 And so for S3, for Snowflake, for example, there's other ways where we can persist that

20:26 data for you.

20:27 But if you're just trying to run a query, just trying to execute something, just trying

20:30 to save something somewhere, you don't have to use that system at all.

20:33 You can just use whatever Python package you would use anyway to do that.

20:38 So maybe some data is expensive for us to get as a company, like maybe we're charged

20:43 on a usage basis or super slow or something.

20:46 I could write just Python code that goes and say, well, look in my local database.

20:50 If it's already there, use that and it's not too stale.

20:53 Otherwise, then do actually go get it, put it there and then get it back.

20:57 And like that kind of stuff would be up to me to put together.

21:00 Yeah.

21:01 And that's the nice thing is you're not really limited by like anyone's data model or worldview

21:06 on how data should be retrieved or saved or augmented.

21:09 You could do it a couple of ways.

21:10 You could say whenever I'm working locally, use this persistent data store that we're

21:15 just going to use for development purposes.

21:18 Fancy database called SQLite, something like that.

21:20 Exactly.

21:21 Yes.

21:22 A wonderful database.

21:23 Actually, it is.

21:24 Yeah.

21:25 It'll work really, really well.

21:26 And then you just say when I'm in a different environment, when I'm in production, swap

21:28 out my SQLite resource for a name, your favorite cloud warehouse resource, and go fetch that

21:33 data from there.

21:34 Or I want to use it mini IO locally.

21:36 I want to use S3 on prod.

21:39 It's very simple to swap these things out.

21:40 Okay.

21:41 Yeah.

21:42 So it looks like you build up these assets as y'all call them, these pieces of data,

21:46 Python code that accesses them.

21:48 And then you have a nice UI that lets you go and build those out kind of workflow style,

21:54 right?

21:55 Yeah, exactly.

21:56 This is where we get into the wonderful world of DAGs, which stands for directed acyclic

22:00 graph.

22:01 So basically it stands for a bunch of things that are not connected in a circle, but are

22:04 connected in some way.

22:05 So there can't be any loops, right?

22:07 Because then you never know where to start or where to end.

22:09 Could be an assignment, but not a circle.

22:11 Not a circle.

22:12 As long as there's like a path through this dataset, where the beginning and an end, then

22:18 we can kind of start to model this connected graph of things.

22:21 And then we know how to execute them, right?

22:23 We can say, well, this is the first thing we have to run because that's where all dependencies

22:26 start.

22:27 And then we can branch off in parallel or we can continue linearly until everything

22:31 is complete.

22:32 And if something breaks in the middle, we can resume from that broken spot.

22:35 Okay, excellent.

22:36 And is that the recommended way?

22:38 Like if I write all this Python code that works on the pieces, then the next recommendation

22:42 would be to fire up the UI and start building it?

22:45 Or do you say, ah, you should really write it in code and then you can just visualize

22:49 it or monitor it?

22:50 Everything in Dagster is written as code.

22:52 The UI reads that code and it interprets it as a DAG and then it displays that for you.

22:57 There are some things you do with the UI, like you can materialize assets, you can make

23:00 them run, you can do backfills, you can view metadata, you can sort of enable and disable

23:06 schedules.

23:07 But the core, we really believe this is Dagster, like the core declaration of how things are

23:11 done, it's always done through code.

23:13 Okay, excellent.

23:14 So when you say materialize, maybe I have an asset, which is really a Python function

23:19 I wrote that goes and pulls down a CSV file.

23:22 The materialize would be, I want to see kind of representative data in this, in the UI.

23:28 And so I could go, all right, I think this is right.

23:30 Let's keep passing it down.

23:31 Is that what that means?

23:33 Materialize really means just run this particular asset, make this asset new again, fresh again,

23:37 right?

23:38 As part of that materialization, we sometimes output metadata.

23:41 And you can kind of see this on the right, if you're looking at the screen here, where

23:44 we talk about what the timestamp was, the URL, there's a nice little graph of like number

23:49 of rows over time.

23:51 All that metadata is something you can emit, and we emit some ourselves by default with

23:56 the framework.

23:57 And then as you materialize these assets, as you run that asset over and over again,

24:00 over time, we capture all that.

24:01 And then you can really get a nice overview of, you know, this assets lifetime, essentially.

24:06 Nice.

24:07 I think the asset, the metadata is really pretty excellent, right?

24:10 Over time, you can see how the data's grown and changed.

24:13 And yeah, the metadata is really powerful.

24:16 And it's one of the nice benefits of being in this asset world, right?

24:19 Because you don't really want to metadata on like this task that run, you want to know

24:23 like this table that I created, how many rows has it had every single time it's run?

24:27 If that number drops by like 50%, that's a big problem.

24:31 Conversely, if the runtime is slowly increasing every single day, you might not notice it,

24:35 but over a month or two, it went from a 30 second pipeline to 30 minutes.

24:40 Maybe there's like a great place to start optimizing that one specific asset.

24:43 Right.

24:44 And what's cool, if it's just Python code, you know how to optimize that probably, right?

24:48 Hopefully, yeah.

24:49 Well, as much as you're going to, yeah, you got, you have all the power of Python and

24:54 you should be able to, as opposed to it's deep down inside some framework that you don't

24:57 really control.

24:58 Exactly.

24:59 Yeah.

25:00 You use Python, you can benchmark it.

25:01 There's probably, you probably knew you didn't write it that well when you first started

25:04 and you can always find ways to improve it.

25:07 So this UI is something that you can just run locally, kind of like Jupyter.

25:11 100%.

25:12 Just type Dijkstra dev and then you get the full UI experience.

25:16 You get to see the runs, all your assets.

25:17 Is it a web app?

25:18 It is.

25:19 Yeah.

25:20 It's a web app.

25:21 There's a Postgres backend.

25:22 And then there's a couple of services that run the web server, the GraphQL, and then

25:25 the workers.

25:26 Nice.

25:27 Yeah.

25:28 So pretty serious web app, it sounds like, but you probably just run it all.

25:31 Yeah.

25:33 Just something you run all probably containers or something you just fire up when you download

25:37 it, right?

25:38 Locally, it doesn't even use containers.

25:39 It's just all pure Python for that.

25:43 But once you deploy, yeah, I think you might want to go down the container route, but it's

25:46 nice not having to have Docker just to run a simple test deployment.

25:50 Yeah.

25:51 I guess not everyone's machine has that, for sure.

25:53 So question from the audience here.

25:56 Jazzy asks, does it hook into AWS in particular?

25:59 Is it compatible with existing pipelines like ingestion lambdas or transform lambdas?

26:04 Yeah, you can hook into AWS.

26:06 So we have some AWS integrations built in.

26:09 Like I mentioned before, there's nothing stopping you from importing Boto3 and doing anything

26:13 really you want.

26:14 So a very simple use case.

26:15 Like let's say you already have an existing transformation being triggered in AWS through

26:20 some lambda.

26:21 You could just model that within Dijkstra and say, you know, trigger that lambda Boto3.

26:25 Okay.

26:26 Then the asset itself is really that representation of that pipeline, but you're not actually running

26:31 that code within Dijkstra itself.

26:32 That's still occurring on the AWS framework.

26:34 And that's a really simple way to start adding a little bit of observability and orchestration

26:38 to existing pipelines.

26:40 Okay.

26:41 That's pretty cool because now you have this nice UI and these metadata in this history,

26:45 but it's someone else's cloud.

26:47 Exactly.

26:48 Yeah.

26:49 And you can start to pull more information in there.

26:50 And over time you might decide, you know, this, you know, lambda that I had, it's starting

26:54 to get out of hand.

26:55 I want to kind of break it apart into multiple assets where I want to sort of optimize it

26:59 a little way and Dijkstra can help you along that.

27:01 Yeah.

27:02 Excellent.

27:03 How do you set up like triggers or observability inside Dijkstra?

27:08 Like Jazzy's asking about S3, but like in general, right?

27:11 If a row is entered into a database, something is dropped in a blob storage or the date changes.

27:16 I don't know.

27:17 Yeah.

27:18 Those are great questions.

27:19 You have a lot of options.

27:20 In Dijkstra, we do model every asset with a couple little flags, I think, that are really

27:24 useful to think about.

27:25 One is whether the code of that particular asset has changed, right?

27:28 And then the other one is whether anything upstream of that asset has changed.

27:32 And those two things really power a lot of automation functionality that we can get downstream.

27:38 So let's start with the S3 example, it's the easiest to understand.

27:41 You have a bucket and there is a file that gets uploaded every day.

27:46 You don't know what time that file gets uploaded.

27:48 You don't know when it'll be uploaded, but you know at some point it will be.

27:51 In Dijkstra, we have a thing called the sensor, which you can just connect to an S3 location.

27:55 You can define how it looks into that file or into that folder.

27:59 And then you would just pull every 30 seconds until something happens.

28:02 When that something happens, that triggers sort of an event.

28:06 And that event can trickle at your will downstream to everything that depends on it as you sort

28:10 of connect to these things.

28:12 So it gets you awake from this like, "Oh, I'm going to schedule something to run every

28:15 hour.

28:16 Maybe the data will be there, but maybe it won't." And you can have a much more event-based workflow.

28:20 When this file runs, I want everything downstream to know that this data has changed.

28:25 And as sort of data flows through these systems, everything will sort of work its way down.

28:28 Yeah, I like it.

28:32 This portion of Talk Python to Me is brought to you by Posit, the makers of Shiny, formerly

28:36 RStudio, and especially Shiny for Python.

28:40 Let me ask you a question.

28:41 Are you building awesome things?

28:43 Of course you are.

28:44 You're a developer or a data scientist.

28:46 That's what we do.

28:47 And you should check out Posit Connect.

28:49 Posit Connect is a way for you to publish, share, and deploy all the data products that

28:54 you're building using Python.

28:56 People ask me the same question all the time.

28:59 "Michael, I have some cool data science project or notebook that I built.

29:02 How do I share it with my users, stakeholders, teammates?

29:05 Do I need to learn FastAPI or Flask or maybe Vue or ReactJS?" Hold on now.

29:11 Those are cool technologies, and I'm sure you'd benefit from them, but maybe stay focused

29:15 on the data project?

29:16 Let Posit Connect handle that side of things.

29:19 With Posit Connect, you can rapidly and securely deploy the things you build in Python.

29:23 Streamlet, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Ports, Dashboards, and APIs.

29:30 Posit Connect supports all of them.

29:32 And Posit Connect comes with all the bells and whistles to satisfy IT and other enterprise

29:37 requirements.

29:38 Make deployment the easiest step in your workflow with Posit Connect.

29:42 For a limited time, you can try Posit Connect for free for three months by going to talkpython.fm/posit.

29:49 That's talkpython.fm/POSIT.

29:52 The link is in your podcast player show notes.

29:54 Thank you to the team at Posit for supporting Talk Python.

29:59 The sensor concept is really cool because I'm sure that there's a ton of cloud machines

30:05 people provisioned just because this thing runs every 15 minutes, that runs every 30

30:10 minutes, and you add them up and in aggregate, we need eight machines just to handle the

30:14 automation, rather than, you know, because they're hoping to catch something without

30:18 too much latency, but maybe like that actually only changes once a week.

30:22 Exactly.

30:23 And I think that's where we have to like sometimes step away from the way we're so used to thinking

30:27 about things, and I'm guilty of this.

30:30 When I create a data pipeline, my natural inclination is to create a schedule where

30:33 it's a, is this a daily one?

30:34 Is this weekly?

30:35 Is this monthly?

30:36 But what I'm finding more and more is when I'm creating my pipelines, I'm not adding

30:39 a schedule.

30:40 I'm using Dagster's auto-materialized policies, and I'm just telling it, you figure it out.

30:45 I don't have to think about schedules.

30:46 Just figure out when the things should be updated.

30:49 When it's, you know, parents have been updated, you run.

30:51 When the data has changed, you run.

30:53 And then just like figure it out and leave me alone.

30:55 Yeah.

30:56 And it's worked pretty well for me so far.

30:57 I think it's great.

30:58 I have a search, refresh the search index on the various podcast pages that runs and

31:04 it runs every hour, but the podcast ships weekly, right?

31:08 But I don't know which hour it is.

31:10 And so it seems like that's enough latency, but it would be way better to put just a little

31:14 bit of smarts.

31:15 Like what was the last date that anything changed?

31:18 Was that since the last time you saw it?

31:20 Maybe we'll just leave that alone, you know?

31:21 But yeah, you're starting to inspire me to go write more code, but pretty cool.

31:26 All right.

31:27 So on the homepage at Dagster.io, you've got a nice graphic that shows you both how to

31:33 write the code, like some examples of the code, as well as how that looks in the UI.

31:38 And one of them is called, says to launch backfills.

31:41 What is this backfill thing?

31:43 Oh, this is my favorite thing.

31:44 Okay.

31:45 So when you first start your data journey as a data engineer, you sort of have a pipeline

31:50 and you build it and it just runs on a schedule and that's fine.

31:54 What you soon find is, you know, you might have to go back in time.

31:58 You might say, I've got this data set that updates monthly.

32:01 Here's a great example, AWS cost reporting, right?

32:05 AWS will send you some data around, you know, all your instances and your S3 bucket, all

32:10 that.

32:11 And it'll update that data every day or every month or whatever have you.

32:14 Due to some reason, you've got to go back in time and refresh data that AWS updated

32:18 due to some like discrepancy.

32:20 Backfill is sort of how you do that.

32:21 And it works hand in hand with this idea of a partition.

32:24 A partition is sort of how your data is naturally organized.

32:28 And it's like a nice way to represent that natural organization.

32:31 Has nothing to do with like the fundamental way, how often you want to run it.

32:35 It's more around like, I've got a data set that comes in once a month, it's represented

32:39 monthly.

32:40 It might be updated daily, but it's the representation of the data is monthly.

32:43 So I will partition it by month.

32:44 It doesn't have to be dates.

32:46 It could be strings.

32:47 It could be a list.

32:48 You could have a partition for every company or every client or every domain you have.

32:55 Whatever you sort of think is a natural way to think about breaking apart that pipeline.

33:00 And once you do that partition, you can do these nice things called backfills, which

33:03 says instead of running this entire pipeline and all my data, I want you to pick that one

33:08 month where data went wrong or that one month where data was missing and just run the partition

33:13 on that range.

33:14 And so you limit compute, you save resources and get a little bit more efficient.

33:18 And it's just easier to like, think about your pipeline because you've got this natural

33:22 built in partitioning system.

33:23 Excellent.

33:24 So maybe you missed some important event.

33:27 Maybe your automation went down for a little bit, came back up.

33:31 You're like, Oh no, we've, we've missed it.

33:33 Right.

33:34 But you want to start over for three years.

33:37 So maybe we could just go and run the last day.

33:39 It's worth of.

33:40 Exactly.

33:41 Or another one would be your vendor says, Hey, by the way, we actually screwed up.

33:45 We uploaded this file from two months ago, but the numbers were all wrong.

33:48 Yeah.

33:49 We've uploaded a new version to that destination.

33:51 Can you update your data set?

33:53 One way is to recompute the entire universe from scratch.

33:56 But if you've partitioned things and you can say no limit that to just this one partition

34:00 for that month and that one partition can trickle down all the way to all your other

34:03 assets that depend on that.

34:05 Do you have to pre decide, do you have to think about this partitioning beforehand or

34:10 can you do it retroactively?

34:11 You could do it retroactively.

34:12 And I have done that before as well.

34:14 It really depends on, on where you're at.

34:16 I think it's your first asset ever.

34:19 Probably don't bother with partitions, but it really isn't a lot of work to get them

34:23 to get them started.

34:24 Okay.

34:25 Yeah.

34:26 Really neat.

34:27 I like a lot of the ideas here.

34:28 I like that.

34:29 It's got this visual component that you can see what's going on, inspect it.

34:33 Just so you can debug runs or what happens there.

34:36 Like obviously when you're pulling data from many different sources, maybe it's not your

34:40 data you're taking in.

34:41 Fields could vanish.

34:42 It can be the wrong type.

34:43 Systems can go down.

34:44 I'm sure, sure the debugging is interesting.

34:47 So what's, it looks a little bit kind of like a web browser debug dev tools thing.

34:52 So for the record, my code never fails.

34:54 I've never had a bug in my life, but for the one that you have.

34:57 Yeah.

34:58 Well, mine doesn't either.

34:59 I only do it to make an example and for my, me, how others, yes.

35:03 If I do it's intentional, of course.

35:05 Yeah.

35:06 To humble myself a little bit.

35:08 Exactly.

35:09 This view is one of my favorite, I mean, so many favorite views, but this is, it's actually

35:13 really fun to watch, watch this actually run when you execute this pipeline.

35:16 But really like, let's go back to, you know, the world before orchestrators, we use cron,

35:22 right?

35:23 We'd have a bash script that would do something and we'd have a cron job that said, make sure

35:26 this thing runs.

35:27 And then hopefully it was successful, but sometimes it wasn't.

35:31 And it's a, sometimes it wasn't, that's always been the problem, right?

35:34 It's like, well, what do I do now?

35:35 How do I know why it failed?

35:36 What was, when did it fail?

35:38 You know, what, at what point or what steps did it fail?

35:41 That's really hard to do.

35:42 But this debugger really is, a structured log of every step that's been going on through

35:47 your pipeline, right?

35:48 So in this view, there's three assets that we can kind of see here.

35:52 One is called users.

35:53 One is called orders and one is to run dbt.

35:56 So presumably there's these two, you know, tables that are being updated and then a dbt

36:00 job.

36:01 It looks like that's being updated at the very end.

36:03 Once you execute this pipeline, all the logs are captured from each of those assets.

36:07 So you can manually write your own logs.

36:10 You have access to a Python logger and you can use your info, your error, whatever have

36:14 you in log output that way.

36:16 And it'll be captured in a structured way, but it also capture logs from your integrations.

36:21 So if you're using dbt, we capture those logs as well.

36:24 You can see it processing every single asset.

36:26 So if anything does go wrong, you can filter down and understand at what step, at what

36:31 point did something go wrong.

36:33 - That's awesome.

36:34 And just the historical aspect, cause just going through logs, especially multiple systems

36:40 can be really, really tricky to figure out what's the problem, what actually caused this

36:44 to go wrong, but come back and say, oh, it crashed, pull up the UI and see, all right,

36:49 well show me, show me what this run did, show me what this job did.

36:53 And it seems like it's a lot easier to debug than your standard web API or something like

36:57 that.

36:58 - Exactly.

36:59 You can click on any of these assets that get that metadata that we had earlier as well.

37:02 If you know, one step failed and it's kind of flaky, you can just click on that one step

37:06 and say, just rerun this.

37:08 Everything else is fine.

37:09 It's a restart from scratch.

37:10 - Okay.

37:11 And it'll keep the data from before, so you don't have to rerun that.

37:15 - Yeah.

37:16 I mean, it depends on how you built the pipeline.

37:18 We like to build item potent pipelines is how we sort of talk about it, the data engineering

37:21 landscape, right?

37:23 So you should be able to run something multiple times and not break anything in a perfect

37:27 world.

37:28 That's not always possible, but ideally, yes.

37:30 And so we can presume that if users completed successfully, then we don't have to run that

37:34 again because that data was persisted, you know, database S3 somewhere.

37:38 And if orders was the one that was broken, we can just only run orders and not have to

37:43 worry about rewriting the whole thing from scratch.

37:45 - Excellent.

37:46 So item potent for people who maybe don't know, you run it once or you perform the operation

37:50 once or you perform it 20 times, same outcome should have side effects, right?

37:55 - That's the idea.

37:57 - Easier said than done sometimes.

37:58 - It sure is.

37:59 - Sometimes it's easy, sometimes it's very hard, but the more you can build pipelines

38:03 that way, the easier your life becomes in many ways.

38:07 - Exactly.

38:08 Not always, but generally true for programming as well, right?

38:10 If you talk to functional programming people, they'll say like, it's an absolute, but.

38:14 - Yes, functional programmers love this kind of stuff.

38:17 And it actually does lend itself really well to data pipelines.

38:21 Data pipelines, unlike maybe some of the software engineering stuff, it's a little bit different

38:24 in that the data changing is what causes often most of the headaches, right?

38:30 It's less so the actual code you write, but more the expectation tends to change so frequently

38:36 and so often in new and novel, interesting ways that you would often never expect.

38:41 And so the more you can sort of make that function so pure that you can provide any

38:46 sort of dataset and really test really easily these expectations when they get broken, the

38:51 easier it is to sort of debug these things and build on them in the future.

38:55 - Yeah.

38:56 And cache them as well.

38:57 - Yes, it's always nice.

38:58 - Yeah.

38:59 So speaking of that kind of stuff, like what's the scalability story?

39:02 If I've got some big, huge, complicated data pipeline, can I parallelize them and have

39:08 them run multiple pieces?

39:10 Like if there's different branches or something like that?

39:12 - Yeah, exactly.

39:13 That's one of the key benefits I think in writing your assets in this DAG way, right?

39:20 Anything that is parallelizable will be parallelized.

39:22 Now sometimes you might want to put limits on that.

39:24 Sometimes too much parallelization is bad.

39:26 Your poor little database can't handle it.

39:28 And you can say, you know, maybe a concurrency limit on this one just for today is worth

39:32 putting, or if you're hitting an API for an external vendor, they might not appreciate

39:36 10,000 requests a second on that one.

39:39 So maybe you would slow it down.

39:40 But in this case-

39:41 - Or rate limiting, right?

39:42 You can run into too many requests and then your stuff crashes, then you got to start.

39:45 It can be a whole thing.

39:46 - It can be a whole thing.

39:47 There's memory concerns, but let's pretend the world is simple.

39:51 Anything that can be parallelized will be through Dagster.

39:54 And that's really the benefit of writing these DAGs is that there's a nice algorithm for

39:57 determining what that actually looks like.

39:59 - Yeah.

40:00 I guess if you have a diamond shape or any sort of split, right?

40:03 Those two things now become, 'cause it's a cyclical, they can't turn around and then

40:07 eventually depend on each other again.

40:09 So that's a perfect chance to just go fork it out.

40:12 - Exactly.

40:13 And that's kind of where partitions are also kind of interesting.

40:15 If you have a partitioned asset, you could take your dataset partitioned to five, you

40:19 know, buckets and run all five partitions at once, knowing full well that because you've

40:23 written this in a idempotent and partitioned way, that the first pipeline will only operate

40:28 on apples and the second one only operates on bananas.

40:32 And there is no commingling of apples and bananas anywhere in the pipeline.

40:35 - Oh, that's interesting.

40:36 I hadn't really thought about using the partitions for parallelism, but of course.

40:39 - Yeah.

40:40 It's a fun little way to break things apart.

40:43 - So if we run this on the Dagster cloud or even on our own, is this pretty much automatic?

40:49 We don't have to do anything?

40:51 I think Dagster just looks at it and says, this looks parallelizable and it'll go or?

40:55 - That's right.

40:56 Yeah.

40:57 As long as you've got the full deployment, whether it's OSS or cloud, Dagster will basically

41:00 parallelize it for you, which is possible.

41:02 You can set global currency limits.

41:04 So you might say, you know, 64 is more than enough, you know, parallelization that I need,

41:09 or maybe I want less because I'm worried about overloading systems, but it's really up to

41:13 you.

41:14 - Putting this on a $10 server, please don't kill it.

41:19 First respect that it's somewhat wimpy, but that's okay.

41:21 - It'll get the job done.

41:23 All right.

41:24 I want to talk about some of the tools and some of the tools that are maybe at play here

41:29 when working with Dagster and some of the trends and stuff.

41:31 But before that, maybe speak to where you could see people adopt a tool like Dagster,

41:37 but they generally don't.

41:39 They don't realize like, oh, actually there's a whole framework for this, right?

41:43 Like I could, sure I could go and build just on HTTP server and hook into the request and

41:49 start writing to it.

41:50 But like, maybe I should use Flask or FastAPI.

41:53 Like there's these frameworks that we really naturally adopt for certain situations like

41:58 APIs and others, background jobs, data pipelines, where I think there's probably a good chunk

42:04 of people who could benefit from stuff like this, but they just don't think they need

42:07 a framework for it.

42:09 Like cron is enough.

42:10 - Yeah, it's funny because sometimes cron is enough.

42:13 And I don't want to encourage people not to use cron, but think twice at least is what

42:18 I would say.

42:19 So probably the first like trigger for me of thinking of, you know, is Dagster a good

42:24 choice is like, am I trying to ingest data from somewhere?

42:26 Is that's something that fails.

42:28 Like I think we just can accept that, you know, if you're moving data around, the data

42:32 source will break, the expectations will change.

42:35 You'll need to debug it.

42:36 You'll need to rerun it.

42:37 And doing that in cron is a nightmare.

42:39 So I would say definitely start to think about an orchestration system.

42:43 If you're ingesting data, if you have a simple cron job that sends one email, like you're

42:46 probably fine.

42:47 I don't think you need to implement all of Dagster just to do that.

42:51 But the more closer you get to data pipelining, I think the better your life will be if you're

42:58 not trying to debug a obtuse process that no one really understands six months from

43:03 now.

43:04 - Excellent.

43:05 All right, maybe we could touch on some of the tools that are interesting to see people

43:08 using.

43:09 You talked about DuckDB and DBT, a lot of Ds starting here, but give us a sense of like

43:15 some of the supporting tools you see a lot of folks using that are interesting.

43:19 - Yeah, for sure.

43:20 I think in the data space, probably DBT is one of the most popular choices and DBT in

43:27 many ways, it's nothing more than a command line tool that runs a bunch of SQL in a bag

43:33 as well.

43:34 So there's actually a nice fit with Daxter and DBT together.

43:37 DBT is really used by people who are trying to model that business process using SQL against

43:44 typically a data warehouse.

43:45 So if you have your data in, for example, a Postgres, a Snowflake, Databricks, Microsoft

43:51 SQL, these types of data warehouses, generally you're trying to model some type of business

43:56 process and typically people use SQL to do that.

44:00 Now you can do this without DBT, but DBT has provided a nice clean interface to doing so.

44:06 It makes it very easy to connect these models together, to run them, to have a development

44:10 workflow that works really well.

44:12 And then you can push it to prod and have things run again in production.

44:15 So that's DBT.

44:17 We find it works really well.

44:18 And a lot of our customers are actually using DBT as well.

44:21 There's DuckDB, which is a great, it's like the SQLite for columnar databases, right?

44:27 It's in process, it's fast, it's written by the Dutch.

44:30 There's nothing you can't like about it.

44:32 It's free.

44:33 We love that.

44:34 It's a little bit more simple in Python itself.

44:37 It does.

44:38 It's so easy.

44:39 Yes, exactly.

44:40 The Duck have given us so much and they've asked nothing of us.

44:42 So I'm always very thankful for them.

44:44 It's fast.

44:45 It's so fast.

44:46 It's like if you've ever used pandas for processing large volumes of data, you've occasionally

44:51 hit memory limits or inefficiencies in doing these large aggregates.

44:56 I won't go into all the reasons of why that is, but DuckDB sort of changes that because

45:01 it's a fast serverless sort of C++ written tooling to do really fast vectorized work.

45:09 And by that, I mean, like it works on columns.

45:11 So typically in like SQLite, you're doing transactions, you're doing single row updates,

45:15 writes, inserts, and SQLite is great at that.

45:18 Where typical transactional databases fail or aren't as powerful is when you're doing

45:23 aggregates, when you're looking at an entire column, right?

45:26 Just the way they're architected.

45:27 If you want to know the average, the median, the sum of some large number of columns, and

45:32 you want to group that by a whole bunch of things, you want to know the first date someone

45:36 did something and the last one, those types of vectorized operations, DuckDB is really,

45:41 really fast at doing.

45:42 And it's a great alternative to, for example, pandas, which can often hit memory limits

45:48 and be a little bit slow in that regard.

45:50 Yeah, it looks like it has some pretty cool aspects, transactions, of course, but it also

45:54 says direct Parquet, CSV, and JSON querying.

45:59 So if you've got a CSV file hanging around and you want to ask questions about it or

46:04 JSON or some of the data science stuff through Parquet, you know, turn a indexed proper query

46:09 engine against it.

46:10 Don't just use a dictionary or something, right?

46:12 Yeah, it's great for reading a CSV, zip files, tar files, Parquet, partition Parquet files,

46:19 all that stuff that usually was really annoying to do and operate on, you can now install

46:23 DuckDB.

46:24 It's got a great CLI too.

46:25 So before you go out and like program your entire pipeline, you just run DuckDB and you

46:30 can start writing SQL against CSV files and all this stuff to really understand your data

46:35 and just really see how quick it is.

46:37 I used it on a bird dataset that I had as an example project and there was, you know,

46:43 millions of rows and I was joining them together and doing massive group buys and it was done

46:47 in like seconds.

46:48 And it's just hard for me to believe that it was even correct because it was so quick.

46:52 So it's wonderful.

46:53 I must have done that wrong somehow.

46:55 Because it's done, it shouldn't be done.

46:58 Yeah.

46:59 And the fact it's in process means there's not a babysit, a server for you to babysit

47:03 patch, make sure it's still running.

47:06 It's accessible, but not too accessible.

47:08 All that, right?

47:09 It's a pip install away, which is always, we love that, right?

47:12 Yeah, absolutely.

47:13 You mentioned, I guess I mentioned Parquet, but also Apache Arrow seems like it's making

47:18 its way into a lot of, a lot of different tools and sort of foundational sort of high

47:22 memory, high performance in memory processing.

47:25 Have you used this Eddie?

47:26 I've used it, especially through like working through different languages.

47:30 So moving data between Python and R is where I last used this.

47:34 I think Arrow's great at that.

47:35 I believe Arrow is like the, underneath some of the rust to Python as well.

47:41 It's working there.

47:42 So typically I don't use Arrow like directly myself, but it's in many of the tooling I

47:46 use.

47:47 It's a great product.

47:48 And like so much of the ecosystem is now built on, on Arrow.

47:52 Yeah.

47:53 I think a lot of it is, I feel like the first time I heard about it was through Polars.

47:55 That's right.

47:56 Yeah.

47:57 I'm pretty sure, which is another rust story for kind of like pandas, but a little bit

48:03 more fluent, lazy API.

48:05 Yes.

48:06 We live in such great times to be honest.

48:07 So Polars is a Python bindings for rust, I believe is kind of how I think about it.

48:13 It does all the transformation and rust, but you've had this Python interface to it and

48:17 it makes things again, incredibly fast.

48:20 I would say similar in speed to DuckDB.

48:22 They both are quite comparable sometimes.

48:24 Yeah.

48:25 It also comes to have vectorized and column runner processing and all that kind of stuff.

48:29 It's pretty incredible.

48:31 So not a drop in replacement for pandas, but if you have the opportunity to use it and

48:36 you don't need to use the full breadth of what pandas offers, because pandas is quite

48:39 a huge package.

48:40 There's a lot it does, but if you're just doing simple transforms, I think Polars is

48:44 a great option to explore.

48:45 Yeah, I talked to a Richie Vink, Vink who is part of that.

48:49 And I think they explicitly chose to not try to make it a drop in replacement for pandas,

48:54 but try to choose an API that would allow the engine to be smarter and go like, I see

48:58 you're asking for this, but the step before you wanted this other thing.

49:02 So let me do that transformation all in one shot.

49:04 And a little bit like a query optimization engine.

49:07 What else is out there?

49:08 A couple of guys, time for just a couple more.

49:10 If there's anything that you're like, Oh yeah, people use this all the time.

49:12 Especially the databases you've said, Postgres, Snowflake, et cetera.

49:16 Yeah, there's so much.

49:17 So another little one I like, it's called DLT, DLT hub.

49:21 It's getting a lot of attraction as well.

49:23 And what I like about it is how lightweight it is.

49:25 I'm such a big fan of lightweight tooling.

49:28 That's not, you know, massive frameworks.

49:29 Loading data is I think still kind of yucky in many ways.

49:32 It's not fun.

49:33 And DLT makes it a little bit simpler and easier to do so.

49:36 So that's what I would recommend people just to look into if you got to either ingest data

49:41 from some API, some website, some CSV file.

49:45 It's a great way to do that.

49:47 It claims it's the Python library for data teams loading data into unexpected places.

49:52 Very interesting.

49:53 Yes, that's great.

49:54 Yeah, this is, this looks cool.

49:56 All right.

49:57 Well, I guess maybe let's talk about, let's talk business and then we can talk about what's

50:01 next and then we'll probably be out of time.

50:03 I'm always fascinated.

50:04 I think there's starting to be a bit of a blueprint for this, but companies that take

50:09 a thing, they make it and they give it away and then they have a company around it.

50:12 And congratulations to you all for doing that.

50:14 Right.

50:15 And a lot of it seems to kind of center around the open core model, which I don't know if

50:20 that's exactly how you would characterize yourself, but yeah, maybe you should talk

50:24 about the business side.

50:25 Because I know there's many successful open source projects that don't necessarily result

50:29 in full-time jobs or companies if people were to want that.

50:32 It's a really interesting place.

50:34 And I don't think it's one that anyone has truly figured out well, I can say this is

50:39 the way forward for everyone, but it is something we're trying.

50:42 And I think for Dagster, I think it's working pretty well.

50:44 And what I think is really powerful about Dagster is like the open source product is

50:48 really, really good.

50:49 And it hasn't really been limited in many ways in order to drive like cloud product

50:54 consumption.

50:55 We really believe that there's actual value in that separation of these things.

50:58 There are some things that we just can't do in the open source platform.

51:01 For example, there's pipelines on cloud that involve ingesting data through our old systems

51:06 in order to do reporting, which just doesn't make sense to do on the open source system.

51:11 It makes the product way too complex.

51:13 But for the most part, I think Dexter open source, we really believe that just getting

51:16 it in the hands of developers is the best way to prove the value of it.

51:19 And if we can build a business on top of that, I think we're all super happy to do so.

51:23 It's nice that we get to sort of drive both sides of it.

51:27 To me, that's one of the more exciting parts, right?

51:29 A lot of the development that we do in Dagster open source is driven by people who are paid

51:35 through what happens on Dagster cloud.

51:37 And I think from what I can tell, there's no better way to build open source product

51:41 than to have people who are adequately paid to develop that product.

51:45 Otherwise it can be a labor of love, but one that doesn't last for very long.

51:48 Yeah.

51:49 Whenever I think about building software, there's 80% of it that's super exciting and

51:52 fun, 10% and then there's that little sliver of like really fine polish that if it's not

51:58 just your job to make that thing polished, you're just for the most part, just not going

52:03 to polish that bit, right?

52:04 It's tough.

52:05 UI, design, support.

52:08 There's all these things that go into making a software like really extraordinary.

52:12 That's really, really tough to do.

52:14 And I think I really like the open source business model.

52:17 I think for me being able to just try something, not having talked to sales and being able

52:21 to just deploy locally and test it out and see if this works.

52:24 And if I choose to do so, deploy it in production, or if I bought the cloud product and I don't

52:30 like the direction that it's going, I can even go to open source as well.

52:34 That's pretty compelling to me.

52:35 Yeah, for sure it is.

52:37 And I think the more moving pieces of infrastructure, more uptime you want and all those types of

52:43 things, the more somebody who's maybe a programmer, but not a DevOps infrastructure person, but

52:49 needs to have it there, right?

52:50 Like that's an opportunity as well, right?

52:52 For you to say, look, you can write the code.

52:55 We made it cool for you to write the code, but you don't have to like get notified when

52:59 the server's down or whatever.

53:00 Like, we'll just take care of that for you.

53:01 That's pretty awesome.

53:02 Yeah.

53:03 And it's efficiencies of scale as well, right?

53:04 Like we've learned the same mistakes over and over again, so you don't have to, which

53:08 is nice.

53:09 I don't know how many people who want to maintain servers, but people do.

53:13 And they're more than welcome to if that's how they choose to do so.

53:15 Yeah, for sure.

53:16 All right.

53:17 Just about out of time.

53:18 Let's wrap up our conversation with where are things going for Dagster?

53:23 What's on the roadmap?

53:24 What are you excited about?

53:25 Oh, that's a good one.

53:26 I think we've actually published our roadmap line somewhere.

53:29 If you search Dagster roadmap, it's probably out there.

53:31 I think for the most part that hasn't changed much going into 2024, though we may update

53:36 it.

53:37 Ah, there it is.

53:38 We're really just doubling down on what we've built already.

53:40 I think there's a lot of work we can do on the product itself to make it easier to use,

53:45 easier to understand.

53:46 My team specifically is really focused around the education piece.

53:49 And so we launched Dagster University's first module, which helps you really understand

53:53 the core concepts around Dagster.

53:56 Our next module is coming up in a couple months, and that'll be around using Dagster with dbt,

54:00 which is our most popular integration.

54:02 We're building up more integrations as well.

54:04 So I built a little integration called embedded ELT that makes it easy to ingest data.

54:09 But I want to actually build an integration with DLT as well, DLT hub.

54:12 So we'll be doing that.

54:14 And there's more coming down the pipe, but I don't know how much I can say.

54:18 Look forward to an event in April where we'll have a launch event on all that's coming.

54:23 Nice.

54:24 Is that an online thing people can attend or something?

54:26 Exactly.

54:27 Yeah, there'll be some announcement there on the Dagster website on that.

54:31 Maybe I will call it one thing that's actually really fun.

54:33 It's called Dagster Open Platform.

54:35 It's a GitHub repo that we launched a couple months ago, I want to say.

54:39 We took our internal...

54:40 I should go back one more.

54:42 Sorry.

54:43 I should go back to GitHub, Dagster Open Platform on GitHub.

54:45 I have it somewhere.

54:47 Yeah.

54:48 It's here under the organization.

54:51 Yes, it should be somewhere here.

54:54 There it is.

54:55 Dagster Open Platform on GitHub.

54:57 And it's really a clone of our production pipelines.

54:59 For the most part, there's some things we've chosen to ignore because they're sensitive.

55:03 But as much as possible, we've defaulted to making it public and open.

55:06 And the whole reason behind this was because, you know, as data engineers, it's often hard

55:10 to see how other data engineers write code.

55:12 We get to see how software engineers write code quite often, but most people don't want

55:16 to share their platforms for various good reasons.

55:19 Right.

55:20 Also, there's like smaller teams or maybe just one person.

55:23 And then like those pipelines are so integrated into your specific infrastructure, right?

55:29 So it's not like, well, here's a web framework to share, right?

55:32 Like, here's how we integrate into that one weird API that we have that no one else has.

55:36 So it's no point in publishing it to you, right?

55:39 That's typically how it goes.

55:40 Or they're so large that they're afraid that there's like some, you know, important information

55:44 that they just don't want to take the risk on.

55:46 And so we built like something that's in the middle where we've taken as much as we can

55:49 and we've publicized it.

55:51 And you can't run this on your own.

55:52 Like it's not, that's not the point.

55:53 The point is to look at the code and see, you know, how does Dagster use Dagster and what

55:56 does that kind of look like?

55:57 Nice.

55:58 Okay.

55:59 All right.

56:00 Well, I'll put a link to that in the show notes and people can check it out.

56:01 Yeah, I guess let's wrap it up with the final call to action.

56:05 People are interested in Dagster.

56:06 How do they get started?

56:07 What do you tell them?

56:08 Oh, yeah.

56:09 Well, Dagster is probably the greatest place to start.

56:11 You can try the cloud product.

56:13 We have free self-serve or you can try the local install as well.

56:18 If you get stuck, a great place to join is our Slack channel, which is up on our website.

56:22 There's even a Ask AI channel where you can just talk to a Slack bot that's been trained

56:27 on all our GitHub issues and discussions.

56:29 And it's surprisingly good at walking you through, you know, any debugging, any issues

56:33 or even advice.

56:34 And that's pretty excellent, actually.

56:36 Yeah.

56:37 It's real fun.

56:38 It's really fun.

56:39 It's a great experience community where you can just chat to us as well.

56:41 Cool.

56:42 All right.

56:43 Well, Pedram, thank you for being on the show.

56:44 Make sure the work on Dagster and sharing it with us.

56:47 Thank you, Michael.

56:48 You bet.

56:49 See you later.

56:50 This has been another episode of Talk Python to Me.

56:52 Thank you to our sponsors.

56:54 Be sure to check out what they're offering.

56:55 It really helps support the show.

56:58 This episode is sponsored by Posit Connect from the makers of Shiny.

57:01 Publish, share, and deploy all of your data projects that you're creating using Python.

57:06 Streamlet, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs.

57:13 Posit Connect supports all of them.

57:15 Try Posit Connect for free by going to talkpython.fm/posit.

57:18 P-O-S-I-T.

57:19 Want to level up your Python?

57:24 We have one of the largest catalogs of Python video courses over at Talk Python.

57:28 Our content ranges from true beginners to deeply advanced topics like memory and async.

57:33 And best of all, there's not a subscription in sight.

57:35 Check it out for yourself at training.talkpython.fm.

57:39 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

57:43 We should be right at the top.

57:45 You can also find the iTunes feed at /iTunes, the Google Play feed at /play, and the Direct

57:50 RSS feed at /rss on talkpython.fm.

57:54 We're live streaming most of our recordings these days.

57:57 If you want to be part of the show and have your comments featured on the air, be sure

58:00 to subscribe to our YouTube channel at talkpython.fm/youtube.

58:05 This is your host, Michael Kennedy.

58:07 Thanks so much for listening.

58:08 I really appreciate it.

58:09 Now get out there and write some Python code.

58:12 [MUSIC PLAYING]

58:15 [MUSIC ENDS]

58:18 [MUSIC PLAYING]

58:21 [MUSIC ENDS]

58:24 [MUSIC PLAYING]

58:27 [MUSIC ENDS]

58:30 We just recorded it.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon