#454: Data Pipelines with Dagster Transcript

Recorded on Thursday, Jan 11, 2024.

00:00 Do you have data that you pull from external sources or that is generated and appears at

00:05 your digital doorstep?

00:06 I bet that data needs processed, filtered, transformed, distributed, and much more.

00:11 One of the biggest tools to create these data pipelines with Python is Dagster.

00:16 And we're fortunate to have Pedram Navid on the show to tell us about it.

00:20 Pedram is the head of data engineering and dev rel at Dagster Labs.

00:24 And we're talking data pipelines this week here at Talk Python.

This is Talk Python to Me, episode 454, recorded January 11th, 2024.

Welcome to Talk Python to Me, a weekly podcast on Python.

This is your host, Michael Kennedy.

Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

both on fosstodon.org.

Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

We've started streaming most of our episodes live on YouTube.

Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be
part of that episode.

01:16 part of that episode.

01:56 Last week, I told you about our new course, Build an AI Audio App with Python.

02:01 Well, I have another brand new and amazing course to tell you about.

02:06 This time, it's all about Python's typing system and how to take the most advantage

02:10 of it.

02:11 It's a really awesome course called Rock Solid Python with Python Typing.

02:16 This is one of my favorite courses that I've created in the last couple of years.

02:20 Python type hints are really starting to transform Python, especially from the ecosystem's perspective.

02:26 Think FastAPI, Pydantic, BearType, et cetera.

02:30 This course shows you the ins and outs of Python typing syntax, of course, but it also

02:34 gives you guidance on when and how to use type hints.

02:38 Check out this four and a half hour in-depth course at talkpython.fm/courses.

02:44 Now onto those data pipelines.

02:47 Pedram, welcome to Talk Python to Me.

02:50 It's amazing to have you here.

02:51 >> Michael, great to have you.

02:52 Good to be here.

02:53 >> Yeah.

02:54 We're going to talk about data, data pipelines, automation, and boy, oh boy, let me tell you,

02:59 have I been in the DevOps side of things this week.

03:03 And I'm going to have a special, special appreciation of it.

03:07 I can tell already.

03:08 So excited to talk.

03:10 >> My condolences.

03:12 >> Indeed.

03:13 >> So before we get to that, though, before we talk about Dagster and data pipelines and

03:18 orchestration more broadly, let's just get a little bit of background on you.

03:22 Introduce yourself for people.

03:23 How'd you get into Python and data orchestration and all those things?

03:27 >> Of course, yeah.

03:28 So my name is Pedram Navid.

03:29 I'm the head of data engineering and dev rel at Dagster.

03:33 That's a mouthful.

03:34 And I've been a longtime Python user since 2.7.

03:38 And I got started with Python like I do with many things just out of sheer laziness.

03:42 I was working at a bank and there was this rote task, something involving going into

03:47 servers, opening up a text file and seeing if a patch was applied to a server.

03:52 A nightmare scenario when there's 100 servers to check and 15 different patches to confirm.

03:57 >> Yeah, so this kind of predates like the cloud and all that automation and stuff, right?

04:02 >> This was definitely before cloud.

04:04 This was like right between Python 2 and Python 3, and we were trying to figure out how to

04:07 use print statements correctly.

04:09 That's when I learned Python.

04:10 I was like, there's got to be a better way.

04:12 And honestly, I've not looked back.

04:13 I think if you look at my entire career trajectory, you'll see it's just punctuated by finding

04:18 ways to be more lazy in many ways.

04:22 >> Yeah.

04:23 Who was it?

04:24 I think it was Matthew Rocklin that had the phrase something like productive laziness

04:29 or something like that.

04:31 Like, I'm going to find a way to leverage my laziness to force me to build automation

04:36 so I never ever have to do this sort of thing again.

04:39 I got that sort of print.

04:40 >> It's very motivating to not have to do something.

04:43 And I'll do anything to not do something.

04:44 >> Yeah, yeah, yeah.

04:45 It's incredible.

04:46 And like that DevOps stuff I was talking about, just, you know, one command and there's maybe

04:51 eight or nine new apps with all their tiers redeployed, updated, resynced.

04:55 And it took me a lot of work to get there.

04:59 But now I never have to think about it again, at least not for a few years.

05:03 And it's amazing.

05:04 I can just be productive.

05:05 It's like right in line with that.

05:07 >> So what are some of the Python projects you've been, you've worked on, talked about

05:11 different ways to apply this over the years?

05:13 >> Oh, yeah.

05:14 So it started with internal, just like Python projects, trying to automate, like I said,

05:18 some rote tasks that I had.

05:20 And that accidentally becomes, you know, a bigger project.

05:23 People see it and they're like, oh, I want that too.

05:25 And so, well, now I have to build like a GUI interface because most people don't speak

05:29 Python.

05:30 And so that got me into iGUI, I think it was called, way back when.

05:35 That was a fun journey.

05:36 And then from there, it's really taken off.

05:38 A lot of it has been mostly personal projects.

05:41 Trying to understand open source was a really big learning path for me as well.

05:46 Really being absorbed by things like SQLAlchemy and requests back when they were coming out.

05:50 Eventually, it led to more of a data engineering type of role, where I got involved with tools

05:55 like Airflow and tried to automate data pipelines instead of patches on a server.

06:01 That one day led to, I guess, making a long story short, a role at Dagster, where now I

06:06 contribute a little bit to Dagster.

06:08 I work on Dagster, the core project itself, but I also use Dagster internally to build

06:11 our own data pipelines.

06:13 I'm sure it's interesting to see how you all both build Dagster and then consume Dagster.

06:19 Yeah, it's been wonderful.

06:21 I think there's a lot of great things about it.

06:23 One is like getting access to Dagster before it's fully released, right?

06:27 Internally, we dog food, new features, new concepts, and we work with the product team,

06:33 the engineering team, and say, "Hey, this makes sense.

06:35 This doesn't.

06:36 This works really well.

06:37 That doesn't." That feedback loop is so fast and so iterative that for me personally, being able to see

06:43 that come to fruition is really, really compelling.

06:45 But at the same time, I get to work at a place that's building a tool for me.

06:50 You don't often get that luxury.

06:51 I've worked in ads.

06:53 I've worked in insurance.

06:54 It's like banking.

06:55 These are nice things, but it's not built for me, right?

06:59 And so for me, that's probably been the biggest benefit, I would say.

07:01 Right.

07:02 If you work in some marketing thing, you're like, "You know, I retargeted myself so well

07:06 today.

07:07 You wouldn't believe it.

07:08 I really enjoyed it." I've seen the ads that I've created before, so it's a little fun, but it's not the same.

07:15 Yeah.

07:16 I've heard of people who are really, really good at ad targeting and finding groups where

07:21 they've pranked their wife or something, or just had an ad that would only show up for

07:26 their wife by running it.

07:27 It was so specific and freaked them out a little bit.

07:30 That's pretty clever.

07:31 Yeah.

07:32 Maybe it wasn't appreciated, but it is clever.

07:34 Who knows?

07:35 All right.

07:37 Well, before we jump in, you said that, of course, you built GUIs with PyGUI and those

07:43 sorts of things because people don't speak Python back then, two, seven days and whatever.

07:48 Is that different now?

07:49 Not that people speak Python, but is it different in the sense that, "Hey, I could give them

07:53 a Jupyter notebook," or, "I could give them Streamlit," or one of these things, right?

07:58 A little more or less you building and just plug it in?

08:01 I think so.

08:02 Like you said, it's not different in that most people probably still to this day don't

08:06 speak Python.

08:07 I know we had this movement a little bit back where everyone was going to learn SQL and

08:11 everyone was going to learn to code.

08:13 I was never that bullish on that trend because if I'm a marketing person, I've got 10,000

08:18 things to do and learning to code isn't going to be a priority ever.

08:22 So I think building interfaces for people that are easy to use and speak well to them

08:27 is always useful.

08:29 That never has gone away, but I think the tooling around it has been better, right?

08:32 I don't think I'll ever want to use PyGUI again.

08:34 Nothing wrong with the platform.

08:35 It's just not fun to write.

08:37 Streamlit makes it so easy to do that.

08:39 So it's something like retool and there's a thousand other ways now that you can bring

08:43 these tools in front of your stakeholders and your users that just wasn't possible before.

08:47 I think it's a pretty exciting time.

08:49 There are a lot of pretty polished tools.

08:51 Yeah, it's gotten so good.

08:52 Yeah.

08:53 There's some interesting ones like OpenBB.

08:54 Do you know that?

08:55 The financial dashboard thing.

08:58 I've heard of this.

08:59 I haven't seen it.

09:00 Yeah.

09:01 It's basically for traders, but it's like a terminal type thing that has a bunch of

09:05 Matplotlib and other interactive stuff that pops up kind of compared to say Bloomberg

09:10 dashboard type things.

09:13 But yeah, that's one sense where like maybe like traders go and learn Python because it's

09:18 like, all right, there's enough value here.

09:19 But in general, I don't think people are going to stop what they're doing and learning the

09:24 code.

09:25 So these new UI things are not.

09:26 All right, let's dive in and talk about this general category first of data pipelines,

09:32 data orchestration, all those things.

09:33 We'll talk about Dagster and some of the trends and that.

09:36 So let's grab some random internet search for what does a data pipeline maybe look like?

09:41 But people out there listening who don't necessarily live in that space, which I think is honestly

09:47 many of us, maybe we should, but maybe in our minds, we don't think we live in data

09:51 pipeline land.

09:52 Tell them about it.

09:53 Yeah, for sure.

09:54 It is hard to think about if you haven't done or built one before.

09:57 In many ways, a data pipeline is just a series of steps that you apply to some dataset that

10:02 you have in order to transform it to something a little bit more valuable at the very end.

10:08 It's a simplified version, the devil's in the details, but really like at the end of

10:12 the day, you're in a business, the production of data sort of happens by the very nature

10:16 of operating that business.

10:17 It tends to be the core thing that all businesses have in common.

10:21 And then the other sort of output is you have people within that business who are trying

10:25 to understand how the business is operating.

10:27 And this used to be easy when all we had was a single spreadsheet that we could look at

10:31 once a month.

10:32 Yeah, I think businesses have gone a little bit more complex than these days.

10:36 Computers and automation.

10:37 And the expectations.

10:38 I expect to be able to see almost real time, not I'll see it at the end of the month sort

10:42 of.

10:43 That's right.

10:44 Yeah.

10:45 I think people have gotten used to getting data too, which is both good and bad.

10:47 Good in the sense that now people are making better decisions.

10:49 Bad, and then there's more work for us to do.

10:51 And we can't just sit on our feet for half a day, half a month waiting for the next request

10:55 to come in.

10:56 There's just an endless stream that seems to never end.

10:59 So that's what really a pipeline is all about.

11:00 It's like taking these data and making it consumable in a way that users, tools will

11:06 understand that helps people make decisions at the very end of the day.

11:09 That's sort of the nuts and bolts of it.

11:11 In your mind, does data acquisition live in this land?

11:14 So for example, maybe we have a scheduled job that goes and does web scraping, calls

11:19 an API once an hour, and that might kick off a whole pipeline of processing.

11:24 Or we watch a folder for people to upload over FTP, like a CSV file or something horrible

11:32 like that.

11:33 You don't even, it's unspeakable.

11:34 But something like that where you say, oh, a new CSV has arrived for me to get, right?

11:38 Yeah, I think that's the beginning of all data pipeline journeys in my mind, very much,

11:43 right?

11:44 Like an FTP, as much as we hate it, it's not terrible.

11:46 I mean, the worst, there are worse ways to transfer files, but it's, I think still very

11:52 much in use today.

11:53 And every data pipeline journey at some point has to begin with that consumption of data

11:57 from somewhere.

11:58 Yeah.

11:59 Hopefully it's SFTP, not just straight FTP.

12:02 Like the encrypted, don't just send your password in the plain text.

12:06 Oh, well, I've seen that go wrong.

12:09 That's a story for another day, honestly.

12:11 All right.

12:12 Well, let's talk about the project that you work on.

12:14 We've been talking about it in general, but let's talk about Dagster.

12:18 Like, where does it fit in this world?

12:20 Yes.

12:21 Dagster to me is a way to build a data platform.

12:24 It's also a different way of thinking about how you build data pipelines.

12:28 Maybe it's good to compare it with kind of what the world was like, I think, before Dagster

12:32 and how it came about to be.

12:34 So if you think of Airflow, I think it's probably the most canonical orchestrator out there,

12:39 but there are other ways which people used to orchestrate these data pipelines.

12:43 They were often task-based, right?

12:45 Like I would download file, I would unzip file, I would upload file.

12:50 These are sort of the words we use to describe the various steps within a pipeline.

12:55 Some of those little steps might be Python functions that you write.

12:59 Maybe there's some pre-built other ones.

13:00 Yeah, they might be Python.

13:02 Could be a bash script.

13:03 It'd be logging into a server and downloading a file.

13:05 Could be hitting request to download something from the internet, unzipping it.

13:09 Just a various, you know, hodgepodge of commands that would run.

13:12 That's typically how we thought about it.

13:13 For more complex scenarios where your data is bigger, maybe it's running against a Hadoop

13:17 cluster or a Spark cluster.

13:19 The compute's been offloaded somewhere else.

13:21 But the sort of conceptual way you tended to think about these things is in terms of

13:25 tasks, right?

13:26 Process this thing, do this massive data dump, run a bunch of things, and then your job is

13:31 complete.

13:32 With Airflow, or sorry, with Dijkstra, we kind of flip it around a little bit on our

13:35 heads and we say, instead of thinking about tasks, what if we flipped that around and

13:40 thought about the actual underlying assets that you're creating?

13:43 What if you told us not, you know, the step that you're going to take, but the thing that

13:47 you produce?

13:48 Because it turns out as people and as data people and stakeholders, really, we don't

13:52 care about the task, like we just assume that you're going to do it.

13:56 What we care about is, you know, that table, that model, that file, that Jupyter notebook.

14:01 And if we model our pipeline through that, then we get a whole bunch of other benefits.

14:06 And that's sort of the Dijkstra sort of pitch, right?

14:09 Like if you want to understand the things that are being produced by these tasks, tell

14:13 us about the underlying assets.

14:15 And then when a stakeholder says and comes to you and says, you know, how old is this

14:19 table?

14:20 Has it been refreshed lately?

14:21 You don't have to go look at a specific task.

14:22 And remember that task ABC had model XYZ.

14:26 You just go and look up model XYZ directly there, and it's there for you.

14:29 And because you've defined things in this way, you get other nice things like a lineage

14:33 graph.

14:34 You get to understand how fresh your data is.

14:36 You can do event-based orchestration and all kinds of nice things that are a lot harder

14:39 to do in a task world.

14:41 Yeah, more declarative, less imperative, I suppose.

14:45 Yeah, it's been the trend, I think, in lots of tooling.

14:48 React, I think was famous for this as well, right?

14:51 In many ways.

14:52 It was a hard framework, I think, for people to sort of get their heads around initially

14:55 because we were so used to like the jQuery declarative or jQuery style of doing things.

15:01 Yeah.

15:02 How do I hook the event that makes the thing happen?

15:03 And React said, let's think about it a little bit differently.

15:06 Let's do this event-based orchestration.

15:08 And I think the proof's in the pudding.

15:10 React's everywhere now and jQuery, maybe not so much.

15:13 Yeah.

15:14 There's still a lot of jQuery out there, but there's not a lot of active jQuery.

15:18 But I imagine there's some.

15:19 There is.

15:20 Yeah.

15:21 Just because people are like, you know what?

15:22 Don't touch that.

15:23 That works.

15:24 Which is probably the smartest thing people can do, I think.

15:27 Yeah, honestly.

15:28 Even though new frameworks are shiny.

15:30 And if there's any ecosystem that loves to chase the shiny new idea, it's the JavaScript

15:35 web world.

15:36 Oh, yeah.

15:37 There's no shortage of new frameworks coming out every time.

15:40 We do too, but not as much as like, that's six months old.

15:44 That's so old, we can't possibly do that anymore.

15:46 We're rewriting it.

15:47 We're going to do the big rewrite again.

15:48 Yep.

15:49 Fun.

15:50 So, Dagster is the company, but also is open source.

15:54 What's the story around like, can I use it for free?

15:57 Is it open source?

15:58 Do I pay for it?

15:59 100%.

16:00 Okay.

16:01 So, Dagster Labs is the company.

16:02 Dagster open source is the product.

16:03 It's 100% free.

16:04 We're very committed to the open source model.

16:06 I would say 95% of the things you can get out of Dagster are available through open source.

16:11 And we tend to try to release everything through that model.

16:14 You can run very complex pipelines, and you can deploy it all on your own if you wish.

16:19 There is a Dagster cloud product, which is really the hosted version of Dagster.

16:23 If you want hosted plain, we can do that for you through Dagster cloud, but it all runs

16:27 on the same code base and the modeling and the files all essentially look the same.

16:32 Okay.

16:33 So obviously you could get, like I talked about at the beginning, you could go down

16:36 the DevOps side, get your own open source Dagster set up, schedule it, run it on servers,

16:41 all those things.

16:42 But if we just wanted something real simple, we could just go to you guys and say, "Hey,

16:47 I built this with Dagster.

16:48 Will you run it for me?" Pretty much.

16:50 Yeah.

16:51 Right.

16:52 So there's two options there.

16:53 You can do the serverless model, which says, "Dagster, just run it.

16:55 We take care of the compute, we take care of the execution for you, and you just write

16:58 the code and upload it to GitHub or any repository of your choice, and we'll sync to that and

17:04 then run it." The other option is to do the hybrid model.

17:06 So you basically do the CI/CD aspect.

17:09 You just say, you push to name your branch.

17:11 If you push to that branch, that means we're just going to deploy a new version and whatever

17:15 happens after that, it'll be in production, right?

17:18 Exactly.

17:19 Yeah.

17:20 And we offer some templates that you can use in GitHub for workflows in order to accommodate

17:23 that.

17:24 Excellent.

17:25 Then I cut you off.

17:26 You're saying something about hybrid.

17:27 Hybrid is the other option for those of you who want to run your own compute.

17:30 You don't want the data leaving your ecosystem.

17:32 You can say, "We've got this Kubernetes cluster, this ECS cluster, but we still want to use

17:37 a Dagster Cloud product to sort of manage the control plane.

17:40 Dagster Cloud will do that." And then you can go off and execute things on your own environment if that's something

17:44 you wish to do.

17:45 Oh, yeah.

17:46 Because running stuff in containers isn't too bad, but running container clusters, all

17:51 of a sudden you're back doing a lot of work, right?

17:55 Exactly.

17:56 Yeah.

17:57 Okay.

17:58 Well, let's maybe talk about Dagster for a bit.

17:59 I want to talk about some of the trends as well, but let's just talk through maybe setting

18:02 up a pipeline.

18:04 What does it look like?

18:05 You talked about in general, less imperative, more declarative, but what does it look like?

18:10 Be careful about talking about code on audio, but just give us a sense of what the programming

18:15 model feels like for us.

18:16 As much as possible, it really feels like just writing Python.

18:20 It's pretty easy.

18:21 You add a decorator on top of your existing Python function that does something.

18:25 That's a simple decorator called asset.

18:28 And then your pipeline, that function becomes a data asset.

18:31 That's how it's represented in the Dagster UI.

18:33 So you could imagine you've got a pipeline that gets like maybe Slack analytics and uploads

18:39 that to some dashboard, right?

18:41 Your first pipeline, your function will be called something like Slack data, and that

18:45 would be your asset.

18:46 In that function is where you do all the transform, the downloading of the data until you've really

18:51 created that fundamental data asset that you care about.

18:53 And that could be stored either in a data warehouse to S3, however you sort of want

18:58 to persist it, that's really up to you.

19:00 And then the resources is sort of where the power, I think, of a lot of Dagster comes in.

19:04 So the asset is sort of like declaration of the thing I'm going to create.

19:08 The resource is how I'm going to operate on that, right?

19:12 Because sometimes you might want to have a, let's say a DuckDB instance locally, because

19:17 it's easier and faster to operate.

19:18 But when you're moving to the cloud, you want to have a Databricks or a Snowflake.

19:23 You can swap out resources based on environments and your asset can reference that resource.

19:28 And as long as it has that same sort of API, you can really flexibly change between where

19:32 that data is going to be persistent.

19:34 Does Dagster know how to talk to those different platforms?

19:37 Does it like natively understand DuckDB and Snowflake?

19:40 Yeah.

19:41 Interesting.

19:42 People often look to Dagster and like, "Oh, does it do X?" And the question is like, "Dagster does anything you can do Python with?"

19:48 Which is most things, yeah.

19:49 Which is most things.

19:50 So I think if you come from the Airflow world, you're very much used to like these Airflow

19:54 providers and if you want to use...

19:55 That's kind of what I was thinking, yeah.

19:56 Yeah.

19:57 You want to use a Postgres, you need to find the Postgres provider.

19:59 You want to use S3, you need to find the S3 provider.

20:01 With Dagster, you kind of say you don't have to do any of that.

20:04 If you want to use Snowflake, for example, install the Snowflake connector package from

20:08 Snowflake and you use that as a resource directly.

20:11 And then you just run your SQL that way.

20:13 There are some places where we do have integrations that help if you want to get into the weeds

20:18 with I/O Manager, it's where we persist the data on your behalf.

20:21 And so for S3, for Snowflake, for example, there's other ways where we can persist that

20:26 data for you.

20:27 But if you're just trying to run a query, just trying to execute something, just trying

20:30 to save something somewhere, you don't have to use that system at all.

20:33 You can just use whatever Python package you would use anyway to do that.

20:38 So maybe some data is expensive for us to get as a company, like maybe we're charged

20:43 on a usage basis or super slow or something.

20:46 I could write just Python code that goes and say, well, look in my local database.

20:50 If it's already there, use that and it's not too stale.

20:53 Otherwise, then do actually go get it, put it there and then get it back.

20:57 And like that kind of stuff would be up to me to put together.

21:00 Yeah.

21:01 And that's the nice thing is you're not really limited by like anyone's data model or worldview

21:06 on how data should be retrieved or saved or augmented.

21:09 You could do it a couple of ways.

21:10 You could say whenever I'm working locally, use this persistent data store that we're

21:15 just going to use for development purposes.

21:18 Fancy database called SQLite, something like that.

21:20 Exactly.

21:21 Yes.

21:22 A wonderful database.

21:23 Actually, it is.

21:24 Yeah.

21:25 It'll work really, really well.

21:26 And then you just say when I'm in a different environment, when I'm in production, swap

21:28 out my SQLite resource for a name, your favorite cloud warehouse resource, and go fetch that

21:33 data from there.

21:34 Or I want to use it mini IO locally.

21:36 I want to use S3 on prod.

21:39 It's very simple to swap these things out.

21:40 Okay.

21:41 Yeah.

21:42 So it looks like you build up these assets as y'all call them, these pieces of data,

21:46 Python code that accesses them.

21:48 And then you have a nice UI that lets you go and build those out kind of workflow style,

21:54 right?

21:55 Yeah, exactly.

21:56 This is where we get into the wonderful world of DAGs, which stands for directed acyclic

22:00 graph.

22:01 So basically it stands for a bunch of things that are not connected in a circle, but are

22:04 connected in some way.

22:05 So there can't be any loops, right?

22:07 Because then you never know where to start or where to end.

22:09 Could be an assignment, but not a circle.

22:11 Not a circle.

22:12 As long as there's like a path through this dataset, where the beginning and an end, then

22:18 we can kind of start to model this connected graph of things.

22:21 And then we know how to execute them, right?

22:23 We can say, well, this is the first thing we have to run because that's where all dependencies

22:26 start.

22:27 And then we can branch off in parallel or we can continue linearly until everything

22:31 is complete.

22:32 And if something breaks in the middle, we can resume from that broken spot.

22:35 Okay, excellent.

22:36 And is that the recommended way?

22:38 Like if I write all this Python code that works on the pieces, then the next recommendation

22:42 would be to fire up the UI and start building it?

22:45 Or do you say, ah, you should really write it in code and then you can just visualize

22:49 it or monitor it?

22:50 Everything in Dagster is written as code.

22:52 The UI reads that code and it interprets it as a DAG and then it displays that for you.

22:57 There are some things you do with the UI, like you can materialize assets, you can make

23:00 them run, you can do backfills, you can view metadata, you can sort of enable and disable

23:06 schedules.

23:07 But the core, we really believe this is Dagster, like the core declaration of how things are

23:11 done, it's always done through code.

23:13 Okay, excellent.

23:14 So when you say materialize, maybe I have an asset, which is really a Python function

23:19 I wrote that goes and pulls down a CSV file.

23:22 The materialize would be, I want to see kind of representative data in this, in the UI.

23:28 And so I could go, all right, I think this is right.

23:30 Let's keep passing it down.

23:31 Is that what that means?

23:33 Materialize really means just run this particular asset, make this asset new again, fresh again,

23:37 right?

23:38 As part of that materialization, we sometimes output metadata.

23:41 And you can kind of see this on the right, if you're looking at the screen here, where

23:44 we talk about what the timestamp was, the URL, there's a nice little graph of like number

23:49 of rows over time.

23:51 All that metadata is something you can emit, and we emit some ourselves by default with

23:56 the framework.

23:57 And then as you materialize these assets, as you run that asset over and over again,

24:00 over time, we capture all that.

24:01 And then you can really get a nice overview of, you know, this assets lifetime, essentially.

24:06 Nice.

24:07 I think the asset, the metadata is really pretty excellent, right?

24:10 Over time, you can see how the data's grown and changed.

24:13 And yeah, the metadata is really powerful.

24:16 And it's one of the nice benefits of being in this asset world, right?

24:19 Because you don't really want to metadata on like this task that run, you want to know

24:23 like this table that I created, how many rows has it had every single time it's run?

24:27 If that number drops by like 50%, that's a big problem.

24:31 Conversely, if the runtime is slowly increasing every single day, you might not notice it,

24:35 but over a month or two, it went from a 30 second pipeline to 30 minutes.

24:40 Maybe there's like a great place to start optimizing that one specific asset.

24:43 Right.

24:44 And what's cool, if it's just Python code, you know how to optimize that probably, right?

24:48 Hopefully, yeah.

24:49 Well, as much as you're going to, yeah, you got, you have all the power of Python and

24:54 you should be able to, as opposed to it's deep down inside some framework that you don't

24:57 really control.

24:58 Exactly.

24:59 Yeah.

25:00 You use Python, you can benchmark it.

25:01 There's probably, you probably knew you didn't write it that well when you first started

25:04 and you can always find ways to improve it.

25:07 So this UI is something that you can just run locally, kind of like Jupyter.

25:11 100%.

25:12 Just type Dijkstra dev and then you get the full UI experience.

25:16 You get to see the runs, all your assets.

25:17 Is it a web app?

25:18 It is.

25:19 Yeah.

25:20 It's a web app.

25:21 There's a Postgres backend.

25:22 And then there's a couple of services that run the web server, the GraphQL, and then

25:25 the workers.

25:26 Nice.

25:27 Yeah.

25:28 So pretty serious web app, it sounds like, but you probably just run it all.

25:31 Yeah.

25:33 Just something you run all probably containers or something you just fire up when you download

25:37 it, right?

25:38 Locally, it doesn't even use containers.

25:39 It's just all pure Python for that.

25:43 But once you deploy, yeah, I think you might want to go down the container route, but it's

25:46 nice not having to have Docker just to run a simple test deployment.

25:50 Yeah.

25:51 I guess not everyone's machine has that, for sure.

25:53 So question from the audience here.

25:56 Jazzy asks, does it hook into AWS in particular?

25:59 Is it compatible with existing pipelines like ingestion lambdas or transform lambdas?

26:04 Yeah, you can hook into AWS.

26:06 So we have some AWS integrations built in.

26:09 Like I mentioned before, there's nothing stopping you from importing Boto3 and doing anything

26:13 really you want.

26:14 So a very simple use case.

26:15 Like let's say you already have an existing transformation being triggered in AWS through

26:20 some lambda.

26:21 You could just model that within Dijkstra and say, you know, trigger that lambda Boto3.

26:25 Okay.

26:26 Then the asset itself is really that representation of that pipeline, but you're not actually running

26:31 that code within Dijkstra itself.

26:32 That's still occurring on the AWS framework.

26:34 And that's a really simple way to start adding a little bit of observability and orchestration

26:38 to existing pipelines.

26:40 Okay.

26:41 That's pretty cool because now you have this nice UI and these metadata in this history,

26:45 but it's someone else's cloud.

26:47 Exactly.

26:48 Yeah.

26:49 And you can start to pull more information in there.

26:50 And over time you might decide, you know, this, you know, lambda that I had, it's starting

26:54 to get out of hand.

26:55 I want to kind of break it apart into multiple assets where I want to sort of optimize it

26:59 a little way and Dijkstra can help you along that.

27:01 Yeah.

27:02 Excellent.

27:03 How do you set up like triggers or observability inside Dijkstra?

27:08 Like Jazzy's asking about S3, but like in general, right?

27:11 If a row is entered into a database, something is dropped in a blob storage or the date changes.

27:16 I don't know.

27:17 Yeah.

27:18 Those are great questions.

27:19 You have a lot of options.

27:20 In Dijkstra, we do model every asset with a couple little flags, I think, that are really

27:24 useful to think about.

27:25 One is whether the code of that particular asset has changed, right?

27:28 And then the other one is whether anything upstream of that asset has changed.

27:32 And those two things really power a lot of automation functionality that we can get downstream.

27:38 So let's start with the S3 example, it's the easiest to understand.

27:41 You have a bucket and there is a file that gets uploaded every day.

27:46 You don't know what time that file gets uploaded.

27:48 You don't know when it'll be uploaded, but you know at some point it will be.

27:51 In Dijkstra, we have a thing called the sensor, which you can just connect to an S3 location.

27:55 You can define how it looks into that file or into that folder.

27:59 And then you would just pull every 30 seconds until something happens.

28:02 When that something happens, that triggers sort of an event.

28:06 And that event can trickle at your will downstream to everything that depends on it as you sort

28:10 of connect to these things.

28:12 So it gets you awake from this like, "Oh, I'm going to schedule something to run every

28:15 hour.

28:16 Maybe the data will be there, but maybe it won't." And you can have a much more event-based workflow.

28:20 When this file runs, I want everything downstream to know that this data has changed.

28:25 And as sort of data flows through these systems, everything will sort of work its way down.

28:28 Yeah, I like it.

29:59 The sensor concept is really cool because I'm sure that there's a ton of cloud machines

30:05 people provisioned just because this thing runs every 15 minutes, that runs every 30

30:10 minutes, and you add them up and in aggregate, we need eight machines just to handle the

30:14 automation, rather than, you know, because they're hoping to catch something without

30:18 too much latency, but maybe like that actually only changes once a week.

30:22 Exactly.

30:23 And I think that's where we have to like sometimes step away from the way we're so used to thinking

30:27 about things, and I'm guilty of this.

30:30 When I create a data pipeline, my natural inclination is to create a schedule where

30:33 it's a, is this a daily one?

30:34 Is this weekly?

30:35 Is this monthly?

30:36 But what I'm finding more and more is when I'm creating my pipelines, I'm not adding

30:39 a schedule.

30:40 I'm using Dagster's auto-materialized policies, and I'm just telling it, you figure it out.

30:45 I don't have to think about schedules.

30:46 Just figure out when the things should be updated.

30:49 When it's, you know, parents have been updated, you run.

30:51 When the data has changed, you run.

30:53 And then just like figure it out and leave me alone.

30:55 Yeah.

30:56 And it's worked pretty well for me so far.

30:57 I think it's great.

30:58 I have a search, refresh the search index on the various podcast pages that runs and

31:04 it runs every hour, but the podcast ships weekly, right?

31:08 But I don't know which hour it is.

31:10 And so it seems like that's enough latency, but it would be way better to put just a little

31:14 bit of smarts.

31:15 Like what was the last date that anything changed?

31:18 Was that since the last time you saw it?

31:20 Maybe we'll just leave that alone, you know?

31:21 But yeah, you're starting to inspire me to go write more code, but pretty cool.

31:26 All right.

31:27 So on the homepage at Dagster.io, you've got a nice graphic that shows you both how to

31:33 write the code, like some examples of the code, as well as how that looks in the UI.

31:38 And one of them is called, says to launch backfills.

31:41 What is this backfill thing?

31:43 Oh, this is my favorite thing.

31:44 Okay.

31:45 So when you first start your data journey as a data engineer, you sort of have a pipeline

31:50 and you build it and it just runs on a schedule and that's fine.

31:54 What you soon find is, you know, you might have to go back in time.

31:58 You might say, I've got this data set that updates monthly.

32:01 Here's a great example, AWS cost reporting, right?

32:05 AWS will send you some data around, you know, all your instances and your S3 bucket, all

32:10 that.

32:11 And it'll update that data every day or every month or whatever have you.

32:14 Due to some reason, you've got to go back in time and refresh data that AWS updated

32:18 due to some like discrepancy.

32:20 Backfill is sort of how you do that.

32:21 And it works hand in hand with this idea of a partition.

32:24 A partition is sort of how your data is naturally organized.

32:28 And it's like a nice way to represent that natural organization.

32:31 Has nothing to do with like the fundamental way, how often you want to run it.

32:35 It's more around like, I've got a data set that comes in once a month, it's represented

32:39 monthly.

32:40 It might be updated daily, but it's the representation of the data is monthly.

32:43 So I will partition it by month.

32:44 It doesn't have to be dates.

32:46 It could be strings.

32:47 It could be a list.

32:48 You could have a partition for every company or every client or every domain you have.

32:55 Whatever you sort of think is a natural way to think about breaking apart that pipeline.

33:00 And once you do that partition, you can do these nice things called backfills, which

33:03 says instead of running this entire pipeline and all my data, I want you to pick that one

33:08 month where data went wrong or that one month where data was missing and just run the partition

33:13 on that range.

33:14 And so you limit compute, you save resources and get a little bit more efficient.

33:18 And it's just easier to like, think about your pipeline because you've got this natural

33:22 built in partitioning system.

33:23 Excellent.

33:24 So maybe you missed some important event.

33:27 Maybe your automation went down for a little bit, came back up.

33:31 You're like, Oh no, we've, we've missed it.

33:33 Right.

33:34 But you want to start over for three years.

33:37 So maybe we could just go and run the last day.

33:39 It's worth of.

33:40 Exactly.

33:41 Or another one would be your vendor says, Hey, by the way, we actually screwed up.

33:45 We uploaded this file from two months ago, but the numbers were all wrong.

33:48 Yeah.

33:49 We've uploaded a new version to that destination.

33:51 Can you update your data set?

33:53 One way is to recompute the entire universe from scratch.

33:56 But if you've partitioned things and you can say no limit that to just this one partition

34:00 for that month and that one partition can trickle down all the way to all your other

34:03 assets that depend on that.

34:05 Do you have to pre decide, do you have to think about this partitioning beforehand or

34:10 can you do it retroactively?

34:11 You could do it retroactively.

34:12 And I have done that before as well.

34:14 It really depends on, on where you're at.

34:16 I think it's your first asset ever.

34:19 Probably don't bother with partitions, but it really isn't a lot of work to get them

34:23 to get them started.

34:24 Okay.

34:25 Yeah.

34:26 Really neat.

34:27 I like a lot of the ideas here.

34:28 I like that.

34:29 It's got this visual component that you can see what's going on, inspect it.

34:33 Just so you can debug runs or what happens there.

34:36 Like obviously when you're pulling data from many different sources, maybe it's not your

34:40 data you're taking in.

34:41 Fields could vanish.

34:42 It can be the wrong type.

34:43 Systems can go down.

34:44 I'm sure, sure the debugging is interesting.

34:47 So what's, it looks a little bit kind of like a web browser debug dev tools thing.

34:52 So for the record, my code never fails.

34:54 I've never had a bug in my life, but for the one that you have.

34:57 Yeah.

34:58 Well, mine doesn't either.

34:59 I only do it to make an example and for my, me, how others, yes.

35:03 If I do it's intentional, of course.

35:05 Yeah.

35:06 To humble myself a little bit.

35:08 Exactly.

35:09 This view is one of my favorite, I mean, so many favorite views, but this is, it's actually

35:13 really fun to watch, watch this actually run when you execute this pipeline.

35:16 But really like, let's go back to, you know, the world before orchestrators, we use cron,

35:22 right?

35:23 We'd have a bash script that would do something and we'd have a cron job that said, make sure

35:26 this thing runs.

35:27 And then hopefully it was successful, but sometimes it wasn't.

35:31 And it's a, sometimes it wasn't, that's always been the problem, right?

35:34 It's like, well, what do I do now?

35:35 How do I know why it failed?

35:36 What was, when did it fail?

35:38 You know, what, at what point or what steps did it fail?

35:41 That's really hard to do.

35:42 But this debugger really is, a structured log of every step that's been going on through

35:47 your pipeline, right?

35:48 So in this view, there's three assets that we can kind of see here.

35:52 One is called users.

35:53 One is called orders and one is to run dbt.

35:56 So presumably there's these two, you know, tables that are being updated and then a dbt

36:00 job.

36:01 It looks like that's being updated at the very end.

36:03 Once you execute this pipeline, all the logs are captured from each of those assets.

36:07 So you can manually write your own logs.

36:10 You have access to a Python logger and you can use your info, your error, whatever have

36:14 you in log output that way.

36:16 And it'll be captured in a structured way, but it also capture logs from your integrations.

36:21 So if you're using dbt, we capture those logs as well.

36:24 You can see it processing every single asset.

36:26 So if anything does go wrong, you can filter down and understand at what step, at what

36:31 point did something go wrong.

36:33 - That's awesome.

36:34 And just the historical aspect, cause just going through logs, especially multiple systems

36:40 can be really, really tricky to figure out what's the problem, what actually caused this

36:44 to go wrong, but come back and say, oh, it crashed, pull up the UI and see, all right,

36:49 well show me, show me what this run did, show me what this job did.

36:53 And it seems like it's a lot easier to debug than your standard web API or something like

36:57 that.

36:58 - Exactly.

36:59 You can click on any of these assets that get that metadata that we had earlier as well.

37:02 If you know, one step failed and it's kind of flaky, you can just click on that one step

37:06 and say, just rerun this.

37:08 Everything else is fine.

37:09 It's a restart from scratch.

37:10 - Okay.

37:11 And it'll keep the data from before, so you don't have to rerun that.

37:15 - Yeah.

37:16 I mean, it depends on how you built the pipeline.

37:18 We like to build item potent pipelines is how we sort of talk about it, the data engineering

37:21 landscape, right?

37:23 So you should be able to run something multiple times and not break anything in a perfect

37:27 world.

37:28 That's not always possible, but ideally, yes.

37:30 And so we can presume that if users completed successfully, then we don't have to run that

37:34 again because that data was persisted, you know, database S3 somewhere.

37:38 And if orders was the one that was broken, we can just only run orders and not have to

37:43 worry about rewriting the whole thing from scratch.

37:45 - Excellent.

37:46 So item potent for people who maybe don't know, you run it once or you perform the operation

37:50 once or you perform it 20 times, same outcome should have side effects, right?

37:55 - That's the idea.

37:57 - Easier said than done sometimes.

37:58 - It sure is.

37:59 - Sometimes it's easy, sometimes it's very hard, but the more you can build pipelines

38:03 that way, the easier your life becomes in many ways.

38:07 - Exactly.

38:08 Not always, but generally true for programming as well, right?

38:10 If you talk to functional programming people, they'll say like, it's an absolute, but.

38:14 - Yes, functional programmers love this kind of stuff.

38:17 And it actually does lend itself really well to data pipelines.

38:21 Data pipelines, unlike maybe some of the software engineering stuff, it's a little bit different

38:24 in that the data changing is what causes often most of the headaches, right?

38:30 It's less so the actual code you write, but more the expectation tends to change so frequently

38:36 and so often in new and novel, interesting ways that you would often never expect.

38:41 And so the more you can sort of make that function so pure that you can provide any

38:46 sort of dataset and really test really easily these expectations when they get broken, the

38:51 easier it is to sort of debug these things and build on them in the future.

38:55 - Yeah.

38:56 And cache them as well.

38:57 - Yes, it's always nice.

38:58 - Yeah.

38:59 So speaking of that kind of stuff, like what's the scalability story?

39:02 If I've got some big, huge, complicated data pipeline, can I parallelize them and have

39:08 them run multiple pieces?

39:10 Like if there's different branches or something like that?

39:12 - Yeah, exactly.

39:13 That's one of the key benefits I think in writing your assets in this DAG way, right?

39:20 Anything that is parallelizable will be parallelized.

39:22 Now sometimes you might want to put limits on that.

39:24 Sometimes too much parallelization is bad.

39:26 Your poor little database can't handle it.

39:28 And you can say, you know, maybe a concurrency limit on this one just for today is worth

39:32 putting, or if you're hitting an API for an external vendor, they might not appreciate

39:36 10,000 requests a second on that one.

39:39 So maybe you would slow it down.

39:40 But in this case-

39:41 - Or rate limiting, right?

39:42 You can run into too many requests and then your stuff crashes, then you got to start.

39:45 It can be a whole thing.

39:46 - It can be a whole thing.

39:47 There's memory concerns, but let's pretend the world is simple.

39:51 Anything that can be parallelized will be through Dagster.

39:54 And that's really the benefit of writing these DAGs is that there's a nice algorithm for

39:57 determining what that actually looks like.

39:59 - Yeah.

40:00 I guess if you have a diamond shape or any sort of split, right?

40:03 Those two things now become, 'cause it's a cyclical, they can't turn around and then

40:07 eventually depend on each other again.

40:09 So that's a perfect chance to just go fork it out.

40:12 - Exactly.

40:13 And that's kind of where partitions are also kind of interesting.

40:15 If you have a partitioned asset, you could take your dataset partitioned to five, you

40:19 know, buckets and run all five partitions at once, knowing full well that because you've

40:23 written this in a idempotent and partitioned way, that the first pipeline will only operate

40:28 on apples and the second one only operates on bananas.

40:32 And there is no commingling of apples and bananas anywhere in the pipeline.

40:35 - Oh, that's interesting.

40:36 I hadn't really thought about using the partitions for parallelism, but of course.

40:39 - Yeah.

40:40 It's a fun little way to break things apart.

40:43 - So if we run this on the Dagster cloud or even on our own, is this pretty much automatic?

40:49 We don't have to do anything?

40:51 I think Dagster just looks at it and says, this looks parallelizable and it'll go or?

40:55 - That's right.

40:56 Yeah.

40:57 As long as you've got the full deployment, whether it's OSS or cloud, Dagster will basically

41:00 parallelize it for you, which is possible.

41:02 You can set global currency limits.

41:04 So you might say, you know, 64 is more than enough, you know, parallelization that I need,

41:09 or maybe I want less because I'm worried about overloading systems, but it's really up to

41:13 you.

41:14 - Putting this on a $10 server, please don't kill it.

41:19 First respect that it's somewhat wimpy, but that's okay.

41:21 - It'll get the job done.

41:23 All right.

41:24 I want to talk about some of the tools and some of the tools that are maybe at play here

41:29 when working with Dagster and some of the trends and stuff.

41:31 But before that, maybe speak to where you could see people adopt a tool like Dagster,

41:37 but they generally don't.

41:39 They don't realize like, oh, actually there's a whole framework for this, right?

41:43 Like I could, sure I could go and build just on HTTP server and hook into the request and

41:49 start writing to it.

41:50 But like, maybe I should use Flask or FastAPI.

41:53 Like there's these frameworks that we really naturally adopt for certain situations like

41:58 APIs and others, background jobs, data pipelines, where I think there's probably a good chunk

42:04 of people who could benefit from stuff like this, but they just don't think they need

42:07 a framework for it.

42:09 Like cron is enough.

42:10 - Yeah, it's funny because sometimes cron is enough.

42:13 And I don't want to encourage people not to use cron, but think twice at least is what

42:18 I would say.

42:19 So probably the first like trigger for me of thinking of, you know, is Dagster a good

42:24 choice is like, am I trying to ingest data from somewhere?

42:26 Is that's something that fails.

42:28 Like I think we just can accept that, you know, if you're moving data around, the data

42:32 source will break, the expectations will change.

42:35 You'll need to debug it.

42:36 You'll need to rerun it.

42:37 And doing that in cron is a nightmare.

42:39 So I would say definitely start to think about an orchestration system.

42:43 If you're ingesting data, if you have a simple cron job that sends one email, like you're

42:46 probably fine.

42:47 I don't think you need to implement all of Dagster just to do that.

42:51 But the more closer you get to data pipelining, I think the better your life will be if you're

42:58 not trying to debug a obtuse process that no one really understands six months from

43:03 now.

43:04 - Excellent.

43:05 All right, maybe we could touch on some of the tools that are interesting to see people

43:08 using.

43:09 You talked about DuckDB and DBT, a lot of Ds starting here, but give us a sense of like

43:15 some of the supporting tools you see a lot of folks using that are interesting.

43:19 - Yeah, for sure.

43:20 I think in the data space, probably DBT is one of the most popular choices and DBT in

43:27 many ways, it's nothing more than a command line tool that runs a bunch of SQL in a bag

43:33 as well.

43:34 So there's actually a nice fit with Dagster and DBT together.

43:37 DBT is really used by people who are trying to model that business process using SQL against

43:44 typically a data warehouse.

43:45 So if you have your data in, for example, a Postgres, a Snowflake, Databricks, Microsoft

43:51 SQL, these types of data warehouses, generally you're trying to model some type of business

43:56 process and typically people use SQL to do that.

44:00 Now you can do this without DBT, but DBT has provided a nice clean interface to doing so.

44:06 It makes it very easy to connect these models together, to run them, to have a development

44:10 workflow that works really well.

44:12 And then you can push it to prod and have things run again in production.

44:15 So that's DBT.

44:17 We find it works really well.

44:18 And a lot of our customers are actually using DBT as well.

44:21 There's DuckDB, which is a great, it's like the SQLite for columnar databases, right?

44:27 It's in process, it's fast, it's written by the Dutch.

44:30 There's nothing you can't like about it.

44:32 It's free.

44:33 We love that.

44:34 It's a little bit more simple in Python itself.

44:37 It does.

44:38 It's so easy.

44:39 Yes, exactly.

44:40 The Duck have given us so much and they've asked nothing of us.

44:42 So I'm always very thankful for them.

44:44 It's fast.

44:45 It's so fast.

44:46 It's like if you've ever used pandas for processing large volumes of data, you've occasionally

44:51 hit memory limits or inefficiencies in doing these large aggregates.

44:56 I won't go into all the reasons of why that is, but DuckDB sort of changes that because

45:01 it's a fast serverless sort of C++ written tooling to do really fast vectorized work.

45:09 And by that, I mean, like it works on columns.

45:11 So typically in like SQLite, you're doing transactions, you're doing single row updates,

45:15 writes, inserts, and SQLite is great at that.

45:18 Where typical transactional databases fail or aren't as powerful is when you're doing

45:23 aggregates, when you're looking at an entire column, right?

45:26 Just the way they're architected.

45:27 If you want to know the average, the median, the sum of some large number of columns, and

45:32 you want to group that by a whole bunch of things, you want to know the first date someone

45:36 did something and the last one, those types of vectorized operations, DuckDB is really,

45:41 really fast at doing.

45:42 And it's a great alternative to, for example, pandas, which can often hit memory limits

45:48 and be a little bit slow in that regard.

45:50 Yeah, it looks like it has some pretty cool aspects, transactions, of course, but it also

45:54 says direct Parquet, CSV, and JSON querying.

45:59 So if you've got a CSV file hanging around and you want to ask questions about it or

46:04 JSON or some of the data science stuff through Parquet, you know, turn a indexed proper query

46:09 engine against it.

46:10 Don't just use a dictionary or something, right?

46:12 Yeah, it's great for reading a CSV, zip files, tar files, Parquet, partition Parquet files,

46:19 all that stuff that usually was really annoying to do and operate on, you can now install

46:23 DuckDB.

46:24 It's got a great CLI too.

46:25 So before you go out and like program your entire pipeline, you just run DuckDB and you

46:30 can start writing SQL against CSV files and all this stuff to really understand your data

46:35 and just really see how quick it is.

46:37 I used it on a bird dataset that I had as an example project and there was, you know,

46:43 millions of rows and I was joining them together and doing massive group buys and it was done

46:47 in like seconds.

46:48 And it's just hard for me to believe that it was even correct because it was so quick.

46:52 So it's wonderful.

46:53 I must have done that wrong somehow.

46:55 Because it's done, it shouldn't be done.

46:58 Yeah.

46:59 And the fact it's in process means there's not a babysit, a server for you to babysit

47:03 patch, make sure it's still running.

47:06 It's accessible, but not too accessible.

47:08 All that, right?

47:09 It's a pip install away, which is always, we love that, right?

47:12 Yeah, absolutely.

47:13 You mentioned, I guess I mentioned Parquet, but also Apache Arrow seems like it's making

47:18 its way into a lot of, a lot of different tools and sort of foundational sort of high

47:22 memory, high performance in memory processing.

47:25 Have you used this Eddie?

47:26 I've used it, especially through like working through different languages.

47:30 So moving data between Python and R is where I last used this.

47:34 I think Arrow's great at that.

47:35 I believe Arrow is like the, underneath some of the rust to Python as well.

47:41 It's working there.

47:42 So typically I don't use Arrow like directly myself, but it's in many of the tooling I

47:46 use.

47:47 It's a great product.

47:48 And like so much of the ecosystem is now built on, on Arrow.

47:52 Yeah.

47:53 I think a lot of it is, I feel like the first time I heard about it was through Polars.

47:55 That's right.

47:56 Yeah.

47:57 I'm pretty sure, which is another rust story for kind of like pandas, but a little bit

48:03 more fluent, lazy API.

48:05 Yes.

48:06 We live in such great times to be honest.

48:07 So Polars is a Python bindings for rust, I believe is kind of how I think about it.

48:13 It does all the transformation and rust, but you've had this Python interface to it and

48:17 it makes things again, incredibly fast.

48:20 I would say similar in speed to DuckDB.

48:22 They both are quite comparable sometimes.

48:24 Yeah.

48:25 It also comes to have vectorized and column runner processing and all that kind of stuff.

48:29 It's pretty incredible.

48:31 So not a drop in replacement for pandas, but if you have the opportunity to use it and

48:36 you don't need to use the full breadth of what pandas offers, because pandas is quite

48:39 a huge package.

48:40 There's a lot it does, but if you're just doing simple transforms, I think Polars is

48:44 a great option to explore.

48:45 Yeah, I talked to a Richie Vink, Vink who is part of that.

48:49 And I think they explicitly chose to not try to make it a drop in replacement for pandas,

48:54 but try to choose an API that would allow the engine to be smarter and go like, I see

48:58 you're asking for this, but the step before you wanted this other thing.

49:02 So let me do that transformation all in one shot.

49:04 And a little bit like a query optimization engine.

49:07 What else is out there?

49:08 A couple of guys, time for just a couple more.

49:10 If there's anything that you're like, Oh yeah, people use this all the time.

49:12 Especially the databases you've said, Postgres, Snowflake, et cetera.

49:16 Yeah, there's so much.

49:17 So another little one I like, it's called DLT, DLT hub.

49:21 It's getting a lot of attraction as well.

49:23 And what I like about it is how lightweight it is.

49:25 I'm such a big fan of lightweight tooling.

49:28 That's not, you know, massive frameworks.

49:29 Loading data is I think still kind of yucky in many ways.

49:32 It's not fun.

49:33 And DLT makes it a little bit simpler and easier to do so.

49:36 So that's what I would recommend people just to look into if you got to either ingest data

49:41 from some API, some website, some CSV file.

49:45 It's a great way to do that.

49:47 It claims it's the Python library for data teams loading data into unexpected places.

49:52 Very interesting.

49:53 Yes, that's great.

49:54 Yeah, this is, this looks cool.

49:56 All right.

49:57 Well, I guess maybe let's talk about, let's talk business and then we can talk about what's

50:01 next and then we'll probably be out of time.

50:03 I'm always fascinated.

50:04 I think there's starting to be a bit of a blueprint for this, but companies that take

50:09 a thing, they make it and they give it away and then they have a company around it.

50:12 And congratulations to you all for doing that.

50:14 Right.

50:15 And a lot of it seems to kind of center around the open core model, which I don't know if

50:20 that's exactly how you would characterize yourself, but yeah, maybe you should talk

50:24 about the business side.

50:25 Because I know there's many successful open source projects that don't necessarily result

50:29 in full-time jobs or companies if people were to want that.

50:32 It's a really interesting place.

50:34 And I don't think it's one that anyone has truly figured out well, I can say this is

50:39 the way forward for everyone, but it is something we're trying.

50:42 And I think for Dagster, I think it's working pretty well.

50:44 And what I think is really powerful about Dagster is like the open source product is

50:48 really, really good.

50:49 And it hasn't really been limited in many ways in order to drive like cloud product

50:54 consumption.

50:55 We really believe that there's actual value in that separation of these things.

50:58 There are some things that we just can't do in the open source platform.

51:01 For example, there's pipelines on cloud that involve ingesting data through our old systems

51:06 in order to do reporting, which just doesn't make sense to do on the open source system.

51:11 It makes the product way too complex.

51:13 But for the most part, I think Dexter open source, we really believe that just getting

51:16 it in the hands of developers is the best way to prove the value of it.

51:19 And if we can build a business on top of that, I think we're all super happy to do so.

51:23 It's nice that we get to sort of drive both sides of it.

51:27 To me, that's one of the more exciting parts, right?

51:29 A lot of the development that we do in Dagster open source is driven by people who are paid

51:35 through what happens on Dagster cloud.

51:37 And I think from what I can tell, there's no better way to build open source product

51:41 than to have people who are adequately paid to develop that product.

51:45 Otherwise it can be a labor of love, but one that doesn't last for very long.

51:48 Yeah.

51:49 Whenever I think about building software, there's 80% of it that's super exciting and

51:52 fun, 10% and then there's that little sliver of like really fine polish that if it's not

51:58 just your job to make that thing polished, you're just for the most part, just not going

52:03 to polish that bit, right?

52:04 It's tough.

52:05 UI, design, support.

52:08 There's all these things that go into making a software like really extraordinary.

52:12 That's really, really tough to do.

52:14 And I think I really like the open source business model.

52:17 I think for me being able to just try something, not having talked to sales and being able

52:21 to just deploy locally and test it out and see if this works.

52:24 And if I choose to do so, deploy it in production, or if I bought the cloud product and I don't

52:30 like the direction that it's going, I can even go to open source as well.

52:34 That's pretty compelling to me.

52:35 Yeah, for sure it is.

52:37 And I think the more moving pieces of infrastructure, more uptime you want and all those types of

52:43 things, the more somebody who's maybe a programmer, but not a DevOps infrastructure person, but

52:49 needs to have it there, right?

52:50 Like that's an opportunity as well, right?

52:52 For you to say, look, you can write the code.

52:55 We made it cool for you to write the code, but you don't have to like get notified when

52:59 the server's down or whatever.

53:00 Like, we'll just take care of that for you.

53:01 That's pretty awesome.

53:02 Yeah.

53:03 And it's efficiencies of scale as well, right?

53:04 Like we've learned the same mistakes over and over again, so you don't have to, which

53:08 is nice.

53:09 I don't know how many people who want to maintain servers, but people do.

53:13 And they're more than welcome to if that's how they choose to do so.

53:15 Yeah, for sure.

53:16 All right.

53:17 Just about out of time.

53:18 Let's wrap up our conversation with where are things going for Dagster?

53:23 What's on the roadmap?

53:24 What are you excited about?

53:25 Oh, that's a good one.

53:26 I think we've actually published our roadmap line somewhere.

53:29 If you search Dagster roadmap, it's probably out there.

53:31 I think for the most part that hasn't changed much going into 2024, though we may update

53:36 it.

53:37 Ah, there it is.

53:38 We're really just doubling down on what we've built already.

53:40 I think there's a lot of work we can do on the product itself to make it easier to use,

53:45 easier to understand.

53:46 My team specifically is really focused around the education piece.

53:49 And so we launched Dagster University's first module, which helps you really understand

53:53 the core concepts around Dagster.

53:56 Our next module is coming up in a couple months, and that'll be around using Dagster with dbt,

54:00 which is our most popular integration.

54:02 We're building up more integrations as well.

54:04 So I built a little integration called embedded ELT that makes it easy to ingest data.

54:09 But I want to actually build an integration with DLT as well, DLT hub.

54:12 So we'll be doing that.

54:14 And there's more coming down the pipe, but I don't know how much I can say.

54:18 Look forward to an event in April where we'll have a launch event on all that's coming.

54:23 Nice.

54:24 Is that an online thing people can attend or something?

54:26 Exactly.

54:27 Yeah, there'll be some announcement there on the Dagster website on that.

54:31 Maybe I will call it one thing that's actually really fun.

54:33 It's called Dagster Open Platform.

54:35 It's a GitHub repo that we launched a couple months ago, I want to say.

54:39 We took our internal...

54:40 I should go back one more.

54:42 Sorry.

54:43 I should go back to GitHub, Dagster Open Platform on GitHub.

54:45 I have it somewhere.

54:47 Yeah.

54:48 It's here under the organization.

54:51 Yes, it should be somewhere here.

54:54 There it is.

54:55 Dagster Open Platform on GitHub.

54:57 And it's really a clone of our production pipelines.

54:59 For the most part, there's some things we've chosen to ignore because they're sensitive.

55:03 But as much as possible, we've defaulted to making it public and open.

55:06 And the whole reason behind this was because, you know, as data engineers, it's often hard

55:10 to see how other data engineers write code.

55:12 We get to see how software engineers write code quite often, but most people don't want

55:16 to share their platforms for various good reasons.

55:19 Right.

55:20 Also, there's like smaller teams or maybe just one person.

55:23 And then like those pipelines are so integrated into your specific infrastructure, right?

55:29 So it's not like, well, here's a web framework to share, right?

55:32 Like, here's how we integrate into that one weird API that we have that no one else has.

55:36 So it's no point in publishing it to you, right?

55:39 That's typically how it goes.

55:40 Or they're so large that they're afraid that there's like some, you know, important information

55:44 that they just don't want to take the risk on.

55:46 And so we built like something that's in the middle where we've taken as much as we can

55:49 and we've publicized it.

55:51 And you can't run this on your own.

55:52 Like it's not, that's not the point.

55:53 The point is to look at the code and see, you know, how does Dagster use Dagster and what

55:56 does that kind of look like?

55:57 Nice.

55:58 Okay.

55:59 All right.

56:00 Well, I'll put a link to that in the show notes and people can check it out.

56:01 Yeah, I guess let's wrap it up with the final call to action.

56:05 People are interested in Dagster.

56:06 How do they get started?

56:07 What do you tell them?

56:08 Oh, yeah.

56:09 Well, Dagster is probably the greatest place to start.

56:11 You can try the cloud product.

56:13 We have free self-serve or you can try the local install as well.

56:18 If you get stuck, a great place to join is our Slack channel, which is up on our website.

56:22 There's even a Ask AI channel where you can just talk to a Slack bot that's been trained

56:27 on all our GitHub issues and discussions.

56:29 And it's surprisingly good at walking you through, you know, any debugging, any issues

56:33 or even advice.

56:34 And that's pretty excellent, actually.

56:36 Yeah.

56:37 It's real fun.

56:38 It's really fun.

56:39 It's a great experience community where you can just chat to us as well.

56:41 Cool.

56:42 All right.

56:43 Well, Pedram, thank you for being on the show.

56:44 Make sure the work on Dagster and sharing it with us.

56:47 Thank you, Michael.

56:48 You bet.

56:49 See you later.

