#454: Data Pipelines with Dagster Transcript
00:00 Do you have data that you pull from external sources or that is generated and appears at
00:05 your digital doorstep?
00:06 I bet that data needs processed, filtered, transformed, distributed, and much more.
00:11 One of the biggest tools to create these data pipelines with Python is Dagster.
00:16 And we're fortunate to have Pedram Navid on the show to tell us about it.
00:20 Pedram is the head of data engineering and dev rel at Dagster Labs.
00:24 And we're talking data pipelines this week here at Talk Python.
00:28 This is Talk Python to Me, episode 454, recorded January 11th, 2024.
00:48 Welcome to Talk Python to Me, a weekly podcast on Python.
00:51 This is your host, Michael Kennedy.
00:53 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,
00:58 both on fosstodon.org.
01:01 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
01:06 We've started streaming most of our episodes live on YouTube.
01:10 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be
01:16 part of that episode.
01:19 This episode is sponsored by Posit Connect from the makers of Shiny.
01:23 Posit Connect lets you share and deploy all of your data projects that you're creating
01:26 using Python.
01:27 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quatro, Reports, Dashboards, and APIs.
01:34 Posit Connect supports all of them.
01:36 Try Posit Connect for free by going to talkpython.fm/posit, P-O-S-I-T.
01:42 And it's also brought to you by us over at Talk Python Training.
01:46 Did you know that we have over 250 hours of Python courses?
01:50 Yeah, that's right.
01:52 Check them out at talkpython.fm/courses.
01:56 Last week, I told you about our new course, Build an AI Audio App with Python.
02:01 Well, I have another brand new and amazing course to tell you about.
02:06 This time, it's all about Python's typing system and how to take the most advantage
02:10 of it.
02:11 It's a really awesome course called Rock Solid Python with Python Typing.
02:16 This is one of my favorite courses that I've created in the last couple of years.
02:20 Python type hints are really starting to transform Python, especially from the ecosystem's perspective.
02:26 Think FastAPI, Pydantic, BearType, et cetera.
02:30 This course shows you the ins and outs of Python typing syntax, of course, but it also
02:34 gives you guidance on when and how to use type hints.
02:38 Check out this four and a half hour in-depth course at talkpython.fm/courses.
02:44 Now onto those data pipelines.
02:47 Pedram, welcome to Talk Python to Me.
02:50 It's amazing to have you here.
02:51 >> Michael, great to have you.
02:52 Good to be here.
02:53 >> Yeah.
02:54 We're going to talk about data, data pipelines, automation, and boy, oh boy, let me tell you,
02:59 have I been in the DevOps side of things this week.
03:03 And I'm going to have a special, special appreciation of it.
03:07 I can tell already.
03:08 So excited to talk.
03:10 >> My condolences.
03:12 >> Indeed.
03:13 >> So before we get to that, though, before we talk about Dagster and data pipelines and
03:18 orchestration more broadly, let's just get a little bit of background on you.
03:22 Introduce yourself for people.
03:23 How'd you get into Python and data orchestration and all those things?
03:27 >> Of course, yeah.
03:28 So my name is Pedram Navid.
03:29 I'm the head of data engineering and dev rel at Dagster.
03:33 That's a mouthful.
03:34 And I've been a longtime Python user since 2.7.
03:38 And I got started with Python like I do with many things just out of sheer laziness.
03:42 I was working at a bank and there was this rote task, something involving going into
03:47 servers, opening up a text file and seeing if a patch was applied to a server.
03:52 A nightmare scenario when there's 100 servers to check and 15 different patches to confirm.
03:57 >> Yeah, so this kind of predates like the cloud and all that automation and stuff, right?
04:02 >> This was definitely before cloud.
04:04 This was like right between Python 2 and Python 3, and we were trying to figure out how to
04:07 use print statements correctly.
04:09 That's when I learned Python.
04:10 I was like, there's got to be a better way.
04:12 And honestly, I've not looked back.
04:13 I think if you look at my entire career trajectory, you'll see it's just punctuated by finding
04:18 ways to be more lazy in many ways.
04:22 >> Yeah.
04:23 Who was it?
04:24 I think it was Matthew Rocklin that had the phrase something like productive laziness
04:29 or something like that.
04:31 Like, I'm going to find a way to leverage my laziness to force me to build automation
04:36 so I never ever have to do this sort of thing again.
04:39 I got that sort of print.
04:40 >> It's very motivating to not have to do something.
04:43 And I'll do anything to not do something.
04:44 >> Yeah, yeah, yeah.
04:45 It's incredible.
04:46 And like that DevOps stuff I was talking about, just, you know, one command and there's maybe
04:51 eight or nine new apps with all their tiers redeployed, updated, resynced.
04:55 And it took me a lot of work to get there.
04:59 But now I never have to think about it again, at least not for a few years.
05:03 And it's amazing.
05:04 I can just be productive.
05:05 It's like right in line with that.
05:07 >> So what are some of the Python projects you've been, you've worked on, talked about
05:11 different ways to apply this over the years?
05:13 >> Oh, yeah.
05:14 So it started with internal, just like Python projects, trying to automate, like I said,
05:18 some rote tasks that I had.
05:20 And that accidentally becomes, you know, a bigger project.
05:23 People see it and they're like, oh, I want that too.
05:25 And so, well, now I have to build like a GUI interface because most people don't speak
05:29 Python.
05:30 And so that got me into iGUI, I think it was called, way back when.
05:35 That was a fun journey.
05:36 And then from there, it's really taken off.
05:38 A lot of it has been mostly personal projects.
05:41 Trying to understand open source was a really big learning path for me as well.
05:46 Really being absorbed by things like SQLAlchemy and requests back when they were coming out.
05:50 Eventually, it led to more of a data engineering type of role, where I got involved with tools
05:55 like Airflow and tried to automate data pipelines instead of patches on a server.
06:01 That one day led to, I guess, making a long story short, a role at Dagster, where now I
06:06 contribute a little bit to Dagster.
06:08 I work on Dagster, the core project itself, but I also use Dagster internally to build
06:11 our own data pipelines.
06:13 I'm sure it's interesting to see how you all both build Dagster and then consume Dagster.
06:19 Yeah, it's been wonderful.
06:21 I think there's a lot of great things about it.
06:23 One is like getting access to Dagster before it's fully released, right?
06:27 Internally, we dog food, new features, new concepts, and we work with the product team,
06:33 the engineering team, and say, "Hey, this makes sense.
06:35 This doesn't.
06:36 This works really well.
06:37 That doesn't." That feedback loop is so fast and so iterative that for me personally, being able to see
06:43 that come to fruition is really, really compelling.
06:45 But at the same time, I get to work at a place that's building a tool for me.
06:50 You don't often get that luxury.
06:51 I've worked in ads.
06:53 I've worked in insurance.
06:54 It's like banking.
06:55 These are nice things, but it's not built for me, right?
06:59 And so for me, that's probably been the biggest benefit, I would say.
07:01 Right.
07:02 If you work in some marketing thing, you're like, "You know, I retargeted myself so well
07:06 today.
07:07 You wouldn't believe it.
07:08 I really enjoyed it." I've seen the ads that I've created before, so it's a little fun, but it's not the same.
07:15 Yeah.
07:16 I've heard of people who are really, really good at ad targeting and finding groups where
07:21 they've pranked their wife or something, or just had an ad that would only show up for
07:26 their wife by running it.
07:27 It was so specific and freaked them out a little bit.
07:30 That's pretty clever.
07:31 Yeah.
07:32 Maybe it wasn't appreciated, but it is clever.
07:34 Who knows?
07:35 All right.
07:37 Well, before we jump in, you said that, of course, you built GUIs with PyGUI and those
07:43 sorts of things because people don't speak Python back then, two, seven days and whatever.
07:48 Is that different now?
07:49 Not that people speak Python, but is it different in the sense that, "Hey, I could give them
07:53 a Jupyter notebook," or, "I could give them Streamlit," or one of these things, right?
07:58 A little more or less you building and just plug it in?
08:01 I think so.
08:02 Like you said, it's not different in that most people probably still to this day don't
08:06 speak Python.
08:07 I know we had this movement a little bit back where everyone was going to learn SQL and
08:11 everyone was going to learn to code.
08:13 I was never that bullish on that trend because if I'm a marketing person, I've got 10,000
08:18 things to do and learning to code isn't going to be a priority ever.
08:22 So I think building interfaces for people that are easy to use and speak well to them
08:27 is always useful.
08:29 That never has gone away, but I think the tooling around it has been better, right?
08:32 I don't think I'll ever want to use PyGUI again.
08:34 Nothing wrong with the platform.
08:35 It's just not fun to write.
08:37 Streamlit makes it so easy to do that.
08:39 So it's something like retool and there's a thousand other ways now that you can bring
08:43 these tools in front of your stakeholders and your users that just wasn't possible before.
08:47 I think it's a pretty exciting time.
08:49 There are a lot of pretty polished tools.
08:51 Yeah, it's gotten so good.
08:52 Yeah.
08:53 There's some interesting ones like OpenBB.
08:54 Do you know that?
08:55 The financial dashboard thing.
08:58 I've heard of this.
08:59 I haven't seen it.
09:00 Yeah.
09:01 It's basically for traders, but it's like a terminal type thing that has a bunch of
09:05 Matplotlib and other interactive stuff that pops up kind of compared to say Bloomberg
09:10 dashboard type things.
09:13 But yeah, that's one sense where like maybe like traders go and learn Python because it's
09:18 like, all right, there's enough value here.
09:19 But in general, I don't think people are going to stop what they're doing and learning the
09:24 code.
09:25 So these new UI things are not.
09:26 All right, let's dive in and talk about this general category first of data pipelines,
09:32 data orchestration, all those things.
09:33 We'll talk about Dagster and some of the trends and that.
09:36 So let's grab some random internet search for what does a data pipeline maybe look like?
09:41 But people out there listening who don't necessarily live in that space, which I think is honestly
09:47 many of us, maybe we should, but maybe in our minds, we don't think we live in data
09:51 pipeline land.
09:52 Tell them about it.
09:53 Yeah, for sure.
09:54 It is hard to think about if you haven't done or built one before.
09:57 In many ways, a data pipeline is just a series of steps that you apply to some dataset that
10:02 you have in order to transform it to something a little bit more valuable at the very end.
10:08 It's a simplified version, the devil's in the details, but really like at the end of
10:12 the day, you're in a business, the production of data sort of happens by the very nature
10:16 of operating that business.
10:17 It tends to be the core thing that all businesses have in common.
10:21 And then the other sort of output is you have people within that business who are trying
10:25 to understand how the business is operating.
10:27 And this used to be easy when all we had was a single spreadsheet that we could look at
10:31 once a month.
10:32 Yeah, I think businesses have gone a little bit more complex than these days.
10:36 Computers and automation.
10:37 And the expectations.
10:38 I expect to be able to see almost real time, not I'll see it at the end of the month sort
10:42 of.
10:43 That's right.
10:44 Yeah.
10:45 I think people have gotten used to getting data too, which is both good and bad.
10:47 Good in the sense that now people are making better decisions.
10:49 Bad, and then there's more work for us to do.
10:51 And we can't just sit on our feet for half a day, half a month waiting for the next request
10:55 to come in.
10:56 There's just an endless stream that seems to never end.
10:59 So that's what really a pipeline is all about.
11:00 It's like taking these data and making it consumable in a way that users, tools will
11:06 understand that helps people make decisions at the very end of the day.
11:09 That's sort of the nuts and bolts of it.
11:11 In your mind, does data acquisition live in this land?
11:14 So for example, maybe we have a scheduled job that goes and does web scraping, calls
11:19 an API once an hour, and that might kick off a whole pipeline of processing.
11:24 Or we watch a folder for people to upload over FTP, like a CSV file or something horrible
11:32 like that.
11:33 You don't even, it's unspeakable.
11:34 But something like that where you say, oh, a new CSV has arrived for me to get, right?
11:38 Yeah, I think that's the beginning of all data pipeline journeys in my mind, very much,
11:43 right?
11:44 Like an FTP, as much as we hate it, it's not terrible.
11:46 I mean, the worst, there are worse ways to transfer files, but it's, I think still very
11:52 much in use today.
11:53 And every data pipeline journey at some point has to begin with that consumption of data
11:57 from somewhere.
11:58 Yeah.
11:59 Hopefully it's SFTP, not just straight FTP.
12:02 Like the encrypted, don't just send your password in the plain text.
12:06 Oh, well, I've seen that go wrong.
12:09 That's a story for another day, honestly.
12:11 All right.
12:12 Well, let's talk about the project that you work on.
12:14 We've been talking about it in general, but let's talk about Dagster.
12:18 Like, where does it fit in this world?
12:20 Yes.
12:21 Dagster to me is a way to build a data platform.
12:24 It's also a different way of thinking about how you build data pipelines.
12:28 Maybe it's good to compare it with kind of what the world was like, I think, before Dagster
12:32 and how it came about to be.
12:34 So if you think of Airflow, I think it's probably the most canonical orchestrator out there,
12:39 but there are other ways which people used to orchestrate these data pipelines.
12:43 They were often task-based, right?
12:45 Like I would download file, I would unzip file, I would upload file.
12:50 These are sort of the words we use to describe the various steps within a pipeline.
12:55 Some of those little steps might be Python functions that you write.
12:59 Maybe there's some pre-built other ones.
13:00 Yeah, they might be Python.
13:02 Could be a bash script.
13:03 It'd be logging into a server and downloading a file.
13:05 Could be hitting request to download something from the internet, unzipping it.
13:09 Just a various, you know, hodgepodge of commands that would run.
13:12 That's typically how we thought about it.
13:13 For more complex scenarios where your data is bigger, maybe it's running against a Hadoop
13:17 cluster or a Spark cluster.
13:19 The compute's been offloaded somewhere else.
13:21 But the sort of conceptual way you tended to think about these things is in terms of
13:25 tasks, right?
13:26 Process this thing, do this massive data dump, run a bunch of things, and then your job is
13:31 complete.
13:32 With Airflow, or sorry, with Dijkstra, we kind of flip it around a little bit on our
13:35 heads and we say, instead of thinking about tasks, what if we flipped that around and
13:40 thought about the actual underlying assets that you're creating?
13:43 What if you told us not, you know, the step that you're going to take, but the thing that
13:47 you produce?
13:48 Because it turns out as people and as data people and stakeholders, really, we don't
13:52 care about the task, like we just assume that you're going to do it.
13:56 What we care about is, you know, that table, that model, that file, that Jupyter notebook.
14:01 And if we model our pipeline through that, then we get a whole bunch of other benefits.
14:06 And that's sort of the Dijkstra sort of pitch, right?
14:09 Like if you want to understand the things that are being produced by these tasks, tell
14:13 us about the underlying assets.
14:15 And then when a stakeholder says and comes to you and says, you know, how old is this
14:19 table?
14:20 Has it been refreshed lately?
14:21 You don't have to go look at a specific task.
14:22 And remember that task ABC had model XYZ.
14:26 You just go and look up model XYZ directly there, and it's there for you.
14:29 And because you've defined things in this way, you get other nice things like a lineage
14:33 graph.
14:34 You get to understand how fresh your data is.
14:36 You can do event-based orchestration and all kinds of nice things that are a lot harder
14:39 to do in a task world.
14:41 Yeah, more declarative, less imperative, I suppose.
14:45 Yeah, it's been the trend, I think, in lots of tooling.
14:48 React, I think was famous for this as well, right?
14:51 In many ways.
14:52 It was a hard framework, I think, for people to sort of get their heads around initially
14:55 because we were so used to like the jQuery declarative or jQuery style of doing things.
15:01 Yeah.
15:02 How do I hook the event that makes the thing happen?
15:03 And React said, let's think about it a little bit differently.
15:06 Let's do this event-based orchestration.
15:08 And I think the proof's in the pudding.
15:10 React's everywhere now and jQuery, maybe not so much.
15:13 Yeah.
15:14 There's still a lot of jQuery out there, but there's not a lot of active jQuery.
15:18 But I imagine there's some.
15:19 There is.
15:20 Yeah.
15:21 Just because people are like, you know what?
15:22 Don't touch that.
15:23 That works.
15:24 Which is probably the smartest thing people can do, I think.
15:27 Yeah, honestly.
15:28 Even though new frameworks are shiny.
15:30 And if there's any ecosystem that loves to chase the shiny new idea, it's the JavaScript
15:35 web world.
15:36 Oh, yeah.
15:37 There's no shortage of new frameworks coming out every time.
15:40 We do too, but not as much as like, that's six months old.
15:44 That's so old, we can't possibly do that anymore.
15:46 We're rewriting it.
15:47 We're going to do the big rewrite again.
15:48 Yep.
15:49 Fun.
15:50 So, Dagster is the company, but also is open source.
15:54 What's the story around like, can I use it for free?
15:57 Is it open source?
15:58 Do I pay for it?
15:59 100%.
16:00 Okay.
16:01 So, Dagster Labs is the company.
16:02 Dagster open source is the product.
16:03 It's 100% free.
16:04 We're very committed to the open source model.
16:06 I would say 95% of the things you can get out of Dagster are available through open source.
16:11 And we tend to try to release everything through that model.
16:14 You can run very complex pipelines, and you can deploy it all on your own if you wish.
16:19 There is a Dagster cloud product, which is really the hosted version of Dagster.
16:23 If you want hosted plain, we can do that for you through Dagster cloud, but it all runs
16:27 on the same code base and the modeling and the files all essentially look the same.
16:32 Okay.
16:33 So obviously you could get, like I talked about at the beginning, you could go down
16:36 the DevOps side, get your own open source Dagster set up, schedule it, run it on servers,
16:41 all those things.
16:42 But if we just wanted something real simple, we could just go to you guys and say, "Hey,
16:47 I built this with Dagster.
16:48 Will you run it for me?" Pretty much.
16:50 Yeah.
16:51 Right.
16:52 So there's two options there.
16:53 You can do the serverless model, which says, "Dagster, just run it.
16:55 We take care of the compute, we take care of the execution for you, and you just write
16:58 the code and upload it to GitHub or any repository of your choice, and we'll sync to that and
17:04 then run it." The other option is to do the hybrid model.
17:06 So you basically do the CI/CD aspect.
17:09 You just say, you push to name your branch.
17:11 If you push to that branch, that means we're just going to deploy a new version and whatever
17:15 happens after that, it'll be in production, right?
17:18 Exactly.
17:19 Yeah.
17:20 And we offer some templates that you can use in GitHub for workflows in order to accommodate
17:23 that.
17:24 Excellent.
17:25 Then I cut you off.
17:26 You're saying something about hybrid.
17:27 Hybrid is the other option for those of you who want to run your own compute.
17:30 You don't want the data leaving your ecosystem.
17:32 You can say, "We've got this Kubernetes cluster, this ECS cluster, but we still want to use
17:37 a Dagster Cloud product to sort of manage the control plane.
17:40 Dagster Cloud will do that." And then you can go off and execute things on your own environment if that's something
17:44 you wish to do.
17:45 Oh, yeah.
17:46 Because running stuff in containers isn't too bad, but running container clusters, all
17:51 of a sudden you're back doing a lot of work, right?
17:55 Exactly.
17:56 Yeah.
17:57 Okay.
17:58 Well, let's maybe talk about Dagster for a bit.
17:59 I want to talk about some of the trends as well, but let's just talk through maybe setting
18:02 up a pipeline.
18:04 What does it look like?
18:05 You talked about in general, less imperative, more declarative, but what does it look like?
18:10 Be careful about talking about code on audio, but just give us a sense of what the programming
18:15 model feels like for us.
18:16 As much as possible, it really feels like just writing Python.
18:20 It's pretty easy.
18:21 You add a decorator on top of your existing Python function that does something.
18:25 That's a simple decorator called asset.
18:28 And then your pipeline, that function becomes a data asset.
18:31 That's how it's represented in the Dagster UI.
18:33 So you could imagine you've got a pipeline that gets like maybe Slack analytics and uploads
18:39 that to some dashboard, right?
18:41 Your first pipeline, your function will be called something like Slack data, and that
18:45 would be your asset.
18:46 In that function is where you do all the transform, the downloading of the data until you've really
18:51 created that fundamental data asset that you care about.
18:53 And that could be stored either in a data warehouse to S3, however you sort of want
18:58 to persist it, that's really up to you.
19:00 And then the resources is sort of where the power, I think, of a lot of Dagster comes in.
19:04 So the asset is sort of like declaration of the thing I'm going to create.
19:08 The resource is how I'm going to operate on that, right?
19:12 Because sometimes you might want to have a, let's say a DuckDB instance locally, because
19:17 it's easier and faster to operate.
19:18 But when you're moving to the cloud, you want to have a Databricks or a Snowflake.
19:23 You can swap out resources based on environments and your asset can reference that resource.
19:28 And as long as it has that same sort of API, you can really flexibly change between where
19:32 that data is going to be persistent.
19:34 Does Dagster know how to talk to those different platforms?
19:37 Does it like natively understand DuckDB and Snowflake?
19:40 Yeah.
19:41 Interesting.
19:42 People often look to Dagster and like, "Oh, does it do X?" And the question is like, "Dagster does anything you can do Python with?"
19:48 Which is most things, yeah.
19:49 Which is most things.
19:50 So I think if you come from the Airflow world, you're very much used to like these Airflow
19:54 providers and if you want to use...
19:55 That's kind of what I was thinking, yeah.
19:56 Yeah.
19:57 You want to use a Postgres, you need to find the Postgres provider.
19:59 You want to use S3, you need to find the S3 provider.
20:01 With Dagster, you kind of say you don't have to do any of that.
20:04 If you want to use Snowflake, for example, install the Snowflake connector package from
20:08 Snowflake and you use that as a resource directly.
20:11 And then you just run your SQL that way.
20:13 There are some places where we do have integrations that help if you want to get into the weeds
20:18 with I/O Manager, it's where we persist the data on your behalf.
20:21 And so for S3, for Snowflake, for example, there's other ways where we can persist that
20:26 data for you.
20:27 But if you're just trying to run a query, just trying to execute something, just trying
20:30 to save something somewhere, you don't have to use that system at all.
20:33 You can just use whatever Python package you would use anyway to do that.
20:38 So maybe some data is expensive for us to get as a company, like maybe we're charged
20:43 on a usage basis or super slow or something.
20:46 I could write just Python code that goes and say, well, look in my local database.
20:50 If it's already there, use that and it's not too stale.
20:53 Otherwise, then do actually go get it, put it there and then get it back.
20:57 And like that kind of stuff would be up to me to put together.
21:00 Yeah.
21:01 And that's the nice thing is you're not really limited by like anyone's data model or worldview
21:06 on how data should be retrieved or saved or augmented.
21:09 You could do it a couple of ways.
21:10 You could say whenever I'm working locally, use this persistent data store that we're
21:15 just going to use for development purposes.
21:18 Fancy database called SQLite, something like that.
21:20 Exactly.
21:21 Yes.
21:22 A wonderful database.
21:23 Actually, it is.
21:24 Yeah.
21:25 It'll work really, really well.
21:26 And then you just say when I'm in a different environment, when I'm in production, swap
21:28 out my SQLite resource for a name, your favorite cloud warehouse resource, and go fetch that
21:33 data from there.
21:34 Or I want to use it mini IO locally.
21:36 I want to use S3 on prod.
21:39 It's very simple to swap these things out.
21:40 Okay.
21:41 Yeah.
21:42 So it looks like you build up these assets as y'all call them, these pieces of data,
21:46 Python code that accesses them.
21:48 And then you have a nice UI that lets you go and build those out kind of workflow style,
21:54 right?
21:55 Yeah, exactly.
21:56 This is where we get into the wonderful world of DAGs, which stands for directed acyclic
22:00 graph.
22:01 So basically it stands for a bunch of things that are not connected in a circle, but are
22:04 connected in some way.
22:05 So there can't be any loops, right?
22:07 Because then you never know where to start or where to end.
22:09 Could be an assignment, but not a circle.
22:11 Not a circle.
22:12 As long as there's like a path through this dataset, where the beginning and an end, then
22:18 we can kind of start to model this connected graph of things.
22:21 And then we know how to execute them, right?
22:23 We can say, well, this is the first thing we have to run because that's where all dependencies
22:26 start.
22:27 And then we can branch off in parallel or we can continue linearly until everything
22:31 is complete.
22:32 And if something breaks in the middle, we can resume from that broken spot.
22:35 Okay, excellent.
22:36 And is that the recommended way?
22:38 Like if I write all this Python code that works on the pieces, then the next recommendation
22:42 would be to fire up the UI and start building it?
22:45 Or do you say, ah, you should really write it in code and then you can just visualize
22:49 it or monitor it?
22:50 Everything in Dagster is written as code.
22:52 The UI reads that code and it interprets it as a DAG and then it displays that for you.
22:57 There are some things you do with the UI, like you can materialize assets, you can make
23:00 them run, you can do backfills, you can view metadata, you can sort of enable and disable
23:06 schedules.
23:07 But the core, we really believe this is Dagster, like the core declaration of how things are
23:11 done, it's always done through code.
23:13 Okay, excellent.
23:14 So when you say materialize, maybe I have an asset, which is really a Python function
23:19 I wrote that goes and pulls down a CSV file.
23:22 The materialize would be, I want to see kind of representative data in this, in the UI.
23:28 And so I could go, all right, I think this is right.
23:30 Let's keep passing it down.
23:31 Is that what that means?
23:33 Materialize really means just run this particular asset, make this asset new again, fresh again,
23:37 right?
23:38 As part of that materialization, we sometimes output metadata.
23:41 And you can kind of see this on the right, if you're looking at the screen here, where
23:44 we talk about what the timestamp was, the URL, there's a nice little graph of like number
23:49 of rows over time.
23:51 All that metadata is something you can emit, and we emit some ourselves by default with
23:56 the framework.
23:57 And then as you materialize these assets, as you run that asset over and over again,
24:00 over time, we capture all that.
24:01 And then you can really get a nice overview of, you know, this assets lifetime, essentially.
24:06 Nice.
24:07 I think the asset, the metadata is really pretty excellent, right?
24:10 Over time, you can see how the data's grown and changed.
24:13 And yeah, the metadata is really powerful.
24:16 And it's one of the nice benefits of being in this asset world, right?
24:19 Because you don't really want to metadata on like this task that run, you want to know
24:23 like this table that I created, how many rows has it had every single time it's run?
24:27 If that number drops by like 50%, that's a big problem.
24:31 Conversely, if the runtime is slowly increasing every single day, you might not notice it,
24:35 but over a month or two, it went from a 30 second pipeline to 30 minutes.
24:40 Maybe there's like a great place to start optimizing that one specific asset.
24:43 Right.
24:44 And what's cool, if it's just Python code, you know how to optimize that probably, right?
24:48 Hopefully, yeah.
24:49 Well, as much as you're going to, yeah, you got, you have all the power of Python and
24:54 you should be able to, as opposed to it's deep down inside some framework that you don't
24:57 really control.
24:58 Exactly.
24:59 Yeah.
25:00 You use Python, you can benchmark it.
25:01 There's probably, you probably knew you didn't write it that well when you first started
25:04 and you can always find ways to improve it.
25:07 So this UI is something that you can just run locally, kind of like Jupyter.
25:11 100%.
25:12 Just type Dijkstra dev and then you get the full UI experience.
25:16 You get to see the runs, all your assets.
25:17 Is it a web app?
25:18 It is.
25:19 Yeah.
25:20 It's a web app.
25:21 There's a Postgres backend.
25:22 And then there's a couple of services that run the web server, the GraphQL, and then
25:25 the workers.
25:26 Nice.
25:27 Yeah.
25:28 So pretty serious web app, it sounds like, but you probably just run it all.
25:31 Yeah.
25:33 Just something you run all probably containers or something you just fire up when you download
25:37 it, right?
25:38 Locally, it doesn't even use containers.
25:39 It's just all pure Python for that.
25:43 But once you deploy, yeah, I think you might want to go down the container route, but it's
25:46 nice not having to have Docker just to run a simple test deployment.
25:50 Yeah.
25:51 I guess not everyone's machine has that, for sure.
25:53 So question from the audience here.
25:56 Jazzy asks, does it hook into AWS in particular?
25:59 Is it compatible with existing pipelines like ingestion lambdas or transform lambdas?
26:04 Yeah, you can hook into AWS.
26:06 So we have some AWS integrations built in.
26:09 Like I mentioned before, there's nothing stopping you from importing Boto3 and doing anything
26:13 really you want.
26:14 So a very simple use case.
26:15 Like let's say you already have an existing transformation being triggered in AWS through
26:20 some lambda.
26:21 You could just model that within Dijkstra and say, you know, trigger that lambda Boto3.
26:25 Okay.
26:26 Then the asset itself is really that representation of that pipeline, but you're not actually running
26:31 that code within Dijkstra itself.
26:32 That's still occurring on the AWS framework.
26:34 And that's a really simple way to start adding a little bit of observability and orchestration
26:38 to existing pipelines.
26:40 Okay.
26:41 That's pretty cool because now you have this nice UI and these metadata in this history,
26:45 but it's someone else's cloud.
26:47 Exactly.
26:48 Yeah.
26:49 And you can start to pull more information in there.
26:50 And over time you might decide, you know, this, you know, lambda that I had, it's starting
26:54 to get out of hand.
26:55 I want to kind of break it apart into multiple assets where I want to sort of optimize it
26:59 a little way and Dijkstra can help you along that.
27:01 Yeah.
27:02 Excellent.
27:03 How do you set up like triggers or observability inside Dijkstra?
27:08 Like Jazzy's asking about S3, but like in general, right?
27:11 If a row is entered into a database, something is dropped in a blob storage or the date changes.
27:16 I don't know.
27:17 Yeah.
27:18 Those are great questions.
27:19 You have a lot of options.
27:20 In Dijkstra, we do model every asset with a couple little flags, I think, that are really
27:24 useful to think about.
27:25 One is whether the code of that particular asset has changed, right?
27:28 And then the other one is whether anything upstream of that asset has changed.
27:32 And those two things really power a lot of automation functionality that we can get downstream.
27:38 So let's start with the S3 example, it's the easiest to understand.
27:41 You have a bucket and there is a file that gets uploaded every day.
27:46 You don't know what time that file gets uploaded.
27:48 You don't know when it'll be uploaded, but you know at some point it will be.
27:51 In Dijkstra, we have a thing called the sensor, which you can just connect to an S3 location.
27:55 You can define how it looks into that file or into that folder.
27:59 And then you would just pull every 30 seconds until something happens.
28:02 When that something happens, that triggers sort of an event.
28:06 And that event can trickle at your will downstream to everything that depends on it as you sort
28:10 of connect to these things.
28:12 So it gets you awake from this like, "Oh, I'm going to schedule something to run every
28:15 hour.
28:16 Maybe the data will be there, but maybe it won't." And you can have a much more event-based workflow.
28:20 When this file runs, I want everything downstream to know that this data has changed.
28:25 And as sort of data flows through these systems, everything will sort of work its way down.
28:28 Yeah, I like it.
28:32 This portion of Talk Python to Me is brought to you by Posit, the makers of Shiny, formerly
28:36 RStudio, and especially Shiny for Python.
28:40 Let me ask you a question.
28:41 Are you building awesome things?
28:43 Of course you are.
28:44 You're a developer or a data scientist.
28:46 That's what we do.
28:47 And you should check out Posit Connect.
28:49 Posit Connect is a way for you to publish, share, and deploy all the data products that
28:54 you're building using Python.
28:56 People ask me the same question all the time.
28:59 "Michael, I have some cool data science project or notebook that I built.
29:02 How do I share it with my users, stakeholders, teammates?
29:05 Do I need to learn FastAPI or Flask or maybe Vue or ReactJS?" Hold on now.
29:11 Those are cool technologies, and I'm sure you'd benefit from them, but maybe stay focused
29:15 on the data project?
29:16 Let Posit Connect handle that side of things.
29:19 With Posit Connect, you can rapidly and securely deploy the things you build in Python.
29:23 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Ports, Dashboards, and APIs.
29:30 Posit Connect supports all of them.
29:32 And Posit Connect comes with all the bells and whistles to satisfy IT and other enterprise
29:37 requirements.
29:38 Make deployment the easiest step in your workflow with Posit Connect.
29:42 For a limited time, you can try Posit Connect for free for three months by going to talkpython.fm/posit.
29:49 That's talkpython.fm/POSIT.
29:52 The link is in your podcast player show notes.
29:54 Thank you to the team at Posit for supporting Talk Python.
29:59 The sensor concept is really cool because I'm sure that there's a ton of cloud machines
30:05 people provisioned just because this thing runs every 15 minutes, that runs every 30
30:10 minutes, and you add them up and in aggregate, we need eight machines just to handle the
30:14 automation, rather than, you know, because they're hoping to catch something without
30:18 too much latency, but maybe like that actually only changes once a week.
30:22 Exactly.
30:23 And I think that's where we have to like sometimes step away from the way we're so used to thinking
30:27 about things, and I'm guilty of this.
30:30 When I create a data pipeline, my natural inclination is to create a schedule where
30:33 it's a, is this a daily one?
30:34 Is this weekly?
30:35 Is this monthly?
30:36 But what I'm finding more and more is when I'm creating my pipelines, I'm not adding
30:39 a schedule.
30:40 I'm using Dagster's auto-materialized policies, and I'm just telling it, you figure it out.
30:45 I don't have to think about schedules.
30:46 Just figure out when the things should be updated.
30:49 When it's, you know, parents have been updated, you run.
30:51 When the data has changed, you run.
30:53 And then just like figure it out and leave me alone.
30:55 Yeah.
30:56 And it's worked pretty well for me so far.
30:57 I think it's great.
30:58 I have a search, refresh the search index on the various podcast pages that runs and
31:04 it runs every hour, but the podcast ships weekly, right?
31:08 But I don't know which hour it is.
31:10 And so it seems like that's enough latency, but it would be way better to put just a little
31:14 bit of smarts.
31:15 Like what was the last date that anything changed?
31:18 Was that since the last time you saw it?
31:20 Maybe we'll just leave that alone, you know?
31:21 But yeah, you're starting to inspire me to go write more code, but pretty cool.
31:26 All right.
31:27 So on the homepage at Dagster.io, you've got a nice graphic that shows you both how to
31:33 write the code, like some examples of the code, as well as how that looks in the UI.
31:38 And one of them is called, says to launch backfills.
31:41 What is this backfill thing?
31:43 Oh, this is my favorite thing.
31:44 Okay.
31:45 So when you first start your data journey as a data engineer, you sort of have a pipeline
31:50 and you build it and it just runs on a schedule and that's fine.
31:54 What you soon find is, you know, you might have to go back in time.
31:58 You might say, I've got this data set that updates monthly.
32:01 Here's a great example, AWS cost reporting, right?
32:05 AWS will send you some data around, you know, all your instances and your S3 bucket, all
32:10 that.
32:11 And it'll update that data every day or every month or whatever have you.
32:14 Due to some reason, you've got to go back in time and refresh data that AWS updated
32:18 due to some like discrepancy.
32:20 Backfill is sort of how you do that.
32:21 And it works hand in hand with this idea of a partition.
32:24 A partition is sort of how your data is naturally organized.
32:28 And it's like a nice way to represent that natural organization.
32:31 Has nothing to do with like the fundamental way, how often you want to run it.
32:35 It's more around like, I've got a data set that comes in once a month, it's represented
32:39 monthly.
32:40 It might be updated daily, but it's the representation of the data is monthly.
32:43 So I will partition it by month.
32:44 It doesn't have to be dates.
32:46 It could be strings.
32:47 It could be a list.
32:48 You could have a partition for every company or every client or every domain you have.
32:55 Whatever you sort of think is a natural way to think about breaking apart that pipeline.
33:00 And once you do that partition, you can do these nice things called backfills, which
33:03 says instead of running this entire pipeline and all my data, I want you to pick that one
33:08 month where data went wrong or that one month where data was missing and just run the partition
33:13 on that range.
33:14 And so you limit compute, you save resources and get a little bit more efficient.
33:18 And it's just easier to like, think about your pipeline because you've got this natural
33:22 built in partitioning system.
33:23 Excellent.
33:24 So maybe you missed some important event.
33:27 Maybe your automation went down for a little bit, came back up.
33:31 You're like, Oh no, we've, we've missed it.
33:33 Right.
33:34 But you want to start over for three years.
33:37 So maybe we could just go and run the last day.
33:39 It's worth of.
33:40 Exactly.
33:41 Or another one would be your vendor says, Hey, by the way, we actually screwed up.
33:45 We uploaded this file from two months ago, but the numbers were all wrong.
33:48 Yeah.
33:49 We've uploaded a new version to that destination.
33:51 Can you update your data set?
33:53 One way is to recompute the entire universe from scratch.
33:56 But if you've partitioned things and you can say no limit that to just this one partition
34:00 for that month and that one partition can trickle down all the way to all your other
34:03 assets that depend on that.
34:05 Do you have to pre decide, do you have to think about this partitioning beforehand or
34:10 can you do it retroactively?
34:11 You could do it retroactively.
34:12 And I have done that before as well.
34:14 It really depends on, on where you're at.
34:16 I think it's your first asset ever.
34:19 Probably don't bother with partitions, but it really isn't a lot of work to get them
34:23 to get them started.
34:24 Okay.
34:25 Yeah.
34:26 Really neat.
34:27 I like a lot of the ideas here.
34:28 I like that.
34:29 It's got this visual component that you can see what's going on, inspect it.
34:33 Just so you can debug runs or what happens there.
34:36 Like obviously when you're pulling data from many different sources, maybe it's not your
34:40 data you're taking in.
34:41 Fields could vanish.
34:42 It can be the wrong type.
34:43 Systems can go down.
34:44 I'm sure, sure the debugging is interesting.
34:47 So what's, it looks a little bit kind of like a web browser debug dev tools thing.
34:52 So for the record, my code never fails.
34:54 I've never had a bug in my life, but for the one that you have.
34:57 Yeah.
34:58 Well, mine doesn't either.
34:59 I only do it to make an example and for my, me, how others, yes.
35:03 If I do it's intentional, of course.
35:05 Yeah.
35:06 To humble myself a little bit.
35:08 Exactly.
35:09 This view is one of my favorite, I mean, so many favorite views, but this is, it's actually
35:13 really fun to watch, watch this actually run when you execute this pipeline.
35:16 But really like, let's go back to, you know, the world before orchestrators, we use cron,
35:22 right?
35:23 We'd have a bash script that would do something and we'd have a cron job that said, make sure
35:26 this thing runs.
35:27 And then hopefully it was successful, but sometimes it wasn't.
35:31 And it's a, sometimes it wasn't, that's always been the problem, right?
35:34 It's like, well, what do I do now?
35:35 How do I know why it failed?
35:36 What was, when did it fail?
35:38 You know, what, at what point or what steps did it fail?
35:41 That's really hard to do.
35:42 But this debugger really is, a structured log of every step that's been going on through
35:47 your pipeline, right?
35:48 So in this view, there's three assets that we can kind of see here.
35:52 One is called users.
35:53 One is called orders and one is to run dbt.
35:56 So presumably there's these two, you know, tables that are being updated and then a dbt
36:00 job.
36:01 It looks like that's being updated at the very end.
36:03 Once you execute this pipeline, all the logs are captured from each of those assets.
36:07 So you can manually write your own logs.
36:10 You have access to a Python logger and you can use your info, your error, whatever have
36:14 you in log output that way.
36:16 And it'll be captured in a structured way, but it also capture logs from your integrations.
36:21 So if you're using dbt, we capture those logs as well.
36:24 You can see it processing every single asset.
36:26 So if anything does go wrong, you can filter down and understand at what step, at what
36:31 point did something go wrong.
36:33 - That's awesome.
36:34 And just the historical aspect, cause just going through logs, especially multiple systems
36:40 can be really, really tricky to figure out what's the problem, what actually caused this
36:44 to go wrong, but come back and say, oh, it crashed, pull up the UI and see, all right,
36:49 well show me, show me what this run did, show me what this job did.
36:53 And it seems like it's a lot easier to debug than your standard web API or something like
36:57 that.
36:58 - Exactly.
36:59 You can click on any of these assets that get that metadata that we had earlier as well.
37:02 If you know, one step failed and it's kind of flaky, you can just click on that one step
37:06 and say, just rerun this.
37:08 Everything else is fine.
37:09 It's a restart from scratch.
37:10 - Okay.
37:11 And it'll keep the data from before, so you don't have to rerun that.
37:15 - Yeah.
37:16 I mean, it depends on how you built the pipeline.
37:18 We like to build item potent pipelines is how we sort of talk about it, the data engineering
37:21 landscape, right?
37:23 So you should be able to run something multiple times and not break anything in a perfect
37:27 world.
37:28 That's not always possible, but ideally, yes.
37:30 And so we can presume that if users completed successfully, then we don't have to run that
37:34 again because that data was persisted, you know, database S3 somewhere.
37:38 And if orders was the one that was broken, we can just only run orders and not have to
37:43 worry about rewriting the whole thing from scratch.
37:45 - Excellent.
37:46 So item potent for people who maybe don't know, you run it once or you perform the operation
37:50 once or you perform it 20 times, same outcome should have side effects, right?
37:55 - That's the idea.
37:57 - Easier said than done sometimes.
37:58 - It sure is.
37:59 - Sometimes it's easy, sometimes it's very hard, but the more you can build pipelines
38:03 that way, the easier your life becomes in many ways.
38:07 - Exactly.
38:08 Not always, but generally true for programming as well, right?
38:10 If you talk to functional programming people, they'll say like, it's an absolute, but.
38:14 - Yes, functional programmers love this kind of stuff.
38:17 And it actually does lend itself really well to data pipelines.
38:21 Data pipelines, unlike maybe some of the software engineering stuff, it's a little bit different
38:24 in that the data changing is what causes often most of the headaches, right?
38:30 It's less so the actual code you write, but more the expectation tends to change so frequently
38:36 and so often in new and novel, interesting ways that you would often never expect.
38:41 And so the more you can sort of make that function so pure that you can provide any
38:46 sort of dataset and really test really easily these expectations when they get broken, the
38:51 easier it is to sort of debug these things and build on them in the future.
38:55 - Yeah.
38:56 And cache them as well.
38:57 - Yes, it's always nice.
38:58 - Yeah.
38:59 So speaking of that kind of stuff, like what's the scalability story?
39:02 If I've got some big, huge, complicated data pipeline, can I parallelize them and have
39:08 them run multiple pieces?
39:10 Like if there's different branches or something like that?
39:12 - Yeah, exactly.
39:13 That's one of the key benefits I think in writing your assets in this DAG way, right?
39:20 Anything that is parallelizable will be parallelized.
39:22 Now sometimes you might want to put limits on that.
39:24 Sometimes too much parallelization is bad.
39:26 Your poor little database can't handle it.
39:28 And you can say, you know, maybe a concurrency limit on this one just for today is worth
39:32 putting, or if you're hitting an API for an external vendor, they might not appreciate
39:36 10,000 requests a second on that one.
39:39 So maybe you would slow it down.
39:40 But in this case-
39:41 - Or rate limiting, right?
39:42 You can run into too many requests and then your stuff crashes, then you got to start.
39:45 It can be a whole thing.
39:46 - It can be a whole thing.
39:47 There's memory concerns, but let's pretend the world is simple.
39:51 Anything that can be parallelized will be through Dagster.
39:54 And that's really the benefit of writing these DAGs is that there's a nice algorithm for
39:57 determining what that actually looks like.
39:59 - Yeah.
40:00 I guess if you have a diamond shape or any sort of split, right?
40:03 Those two things now become, 'cause it's a cyclical, they can't turn around and then
40:07 eventually depend on each other again.
40:09 So that's a perfect chance to just go fork it out.
40:12 - Exactly.
40:13 And that's kind of where partitions are also kind of interesting.
40:15 If you have a partitioned asset, you could take your dataset partitioned to five, you
40:19 know, buckets and run all five partitions at once, knowing full well that because you've
40:23 written this in a idempotent and partitioned way, that the first pipeline will only operate
40:28 on apples and the second one only operates on bananas.
40:32 And there is no commingling of apples and bananas anywhere in the pipeline.
40:35 - Oh, that's interesting.
40:36 I hadn't really thought about using the partitions for parallelism, but of course.
40:39 - Yeah.
40:40 It's a fun little way to break things apart.
40:43 - So if we run this on the Dagster cloud or even on our own, is this pretty much automatic?
40:49 We don't have to do anything?
40:51 I think Dagster just looks at it and says, this looks parallelizable and it'll go or?
40:55 - That's right.
40:56 Yeah.
40:57 As long as you've got the full deployment, whether it's OSS or cloud, Dagster will basically
41:00 parallelize it for you, which is possible.
41:02 You can set global currency limits.
41:04 So you might say, you know, 64 is more than enough, you know, parallelization that I need,
41:09 or maybe I want less because I'm worried about overloading systems, but it's really up to
41:13 you.
41:14 - Putting this on a $10 server, please don't kill it.
41:19 First respect that it's somewhat wimpy, but that's okay.
41:21 - It'll get the job done.
41:23 All right.
41:24 I want to talk about some of the tools and some of the tools that are maybe at play here
41:29 when working with Dagster and some of the trends and stuff.
41:31 But before that, maybe speak to where you could see people adopt a tool like Dagster,
41:37 but they generally don't.
41:39 They don't realize like, oh, actually there's a whole framework for this, right?
41:43 Like I could, sure I could go and build just on HTTP server and hook into the request and
41:49 start writing to it.
41:50 But like, maybe I should use Flask or FastAPI.
41:53 Like there's these frameworks that we really naturally adopt for certain situations like
41:58 APIs and others, background jobs, data pipelines, where I think there's probably a good chunk
42:04 of people who could benefit from stuff like this, but they just don't think they need
42:07 a framework for it.
42:09 Like cron is enough.
42:10 - Yeah, it's funny because sometimes cron is enough.
42:13 And I don't want to encourage people not to use cron, but think twice at least is what
42:18 I would say.
42:19 So probably the first like trigger for me of thinking of, you know, is Dagster a good
42:24 choice is like, am I trying to ingest data from somewhere?
42:26 Is that's something that fails.
42:28 Like I think we just can accept that, you know, if you're moving data around, the data
42:32 source will break, the expectations will change.
42:35 You'll need to debug it.
42:36 You'll need to rerun it.
42:37 And doing that in cron is a nightmare.
42:39 So I would say definitely start to think about an orchestration system.
42:43 If you're ingesting data, if you have a simple cron job that sends one email, like you're
42:46 probably fine.
42:47 I don't think you need to implement all of Dagster just to do that.
42:51 But the more closer you get to data pipelining, I think the better your life will be if you're
42:58 not trying to debug a obtuse process that no one really understands six months from
43:03 now.
43:04 - Excellent.
43:05 All right, maybe we could touch on some of the tools that are interesting to see people
43:08 using.
43:09 You talked about DuckDB and DBT, a lot of Ds starting here, but give us a sense of like
43:15 some of the supporting tools you see a lot of folks using that are interesting.
43:19 - Yeah, for sure.
43:20 I think in the data space, probably DBT is one of the most popular choices and DBT in
43:27 many ways, it's nothing more than a command line tool that runs a bunch of SQL in a bag
43:33 as well.
43:34 So there's actually a nice fit with Dagster and DBT together.
43:37 DBT is really used by people who are trying to model that business process using SQL against
43:44 typically a data warehouse.
43:45 So if you have your data in, for example, a Postgres, a Snowflake, Databricks, Microsoft
43:51 SQL, these types of data warehouses, generally you're trying to model some type of business
43:56 process and typically people use SQL to do that.
44:00 Now you can do this without DBT, but DBT has provided a nice clean interface to doing so.
44:06 It makes it very easy to connect these models together, to run them, to have a development
44:10 workflow that works really well.
44:12 And then you can push it to prod and have things run again in production.
44:15 So that's DBT.
44:17 We find it works really well.
44:18 And a lot of our customers are actually using DBT as well.
44:21 There's DuckDB, which is a great, it's like the SQLite for columnar databases, right?
44:27 It's in process, it's fast, it's written by the Dutch.
44:30 There's nothing you can't like about it.
44:32 It's free.
44:33 We love that.
44:34 It's a little bit more simple in Python itself.
44:37 It does.
44:38 It's so easy.
44:39 Yes, exactly.
44:40 The Duck have given us so much and they've asked nothing of us.
44:42 So I'm always very thankful for them.
44:44 It's fast.
44:45 It's so fast.
44:46 It's like if you've ever used pandas for processing large volumes of data, you've occasionally
44:51 hit memory limits or inefficiencies in doing these large aggregates.
44:56 I won't go into all the reasons of why that is, but DuckDB sort of changes that because
45:01 it's a fast serverless sort of C++ written tooling to do really fast vectorized work.
45:09 And by that, I mean, like it works on columns.
45:11 So typically in like SQLite, you're doing transactions, you're doing single row updates,
45:15 writes, inserts, and SQLite is great at that.
45:18 Where typical transactional databases fail or aren't as powerful is when you're doing
45:23 aggregates, when you're looking at an entire column, right?
45:26 Just the way they're architected.
45:27 If you want to know the average, the median, the sum of some large number of columns, and
45:32 you want to group that by a whole bunch of things, you want to know the first date someone
45:36 did something and the last one, those types of vectorized operations, DuckDB is really,
45:41 really fast at doing.
45:42 And it's a great alternative to, for example, pandas, which can often hit memory limits
45:48 and be a little bit slow in that regard.
45:50 Yeah, it looks like it has some pretty cool aspects, transactions, of course, but it also
45:54 says direct Parquet, CSV, and JSON querying.
45:59 So if you've got a CSV file hanging around and you want to ask questions about it or
46:04 JSON or some of the data science stuff through Parquet, you know, turn a indexed proper query
46:09 engine against it.
46:10 Don't just use a dictionary or something, right?
46:12 Yeah, it's great for reading a CSV, zip files, tar files, Parquet, partition Parquet files,
46:19 all that stuff that usually was really annoying to do and operate on, you can now install
46:23 DuckDB.
46:24 It's got a great CLI too.
46:25 So before you go out and like program your entire pipeline, you just run DuckDB and you
46:30 can start writing SQL against CSV files and all this stuff to really understand your data
46:35 and just really see how quick it is.
46:37 I used it on a bird dataset that I had as an example project and there was, you know,
46:43 millions of rows and I was joining them together and doing massive group buys and it was done
46:47 in like seconds.
46:48 And it's just hard for me to believe that it was even correct because it was so quick.
46:52 So it's wonderful.
46:53 I must have done that wrong somehow.
46:55 Because it's done, it shouldn't be done.
46:58 Yeah.
46:59 And the fact it's in process means there's not a babysit, a server for you to babysit
47:03 patch, make sure it's still running.
47:06 It's accessible, but not too accessible.
47:08 All that, right?
47:09 It's a pip install away, which is always, we love that, right?
47:12 Yeah, absolutely.
47:13 You mentioned, I guess I mentioned Parquet, but also Apache Arrow seems like it's making
47:18 its way into a lot of, a lot of different tools and sort of foundational sort of high
47:22 memory, high performance in memory processing.
47:25 Have you used this Eddie?
47:26 I've used it, especially through like working through different languages.
47:30 So moving data between Python and R is where I last used this.
47:34 I think Arrow's great at that.
47:35 I believe Arrow is like the, underneath some of the rust to Python as well.
47:41 It's working there.
47:42 So typically I don't use Arrow like directly myself, but it's in many of the tooling I
47:46 use.
47:47 It's a great product.
47:48 And like so much of the ecosystem is now built on, on Arrow.
47:52 Yeah.
47:53 I think a lot of it is, I feel like the first time I heard about it was through Polars.
47:55 That's right.
47:56 Yeah.
47:57 I'm pretty sure, which is another rust story for kind of like pandas, but a little bit
48:03 more fluent, lazy API.
48:05 Yes.
48:06 We live in such great times to be honest.
48:07 So Polars is a Python bindings for rust, I believe is kind of how I think about it.
48:13 It does all the transformation and rust, but you've had this Python interface to it and
48:17 it makes things again, incredibly fast.
48:20 I would say similar in speed to DuckDB.
48:22 They both are quite comparable sometimes.
48:24 Yeah.
48:25 It also comes to have vectorized and column runner processing and all that kind of stuff.
48:29 It's pretty incredible.
48:31 So not a drop in replacement for pandas, but if you have the opportunity to use it and
48:36 you don't need to use the full breadth of what pandas offers, because pandas is quite
48:39 a huge package.
48:40 There's a lot it does, but if you're just doing simple transforms, I think Polars is
48:44 a great option to explore.
48:45 Yeah, I talked to a Richie Vink, Vink who is part of that.
48:49 And I think they explicitly chose to not try to make it a drop in replacement for pandas,
48:54 but try to choose an API that would allow the engine to be smarter and go like, I see
48:58 you're asking for this, but the step before you wanted this other thing.
49:02 So let me do that transformation all in one shot.
49:04 And a little bit like a query optimization engine.
49:07 What else is out there?
49:08 A couple of guys, time for just a couple more.
49:10 If there's anything that you're like, Oh yeah, people use this all the time.
49:12 Especially the databases you've said, Postgres, Snowflake, et cetera.
49:16 Yeah, there's so much.
49:17 So another little one I like, it's called DLT, DLT hub.
49:21 It's getting a lot of attraction as well.
49:23 And what I like about it is how lightweight it is.
49:25 I'm such a big fan of lightweight tooling.
49:28 That's not, you know, massive frameworks.
49:29 Loading data is I think still kind of yucky in many ways.
49:32 It's not fun.
49:33 And DLT makes it a little bit simpler and easier to do so.
49:36 So that's what I would recommend people just to look into if you got to either ingest data
49:41 from some API, some website, some CSV file.
49:45 It's a great way to do that.
49:47 It claims it's the Python library for data teams loading data into unexpected places.
49:52 Very interesting.
49:53 Yes, that's great.
49:54 Yeah, this is, this looks cool.
49:56 All right.
49:57 Well, I guess maybe let's talk about, let's talk business and then we can talk about what's
50:01 next and then we'll probably be out of time.
50:03 I'm always fascinated.
50:04 I think there's starting to be a bit of a blueprint for this, but companies that take
50:09 a thing, they make it and they give it away and then they have a company around it.
50:12 And congratulations to you all for doing that.
50:14 Right.
50:15 And a lot of it seems to kind of center around the open core model, which I don't know if
50:20 that's exactly how you would characterize yourself, but yeah, maybe you should talk
50:24 about the business side.
50:25 Because I know there's many successful open source projects that don't necessarily result
50:29 in full-time jobs or companies if people were to want that.
50:32 It's a really interesting place.
50:34 And I don't think it's one that anyone has truly figured out well, I can say this is
50:39 the way forward for everyone, but it is something we're trying.
50:42 And I think for Dagster, I think it's working pretty well.
50:44 And what I think is really powerful about Dagster is like the open source product is
50:48 really, really good.
50:49 And it hasn't really been limited in many ways in order to drive like cloud product
50:54 consumption.
50:55 We really believe that there's actual value in that separation of these things.
50:58 There are some things that we just can't do in the open source platform.
51:01 For example, there's pipelines on cloud that involve ingesting data through our old systems
51:06 in order to do reporting, which just doesn't make sense to do on the open source system.
51:11 It makes the product way too complex.
51:13 But for the most part, I think Dexter open source, we really believe that just getting
51:16 it in the hands of developers is the best way to prove the value of it.
51:19 And if we can build a business on top of that, I think we're all super happy to do so.
51:23 It's nice that we get to sort of drive both sides of it.
51:27 To me, that's one of the more exciting parts, right?
51:29 A lot of the development that we do in Dagster open source is driven by people who are paid
51:35 through what happens on Dagster cloud.
51:37 And I think from what I can tell, there's no better way to build open source product
51:41 than to have people who are adequately paid to develop that product.
51:45 Otherwise it can be a labor of love, but one that doesn't last for very long.
51:48 Yeah.
51:49 Whenever I think about building software, there's 80% of it that's super exciting and
51:52 fun, 10% and then there's that little sliver of like really fine polish that if it's not
51:58 just your job to make that thing polished, you're just for the most part, just not going
52:03 to polish that bit, right?
52:04 It's tough.
52:05 UI, design, support.
52:08 There's all these things that go into making a software like really extraordinary.
52:12 That's really, really tough to do.
52:14 And I think I really like the open source business model.
52:17 I think for me being able to just try something, not having talked to sales and being able
52:21 to just deploy locally and test it out and see if this works.
52:24 And if I choose to do so, deploy it in production, or if I bought the cloud product and I don't
52:30 like the direction that it's going, I can even go to open source as well.
52:34 That's pretty compelling to me.
52:35 Yeah, for sure it is.
52:37 And I think the more moving pieces of infrastructure, more uptime you want and all those types of
52:43 things, the more somebody who's maybe a programmer, but not a DevOps infrastructure person, but
52:49 needs to have it there, right?
52:50 Like that's an opportunity as well, right?
52:52 For you to say, look, you can write the code.
52:55 We made it cool for you to write the code, but you don't have to like get notified when
52:59 the server's down or whatever.
53:00 Like, we'll just take care of that for you.
53:01 That's pretty awesome.
53:02 Yeah.
53:03 And it's efficiencies of scale as well, right?
53:04 Like we've learned the same mistakes over and over again, so you don't have to, which
53:08 is nice.
53:09 I don't know how many people who want to maintain servers, but people do.
53:13 And they're more than welcome to if that's how they choose to do so.
53:15 Yeah, for sure.
53:16 All right.
53:17 Just about out of time.
53:18 Let's wrap up our conversation with where are things going for Dagster?
53:23 What's on the roadmap?
53:24 What are you excited about?
53:25 Oh, that's a good one.
53:26 I think we've actually published our roadmap line somewhere.
53:29 If you search Dagster roadmap, it's probably out there.
53:31 I think for the most part that hasn't changed much going into 2024, though we may update
53:36 it.
53:37 Ah, there it is.
53:38 We're really just doubling down on what we've built already.
53:40 I think there's a lot of work we can do on the product itself to make it easier to use,
53:45 easier to understand.
53:46 My team specifically is really focused around the education piece.
53:49 And so we launched Dagster University's first module, which helps you really understand
53:53 the core concepts around Dagster.
53:56 Our next module is coming up in a couple months, and that'll be around using Dagster with dbt,
54:00 which is our most popular integration.
54:02 We're building up more integrations as well.
54:04 So I built a little integration called embedded ELT that makes it easy to ingest data.
54:09 But I want to actually build an integration with DLT as well, DLT hub.
54:12 So we'll be doing that.
54:14 And there's more coming down the pipe, but I don't know how much I can say.
54:18 Look forward to an event in April where we'll have a launch event on all that's coming.
54:23 Nice.
54:24 Is that an online thing people can attend or something?
54:26 Exactly.
54:27 Yeah, there'll be some announcement there on the Dagster website on that.
54:31 Maybe I will call it one thing that's actually really fun.
54:33 It's called Dagster Open Platform.
54:35 It's a GitHub repo that we launched a couple months ago, I want to say.
54:39 We took our internal...
54:40 I should go back one more.
54:42 Sorry.
54:43 I should go back to GitHub, Dagster Open Platform on GitHub.
54:45 I have it somewhere.
54:47 Yeah.
54:48 It's here under the organization.
54:51 Yes, it should be somewhere here.
54:54 There it is.
54:55 Dagster Open Platform on GitHub.
54:57 And it's really a clone of our production pipelines.
54:59 For the most part, there's some things we've chosen to ignore because they're sensitive.
55:03 But as much as possible, we've defaulted to making it public and open.
55:06 And the whole reason behind this was because, you know, as data engineers, it's often hard
55:10 to see how other data engineers write code.
55:12 We get to see how software engineers write code quite often, but most people don't want
55:16 to share their platforms for various good reasons.
55:19 Right.
55:20 Also, there's like smaller teams or maybe just one person.
55:23 And then like those pipelines are so integrated into your specific infrastructure, right?
55:29 So it's not like, well, here's a web framework to share, right?
55:32 Like, here's how we integrate into that one weird API that we have that no one else has.
55:36 So it's no point in publishing it to you, right?
55:39 That's typically how it goes.
55:40 Or they're so large that they're afraid that there's like some, you know, important information
55:44 that they just don't want to take the risk on.
55:46 And so we built like something that's in the middle where we've taken as much as we can
55:49 and we've publicized it.
55:51 And you can't run this on your own.
55:52 Like it's not, that's not the point.
55:53 The point is to look at the code and see, you know, how does Dagster use Dagster and what
55:56 does that kind of look like?
55:57 Nice.
55:58 Okay.
55:59 All right.
56:00 Well, I'll put a link to that in the show notes and people can check it out.
56:01 Yeah, I guess let's wrap it up with the final call to action.
56:05 People are interested in Dagster.
56:06 How do they get started?
56:07 What do you tell them?
56:08 Oh, yeah.
56:09 Well, Dagster is probably the greatest place to start.
56:11 You can try the cloud product.
56:13 We have free self-serve or you can try the local install as well.
56:18 If you get stuck, a great place to join is our Slack channel, which is up on our website.
56:22 There's even a Ask AI channel where you can just talk to a Slack bot that's been trained
56:27 on all our GitHub issues and discussions.
56:29 And it's surprisingly good at walking you through, you know, any debugging, any issues
56:33 or even advice.
56:34 And that's pretty excellent, actually.
56:36 Yeah.
56:37 It's real fun.
56:38 It's really fun.
56:39 It's a great experience community where you can just chat to us as well.
56:41 Cool.
56:42 All right.
56:43 Well, Pedram, thank you for being on the show.
56:44 Make sure the work on Dagster and sharing it with us.
56:47 Thank you, Michael.
56:48 You bet.
56:49 See you later.
56:50 This has been another episode of Talk Python to Me.
56:52 Thank you to our sponsors.
56:54 Be sure to check out what they're offering.
56:55 It really helps support the show.
56:58 This episode is sponsored by Posit Connect from the makers of Shiny.
57:01 Publish, share, and deploy all of your data projects that you're creating using Python.
57:06 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs.
57:13 Posit Connect supports all of them.
57:15 Try Posit Connect for free by going to talkpython.fm/posit.
57:18 P-O-S-I-T.
57:19 Want to level up your Python?
57:24 We have one of the largest catalogs of Python video courses over at Talk Python.
57:28 Our content ranges from true beginners to deeply advanced topics like memory and async.
57:33 And best of all, there's not a subscription in sight.
57:35 Check it out for yourself at training.talkpython.fm.
57:39 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
57:43 We should be right at the top.
57:45 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct
57:50 RSS feed at /rss on talkpython.fm.
57:54 We're live streaming most of our recordings these days.
57:57 If you want to be part of the show and have your comments featured on the air, be sure
58:00 to subscribe to our YouTube channel at talkpython.fm/youtube.
58:05 This is your host, Michael Kennedy.
58:07 Thanks so much for listening.
58:08 I really appreciate it.
58:09 Now get out there and write some Python code.
58:12 [MUSIC PLAYING]
58:15 [MUSIC ENDS]
58:18 [MUSIC PLAYING]
58:21 [MUSIC ENDS]
58:24 [MUSIC PLAYING]
58:27 [MUSIC ENDS]
58:30 We just recorded it.