Learn Python with Talk Python's 270 hours of courses

#491: DuckDB and Python: Ducks and Snakes living together Transcript

Recorded on Tuesday, Dec 10, 2024.

00:00 Have you heard about DuckDB?

00:01 Well, join me for an insightful conversation with Alex Monahan, who works on documentation tutorials and training at DuckDB Labs.

00:09 We explore why DuckDB is gaining momentum among Python and data enthusiasts

00:14 from its in-process database design to its blazing fast columnar architecture.

00:19 We also dive into the indexing strategies, concurrency considerations, and fascinating way that Mother Duck, the cloud companion to DuckDB,

00:28 handles large-scale data seamlessly.

00:30 Don't miss this chance to learn how a single pip install could totally transform your Python data workflow.

00:36 This is Talk Python to Me, episode 491, recorded December 10th, 2024.

00:41 Are you ready for your host, please?

00:43 You're listening to Michael Kennedy on Talk Python to Me.

00:47 Live from Portland, Oregon, and this segment was made with Python.

00:51 Welcome to Talk Python to Me, a weekly podcast on Python.

00:57 This is your host, Michael Kennedy.

00:59 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both accounts over at fosstodon.org,

01:07 and keep up with the show and listen to over nine years of episodes at talkpython.fm.

01:12 If you want to be part of our live episodes, you can find the live streams over on YouTube.

01:17 Subscribe to our YouTube channel over at talkpython.fm/youtube, and get notified about upcoming shows.

01:23 This episode is brought to you by Sentry.

01:25 Don't let those errors go unnoticed.

01:27 Use Sentry like we do here at Talk Python.

01:29 Sign up at talkpython.fm/sentry.

01:32 And it's brought to you by the Data Citizens Dialogues podcast from Colibra.

01:36 If you're ready for a deeper dive into the latest hot topics in data, listen to an episode at talkpython.fm/citizens.

01:43 Alex, welcome to Talk Python to Me.

01:46 Howdy, thanks so much for having me.

01:47 Yeah, it's fabulous to have you here.

01:49 I'm really excited to talk about databases.

01:51 Databases are so important for making your app work well.

01:55 Making them fast surprises me how many people have slow databases.

02:00 It doesn't make sense.

02:01 We're pretty big fans of databases around here, for sure.

02:03 And fast ones as well, right?

02:04 As best we can.

02:06 That's right.

02:07 Well, DuckDB is certainly on the rise in terms of me hearing people talk about like,

02:13 oh, have you heard DuckDB?

02:14 Have you seen DuckDB?

02:16 Have you worked with it?

02:17 And I think it continues an interesting tradition of lots of low stress near your application type

02:24 of databases.

02:25 So it's going to be fun to dive into that for sure.

02:27 Before we do, tell us about yourself.

02:29 Sure.

02:29 Well, thanks so much for having me.

02:30 Huge fan of the show.

02:31 Really excited to be here.

02:32 So I appreciate that.

02:34 I work at MotherDuck.

02:35 I'm a forward deployed software engineer, which is kind of a made up title.

02:39 A couple of companies use it, but it's really a customer facing software engineer.

02:42 So working closely with folks to see how MotherDuck fits in with all the rest of their stack.

02:47 So MotherDuck is a cloud data warehouse with DuckDB at its core.

02:50 So absolutely a huge part of what we do.

02:53 I also work part time for DuckDB Labs, which is our partner organization.

02:57 And I do blogging and some developer advocacy types of things for them.

03:02 And so it's a separate company based in Amsterdam, the Netherlands.

03:06 And they are database experts and researchers.

03:09 And they're kind of the stewards of the open source project.

03:11 Yeah.

03:11 Awesome.

03:12 What is a forward deployed engineer do?

03:15 Do you get called into big companies who want to adopt the technology?

03:19 And they say, hey, could you help us do this quicker?

03:21 Or are they running in trouble?

03:23 And they're like, you said it was fast.

03:25 It's not fast.

03:26 Help.

03:26 Well, don't copy this every query.

03:28 So you'll be better.

03:30 There's quite a lot of variety.

03:31 I think there's definitely some aspects of, you know, hey, can you help me with this?

03:36 There's also aspects of as folks are getting started, they're kind of picking out the rest of the tools around the database.

03:40 So what BI tool should I use?

03:43 What orchestrator should I use?

03:44 And then, you know, we do get into some SQL optimization, stuff like that.

03:48 So that's kind of the customer facing side.

03:49 That's the majority of the role.

03:50 There's also troubleshooting that goes along with it, as well as kind of feeding things back into the company, being the voice of the customer saying, hey, we've had three people hit this.

03:58 Is there a way we can build something different?

04:01 Yeah.

04:01 Yeah.

04:02 Like everyone thinks they should do it this way.

04:04 We need to change our docs or they want this particular feature and we don't have it.

04:08 Yeah.

04:09 Yep.

04:10 Yep.

04:10 It's a good cycle that we've got going.

04:12 It sounds really fun.

04:13 It's a lot of fun.

04:14 The variety is my favorite part.

04:16 And, you know, a lot of fun investigation, troubleshooting, you know, got to reframe it as a scavenger hunt.

04:21 You know, it's fun.

04:22 Yeah.

04:23 Every day is like a box of chocolates, huh?

04:25 Oh, yeah.

04:26 A little different all the time.

04:27 You know, you don't know what you're going to get.

04:29 I used to do in-person professional training.

04:33 You know, I'd travel to Chicago or New York or Beijing and I would show up Monday morning.

04:39 I mean, all right, folks, what are we, you know, what's the situation?

04:43 Obviously, I didn't know what we were going to talk about, but it was still, it was different every time.

04:46 Every week was different.

04:48 Sounds like it might be not so different from what you're doing.

04:50 Are a lot of these in-person things or are they mostly Zoom style?

04:55 It's Zoom style.

04:56 You know, just we work with a variety of folks, not, you know, like a couple of really giant customers.

05:01 So with that variety means that I couldn't fly fast enough.

05:04 Yeah, for sure.

05:06 Or, you know, issues I've had as well as like three people on the team are in this city.

05:10 Three people are in that other city and one is here.

05:12 It's like even if you flew somewhere, you would only be, you'd still be on Zoom.

05:15 So you might as well just be on Zoom.

05:16 Definitely, definitely.

05:18 Absolutely.

05:18 The world we live in.

05:19 Well, I mean, that's a blessing, honestly, to not have to fly to all these places as well as just be able to pop in and out, right?

05:27 I mean, this work from home story.

05:29 It's amazing.

05:29 It's only getting better all the time.

05:31 So let's talk DuckDB and databases.

05:35 I think maybe setting a bit of a mental model.

05:38 A lot of folks are familiar with SQLite, right?

05:43 Yep.

05:43 Absolutely.

05:43 And so SQLite is a library, not a server that you can use that then plugs into your application, works on a local file rather than a certain thing you run in another VM or however you, you know, there's not a managed SQLite.

05:58 Not that I know, I know maybe there's a managed SQLite, I don't know of one.

06:01 There are.

06:01 There are.

06:02 Are there?

06:02 Incredible.

06:03 Yeah, there's a bunch of cool stuff that you can do if you have a database that's just part of your application instead of somewhere else.

06:10 You know, another one that I just ran across is Mocha-py or there's no doubt, there's Dashpy, which is a Python wrapper around the, I think it's Rust Mocha caching thing.

06:21 But instead of being Redis or something like that, it's like an in-memory thing that you can run for sort of the same purpose.

06:27 So DuckDB sort of falls into that realm, right?

06:31 Yes.

06:31 I think it's an in-process database, just like you said.

06:34 So it's definitely a library.

06:35 You know, it's a import DuckDB.

06:37 So it's not in the Python standard library, but it is on PyPI, just pip install DuckDB.

06:42 It's pre-compiled, zero dependencies.

06:45 So it is super easy.

06:46 Basically, anywhere you can get to pip, you can get DuckDB run in.

06:49 So that could be your laptop.

06:50 That could be a big, beefy server.

06:52 It could be a Lambda function.

06:54 It could be inside of your orchestrator, you know, in a GitHub action.

06:58 Kind of slots in everywhere.

06:59 That's part of the magic.

07:00 Presumably, when you pip install DuckDB, it's just downloading a wheel for your architecture, for your platform?

07:06 Yep.

07:06 Pre-compiled all the combinations.

07:08 Yeah, that's really good.

07:09 Yeah, if you just sit on the DuckDB.org homepage, you can learn about it as it animates different messages to you.

07:16 It says DuckDB is a fast analytical database system, open source database system, in-memory database system.

07:22 We'll see what else.

07:23 Yeah.

07:23 Portable.

07:24 So, yeah, it works on most of the places, the main operating systems, right?

07:29 Yes.

07:29 Yep, it's all over.

07:30 Can I Raspberry Pi it?

07:32 I'd imagine the answer is yes.

07:33 Can I CircuitPython it?

07:35 Probably not, right?

07:36 I haven't checked that.

07:37 You can do iPhone and Android.

07:38 There's a Swift client, so you can build iPhone apps and Android works as well.

07:43 So, yeah, you could do that.

07:44 Works in your browser with WebAssembly.

07:46 It'll work with Pyodide in the browser as well.

07:49 So, it really goes all over.

07:50 Oh, that's really cool.

07:51 Yeah, one of the things that I said people are doing interesting stuff with, I had in mind, with SQLite is I saw some folks suggesting using SQLite in the browser instead of local storage and like the local, like all the JavaScript-y things.

08:06 And could I do that with DuckDB?

08:07 You could definitely do that with DuckDB.

08:10 I think the use case is slightly different, but that's sort of the benefit is you can use both.

08:15 So, definitely could talk a bit more about the differences, but I'd say in general, a good mental model for DuckDB is kind of a from scratch separate project from SQLite, but taking a lot of the lessons and learnings about what makes people love SQLite and bringing that into the analytical realm.

08:30 And that's really, instead of inserting and updating tons of tiny transactions all the time really fast, it's big bulk operations.

08:37 Aggregate a billion rows, join a billion rows to a billion rows, that type of thing.

08:42 Yeah, doing that in memory in JavaScript is not that great.

08:45 Doing that in memory in Python is not that great.

08:47 But it sounds a little bit more like vector programming in the sense of kind of like you would do with pandas or pollers, right?

08:56 You wouldn't loop over, you shouldn't loop over a pandas data frame processing each item.

09:02 You should issue vector operations that apply to every row, like multiply this column by two rather than for every C in column, you know, whatever, like looping it over.

09:11 Yes.

09:12 So, it is exactly what you said.

09:14 It's designed for that.

09:15 It's vectorized execution.

09:17 So, it's a little different than pandas.

09:18 It's, you know, pollers is a bit more similar.

09:21 Where instead, pandas will process a whole column at a time.

09:25 And that works a lot better than one row at a time until your columns get too big and then it crashes.

09:30 Yeah.

09:30 So, what DuckDB does is it does it in chunks of about 2,000 rows at a time.

09:34 And so, that way you get all the benefits of your modern CPU, you know, hierarchy, your caches, all that, where it's just pumping data through.

09:41 But you also conserve memory and you can not get into issues with that.

09:46 I will also say it's technically we are an in-process database, not an in-memory database.

09:51 Oh, sorry.

09:52 If I said in-memory, I meant in-process.

09:54 Yeah.

09:54 It is, you know, very much an interchangeable word like in English.

09:58 We just are very specific about it in database land.

10:00 No, you do.

10:01 And you do support in-memory databases.

10:04 Yes.

10:04 But that's not the same as in-process.

10:06 Yeah, sorry, in-process is what I meant.

10:07 Oh, no problem.

10:08 All I mean to say there is, you know, you don't need to limit yourself to your size of RAM.

10:12 You know, you can solve one terabyte problems on your laptop with DuckDB, no problem.

10:16 Really?

10:17 Yes, it handles larger than memory data, both streaming from disk.

10:21 So, it'll just read just the pieces it needs as it goes and even in the intermediates.

10:25 So, let's say you join two tables together and you, you know, multiply the number of rows.

10:30 It'll send things to disk as it gets memory pressure close to your memory limit.

10:34 So, it's a lot of effort has been put into that in the last year or so.

10:38 So, it's working pretty smooth now.

10:40 Yeah, I can imagine that that's super important.

10:42 So, I'm really looking forward to talking about schemas, indexes, those kinds of things,

10:48 which sound like they may play important roles here.

10:50 But you talked about these different ways in which you can use DuckDB.

10:55 Looking through the documentation, it looks like it works with not just Python, but quite a few platforms or languages.

11:00 Yes, somewhere north of 15.

11:03 So, Python is the most popular way to use DuckDB.

11:05 We've got some really tight integrations with data frame libraries, which we can definitely talk more about.

11:09 There's a lot of Java use out there.

11:12 So, JDBC, we've got a JDBC driver.

11:13 You can use it in JavaScript, both in the browser with Wasm or on the server with Node, popular Node package.

11:21 You can also do Go, Rust, C++, C.

11:25 I think there's like a community-driven Ruby client, you know, really all over the place.

11:30 So, .NET, C#, you can use it from .NET, all over.

11:34 Yeah.

11:34 Yeah, that's awesome.

11:35 And I'm guessing it's interchangeable, the file?

11:38 Yes.

11:39 So, just like SQLite files are designed to be read for a long time, they're very portable.

11:45 DuckDB just kind of reached the 1.0 milestone in June.

11:49 So, super excited about that.

11:50 And with that, you know, promised stability with the file format.

11:54 And that means that we expect for the next somewhere around five years to have your DuckDB files be totally compatible,

12:01 where you can read a five-year-old file five years from now type thing.

12:04 Yeah.

12:04 Which means you can, it's like SQLite in that you can fit all your tables and all the relationships between them all in one file.

12:11 So, that's one nice thing about SQLite.

12:14 You can email the whole database.

12:15 You can do the same with DuckDB.

12:17 But DuckDB is going to store it a lot more compactly if you've got a lot of data.

12:21 Oh, really?

12:22 Okay.

12:22 You guys have worked a lot on the file format.

12:24 It's, again, taking inspiration from SQLite, the format is very separate.

12:29 And in databases, you kind of have a pretty hard fork in the road pretty early on in your architecture,

12:34 which is you want to be row-based or do you want to be column-based?

12:37 Right.

12:38 And row-based is SQLite and column-based would be your pandas, your pollers,

12:42 and most of your cloud data warehouses, your snowflakes, that type of thing.

12:46 And once you go column-based, suddenly your compression is amazing because you store data that's very similar right next to each other.

12:53 So, you can really compress it.

12:54 You know, your dates are back-to-back.

12:56 Instead of, you know, a date column, string column, integer column in a row-based store, they don't compress well.

13:01 So, we can get somewhere around 5x the compression typically.

13:04 Okay.

13:04 That's really great.

13:06 And you could take this file and you could put it into S3 or use it with some sort of Jupyter Notebook, right?

13:13 And just, they could just grab the file and run with it.

13:15 And it's all the data, yeah?

13:16 Yes.

13:17 Object stores are really the name of the game there.

13:20 You can even work with a file on Object Store and read it without having to download the whole file first.

13:26 So, DuckDB supports something like SQLite in that it's got this attach syntax.

13:30 So, you can say, hey, I want you to attach this file and then just read the pieces of it that I actually query, which is pretty slick.

13:37 Yeah, that is super slick.

13:38 All right.

13:38 Let's, let me scroll down here.

13:41 And let's talk about maybe using it from Python.

13:45 Give us a sense of, like, what working with DuckDB is like here.

13:49 You bet.

13:50 So, you've got your pip install DuckDB, which will download that pre-compiled binary.

13:54 It's like 20 or 30 megabytes.

13:55 So, pretty close to instant.

13:56 Then import DuckDB, and then you can do DuckDB.SQL.

14:00 And then within that function, you just pass in a string of your SQL statement and it'll execute.

14:05 It's really that simple.

14:06 And so, it follows the DB API if you want.

14:09 So, you can get back your typical suples.

14:11 Or you can say .df at the end and get back a Pandas data frame.

14:15 Or .pl and get colors.

14:17 All kinds of stuff.

14:18 Yeah.

14:19 I saw that there's a lot of to and from the popular data science libraries, which seems really nice.

14:24 You work in your data science library or maybe run a query and get a subset of your data.

14:29 And then maybe plot that with PlotLayer, Altair, or pass it along to other things that expect the data frame, right?

14:36 Yes, exactly.

14:37 A lot of the DuckDB ethos is don't ask folks to ship their data necessarily somewhere else.

14:42 Don't ask them to change their workflow dramatically to use a database.

14:45 Meet folks right where they are.

14:47 And that means you can have one line of code be in your data frame library of choice.

14:51 Pandas, Polar, or Arrow.

14:53 And then a line of code in DuckDB.

14:55 And then you're right back in your data frame in the next line.

14:58 So, it fits really wherever you want.

14:59 It's a great fit if you want to throw some SQL at a problem.

15:02 Some problems are easier to formulate in SQL, I think, than in data frame libraries.

15:06 And vice versa.

15:06 But also, some libraries, it's harder to scale larger than memory.

15:11 And DuckDB does have a lot of nice performance under the hood.

15:14 So, maybe the hardest part of your workflow, you can slide in DuckDB and make the minimal change to get it working.

15:23 This portion of Talk Python to me is brought to you by Sentry.

15:26 Code breaks.

15:27 It's a fact of life.

15:29 With Sentry, you can fix it faster.

15:31 As I've told you all before, we use Sentry on many of our apps and APIs here at Talk Python.

15:37 I recently used Sentry to help me track down one of the weirdest bugs I've run into in a long time.

15:42 Here's what happened.

15:43 When signing up for our mailing list, it would crash under a non-common execution pass.

15:49 Like, situations where someone was already subscribed or entered an invalid email address or something like this.

15:55 The bizarre part was that our logging of that unusual condition itself was crashing.

16:01 How is it possible for our log to crash?

16:04 It's basically a glorified print statement.

16:07 Well, Sentry to the rescue.

16:08 I'm looking at the crash report right now, and I see way more information than you'd expect to find in any log statement.

16:15 And because it's production, debuggers are out of the question.

16:19 I see the traceback, of course, but also the browser version, client OS, server OS, server OS version, whether it's production or Q&A, the email and name of the person signing up.

16:30 That's the person who actually experienced the crash.

16:32 Dictionaries of data on the call stack and so much more.

16:35 What was the problem?

16:36 I initialized the logger with the string info for the level rather than the enumeration dot info, which was an integer-based enum.

16:45 So the logging statement would crash, saying that I could not use less than or equal to between strings and ints.

16:52 Crazy town.

16:53 But with Sentry, I captured it, fixed it, and I even helped the user who experienced that crash.

17:00 Don't fly blind.

17:01 Fix code faster with Sentry.

17:03 Create your Sentry account now at talkpython.fm/sentry.

17:07 And if you sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's business plan, which will give you up to 20 times as many monthly events as well as other features.

17:20 It seems to me like maybe you put all of the data into DuckDB, and then you can ask questions which result in Python data science objects that then you can work on, where those questions result in, give me the stuff for this state or this time period or something like that where it's a reasonable amount of data.

17:37 But if you tried to just load it all, then it would say no.

17:41 I think that's a great way to go about it for sure.

17:43 You know, another common workflow is to do some, you know, similar things, but with Parquet files or CSVs or JSONs.

17:51 So in the same way you can store a lot of your data in DuckDB, you can periodically write chunks out to Parquet or read in from Parquet, all that type of stuff too.

17:59 Yeah.

17:59 Tell people what Parquet is.

18:00 I know it comes from PyArrow or the Arrow project, right?

18:04 Where does it come from?

18:05 I know they use it.

18:06 I don't know if they came up with it, but yeah.

18:08 They definitely collaborated and were a big part of it.

18:11 So it's a community developed open format.

18:14 It's Apache Parquet.

18:15 So it's a open source community driven and it's a columnar format.

18:21 So instead of storing things by rows, it's columnar storage and it's chunked columnar storage.

18:26 So it's sort of like somewhere around every million rows or so you kind of write out another chunk.

18:30 And it ends up being a really good format to store large amounts of data in the cloud.

18:36 And it's typically what's underneath a data lake or a data lake house.

18:40 The vast majority of those have Parquet under the hood.

18:43 I see.

18:44 We love the Parquet format at DuckDB.

18:46 We read it and write it.

18:46 We have a custom from scratch reader, custom from scratch writer for it.

18:49 We actually prefer our own format given the choice as well.

18:53 Parquet has been around for a decade or so at this point.

18:56 And we've added a few things over time into DuckDB that are new.

19:00 Yeah.

19:00 Parquet is one of those things that Pandas 2.0 supports it, right?

19:04 Polar supports it.

19:05 It's pretty universal.

19:07 So it might be a good way to get data from that you've worked on before into DuckDB.

19:12 Yes, it's great.

19:13 I think, you know, most universal CSV, most difficult, also CSV.

19:17 Parquet is a really great compromise there.

19:20 Yeah.

19:21 Hold on.

19:24 My computer is lagging.

19:25 It's distracting me.

19:26 I apologize.

19:27 One sec.

19:29 Scroll, scroll.

19:31 So DuckDB works well, it sounds like for the data science story, like it's built for the data science story.

19:36 If I was wanting to store relational data, you know, maybe I've already got DuckDB working for something.

19:42 I'm like, you know, do I necessarily want to bring SQLite into the story as well and have these two things?

19:48 Is it possible to, say, do relational type things with DuckDB?

19:53 Yes, that's a great question.

19:55 So I think DuckDB is trying to bring the best of both.

19:58 It is a full relational database.

20:00 And as well as supporting some of the, you know, data science workflows that you typically in the past would have been doing in a data frame.

20:06 So DuckDB is an analytical system, not transactional.

20:10 So it's not going to be the fastest at doing individual row operations, but it supports them all.

20:15 And it supports them with full asset transactions that you'd expect from a database where you won't get database corruption, even if your computer cuts out.

20:23 You know, it's got that kind of robustness that you expect from a database.

20:27 So part of the story, though, is that with an analytical database, it's typically not your only database unless you're doing data science work.

20:34 If you're building an application, you're going to want a transactional database for, you know, saving your, you know, when a customer places an order, you know, all that type of transactional stuff is still critical.

20:44 So typically DuckDB is going to be layered in addition.

20:46 And the interoperability story there with DuckDB is pretty, pretty fantastic.

20:50 You can actually read SQLite files directly with DuckDB.

20:54 And it's going to be, if you're doing any sort of advanced processing on them, it's going to be a lot quicker in DuckDB.

21:00 You can even read Postgres databases.

21:02 You can read MySQL databases.

21:04 Oh, wow.

21:04 There's a community extension to read BigQuery.

21:06 So it's really a universal Swiss Army knife in a lot of ways to read from those operational data stores, but to do those analytical tasks.

21:14 You know, look at all my orders.

21:17 What was the most popular five products?

21:19 You know, that's going to be a tough query for a transactional system, but bread and butter for DuckDB.

21:23 So when you say read from, let's say, Postgres, does that import it into a table in DuckDB or does it translate the queries over to the underlying data system?

21:35 What's happening there?

21:36 It's a great question.

21:37 We are communicating with Postgres, so we're not importing everything.

21:42 We're sending the query to it and we're reading all the data from Postgres.

21:46 Typically, that's going to be sparing the table or forming a view from Postgres.

21:49 And then after that point, bring it into the DuckDB engine for subsequent processing.

21:54 But you don't have to migrate all your data.

21:56 It's kind of a data virtualization is one of the buzzwords.

22:00 It's been around a little bit.

22:01 But, you know, it's getting to get some of the benefits of analytical performance without having to move your data first.

22:08 It'll definitely be more performant if you put it in an analytical format like Parquet, like DuckDB, like an analytical database.

22:14 But you get a big fraction of that benefit with the engine part, which is the vectorized engine of doing things 2,000 rows at a time instead of one row at a time.

22:23 That's a lot of it.

22:25 Yeah, that's a huge benefit.

22:25 It's almost like microservices for databases.

22:29 You can click the things together.

22:31 But the benefit would be, you know, that if you have an operational database, that is, it's getting read to and written to and read from in real time.

22:40 When you ask questions about it, it's using that real time data.

22:43 It's not, well, we did an export last night and here's what we know.

22:47 You can just get more live data, I guess.

22:50 Yes, that's definitely a use case for that.

22:52 You can kind of go either end of that spectrum.

22:54 You can use that as an easy way to import or you can pull it live or you can kind of do midway and say, for any data older than an hour, use what I extracted.

23:04 But grab the last hour right out of the operational database and get the best of both.

23:08 Oh, wow.

23:08 Okay.

23:09 Yeah, that'd be pretty wild, wouldn't it?

23:10 So you say that you all have the simplest or friendliest SQL syntax.

23:17 What's the story with that?

23:18 Sure.

23:18 I think a lot of that is around part of the origin story of DuckDB is reaching out to the data science community and hearing feedback that we've invented this entire other thing called data frames because we didn't love databases.

23:30 Yeah, exactly.

23:31 They didn't answer the questions we wanted or whatever.

23:35 Yeah.

23:35 Right.

23:36 It was, there was a lot of friction.

23:37 And, you know, if you've, you know, folks have installed Postgres and set it up, there's a lot more to it than, you know, import pandas as PD.

23:45 It's just a different level of hassle that you have to do.

23:49 The SQL syntax is no different.

23:51 You know, SQL is an old language, much older than Python.

23:54 And it's got a lot of very interesting behavior that, you know, by modern standards is very different.

24:00 So with DuckDB, we've really pushed the envelope in a lot of ways.

24:04 Nice.

24:04 Well, you know, many of the things that relational databases were built for and when they're designed, this was the 70s and they had different, you know, memory was expensive.

24:14 Disk was expensive.

24:15 So they made these trade-offs in that world, right?

24:18 Yes.

24:19 And it's absolutely.

24:21 There's definitely some aspects of SQL where it's been around a long time, so it'll be around forever.

24:25 But what DuckDB is trying to do is have it not be frozen in time forever and really push the whole language forward.

24:31 So there's a couple of cool ways.

24:33 The syntax itself is designed to be both easier to use and also push it into areas that are new and different for databases.

24:40 So easier to use, for example, in SQL, if you want to aggregate, you kind of typically need two ingredients.

24:46 You need an aggregation function like a sum or a max or an average.

24:49 And then you want to group by clause to say, what's the level of detail I want to group at?

24:55 Do I want to group by customer or customer and order?

24:58 You know, what level of detail am I looking at?

25:00 DuckDB can actually infer that level of detail with group by all and just say, whichever columns you're looking at, if you're looking at the customer ID and you're getting an average,

25:09 okay, I'll just aggregate it up at customer ID level for you.

25:11 And so it cuts out a whole clause of SQL.

25:13 You basically never have to worry about the group by clause ever again.

25:17 Just put the keyword all, you're good to go.

25:19 I love it because the group by clause is hard.

25:21 It's anytime you have to edit something in two places, just you're already in trouble, right?

25:25 I mean, just, you know, the don't repeat yourself is very valuable.

25:29 Yeah.

25:29 Yeah.

25:30 The joins and the group buys and the whole aggregation pipeline stuff, which is incredibly powerful,

25:35 especially in analytical databases like DuckDB, but also hard.

25:38 Yes, absolutely.

25:40 And I think obviously anything we can do to reduce the friction there is helpful.

25:43 So we are talking about the friendly SQL dialect.

25:46 I'll throw a teaser in there to say there are ways to use DuckDB without SQL as well in very Pythonic ways.

25:51 So you really can have it both ways if you want.

25:54 Okay.

25:54 What about ORMs?

25:55 Is there a concept of a ORM for DuckDB?

25:59 I'd have to check exactly which ones support it.

26:01 But I know at MotherDuck, we're working on a TypeScript ORM that'll support DuckDB SQLize.

26:10 I'd have to check which other ones are out there that support it.

26:12 But nothing in principle really stops it.

26:14 We support the fundamental things you need, like transactions, like, you know, all the other pieces you need.

26:21 Yeah.

26:21 Basically, if it runs on pure SQL, then, you know, you should...

26:25 ORMs are mostly about I'm querying against classes, and then I'm going to turn that into SQL for you.

26:31 And then the things that come back, I'm going to turn them into objects, right?

26:34 Like, I feel like both sides of that would be fine.

26:37 Yes.

26:38 And it is, you know, it's a relational database.

26:41 So it's not...

26:42 There are many...

26:43 It'll look a lot very similar to the ORMs themselves, so it should be very compatible.

26:47 Yeah, it seems possible.

26:49 It seems totally possible.

26:50 All right.

26:50 So maybe let's talk through a little bit of the ways of working with DuckDB through, maybe connecting, running some queries, updating.

27:00 But let me first ask you, what does a schema look like?

27:05 If, for example, if I'm going to work with Postgres or SQLite, I need to run create scripts to create the tables and the indices and that kind of stuff before I can even talk to it, generally speaking.

27:16 Is that needed for DuckDB?

27:18 It doesn't look like it from the examples that I've seen, but how does that work?

27:21 I think in those other databases, there are some ways where you don't necessarily have to do that, and DuckDB follows that same approach where it can automatically create tables for you based on the query that you pipe into it.

27:33 So you create a table as the result of another SQL statement, and it will just automatically create that table with the data types that you pass through.

27:42 So one key note, if folks really know a lot about SQLite, one of the defining characteristics of it is that it is very much non-typed as much as possible.

27:50 VectDB is very much more traditional in that it does have typing.

27:55 It does take full advantage of having nice numeric data types, date times that are true date times, not strings.

28:03 It'll do its best to auto-convert for you, but it really does make it very easy.

28:07 You can just create a table as, for example, create table as select star from this parquet file.

28:12 It can really be that easy.

28:14 It'll just auto-create the table based on the parquet file.

28:16 Okay.

28:16 Yeah, that's really neat.

28:18 And a lot of those have their columns have data types and stuff, right?

28:21 Yes.

28:21 And we have also invested quite a lot in our CSV reader as well.

28:26 So the CSV reader we consider to be up there with the best on the planet.

28:31 It auto-infers the data types by sniffing the CSV to really deduce it.

28:37 And it handles some seriously wacky CSV files.

28:41 Okay.

28:42 CSV files.

28:43 I mean, whatever you're imagining, it's worse.

28:46 It's worse than that.

28:49 But DuckTB is okay and ready to handle it and making it easy.

28:54 Because it really aligns with the DuckTB ethos, which is a database shouldn't be extra work.

28:58 It should be saving you work.

28:59 Yeah.

29:00 And most of the time, step one of a database is import your data.

29:04 And a lot of databases, that's not that easy to just import your data.

29:08 Especially if it's not exactly the format of that database, right?

29:12 It's like, well, okay, I have a .SQL file.

29:15 It's a bunch of update statements.

29:16 Sure, that's easy to import.

29:18 But here's my CSV and actually you need to transform it or whatever.

29:21 Yeah.

29:22 Yes, there's a long tail of complexity there, but DuckTB is really trying to handle that

29:27 and just make that first experience using a database just straight out of the box.

29:31 Really easy.

29:32 This portion of Talk Python to Me is brought to you by the Data Citizens Dialogues podcast.

29:38 If you're ready for a deeper dive into the latest hot topics in data, you need to listen to the Data Citizens Dialogues podcast.

29:45 Brought to you by Colibra, the leader in data intelligence.

29:49 In every episode of Data Citizens Dialogues, industry leaders unpack data's impact on the world.

29:55 From big picture questions like AI governance and data sharing to more nuanced questions like,

30:01 how do we balance offense and defense in data management?

30:04 You'll hear firsthand insights about the data conversations affecting all kinds of industries.

30:10 With guests sharing unique stories from some of the world's largest companies,

30:14 such as Adobe, Fidelity, Deloitte, Hewlett-Packard, McDonald's, and even the United States Coast Guard,

30:19 you'll get an amazing look inside how these organizations handle their data.

30:22 My favorite episode is Solving Data Discovery with a self-service approach.

30:26 It's an interesting look inside creating a single source of truth at an online university.

30:32 Check them out and try an episode for yourself.

30:34 Find Data Citizens Dialogues at talkpython.fm/citizens.

30:39 That's talkpython.fm/citizens.

30:41 The link is in your podcast player's show notes.

30:43 Or just follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.

30:51 Thank you to the Data Citizens Dialogues podcast for supporting the show.

30:54 So you have some interesting data ingestion.

30:59 This is a good thing, not a bad thing.

31:01 Data ingestion options.

31:04 You have a, it feels a little bit Pandas-like.

31:06 You have a read CSV, read Parquet, read JSON.

31:09 And then it's just, as you sort of hinted there, it's like part of the database now and you can ask it questions.

31:16 Yes, so you could absolutely do that just with the read CSV and it'll import there.

31:21 You can also even just read it without importing.

31:25 So for example, when we're running that read Parquet, it's lazily evaluated, much like Polars or some of the other more modern data frame libraries are,

31:34 where it's building up this relation object that says, once you look at the results, start by reading the Parquet file and then process the next few steps.

31:42 So you could do something like read Parquet, filter to these things, aggregate up on these columns.

31:47 And it will actually do all those operations together.

31:50 It'll optimize them to do them in the right order to pull the least amount of data.

31:54 And then only at that point will it, you know, materialize what it needs to.

31:57 So the examples that you're looking at on screen are, you know, DuckDB.SQL select star from a Parquet file.

32:04 You don't ever even have to read that part, you know, write that Parquet file into DuckDB format.

32:08 It's just going to read it and return the results to you straight as a pass through, as an engine.

32:13 Oh, that's really cool.

32:15 So if it were a million lines of that, I could just iteratively pull through it.

32:21 Yes.

32:21 It'll just chunk through it at nice low memory usage and just, you know, pass those results out to you.

32:26 And then if you process them also in batches, you can keep your memory, you're super low and take advantage of DuckDB being a streaming engine and not bring it on to memory.

32:35 Right.

32:35 Well, let's say I want to read a CSV file, but then put it into a persistent table in DuckDB.

32:41 What do I do for this?

32:42 Yes.

32:43 You could say create table as my improved CSV table as select star from path to my CSV file.

32:51 Will it copy over the schema?

32:53 It will infer the schema as best as it possibly can, and it will automatically define the schema for you.

32:59 Nice.

32:59 What's the nullability story in your schema?

33:02 I think...

33:03 Do you support nullability or are there things that cannot be null?

33:07 Yes.

33:07 It's more traditional in the database in that everything can have a null value in the column.

33:12 I think in Pandas, that was a little bit complicated by the fact that they were using NumPy in some cases, but extending beyond NumPy in others.

33:20 But in DuckDB, it's more of a traditional database story in that you can have a null value wherever you choose unless you define a table to say this column doesn't allow nulls.

33:29 Right.

33:29 You can do that if you want.

33:30 Yeah.

33:31 I remember that was one of the things that caught me out with SQLite.

33:33 I think it doesn't support...

33:35 I can't remember which way it went.

33:36 It's been a while.

33:37 It doesn't support non-nullable columns.

33:40 Like, everything is nullable or something like that.

33:42 And it was like, wait, why is this not failing when I put in this wrong data or something along those lines?

33:47 You know, SQLite is incredibly flexible and there's a lot of benefits to it.

33:51 At times, that can mean that things pop up later in the lifecycle than at the beginning, like you said.

33:56 So with SQLite, you can put a number in a string column and a string column and a number column, you know, and it'll mostly work until it doesn't work.

34:04 DuckDB will be much more explicit about that, but it offers a lot of tools where you can auto-convert or it'll auto-convert for you.

34:11 So you kind of get the benefits without the cost.

34:13 And it actually has what's called a union type where you can actually have one column that can have strings and numbers in it if you know that you're going to have a little bit of both.

34:21 So really, best of all worlds, I'd say.

34:24 Yeah, I think that's more common in the data science, data lake space.

34:27 You've got a bunch of data, you pull it in, and then you're going to clean it up and work on it.

34:31 Whereas in maybe a relational database powering a web API, like, nope, users always have an email.

34:37 That's just how it is.

34:39 Right.

34:39 Yes.

34:40 Okay.

34:41 So I'm looking at some code on your documentation here about getting started with Python for pandas.

34:46 And I need some help.

34:48 I need some help.

34:49 It says import DuckDB, import pandas as pd.

34:52 This is all normal.

34:53 And then it says pandas df, it's a variable, equals pd.dataframes.

34:57 So it creates one.

34:57 And here's where I need the help.

34:59 The next line is duckdb.sql, select star, in the quotes, pandas df.

35:05 How do those two things connect?

35:08 That's a great question.

35:09 Because you never, like, take the actual object of the data frame and show it to DuckDB in any way.

35:14 Like, you don't pass it to a function or set a property or nothing.

35:17 Exactly.

35:18 And this is the benefit of being an in-process database.

35:20 We are in the same memory space as Python, which means we can actually read the Python locals for its local variables.

35:28 And DuckDB has been built from the very beginning with this concept of interchangeability of storage format.

35:34 That's why you can read parquet files or read CSV files, because we treat them as first-class citizens in our database.

35:40 And data frames are no different.

35:41 What we'll do first when we see a select statement come in and it's got a table in there.

35:45 In this case, it's pandas df is the name of the table.

35:47 First thing we'll do is, hey, do we have a physical DuckDB table like that?

35:50 Is there a table of that name?

35:51 And if we don't find one, we don't give up.

35:54 We keep looking.

35:54 We say, okay, is there another object we know how to read with that name?

35:58 And so we look in Python's local variables.

36:00 There's a variable named pandas df.

36:02 And we check the class name.

36:04 What class is this?

36:05 Is it pandas?

36:06 Is it polars?

36:06 Is it Apache arrow?

36:07 And if it's one of those, we can treat it just like a table and then run SQL right on top of it.

36:13 Pretty wild.

36:14 It's pretty wild.

36:15 I think one of the real values there is that you don't have to do, presumably, I'm presuming, you don't have to do a data conversion from pandas into whatever in-memory thing DuckDB thinks is ideal and then back.

36:30 That's a great question.

36:31 So you definitely don't have to do it all at once.

36:33 So pandas, sometimes it is Python objects under the hood, whereas DuckDB is written in C++.

36:37 So it will change some things, but it'll do it in chunks like we're talking about.

36:41 So it won't be all up front.

36:43 But for things like polars and arrow, in most data types, it's zero copy where it's the same format.

36:51 DuckDB has slightly different data formats.

36:54 And Apache arrow is actually adjusted to align closer to DuckDB in some of their format, like how they store strings.

37:00 So DuckDB has really been one of the leading databases in terms of how it handles strings.

37:04 And Apache arrow is also making those changes.

37:08 Yeah, that's super cool.

37:09 You can't write back to these, can you?

37:11 You can't say, wait, you can't.

37:13 You can't.

37:13 You can say update, like the data frame variable, PD data frame set, whatever.

37:19 Yeah.

37:20 So if you're updating the data frame, you will overwrite the whole data frame.

37:24 But DuckDB table, you can run updates on that object for sure.

37:29 Yeah.

37:30 But you can write data frames for sure.

37:33 So you can replace the whole thing.

37:34 Right.

37:34 But I couldn't change the 172nd row of that data frame with a SQL query.

37:39 Or could I?

37:40 You would just be making a copy of the object.

37:44 Got it.

37:44 Got it.

37:45 Yeah, that's what I thought.

37:46 You would make the changes, but then you would return as data frame and make a copy of it.

37:50 Correct.

37:50 We would return a different object in memory.

37:52 But functionally, it would be the same.

37:54 And DuckDB is fast enough where it would probably be quite performant still.

37:57 Yeah, probably fine.

37:58 Yeah.

37:58 Okay.

37:59 I'm not saying that it should be able to do.

38:00 I'm just trying to figure out how far down this.

38:02 How far down?

38:03 Yeah.

38:03 Yeah.

38:05 How far down does it go?

38:07 Yeah.

38:07 So it does the same kind of magic with the locals, with Pyro and Polars and so on.

38:13 Mm-hmm.

38:13 So there's different ways to get data back.

38:16 And it doesn't really matter where it came from, if it came from Pandas or DuckDB files itself,

38:21 or even just hard-coded SQL.

38:23 But you can say things like fetch all, which will return you Python objects, or .df will return

38:29 you a Pandas data frame or different things, right?

38:31 Yes.

38:32 Yep, we've got .pl for polars, .aero.

38:35 You could do NumPy arrays if you'd like.

38:37 We also support outputting to TensorFlow tensors and PyTorch tensors.

38:43 Wow.

38:43 Because those are NumPy under the hood in a lot of ways.

38:46 So we can output to that as well.

38:48 Okay.

38:48 That's super cool.

38:49 Do I need to be worried about little Bobby tables, SQL injections?

38:53 We could definitely pass in parameters that will get escaped properly and all that.

38:58 You know, with DuckDB, I think that in general, we take the approach of being

39:02 super flexible, but there's a lot of options to lock it down should you choose.

39:06 So the select star from any CSV file on my hard drive, you could turn that off if you want.

39:12 For example, likewise with the data frames, you can disable that also.

39:16 Yeah.

39:17 I guess what I'm asking more, is there a concept of a parameterized query or something like that?

39:22 Yes.

39:23 Yes.

39:23 So you can follow the same Python DB 2.0 kind of API spec.

39:27 You could do .execute and pass in a query with placeholder parameters and then pass in those as arguments.

39:33 So you can absolutely do that.

39:34 It tends to be not good for bulk ingestion.

39:38 The reason for that is Python objects just have a lot more.

39:40 Oh, well, I lost your audio.

39:43 Did you possibly hit something that hit a mute button?

39:46 I did.

39:47 Sorry.

39:47 Okay.

39:48 Sorry.

39:48 Give us the...

39:50 You said it's not good for bulk ingestion.

39:52 Let's go from there.

39:54 Yep.

39:55 Sorry.

39:55 Oh, all good.

39:56 That's on me.

39:57 So you can insert straight from Python objects with parameters.

40:02 It works really great.

40:03 It's very convenient, but it's not great for bulk ingestion.

40:06 Inserting a million rows that way versus a million rows in a Pandas data frame, it's going to be a couple orders of magnitude faster to do it through a data frame.

40:14 Yeah.

40:14 Yeah, absolutely.

40:15 So if by default, if I just do WDB.sql, this is working in this correctly this time, in memory database.

40:25 But if you create a connection to a file, then that where you just call it whatever you want, right?

40:31 Like in your examples, you got file.db.

40:33 That will then start persisting to that file, right?

40:37 But it's pretty simple to do either.

40:39 Yes.

40:40 You're just changing the connection path.

40:42 And if you pass in no path, it'll be in memory.

40:46 Otherwise, you can point to a file.

40:47 And it's really that straightforward.

40:50 It'll save all your tables into that same file, nice and compressed in columnar format.

40:54 And then now the format is all stabilized.

40:57 You can throw that on an object store, share it with anybody, put it in your data lake, anywhere you want it to go.

41:01 Can I read from objects or can I give it an S3 connection?

41:04 Yes, indeed.

41:05 You can read from any of the major clouds.

41:08 So AWS, Google, Azure, you can read individual files or actually it'll understand hierarchies of files.

41:15 So if you have like a top level folder for a year and a folder for month and day, if you filter for I only want three days worth of data, DuckTB will only read three files.

41:24 So it's pretty slick.

41:25 Oh, wow.

41:25 That's wild.

41:25 Okay.

41:26 Yeah, that is really slick.

41:27 And you can even write out to those as well.

41:29 So you can actually write on object stores too.

41:31 That's amazing, actually.

41:33 And a lot of less hyperscale clouds, I don't know, I was almost calling them lesser clouds, but I think of almost as like even better clouds.

41:40 You know, DigitalOcean, Hetzner, Leno, some of these smaller places, a lot of them have their own object store.

41:47 And the object store API is just a different connection string for the S3 API, right?

41:54 So it sounds like you could probably get that to work as well.

41:57 Let's pass that.

41:58 I believe I've seen MinIO used, which is even like self-hosted if you want.

42:01 So if it handles MinIO, you can kind of put it wherever you want it.

42:04 Yeah, yeah, that's for sure.

42:05 MinIO is awesome.

42:06 I haven't got a chance to put MinIO into any useful work, but it's like a self-hosted S3 basically, right?

42:16 Which is really cool.

42:16 Yes.

42:17 Yeah.

42:17 So I don't know.

42:18 It seems like a really neat system.

42:20 But it's also, MinIO is complicated.

42:22 You know, it's not easy to set up and run.

42:26 It seems like there's a lot going on there.

42:28 So that's why I haven't done it yet.

42:30 I'm like, this looks amazing.

42:30 Oh, this looks complicated.

42:32 What else am I going to do today?

42:33 I think S3 is pretty great.

42:36 Drop a file, read a file.

42:38 So it's not a bad starting point for quite a while.

42:41 Yeah, absolutely.

42:41 So with this connection object, like if you want to have a persistent database, right?

42:46 You create a connection object and then it has basically the same API as the DuckDB itself.

42:52 So you can run SQL queries.

42:53 You can do table stuff on it and so on.

42:54 What's the concurrency story with this thing?

42:58 Like, so for example, what if I open, what if I try to use DuckDB in Flask and I have Flask running in G Unicorn and I say, G Unicorn, please use four worker processes.

43:10 And I try to talk to it.

43:11 It's a great question.

43:12 So am I going to be having a bad time or having a good time?

43:15 It depends a little bit.

43:16 It depends.

43:16 It depends a little bit.

43:17 So, but yes, it's, there's some trade-offs.

43:19 There's reasons that there are databases that are in process and that there are databases that are client server.

43:24 And so whenever you want to write data to DuckDB, you can only write data from within a single process.

43:30 And you'll open up a read-write connection and it will actually disallow any access from any other process.

43:36 And it'll be single process access while you're doing reads and writes.

43:39 It does handle multi-threading just fine.

43:42 So you can have multiple Python threads.

43:43 You get nice parallel performance that way.

43:46 No problem.

43:46 But it's all within the same memory space because you're using threads.

43:49 So we're pretty excited about free threading coming soon in Python.

43:52 Yes.

43:52 Yeah.

43:53 Well, that's going to knock out the need for the web, the web garden, not the web farm, the web garden,

43:59 where you've got a bunch of these copies of the web server running in parallel because they're all trying to have separate gills so they can be more concurrent.

44:07 That's not the only reason, but that's a big reason in Python.

44:10 That's definitely a big reason.

44:11 So that we're excited about.

44:12 If you want to be read-only, so since DuckDB is an analytical database and since we can read from files,

44:19 there are a lot of use cases that are read-only where if I'm reading from parquet files on an object store,

44:24 I might not be doing any write operations at all, at which point multi-process,

44:29 access is totally fine as long as you're in read-only mode for all of those processes.

44:33 I will add one big other thing to look into, though, is where I work, which is Mother Duck.

44:39 So we're a cloud serverless data warehouse built around DuckDB and we'll handle the concurrency side.

44:45 So it's really taking this amazing single-player DuckDB experience and bringing it into the cloud as a full multiplayer cloud data warehouse experience.

44:53 So you get concurrency, you get automatic managed storage, it's very efficient, and you also get access control and some of the things you associate with a large-scale database.

45:03 That's the story. If you need concurrency but you're using DuckDB, I think Mother Duck is a great option.

45:09 Right. So this is kind of DuckDB as a service. Is that right?

45:13 That's definitely a key piece of it. I think we have a... DuckDB is a huge part of what we do and it is the engine under the hood.

45:21 I think we are using that to really be meeting the use case of being not just sort of a cloud hosting for DuckDB, it's to really be a full data warehouse with DuckDB as the engine.

45:33 So a little bit of a distinction there, but just as a whole organization could kind of use it as opposed to just one person, you know, cloud hosting it, something like that.

45:41 Yeah. Well, it seems like a really, really great service. And I guess it's worth calling out that DuckDB, the thing that runs in your process, is available on as an MIT licensed thing, right? Which is, that's pretty amazing.

45:54 And you, I saw somewhere, I can't remember where I saw it, but it's something like, you'll promise to be MIT, to stay MIT. No rug pulling, the rugs will stay.

46:02 Yes. So it's a pretty cool kind of corporate structure. And if you can say a cool corporate structure, we'll see if we can get away with that. But as far as corporate structures go, it's pretty cool, right?

46:12 So there's actually three organizations at play. And one of them is a nonprofit foundation called the DuckDB Foundation. And they actually own the intellectual property for DuckDB.

46:22 And that is what ensures that it'll be MIT licensed forever, because it is not a company that owns DuckDB, which is amazing.

46:29 So there is like, it's not even possible to do a rug pull, even if somebody wanted, which nobody does, right? It's not even possible. It's against the law, which is pretty cool.

46:37 So yeah.

46:38 I guess maybe give people, I threw that out there. There might be people who don't know what that means.

46:43 Oh, what a rug pull is?

46:44 Yes, exactly. Yeah. What is that?

46:46 You know, certain products, once they're open source, they begin to, they get funded in certain ways to where they need to be, you know, selling it as a product.

46:56 And at times that means that either they don't always put everything back in open source, or there's different licensing models that get changed,

47:04 so that the license is a little bit more restrictive than it was in the past.

47:07 Kind of the typical maneuver there for a couple different companies has been around Amazon will just host your product.

47:14 And then it's very hard to have your own business if Amazon just hosts it.

47:18 So we're not in that situation with DuckDB because it's open source forever.

47:24 But also Mother Duck is not just wrapping DuckDB.

47:27 We're innovating at the, at that service layer, where with Mother Duck, one of its secret sauce ingredients is it's not just a cloud service.

47:36 When you connect to Mother Duck, you have DuckDB running locally and in the cloud.

47:40 And one query, one Python statement can actually run partially in the cloud and partially on your laptop.

47:45 Oh, wow. Okay.

47:46 And we actually can, we can choose the optimal location for that to run based on your query plan.

47:51 We actually do that as a part of the optimization to say, where's the most efficient place for this to run?

47:55 And that's some serious secret sauce that, you know, it's not just, just wrapping, you know, an open source solution and hosting it.

48:02 That gives some really nice benefits for things like the local development experience.

48:05 You can develop locally at light speed and then push to the cloud at the final step.

48:10 Or alternatively, kind of reverse the polarity.

48:13 Maybe your first step is to bring everything down locally in a cache and then do all kinds of data science processing on it.

48:19 So it's, it's very convenient to have both local and cloud.

48:23 I see.

48:23 Kind of like we talked about writing a query and then turn it into a data science object and then asking questions.

48:29 This is a little bit like distributed versus local, but staying within DuckDB.

48:33 It's definitely, you get to use server side computer, local compute, however you'd like.

48:38 Yeah.

48:39 Super cool.

48:40 I love the idea of it.

48:42 It's a lot of fun.

48:42 We're, we're, we're doing a lot of fun things and we, we, you gotta have fun with it as well.

48:47 You know, what do you call someone who works at mother duck?

48:49 A duckling.

48:50 That's the second most popular one.

48:53 We actually were mother duckers, Michael.

48:55 Oh, okay.

48:56 Mother duckers.

48:57 All right.

48:57 You gotta have fun with it.

48:59 Right.

48:59 Otherwise it's all just ones and zeros at the end of the day.

49:01 Right.

49:01 You have to have fun with it.

49:02 Yeah.

49:03 You have to have fun with it.

49:04 You should.

49:04 I don't know.

49:05 Ducklings.

49:05 But mother duckers is good.

49:07 Still.

49:07 I guess ducklings might send the wrong message.

49:09 They're all good.

49:10 Do you want to, do you want to like sort of a tough, tough persona or facade, or do you want

49:17 to like a, you know, coddled, whatever?

49:19 No, it's a great name.

49:21 It's, you know, like if you look at the branding on our website, you know, we're not going for

49:25 like, we're not like a super muscly duck, you know, at our level ranges.

49:29 We're definitely in the playful side.

49:30 And part of it is because we really think that back in the big data era, we believe the big

49:36 data era is over here at Mother Duck.

49:38 And the reason for that is you no longer need 50 servers to answer your questions.

49:43 It turns out laptops are pretty daggone fast now.

49:45 Yeah.

49:46 And you can do a lot.

49:47 Even, even if you're not able to handle it on a laptop, how about one giant node on AWS?

49:52 You can rent a node with multiple terabytes of RAM, of RAM.

49:57 I mean, you could do a lot, right?

49:58 So at that point, what is a single node?

50:01 You can kind of do, do almost anything that you need.

50:04 So as a result.

50:05 And single server versus distributed is way easier.

50:09 Yes.

50:09 It allows you to innovate a lot faster.

50:10 You can use better algorithms.

50:12 So in many cases, it's actually just plain old faster to do it that way.

50:15 A lot of benefits.

50:17 But what we realized fundamentally at Mother Duck is, in a lot of ways, DuckDB as well.

50:21 The scale of data shouldn't be the most important thing anymore.

50:24 It should be how easy it is to use.

50:26 You should be able to do select star from a CSV.

50:29 You know, it doesn't take scale to parse a CSV file in an intelligent way.

50:33 It takes elbow grease.

50:34 It takes innovation in terms of the algorithm.

50:37 But at the end of the day, it's not about how big is your CSV file.

50:40 It's about how long did it take me to get my job done.

50:42 And in Mother Duck, it's a similar kind of pragmatism.

50:45 And our branding, I think, does a really good job of showing that.

50:47 I think so.

50:48 Well, I asked you about half of what I said I was looking forward to asking about, which

50:53 was the table schemas, how to do that.

50:55 However, what about indexes?

50:57 Sure.

50:58 I think indexes are a great question.

51:00 So DuckDB does support indexes.

51:02 But in many cases, it's not necessarily something that you need.

51:06 And the reason for that is because of the workloads that it's best for, but also because of columnar

51:11 databases.

51:11 So when you have a columnar database, by default, it's going to create some rough indexes for you

51:18 automatically.

51:19 So every about 100,000 rows in DuckDB, we create an index there that says, what's the minimum value

51:25 in this chunk and then the maximum value in this chunk.

51:28 And what we'll do is then if you run a query that says select data from this week, we're going to check,

51:32 hey, which blocks have any data from this week?

51:36 And we can skip 90% of your data because we know that it doesn't match that filter.

51:41 But that's not like an individual row index.

51:44 It's an index for every 100,000 rows.

51:47 And it's for the analytical queries where you're looking at trends, where you're analyzing

51:51 large amounts of data, it tends to be really the optimal way to go about it.

51:54 And that's the way most systems do it.

51:56 There is an external index you can add.

51:58 It's an art index, an adaptive radix tri index, which is a pretty cool way of doing it.

52:04 It's a little different than the B tree in SQLite, but it allows for those vast point lookups

52:09 where if you want to grab one row, you can get a very quick mapping to that individual row.

52:14 It tends to be just less necessary in DuckDB to do that, but we have it if you need it.

52:19 I'm realizing that I need to learn more about databases, especially columnar ones, because

52:25 they're not matching my mental model as much as, say, for relational or document ones.

52:29 It's definitely a different world.

52:32 Different world.

52:32 And that's, I think, the fun of it is I think of databases as an infinitely deep rabbit hole

52:38 of really just fun optimizations, fun things to learn about.

52:43 So a lot of exciting stuff.

52:44 Yeah.

52:44 When you get that stuff right, it's like magic.

52:46 It's like, how could it possibly answer that so quickly?

52:49 At the same time, when it's wrong, it's really frustrating.

52:52 You know, you go to a website and you don't say, oh, you know, click this thing here.

52:57 And then you try to, and it just sits there and spins.

52:59 Like, I know this is happening without an index.

53:01 And I know this could be so easily fixed.

53:03 And it's just not.

53:04 Why, why is it so bad?

53:06 But, you know, you can make your stuff fast.

53:08 You can't make other people's stuff fast.

53:09 So I guess it's an advantage.

53:10 Makes you look good.

53:11 Yes.

53:12 So I think that the workloads are just so different, but the mental model, it does open doors in

53:18 a lot of ways.

53:18 So for example, if you want to get the average of like a billion rows at DuckDB, it's a couple

53:23 seconds.

53:24 Wow.

53:24 It's that fast.

53:25 You know, so just the things that you can do with that kind of tool, an analytical oriented

53:31 tool, really unlocks things you can do in an application sense.

53:34 So data apps, we think are a really interesting market for both DuckDB and MotherDuck, where,

53:40 you know, it's not just show me a little bit of data.

53:43 Maybe it's an event tracker, like a fitness tracker.

53:46 You know, I want to see the history of my bike rides for the last month.

53:50 And I want to see the trends, my moving averages, you know, highlight the outliers for me.

53:54 You know, there's a lot of heavy analytical work to be done in data apps that an analytical

54:00 database is really good for.

54:00 Yeah.

54:01 Give us some, I don't know how much you can share.

54:04 I know you interact with these customers quite a bit.

54:06 So, you know, but I don't know how much you can share.

54:08 Give us some senses of what people are doing.

54:10 Sure.

54:11 On the MotherDuck side or the DuckDB side?

54:13 Well, you know, give us a little example from each, maybe that you think is representative.

54:17 You bet.

54:18 I think on the DuckDB side, I think some of the very interesting things are when you use DuckDB

54:24 as a data processing engine, where each individual processing task is, you know, one node, but you're

54:30 doing a ton of them in parallel.

54:31 So some companies will actually take DuckDB and use it to transform one Parquet file into

54:36 an aggregated Parquet file.

54:38 And they'll just set it up on a trigger.

54:39 So when a Parquet file gets added, they'll automatically just create a little pipeline with DuckDB and

54:44 convert it into a much, you know, cleaner, more aggregated form.

54:47 And you could do that at huge scale.

54:49 So there's some companies that are doing huge amounts of data processing, you know, with

54:53 DuckDB.

54:54 Tons of tiny DuckDBs, a flock of tiny DuckDBs running behind the scenes.

54:59 So that's an interesting use case that, you know, DuckDB originated as a data science tool,

55:04 but I think the data engineering side of DuckDB is also very exciting, where it fits really well

55:09 in those pipelines.

55:10 On the MotherDuck side, I think what we see is a lot of cases where the transactional database,

55:15 you know, a lot of cases, it's Postgres.

55:18 It's got a really friendly SQL syntax, and it aligns very closely with DuckDB's SQL syntax,

55:23 where Postgres just is too slow for those kind of trend-oriented aggregates or join queries

55:28 that you want to run.

55:30 And MotherDuck ends up being a really great way to very easily get a lot more speed in those cases.

55:35 And that could be speed for data engineering tasks.

55:37 Like some folks are running data processing jobs with this tool called DBT, stands for Data Build

55:43 Tool, and it's a Python library.

55:46 And they were running these, you know, sets of computation and aggregations, and they would

55:51 take, you know, eight hours.

55:52 And if it starts running, you know, 2 a.m., by the time it errors out, you've lost your whole

55:57 day of your data analysis being up to date.

56:00 Yeah, you better get it right, right?

56:02 Better get it right.

56:03 Unfortunately, DBT does also stand for, you know, behavioral therapy of a certain kind.

56:07 So you've got to search for DBT data.

56:09 It's a tough acronym.

56:11 Yeah, it's a little short.

56:13 Yeah, DBT Python, that'll get you there too, yeah.

56:16 With MotherDuck and with DuckDB, you can take that, and instead of being eight hours, you can

56:21 run it in, you know, 15 or 30 minutes.

56:23 And just that scale of change means that, you know, it's just a far more delightful experience

56:29 to do those kind of data pipelines on an analytical tool as opposed to a transaction.

56:33 So that's one.

56:34 The other one is business intelligence tools.

56:36 In a lot of cases, those start to get really slow on a transactional database because if

56:40 you're looking at a graph, it's by most of the time, it's a trend.

56:43 And most of the time, it's looking at a lot of data to plot.

56:46 Even if it's just plotting a few points, it's looking at a lot of data to plot it.

56:49 And DuckDB is really excellent for that, and MotherDuck as well.

56:52 There's actually a lot of business intelligence tools where they are powered by DuckDB.

56:56 So Mode was acquired by ThoughtSpot.

56:59 They're powered by DuckDB.

57:00 Hex is a data science notebook.

57:02 They have DuckDB as a part of their architecture.

57:05 There's a couple others.

57:06 We talked about some of the data types.

57:08 What about JSON or document type of things, right, where you've got a little hierarchical

57:14 data?

57:14 You know, there's on one end of that spectrum, we have MongoDB.

57:18 That's all it does.

57:18 And on the other, you've got relational databases like Postgres that say, well, this column can

57:23 just be JSON arbitrary stuff.

57:26 Yes.

57:27 And we're in that space where we absolutely believe that's how most things are, a little

57:31 bit of both.

57:32 So there is a full JSON data type in DuckDB.

57:35 And so you can store any arbitrary document in there and then process it with some really

57:40 fast JSON processing.

57:41 We can also take JSON data and automatically split it out into columns if you want to unnest

57:47 it.

57:48 So you can kind of go back and forth there.

57:49 Oh, nice.

57:49 There are also specific DuckDB nested types if you want to enforce the typing at the column

57:56 level, but you want to be nested.

57:57 So you could say, I know these are going to be my keys.

58:00 I know this will be my structure.

58:01 You can store it all in one column and it'll be very, very quick to work with.

58:06 So you kind of have both.

58:07 You have the full flexibility with JSON and then you also can have nested types.

58:11 Can you query into them?

58:12 So if I have a JSON thing that's got like purchases and then it's got a bunch of purchase objects

58:17 and then like values, can I say, give me the ones that have over $100 purchases?

58:21 You can do much like Postgres and SQLite have the ability to extract pieces of JSON with

58:27 the path syntax of like go to this path and pull that value out.

58:30 You absolutely can do that in DuckDB as well and say, okay, navigate to the customer object

58:37 and then inside the customer object, go to this object and absolutely.

58:41 Yeah.

58:41 Super cool.

58:42 All right.

58:42 Well, I think there's probably more to dive into, honestly, but we've covered it.

58:47 I think we've covered it pretty well.

58:48 I think DuckDB is super exciting for the possibilities that it opens up.

58:52 You know, there might be a ton of people out there who haven't heard of it or really just

58:56 heard of it in passing and didn't really know.

58:58 This in process stuff that you can do, it really makes it quite accessible, quite easy,

59:03 lowers the bar to getting started, right?

59:05 You don't have to understand connections, security, servers, daemons, all of that.

59:10 So I'm excited to see where this goes.

59:12 It's already got a ton of traction.

59:13 So final call to action, maybe tell people who are in that group, what do they do?

59:18 You bet.

59:18 I think, you know, obviously, pip install DuckDB, give it a try.

59:21 It's anywhere you can use it.

59:23 MIT license, it really can fit anywhere that you're running Python.

59:27 A whole lot of other fun things you could do on top of it.

59:30 There's a new extension ecosystem.

59:31 So if it doesn't exist in DuckDB, but you'd like it to, you can actually build an extension

59:36 for DuckDB in a variety of languages.

59:38 And so we can all together make DuckDB into whatever we'd like.

59:42 So it's pretty neat.

59:43 Yeah.

59:44 Well, good work.

59:45 And I love to see the open source side and then maybe a strong company that's built around

59:53 it in a way that doesn't really undermine the open source value.

59:56 So MotherDuck, DuckDB, it looks like a really good relationship there.

59:59 So that's nice.

01:00:00 Yeah.

01:00:01 We even have two companies.

01:00:02 We got DuckDB Labs and MotherDuck.

01:00:04 Yeah, that's right.

01:00:05 Got the foundational side as well.

01:00:07 Yeah.

01:00:07 Awesome.

01:00:08 Well, Alex, thanks for being on the show.

01:00:09 And yeah, thanks for sharing all this stuff with us.

01:00:12 It's great.

01:00:12 Thanks for having me.

01:00:13 Cheers, folks.

01:00:14 Happy analyzing.

01:00:14 Yep.

01:00:15 Bye.

01:00:15 Bye.

01:00:16 This has been another episode of Talk Python to Me.

01:00:19 Thank you to our sponsors.

01:00:21 Be sure to check out what they're offering.

01:00:22 It really helps support the show.

01:00:24 Take some stress out of your life.

01:00:26 Get notified immediately about errors and performance issues in your web or mobile applications with

01:00:31 Sentry.

01:00:32 Just visit talkpython.fm/sentry and get started for free.

01:00:37 And be sure to use the promo code talkpython, all one word.

01:00:40 And it's brought to you by the Data Citizens Dialogues podcast from Colibra.

01:00:44 If you're ready for a deeper dive into the latest hot topics and data, listen to an episode

01:00:49 at talkpython.fm/citizens.

01:00:51 Want to level up your Python?

01:00:53 We have one of the largest catalogs of Python video courses over at Talk Python.

01:00:57 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:01:02 And best of all, there's not a subscription in sight.

01:01:05 Check it out for yourself at training.talkpython.fm.

01:01:08 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:01:13 We should be right at the top.

01:01:14 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the

01:01:20 direct RSS feed at /rss on talkpython.fm.

01:01:24 We're live streaming most of our recordings these days.

01:01:26 If you want to be part of the show and have your comments featured on the air, be sure to

01:01:30 subscribe to our YouTube channel at talkpython.fm/youtube.

01:01:35 This is your host, Michael Kennedy.

01:01:36 Thanks so much for listening.

01:01:38 I really appreciate it.

01:01:39 Now get out there and write some Python code.

01:01:41 You You you you Thank you.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon