#480: Ahoy, Narwhals are bridging the data science APIs Transcript
00:00 If you work in data science, you definitely know about data frame libraries.
00:03 Handis is certainly the most popular, but there are others such as QDF, Moden, Polars,
00:09 Dask, and more.
00:10 They're all similar, but definitely not the same APIs, and Polars is quite different.
00:15 But here's the problem.
00:16 If you want to write a library that is for users of more than one of these data frame
00:20 frameworks, how do you do that?
00:22 Or if you want to leave open the possibility of changing yours after the app is built,
00:27 you've got the same problem.
00:28 Well, that's what narwhals solves.
00:30 We have Marco Garelli on the show to tell us all about narwhals.
00:34 This is Talk Python to Me, episode 480, recorded September 10th, 2024.
00:40 Are you ready for your host, please?
00:43 You're listening to Michael Kennedy on Talk Python to Me.
00:46 Live from Portland, Oregon, and this segment was made with Python.
00:50 Welcome to Talk Python to Me, a weekly podcast on Python.
00:56 This is your host, Michael Kennedy.
00:58 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,
01:04 both accounts over at fosstodon.org.
01:07 And keep up with the show and listen to over nine years of episodes at talkpython.fm.
01:12 If you want to be part of our live episodes, you can find the live streams over on YouTube.
01:16 Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows.
01:23 This episode is brought to you by WorkOS.
01:25 If you're building a B2B SaaS app, at some point your customers will start asking for enterprise features like SAML authentication,
01:32 skim provisioning, audit logs, and fine-grained authorization.
01:36 WorkOS helps ship enterprise features on day one without slowing down your core product development.
01:42 Find out more at talkpython.fm/workos.
01:46 Marco, welcome to Talk Python to Me.
01:49 Hey, thanks for having me.
01:50 Hey, it's fantastic to have you here.
01:51 We talked a little bit on the socials and other places, but, you know, nice to talk to you in person and about some of your projects.
01:59 Yeah, nice to finally do it.
02:00 Been listening to your shows for years, so it's a pleasure to be here.
02:04 Yeah, that's really cool.
02:05 It's awesome when people who are listeners for a long time get to come on the show.
02:09 I love it.
02:10 So we're going to talk about narwhals and data science, data frame, libraries, and basically coming up with a way to write,
02:20 consistent code against all these different libraries, which I think is an awesome goal, which is why I'm having you on the show, of course.
02:27 Before we get to all that, as you know, let's hear a little bit about yourself.
02:30 Sure.
02:30 So, yeah, my name is Marco.
02:32 I work at a company called Quantsite Labs, which supports several open source projects and also offers some training and consulting services.
02:41 I live in Cardiff in Wales and been at Quantsite for about two years now.
02:47 Originally hired as a Pandas maintainer, but then shifted considerably towards some other projects such as Polars.
02:55 And then in February of this year, I tried releasing this narwhals library as a bit of an experiment.
03:01 It's growing a bit faster than expected.
03:04 Yeah, it's a really interesting project.
03:07 Quantsite is quite the place.
03:09 You know, I didn't really know about y'all before having people on the show from there.
03:14 But here's my experience.
03:16 I've had I reach out to say some of the Jupyter folks, whatever.
03:19 Let's have let's have some of the Jupyter people on.
03:21 There's three or four people from Quantsite show up.
03:23 And then, oh, let's talk about this other project.
03:26 Another person from Quantsite two weeks later.
03:28 And then you're from Quantsite.
03:30 And none of those connections were like, let me try to find people from Quantsite.
03:34 I think you all are having a pretty big impact in the data science space.
03:37 That's cool.
03:37 Yeah, it is a bit amusing in the internal Slack channel.
03:41 If you ask a question, does anyone know how to do this?
03:44 Someone will reply, oh, yeah, let me ping this person who's a maintainer of that library.
03:48 And you're like, okay, well.
03:49 Exactly.
03:49 I think we know how it works.
03:51 Let's ask them.
03:52 Yeah, exactly.
03:53 Yeah, it's a big world, but also a small world in interesting ways.
03:57 Yeah.
03:58 How did you get into programming in the first place?
04:00 I think the first experience with programming I had was at university.
04:04 So I studied maths.
04:05 I think like you as well.
04:07 Yeah, yeah, yeah.
04:08 That sounds really.
04:08 Yeah, keep going.
04:09 So far, you're telling my story.
04:10 Yeah, sure.
04:12 Although my initial encounter with it, I didn't, I'm not quite particularly enjoyed it.
04:16 It was just having to solve some problems in MATLAB.
04:19 I did find it kind of satisfying that if you gave it instructions, it did exactly that.
04:24 But I wouldn't say that I felt like naturally talented or anything.
04:29 I then really took to programming, though, after I started a maths PhD and dropped out because it wasn't really going anywhere.
04:36 And once I went deeper into programming, then I realized, okay, actually, I do have some affinity for this topic.
04:42 I do quite enjoy it.
04:43 Yeah.
04:44 I think that's often the case.
04:45 What was your PhD focus before you dropped out?
04:47 It was meant to be, well, some applied mathematics, like stochastic partial differential equations.
04:52 But, you know, in academia, you publish or perish.
04:55 And I wasn't publishing and didn't really see that changing.
04:59 So I had to make a bit of a pivot.
05:02 I imagine you made a pretty good choice, just guessing.
05:05 I mean, I love math, but the options are just so much broader outside of academia.
05:10 In hindsight, yeah, I kind of wish that somebody at the time had told me that I could have still had a really interesting and rewarding career outside of academia.
05:17 And I shouldn't have stressed myself out so much about trying to find a PhD or about having to complete it when I had already started it.
05:26 The secret, I think, about being a good programmer is it's kind of like grad school anyway.
05:31 You're constantly studying and learning.
05:34 You feel like you've figured something out.
05:35 It's like, well, that's changed.
05:36 Now on to the next thing.
05:37 You're kind of like, well, we figured out pandas.
05:39 Now we got polars.
05:40 Okay, well, we're going to start over and figure out how to use that one.
05:42 All right.
05:42 So that is true.
05:44 You need to do a lot of learning, a lot of self-directed learning in particular.
05:48 It's really stimulating, I must say.
05:50 It is.
05:51 It's great if you want that.
05:52 If you want to just nine to five, you don't need to stress about.
05:55 Look, I think there's actually options there.
05:58 We'll get to narwhals in a second.
05:59 But I think there are options there.
06:00 I think if you want to do COBOL, FORTRAN, some of these older programming languages where so much of the world depends on them, but nobody wants to do them.
06:10 You could totally own that space and make really good money if you didn't want to learn anything.
06:15 But where's the fun in that, right?
06:17 Yeah.
06:18 It would be nice if we could all get rid of our legacy systems.
06:21 But, you know, this stuff does power the world.
06:24 They're there for a reason, right?
06:26 That works.
06:27 I would really like it if you don't touch it, please.
06:29 But that's not the way it is with narwhals.
06:32 Let's start with an overview of what narwhals is and why you created it.
06:38 And then I want to talk a bit about some of the data science libraries before we get too much deeper.
06:42 What is narwhals?
06:43 A narwhal is a cool whale as far as I know.
06:46 It's like the unicorn of the sea, basically.
06:48 What is this library?
06:51 So it's intended as a compatibility layer between different data frame libraries.
06:56 So narwhals does not do any computation itself.
07:00 It's more of just a wrapper around different data frame APIs.
07:04 And I like the Polar's API, so I figured that I should keep it fairly close to the Polar's API,
07:09 and in particular, to Polar's expressions.
07:12 As to why narwhals, so I was just getting frustrated with the fact that there's,
07:19 let's say about a year ago, there were relatively few libraries that supported Polar's.
07:23 And if libraries did support Polar's, it was often just done by converting to pandas or converting to pyarrow.
07:30 Yet a lot of these libraries, they weren't doing anything that complicated with data frames.
07:34 A lot of data frame consuming libraries, they don't really want to do that much.
07:38 They want to select columns.
07:40 They want to select rows.
07:42 Maybe they want to do some aggregations.
07:43 Like they're not doing stuff that's completely wild.
07:46 And so trying to design some minimal compatibility layer, I think, is a lot easier than trying to make a full-blown data frame API that end users are meant to use.
07:57 So the idea with narwhals is this is a tool for tool builders.
08:01 If library maintainers want to support different data frame libraries as inputs with minimal overhead and with minimal maintenance required on their side, this is the problem we're trying to solve.
08:12 It's a great problem to solve because maybe you want to have a library that works with an abstract concept of a data frame.
08:20 But usually I would imagine you have to start out and say, are we going to go and support Polar's?
08:25 Are we going to support pandas?
08:27 And the APIs are different.
08:28 Not just the APIs, but the behaviors.
08:31 For example, the lazy execution of Polar's versus the eager execution of pandas.
08:37 And so being able to just write your library so it takes you there is probably a big hassle, right?
08:42 Because it's just kind of have to have almost two versions each step, right?
08:46 Yeah, exactly.
08:47 Well, I actually heard from a maintainer recently who was saying that he was interested in using narwhals even just to have pandas as a dependency.
08:56 Because pandas, the API changes a bit between versions.
09:00 And he was getting a bit tired of pandas' API changes.
09:05 And was like, okay, well, if we can just defer all of the version checks and API differences to an abstraction there that might even simplify our life, even if we're just interested in supporting pandas.
09:16 Yeah, that's cool.
09:17 Yeah, actually, that's an interesting idea.
09:20 It's just like, we'll have a compatibility later, just in case.
09:22 And pandas went from one to two on major version recently, which is a big deal, and switched to PyArrow and all that, right?
09:30 Yeah, so version two, it was sometime last year, I think, 2023.
09:36 So yeah, the PyArrow, I think there were some misconceptions around that.
09:40 So as we're live on air, let's take the chance to address some PyArrow misconceptions.
09:46 PyArrow in pandas currently is optional, and it'll probably stay optional for quite a while.
09:52 So there is some talk about in version three using PyArrow strings instead of the classical NumPy object strings.
09:59 By default, if people have PyArrow installed, it's not totally decided.
10:04 It's not totally set in stone whether PyArrow will be a required dependency.
10:08 And maybe pandas version four will have it as a required dependency, and it'll be the default everywhere.
10:14 But that's a few years away.
10:16 Yeah, maybe.
10:17 Maybe we'll get Python four as well.
10:19 You never know.
10:20 You know, it's interesting.
10:22 I think the data science space, more than many other areas, has this ability to run Python in more places, right?
10:29 For example, there's Pyodide, there's Jupyter Lite.
10:33 There's a lot of more constrained environments that it might go in.
10:37 And I don't know what the story of PyArrow and WASM and all these different things.
10:41 You tell me you still get benefits there.
10:43 But there's a lot to consider.
10:44 Yeah, totally.
10:45 And I think that's one reason why some library maintainers are really drawn to a lightweight compatibility layer like narwhals.
10:52 With narwhals, you say you don't need any dependencies.
10:55 You need narwhals, but that's just a bunch of Python files.
10:58 Like if you wanted to, you could even just vendor narwhals.
11:01 Like it's not that big of a deal, but there's no extra dependencies required.
11:05 Like pandas users don't need polas installed, and polas users don't need pandas installed.
11:10 So if you're trying to deploy to a constrained environment where package size is limited, like if a library has narwhals as a required dependency, as opposed to any big data frame library, and then the user can just bring their own data frame.
11:26 Then like this, we're really minimizing the number of installation and dependency hell issues that people might run into.
11:32 I think you've covered dependency hell on the show a few times before.
11:35 Yeah, indeed. I think one thing that's interesting for people out there listening, we'll talk about the different libraries that it works with right now.
11:44 But if you have a library out there and you're listening, there's not too much work to integrate it into or make it narwhal compatible, narwhalification of a library to let it do this interchange, right?
11:56 So I think I can interpret your question in a couple of ways.
12:00 So I'll just play them back and let's just see.
12:02 Let's do it.
12:03 One is if you're a library that consumes data frames.
12:06 So yeah, there's some examples there on the readme of who's adopted narwhals.
12:10 So like Altair is the most recent, probably the most famous one.
12:14 I think that's where I heard of it was when I was, some news about Altair and narwhals together is actually how I heard of narwhals.
12:21 Okay, yeah. So yeah. And how complicated that is really depends on how complicated the data frame operations this library is doing.
12:30 In the case of Altair, they weren't doing anything that was that crazy.
12:34 They needed to inspect the data types, select some columns, convert date times to strings, get the unique categories out of categoricals.
12:44 It wasn't that bad. So I think within a few weeks we were able to do it. Same story with Psychic Lego.
12:51 There's some other libraries that have reached out that have shown interest where it's going to be a bit of a heavier lift,
12:57 but it's generally not as bad as I thought it was going to be when I started the project.
13:03 The other way that I think I might have interpreted your question is how difficult is it for a new data frame library to become narwhals compatible?
13:11 Yes.
13:12 And there's a couple of ways that they can go about doing that.
13:14 The preferred way is if they either write to us or open a pull request, adding their library as a backend in narwhals.
13:21 However, we love open source, but I don't consider myself an open source absolutist.
13:26 I understand that not everything can be open sourced.
13:29 So if somebody has a closed source solution, we do have an extensibility mechanism within narwhals such that somebody just needs to implement some dunder methods.
13:38 And then if they pass the data frame into a library that's been narwhalified, then narwhals will know how to glue things together.
13:45 And they'll be able to still support this closed source solution without it needing to go out into the open.
13:51 Right. It's kind of something like inheriting from a class and implementing some functions and then it knows, right?
13:57 Yeah, exactly.
13:58 Yeah. Yeah. Cool.
13:59 So right now it has full API support for CUDAF, C-U-D-F. I'm guessing that's CUDA data frame library?
14:07 Yeah, I'm not totally sure how we're supposed to pronounce it. I call it CUDAF.
14:11 Yeah, that came out of the Rapids team at NVIDIA. It's like an accelerated version of Pandas on GPU.
14:17 Yeah, that's been quite a fun one.
14:18 Nice. Yeah, I bet. That's pretty wild.
14:22 The API is quite similar to Pandas, but it's not exactly the same. So we have to do a bit of working around.
14:27 Right, right. Because graphics cards are not just regular memory and regular programs. They're weird, right?
14:34 Yeah, that's part of it. So there's some parts of the Pandas API which they intentionally don't support.
14:39 And there's another part of it is just that the Pandas API is so extensive that it's just a question of resources.
14:46 It's pretty difficult to reimplement 100% of the Pandas API. But Moden does attempt to do that. Moden does tell itself as a drop-in replacement for Pandas.
14:57 In practice, I think they do have a section in their docs where they do mention some gotchas, some slight differences.
15:05 But that's the idea. They've kind of got their own intermediate representation, and they've got their algebra, which they've published a paper about, which they then map onto the Pandas API.
15:17 A pretty interesting project that was a lot easier to support. The way they mimic the Pandas API is a lot closer.
15:23 But it's been interesting. Like with novels, we did find a couple of minor bugs in Moden just by running our test suite through the different libraries, which we then reported to them and they fixed very quickly.
15:33 That's pretty awesome. Yeah, yeah. That's super awesome.
15:34 So Moden lets you use Ray, Dask, or Unidisk. Unidisk.
15:40 Two of, one of which I know. One of which I've heard of. Two of which I've heard of.
15:45 So I was going to ask about things like Dask and others, which are sort of themselves extensions of InDesk.
15:52 But if you support Moden, you're kind of through one more layer supporting Dask.
15:57 Oh, but it's better. We don't have this on the readme yet, but we do have a level of support for Dask.
16:04 We've not quite put it on the readme yet because we're still kind of defining exactly where the boundaries are.
16:10 But it's going to be some kind of partial, lazy-only layer of support.
16:17 And it's actually quite a nice way to run Dask.
16:19 Like when you're running Dask, there are some things which do trigger compute for you.
16:23 There are some things which may trigger index repartitioning.
16:27 I think that's what it's called.
16:28 And in NowWals, we've just been extremely careful that if you're able to stick to the NowWals API,
16:33 then what you're doing is going to be performant.
16:36 Awesome. Yeah, that's super cool.
16:38 So one thing I think worth maybe pointing out here is you talked about Pandas, Pandas 1, Pandas 2,
16:44 and it being an extensive API.
16:46 I mentioned the eager versus lazy computation.
16:49 But these two libraries are maybe some of the most popular ones, but they're pretty different in their philosophy.
16:56 So maybe just could you just quick compare and contrast Polars versus Pandas?
17:01 Yeah, sure.
17:02 So Pandas started a lot earlier, I think, in 2008, maybe first released in 2009,
17:11 and originally really written heavily around NumPy.
17:16 And you can see this in the classical Pandas NumPy data types.
17:20 So the support for missing values is fairly inconsistent across types.
17:25 So you brought up PyArrow before.
17:27 So with the PyArrow data types, then we do get consistent missing value handling in Pandas.
17:32 But for the classical NumPy ones, we don't.
17:34 Polars started a lot later.
17:36 It didn't have a lot of backwards compatibility concerns to have to worry about.
17:42 So it could make a lot of good decisions up front.
17:44 It's generally a lot stricter than Pandas.
17:47 And in particular, there's a lot of strictness and the kinds of ways it lets you interact with its subjects.
17:55 So in Pandas, the way we interact with data frames is we typically extract a series as one-dimensional objects.
18:02 We then manipulate those series.
18:04 Maybe we put them back into the original data frame.
18:07 But we're doing everything one step at a time.
18:09 In Polars, the primary way that we interact with data frames is with what you've got there on the screen.
18:16 PL.col, AB.
18:17 These are called expressions.
18:19 And that expression, my mental model for it is just a function.
18:22 It's a function from a data frame to a series.
18:24 And being a function...
18:25 It's like a generator or something, huh?
18:27 Yeah, kind of.
18:27 Yeah.
18:27 So although I think when you say generator, like in Python, a generator, it's at some point you can consume it.
18:33 Like you can type next on the generator and it produces a value.
18:36 But an expression doesn't produce a value.
18:38 It's like if you've got lambda x, x times two.
18:41 Yeah.
18:41 It doesn't produce a value until you give it an input.
18:44 And similarly, an expression like PL.col, A, B.
18:47 By itself, it doesn't do anything.
18:49 The interpretation is given some data frame DF, I'll return you the columns A and B.
18:54 So it only produces those columns once you give it some input data frame.
18:57 And functions, just by their very definition, are lazy, kind of.
19:02 Like you don't need to evaluate them straight away.
19:05 And so Polers can take a look at all of the things you want to do.
19:07 It can recognize some optimization patterns.
19:10 It can recognize that maybe between some of your expressions, there are some parts that are repeated.
19:15 And so instead of having to recompute the same thing multiple times, it can just compute it once and then reuse that between the different expressions.
19:23 Yeah, that's one of the big features of big capabilities of Polers is that it has kind of a query engine optimizer in there.
19:32 Whereas Pandas, because it's not lazy, it just does one thing, then the next, the next.
19:36 But maybe if you switch the order, like first filter and then compute versus compute and then filter,
19:41 you might get a way better outcome, right?
19:44 That's a massive one, yeah.
19:45 So when I was doing some benchmarking, we brought up QDF earlier.
19:49 So that's the GPU accelerated version of Pandas.
19:51 And that is super fast if you're just doing single operations one at a time in a given order.
19:57 However, there are some benchmarks where maybe you're having to join together multiple data frames,
20:02 and then you're only selecting certain rows.
20:07 At that point, it's actually faster to just do it on a CPU using a lazy library like Polers,
20:12 because Polers can do the query optimization.
20:14 It can figure out that it needs to do the filter and only keep certain rows before doing five gigantic joins.
20:20 Whereas QDF, it's super fast on GPU, but it is all eagerly executed.
20:25 It did way more work, but it did it really fast, so it was about the same in the end.
20:29 Yeah, but now in Polers, there's going to be GPU support.
20:33 And it's going to be query optimized GPU support.
20:37 I don't know if the world is ready for this level of speed.
20:39 Yeah, that's going to be interesting.
20:41 This portion of Talk Python is brought to you by WorkOS.
20:46 If you're building a B2B SaaS app, at some point, your customers will start asking for enterprise features like SAML authentication,
20:53 SKIM provisioning, autologs, and fine-grained authorization.
20:57 That's where WorkOS comes in, with easy-to-use APIs that'll help you ship enterprise features
21:02 on day one without slowing down your core product development.
21:06 Today, some of the fastest-growing startups in the world are powered by WorkOS,
21:10 including ones you probably know, like Perplexity, Vercel, and Webflow.
21:14 WorkOS also provides a generous free tier of up to 1 million monthly active users for AuthKit,
21:21 making it the perfect authentication layer for growing companies.
21:24 It comes standard with useful features like RBAC, MFA, and bot protection.
21:29 If you're currently looking to build SSO for your first enterprise customer,
21:34 you should consider using WorkOS.
21:35 Integrate in minutes and start shipping enterprise plans today.
21:40 Just visit talkpython.fm/workos.
21:43 The link is in your podcast player's show notes.
21:45 Thank you to WorkOS for supporting the show.
21:48 I guess another difference, it's not a massive, you know, in some ways it matters,
21:53 some ways it doesn't, is Pandas is based on C extensions, right?
21:58 I'm guessing, if I remember right.
21:59 And then Poolers is Rust, and they even took the .rs extension for their domain,
22:05 which is really embracing it.
22:08 Not that it really matters, you know, what your native layer is, if you're not working in that, right?
22:14 Like, most Python people don't work in C or Rust, but it's still interesting.
22:17 Well, I think it, yeah, it is interesting, but also it can be useful for users to know this,
22:23 because Poolers has a really nice plugin system.
22:26 So you can extend Poolers with your own little expressions, which you can write in Rust.
22:31 And the amount of Rust that you need to do this is really quite minimal.
22:34 Like, if you try to write these Poolers plugins as if you were writing Python,
22:38 and then just use some LLM or something to guide you, I think, realistically, most data scientists can solve 98% of their inefficient data frame usage
22:48 by using Poolers plugins.
22:49 So having a nice, safe language that you can do this, it really makes a difference.
22:54 I'm going to write it in Python, and then I'm going to ask some LLM.
22:58 Right now I'm using LLM Studio and I think Llama 3.
23:01 Anyway, ask it, say, okay, write this in Rust for me.
23:06 Write it as a Polers plugin.
23:07 Here we go.
23:08 All right.
23:08 Yeah, exactly.
23:09 It's crazy.
23:11 It's this new world we live in.
23:12 Yeah, yeah, totally.
23:13 I mean, like, the amount of Rust knowledge you need to take care of some of the complicated
23:18 parts in Polers is really advanced.
23:20 Really need to study for that, and LLM isn't going to solve it for you.
23:23 But the amount of Rust that you need to just make a plugin to solve some inefficient function,
23:29 I think that's doable.
23:30 Right.
23:31 Yeah, exactly.
23:32 It's very different to say, we're going to just do this loop in this function call here
23:35 versus there, rather than I'm going to write a whole library in Rust or C or whatever.
23:39 Exactly.
23:41 Yeah, so there's a pretty different API between these.
23:45 And in Narwhals, it looks like you've adopted the Rust API, right?
23:50 A subset of it.
23:51 Is that right?
23:51 The Polers one, yes, exactly.
23:53 So I kind of figured we've got a few choices.
23:55 That's what I mean.
23:56 Yeah, the Polers one.
23:57 Yeah, yeah.
23:58 We can either, like, just choose the Pandas API.
24:00 But to be honest, I found that trying to translate the Pandas API to Polers was fairly painful.
24:06 Like, Pandas has a bunch of extra things, like the index, multi-index, and it does index alignment on all the operations.
24:12 I just found it not a particularly pleasant experience to try to map this onto Pandas.
24:18 However, when I tried to do the reverse of translating the Polers API to Pandas,
24:22 it kind of just worked without that much effort.
24:25 And I was like, oh, wow, this is magic.
24:26 Okay, let's just take this a bit further, publish it on GitHub.
24:29 Maybe somebody would find a use case for it.
24:31 I don't know.
24:32 Yeah, that's great.
24:33 Out of the audience.
24:34 ZigZackJack asks, how is Narwhals different from IBIS?
24:38 All right.
24:39 The number one most common question.
24:41 Love this.
24:41 Is it?
24:42 Okay, great.
24:43 Yeah.
24:43 Maybe we should provide a bit of context for the listeners on what is IBIS.
24:47 So IBIS, yes, you can see there on the screen, they describe themselves as the portable data frame library.
24:52 So IBIS is really aiming to be a data frame library, just like Pandas, just like Polers.
24:57 But it's got this API, which can then dispatch to different backends.
25:02 The default one is DuckDB, which is a really powerful embedded analytics database.
25:08 I think you covered it on the show.
25:09 In fact, I think I might have first heard about DuckDB on Python Bytes.
25:12 So listeners, if you want to stay up to date, subscribe to Python Bytes.
25:17 Thank you.
25:18 Yeah.
25:18 One of the shows I almost never miss.
25:20 So yeah, I think the primary difference between Narwhals and IBIS is the target audience.
25:26 So with IBIS, they're really trying to be this full-blown data frame that people can use to do their analyses.
25:33 Whereas with Narwhals, I'm openly saying to end users, like if you're an end user, if you're a data scientist, if you're an ML engineer, if you're a data analyst, don't use Narwhals.
25:43 It's a tool for tool builders, like learn Polers, learn DuckDB, learn whatever the best tool is for your particular task and learn it well and master that and do your analyses.
25:55 On the other hand, if you're a tool builder and you just need to do some simple operations with data frames and you want to empower your...
26:03 If you want to enable your users to use your tool, regardless of which library they're starting with, then Narwhals can provide a nice bridge between them.
26:11 Interesting.
26:12 Is there any interoperability between IBIS and Narwhals?
26:16 We do have some level of support for IBIS.
26:18 And at the moment, this is just interchange level support in the sense that if you pass a IBIS data frame, then you can inspect the schema, not do much else.
26:30 But for the Altair use case, that's all they needed.
26:33 Like they just wanted to inspect the schema, make some decisions on how to encode some different columns.
26:39 And then depending on how long your data frame is, they might convert to PyArrow and dispatch to a different library called Vega Fusion, or they might just do everything within Altair.
26:49 But we found that even just having this relatively minimal level of support for IBIS, Vax, DuckDB, and anything else, anything that implements the data frame interchange protocol was enough to already solve some problems for users of these libraries.
27:04 Yeah.
27:04 Okay.
27:05 Very interesting.
27:06 Let's see.
27:07 We'll hit a few more of the highlights here.
27:10 100% test coverage.
27:12 You already mentioned that you found some bugs in...
27:15 I think it's...
27:16 Which library was it?
27:17 Modian.
27:17 Yeah, yeah, that's right.
27:18 I think all of them.
27:19 I think it's helpful to uncover some rough edge cases in all of the libraries that we have some support for.
27:25 You write a library and you're going to say, I'm going to try to behave like you do.
27:29 And I'll write some tests around that.
27:30 And then when you find the differences, you're like, wait a minute, right?
27:34 Yeah, exactly.
27:35 Also really love to see the let your IDE help you thanks to static typing.
27:39 We'll definitely have to dive into that in a bit as well.
27:42 That looks awesome.
27:43 Cheers.
27:44 Yeah, a huge fan of static typing.
27:45 You know, it's a bit of a controversial topic in some Python circles.
27:48 Some people say that it's not really what Python is meant for and that it doesn't help you prevent bugs and all of that.
27:53 And I can see where these people are coming from.
27:55 But when I've got a statically typed library and my IDE is just always popping up with helpful suggestions and doc strings and all of that, then that's when I really appreciate it.
28:06 Exactly.
28:06 Like, forget the bugs.
28:08 If I don't have to go to the documentation because I hit dot and it's immediately obvious what I'm supposed to do, that's already a win, right?
28:15 And that's typing gives you that.
28:16 Plus it gives you checking.
28:17 Plus it gives you lots of other things.
28:19 I think it's great.
28:20 And especially with your focus on tool builders, tool builders can build tools which have typing.
28:25 They can build better tools using your typing, but they don't because it's optional.
28:29 It's not really forced upon any of the users.
28:32 The only libraries that I can think of that really force typing on their users is Pydantic and FastAPI and a couple of these that like Typer that have behavior driven on the types you put.
28:43 But if you're using that library, you're choosing that as a feature, not a bug, right?
28:49 Yeah, exactly.
28:50 Yeah.
28:50 So awesome.
28:51 And then finally, sticking with the focus on tool builders, perfect backwards compatibility policy.
28:57 What does this mean?
28:57 This is a bit of an ambitious thing.
29:00 So when I was learning Rust, I read about Rust editions.
29:04 So the idea is that when you start a Rust project, you specify the edition of Rust that you want to use.
29:11 And even as Rust gets updated, if you write some project using the 2015 edition of Rust, then it should keep working essentially forever.
29:20 So they keep this edition around.
29:22 And if they have to make backwards incompatible changes, there's new editions like 2018, 2021 editions.
29:29 So this is kind of what we're trying to do.
29:30 Like the idea, the idea was, well, we're kind of mimicking the Polar's API.
29:34 I think there was a bracket I opened earlier, which I might not have finished, which was that the third choice we had was to make an entirely new API.
29:40 But I thought, well, better to give to do something that people are somewhat familiar with.
29:44 Yeah, I think that's a great choice.
29:46 Yeah.
29:46 Because when you go and write the code, half of the people will already know Polar's.
29:50 And so they just keep doing that.
29:51 You don't have to go, well, here's a third thing you have to learn, right?
29:54 Yeah.
29:55 I'd like to think that now half people know Polar's.
29:59 Unfortunately, I think we might not quite be there yet, but it is growing.
30:03 No, I think so too.
30:03 Yeah.
30:04 Yeah, yeah.
30:04 I think we'll get there.
30:05 So yeah, it's okay.
30:07 We're kind of mimicking a subset of the Polar's API and we're just sticking into fundamentals.
30:13 So that part should be relatively stable.
30:16 But at some point, presumably Polar's is going to make a backwards incompatible change.
30:19 And at that point, what do we do in Narwhals?
30:21 What do we do about the top level Narwhals API?
30:24 And coordinating changes between different libraries, it's going to get tricky.
30:30 And the last thing that I want to do is see people put upper bound constraints on the Narwhals library.
30:35 I think upper bound constraints on something like this should never really be necessary.
30:40 So we've tried to replicate what Rust does with its additions.
30:44 The idea is that we've got a stable V1 API.
30:47 We will have a stable V2 API at some point if we need to make backwards incompatible changes.
30:52 But if you write your code using the V1 stable Narwhals API, then even as new Narwhals versions come out, even as the main Narwhals namespace changes, even as we might introduce V2, then your code should, in theory, keep working.
31:09 Like V1 should stay supported indefinitely.
31:12 This is the intention.
31:13 Yeah.
31:14 And you said see the stable API for how to opt in.
31:17 So how do you, I'm just curious what the mechanism is.
31:21 So for example, import Narwhals.stable.v1 as NW, which is the standard Narwhals.
31:26 Yeah, exactly.
31:27 So instead of.
31:28 I got you.
31:29 That's cool.
31:29 Yeah, instead of import Narwhals as NW, you'll do import Narwhals.stable.v1 as NW.
31:34 And yeah, I encourage people when they're just trying it out, prototyping, use import Narwhals as NW.
31:40 If you want to make a release and future-proof yourself, then switch over to the stable.v1.
31:47 This is a little similar to the API.talkpython.fm/v1 slash whatever versus, you know, where people encode a different version in their API endpoints, basically.
32:01 Yeah, yeah.
32:02 In import statements.
32:02 I like it.
32:03 I like it a lot.
32:03 It's great.
32:04 Yeah, let's see how this goes.
32:05 Yeah, exactly.
32:06 Now it's good.
32:07 So just back to typing real quick.
32:09 Pamphil Roy out there says, a lot of open source maintainers complain about typing because if you want to make it really correct, it's painful to add.
32:18 That can be true.
32:19 And so, you know, the last 1% is some insane statement.
32:24 But it's so helpful for end users.
32:26 True, yeah.
32:27 You mentioned earlier that everyone seems to be at QuantSight.
32:29 Do you know where I met Pamphil?
32:31 QuantSight?
32:32 He was an ex-colleague of mine at QuantSight, yes.
32:34 Amazing.
32:36 See, it continues to happen.
32:38 Yeah, exactly.
32:38 But yeah, I think that totally sums it up for me as well.
32:43 You know, it's really great to be using libraries that give you those options.
32:46 You know, we do have the PYI files and we have TypeShed and all of that where people can kind of put typing on things from the outside that didn't want to support it.
32:56 But if it's built in and part of the project, it's just better, you know?
33:00 Yeah.
33:00 If you have it from day one, it works well.
33:02 I mean, trying to add types to a library that started without types like Pandas, it's fairly painful to be honest.
33:08 I bet it is.
33:09 I bet it is.
33:10 Yeah, really cool.
33:11 All right.
33:11 Let's go and talk through, I guess, a quick shout out.
33:16 Just last year, I had Richie Vink, who's the creator of Polars on Talk Python.
33:21 If people want to check that out, they can certainly have a listen to that.
33:26 And I also just recently had Wes McKinney, creator of Pandas, on.
33:30 And I'll link to those shows if people want to, like, dive into those.
33:33 But let's talk a little bit through your documentation.
33:35 It tells a really good story.
33:37 I like what you put down here as, you know, it's not just here's your API and stuff, but it walks you through.
33:43 So we talked about why, obviously, install, pip install.
33:46 It's pure Python with a pure Python wheel, right?
33:49 Yeah, exactly.
33:50 Shouldn't be any issues with installation.
33:52 Is it WASM compatible?
33:54 Do you know?
33:55 Could I use it on PyScript, Pyodide?
33:57 I don't know.
33:58 Are there any restrictions that they need?
34:01 There's some restrictions.
34:03 For example, I don't think you can do threading.
34:05 I don't think you can use some of the common, which you don't have any dependencies, but some
34:10 of the common third-party HTTP clients because it has to go through the browser's AJAX layer.
34:14 There's some, but not terribly many restrictions.
34:16 I'd imagine then that we would only be limited by whichever data frame people are passing in.
34:21 Yeah, yeah.
34:22 Awesome.
34:22 Okay.
34:23 That's super nice.
34:24 And maybe let's just talk through a quick example here, keeping in mind that most people
34:30 can't see any of the code, but let's just give them a sense still of what does it look
34:35 like to write code that is interoperable with both or all these different libraries or these
34:40 data frame libraries using narwhals.
34:42 So maybe give us just an example.
34:44 Sure.
34:45 So the idea is what we can see on the screen is just a very simple example of a data frame
34:51 agnostic function.
34:52 We've got a function called my function.
34:54 And this is something that users could maybe just use.
34:57 Maybe it's something your library exposes, but the user doesn't need to know about narwhals.
35:02 The narwhals only happens once you get inside the function.
35:05 So the user passes in some data frame.
35:07 We then call narwhals.fromNative on that data frame object.
35:12 We do some operation and then we return some native object back to the user.
35:16 Now the narwhals.fromNative, it's a practically free operation.
35:19 It's not doing any data conversion.
35:21 It's just instantiating some narwhals class that's backed by your original data frame.
35:26 Right.
35:27 Right.
35:27 And I imagine if it's Polar's data frame that gets passed in, it's probably a more direct
35:33 pass through to the API than if you're doing operations on a pandas data frame.
35:38 Right.
35:38 Is there a difference of sort of runtime depending on the backend?
35:41 The overhead is really low even for the pandas case.
35:45 In fact, sometimes things do get a little bit faster because of how careful we've been about
35:50 avoiding index operations and unnecessary copies.
35:55 To be honest, some of this will be alleviated in pandas version three when copy on write becomes
36:00 the default.
36:02 Oh, that's interesting.
36:03 Yeah.
36:03 Yeah.
36:03 In terms of the mapping on the implementation side, it's a bit easier to do the Polar's
36:07 backend.
36:08 But even then, we do need to do some version checks.
36:10 Like in 0.20.4, they renamed with row count to with row index, I think.
36:17 And so, yeah, even there, we do need some if-then statements.
36:20 But like the end of the day, what the library does is there's a few extra function calls,
36:25 a few checks on versions.
36:28 It's not really doing that much.
36:29 Yeah.
36:30 Like you might experience an extra millisecond compared to running something natively at most.
36:34 And usually you're using a data frame because you have some amount of data, even hundreds
36:40 of rows.
36:40 It's still most of the computation is going to end up there rather than if it's first in this,
36:45 call that, otherwise call this.
36:46 Right.
36:47 That's not a lot of overhead, relatively speaking.
36:49 I agree.
36:50 Yeah.
36:50 So, yeah, we see an example here of a data frame agnostic function, which just calculates
36:55 some descriptive statistics from an input data frame.
36:58 Using the expressions API, which we talked about earlier.
37:01 Yeah.
37:02 And here's something that I quite like about mkdocs.
37:04 So you see where it says, let's try it out.
37:06 We've got these different tabs and you can click on like polars, pandas, polars lazy.
37:13 And then you can see in each case what it looks like from the user's point of view.
37:17 And you can see, you can compare the outputs.
37:20 So from the user's point of view, they're just passing their object to funk.
37:24 What they're not seeing is that under the hood, funk is using narwhals.
37:27 But from their perspective, they put pandas in, they get pandas out.
37:31 They put polars in, they get polars out.
37:33 That's awesome.
37:34 So we talked about the typing.
37:36 And in this one, we have a DF typed as a frame T.
37:41 Is that some sort of generic?
37:43 And does it have restrictions on it?
37:45 What is this frame T?
37:46 I didn't dive into the source and check it out before.
37:49 Sure.
37:49 Yeah.
37:49 It's a type fur.
37:50 So it's just the idea that you start with a data frame of some kind and you get back some
37:56 data frame of the same kind.
37:57 Start with polars, get back polars, start with pandas, get back pandas, and so on.
38:01 And yeah, this version of the function is using the decorator nw.narwhalify.
38:07 Narwhalify.
38:08 It's a fantastic verb.
38:10 Yeah.
38:10 So there's two ways in which you can implement your function.
38:17 You can do it the explicit way where that's in the quick start in the docs where you write
38:23 your function that takes some native frame and then you convert that to this narwhals one.
38:30 You say from native, then you do your work.
38:32 And then depending on, you could convert it back or in this case, it returns a list of strings
38:37 in that example.
38:38 Or you can skip the first and the last step and just put this decorator on it and it'll
38:42 convert it to or wait, convert it from and then convert it to on the way in and out, right?
38:46 Yeah, exactly.
38:47 So if you're really strict about type annotations, then using from native and to native gives you
38:54 a little bit of extra information.
38:56 But I think narwhalify looks a little bit neater.
38:59 Yeah, that's true.
39:00 So for example, in the first one, you could say that this is actually a pandas data frame because
39:06 you're writing the code or something like that.
39:09 I don't know.
39:09 What is this into frame?
39:10 This is the type on this first example.
39:12 Yeah.
39:13 By into frame, we mean something that can be converted into a narwhals data frame or lazy frame.
39:19 How do you implement that in the type system?
39:21 Is it a protocol or what is this?
39:24 It's, yeah, we've got a protocol.
39:27 So I just found some methods that these libraries have in common.
39:30 Exactly.
39:31 Which wasn't too much.
39:32 You can find that.
39:33 That's what I was thinking.
39:34 Yeah.
39:35 Okay.
39:35 Yeah.
39:36 But if it has enough of the functions of pandas or pollers, you're like, all right, this is
39:41 probably good.
39:42 All right.
39:42 And you can say it's one of these.
39:43 That's pretty cool.
39:45 Yeah, exactly.
39:45 I mean, if any of this is confusing to listeners, we do have a page there in the documentation
39:49 that's all about typing.
39:51 So people can read through that at their own leisure.
39:53 Yeah, for sure.
39:55 All right.
39:55 Let's see.
39:56 I do like the MKDocs where you can have these different examples.
40:00 One thing I noticed is you've got the pollers eager evaluation and you've got the pollers
40:06 lazy evaluation.
40:08 And when you have the pollers lazy, this function decorated with the decorator, the Narwhalify
40:14 decorator, it itself returns something that is lazy and you've got to call collect on, right?
40:19 So it kind of preserves the laziness, I guess.
40:21 Is that right?
40:22 Yes, exactly.
40:23 This was something that was quite important to me, like not be something that only works
40:28 well with eager execution.
40:30 I want to have some level of support such that lazy in can mean lazy out.
40:36 Yeah.
40:36 Eager in, eager out.
40:37 Lazy in, lazy out.
40:38 Okay.
40:39 Exactly.
40:39 Yeah.
40:40 So the way you do that in pollers is you create a lazy frame versus data frame, right?
40:45 But then you've got to call collect on it, kind of like awaiting it if it were async, which
40:50 is cool.
40:50 Yeah.
40:51 Or don't call collect or just wait until you really need to call collect.
40:55 Right.
40:56 Or pass it on to the next one and on to the next.
40:58 Yeah, exactly.
40:59 Exactly.
41:00 So one of the things that you talk about here is the pandas index, which is one of the
41:06 key differences between pollers and pandas.
41:08 And you've classified pandas people into two categories.
41:12 Those who love the index and those who try to get rid of it and ignore it.
41:17 Yeah, exactly.
41:17 So if James Powell is listening, I think we can put him in the first category.
41:22 I think most realistically, most pandas users that I've seen call .reset index drop equals
41:28 true every other line of code.
41:30 They just find that the index gets in the way more than helps them most of the time.
41:34 And with novels, we're trying to accommodate both.
41:37 So we don't do automated index alignment.
41:41 So this isn't something that you have to worry about.
41:43 But if you are really bothered about index alignment, say, due to backwards compatibility concerns,
41:50 then we do have some functions which allow you to do that, which would be no operations for other libraries.
41:56 There's an example in scikit-lego of where they were relying on pandas index alignment.
42:00 So we've got a function here.
42:01 Narwhals may be a line index.
42:04 So for pandas-like, it'll do, the index will do its thing.
42:08 And for other libraries, the data will just be passed through.
42:11 Right.
42:12 So you said they're pandas-like.
42:15 And pandas-like is actually a type in your type system, right?
42:17 Did I see that?
42:18 Yeah, yeah.
42:20 So we've got is pandas-like data frame function to tell.
42:24 So by pandas-like, we mean pandas-qdf-modin.
42:28 So the libraries that have an index and follow those kinds of rules.
42:32 Yeah, yeah, that's really cool.
42:33 Yeah, because at the end of the day, like the idea of writing completely data frame agnostic code
42:37 is a lot easier for new libraries than for existing libraries that have backwards compatibility concerns.
42:42 And we recognize that it might not be completely achievable.
42:46 I think in all of the use cases where we've seen Narwhals adopted, they're doing most of it in a data frame agnostic way,
42:52 but they do have some parts of it where they're saying, okay, if this is a pandas data frame, we've got some pandas-specific logic.
42:58 And otherwise, let's go down the data frame agnostic route.
43:01 Yeah, you also have here levels of support.
43:05 You have full and interchange.
43:06 I think we talked about that a little bit.
43:08 So maybe just point people here.
43:10 If you want to be QDF or modin, you can fully integrate.
43:15 Or if you just want to have enough of an implementation that they can kind of work together, right?
43:21 You can do this data frame interchange protocol.
43:24 Yeah, exactly.
43:24 Or just write to us and we'd be happy to accommodate you without you having to go through the data frame interchange protocol.
43:31 Oh yeah, very nice.
43:32 Okay.
43:32 You mentioned the overhead before, but you do have a picture.
43:35 Picture's always fun.
43:36 And in the picture, you've got little different operations, different times for each of the operations.
43:42 And there's a quite small overhead for pandas versus pandas with Narwhals.
43:47 Yeah, exactly.
43:47 Like in some of them, you can see it becoming a little bit faster.
43:51 In some of them, you can see it becoming a little bit slower.
43:53 And these are queries that I think are the size that you can expect most data scientists to be working with a lot of the time.
43:58 You've got queries that take between a couple of seconds to 30 seconds.
44:02 And there, it's pretty hard to distinguish reliably between like the blue and red dots.
44:08 Sometimes one's higher, sometimes the other one's higher.
44:11 There's a bit of statistical variance just between running the same benchmark multiple times.
44:16 But overall, yeah, we were pretty happy with these results.
44:18 Yeah, that's great.
44:20 So how well have we covered how it works?
44:23 We talked about the API, but I know we've talked about the implementation of how you actually,
44:29 why is it basically almost the same speed?
44:33 Are you not doing, why is it not going to work?
44:34 Yeah, well maybe.
44:35 Are you using underwater unicorn magic?
44:37 Is that what it is?
44:38 That's the secret, yes.
44:40 Underwater unicorn magic.
44:41 Well, perhaps first I should just say why we wrote this, how it works.
44:45 And it's because really I want this to be a community driven project.
44:48 And this is one of those cases where open source is more of a social game than a technical one.
44:54 I'm not saying that's always the case.
44:55 There are many problems that are purely technical.
44:57 Now else is a social game in the end.
44:59 Like what we're doing isn't that complicated.
45:01 But if we want it to work, then it needs to be accessible to the community.
45:06 People do need to be able to trust us.
45:08 And that typically does not happen if it's a one person project.
45:11 So it was really important to me that different people would be able to contribute to it, that it all be as simple and clear as possible.
45:19 So we made this page trying to explain how it works.
45:21 It's not quite as clear and quite as extensive as it'd like to be.
45:25 But a few contributors did say that it really helped them.
45:28 So in terms of how do we get this slow overhead?
45:31 So we're just defining an expression as being a function from a data frame, your sequence of series.
45:36 And then we're just repeatedly and strictly applying that definition.
45:39 So there's nothing too fancy going on.
45:41 That's like in the end, just evaluating Lambda functions in Python, going down the stack trace.
45:48 Like it's pretty fast.
45:49 Yeah, that's really cool.
45:50 Yeah, so people can check this out.
45:51 They want to know.
45:52 I think this might be where I saw the pandas-like expression.
45:55 Ah, right, yeah.
45:56 Pandas-like.
45:56 Yeah, pandas-like is this class that encompasses pandas mode in QDF.
46:00 The ones that kind of follow the pandas API.
46:02 Yeah, close enough for what you need to do.
46:05 Yeah, exactly.
46:06 All right.
46:07 Well, I saw a question out there in the audience somewhere from Francesco, which was basically asking about the roadmap.
46:14 Like, where are you?
46:15 Where are you going?
46:16 Yeah, I should probably introduce Francesco.
46:18 He's one of the most active contributors to the project.
46:22 So thanks, Francesco, for helping to make it a success.
46:26 He was actually also the first person to adopt Narwhals in one of his libraries.
46:31 Yeah, I spoke to him about it at a conference.
46:33 And he was like, I've got this tiny little time-based CV library.
46:37 Let's try Narwhalifying it as an experiment.
46:39 Sure, we did that.
46:40 Then scikit-learn.
46:42 Not scikit-learn, sorry.
46:43 Not scikit-lego.
46:45 It was this kind of experimental building blocks for scikit-learn pipelines that he maintains.
46:50 And then we've just been taking it from there.
46:52 So in terms of roadmap, my top priority is helping out libraries that have shown interest in Narwhals.
47:00 So at the moment, Formulaic, that opened a draft pull request in which they were trying out Narwhals.
47:06 And they tagged me just about some things they were missing.
47:08 So I'd like to see if I can take that to completion.
47:11 I've got, I think I've got most of it working, but just been a bit busy with conferences recently.
47:16 So maybe next month I'll be able to get something ready for review and show that to them.
47:21 That would be pretty cool.
47:23 It's summer is passing.
47:25 The conferences are ending.
47:26 It's going to get dark and cold.
47:28 Perfect time to program.
47:29 Yeah, we'll get back to the situation that I was in when I started Narwhals, which was that it was a rainy Friday.
47:36 Not Friday, sorry.
47:37 It was a rainy February weekend in Wales.
47:40 The rainiest part of the UK.
47:42 Yeah, that's exactly the same in Oregon here.
47:45 So it's a good time to get stuff done.
47:47 Yeah, exactly.
47:48 So yeah.
47:48 And then I've been speaking to people from Shiny and Plotly about potentially looking into Narwhals.
47:55 There's no contract set in stone or anything.
47:59 These people may well change their mind if it doesn't work for them.
48:02 But my idea is, okay, they've shown interest.
48:04 Let's go headfirst into seeing whether we can help them and whether they'd be able to use Narwhals.
48:10 If it doesn't work out, we'll just have strengthened the Narwhals API and learn some new things.
48:15 If it does work, then great, exciting.
48:17 So that's my top priority.
48:20 And it's been really pleasing to see the contributor to community develop around Narwhals.
48:25 I really thought it would be a one-person project for a long time.
48:28 But so many people have been contributing really high-quality pull requests.
48:31 It's really been, yeah, AC42.
48:33 Okay, one of them is this.
48:35 Okay, maybe a couple of them here are like GitHub bots.
48:39 This pre-commit CI bot.
48:40 Yeah.
48:41 Maybe 40, 30, but still, that's a lot.
48:44 While we're talking numbers on the homepage, I also want to point out 10 million downloads a month is a lot of downloads.
48:49 That's awesome.
48:50 Yeah, that's maybe slightly misleading because they pretty much just come from the fact that it's now a required dependency of Altair.
48:57 And Altair gets millions of downloads.
49:00 Yeah, yeah, yeah, exactly.
49:01 But that's how, that's the place of some libraries.
49:04 Like Berksoig, I don't think many people go, oh, let me go get this HTTP library or it's dangerous.
49:09 They just go, I'm going to use Flask.
49:11 Right?
49:11 But it's still a really important building block of the community, even if people don't seek it out as a top-level thing they use, right?
49:19 Sure, cheers.
49:20 Thanks, yeah.
49:20 In fact, if we do our job well, then most people should never know about novels.
49:25 Exactly.
49:26 They just use, just work.
49:28 Yeah, exactly.
49:29 They just look in their pip list like, what is this whale thing in here?
49:34 Yeah, exactly.
49:35 Yeah.
49:36 So, yeah, it's been really encouraging, really pleasing to see this contributor community emerge around the project.
49:42 And I think a lot of the contributors are really interested in adding extra methods and adding extra backends and things.
49:51 So I'm trying to leave a lot of that to the community.
49:55 So, like with Dask, I just got the rough building blocks together.
49:59 And then it was just so nice, like so many really high-quality contributions coming up that brought the Dask support pretty much complete.
50:07 We should see now if we're able to execute all of the TPC-H queries with the Dask backend.
50:11 We might actually be there or be pretty close to getting there.
50:15 Nice.
50:15 What does TPC-H stand for?
50:19 I don't remember what it stands for, but it's a set of database queries.
50:24 I see.
50:25 They were originally written for testing out different databases.
50:28 So it's a bunch of SQL queries.
50:31 But I'm not sure if it was Kaggle that popularized the idea of translating these SQL queries to data frame-like APIs and then running different data frames on them to see who wins the speed test.
50:48 But we just figured they do a bunch of things like joins, concatenations, filtering comparisons with dates, string operations.
50:57 And we're like, okay, if the Nowels API is able to do all of this, then maybe it's extensive enough to be useful.
51:04 Yeah, yeah, yeah.
51:04 That's super cool.
51:06 It sounds a little bit like the TOB index plus other stuff maybe, but for databases.
51:11 I'm not familiar with that.
51:13 It's like a language ranking type of thing.
51:15 And, you know, one aspect is maybe ranking the databases.
51:18 But yeah, no, this is very cool.
51:20 Okay.
51:20 Got it.
51:21 In the end, we're not trying to be fast as Nowels, but we just want to make sure that there's no extra overhead compared to running things natively.
51:29 As long as you're not much slower than the stuff that you're operating with, like, that's all you should ask for.
51:34 You can't make it go faster in the extreme.
51:37 Like, you did talk about some optimizations, but you can't fundamentally change what's happening.
51:40 Yeah, we could do some optimizations on the Nowels side.
51:43 But to be honest, I'm not sure I want to.
51:45 And part of the reason is because I want this to be a pretty simple project that's easy to maintain.
51:49 Yeah, sure.
51:50 And that's really just low overhead.
51:52 Add extra docs and tutorials coming.
51:54 That's fun.
51:55 Yeah.
51:56 Are you looking for contributors and maybe want to write some tutorials or docs?
51:59 I would love this, yeah.
52:00 I mean, it drives me crazy when I see so many projects where people have put so much effort into making a really good product, but then the documentation is really scant.
52:08 Like, if you don't prioritize writing good docs, nobody's going to use your product.
52:13 So I was really grateful to my company.
52:17 They had four interns come on who really helped out with making the docs look amazing.
52:22 Oh, that's cool.
52:23 Like, if you look at the API reference, I think every single function now has got a, like a doc string with an example.
52:31 At the bottom, I think there's API reference.
52:33 Yeah, if you search for any function in here, yeah, in the search box at the top, some, I don't know, series dot something.
52:41 Yeah, see, for any of these, we've got like an example of, okay, here's how you could write a data frame agnostic function, which uses this method.
52:49 And let's show that if you pass pandas or polars, you get the same result.
52:53 And if there's some slight differences that we just cannot get around, like in the way that they handle missing values, then we've got a very clear note about in the docs.
53:00 Yeah, that's great.
53:01 Maybe someday support for DuckDB?
53:03 I would like that.
53:06 I don't think we have much of a choice about whether or not we support DuckDB.
53:09 Like DuckDB is really on fire now.
53:13 It really is.
53:14 Yeah, I think it might be a question of either we have some level of support for DuckDB, or somebody else is going to make something like novels that supports DuckDB, and then we become extinct.
53:25 But besides, to be honest, DuckDB is amazing.
53:31 I just find it a bit painful to write SQL strings.
53:34 And so if I could use DuckDB, but with the Polar's API that I prefer and I'm more familiar with, then...
53:41 Yeah, I 100% agree.
53:44 It looks super nice.
53:45 But if you look, it has a SQL example, and then the Python example is just SQL, quote, quote, quote.
53:50 Yeah, exactly.
53:51 Here's the SQL embedded in Python, you know what I mean?
53:53 So you're kind of writing SQL no matter what.
53:56 Yeah, and then the error messages that you get sometimes are like, oh, there's a pass error near this keyword.
54:01 And you're like, what on earth is going on?
54:03 And then you're like, oh, yeah, I forgot.
54:05 I've got an extra comma at the end of my select or something.
54:08 I don't know.
54:08 Yeah.
54:09 So DuckDB is a little bit like SQLite, but for analytics rather than relational, maybe.
54:16 I'm not sure if that's...
54:17 I think it's primarily aimed at analysts, yeah, analytical kinds of things, yeah, data scientists and people.
54:24 What we are going to struggle with is that in DuckDB, there's no guarantees about row order or operations.
54:32 But on the plus side, when I look at what Altair are doing with data frames, when I look at some of the other libraries that I've shown interest in now,
54:40 they're often just doing very simple things.
54:43 They're not doing things that depend on row order.
54:45 So if we could just initially just support DuckDB for the operations that don't require row order.
54:51 So for something like a cumulative sum, maybe initially we just don't support that for DuckDB.
54:56 Like in the end, if you want to do advanced SQL, just use DuckDB directly.
55:01 Like, as I said earlier, I don't recommend that end users use novels directly.
55:06 But even just having some common operations, ones that aren't row order dependent, I'd like to think that this is already enough to solve some real problems for real people.
55:16 Yeah, I know you said it's mostly for library builders.
55:19 But if you were building an app and you were not committed to your data frame library, or you really wanted to leave open the possibility of choosing a different data frame library,
55:28 using narwhals to isolate that a little bit might be nice, right?
55:32 Yeah.
55:33 So yeah.
55:34 Yeah.
55:35 If anyone tries to do this, and I'd love to hear your story.
55:38 I did hear from somebody in our community call.
55:42 We've got a community call every two weeks, by the way, if anyone wants to come and chat with us.
55:46 I did hear from somebody that, like, at work has got some teams that are primarily using pandas, some teams that are primarily using polars,
55:53 and he just wanted to build some common logic that both teams could use.
55:57 And he was using narwhals for that.
55:59 So I think there are some use cases beyond just library maintainers.
56:03 Yeah, absolutely.
56:04 Maybe you're building just an internal library, and it needs to work with some code you've written in pandas,
56:09 but you maybe want to try your new project in polars, but you want to still use that library, right?
56:13 That would be a cool use case as well.
56:16 Yeah, yeah.
56:16 To not lock yourself in.
56:17 Yeah, I'm pretty sure you've brought up on the show before that XKCD about, like, the spacebar overheating.
56:23 I don't remember which number that one is, but in the end, with a lot of open source projects,
56:27 you put it out with some intention of how it's meant to be used.
56:31 Yes.
56:31 But then people find their own way of using it.
56:33 I believe it was spacebar heating.
56:35 Workflow, this is it.
56:37 Yes.
56:37 Love this one.
56:38 Yeah, it looks like something out of a changelog with feedback or something that says,
56:44 changes in version 10.7.
56:46 The CPU no longer overheats when you hold down the spacebar.
56:49 Comments.
56:49 Longtime user 4 writes, this update broke my workflow.
56:53 My control key is hard to reach, so I hold the spacebar instead.
56:56 And I configured Emacs to interpret a rapid temperature rise as control.
57:00 That's horrifying.
57:00 Look, my setup works for me.
57:02 Just add an option to re-enable spacebar heating.
57:06 I've seen it so many times, but it still makes me laugh each time.
57:09 It's incredible.
57:10 It's incredible.
57:11 All right, Marco.
57:11 Well, congrats on the cool library.
57:13 Congrats on the traction.
57:14 Final call to action.
57:15 Maybe people want to start using narwhals.
57:17 What do you tell them?
57:17 Yeah, give it a go.
57:18 And please join our Discord and or our community calls.
57:22 We're very friendly and open.
57:24 And we'd love to hear from you and see what we can do to address whatever limitations you
57:30 might come up against.
57:31 This has been another episode of Talk Python to Me.
57:35 Thank you to our sponsors.
57:37 Be sure to check out what they're offering.
57:39 It really helps support the show.
57:40 This episode is brought to you by WorkOS.
57:43 If you're building a B2B SaaS app, at some point, your customers will start asking for enterprise
57:49 features like SAML authentication, SKIM provisioning, audit logs, and fine-grained authorization.
57:55 WorkOS helps ship enterprise features on day one without slowing down your
58:00 core product development.
58:00 Find out more at talkpython.fm/workOS.
58:04 Want to level up your Python?
58:06 We have one of the largest catalogs of Python video courses over at Talk Python.
58:10 Our content ranges from true beginners to deeply advanced topics like memory and async.
58:15 And best of all, there's not a subscription in sight.
58:18 Check it out for yourself at training.talkpython.fm.
58:21 Be sure to subscribe to the show.
58:23 Open your favorite podcast app and search for Python.
58:26 We should be right at the top.
58:27 You can also find the iTunes feed at /itunes, the Google Play feed at /play,
58:32 and the direct RSS feed at /rss on talkpython.fm.
58:36 We're live streaming most of our recordings these days.
58:39 If you want to be part of the show and have your comments featured on the air, be sure to
58:43 subscribe to our YouTube channel at talkpython.fm/youtube.
58:48 This is your host, Michael Kennedy.
58:49 Thanks so much for listening.
58:50 I really appreciate it.
58:51 Now get out there and write some Python code.
58:53 Bye.
58:54 Bye.
58:55 Bye.
58:55 Bye.
58:55 Bye.
58:56 Bye.
58:56 Bye.
58:56 Bye.
58:56 Bye.
58:56 Bye.
58:57 Bye.
58:57 Bye.
58:57 Bye.
58:57 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:58 Bye.
58:59 Bye.
58:59 Bye.
58:59 Bye.
58:59 Bye.
58:59 Bye.
58:59 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:00 Bye.
59:01 Bye.
59:01 Bye.
59:01 Bye.
59:01 Bye.
59:01 Bye.
59:02 Bye.
59:02 Bye.
59:02 Bye.
59:02 Bye.
59:02 Bye.
59:02 Bye.
59:02 Bye.
59:02 Bye.
59:02 Bye.
59:03 Bye.
59:03 Bye.
59:03 Bye.
59:03 Bye.
59:03 Bye.
59:03 Bye.
59:04 Bye.
59:04 Bye.
59:04 Bye.
59:04 Bye.
59:04 Bye.
59:05 Bye.
59:05 Bye.
59:05 Bye.
59:05 Bye.
59:05 Bye.
59:05 Bye.
59:05 Bye.
59:06 Bye.
59:06 Bye.
59:06 Bye.
59:07 Bye.
59:07 Bye.
59:07 Bye.
59:07 Bye.
59:07 Bye.
59:07 Bye.
59:07 Bye.
59:07 Bye.
59:08 Bye.
59:08 Bye.
59:08 Bye.
59:09 Bye.
59:09 Bye.
59:09 Bye.
59:09 Bye.
59:09 Bye.
59:09 Bye.
59:09 Bye.
59:10 Bye.
59:10 Bye.
59:10 Bye.
59:10 Bye.
59:10 Bye.
59:10 you you you Thank you.
59:14 Thank you.