Learn Python with Talk Python's 270 hours of courses

#480: Ahoy, Narwhals are bridging the data science APIs Transcript

Recorded on Tuesday, Sep 10, 2024.

00:00 If you work in data science, you definitely know about data frame libraries.

00:03 Pandas is certainly the most popular, but there are others such as QDF, Moden, Polars,

00:09 Dask, and more. They're all similar, but definitely not the same APIs, and Polars is

00:14 quite different. But here's the problem. If you want to write a library that is for users of more

00:19 than one of these data frame frameworks, how do you do that? Or if you want to leave open the

00:24 possibility of changing yours after the app is built, you got the same problem. Well, that's what

00:29 Narwhals solves. We have Marco Garelli on the show to tell us all about Narwhals. This is Talk

00:36 Python to Me, episode 480, recorded September 10th, 2024.

00:40 Are you ready for your host, please?

00:43 You're listening to Michael Kennedy on Talk Python to Me. Live from Portland, Oregon,

00:48 and this segment was made with Python.

00:50 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:58 Follow me on Mastodon, where I'm at, mkennedy, and follow the podcast using at Talk Python,

01:03 both accounts over at Fosstodon.org, and keep up with the show and listen to over nine years of

01:10 episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams

01:15 over on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified

01:21 about upcoming shows. This episode is brought to you by WorkOS. If you're building a B2B SaaS app,

01:28 at some point your customers will start asking for enterprise features like SAML authentication,

01:32 SKIM provisioning, audit logs, and fine-grained authorization. WorkOS helps ship enterprise features

01:39 on day one without slowing down your core product development. Find out more at talkpython.fm slash

01:45 workOS. Marco, welcome to Talk Python to Me. Hi, thanks for having me. Hey, it's fantastic to

01:51 have you here. We talked a little bit on the socials and other places, but, you know, nice to

01:57 talk to you in person and about some of your projects. Yeah, nice to finally do it. I've been

02:00 listening to your shows for years, so it's a pleasure to be here. Yeah, that's really cool. It's

02:05 awesome when people who are listeners for a long time get to come on the show. I love it. So,

02:10 we're going to talk about narwhals and data science, data frame libraries, and basically

02:17 coming up with a way to write consistent code against all these different libraries, which I think

02:23 is an awesome goal, which is why I'm having you on the show, of course. Before we get to all that,

02:28 as you know, let's hear a little bit about yourself. Sure. So, yeah, my name is Marco. I work at a company

02:33 called QuantSight Labs, which supports several open source projects and also offers some training and

02:40 consulting services. I live in Cardiff in Wales and been at QuantSight for about two years now.

02:47 Originally hired as a Pandas maintainer, but then shifted considerably towards some other projects,

02:54 such as Polars. And then in February of this year, I tried releasing this narwhals library as a bit of an

03:00 experiment. It's growing a bit faster than expected. Yeah, it's a really interesting project. QuantSight is

03:08 quite the place. You know, I didn't really know about y'all before having people on the show from

03:14 there, but here's my experience. I've had, I reach out to say some of the Jupyter folks, whatever.

03:19 Let's have, let's have some of the Jupyter people on. There's three or four people from QuantSight

03:23 show up and then, oh, let's talk about this other project. Another person from QuantSight two weeks

03:28 later, and then you're from QuantSight. And none of those connections were like, let me try to find

03:33 people from QuantSight. I think you all are having a pretty big impact in the data science space.

03:37 That's cool.

03:37 That's true. Yeah, it is a bit amusing in the internal Slack channel. If you ask a question,

03:42 how does anyone know how to do this? Someone would reply, oh yeah, let me ping this person

03:46 who's a maintainer of that library. And you're like, okay, well.

03:48 Exactly. I think we know how it works. Let's ask them.

03:53 Yeah. It's a big world, but also a small world in interesting ways.

03:57 Yeah.

03:58 How do you get into programming in the first place?

04:00 I think the first experience with programming I had was at university. So I studied math. I think

04:05 like you as well.

04:07 Yeah. Yeah. Yeah. That sounds really, yeah. Keep going. So far you're telling my story.

04:10 Yeah, sure. Although my initial encounter with it, I didn't, I'm not quite particularly enjoyed it.

04:16 It was just having to solve some problems in math lab. I did find it kind of satisfying that if you

04:21 gave it instructions, it did exactly that. But I wouldn't say that I felt like naturally

04:27 talented or anything. I then really took to programming though, after I started a maths PhD and

04:34 dropped out because it wasn't really going anywhere. And once I went deeper into programming, then I

04:39 realized, okay, actually I do have some affinity for this topic. I do quite enjoy it. Yeah. I think

04:44 that's often the case. What was your PhD focus before you dropped out? It was meant to be,

04:48 well, some applied mathematics like stochastic partial differential equations. But you know,

04:53 in academia, you publish or perish. And I wasn't publishing and didn't really see that changing.

04:59 So I had to make a bit of a pivot.

05:01 I imagine you made a pretty good choice. Just guessing. I mean, I love math, but the options are

05:06 just so much broader outside of academia. In hindsight, yeah, I kind of wish that somebody

05:12 at the time had told me that I could have still had a really interesting and rewarding career outside

05:17 of academia. And I shouldn't have stressed myself out so much about trying to find a PhD or about

05:22 having to complete it when I had already started it.

05:25 The secret, I think, about being a good programmer is it's kind of like grad school. Anyway,

05:31 you're constantly studying and learning. You feel like you've figured something out. It's like,

05:35 well, that's changed. Now onto the next thing. You're kind of like, well, we figured out pandas.

05:39 Now we got pollers. Okay, well, we're going to start over and figure out how to use that well.

05:42 Right. So that is true. You need to do a lot of learning, a lot of self-directed learning

05:47 in particular. It's really stimulating, I must say. It is. It's great if you want that. If you want to

05:53 just nine to five, you don't need to stress about... Well, look, I think there's actually options there.

05:57 We'll get to Narwhals in a second. But I think there are options there. I think if you want to do

06:01 COBOL, FORTRAN, some of these older programming languages where so much of the world depends on

06:08 them, but nobody wants to do them. You could totally own that space and make really good money if you

06:14 didn't want to learn anything. But where's the fun in that, right?

06:16 Yeah. Yeah.

06:19 It'd be nice if we could all get rid of our legacy systems, but you know,

06:22 this stuff does power the world.

06:23 They're there for a reason, right? That works. I would really like it if you don't touch it, please.

06:29 But that's not the way it is with Narwhals. Let's start with an overview of what Narwhals is and

06:37 why you created it. And then I want to talk a bit about some of the data science libraries

06:41 before we get too much deeper. What is Narwhals? A Narwhal is a cool whale as far as I know. It's

06:46 like the unicorn of the sea, basically. What is this library?

06:50 Yeah, exactly.

06:51 So it's intended as a compatibility layer between different data frame libraries. So Narwhals does

06:57 not do any computation itself. It's more of just a wrapper around different data frame APIs. And I like

07:05 the Polar's API. So I figured that I should keep it fairly close to the Polar's API and in particular,

07:10 to Polar's expressions. As to why Narwhals? So I was just getting frustrated with the fact that

07:17 let's say about a year ago, there were relatively few libraries that supported Polar's. And if libraries

07:24 did support Polar's, it was often just done by converting to Pandas or converting to PyArrow.

07:30 Yes, a lot of these libraries, they weren't doing anything that complicated with data frames.

07:34 A lot of data frame consuming libraries, they don't really want to do that much. They want to select

07:39 columns. They want to select rows. Maybe they want to do some aggregations. Like they're not doing

07:44 stuff that's completely wild. And so trying to design some minimal compatibility layer, I think,

07:51 is a lot easier than trying to make a full blown data frame API that end users are meant to use.

07:57 So the idea with Narwhals is this is a tool for tool builders. If library maintainers want to support

08:03 different data frame libraries as inputs with minimal overhead and with minimal maintenance required

08:09 on their side, this is the problem we're trying to solve. It's a great problem to solve because

08:15 maybe you want to have a library that works with an abstract concept of a data frame. But usually I

08:21 would imagine you have to start out and say, are we going to go and support Polar's? Are we going to

08:26 support Pandas? And the APIs are different, not just the APIs, but the behaviors, for example, the

08:32 the lazy execution of Polar's versus the eager execution of Pandas. And so being able to just

08:38 write your library so it takes you there is probably a big hassle, right? Because it's just

08:43 you kind of have to have almost two versions each step, right?

08:46 Yeah, exactly. Well, I actually heard from a maintainer recently who was saying that he was

08:51 interested in using Narwhals even just to have Pandas as a dependency because Pandas,

08:57 the API changes a bit between versions. And he was getting a bit tired of Pandas API changes and was

09:05 like, okay, well, if we can just defer all of the version checks and API differences to an abstraction

09:11 there that might even simplify our life, even if we're just interested in supporting Pandas.

09:16 Yeah, that's cool. Yeah, actually, that's an interesting idea. It's just like, we'll have

09:20 a compatibility layer just in case. And Pandas went from one to two on major version recently,

09:27 which is a big deal and switched to PyArrow and all that, right?

09:30 Yeah, so version two, it was sometime last year, I think, 2023. So yeah, the PyArrow,

09:38 I think there were some misconceptions around that. So as we're live on air, let's take the chance to

09:43 address some PyArrow misconceptions. PyArrow in Pandas currently is optional and it'll probably stay

09:50 optional for quite a while. So there is some talk about in version three using PyArrow strings

09:56 instead of the classical NumPy object strings. By default, if people have PyArrow installed,

10:02 it's not totally decided. It's not totally set in stone whether PyArrow will be a required

10:07 dependency and maybe Pandas version four will have it as a required dependency and it will be the

10:13 default everywhere. But that's a few years away.

10:15 Yeah, maybe. Maybe we'll get Python four as well. You never know. You know, it's interesting. I think

10:22 the data science space more than many other areas has this ability to run Python in more places,

10:28 right? For example, there's Pyodide, there's Jupyter Lite. There's a lot of more constrained

10:35 environments that it might go in. And I don't know what the story with PyArrow and WASM and all

10:40 these different things. You still get benefits there. But there's a lot to consider.

10:44 Yeah, totally. And I think that's one reason why some library maintainers are really drawn to a

10:50 lightweight compatibility layer like narwhals. With narwhals, you say you don't need any dependencies.

10:56 You need narwhals, but that's just a bunch of Python files. Like if you wanted to, you could even

10:59 just vendor narwhals. Like it's not that big of a deal, but there's no extra dependencies required.

11:05 Like pandas users don't need polas installed and polas users don't need pandas installed. So if

11:10 you're trying to deploy to a constrained environment where package size is limited, like if a library

11:18 has narwhals as a required dependency, as opposed to any big data frame library, and then the user can

11:24 just bring their own data frame, then like this, we're really minimizing the number of installation and

11:29 dependency hell issues that people might run into. I think you've covered dependency hell on the show

11:34 a few times before.

11:35 Yeah, indeed. I think one thing that's interesting for people out there listening,

11:41 we'll talk about the different libraries that it works with right now. But if you have a library

11:45 out there and you're listening, there's not too much work to integrate it into or make it

11:50 narwhal compatible, narwhalification of a library to let it do this interchange, right?

11:56 So I think I can interpret your question in a couple of ways. So I'll just play them back and

12:01 let's just see.

12:02 Let's do it.

12:04 if you're a library that consumes data frames.

12:06 So yeah, there's some examples there on the readme of who's adopted narwhals, so like Altair is the most recent.

12:12 - Right.

12:13 - Probably the most famous one.

12:14 - I think that's what I heard of it was when I was, some news about Altair and narwhals together

12:19 is actually how I heard of narwhals.

12:21 - Oh, okay, yeah.

12:22 So yeah, and how complicated that is really depends on how complicated the data frame operations

12:29 this library is doing.

12:30 In the case of Altair, they weren't doing anything that was that crazy.

12:34 They needed to inspect the data types, select some columns, convert date times to strings,

12:41 get the unique categories out of categoricals.

12:44 Like it wasn't that bad.

12:46 So I think within a few weeks we were able to do it.

12:50 Same story with scikit-lego.

12:52 There's some other libraries that have reached out that have shown interest where it's going to be

12:56 a bit of a heavier lift, but it's generally not as bad as I thought it was going to be when I started the project.

13:04 - The other side of, yeah, the other way that I think I might have interpreted your question is

13:07 how difficult is it for a new data frame library to become narwhals compatible?

13:11 - Yes.

13:11 - And there's a couple of ways that they can go about doing that.

13:14 And the other side of the thing is that they can go about.

13:18 adding their library as a backend in narwhals.

13:22 - We love open source, but I don't consider myself an open source absolutist.

13:27 I understand that not everything can be open sourced.

13:29 And so if somebody has a closed source solution, we do have an extensibility mechanism within

13:34 narwhals such that somebody just needs to implement some Dunder methods.

13:38 - And then if they pass the data frame into a library that's been narwhalified, then narwhals

13:43 will know how to glue things together and they'll be able to still support this closed source

13:47 solution without it needing to go out into the open.

13:50 - Right.

13:51 It's kind of something like inheriting from a class and implementing some functions and

13:56 then it, it knows, right?

13:57 - Yeah, exactly.

13:58 - Yeah.

13:59 Yeah.

14:00 Cool.

14:01 So right now it has full API support for CUDAF, C U D F.

14:04 I'm guessing that's CUDA data frame library.

14:07 - Yeah.

14:08 I'm not totally sure how we're supposed to pronounce it.

14:10 I call it.

14:11 - I don't know.

14:12 - Yeah.

14:13 That came out of the rapid steam at NVIDIA.

14:14 It's like an accelerated version of pandas on, on GPU.

14:17 Yeah.

14:18 That's been quite a fun one.

14:19 - Nice.

14:20 - Yeah.

14:21 That's pretty wild.

14:22 - Yeah.

14:23 - Then.

14:24 - The API is quite similar to pandas, but it's not exactly the same.

14:26 So we have to do a bit of working around.

14:28 - Right, right.

14:29 Because graphics cards are not just regular memory and regular programs.

14:33 They're, they're weird, right?

14:34 - That, yeah, that's part of it.

14:36 So there's some parts of the pandas API, which they intentionally don't support.

14:39 And there's another part of it is just that the pandas API is so extensive.

14:44 That's, it's just a question of resources.

14:46 Like it's pretty difficult to reimplement 100% of the pandas API.

14:49 but modern does attempt to do that.

14:52 Modern does, tell itself as a drop in replacement for pandas.

14:56 in practice, they do have a, I think they do have a section in their docs where they

15:01 do mention some gotchas, some slight differences, but that's the idea.

15:06 They've kind of got their, their, in their own intermediate representation, which they've got

15:11 their algebra, which they've published a paper about, which they then map onto the pandas API.

15:17 A pretty interesting project.

15:18 That was a lot easier to support.

15:20 the way they mimic the pandas API is a lot closer.

15:23 but it's been interesting.

15:24 Like with novels, we, we did find a couple of minor bugs in Modin just by running our test

15:28 suite through the different libraries, which we then reported to them and they fixed very

15:32 quickly.

15:33 That's pretty awesome.

15:34 Yeah.

15:36 That's super awesome.

15:37 So I was going to ask about things like Dask and, and others, which are sort of themselves

15:50 extensions of Indes.

15:52 But if you support Modin, you're kind of through one more layer supporting Dask and.

15:57 Oh, but it's, it's better.

15:58 We don't, yeah, it's better.

16:00 We don't have this on the readme yet, but we do have a level of support for Dask.

16:04 We've not quite, I've not quite put it on the readme yet because we're still kind of

16:09 defining exactly where the boundaries are.

16:11 but it's, it's going to be some kind of a partial lazy only layer of support.

16:17 And it's actually quite a nice way to run Dask.

16:19 Like when you're running Dask, there are some things which do trigger compute for you.

16:23 There are some things which may trigger index repartitioning.

16:26 I think that's what it's called.

16:28 And in NowWals, we've just been extremely careful that if you're able to stick to the

16:32 NowWals API, then what, what you're doing is going to be performant.

16:36 Awesome.

16:37 Yeah, that's super cool.

16:38 So one thing I think worth maybe pointing out here is you talked about Pandas, Pandas 1, Pandas 2,

16:44 and it being an extensive API.

16:46 I mentioned the eager versus lazy computation, but these two, these two libraries are maybe some

16:53 of the most popular ones, but they're pretty different in their philosophy.

16:57 So maybe just, could you just quick compare and contrast Polars versus Pandas?

17:01 Yeah, sure.

17:02 Indirectly, Dask and so on.

17:04 Yeah, sure.

17:05 So, well, Pandas started a lot earlier, I think in 2008, maybe first released in 2009 and originally

17:14 really written heavily around NumPy.

17:16 you can see this in the classical Pandas NumPy data types.

17:21 So the support for missing values is fairly inconsistent across types.

17:25 So you brought up PyArrow before.

17:27 So with the PyArrow data types, then we do get consistent missing value handling in Pandas.

17:32 But for the classical NumPy ones, we don't.

17:34 Polars started a lot later.

17:36 There wasn't, it didn't have a lot of backwards compatibility concerns to have to worry about.

17:42 So it could make a lot of good decisions upfront.

17:45 It's generally a lot stricter than Pandas.

17:47 And in particular, there's a lot of strictness and the kinds of ways it lets you interact with its objects.

17:55 So in Pandas, the way we interact with data frames is we typically extract a series as one dimensional objects.

18:02 We then manipulate those series.

18:04 Maybe we put them back into the original data frame, but we're doing everything one step at a time.

18:09 In Polars, the primary way that we interact with data frames is with what you've got there on the screen.

18:16 A pl.col, a, b, these are called expressions.

18:19 And that expression, my mental model for it is just a function.

18:22 It's a function from a data frame to a series.

18:25 And being a function...

18:26 Almost like a generator or something, huh?

18:27 Yeah, kind of, yeah.

18:28 So although I think when you say generator, like in Python, a generator, it's at some point you can consume it.

18:33 Like you can type next on the generator and it produces a value.

18:36 But an expression doesn't produce a value.

18:38 It's like if you've got lambda x, x times two.

18:40 Yeah.

18:41 It doesn't produce a value until you give it an input.

18:44 And similarly, an expression like pl.col, a, b, by itself, it doesn't do anything.

18:49 The interpretation is, given some data frame df, I'll return you the columns a and b.

18:54 So it only produces those columns once you give it some input data frame.

18:58 And functions, just by their very definition, are lazy, kind of.

19:03 Like you don't need to evaluate them straight away.

19:05 And so Polus can take a look at all of the things you want to do.

19:08 It can recognize some optimization patterns.

19:11 It can recognize that maybe between some of your expressions, there are some parts that are repeated.

19:16 And so instead of having to recompute the same thing multiple times, it can just compute it once and then reuse that between the different expressions.

19:23 Yeah, that's one of the big features of big capabilities of Polus is that it has kind of a query engine optimizer in there.

19:32 Whereas pandas, because it's not lazy, it just does one thing, then the next, then the next.

19:36 But maybe if you switch the order, like first filter and then compute versus compute and then filter, you might get a way better outcome, right?

19:44 That's a massive one.

19:45 Yeah.

19:46 So when I was doing some benchmarking, we brought up QDF earlier.

19:49 So that's the GPU accelerated version of pandas.

19:52 And that is super fast if you're just doing single operations on a one at a time in a given order.

19:57 However, there are some benchmarks where maybe you're having to join together multiple data frame, sorry, multiple data frames.

20:05 And then you're only selecting certain rows.

20:06 At that point, it's actually faster to just do it on a CPU using a lazy library like Polus, because Polus can do the query optimization.

20:14 It can figure out that it needs to do the filter and only keep certain rows before doing five gigantic joins.

20:20 Whereas QDF, it's super fast on GPU, but it is all eagerly executed.

20:25 It did way more work, but it did it really fast.

20:27 So it was about the same in the end.

20:28 Yeah.

20:29 But now in Polus, there's going to be GPU support and it's going to be query optimized GPU support.

20:36 I don't know if the world is ready for this level of speed.

20:39 Yeah, that's going to be interesting.

20:42 This portion of Talk By The Named is brought to you by WorkOS.

20:46 If you're building a B2B SaaS app, at some point, your customers will start asking for enterprise features like SAML authentication, skim provisioning, autologues, and fine grained authorization.

20:57 That's where WorkOS comes in with easy to use APIs that will help you ship enterprise features on day one without slowing down your core product development.

21:06 Today, some of the fastest growing startups in the world are powered by WorkOS, including ones you probably know like Perplexity, Vercel, and Webflow.

21:15 WorkOS also provides a generous free tier of up to 1 million monthly active users for AuthKit, making it the perfect authentication layer for growing companies.

21:24 It comes standard with useful features like RBAC, MFA, and bot protection.

21:30 If you're currently looking to build SSO for your first enterprise customer, you should consider using WorkOS.

21:36 Integrate in minutes and start shipping enterprise plans today.

21:40 Just visit talkpython.fm/workos.

21:43 The link is in your podcast player show notes.

21:45 Thank you to WorkOS for supporting the show.

21:47 I guess another difference is not a massive, you know, in some ways it matters, some ways it doesn't, is Pandas is based on C extensions, right?

21:57 I'm guessing if I remember right.

22:00 And then Perlers is Rust and even took the .rs extension for their domain, which is really embracing it.

22:07 But not that it really matters, you know, what your native layer is, if you're not working in that, right?

22:14 Like most Python people don't work in C or Rust, but it's still interesting.

22:18 Well, I think it, yeah, it is interesting, but also it can be useful for users to know this because Perlers has a really nice plugin system.

22:26 So you can extend Polars with your own little expressions, which you can write in Rust.

22:31 And the amount of Rust that you need to do this is really quite minimal.

22:35 Like if you try to write these Polars plugins as if you were writing Python and then just use some LLM or something to guide you.

22:42 I think realistically, most data scientists can solve 98% of their inefficient data frame usage by using Polars plugins.

22:49 So having a nice, safe language that you can do this, it really makes a difference.

22:54 I'm going to write it in Python and then I'm going to ask some LLM.

22:58 Right now I'm using LLM Studio and I think Llama 3.

23:01 Anyway, ask it, say, okay, write this in Rust for me.

23:05 Write it as a Polars plugin.

23:07 Here we go.

23:08 All right.

23:09 Yeah, exactly.

23:10 I mean, it's crazy.

23:11 It's this new world we live in.

23:12 Yeah, yeah, totally.

23:13 I mean, like the amount of Rust knowledge you need to take care of some of the complicated parts in Polars is really advanced.

23:20 Really need to study for that and LLM isn't going to solve it for you.

23:23 But the amount of Rust that you need to just make a plugin to solve some inefficient function, I think that's doable.

23:31 Right. Yeah, exactly.

23:32 It's very different to say we're going to just do this loop in this function call here versus there rather than I'm going to write a whole library in Rust or C or whatever.

23:40 Exactly.

23:41 Yeah.

23:42 So there's a pretty different API between these.

23:45 And in Narwhals, it looks like you've adopted the Rust API, right?

23:50 A subset of it.

23:51 Is that right?

23:52 The Polars one.

23:53 Yes, exactly.

23:54 So I kind of figured.

23:55 Yeah, yeah.

23:56 That's what I mean.

23:57 Yeah.

23:58 The Polars one.

23:59 Yeah.

24:00 The Pandas API.

24:01 But to be honest, I found that trying to translate the Pandas API to Polars was fairly painful.

24:06 Like, Pandas has a bunch of extra things like the index, the multi-index, and it does index alignment on all the operations.

24:12 I just found it not a particularly pleasant experience to try to map this onto Pandas.

24:18 However, when I tried to do the reverse of translating the Polars API to Pandas, it kind of just worked without that much effort.

24:25 And I was like, oh, wow, this is magic.

24:26 Okay, let's just take this a bit further.

24:28 Publish it on GitHub.

24:29 Maybe somebody would find a use case for it.

24:31 I don't know.

24:32 Yeah, that's great.

24:33 Out in the audience.

24:34 ZigZackJack asks, how is Narwhals different from IBIS?

24:38 All right.

24:39 The number one most common question.

24:41 Love this.

24:42 Is it?

24:43 Okay, great.

24:44 Yeah.

24:45 I'll give you a little bit of context for the listeners on what is IBIS.

24:48 So IBIS, yes, you can see there on the screen, they describe themselves as the portable data frame library.

24:52 So IBIS is really aiming to be a data frame library, just like Pandas, just like Polars.

24:58 But it's got this API, which can then dispatch to different backends.

25:02 The default one is DuckDB, which is a really powerful embedded analytics database.

25:08 I think you covered it on the show.

25:09 In fact, I think I might have first heard about DuckDB on Python bytes.

25:13 So listeners, if you want to stay up to date, subscribe to Python bytes.

25:18 Thank you.

25:19 Yeah.

25:20 One of the shows I almost never miss.

25:21 So yeah, I think the primary difference between Narwhals and IBIS is the target audience.

25:27 So with IBIS, they're really trying to be this full blown data frame that people can use to do their analyses.

25:33 Whereas with Narwhals, I'm openly saying to end users, like if you're an end user, if you're a data scientist, if you're an ML engineer, if you're a data analyst, don't use Narwhals.

25:42 Like, it's a tool for tool builders.

25:45 Like, learn Polars, learn DuckDB, learn whatever the best tool is for your particular task, and learn it well, and master that, and do your analyses.

25:55 On the other hand, if you're a tool builder, and you just need to do some simple operations with data frames, and you want to empower your, if you want to enable your users to use your tool, regardless of which library they're starting with, then Narwhals can provide a nice bridge between them.

26:11 Interesting.

26:12 Is there any interoperability between IBIS and Narwhals?

26:16 Narwhals?

26:17 We do have some level of support for IBIS.

26:19 And at the moment, this is just interchange level support, in the sense that if you pass a IBIS data frame, then you can inspect the schema, not do much else.

26:30 But for the Altair use case, that's all they needed.

26:33 Like, they just wanted to inspect the schema, make some decisions on how to encode some different columns.

26:39 And then, depending on how long your data frame is, they might convert to PyArrow and dispatch to a different library called VegaFusion.

26:46 Or they might just do everything within Altair.

26:49 But we found that even just having this relatively minimal level of support for IBIS, Vax, DuckDB, and anything else, anything that implements the data frame interchange protocol was enough to already solve some problems for users of these libraries.

27:04 Yeah.

27:05 Okay.

27:06 Very interesting.

27:07 Let's see.

27:08 We'll hit a few more of the highlights here.

27:10 100% test coverage.

27:12 You already mentioned that you found some bugs in which library was it?

27:17 Modian.

27:18 Yeah, yeah, that's right.

27:19 I think all of them.

27:20 I think it's helps uncover some rough edge cases in all of the libraries that we have some support for.

27:26 You write a library and you're going to say, I'm going to try to behave like you do.

27:29 And I'll write some tests around that.

27:31 And then when you find the differences, you're like, wait a minute, right?

27:34 Yeah, exactly.

27:35 Also really love to see the let your IDE help you thanks to static typing.

27:40 We'll definitely have to dive into that in a bit as well.

27:43 That looks awesome.

27:44 Cheers.

27:44 Yeah, huge fan of static typing.

27:46 You know, it's a bit of a controversial topic in some Python circles.

27:48 Some people say that it's not really what Python is meant for and that it doesn't help you prevent bugs and all of that.

27:54 And I can see where these people are coming from.

27:56 But when I've got a statically typed library and my IDE is just always popping up with helpful suggestions and doc strings and all of that, then that's when I really appreciate it.

28:06 Exactly.

28:07 Like, forget the bugs.

28:08 If I don't have to go to the documentation because I hit dot and it's immediately obvious what I'm supposed to do, that's already a win.

28:15 Right.

28:16 And that's typing gives you that.

28:17 Plus it gives you checking.

28:18 Plus it gives you lots of other things.

28:19 I think it's great.

28:20 And especially with your focus on tool builders, tool builders can build tools which have typing.

28:25 They can build better tools using your typing, but they don't because it's optional.

28:29 It's not really forced upon any of the users.

28:32 The only libraries that I can think of that really force typing on their users is Pydantic and FastAPI and a couple of these that like Typer that have behavior driven on the types you put.

28:44 But if you're using that library, you're choosing that as a feature, not a bug.

28:48 Right.

28:49 Yeah, exactly.

28:50 Yeah.

28:50 So awesome.

28:51 Awesome.

28:52 And then finally, sticking with the focus on tool builders, perfect backwards compatibility policy.

28:57 What does this mean?

28:58 This is a bit of an ambitious thing.

29:00 So when I was learning Rust, I read about Rust editions.

29:04 So the idea is that when you start a Rust project, you specify the edition of Rust that you want to use.

29:11 And even as Rust gets updated, if you write some project using the 2015 edition of Rust, then it should keep working essentially forever.

29:21 So they keep this edition around.

29:22 And if they have to make backwards incompatible changes, there's new editions like 2018, 2021 editions.

29:28 So this is kind of what we're trying to do.

29:30 Like the idea, the idea was, well, we're kind of mimicking the Polar's API.

29:34 I think there was a bracket I opened earlier, which I might not have finished, which was that the third choice we had was to make an entirely new API.

29:40 But I thought, well, better to give, to do something that people are somewhat familiar with.

29:44 Yeah, I think that's a great choice.

29:46 Yeah.

29:47 When you go and write the code, half of the people will already know Polar's.

29:50 And so they just keep doing that.

29:52 You don't have to go, well, here's a third thing you have to learn, right?

29:54 Yeah.

29:55 I'd like to think that now half people know Polar's.

29:59 Unfortunately, I think we might not quite be there yet, but it is growing.

30:03 No, I think so too.

30:04 Yeah.

30:05 I think we'll get there.

30:06 So yeah, it's okay.

30:07 We're kind of mimicking a subset of the Polar's API and we're just sticking into fundamentals.

30:13 So that part should be relatively stable, but at some point, presumably Polar's is going to make a backwards incompatible change.

30:19 And at that point, what do we do in Narwhals?

30:22 What do we do about the top level Narwhals API?

30:24 And coordinating changes between different libraries, it's going to get tricky.

30:30 And the last thing that I want to do is see people put upper bound constraints on the Narwhals library.

30:35 I think upper bound constraints on something like this should never really be necessary.

30:40 So we've tried to replicate what Rust does with its additions.

30:44 The idea is that we've got a stable V1 API.

30:47 We will have a stable V2 API at some point if we need to make backwards incompatible changes.

30:52 But if you write your code using the V1 stable Narwhals API, then even as new Narwhals versions come out, even as the main Narwhals namespace changes, even as we might introduce V2, then your code should in theory keep working.

31:09 Like V1 should stay supported indefinitely.

31:12 This is the intention.

31:13 Yeah, you said see the stable API for how to opt in.

31:17 So how do you, what, what, I'm just curious what the mechanism is.

31:21 So for example, import Narwhals.stable.v1 as NW, which is the standard Narwhals support.

31:27 Yeah, exactly.

31:28 So instead of...

31:29 Yeah, instead of import Narwhals as NW, you'll do import Narwhals.stable.v1 as NW.

31:35 Yeah, I encourage people when they're just trying it out, prototyping, use import Narwhals as NW.

31:41 If you want to make a release and future-proof yourself, then switch over to the stable.v1.

31:47 This is a little similar to the API.talkpython.fm/v1/whatever versus, you know, where people encode a different version in their API endpoints, basically.

32:01 Yeah, yeah.

32:02 In import statements.

32:03 I like it.

32:04 Great.

32:05 Yeah, let's see how this goes.

32:06 Yeah, exactly.

32:07 Now it's good.

32:08 So just back to typing real quick.

32:10 Pamphil Roy out there says, "A lot of open source maintainers complain about typing because if you want to make it really correct, it's painful to add."

32:19 That can be true.

32:20 And some, you know, the last 1% is some insane statement.

32:24 But it's so helpful for end users.

32:26 Yeah.

32:27 True, yeah.

32:28 You mentioned earlier that everyone seems to be at QuantSight.

32:30 Do you know where I met Pamphil?

32:32 At QuantSight?

32:33 At QuantSight, yes.

32:34 Amazing.

32:35 See?

32:36 It continues to happen.

32:37 Yeah, exactly.

32:38 But yeah, I think that totally sums it up for me as well.

32:43 You know, it's really great to be using libraries that give you those options.

32:47 You know, we do have the PYI files and we have TypeShed and all of that where people can kind of put typing on things from the outside that didn't want to support it.

32:56 But if it's built in and part of the project, it's just better, you know?

33:00 Yeah.

33:01 If you have it from day one, it works well.

33:02 I mean, trying to add types to a library that started without types like Pandas, it's fairly painful to be honest.

33:08 I bet it is.

33:11 Yeah, really cool.

33:12 All right.

33:13 Let's go and talk through, I guess, a quick shout out.

33:16 Just last year I had Richie Vink, who's the creator of Polar's on Talk Python.

33:21 If people want to check that out, they can certainly have a listen to that.

33:26 I also just recently had Wes McKinney, the creator of Pandas on, and I'll link to those shows if people want to like dive into those.

33:33 But let's talk a little bit through your documentation.

33:36 It tells a really good story.

33:37 I like what you put down here as, you know, it's not just here's your API and stuff, but it walks you through.

33:43 So we talked about why obviously install, pip install, it's pure Python with a pure Python wheel, right?

33:50 Yeah, exactly.

33:51 Shouldn't be any issues with installation.

33:53 Is it WASM compatible?

33:54 Do you know?

33:55 Like, could I use it on PyScript, Pyodide?

33:57 I don't know.

33:58 Are there any restrictions that they need?

34:01 There's some restrictions.

34:03 For example, I don't think you can do threading.

34:05 I don't think you can use some of the common, which you don't have any dependencies, but some of the common third party HTTP clients, because it has to go through the browser's Ajax layer.

34:14 There's some, but not terribly many restrictions.

34:17 I'd imagine then that we would only be limited by whichever data frame people are passing in.

34:21 Yeah.

34:22 Yeah.

34:22 Awesome.

34:23 Okay.

34:24 That's super nice.

34:25 And maybe let's just talk through a quick example here.

34:28 Keep it in mind that most people can't see any of the code, but let's just give them a sense still of what does it look like to write code that is interoperable with both or all these different libraries, these data frame libraries using narwhals.

34:42 So maybe give us just an example.

34:45 Sure.

34:46 So the idea is what we can see on the screen is just a very simple example of a data frame agnostic function.

34:52 We've got a function called my function, and this is something that users could maybe just use.

34:57 Maybe it's something your library exposes, but the user doesn't need to know about narwhals.

35:01 The narwhals only happens once you get inside the function.

35:05 So the user passes in some data frame.

35:07 We then call narwhals.fromNative on that data frame object.

35:11 We do some operation and then we return some native object back to the user.

35:16 Now the narwhals.fromNative, it's a practically free operation.

35:19 It's not doing any data conversion.

35:21 It's just instantiating some narwhals class that's backed by your original data frame.

35:26 Right, right.

35:27 And I imagine if it's Polar's data frame that gets passed in, it's probably a more direct pass through to the API than if you're doing operations on a Pandas data frame, right?

35:38 Is there a difference of sort of runtime depending on the back end?

35:42 The overhead is really low even for the Pandas case.

35:45 In fact, sometimes things do get a little bit faster because of how careful we've been about avoiding index operations and unnecessary copies.

35:55 To be honest, some of this will be alleviated in Pandas version 3 when copy on write becomes the default.

36:01 Oh, that's interesting. Yeah.

36:03 Yeah.

36:04 In terms of the mapping on the implementation side, it's a bit easier to do the Polar's backend.

36:08 But even then we do need to do some version checks like in 0.20.4, they renamed with row count to with row index, I think.

36:17 And so, yeah, even there we do need some if-then statements.

36:20 But like the end of the day, what the library does is there's a few extra function calls, a few checks on versions.

36:28 It's not really doing that much.

36:30 Yeah.

36:31 Like you might experience an extra millisecond compared to running something natively at most.

36:35 And usually you're using a data frame because you have some amount of data, even hundreds of rows.

36:41 It's still most of the computation is going to end up there rather than if it's version this, call that, otherwise call this, right?

36:47 That's not a lot of overhead, relatively speaking.

36:50 I agree.

36:51 Yeah.

36:52 So, yeah, we see an example here of a data frame agnostic function, which just calculates some descriptive statistics from an input data frame using the expressions API, which we talked about earlier.

37:02 Yeah.

37:03 And here's something that I quite like about mkdocs.

37:05 So you see where it says, let's try it out.

37:07 We've got these different tabs and you can click on like polars, pandas, polars lazy.

37:13 And then you can see in each case what it looks like from the user's point of view.

37:17 And you can see, you can compare the outputs.

37:20 So from the user's point of view, they're just passing their object to funk.

37:24 What they're not seeing is that under the hood funk is using narwhals.

37:28 But from their perspective, they put pandas in, they get pandas out.

37:31 They put polars in, they get polars out.

37:33 That's awesome.

37:34 So we talked about the typing.

37:37 And in this one, we have a DF typed as a frame T.

37:41 Is that some sort of generic and it does it have restrictions on it?

37:45 What is this frame T?

37:46 I didn't dive into the source and check it out before.

37:49 Sure.

37:50 Yeah, it's a type.

37:51 So it's just the idea that you start with a data frame of some kind and you get back some data frame of the same kind.

37:57 Start with polars, get back polars, start with pandas, get back pandas and so on.

38:02 And yeah, this version of the function is using the decorator nw.narwhalify.

38:08 Narwhalify, it's a fantastic verb.

38:10 So yeah, so there's two ways in which you can implement your function.

38:17 You can do it the explicit way where that's in the quick start and the docs where you write your function that takes some frame, some native frame.

38:26 And then you convert that to this narwhals one.

38:30 You say from native, then you do your work.

38:32 And then depending on, you could convert it back.

38:35 Or in this case, it returns a list of strings in that example.

38:38 Or you can skip the first and the last step and just put this decorator on it and it'll convert it to or wait, convert it from and then convert it to on the way in and out, right?

38:47 Yeah, exactly.

38:48 So if you're really strict about type annotations, then using from native and to native gives you a little bit of extra information.

38:57 But I think narwhalify looks a little bit neater.

38:59 Yeah, that's true.

39:00 So for example, in the first one, you could say that this is actually a pandas data frame because you're writing the code or something like that.

39:09 I don't know.

39:10 So by into frame, this is the type on this first example.

39:13 Yeah, by into frame, we mean something that can be converted into a narwhals data frame or lazy frame.

39:20 How do you implement that in the type system?

39:22 Is it a protocol or what is this?

39:25 Yeah, we've got a protocol.

39:27 So I just found some methods that these libraries have in common.

39:31 Exactly.

39:32 Yeah, if you can find that.

39:34 That's what I was thinking.

39:35 Yeah.

39:36 Yeah, but if it has enough of the functions of pandas or pollers, you're like, all right, this is probably good.

39:42 All right.

39:43 And you can say it's one of these.

39:44 That's pretty cool.

39:45 Yeah, exactly.

39:46 I mean, if any of this is confusing to listeners, we do have a page there in the documentation that's all about typing.

39:51 So you can read through that at their own leisure.

39:54 Yeah, for sure.

39:55 All right.

39:56 Let's see.

39:57 I do like the mkdocs where you can have these different examples.

40:00 One thing I noticed is you've got the pollers eager evaluation and you've got the pollers lazy evaluation.

40:08 And when you have the pollers lazy, this function decorated with the decorator, the Nawalify decorator, it itself returns something that is lazy and you've got to call collect on.

40:19 Right.

40:19 So it kind of preserves the laziness, I guess.

40:21 Is that right?

40:22 Yes, exactly.

40:23 This was something that was quite important to me, like not be something that's that only works well with eager execution.

40:30 I want to have some level of support such that lazy in mean lazy out.

40:36 Yeah.

40:37 Eager in, eager out.

40:38 Lazy in, lazy out.

40:39 Okay.

40:40 Exactly.

40:41 Yeah.

40:42 So the way you do that in pollers, you create a lazy frame versus data frame.

40:45 Right.

40:46 But then you've got to call collect on it, kind of like a weight in it and bit more async, which is cool.

40:51 Yeah.

40:52 Or don't call collect or just wait until you really need to call collect.

40:55 Right.

40:56 Or pass it on to the next one and on to the next.

40:58 Yeah, exactly.

40:59 Exactly.

41:00 So one of the things that you talk about here is the pandas index, which is one of the key differences between pollers and pandas.

41:08 And you've classified pandas people into two categories.

41:12 Those who love the index and those who try to get rid of it and ignore it.

41:17 Yeah, exactly.

41:18 So if James Powell is listening, I think we can put him in the first category.

41:23 I think most, realistically, most pandas users that I've seen call dot reset index drop equals true every other line of code.

41:30 They just find that the index gets in the way more than helps them most of the time.

41:35 And with novels, we're trying to accommodate both.

41:38 So we don't do automated index alignment.

41:41 So this isn't something that you have to worry about.

41:43 But if you are really bothered about index alignment, say, due to backwards compatibility concerns, then we do have some functions which allow you to do that, which would be no operations for other libraries.

41:56 So that's a good thing.

41:57 So this is the same thing that we're going to do with pandas.

41:59 So I would say, oh, I'm going to do this.

42:01 I would say that.

42:02 But I would say that's a good thing.

42:03 I would say that's a good thing.

42:04 So this is an example in scikit-lego of where they were relying on pandas index alignment.

42:09 And if you're trying to do these libraries, the data will just be passed through the idea.

42:12 Right.

42:13 So you said they're pandas-like.

42:14 And pandas-like is actually a type in your type system, right?

42:18 Did I see that?

42:19 We've got, yeah, yeah.

42:20 So we've got isPanda'slike dataframe function to tell.

42:24 So by pandas-like, we mean pandasqdf modin.

42:28 So the libraries that have an index and follow those kinds of rules.

42:32 Yeah, yeah, that's really cool.

42:33 Yeah, because at the end of the day, like the idea of writing completely dataframe agnostic code is a lot easier for new libraries.

42:39 than for existing libraries that have backwards compatibility concerns.

42:42 And we recognize that it might not be completely achievable.

42:46 I think in all of the use cases where we've seen now it's adopted, they're doing most of it in a dataframe agnostic way.

42:52 But they do have some parts of it where they're saying, okay, if this is a pandas dataframe, we've got some pandas-specific logic.

42:58 And otherwise, let's go down the dataframe agnostic route.

43:01 Yeah, you also have here levels of support.

43:05 You have full and interchange.

43:06 I think we talked about that a little bit.

43:08 So maybe just point people here.

43:10 But this is if you want to be qdf or modin, you can fully integrate.

43:15 Or if you just want to have enough of an implementation that they can kind of work together, right?

43:21 You can do this dataframe interchange protocol.

43:23 Yeah, exactly.

43:24 Or just write to us and we'd be happy to accommodate you without you having to go through the dataframe interchange protocol.

43:31 Oh, yeah.

43:32 Very nice.

43:33 Okay.

43:34 You mentioned the overhead before, but you do have a picture.

43:36 Pictures are always fun.

43:37 And in the picture, you've got little different operations, different times for each of the operations.

43:42 And there's a quite small overhead for pandas versus pandas with narwhals.

43:47 Yeah, exactly.

43:48 Like in some of them, you can see it becoming a little bit faster.

43:51 In some of them, you can see it becoming a little bit slower.

43:53 Yeah.

43:54 And these are queries that I think are the size that you can expect most data scientists to be working with a lot of the time.

43:59 You've got queries that take between a couple of seconds to 30 seconds.

44:03 And there, it's pretty hard to distinguish reliably between like the blue and red dots.

44:08 Sometimes one's higher, sometimes the other one's higher.

44:11 There's a bit of statistical variance just between running the same benchmark multiple times.

44:15 But overall, yeah, we were pretty happy with these results.

44:18 Yeah, that's great.

44:20 So how well have we covered how it works?

44:23 We talked about the API, but I don't know.

44:25 We've talked about the implementation of how you actually, why is it basically almost the same speed?

44:32 Are you not doing it?

44:33 Why is it not going to work?

44:34 Yeah.

44:35 Well, maybe.

44:36 Are you using underwater unicorn magic?

44:37 Is that what it is?

44:38 Yes.

44:39 That's the secret, yes.

44:40 Underwater unicorn magic.

44:41 Well, perhaps first I should just say why we wrote this, how it works.

44:45 And it's because really I want this to be a community driven project.

44:49 And this is one of those cases where open source is more of a social game than a technical one.

44:54 I'm not saying that's always the case.

44:55 There are many problems that are purely technical.

44:58 Nowles is a social game in the end.

45:00 Like what we're doing isn't that complicated, but if we want it to work, then it needs to be accessible to the community.

45:06 People do need to be able to trust us.

45:08 And that typically does not happen if it's a one person project.

45:11 So it was really important to me that different people would be able to contribute to it, that it all be as simple and clear as possible.

45:19 So made this page trying to explain how it works.

45:21 It's not quite as clear and quite as extensive as it'd like to be.

45:25 But a few contributors did say that it really helped them.

45:28 So in terms of how do we get this slow overhead?

45:31 So we're just defining an expression as being a function from a data frame, your sequence of series, and then we're just repeatedly and strictly applying that definition.

45:39 So there's nothing too fancy going on.

45:41 That's like in the end, just evaluating Lambda functions in Python, going down the stack trace like it's pretty fast.

45:49 Yeah, that's really cool.

45:50 Yeah.

45:51 So people can check this out.

45:52 They want to know that's I think this might be where I saw the pandas like expression.

45:55 Right.

45:56 Yeah.

45:57 Pandas like it's this class that encompasses your pandas mode in QDR, the ones that kind of follow the pandas API.

46:02 Mm-hmm.

46:03 Yeah.

46:04 Close enough for what you need to do.

46:05 Yeah, exactly.

46:06 All right.

46:07 Well, I saw a question out there in the audience somewhere from Francesco, which was basically asking about the roadmap.

46:14 Like, where are you?

46:15 Where are you going?

46:16 Yeah, I should probably introduce Francesco.

46:18 He's one of the most active contributors to the project.

46:22 So thanks, Francesco, for helping to make it a success.

46:26 He was actually also the first person to adopt in one of his libraries.

46:31 Yeah, I spoke to him about it at a conference and he was like, I've got this tiny little time based CV library.

46:37 Let's try now qualifying it as an experiment.

46:39 Sure, we did that then.

46:41 That's right.

46:42 Not scikit-learn, sorry.

46:44 Not scikit-lego.

46:45 It was this kind of experimental building blocks for scikit-learn pipelines that he maintains.

46:50 And then we've just been taking it from there.

46:52 So in terms of roadmap, my top priority is helping out libraries that have shown interest in narwhals.

47:00 So at the moment, Formulaic, that opened a draft pull request in which they were trying out narwhals and they tagged me just about some things they were missing.

47:08 So I'd like to see if I can take that to completion.

47:12 I think I've got most of it working, but just been a bit busy with conferences recently.

47:16 So maybe next month I'll be able to get something ready for review and show that to them.

47:21 That would be pretty cool.

47:23 It's summer is passing.

47:25 The conferences are ending.

47:26 It's going to get dark and cold.

47:28 Perfect time to program.

47:29 Yeah, we'll get back to the situation that I was in when I started narwhals, which was that it was a rainy Friday.

47:36 Not Friday, sorry.

47:37 It was a rainy February weekend in Wales, the rainiest part of the UK.

47:42 So, you know.

47:43 Yeah, that's exactly the same in Oregon here.

47:45 So it's a good time to get stuff done.

47:47 Yeah, exactly.

47:48 So, yeah.

47:49 And then I've been speaking to people from Shiny and Plotly about potentially looking into narwhals.

47:55 There's no contract set in stone or anything.

47:59 These people may well change their mind if it doesn't work for them.

48:02 But my idea is, okay, they've shown interest.

48:04 Let's go head first into seeing whether we can help them and whether they'd be able to use narwhals.

48:10 If it doesn't work out, we'll just have strengthened the narwhals API and learn some new things.

48:15 If it does work, then great, exciting.

48:18 So that's my top priority.

48:20 And it's been really pleasing to see the contributed community develop around narwhals.

48:25 I really thought it would be a one-person project for a long time.

48:28 But so many people have been contributing really high-quality pull requests.

48:31 It's really been, yeah, you see 42.

48:34 Okay, one of them is this.

48:35 Okay, maybe a couple of them here are like a GitHub bot.

48:38 This pre-commit CI bot.

48:40 Yeah.

48:41 Maybe that's not counted as those.

48:43 Maybe 40, 30, but still, that's a lot.

48:44 While we're talking numbers on the homepage, I also want to point out, 10 million downloads a month is a lot of downloads.

48:50 That's awesome.

48:51 Yeah, that's maybe slightly misleading because they pretty much just come from the fact that it's now a required dependency of Altair.

48:58 And Altair gets millions of downloads.

49:00 Yeah, yeah, yeah, exactly.

49:01 But that's how, that's the place of some libraries.

49:04 Like Berkzoyg, I don't think many people go, oh, let me go get this HTTP library or it's dangerous.

49:09 They just go, I'll use Flask, right?

49:11 But it's still a really important building block of the community, even if people don't seek it out as a top-level thing they use, right?

49:19 Sure, cheers, thanks.

49:20 Yeah.

49:20 In fact, if we do our job well, then most people should never know about novels.

49:25 Exactly.

49:26 That tools should just use, just work.

49:28 Sorry.

49:29 Yeah, exactly.

49:30 Yeah.

49:31 So yeah, it's been really encouraging, really pleasing to see this contributor community emerge around the project.

49:42 And I think a lot of the contributors are really interested in adding extra methods and adding extra backends and things.

49:51 So I'm trying to leave a lot of that to the community.

49:55 So like with Dask, I just got the rough building blocks together.

49:59 And then it was just so nice, like so many really high quality contributions coming up that brought the Dask support pretty much complete.

50:07 We should see now if we're able to execute all of the TPC-H queries with the Dask backend, we might actually be there or be pretty close to getting there.

50:15 Nice.

50:16 What does TPC-H stand for?

50:19 I don't remember what it stands for, but it's a set of database queries that they were originally written for testing out different databases.

50:29 So it's a bunch of SQL queries, but I'm not sure if it was Kaggle that popularized the idea of translating these SQL queries to data frame like APIs,

50:44 and then running different data frames on them to see who wins the speed test.

50:49 But we just figured they do a bunch of things like joins, concatenations, filtering, comparisons with dates, string operations.

50:58 And we're like, okay, if the Nowles API is able to do all of this, then maybe it's extensive enough to be useful.

51:04 Right. Yeah, yeah. That's super cool. It sounds a little bit like the TOB index plus other stuff maybe, but for databases.

51:11 I'm not familiar with that.

51:13 It's like a language ranking type of thing. And, you know, one aspect is maybe ranking the databases. But yeah, this is very cool. Okay, got it.

51:21 Yeah.

51:22 I mean, in the end, we're not trying to be fast as Nowles, but we just want to make sure that there's no extra overhead compared to running things natively.

51:29 As long as you're not much slower than the stuff that you're operating with. Like, that's all you should ask for. You can't make it go faster in the extreme.

51:37 Like you did talk about some optimizations, but you can't fundamentally change what's happening.

51:41 Yeah, we could do some optimizations on the Nowles side. But to be honest, I'm not sure I want to. And part of the reason is because I want this to be a pretty simple project that's easy to maintain.

51:50 Yeah, sure.

51:51 That's really just low overhead.

51:53 Add extra docs and tutorials coming. That's fun.

51:56 Yeah.

51:56 Looking for contributors and maybe want to write some tutorials or docs.

52:00 I would love this. Yeah. I mean, it drives me crazy when I see so many projects where people have put so much effort into making a really good product, but then the documentation is really scant.

52:08 Like if you don't prioritize writing good docs, nobody's going to use your product. So I was really grateful to my company.

52:17 They had four interns come on who really helped out with making the docs look amazing.

52:22 Oh, that's cool.

52:23 Like if you look at the API reference, I think every single function now has got a, like a doc string with an example at the bottom.

52:32 I think there's API reference. Yeah. If you search for any function in here, yeah.

52:36 Yeah. In the search box at the top.

52:39 I don't know. Series dot.

52:40 Sure.

52:41 Something easy for any of these.

52:43 We've got like an example of, okay, here's how you could write a data frame agnostic function, which uses this, this method.

52:49 And let's show that if you pass pandas or polars, you get the same result.

52:53 And if there's some slight differences that we just cannot get around, like in the way that they handle missing values, then we've got a very clear note about in the docs.

53:01 Yeah, that's great. Maybe, maybe someday support for DuckDB.

53:04 I would like that. I don't think we have much of a choice about whether or not we support like DB.

53:10 Like, like, DB is really on fire now.

53:12 So it really is.

53:14 Yeah, I think it's, it might be a question of either we have some level of support for DuckDB or somebody else is going to make something like novels that supports DuckDB and then we become extinct.

53:26 But besides, like, to be honest, like, like, like, to be as amazing, I just find it a bit painful to write SQL strings.

53:35 And so if I could use DuckDB, but with the Polar's API that I prefer and I'm more familiar with, then.

53:42 Yeah, I 100% agree. It looks super nice. But if you look at it has a SQL example, and then the Python example is just SQL quote, quote, quote.

53:51 Yeah, exactly.

53:52 Here's the SQL embedded in Python, you know what I mean? So it's, it's, you're kind of writing SQL no matter what.

53:56 Yeah, and then the error messages that you get sometimes are like, oh, there's a pass error near this keyword and you're like, what on earth is going on? Like, it's, and then you're like, oh, yeah, I forgot.

54:05 I've got an extra comma at the end of my select or something. I don't know.

54:08 Yeah. So this kind of thing.

54:09 Yeah. So DuckDB. Yeah. So DuckDB is a little bit like SQL lite, but for analytics rather than relational, maybe. I'm not sure if that's a good scenario.

54:17 I think it's primarily aimed at analysts. Yeah. Analytical kinds of things. Yeah. Data scientists and people. What I kind of, what we are going to struggle with is that in DuckDB, there's no guarantees about row order or operations.

54:32 So, but on the plus side, when I look at what Altair are doing with data frames, when I look at some of the other libraries that I've shown interest in now, they're often just doing very simple things.

54:43 They're not doing things that depend on row order. So if we could just initially just support DuckDB for the operations that don't require row order.

54:51 So for something like a cumulative sum, maybe initially we just don't support that for DuckDB. Yeah.

54:57 Like in the end, if you want to do advanced SQL, just use DuckDB directly. Like, as I said earlier, I don't recommend that end users use novels directly, but even just having some common operations, ones that aren't row order dependent.

55:12 I'd like to think that this is already enough to solve some real problems for real people. Yeah. I know you said it's mostly for library builders, but if you were building an app and you

55:20 were not committed to your data frame library or you really wanted to leave open the possibility of choosing a different data frame library, you know, sort of using narwhals to isolate that a little bit might be nice, right?

55:31 Yeah. So yeah. Yeah. If anyone tries to do this and I'd love to hear your story. I did hear from somebody in our community call. We've got a community call every two weeks, by the way, if anyone wants to come and chat with us. I did hear from somebody that like at work has got some teams that are primarily using pandas.

55:50 There's some teams that are primarily using polars and he just wanted to build some common logic that both teams could use. And he was using narwhals for that. So I think there are some use cases beyond just library maintainers.

56:02 Yeah, absolutely. Maybe you're building just an internal library and it needs to work with some code you've written in pandas, but you maybe want to try your new project in polars, but you want to still use that library, right? That would be a cool use case as well.

56:14 Yeah. Yeah. Yeah. Yeah. I'm pretty sure you've brought up on the show before that XKCD about like the space bar overheating. I can't remember which number that one is, but in the end with a lot of open source projects, you put it out with some intention of how it's meant to be used.

56:30 Yes.

56:34 Yeah. Yeah. It's a, it's a, it's a, it looks like something out of a, a, a change log with feedback or something that says changes in version 10.7. The CPU no longer overheats when you hold down the space for comments.

56:49 Long time using four rights. This update broke my workflow. My control key is hard to reach. So I hold the space bar instead. And I configured Emacs to interpret a rapid temperature rise as control. That's horrifying. Look, my setup works for me. Just add an option to reenable space bar heating.

57:05 I've seen it so many times, but it still makes me laugh each time.

57:09 It's incredible. It's incredible. All right, Marco. Well, congrats on the cool library. Congrats on the traction. Final call to action. Maybe people want to start using narwhals. What do you tell them?

57:17 Yeah. Give it a go and please join our discord and or our community calls. We're very friendly and open and would love to hear from you and see what we can do to address whatever limitations you might come up against.

57:44 If you're using a B2B SaaS app at some point, your customers will start asking for enterprise features like SAML authentication, skim provisioning, audit logs, and fine-grained authorization.

57:54 WorkOS helps ship enterprise features on day one without slowing down your core product development. Find out more at talkpython.fm/workOS.

58:04 Want to level up your Python? We have one of the largest catalogs of Python video courses over at Talk Python.

58:10 Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.

58:20 Be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top.

58:27 You can also find the iTunes feed at slash iTunes, the Google Play feed at /play, and the direct RSS feed at slash RSS on talkpython.fm.

58:36 We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

58:47 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

58:53 We'll see you next time.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon