#402: Polars: A Lightning-fast DataFrame for Python [updated audio] Transcript
00:00 When you think about processing tabular data in Python, what library comes to mind?
00:04 Pandas, I'd guess.
00:05 But there are other libraries out there, and Polar's is one of the more exciting new ones.
00:10 It's built in Rust, embraces parallelism, and can be 10 to 20 times faster than Pandas out of the box.
00:17 We have Polar's creator, Richie Vink, here to give us a look at this exciting new data frame library.
00:23 This is Talk Python to Me, episode 402, recorded January 29, 2023.
00:29 Welcome to Talk Python to Me, a weekly podcast on Python.
00:46 This is your host, Michael Kennedy.
00:48 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.
00:55 Be careful with impersonating accounts on other instances.
00:58 There are many.
00:59 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
01:04 We've started streaming most of our episodes live on YouTube.
01:08 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.
01:17 This episode is brought to you by Typy.
01:19 Typy is here to take on the challenge of rapidly transforming a bare algorithm in Python into a full-fledged decision support system for end users.
01:26 Check them out at talkpython.fm/Typy, T-A-I-P-Y.
01:32 And it's also brought to you by User Interviews.
01:34 Earn extra income for sharing your software developer opinion.
01:38 Head over to talkpython.fm/userinterviews to participate today.
01:43 Hey, Richie.
01:44 Welcome to Talk Python to me.
01:46 Hello, Argo.
01:46 Thanks for having me.
01:47 Great to be here.
01:48 I feel like maybe I should rename my podcast TalkRust to me or something.
01:52 I don't know.
01:52 Rust is taking over.
01:54 As the low-level part of how do we make Python go fast?
01:59 There's some kind of synergy with Rust.
02:01 What's going on there?
02:02 Yeah, there is.
02:02 I'd say Python already was low-level languages that succeeded that made Python a success.
02:09 I mean, like NumPy, Pandas, everything that was reasonable fast was so because of C or Cytan, which is also C.
02:16 But Rust, different from C, Rust has made low-level programming a lot more fun to use and a lot more safe.
02:23 And especially if you regard multi-threaded programming, parallel programming, computer programming, it is a lot easier in Rust.
02:31 It opens a lot of possibilities.
02:33 Yeah, my understanding.
02:34 I've only given a cursory look to Rust, just sort of scan some examples.
02:39 And we're going to see some examples of code in a little bit, actually, related to pullers.
02:43 But it's kind of a low-level language.
02:45 It's not as simple as Python.
02:47 No.
02:48 Maybe a JavaScript, but it is easier than C, C++, not just in the syntax, but it does better memory tracking for you and the concurrency especially.
02:59 Yeah, well, so Rust has got a, brings a whole new thing to the table, which is called ownership and a bower checker.
03:05 And Rust is really strict.
03:07 There are things in Rust, you cannot do a C or C++.
03:10 Because at the time, there can only be one owner of a piece of memory.
03:14 And other people can, you can lend out this piece of memory to other users, but then they cannot mutate.
03:19 So it can be only one owner, which is able to mutate something.
03:22 And this restriction makes Rust a really hard language to learn.
03:27 But once you, once it clicked, once you went over that, that steep learning curve, it becomes a lot easier because it doesn't allow you things that you could do in C and C++.
03:37 But those things were also things you shouldn't do in C and C++ because they probably led to sack all the central memory issues.
03:44 And this bower checker also makes writing concurrent programming safe.
03:49 You can have many threads reading a variable all they want.
03:53 They can read concurrently.
03:54 It's when you have writers and readers that this whole thread safety, critical section, take your locks or the locks re-entering.
04:01 All of that really difficult stuff comes in.
04:04 And so, yeah, it sounds like an important key to making that.
04:08 And the same bower checker also knows when memory has to be freed and not.
04:13 But it doesn't have to, unlike in Go or Java, where you have a garbage collector, it doesn't have to do garbage connection and it doesn't have to do reference counting by Python.
04:22 It does so by just statically.
04:24 So at compile time, it knows when something is out of scope and not used anymore.
04:28 And this is real power.
04:29 I guess the takeaway for listeners who are wondering, you know, why is Rust seemingly taking over so much of the job that C and variations of C, right?
04:38 Like you said, Cython have traditionally played in Python.
04:41 It's easier to write modern, faster, safer code.
04:45 Yeah.
04:45 Yeah.
04:45 Probably more fun too, right?
04:47 Yeah, definitely.
04:47 And it's a language which has got its tools, right?
04:51 So it's got a package manager, which is really great to use.
04:54 It's got a real great.co, which is similar to the PyPy index.
04:58 It feels like a modern language.
04:59 Yeah.
05:00 Builds low-level, more low-level code.
05:02 You can also write high-level stuff like REST APIs, which is, I will say, also for high-level stuff, I like to write it in Rust because of the safety guarantees.
05:13 And also the correctness guarantees if my program compiles REST, I'm much more certain it is correct than when I write my Python program, which is dynamic and types can, are not enforced.
05:23 So it's always been preying on that side.
05:26 Yeah.
05:26 Python is great to use, but it's harder to write correct code in Python.
05:30 Yeah, and you can optionally write very loose code, or you could opt in to things like type hints and even mypy, and then you get closer to the static languages, right?
05:41 Are you a fan of Python typing?
05:43 Definitely.
05:44 But because they're optional, they are as strong as the weakest link.
05:47 So one library which you use, if it doesn't do the static correct or doesn't do it, it breaks.
05:54 It's quite brittle because it's optional.
05:56 I hope we get something really enforcative and really can check it.
06:00 I'd love to know if it's possible because of the dynamic nature of Python.
06:04 Python can do so many things, jump dynamically.
06:07 And technically, we just cannot know from, I don't know how far it can go.
06:12 But yeah.
06:13 In PowerRest as well, we use mypy type hints, which prevent us from having a nutbox.
06:19 Most of the way, the IDE experience much nicer.
06:22 Yeah.
06:23 Pypins are great.
06:24 They really help you also think about your library.
06:26 I think you really see a shift in modern Python and Python 10 years ago where it was more dynamic.
06:33 And then the dynamic, I remember, I thought it was more seen as a strength than currently I've made.
06:39 Yeah, I totally agree.
06:41 And I feel like when type hints first came out, you know, this was, yes, wow, at this point, kind of early Python 3.
06:47 But it didn't feel like it at the time, you know.
06:49 Python 3 had been out for quite a while.
06:51 When type hints were introduced, I feel like that was Python 3, 4.
06:55 But anyway, that was, put it maybe six years into the life cycle of Python 3.
06:59 But still, I feel like a lot of people were suspicious of that at the moment.
07:03 Yeah.
07:03 You know, they're like, oh, what is this weird thing?
07:06 We're not really sure we want to put these types into our Python.
07:09 And now a lot less.
07:11 There's a lot less of those reactions.
07:13 Yeah.
07:13 I see.
07:13 Yeah.
07:14 Yeah.
07:14 I see Python having two, probably more.
07:17 But I often see Python as the really fun, nice to own duct tape language where I can, in my, for instance, in Jupyter notebook, I can just hack away and try interactively what happens.
07:28 And for such code, type hints don't matter.
07:30 But once I write more of a library or product or tool, then type hints were really great.
07:36 I believe they came about that Dropbox really needed them.
07:39 They have a huge Python project.
07:40 It's a really tough one.
07:42 I'm not really sure.
07:44 Yeah.
07:44 And I heard some guy who has something to do with Python used to work there.
07:47 Yeah.
07:47 Yeah.
07:47 Yeah.
07:48 Guido used to work there.
07:48 I think even at that time.
07:50 All right.
07:51 So a bit of a diversion from how I often start the show.
07:54 So let's just circle back real quick and get your story.
07:56 How did you get into programming and Python and Rust as well, I suppose?
08:00 I got into programming.
08:01 I just wanted to learn programming.
08:02 A friend of mine who did program a lot of PHP said, learn Python.
08:07 Like that.
08:07 He gave me an interactive website where I could do some puzzles.
08:12 And I really got hooked to it.
08:14 It was a fun summer.
08:15 Yeah.
08:16 I was programming a lot.
08:18 I started automating.
08:19 My job was a civil engineer at the moment.
08:22 And I started.
08:22 It was a lot of mundane tasks.
08:24 It's been repetitive.
08:25 And I just found ways to automate my job.
08:27 And eventually I was doing that for a year or three or four.
08:31 And then I got into data science and I switched jobs.
08:34 I became a data scientist and later a data engineer.
08:37 Yeah.
08:38 So that was Python mostly.
08:39 I've always been looking for more languages, playing with Hasbro, playing with Go, playing with Dubstrip, or just playing with Scala.
08:50 And then I found Rust.
08:51 Rust really, really.
08:53 And you had, like, you learn a lot about how computers work.
08:56 Yeah.
08:57 So I had a new renaissance of the first experience with Python.
09:00 And another summer with graphs.
09:02 And I'm doing a lot of the project.
09:04 But right there.
09:06 I don't know.
09:07 A lot of projects.
09:08 And all this game, one of those hobby projects just to use last month.
09:12 Now it's got quite the following.
09:14 And we're going to definitely dive into that.
09:16 But let me pull it up.
09:18 It does right here.
09:19 13,000 GitHub stars.
09:21 That's a good number of people using that project.
09:25 Yeah.
09:25 Yeah.
09:25 Crazy, isn't it?
09:26 Yeah, it is.
09:27 And it's on GitHub stars.
09:29 It's the process where I will get it to.
09:31 Wow.
09:32 Incredible.
09:32 You must be really proud of that.
09:35 Yeah.
09:35 Yeah.
09:36 If you would have told me this two years ago, I would never be.
09:39 But it happens slow enough so you can get accustomed to that.
09:43 Yeah.
09:43 That's cool.
09:44 Kind of like being a parent.
09:46 The challenges of the kids are small.
09:48 They're intense, but there are only a few things they need when they're small.
09:52 And you kind of grow with it.
09:53 Yeah.
09:53 So a couple of thoughts.
09:54 One, you had the inverse style of learning to program that I think a lot of computer science
10:00 people do and certainly that I did.
10:02 It could also just be that I learned it a long time ago.
10:05 But when I learned programming, it was, I'm going to learn C and C++.
10:09 And then you're kind of allowed to learn the easier languages.
10:13 But you will learn your pointers.
10:15 You'll have your void star, star, and you're going to like it.
10:18 You're going to understand what a pointer to a pointer means.
10:20 And we're going to get, I mean, you start inside of the most complex, closest to the machine.
10:27 You work your way out.
10:27 You kind of took this opposite.
10:29 Like, let me learn Python where it's much more high level.
10:32 It's much, you know, if you choose to be often say very much more away from the hardware and
10:37 the ideas of memories and threads and all that.
10:40 And then you went to Rust.
10:41 So was it kind of an intense experience?
10:43 You're like, oh my gosh, this is intense.
10:46 Or had you studied enough languages by then to become comfortable?
10:49 Well, yeah, yeah, no.
10:50 So the going from high level to low language, I think it makes natural sense.
10:55 You're learning it yourself.
10:56 There's no professor telling me you learn your pointers.
10:59 Yeah.
11:00 I think this also helped a lot because at that point you're really custom programming to algorithms.
11:06 Yeah, but you can, I believe you should learn one thing, one new thing at a time.
11:10 And that you can really own that knowledge that you're on.
11:13 But for us, I wouldn't say you should learn Rust as in first language.
11:17 It'd be really terrible because you meet, terrible.
11:21 But other languages also don't help you much because the power checker is quite unique.
11:26 It doesn't let you do things you can do in other languages.
11:29 So what you, what you learn there, the languages that allow you to do that, they just encourage
11:34 you because you were, they encourage the wrong behavior, right?
11:38 Well, yeah.
11:39 So nine out of, nine out of 10 times, it turns out by compiling, not letting you do that one
11:45 thing, that one thing you wanted was probably really bad to begin with.
11:48 that to really, so in Rust, your code is always a lot flatter.
11:52 It's always really clear who owns the memory, how deep your nesting is.
11:56 It's always one degree or most of the times it's, it's not that complicated.
12:01 You, you make things really flat and really easy to reason about.
12:05 And in the beginning of the project, it seems okay, a bit over constraining, but when, I mean,
12:11 software will become complex and complicated and then you're happy to compile a notch to this.
12:16 Yeah, absolutely.
12:17 In this direction.
12:18 It seems like a better way.
12:19 Honestly, you know, you get a sense of programming in a more simple language that doesn't ask so many
12:26 low level concepts of you.
12:27 And then you're ready.
12:28 You, you can add on these new ones.
12:30 so I, I feel like a lot of how we teach programming and how people learn programming is
12:34 a little bit backwards, to be honest.
12:36 Yeah.
12:36 Anyway, enough on that.
12:37 So you were a civil engineer for a while and then you became a data scientist and now you've
12:42 created this library.
12:43 Still working as a data scientist now?
12:44 No, no.
12:45 I got sponsored two years ago for two days a week.
12:49 And yeah, just to use the time to get a polar.
12:53 And currently I stopped all my day jobs.
12:56 I'm going full time on all the other side and trying to live on sponsorships, which is not
13:01 really working.
13:02 It's not enough at this time.
13:03 I hope to start a foundation and get some proper sponsors.
13:06 Yeah.
13:07 That'd be great.
13:08 Yeah.
13:08 Yeah.
13:08 That's awesome.
13:10 Yeah.
13:10 It's still awesome that you're able to do that, even if, you know, you still needed to grow
13:14 a little bit.
13:14 Yeah.
13:15 We'll have you on a podcast and let other people know out there who, who may be using your
13:19 library.
13:20 Yeah.
13:20 Maybe they can, you know, put a little sponsorship and get up sponsors.
13:23 I feel like get up sponsors really made it a lot easier for people to, to support.
13:29 Cause there used to be like PayPal donate buttons and other, other things like that.
13:35 And one, those are not really recurring.
13:37 And two, you've got to go find someplace and put your credit card.
13:40 You know, many of us already have a credit card registered at GitHub.
13:44 It's just a matter of checking a box and monthly it'll just go, you know, it's kind of like the
13:48 app store versus buying independent apps.
13:49 It just cuts down a lot of the friction.
13:51 I feel like it's been really positive mostly for open source.
13:55 Yeah.
13:55 I think it's good to, as a way to say, thank you.
13:58 With, it isn't enough to pay the bills.
14:00 I think for most developers, it isn't, but, I hope we get there.
14:04 The companies, who use it to give a bit more, bit more back.
14:08 I mean, we have a lot of money.
14:09 I agree.
14:10 It's really, really ridiculous that there are banks and VC funded companies and things like
14:16 that, that have not necessarily in terms of the VC ones, but definitely in terms of
14:21 financial and other large companies that make billions and billions of dollars in profit on top of
14:26 open source technology.
14:27 Yeah.
14:27 And many of them don't give anything back, which is, it's not criminal because the licenses allow it,
14:33 but it's, it's certainly borders on immoral to say, we're making all this money and not at all support.
14:39 The people who are really building the foundations that we build upon.
14:42 Most of my sponsors are developers.
14:44 Yeah.
14:44 Yeah.
14:45 Yeah.
14:45 So, yeah, we'll let's hope it changes.
14:48 I don't know.
14:49 Yeah.
14:49 Well, I'll continue to beat that drum.
14:52 This portion of talk Python to me is brought to you by type.
14:57 I type.
14:58 I type.
14:58 I is the next generation open source Python application builder with type.
15:02 I, you can turn data and AI algorithms into full web apps in no time.
15:06 Here's how it works.
15:07 You start with a bear algorithm written in Python.
15:10 You then use type.
15:12 I is innovative tool set that enables Python developers to build interactive end user applications quickly.
15:17 There's a visual designer to develop highly interactive GUIs ready for production.
15:22 And for inbound data streams, you can program against the type I core layer as well.
15:27 I buy core provides intelligent pipeline management, data caching, and scenario and cycle management facilities.
15:33 That's it.
15:34 You'll have transformed a bear algorithm into a full fledged decision support system for end users.
15:39 I buy is pure Python and open source and you install with a simple pip install type.
15:44 For large organizations that need fine grained control and authorization around their data.
15:49 There is a paid type I enterprise edition, but the type I core and GUI described above is completely free to use.
15:56 Learn more and get started by visiting talkpython.fm/type I that's T-A-I-P-Y.
16:03 The links in your show notes.
16:04 Thank you to type I for sponsoring the show.
16:06 Let's talk about your project.
16:10 So Polars and the RS is for rust, I imagine at the end.
16:15 Yeah.
16:15 But tell us about the name Polars, like polar bear, but Polars.
16:19 Yeah.
16:19 So I started writing a data from library and initially it was only for for us, my ID or two.
16:26 You can tell you, so all the people doing data science and Python, you're like, wow.
16:29 Yeah.
16:29 Yeah.
16:29 Yeah.
16:30 What can I do for these people?
16:31 Right?
16:31 Yeah.
16:31 Yeah.
16:32 And I wanted to give a week to the ponders project, but I wanted a bear that was better,
16:37 faster, I don't know, stronger.
16:38 So luckily a panda bear is in the most practical bear.
16:43 So I had a few to choose, but the Grizzly, yeah, the polar has the RS.
16:49 So that's a lot of the incentives.
16:51 Yeah.
16:51 So, yeah.
16:52 So the subtitle here is lightning fast data frame library for rust and Python.
16:57 And you have two APIs that people can use.
17:00 We'll get to dive into those.
17:01 Yeah.
17:01 Because we read an angle in rust.
17:04 It's a complete data frame library in rust.
17:06 You can expose it to many fun bits, but it is already front-end thing.
17:10 Well, Python, Node.js, R is coming up and normal JavaScript is coming up.
17:15 And Ruby, there is also a polarity group.
17:18 So, interesting.
17:20 So for the JavaScript one, are you going to use WebAssembly?
17:23 Yeah.
17:23 Right.
17:23 Which is pretty straightforward because rust comes from Mozilla.
17:27 WebAssembly, I believe also originated, they kind of originated as a, somewhat tied
17:31 together.
17:32 Yeah.
17:32 Story.
17:32 So, Rust C++ C can compile to WebAssembly.
17:36 It's not really straightforward because the WebAssembly virtual machine isn't like your normal
17:41 OS.
17:41 So there are a lot of things harder, but we're, we are working on the challenges.
17:45 Okay.
17:45 Well, that's pretty interesting.
17:47 But for now you've got Python and you've got Rust and that's great.
17:50 Let's, I think a lot of people listening, myself included, when I started looking into this,
17:55 immediately go to, it's like pandas, but rust, you know, it's like pandas, but instead of C at
18:02 the bottom, it's, it's rust at the bottom.
18:04 And that's somewhat true, but mostly not true.
18:07 So let's start with you telling us, you know, how is this like pandas and how is it different
18:12 from pandas?
18:13 Yeah.
18:13 So it's not like pandas.
18:15 I think it's different on two ways.
18:18 So we have the API and we have the application and which one should I start with?
18:23 I think.
18:24 Yeah.
18:25 Yeah.
18:25 Bottom up.
18:25 Sure.
18:26 All right.
18:26 So that was my critique from pandas.
18:29 And they didn't start by the library.
18:30 And they didn't start by the library.
18:32 They do whatever was there already with work, do it for that purpose.
18:36 And pandas built on NumPy.
18:39 And NumPy is a great library.
18:40 It's built from numerical process and not from relational.
18:44 Versus in relational data is completely different.
18:47 You have string data, nested data.
18:49 And this data has probably put as Python object in those NumPy lines.
18:54 And if you know anything about memory, then in this array, you have pointer with where each
19:00 Python object is somewhere else.
19:01 So if you traverse this memory, every point you must look it up somewhere else.
19:05 But memory is not a cache where the cache means, which is a 200x slowdown per element to traverse.
19:11 Yeah.
19:12 So for people listening, what you're saying, the 200x slowdown is the L1, L2, L3 caches,
19:18 which all have different speeds and stuff, but the caches that are near the CPU versus
19:23 main memory, it's like two to 400 times slower, not aging off a disk or something.
19:28 It's really different, right?
19:29 It's really a big deal.
19:30 It's a big deal.
19:31 It's terribly slow.
19:32 It also, Python has a GIL.
19:34 It also blocks multi-treading.
19:36 If you want to read the string, you cannot do this from different threads.
19:40 If you want to modify the string, there's only one thread that can access Python.
19:43 They also didn't take into account anything from databases.
19:49 So databases are based in from the 1950s.
19:53 There's been a lot of research in databases, how we do things fast, write a query, and then
19:59 optimize this query because the user that uses your library is not the expert.
20:03 It doesn't write optimized query.
20:04 No, but we have a lot of information, so we can optimize this query and execute this in
20:09 the most, in a very efficient way.
20:11 Well, that's an interesting idea.
20:12 Yeah.
20:13 And Pandas just executes it and gives you what you ask.
20:16 And what you ask is probably not.
20:18 Yeah, that's interesting because as programmers, when I have my Python hat on, I want my code
20:24 to run exactly as I wrote it.
20:26 I don't want it to get clever and change it.
20:29 I, you know, if I said do a loop, do a loop.
20:31 If I, if I said put it in a dictionary, put it in a dictionary.
20:34 But when I write a database query, be that against Postgres with relational or MongoDB,
20:41 there's a query planner and the query planner looks at all the different steps.
20:45 Should we do the filter first?
20:47 Can we use an index?
20:48 Can we use a compo?
20:49 Which index should we choose?
20:51 All of those things, right?
20:52 And so what you tell it and what happens, you don't tell it how to do finding the data,
20:58 the database.
20:58 You just give it, here's kind of the expressions that I need, the predicates that I need you
21:04 to work with.
21:04 And then you figure it out.
21:06 You're smart.
21:06 You're the database.
21:07 So one of the differences I got from reading what, what you've got here so far is it looks
21:12 like, I don't know if it goes as far as this database stuff that we're talking about,
21:16 but there's a way for it to build up the code it's supposed to run.
21:20 And it can decide things like, you know, these two things could go in parallel or things along
21:25 those lines.
21:25 Right?
21:26 Yeah.
21:26 Yeah.
21:26 Well, it is actually very similar.
21:29 It is a vectorized query.
21:30 And you can, the only thing that doesn't make us a database is that we don't have any,
21:35 we don't bother with the, with file structures that we write, like the persistence and transactions
21:40 and all that.
21:41 Yeah.
21:41 So we have different kinds of databases, you have OLAP and OLTP, transactional
21:46 modeling, which works often on one.
21:48 So if you do a REST API query and you modify one user ID, then you're transactional.
21:53 And if you're doing OLAP, that's more analytical.
21:56 And then you do large aggregations of large whole tables.
21:59 And then you need to process all the data and those different databases,
22:03 these eyes lead to different query optimizers.
22:05 And OLAP is focused on OLAP.
22:07 But yeah, we, so as you described, you've got two ways of programming things.
22:11 One is procedural, which Python mostly is.
22:14 So you tell exactly if you want to get a couple coffee, how many steps should take forward.
22:19 They rotate 90 degrees, take three steps.
22:21 They rotate 90 degrees.
22:23 You can put, write down the whole algorithm, how to get a coffee.
22:25 Or you could just say, get me a coffee.
22:27 I'd like to sell sugar and then let the algorithm, let query engine decide how to best get.
22:34 Right.
22:34 And that's more declarative.
22:35 You describe the end result.
22:37 And as it turns out, this is also very readable because you declare what you want and the intent
22:42 is readable in the query.
22:44 And if you're doing more procedural programming, you describe what you're doing.
22:48 And the intent often needs to come from comments.
22:51 Like what are we trying to do when we follow this?
22:54 Right.
22:54 Yeah.
22:54 That makes a lot of sense.
22:55 And that's very interesting.
22:57 Yeah.
22:57 Sorry.
22:57 And that's why the, so the first thing is we write from, we write a database engine,
23:02 a premium engine scratch and really think about multiprocessing, about cache, caches, about also out
23:09 of core, we think process data doesn't fit into memory.
23:12 So we really built this web spread with all those things in mind.
23:15 And then in, at first we wanted to expose Pondus API, and then we noticed how bad it was for
23:22 writing fast data.
23:23 It, Pondus API just isn't really neat for this declarative analyzing of what the user wants to
23:29 do.
23:29 So we just got it off and took the freedom to design an API that makes most sense.
23:34 Oh, that's interesting.
23:35 I didn't realize that you had started trying to be closer to pandas than you ended up.
23:40 Yeah.
23:40 Well, it was very short-lived, I must say.
23:42 That was it.
23:43 Most faithful.
23:44 Yeah.
23:44 And that's not necessarily saying pandas are bad, I don't think.
23:48 It's approaching the problem differently and it has different goals, right?
23:51 Yeah.
23:52 So maybe we could look at an example of some of the code that we're talking about.
23:56 I guess also one of the other differences there is much of this has to do with what you would call,
24:03 I guess you refer to them as lazy APIs or streaming APIs, kind of like a generator.
24:08 Yeah.
24:09 So what you think about a join, for instance, in pandas, if you would write a join and then take
24:14 only one to first 100 rows with that result, then it would first do the join.
24:20 And then that might produce 1 million or 10 million rows.
24:24 And then you take only 100 of them.
24:25 And then you have materialized a million, but you take only a fraction of that.
24:29 And by having that lazy, you can optimize for the whole query at a time and just see,
24:34 yeah, we do this join, but we only need 100 rows.
24:36 So that's how we materialize normally.
24:38 It gets you more realistic.
24:40 That's really cool.
24:41 I didn't realize it had so many similarities to databases, but yeah, it makes a lot of sense.
24:46 All right.
24:46 Let's look at maybe a super simple example you've got on fuller.rs.
24:53 What country is RS?
24:54 I always love how different countries that often have nothing to do with domain names get grabbed
25:00 because they have a cool ending like Libya that was .ly for a while.
25:04 You know, it still is, but it was used frequently like bit.ly and stuff.
25:07 Do you know what RS is?
25:08 I believe it's Serbia.
25:10 Serbia.
25:11 Okay, cool.
25:11 I'm not sure.
25:12 Yeah.
25:12 Yeah.
25:12 Very cool.
25:13 All right.
25:13 So polar.rs, like polar.rs.
25:17 Over here, you've got on the homepage here, the landing page, and then through the documentation
25:22 as well, you've got a lot of places where you're like, show me the Rust API or show me the Python API.
25:26 People can come and check out the Rust code.
25:29 It's a little bit longer because it's, you know, that kind of language, but it's not terribly more complex.
25:35 But maybe talk us through this little example here on the homepage in Python,
25:40 just to give people a sense of what the API looks like.
25:42 Yeah.
25:42 So we start with a scan CSP, which is a lazy read, which is so read CSP.
25:49 It tells what you do.
25:50 And then it reads the CSP and you get the data frame.
25:53 In a scan CSP, we start a computation graph.
25:56 called this a lazy frame.
25:57 And the lazy frame is actually just, it also, it remembers the steps of the operations you want to
26:02 do.
26:02 Then it tells it to the boiler, but it looks at this very plan and it will optimize it.
26:06 And we'll think of how to execute it.
26:09 And we have different engines.
26:10 So you can have an engine that's more specialized for data that doesn't fit into memory and engine
26:14 that's more specialized for data that does fit differently.
26:17 So we start with a scan and then we do a dot filter and we want to use verbs.
26:24 Verbs, that's the declarative part.
26:26 If on us, we often do indexes or a, and those indexes are a big years in my opinion, because
26:32 you can, you can pause in a new file array with booleans, but you can also pause in a new file array
26:37 with integers.
26:38 So you can do slicing.
26:40 You can also pause in a new file, a list of strings and then you do column selection.
26:44 So it has three, three functions.
26:46 One thing that I find really interesting about pandas is it's so incredible.
26:51 And people who are very good with pandas, they can just make it fly.
26:55 They can make it really right expressions that are super powerful, but it's not obvious that
27:00 you should have been able to do that before you see it.
27:02 You know, there's a lot of not quite magic, but stuff that that doesn't seem to come really
27:07 straight out of the API directly.
27:10 You know, you pass in like some sort of like a boolean expression that involves a
27:16 a vector and some other test into the brackets.
27:19 Like, wait, how, how did I know I could do that?
27:21 Whereas this, your API is a lot more of a fluent API where you say, you know, PD, you'd say PL, PL.scan,
27:29 CSV.filter.groupby.aggregate.collect.
27:33 And it kind of just flows together.
27:35 Does that mean that the editors and IDEs can be more helpful suggesting what happens at each step?
27:41 Yes, we are really strict on types.
27:43 So we also only return a single type common from a method.
27:47 And we only, a dot filter just expects a boolean expression that produces a boolean, not an integer, not a string.
27:54 So we want our methods from reading or code.
27:57 You should be able to understand what should go in there.
28:01 That's really important to me.
28:02 It should be unambiguous.
28:03 It should be consistent.
28:04 And you, your knowledge of the API should expand to different parts of the API.
28:08 And that's where I think we're going to talk about this later, but that's where expressions
28:12 may be coming over.
28:14 So, this portion of Talk Python to me is brought to you by User Interviews.
28:20 As a developer, how often do you find yourself talking back to products and services that you use?
28:25 Sometimes it may be frustration over how it's working poorly.
28:29 And if they just did such and such, it would work better.
28:33 And it's easy to do.
28:34 Other times it might be delight.
28:36 Wow.
28:37 They auto-filled that section for me.
28:38 How did they even do that?
28:40 Wonderful.
28:40 Thanks.
28:41 While this verbalization might be great to get the thoughts out of your head, did you
28:45 know that you can earn money for your feedback on real products?
28:49 User Interviews connects researchers with professionals that want to participate in research studies.
28:54 There is a high demand for developers to share their opinions on products being created for
28:59 developers.
29:00 Aside from the extra cash, you'll talk to people building products in your space.
29:04 You will not only learn about new tools being created, but you'll also shape the future of
29:09 the products that we all use.
29:11 It's completely free to sign up and you can apply to your first study in under five minutes.
29:16 The average study pays over $60.
29:18 However, many studies specifically interested in developers pay several hundreds of dollars
29:23 for a one-on-one interview.
29:25 Are you ready to earn extra income from sharing your expert opinion?
29:29 Head over to talkpython.fm/userinterviews to participate today.
29:34 The link is in your podcast player show notes.
29:36 Thank you to User Interviews for supporting the show.
29:39 I just derailed you a little bit here as you were describing this.
29:45 So you start out with scanning a CSV, which is sort of creating and kicking off a data frame
29:52 equivalent here.
29:53 Lazy frame.
29:53 And then you say a dot filter and you give it an expression like this column is greater than five.
30:01 Right.
30:01 Right.
30:01 Or some expression that we would understand in Python.
30:04 And that's the filter statement, right?
30:05 Yeah.
30:06 And then we follow a group by argument and then an aggregation where we say, okay, take all columns
30:12 and sum them.
30:13 And this again is an expression.
30:14 And these are really easy expressions.
30:16 And then we take this lazy frame and we materialize it into a data frame called
30:21 Comecton.
30:22 And collect means, okay, all those steps you recorded.
30:25 Now you can do your magic, query optimizer, get all the stuff.
30:28 And what this will do here, it will recognize that, okay, we've taken the iris.csv, which got
30:34 different columns.
30:35 And now in this case, it won't.
30:37 So if you would have finished with a CEMEC where we only selected two columns, it would
30:41 have recognized, oh, we don't need all those columns in the, in the CSV file.
30:45 We only take the ones we need.
30:47 What it will do, it will push the filter, the predicate down to the scan.
30:51 So during the reading of the CSV, we will take this predicate.
30:55 We say, okay, where the sample length is larger than five, the rows that don't match the predicate
31:00 will not be materialized.
31:01 So if you wrap a really large CSV file, if we really, let's say you have a CSV file with
31:06 a 10s of gigabytes, but your, your predicate only selects 5% of that.
31:10 Then you only materialize 5% of the 10 gigabytes.
31:14 Yeah.
31:14 So 500 megs instead of 10 gigabytes or something like that, or 200, 200 megs, whatever it is,
31:19 quite a bit less.
31:20 That's really interesting.
31:22 And this is all part of the benefits of what we were talking about with the lazy, lazy frames,
31:27 lazy APIs, and, and building up all of the steps before you say go, because in pandas,
31:33 you would say, read CSV.
31:34 So, okay, it's going to read the CSV.
31:36 Now what?
31:36 Yes.
31:37 Right.
31:37 And then you apply your filter if that's the order you want to do it in, and then you group
31:41 and then, and so on and so on.
31:42 Right.
31:42 Right.
31:43 It's interesting in that it does allow more database like behavior behind the scenes.
31:48 Yeah.
31:48 Yeah.
31:49 And yet, in my opinion, the data frame is, should be seen as a table in a, in a database.
31:54 It's, it's the final view of computation.
31:57 Like you can see it as a materialized view.
31:59 It's, we have some data on this and we want to get it into another table, which we would feed into
32:06 our machine learning models or whatever.
32:08 And we do a lot of operations on them before we get there.
32:12 So I wouldn't see a data frame as a, as a data.
32:15 It's not, it's not only a data structure.
32:17 It's not only a list or a dictionary.
32:20 There are lots of steps before we get into those tables.
32:23 And eventually.
32:24 Right.
32:25 So here's an interesting challenge.
32:28 There's a lot of visualization libraries.
32:32 There are a lot of other data science libraries that know and expect and as data frames.
32:38 So like, okay, what you do is you send me the pandas data frame here, or we're going to patch pandas so that if you call this function on the data frame,
32:46 it's going to do this thing.
32:47 And they may say, Richie, fantastic job you've done here in Polars, but my stuff is already all built around pandas.
32:53 So I'm not going to use this.
32:55 Right.
32:55 But it's worth pointing out.
32:56 There's some cool pandas integration.
32:58 Right.
32:58 Yeah.
32:59 Yeah.
32:59 So he said, so Polars doesn't want to do plotting.
33:02 I don't think it should be in a different line.
33:05 Maybe another length, another library can do it on top of Polars.
33:08 Just like it shouldn't be a Polars in my opinion.
33:11 But often when you do plotting, you're plotting the number of rows will not be billions.
33:16 I mean, there's no plotting engine that can deal with that.
33:19 So you will be reducing your, your big data set to something small.
33:22 And then you can send it to the plot.
33:24 Yeah.
33:24 There's hardly a monitor that has enough pixels to show you that anyway.
33:29 Right.
33:29 So yeah.
33:30 Yeah.
33:30 Yeah.
33:30 Yeah.
33:31 You can call to pandas and then we transform our polars data frame to pandas.
33:34 And then you can integrate with I could learn with, and we often find that progressively rewriting
33:41 things from pandas to polars already is cheaper than keeping it in pandas.
33:45 If you do it, if you call from pandas, or let's do a join in polars and then back to pandas,
33:50 we probably made up for those double copies.
33:52 Pandas does a lot of internal copies.
33:54 If you do a reset index copies all data.
33:57 There are a lot of internal copies and pandas which aren't listed.
34:00 So I wouldn't worry about an explicit copy in the end of your ETL to go to plotting when the data is
34:06 already.
34:06 Right.
34:07 Right.
34:07 Right.
34:07 So let's look at the benchmarks because it sounds like to a large degree, even if you do have to do this
34:12 conversion in the end, many times, it still might even be quicker.
34:17 So you've got some benchmarks over here and you compared, I'm going to need some good vision for
34:22 this one.
34:22 You compared polars, pandas, Dask, and then two things which are too small for me to read.
34:28 Tell us what you compared.
34:29 Modding and facts.
34:30 Modding and facts.
34:30 Okay.
34:31 And for people listening, you go out here and look at these benchmarks, like linked right off the homepage.
34:38 There's like a little tiny purple thing and a whole bunch of really tall bar graphs.
34:42 It's got the rest.
34:43 Yes.
34:44 And the little tiny thing that you can kind of miss if you don't look carefully,
34:47 that's the time it takes for polars.
34:50 And then all the others are up there in like 60 seconds, a hundred seconds.
34:55 And then polars is like a quarter of a second.
34:57 So, you know, it's easy to miss it in the graph.
34:59 But the quick takeaway here, I think, is there's some fast stuff.
35:03 Yeah.
35:03 Yeah.
35:03 We're often orders of magnitude faster than pandas.
35:06 So it's not uncommon to hear it's 1020x times fast, especially if you do write proper
35:12 pandas and for apropotos.
35:14 It's probably 20 except if we deal with IO as well.
35:17 So what we see here are the TPCH benchmarks.
35:20 And TPCH is a database query benchmark standard, which this is used by every query engine to show
35:28 how fast it is.
35:29 And those are really our questions that really, really collect some muscles of the query engine.
35:34 So you have joints on several tables, different group buys, different nested group buys, etc.
35:40 And yeah, yeah, I really tried to make those other tools faster.
35:44 But so in memory, it does then mode in.
35:47 But it was really hard to make stuff faster than pandas.
35:49 Except for polars.
35:50 So I think that's a good idea.
35:51 So I think that's a good idea.
35:53 So I think that's a good idea.
35:53 So I think that's a good idea.
35:55 So I think that's a good idea.
35:55 So I think that's a good idea.
35:57 So I think that's a good idea.
35:57 So I think that's a good idea.
35:59 So I think that's a good idea.
35:59 I think that's a good idea.
36:01 So I think that's a good idea.
36:01 I think that's a good idea.
36:03 I think that's a good idea.
36:03 I think that's a good idea.
36:05 I think that's a good idea.
36:06 I think that's a good idea.
36:07 I think that's a good idea.
36:08 I think that's a good idea.
36:09 I think that's a good idea.
36:10 I think that's a good idea.
36:11 I think that's a good idea.
36:12 I think that's a good idea.
36:13 I think that's a good idea.
36:14 I think that's a good idea.
36:15 I think that's a good idea.
36:16 I think that's a good idea.
36:17 I think that's a good idea.
36:18 I think that's a good idea.
36:19 I think that's a good idea.
36:20 I think that's a good idea.
36:21 I think that's a good idea.
36:23 I think that's a good idea.
36:24 I think that's a good idea.
36:26 I think that's a good idea.
36:27 I think that's a good idea.
36:28 I think that's a good idea.
36:30 I think that's a good idea.
36:31 I think that's a good idea.
36:32 I think that's a good idea.
36:33 I think that's a good idea.
36:34 I think that's a good idea.
36:35 I think that's a good idea.
36:36 I think that's a good idea.
36:37 I think that's a good idea.
36:38 I think that's a good idea.
36:39 I think that's a good idea.
36:40 I think that's a good idea.
36:41 I think that's a good idea.
36:42 I think that's a good idea.
36:43 I think that's a good idea.
36:44 I think that's a good idea.
36:46 I think that's a good idea.
36:48 I think that's a good idea.
36:50 And then you have problems.
36:52 Or we need to do multiprocessing.
36:53 Or we need to send those Python objects to to to another project.
36:57 And we copy data, which is slow.
36:58 Or we need to do multi-threading and we're bound by the GIL and we're single thread.
37:02 And then there are key defenses.
37:03 Yeah, I think there's some interesting parallels for Dask and Polars.
37:09 On these benchmarks, at least, you're showing much better than performance than Dask.
37:13 I've had Matthew Rocklin on a couple of times to talk about Dask and some of the work they're doing
37:18 there at Coiled.
37:19 and it's very cool.
37:20 And one of the things that I think Dask is interesting for is allowing you to scale your code out
37:26 to multi cores on your machine or to even distributed grid computing or process data that doesn't fit in memory
37:33 and they can behind the scenes juggle all that for you.
37:36 I feel like Polars kind of has a different way, but attempts to solve some of those problems as well.
37:42 Yeah, but Polars has full control over it over everything.
37:45 So it's built from the ground up.
37:47 It controls I/O, it controls their own memory, it controls which strap gets which data.
37:52 And in Dask, it goes through, it takes this other tool and then parallelizes that.
37:57 But it is limited by what this other tool also is limited by.
38:01 But I think, so on a single machine, it has those challenges.
38:04 I think Dask distributed and does that these challenges.
38:07 And I think for distributed, it worked really well.
38:10 Yeah, the interesting part with Dask, I think, is that it's kind of like Pandas,
38:14 but it scales in all these interesting ways.
38:16 Across cores, bigger memory, but also across machines and then, you know,
38:20 across cores, across machines, like all that stuff.
38:23 I feel like Dask is a little bit, maybe it's trying to solve like a little bit bigger computer problem.
38:28 Like how can we use a cluster of computers to answer these questions?
38:32 The documentation also says it themselves.
38:34 they say that they're probably not faster than Pandas on the single machine.
38:38 So they're more for the large, big data.
38:41 Yeah.
38:41 But Paulus wants to be and a lot faster on the single machine, but also wants to be able to do
38:46 out of core processing on the single machine.
38:48 So if you, we don't support all queries yet, but we want to, we already do basic device group by sorts,
38:55 predicates, element wise operations.
38:58 And then we can process, I process iPhone gigabytes on my laptop.
39:02 Matt, that's pretty good.
39:04 Your laptop probably doesn't have 500.
39:06 No, no, no, no, no.
39:06 It's 16 gigs.
39:07 Yeah.
39:09 Nice.
39:09 It's probably actually a value to, as you develop this product, to not have too massive of a computer to work on.
39:16 If you had a $5,000 workstation, you know, you might be a little out of touch
39:22 with many people using your code.
39:24 And, you know, it's so awesome.
39:25 Although I think there, I think others like scaling on a single machine makes sense for different reasons as well.
39:32 I think a lot of people talk about distributed, but if you think about complexity of distributed,
39:38 you need to send data, shuffle data over the network to other machines.
39:41 So there are a lot of people using polars in our discord who have one terabyte of red and say,
39:47 it's cheaper and a lot faster than Spark, because they can, well, all this is faster on a single machine.
39:52 And one too, they have a beefy machine with like 120 cores and they don't have to go over the network to parallelize.
40:01 And yeah, so I think times are changing.
40:04 I think also scaling out data on a single machine is getting more worried.
40:08 It is.
40:09 One of the areas in which it's interesting is GPUs.
40:12 Do you have any integration with GPUs or any of those sorts of things?
40:15 No.
40:15 No.
40:15 I'm suggesting that necessarily is even a good idea.
40:18 I'm just wondering if it does.
40:19 No, I get this question, but I'm not really convinced I can get memory.
40:23 I can get the data fast enough into the memory.
40:25 We want to process gigabytes of data.
40:28 The challenge already on the CPU is, is getting the data or cache or memory fast enough
40:34 on a CPU piece.
40:36 Just, I don't know.
40:37 Yeah.
40:38 Yeah.
40:38 So maybe we could talk really quickly about platforms that it runs on.
40:42 You know, I just, this is the very first show that I'm doing on my M2 Pro processor,
40:48 which is fun.
40:49 I literally been using for like an hour and a half, so I don't really have much to say, but it looks neat.
40:53 Anyway, you know, that's very different than an Intel machine, which is different than a Raspberry Pi,
40:58 which is different than, you know, some version of Linux running on ARM or on AMD.
41:04 So where, where do these, what's the, the reach?
41:07 Well, we support it.
41:08 We support it.
41:09 We don't.
41:10 So Poilers also has a lot of like SIMD optimizations.
41:13 SIMD starts for a single instruction mock data, where for instance, if you do a floating point operation,
41:18 it's doing a single floating point at a time, you can fill in those vector lanes into your CPU,
41:24 which can fit eight floating points.
41:26 And in a single operation, can include eight of the five.
41:29 And they have eight times the parallelism on a single core.
41:32 Those instructions are only activated for Intel.
41:36 So we don't have these instructions activated for ARM, but we do compile to ARM.
41:41 How it forms?
41:42 I think it performs far.
41:43 Yeah.
41:44 Yeah.
41:45 But so if the standard machines, right?
41:47 macOS, Windows, Linux, or we're all good to go.
41:50 Yeah.
41:50 And it ships as a wheel.
41:52 So you don't have to have any, you don't have to have rusty or anything like that.
41:55 Hanging around.
41:55 Yeah.
41:56 Okay.
41:56 We also have condo, but condo is always a bit lagging the eye.
42:00 So I could try to answer a bit because we can control this.
42:05 Yeah, exactly.
42:06 You push it out to IPI and that's what pip sees.
42:10 And it's going to go, right?
42:10 Pretty much instantly.
42:12 I guess it's worth pointing out while we're sitting here is, not that thing I highlighted this.
42:16 You do have a whole section in your user guide, the Polar's book called coming from pandas that actually talks about the differences,
42:24 not just how do I do this versus, you know, this operation and pandas versus Polar's,
42:29 but it also talks about some of the philosophy, like this lazy concepts that we've spoken about and a query optimization.
42:36 I feel like we covered it pretty well.
42:38 Yeah.
42:39 Unless there's maybe some other stuff that you want to throw in here really quick,
42:42 but I mostly just want to throw this out as resource.
42:44 Cause I know many people are coming from pandas and they may be interested in this,
42:48 and this is probably a good place to start.
42:50 I'll link to it in the show notes.
42:51 I think the most controversial one is that we don't have the multi-index.
42:55 You don't have anything other than zero base zero one, two, three, where is it in the array type of.
43:00 Yeah.
43:00 Well, we can, we will support data structures that make you cooks faster,
43:05 like index in a database sense, but it will not involve the, it will not chase the cement.
43:10 Great.
43:11 That's important.
43:12 Okay.
43:13 Yeah.
43:14 So I encourage people who are mostly pandas people that come down here and,
43:18 you know, look through this.
43:19 It's, it's pretty straightforward.
43:20 Another thing that I think is interesting and we're talking about maybe is we could touch
43:26 a little bit on some of the, how can I, and your user guide, you've got,
43:30 how can I work with IO?
43:32 How can I work with time series?
43:33 How can I work with multiprocessing and so on?
43:36 What do you think is good to highlight out of here?
43:38 Yeah.
43:38 How do you regard it?
43:39 It's a bit outdated.
43:40 So you can see your own.
43:42 So the Francis IO is changing.
43:45 All of this writes as its own IO readers.
43:49 So we've written our own CSP reader, JSON reader, RK, IPC, Arrow.
43:55 And that's all in our control, but for interaction with databases, often a bit more complicated.
44:01 Deal with different drivers, different ways.
44:04 And currently we do this with connector X, which is really great and allows us to read from a lot
44:09 of different databases, but it doesn't allow us to write from databases yet.
44:13 And this is happy.
44:14 This is not changing.
44:15 I want to play a bit why.
44:17 So Parler is built upon the Arrow memory specification and the Arrow memory specification
44:23 is sort of the standard of how memory or data, our memory for columnar data should look into,
44:28 how columnar data should be for, should be represented in memory.
44:32 And this is becoming a new standard and Bark is using it, Dremel, Pondos itself.
44:38 For instance, if you read a parquet in Pondos, it reads in first into Arrow memory and then
44:43 copies that into Pondos memory.
44:45 So the Arrow memory specification coming in standard, and this is a way to share data to processes,
44:52 to other, also to other libraries within a process without copying data.
44:57 We can just swap our pointers if we know that we both support Arrow.
45:00 Oh, so Arrow defines basically a, in memory, it looks like this.
45:05 Yes.
45:05 And if you both agree on that, we can just swap our pointers.
45:08 Right.
45:09 Because a .NET object, a C++ object and a Python object, those don't look like anything similar
45:15 to any of them, right?
45:16 In memory.
45:18 And yeah.
45:19 So, so this is from the Apache Arrow project.
45:22 Yeah.
45:22 And this is really, really used by a lot of different tools already.
45:28 And currently there is coming the ADBC, which is the Apache Arrow database connector,
45:32 which will solve all those problems because then we can write, read arrives from a lot of databases in Arrow and then it will be really fast and very easy for us to do.
45:41 So luckily we, we, that's one of those foundations of folders I'm really happy about because supporting Arrow and using Arrow memory gives us a lot of interaction,
45:53 which interwoven with other libraries.
45:54 Yeah.
45:54 That's interesting.
45:56 And when you think of Pandas, you know, it's kind of built on top of NumPy as its core foundation,
46:01 and it can exchange NumPy arrays with other things that do that.
46:06 So Apache Arrow is kind of, kind of your, your base.
46:09 Yeah.
46:09 Well, it's kind of full circle because Apache Arrow is started by Wes McKinney.
46:13 Wes McKinney being known as the creator of Pandas.
46:17 And when he got out of Pandas, he thought, okay, the memory representation of NumPy is just not, we should not use it.
46:24 And then he was inspired to build Apache Arrow, which made from our master.
46:29 Yeah.
46:30 So that's how you learn about these projects, right?
46:32 This is how you realize, oh, we, we had put this thing in place.
46:36 Maybe we'll work better, right?
46:37 You, you work on a project for five years and you're like, if I got a chance to start over, but it's too late now.
46:43 But every now and then you do actually get a chance to start over.
46:46 Yeah.
46:47 Interesting.
46:47 I didn't realize that Wes was involved with both.
46:50 I mean, I knew from Pandas, but I didn't realize it's.
46:52 Yeah.
46:52 He's a CEO of Holter and with Shiro.
46:55 He started Pajero and that's, Pajero is sort of super big, like use everywhere, but sort of middleware.
47:03 Like it's end users are developers and not end users are developers who build tools and not developers who use
47:09 like that.
47:11 Right.
47:11 You might not even know that you're using it.
47:14 You just use, I just use Polars.
47:16 And oh, by the way, it happens to internally be better because of this.
47:20 Yeah.
47:20 Yeah.
47:20 Very cool.
47:21 Okay.
47:22 Let's see.
47:22 We've got a little bit of time left to talk about it.
47:25 So for example, this, some of these, how can I let me just touch on a couple that are nice
47:29 here.
47:29 So you talked about connector X, you talked about the database, but it's like three lines of code to
47:34 define a connection string, define a SQL query, and then just you can just say PL dot read SQL.
47:40 Yeah.
47:41 And there you go.
47:42 You call it data frame or what do you call the thing you get back here?
47:45 So reading is always a data frame.
47:47 Okay.
47:48 Scanning will be a base.
47:49 Got it.
47:49 Okay.
47:49 Is there a scan SQL as well?
47:52 We know this might happen in the future.
47:55 The challenge is, are we going to push back our optimizations?
47:59 Sorry, we write out others query, and then we must translate that into SQL into the SQL we send
48:05 to the database.
48:07 But that needs to be consistent over different databases.
48:09 That's all the rabbit hole we might get into.
48:12 I'm not sure it's worth it because you can already do many of these operations in the SQL
48:18 query that you're sending over, right?
48:20 You have sort of two layers of query engines and optimizers and query plans.
48:25 And it's not like you can't add on additional filters, joins, sorts, and so on before it ever
48:32 gets back to you.
48:32 It would be terrible if someone writes select star from table and then writes the filters in
48:38 polars and then the database has sent all those data over the network.
48:42 So yeah, ideally we'd be able to push those predicates down into the SQL.
48:47 Yeah.
48:48 But you know somebody's going to do it because they're more comfortable writing
48:51 polar API in Python than they are writing in T-SQL.
48:54 Yeah.
48:55 You will not.
48:56 Yeah.
48:56 If it's possible, someone will write it.
48:58 It's not optimal.
48:59 That's right.
49:00 That is right.
49:01 Let's see what else can you do here.
49:03 So you can, we've already talked about the CSV files and this is the part of that I was talking
49:08 about where you've got the toggle to see the rust code and the Python code.
49:12 So I think people might appreciate that parquet files.
49:15 So parquet files is a more efficient format.
49:19 Maybe talk about using parquet files versus CSV and why you might want to get rid of your
49:24 CSV and like store these intermediate files and then load them.
49:28 But this is really policy here reader.
49:31 I really did my best on that.
49:32 But you can use parquet or arrow IPC because your data is typed.
49:39 There's no ambiguity on reading.
49:41 We know type it is.
49:43 Right.
49:43 Because CSV files, even though it might be representing a date, it's still a string.
49:47 Yeah, we need to parse it.
49:49 Yeah, it's slow to parse it.
49:51 Yeah.
49:52 But also we can just, so it parquet interacts really nicely with query optimization.
49:58 So we can select just a single column from the file without touching any of the other
50:02 columns.
50:03 We can read statistics.
50:04 So a parquet file can write statistics, which knows, okay, this page has got this maximum
50:10 value, this minimum value.
50:11 And if you have written a photo square, which says, oh, so I'll give me the result where the
50:16 value is larger than this.
50:18 And we see that the statistics say it cannot be in this file.
50:22 We can just skip the whole column.
50:24 We don't have to read.
50:25 Yeah.
50:25 Oh, interesting.
50:26 Wow.
50:26 Okay.
50:26 So there are a lot of optimizations, which, so the best work is work you don't have to do
50:32 and partake a lot.
50:33 Exactly.
50:34 Or you've done it when you created the file and you never do it again or something like that.
50:39 Yeah.
50:39 Yeah.
50:39 So you've got a read parquet, a scan parquet, I suppose that's the data frame versus lazy frame.
50:46 And then you also have the ability to write them.
50:47 That's pretty interesting.
50:48 JSON, multiple files.
50:50 Yeah.
50:50 Yeah.
50:51 There's just a whole bunch of how do I, how can I rather, a bunch of neat things.
50:55 What else would you like to highlight here in the next couple minutes?
50:57 The most important thing I want to touch on is the expression API.
51:00 So that's a bit, if you go a bit higher.
51:02 So you swallow, photos expression.
51:05 We got our own chip.
51:06 There you go.
51:07 One of the goals of the photos API is to keep the API so small, but give you a lot of things
51:13 you can do.
51:14 And this is where the photos expressions come in.
51:16 So photos expressions are expressions of what you want to do, which are run and parallelized on a query in Japan and can combine them in depth.
51:25 So an expression takes a series and produces a series and does the input.
51:29 It's the same as the output.
51:30 You can combine them.
51:31 And as you can see, we can do pretty complicated stuff and you can keep chaining them.
51:36 And this is the same.
51:38 Like, I would, I'd like to see it.
51:40 Transistor Python vocabulary is quite small.
51:43 So we have a while we have a loop, we have a variable assignment.
51:46 But if you, I think it fits into maybe two, two pieces of paper, but with this, you can write
51:52 any program you want with the combination of all those, all those, yeah, this vocabulary.
51:57 Yeah.
51:58 And that's what we want to do with the photos expressions as well.
52:00 So we, you've got a lot of small building blocks, which can be combined into.
52:06 Yeah.
52:06 So somebody could say, I want to select a column back, but then I don't want the actual
52:11 values.
52:12 I want the unique ones, a uniqueness.
52:15 So if there's duplicate, remove those, and then you can do a dot account.
52:18 Then you can add an alias, which gives it a new, which basically defines the column name.
52:23 You could read it as a, well, it's not names.
52:25 It's a, it's a, it's.
52:27 So take column names as, you need names to, you see, but as is the keyword and pipe.
52:32 So I'm allowed to use it.
52:33 Right.
52:34 It means something else.
52:35 Yeah.
52:36 That's, that's interesting.
52:37 Okay.
52:38 Yeah.
52:38 So people, they use these expressions to do lots of transformations and filterings and, and things
52:45 like that.
52:46 Yeah.
52:46 So these expressions can be used in a select on different places, but the knowledge of expressions
52:52 extrapolates to different locations.
52:54 So you can do it in a, in a select statement and then you select column.
52:57 Then you select this expression and you get a result.
53:00 But you can also do this in a group by aggregation.
53:02 And then the same logic applies.
53:04 It runs on the same engine and we make sure everything is possible.
53:07 And this is really powerful because because it's so expressive, people don't have to use
53:14 custom apply with Lambda because when you use a Lambda, it's like box to us.
53:18 It will be slow because it's Python and we don't know what happened.
53:21 So a Lambda is, it will be slow.
53:23 It will kill parallelization because it builds.
53:25 But yeah.
53:26 A Lambda is three times a day.
53:28 Right.
53:29 It gets in the way of a lot of your optimizations and a lot of your, your speed ups there.
53:34 That's why we want to make this expression API very complete.
53:37 So you, you don't need that as much.
53:39 Yeah.
53:40 So people are wanting to get this, get seriously into this.
53:42 They should check out chapter three expressions, right?
53:45 And just go through there.
53:46 And probably, especially, you know, sort of browse through the Python examples that they
53:50 can see where, go back and see what they need to learn more about.
53:54 But it's a very interesting API.
53:56 The speed is very compelling thing.
53:59 Yeah.
53:59 I think it's a cool project.
54:00 And like I said, how many people we got here?
54:02 13,000 people using it already.
54:03 So that's, that's a big community.
54:06 Yeah.
54:06 So if you're interested in project, we have a discord where, where you can chat with us
54:10 and ask questions and see how you can best do things.
54:14 Pretty active there.
54:15 Cool.
54:15 The discord's linked right off the homepage.
54:17 So that's awesome.
54:18 People can find it there.
54:19 Contributions.
54:20 People want to make contributions.
54:22 I'm sure you're willing to accept PRs and other feedback.
54:25 Or you put in a really large PR, please first open an issue with a, with a, with a,
54:31 to start with discussion of business, this contribution is welcome.
54:34 And we also have a few getting started.
54:37 Good for the contributors.
54:39 Okay.
54:39 Yes.
54:40 You've, you've tagged or labeled some of the issues as look here, if you want to get,
54:45 get into this.
54:46 Yeah.
54:46 I must say, I think we're an interesting project to contribute to because we're,
54:50 you can, it's not, not everything is set in stone.
54:53 So there are still places where you can play.
54:56 I'm not sure.
54:57 And there, there's still interesting work to be done.
54:59 It's not completely 100% polished and finalized.
55:04 Yeah.
55:04 On the periphery.
55:05 Yeah.
55:05 Yeah.
55:06 Yeah.
55:06 Yeah.
55:07 Yeah.
55:07 Very cool.
55:07 Let's wrap it up with a comment from the audience here.
55:09 Ajit says, excellent content guys.
55:12 It certainly helps me kickstart my journey from pandas to pollers.
55:16 Awesome.
55:16 Awesome.
55:17 Glad, glad to help.
55:18 I'm sure it will.
55:19 Many people do that.
55:20 So Richie, let's close it out with final call action.
55:23 People are interested in this project.
55:25 They want to start playing and learning pollers.
55:27 maybe try it out on some other code that is and is at the moment.
55:30 What do they do?
55:31 I'd recommend if you have a new project, just start in pollers.
55:34 Because you can also rewrite some comments, but the most fun experience will just start a new
55:41 project in pollers.
55:42 And because then you can really enjoy what pollers offers.
55:46 The only expression API, learn how you use it declaratively.
55:49 And yeah, we'll be, then it will be most fun.
55:52 Absolutely.
55:53 Sounds great.
55:53 And like we did point out, it has the to and from hand as data frame.
55:58 So you can work on a section of your code and still have it consistent, right?
56:02 With, with other parts that have to be.
56:04 Yeah.
56:04 You can progressively rewrite some performance heavy parts.
56:08 Or I also think supporters have really strict on the, on the schema and the types.
56:13 It's also, if you write any ETL, you will be really happy to do that.
56:17 But also because you can check the scheme of lazy frame before executing it.
56:21 But you know, the apps core running the query and if the data comes in and it doesn't
56:26 oblige to this schema, you can fail fast.
56:30 Instead of having strange outtakes.
56:31 Oh, that's interesting because you definitely don't want zero when you expected something else
56:37 because it could parse or other weird, you know, whatever, right?
56:40 Yeah.
56:40 Yeah.
56:40 So this was my, for missing data and polar doesn't change the schema.
56:45 Yeah.
56:45 So polar is great.
56:47 The schema is defined by the operations and the data and not by their values in data.
56:53 So you can definitely check.
56:55 Got it.
56:55 Excellent.
56:56 All right.
56:57 Well, congratulations on a cool project.
56:59 I'm glad we got to share with everybody.
57:00 Thanks for coming on the show.
57:01 Bye.
57:02 You bet.
57:02 Bye.
57:03 Bye.
57:04 This has been another episode of Talk Python to me.
57:06 Thank you to our sponsors.
57:08 Be sure to check out what they're offering.
57:10 It really helps support the show.
57:11 Type I is here to take on the challenge of rapidly transforming a bare algorithm in Python
57:16 into a full-fledged decision support system for end users.
57:20 Get started with Type I core and GUI for free at talkpython.fm/Typy.
57:25 T-A-I-P-Y.
57:27 Earn extra income from sharing your software development opinion at user interviews.
57:32 Head over to talkpython.fm/userinterviews to participate today.
57:36 Want to level up your Python?
57:39 We have one of the largest catalogs of Python video courses over at Talk Python.
57:43 Our content ranges from true beginners to deeply advanced topics like memory and async.
57:48 And best of all, there's not a subscription in sight.
57:50 Check it out for yourself at training.talkpython.fm.
57:53 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
57:58 We should be right at the top.
57:59 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct
58:05 RSS feed at /rss on talkpython.fm.
58:08 We're live streaming most of our recordings these days.
58:12 If you want to be part of the show and have your comments featured on the air,
58:15 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.
58:20 This is your host, Michael Kennedy.
58:22 Thanks so much for listening.
58:23 I really appreciate it.
58:24 Now get out there and write some Python code.
58:26 I'll see you next time.
58:41 I'll see you next time.
58:43 Bye.
58:45 Bye.