Learn Python with Talk Python's 270 hours of courses

#402: Polars: A Lightning-fast DataFrame for Python [updated audio] Transcript

Recorded on Sunday, Jan 29, 2023.

00:00 When you think about processing tabular data in Python, what library comes to mind?

00:05 Pandas, I'd guess.

00:06 But there are other libraries out there, and Polars is one of the more exciting new ones.

00:11 It's built in Rust, embraces parallelism, and can be 10 to 20 times faster than Pandas out of the box.

00:17 We have Polars creator, Ritchie Vink, here to give us a look at this exciting new DataFrame library.

00:23 history. This is Talk Python to Me, episode 402, recorded January 29th, 2023.

00:30 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:48 Follow me on Mastodon where I'm @mkennedy and follow the podcast using @talkpython, both on fosstodon.org.

00:56 Be careful with impersonating accounts on other instances, there are many.

00:59 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:05 We've started streaming most of our episodes live on YouTube.

01:09 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:17 This episode is brought to you by Taipy. Taipy is here to take on the challenge of rapidly transforming a bear algorithm in Python into a full fledged decision support system for end users.

01:27 Check them out at talkpython.fm/type t-a-i-p-y.

01:32 And it's also brought to you by user interviews, earn extra income for sharing your software developer opinion.

01:38 Head over to talkbython.fm/userinterviews to participate today.

01:43 Hey, Ritchie, welcome to Talk Python to Me

01:46 I feel like maybe I should rename my podcast, talk rust to me or something.

01:52 I don't know.

01:52 Rust is taken over as the low level part of, of how do we make Python go fast?

01:59 There's some kind of synergy with rust.

02:01 What's going on there?

02:02 Yeah, there is.

02:03 I'd say Python always already was low level languages that succeeded that made Python a success.

02:09 I mean, like NumPy, Pandas, everything that was reasonable fast was so because of C or Cython, which is also C, but Rust different from C, Rust has made low level programming a lot more fun to use and a lot more safe.

02:23 And especially if you regard multi-threaded programming, parallel programming, concurrent programming, it is a lot easier in Rust.

02:32 It opens a lot of possibilities.

02:34 Yeah.

02:34 My understanding, I've only given a cursory look to Rust, just sort of scan some examples and we're going to see some examples of code in a little bit, actually related to Polars, but it's kind of a low level language.

02:47 It's not as simple as Python.

02:49 No, it's a JavaScript, but it's, it is easier than C, C++, not just in the syntax, but you know, it has, it does better memory tracking for you and the concurrency, especially, right?

03:00 Yeah.

03:01 Well, so Rust has got a, brings a whole new thing to the table, which is called ownership and a borrower checker.

03:07 And Rust is really strict.

03:08 There are things on Rust we cannot do in C or C++ does at a time.

03:13 there can only be one owner of a piece of memory and other people can you can lend out this piece of memory to other users but then they cannot mutate so it can be only one owner which is able to mutate something and this restriction makes rust a really hard language to learn but once you once it's clicked once you went over that that steep learning curve it becomes a lot easier because it doesn't allow you things that you could do in C and C++ but those things were also things you shouldn't do in C++ because they probably led to set faults and memory issues.

03:47 And this borrow checker also makes writing concurrent programming safe.

03:52 You can have many threads reading a variable all they want.

03:55 They can read concurrently.

03:57 It's when you have writers and readers that this whole thread safety, critical section, take your locks or the locks re-entrant, all of that really difficult stuff comes in.

04:06 And so.

04:07 Yes.

04:08 Sounds like an important key to me.

04:10 Yeah, and Rust, and the same board checker also knows when memory has to be freed and not.

04:16 But it doesn't have to, unlike in Go or Java, where you have a garbage collector, it doesn't have to do garbage collection and it doesn't have to do reference counting like Python does.

04:25 It does so by just statically, so at compile time, it knows when something is out of scope and not used anymore.

04:31 And this is real power.

04:32 I guess the takeaway for listeners who are wondering, you know, Why is Rust seemingly taking over so much of the job that C and variations of C, right, like you said, Cython, have traditionally played in Python. It's easier to write modern, faster, safer code. Yeah. Yeah. And it's more fun too, right? Yeah, definitely. And it's a language which has got its tools right. So it's got a package manager, which is really great to use. It's got a real create something which is similar to the PyPI index. It feels like a modern language.

05:03 Yeah builds low level more low level code. You can also write high level stuff like Rest API's which is I must say it also for high level stuff I like to write it in Rust because of the safety guarantees and also the correctness guarantees if I if my program compiles on Rust, I'm much more certain it is correct then when I write my Python program, which is dynamic and types can are not enforced So it's always a bit graying on that side.

05:30 Python is great to use, but it's harder to write correct code.

05:33 Yeah.

05:34 And you can optionally write very loose code or you could opt into things like type hints and even MyPY, and then you get closer to the static languages, right?

05:45 Are you a fan of typing?

05:47 Definitely.

05:48 But because they're optional, they are as strong as the weakest link.

05:52 So one library, which you use, if it doesn't do this type correct, or it It doesn't do it. Yeah, it breaks. It's quite brittle because it's optional.

06:01 I hope we get something that really enforces it and really can check it.

06:05 I don't know if it's possible because of the dynamic nature of Python.

06:09 Python can do so many things just dynamically.

06:13 Statically, we just cannot know, probably.

06:15 I don't know how far it can go.

06:19 In Polars as well, we use mypy typehints, which prevent us from having a lot of bugs and also make the IDE experience much nicer.

06:28 Yeah, Typehints are great.

06:30 They really help you also think about your library.

06:32 I think you really see a shift in modern Python and Python 10 years ago, where it was more dynamic and the dynamic dynamic of Python were more seen as a strength than currently, I believe.

06:46 - Yeah, I totally agree.

06:47 And I feel like when Typehints first came out, this was, yes, wow, at this point, kind of early Python 3, But it didn't feel like it at the time, you know, Python 3 had been out for quite a while.

06:58 When type hints were introduced, I feel like that was Python 3.4. But anyway, that was put it maybe six years into the life cycle of Python 3. But still, I feel like a lot of people were suspicious of that at the moment. You know, they're like, Oh, what is this weird thing? We're not really sure we want to put these types into our Python. And now a lot less. There's a lot less of those reactions. I see, yeah, yeah, I see Python having to, probably more, but I often see Python as the really fun, nice to use, Duct Tape language where I can, in my, for instance, in Jupyter Notebook, I can just hack away and try interactively what happens, and for such code, type hints don't matter.

07:38 But once I write more of a library or product or tool, then typehints are really great. I believe I believe they came about the Dropbox really needed them.

07:47 They had a huge Python program.

07:49 It's a really trouble making it without Pythons.

07:52 But I'm not really sure.

07:53 - Yeah, and I heard some guy who has something to do with Python used to work there.

07:56 - Yeah, yeah, yeah.

07:57 - Guido used to work there, I think even at that time.

07:59 All right, so a bit of a diversion from how I often start the show.

08:02 So let's just circle back real quick and get your story.

08:05 How'd you get into programming and Python and Rust as well, I suppose?

08:08 - I got into programming, I just wanted to learn programming.

08:11 of mine who did the program, the love of PHP said, learn Python like that.

08:16 An interactive website where I could do some, some puzzles and I really got hooked to it.

08:23 Was a fun summer and I was programming a lot. I started automating. My job was a civil engineer at the moment that I started. It was a lot of mundane tasks, repetitive, and I just found ways to automate my job. And eventually I was doing that for a year or three or, and then I got into to data science and I switched jobs, became a data scientist and later a data engineer.

08:47 Yeah, so that was Python mostly.

08:49 I've always been looking for more languages.

08:52 I've been playing with Haskell, I've been playing with Go, I've been playing with JavaScript, not just with the JavaScript, but playing with Scala.

09:00 And then I found Rust, and Rust really, really made me happy.

09:04 Like, you learn a lot about how computers work.

09:07 So I had a new renaissance of the first experience with Python, another summer at Rust, and been doing a lot of toy projects, like writing an interpreter, I don't know, a lot of projects, and Polars became one of those hobby projects just to use Rust more.

09:23 - Now it's got quite the following, and we're gonna definitely dive into that, but let me pull it up.

09:29 It does right here, 13,000 GitHub stars.

09:31 That's a good number of people using that project.

09:36 - Yeah, yeah. - It's pretty crazy, isn't it? - Yeah, it is.

09:37 It's on GitHub stars.

09:40 It's the fastest growing data tool, I believe.

09:42 - Wow, incredible.

09:43 You must be really proud of that.

09:46 - Yeah, yeah.

09:47 If you would have told me this two years ago, I would never leave it.

09:50 But those, it happens slow enough so you can get accustomed to that.

09:54 - Yeah, that's cool.

09:56 Kind of like being a parent.

09:57 The challenges of the kids are small.

10:00 They're intense, but there are only a few things they need when they're small and you kind of grow with it.

10:04 So a couple of thoughts.

10:06 One, you had the inverse style of learning to program that I think a lot of computer science people do, and certainly that I did.

10:13 It could also just be that I learned it a long time ago.

10:16 But when I learned programming, it was I'm gonna learn C and C++, and then you're kind of allowed to learn the easier languages, but you will learn your pointers.

10:26 You'll have your void star star, and you're gonna like it.

10:29 You're gonna understand what a pointer to a pointer means, and we're gonna get, I mean, you start inside, and you, of the most complex, closest to the machine, you worked your way out, you kind of took this opposite, like let me learn Python, where it's much more high level, it's much, if you choose to be, you often stay very much more away from the hardware and the ideas of memory and threads and all that.

10:51 And then you went to Rust.

10:52 So was it kind of an intense experience?

10:54 You were like, oh my gosh, this is intense.

10:57 Or had you studied enough languages by then to become comfortable?

11:00 - Well, yeah, yeah, no.

11:01 So the going from high level to low level, I think it makes natural sense if you've learned it yourself.

11:08 There's no professor telling me you learn your pointers.

11:12 So I think this also helped a lot because at that point, you're really accustomed to programming, to algorithms.

11:18 So you can, I believe you should learn one thing, one new thing at a time, and then you can really own that knowledge later on.

11:26 But Rust, I wouldn't say you should learn Rust as a first language.

11:29 It would be really terrible because you need, that would be terrible.

11:34 But other languages also don't help you much, because the borrow checker is quite unique.

11:39 It doesn't let you do things you can do in other languages.

11:42 So what you learn there, the languages that allow you to do that, they just hurt you because you were--

11:49 - They encourage the wrong behavior, right?

11:52 - Well, yeah, so nine out of 10 times, it turns out by compiling, not letting you do that one thing, that one thing you wanted was probably really bad to begin with, led to really...

12:03 So in Rust, your code is always a lot flatter.

12:06 It's always really clear who owns the memory, how deep your nesting is.

12:10 It's always one D deeper.

12:13 Most of the times, it's not that complicated.

12:16 You make things really flat and really easy to reason about.

12:20 And in the beginning of a project, it seems okay, a bit over-constraining.

12:24 But when, I mean, software will become complex and complicated, and then you're happy that the compiler notched you in this direction.

12:32 - It seems like a better way, honestly.

12:34 You know, you get a sense of programming in a more simple language that doesn't ask so many low-level concepts of you.

12:42 And then you're ready, you can add on these new ones.

12:45 So I feel like a lot of how we teach programming and how people learn programming is a little bit backwards, to be honest.

12:50 - Yeah. - Anyway, enough on that.

12:52 So you were a civil engineer for a while, and then you became a data scientist, and now you've created this library.

12:57 Still working as a data scientist now?

12:59 - No, no.

13:00 I got sponsored two years ago for two days a week.

13:03 And yeah, just use that time to develop Polar

13:07 And currently I stopped all my day jobs and going full time on Polars.

13:13 I'm trying to live from sponsorships, which is not really working.

13:17 It's not enough at this time, but I hope to start a foundation and get some proper sponsors in.

13:22 - Yeah, that'd be great.

13:24 - Yeah, it's useful.

13:25 - That's awesome.

13:26 It's still awesome that you're able to do that, even if you still needed to grow a little bit.

13:30 - Yeah.

13:31 - We'll have you on a podcast and let other people know out there who maybe are using your library.

13:35 Maybe they can put a little sponsorship in GitHub sponsors.

13:39 I feel like GitHub sponsors really made it a lot easier for people to support.

13:44 'Cause there used to be like PayPal donate buttons and other things like that.

13:50 And one, those are not really recurring.

13:52 And two, you've got to go find some place and put your credit card.

13:56 Many of us already have a credit card registered at GitHub.

13:59 It's just a matter of checking the box and monthly it'll just go.

14:02 You know, it's kind of like the app store versus buying independent apps.

14:05 It just cuts down a lot of the friction.

14:07 I feel like it's been really positive, mostly for open source.

14:10 - Yeah, I think it's good to, as a way to say thank you.

14:14 It isn't enough to pay the bills.

14:16 I think for most developers it isn't, but I hope we get there.

14:19 I think companies who use it should give a bit more back.

14:24 I mean, they have a lot of money.

14:25 - I agree.

14:26 really, really ridiculous that there are banks and VC funded companies and things like that, that have not necessarily in terms of the VC ones, but definitely in terms of financial and other large companies that make billions and billions of dollars in profit on top of open source technology. And many of them don't give anything back, which is, it's not criminal because the licenses allow it, but it's it's certainly borders on immoral to say, all this money and not at all support the people who are really building the foundations that we build upon?

14:58 Most of my sponsors are developers. Yeah. So, yeah, let's hope it changes. I don't know.

15:05 Yeah. Well, I'll continue to beat that drum.

15:07 This portion of Talk Python to Me is brought to you by Taipy.

15:13 Taipy is the next generation open source Python application builder. With Taipy, you can turn data and AI algorithms into full web apps in no time. Here's how it works.

15:23 You start with a bare algorithm written in Python.

15:26 You then use Taipy's innovative tool set that enables Python developers to build interactive end-user applications quickly.

15:33 There's a visual designer to develop highly interactive GUIs ready for production.

15:37 And for inbound data streams, you can program against the Taipy core layer as well.

15:42 Taipy core provides intelligent pipeline management, data caching, and scenario and cycle management facilities.

15:48 That's it.

15:49 You'll have transformed a bare algorithm into a full-fledged decision support system for end users.

15:55 Taipy is pure Python and open source, and you install it with a simple pip install taipy.

16:00 For large organizations that need fine-grained control and authorization around their data, there is a paid TypeEye Enterprise Edition, but the TypeEye core and GUI described above is completely free to use.

16:11 Learn more and get started by visiting talkpython.fm/taipy, that's T-A-I-P-Y, the links in your show notes.

16:19 Thank you to Taipy for sponsoring the show.

16:21 Let's talk about your project.

16:23 So Polars and the RS is for Rust.

16:27 I imagine at the end, but tell us about the name Polars, like Polar Bear, but Polars.

16:32 Yeah.

16:32 So I started writing a data from library and initially it was only for, for Rust.

16:37 It was my idea until you get, until you saw all the people doing data science and Python, you're like, Oh, what can I do for these people?

16:45 Right.

16:45 Yeah.

16:46 Yeah.

16:46 And I wanted to give a wink to the Pandas project, but I wanted a beer that was better, faster, I don't know, stronger. So luckily, a Panda beer isn't the most frightful beer. So I had a few to choose. But the grizzly, yeah, the Polar has the RS, so that's a lucky coincidence.

17:06 Yeah. So the subtitle here is Lightning Fast DataFrame Library for Rust and Python. And you You have two APIs that people can use.

17:15 We'll get to dive into those.

17:16 - Yeah, because we're using it in Rust, it's a complete different library in Rust and you can expose it to many front-ends.

17:23 So front-end is already front-end in Rust, Python, Node.js, R is coming up, and normal JavaScript is coming up, and Ruby, there is also a Polar Ruby.

17:34 So-- - How interesting.

17:35 So for the JavaScript one, are you gonna use WebAssembly?

17:38 - Yeah. - Right, which is pretty straightforward Rust comes from Mozilla WebAssembly, I believe also originated, they kind of originated as a somewhat tied together.

17:47 Yeah.

17:48 So Rust C++ C can compile to WebAssembly.

17:52 It's not really straightforward because the WebAssembly virtual machine isn't like your normal OS.

17:57 So there are a lot of things harder, but we're, we are working on the challenges.

18:02 Okay.

18:03 Well, that's pretty interesting, but for now you've got Python and you've got Rust and that's great.

18:07 I think a lot of people listening, myself included when I started looking into this, immediately go to, it's like pandas, but Rust.

18:15 (laughs)

18:16 You know, it's like pandas, but instead of C at the bottom, it's Rust at the bottom.

18:20 And that's somewhat true, but mostly not true.

18:23 So let's start with you telling us, you know, how is this like pandas and how is it different from pandas?

18:29 - Yeah, so it's not like pandas.

18:32 I think it's different on two ways.

18:34 So we have the API and we have the implementation.

18:37 And which one should I start with?

18:39 Bottom up?

18:40 That's, I think, bottom up.

18:41 Yeah, bottom up. Sure.

18:42 Yeah.

18:43 So that was my critique from Pandas.

18:46 And that they didn't start bottom up.

18:48 They took whatever was there already, which were good for that purpose.

18:53 And Pandas built on NumPy.

18:55 And NumPy is a great library.

18:58 But it's built for numerical processing and not for relational processing.

19:01 Relational data is completely different.

19:04 You have string data, you have message data, and this data is going to be just put as Python object in those NumPy arrays.

19:11 And if you know anything about memory, then in this array you have a pointer where each Python object is somewhere else.

19:18 So if you traverse this memory, every pointer you hit must look it up somewhere else.

19:23 That memory is not in cache, where the cache miss, which is a 200x slowdown per element you traverse.

19:29 - Yeah.

19:29 So for people listening, what you're saying the 200x slowdown is, The L1, L2, L3 caches, which all have different speeds and stuff, but the caches that are near the CPU versus main memory, it's like two to 400 times slower, not aging off a disk or something.

19:45 It's, it's really different, right?

19:47 It's really a big deal.

19:47 It's a big deal.

19:48 It's terribly slow.

19:49 It also, Python has a gil.

19:51 It also blocks multi-threading.

19:54 If you want to read the string, you cannot do this from different threads.

19:57 If you want to modify the string, there's only one thread that connects this Python gill.

20:02 So they also didn't take into account anything from databases.

20:07 So databases are basing from the 1950s.

20:11 There's been a lot of research in databases and how we do things fast, write a query and then optimize this query because the user that uses your library is not the expert.

20:21 It doesn't write optimized query.

20:23 No, but we have a lot of information so we can optimize this query and execute this in the most, in a very efficient way.

20:30 That's an interesting idea.

20:31 Yeah, Pandas just executes it and gives you what you ask.

20:35 What you ask is probably not...

20:37 Yeah, that's interesting because as programmers, when I have my Python hat on, I want my code to run exactly as I wrote it.

20:45 I don't want it to get clever and change it.

20:48 If I said do a loop, do a loop.

20:50 If I said put it in a dictionary, put it in a dictionary.

20:53 But when I write a database query, be that against Postgres with Relational or MongoDB, There's a query planner and the query planner looks at all the different steps.

21:04 Should we do the filter first?

21:06 Can we use an index?

21:07 Can we use a compo, which index should we choose?

21:09 All of those things, right?

21:11 And so what you tell it and what happens, you don't tell it how to do finding the data, the database, you just give it, here's kind of the expressions that I need, the, the, the predicates that I need you to work with.

21:23 And then you figure it out.

21:24 You're smart.

21:25 You're the database.

21:26 So one of the differences I got from reading what, what you've got here so far is it looks like, I don't know if it goes as far as this database stuff that we're talking about, but there's a way for it to build up the code it's supposed to run and it can decide things like, you know, these two things could go in parallel or things along those lines.

21:44 Right.

21:44 Yeah.

21:45 Well, it is actually very similar.

21:47 It is a factorized query engine and you can, the only thing that doesn't make us a database is that we don't have any, we don't bother with, with file structures, right? Like the persistence and transactions. Yeah. So we have different types of late databases, you have all up and OTP transactional modeling, which works often on one. So if you do a rest API query, and you modify one user ID, then your transactional and if you're doing OLAP, that's more analytical, and then you do large aggregations of large whole tables, and then you need to process all the data.

22:20 And those different database designs lead to different query optimizers.

22:24 And Polars is focused on OLA.

22:26 But yeah, so as you described, you've got two ways of programming things.

22:30 One is procedural, which Python mostly is.

22:33 So you tell exactly, if you want to get a cup of coffee, how many steps it should take forward, then rotate 90 degrees, take three steps, then rotate 90 degrees.

22:42 You can write down the whole algorithm how to get a coffee.

22:46 Or you could just say, get me a coffee.

22:48 I'd like some sugar and then let the query engine decide how to best get it.

22:54 Right.

22:54 And that's more declarative.

22:56 You describe the end result.

22:57 And as it turns out, this is also very readable because you declare what you want and the intent is readable in the query.

23:05 And if you're doing more procedural programming, you describe what you're doing.

23:09 And the intent often needs to come from comments, like what are we trying to do when we follow this out?

23:15 Right. Yeah, that makes a lot of sense.

23:16 And that's why...

23:17 Yeah, sorry. And that's why the...

23:19 So the first thing is we write a database engine, a query engine from scratch and really think about multiprocessing, about caches, also out of core, we can process data that doesn't fit into memory.

23:33 So we really built this from scratch with all those things in mind.

23:37 And then at first we wanted to expose the Pondus API and then we noticed how bad it was for writing fast data.

23:46 The promise API just isn't really good for this declarative analyzing of what the user wants to do.

23:51 So we just cut it off and took the freedom to design an API that makes most sense.

23:56 That's interesting.

23:58 I didn't realize that you had started trying to be closer to pandas than you ended up.

24:02 Yeah.

24:03 Well, it was very short lived, I must say.

24:05 It was painful.

24:07 Yeah.

24:07 And that's not necessarily saying pandas are bad, I don't think.

24:11 It's approaching the problem differently and it has different goals, right?

24:14 Yeah.

24:14 So maybe we could look at an example of some of the code that we're talking about.

24:19 I guess also one of the other differences there is much of this has to do with what you would call, I guess you refer to them as lazy APIs or streaming APIs, kind of like a generator.

24:31 Yeah.

24:31 So if you think about about a join, for instance, in pandas, if you would write a join and then they only do and only want to first 100 rows of that result, then it would first do the join, and then that might produce one million or 10 million rows, and then you take only 100 of them, and then you have materialized a million, but you take only a fraction of that.

24:52 And by having that lazy, you can optimize for the whole query at a time and just see, oh, we do this join, but we only need 100 rows, so that's how we materialize 100 rows.

25:02 So it gets you more of a holistic approach.

25:04 - That's really cool.

25:05 I didn't realize it had so many similarities to databases, but yeah, it makes a lot of sense.

25:09 All right, let's look at maybe a super simple example.

25:14 You've got on polar.rs.

25:17 What country is rs?

25:18 I always love how different countries that often have nothing to do with domain names get grabbed because they have a cool ending like Libya that was .ly for a while.

25:28 You know, it still is, but like it was used frequently, like bit.ly and stuff.

25:31 Do you know what rs is?

25:32 - I believe it's Serbia.

25:34 - Serbia, okay.

25:35 - I'm not sure.

25:36 - Yeah, yeah, very cool.

25:37 All right, so polar.rs.

25:39 like polar.rs.

25:41 Over here, you've got on the homepage here, the landing page, and then through the documentation as well, you've got a lot of places where you're like, show me the Rust API or show me the Python API.

25:50 People can come and check out the Rust code.

25:52 It's a little bit longer because it's, you know, that kind of language, but it's not terribly more complex.

25:58 But maybe talk us through this little example here on the homepage in Python, just to give people a sense of what the API looks like.

26:06 - Yeah, so we started with a scan CSV, which is a lazy read, which is so read CSV tells what you do, and then it reads the CSV and you get the data frame.

26:16 In a scan CSV, we started a computation graph.

26:20 We call this a lazy frame.

26:21 A lazy frame is actually just, it holds, it remembers the steps of the operations you want to do.

26:26 Then it sends it to boiler, it looks at this very plan and optimize it, and it will think of how to execute it.

26:33 And we have different engines.

26:34 So you can have an engine that's more specialized for data doesn't fit into memory and I need to look more specialized for data that does fit into memory.

26:42 So we start with a scan and then we do a dot filter and we want to use verbs. Verbs, that's the declarative part. In pandas we often do indexes for a... and those indexes are ambiguous in my opinion because you can you can pass in a NumPy array with booleans but you can also pass in a NumPy array with integers so you can do slicing, you can also pass in a NumPy, a list of strings and then you do column selection.

27:09 So it has three functions.

27:11 >> One thing that I find really interesting about Pandas is it's so incredible and people who are very good with Pandas, they can just make it fly.

27:19 They can make it really right expressions that are super powerful.

27:23 But it's not obvious that you should have been able to do that before you see it.

27:27 There's a lot of not quite magic, but stuff that doesn't seem to come really straight out of the API directly.

27:34 you pass in some sort of a Boolean expression that involves a vector and some other test into the brackets.

27:44 I wait, how do I know I can do that?

27:46 Whereas this, your API is a lot more of a fluent API where you say, PD, you'd say PL, PL.scan, CSV.filter.groupby.aggregate.collect, and it just flows together.

28:00 Does that mean that the editors and IDEs can be more helpful suggesting what happens at each step?

28:06 Yes, we are really strict on type.

28:08 So we also only return a single type from a method.

28:12 And we only a dot filter just expects a Boolean expression that produces a Boolean, not an integer, not a string.

28:18 So we want our methods from reading our code, you should be able to understand what should go in there.

28:26 That's really important to me.

28:27 It should be unambiguous, it should be consistent, and your knowledge of the API should expand to different parts of the API.

28:34 And that's where I think we're going to talk about this later, but that's where expressions really come in.

28:42 This portion of talk Python to me is brought to you by user interviews.

28:46 As a developer, how often do you find yourself talking back to products and services that you use?

28:51 Sometimes it may be frustration over how it's working poorly.

28:55 And if they just did such and such, it would work better and it's easy to do.

29:00 Other times it might be delight.

29:02 Wow, they auto-filled that section for me.

29:04 How did they even do that?

29:06 Wonderful, thanks.

29:07 While this verbalization might be great to get the thoughts out of your head, did you know that you can earn money for your feedback on real products?

29:15 User interviews connects researchers with professionals that want to participate in research studies.

29:20 There is a high demand for developers to share their opinions on products being created for developers.

29:26 Aside from the extra cash, you'll talk to people building products in your space.

29:30 You will not only learn about new tools being created, but you'll also shape the future of the products that we all use.

29:37 It's completely free to sign up and you can apply to your first study in under five minutes.

29:42 The average study pays over $60.

29:44 However, many studies specifically interested in developers pay several hundreds of dollars for a one-on-one interview.

29:51 Are you ready to earn extra income from sharing your expert opinion?

29:55 Head over to talkpython.fm/userinterviews to participate today.

30:00 The link is in your podcast player show notes.

30:02 Thank you to user interviews for supporting the show.

30:05 - I just derailed you a little bit here as you were describing this.

30:10 So you start out with scanning a CSV, which is sort of creating and kicking off a data frame equivalent here.

30:18 - A lazy frame.

30:19 - And then you, a lazy frame, okay.

30:20 And then you say a dot filter and you give it an expression like this column is greater than five, right?

30:26 Or some expression that we would understand in Python.

30:29 That's the filter statement, right?

30:30 Yeah, and then we follow with a group by argument, and then an aggregation where we say, okay, take all columns and sum them.

30:37 And this again is an expression.

30:39 These are really easy expressions.

30:41 And then we take this lazy frame, and we materialize it into a data frame that column collect on.

30:47 And collect means, okay, all those steps you recorded, now you can do your magic, where we optimize or get all the stuff.

30:54 And what this will do here, it will recognize that, Okay, we've taken the iris.csv, which got different columns.

31:01 And now in this case, it won't.

31:03 So if you would have finished with a select, where we only select a few columns, it would have recognized, oh, we don't need all those columns in the CSV file, we only take the ones we need.

31:13 What it will do, it will push the filter, the predicate, down to the scan.

31:17 So during the reading of the CSV, we will take this predicate.

31:21 We say, okay, the sample length is larger than five.

31:24 The rows that don't match this predicate will not be materialized.

31:27 So if you have a really large CSV file, this will really, let's say you have a CSV file with tens of gigabytes, but your predicate only selects 5 percent of that, then you only materialize 5 percent of the 10 gigabytes.

31:40 >> Yeah. So 500 megs instead of 10 gigabytes or something like that, or 200 megs, whatever it is, quite a bit less. That's really interesting.

31:49 This is all part of the benefits of what we're talking about with the lazy frames, lazy APIs, and building up all of the steps before you say go, because in pandas, you would say read CSV.

32:01 So, okay, it's gonna read the CSV, now what?

32:03 - Yes. - Right?

32:04 And then you apply your filter if that's the order you wanna do it in, and then you group and then, and so on and so on, right?

32:10 It's interesting in that it does allow more database-like behavior behind the scenes.

32:15 - Yeah, yeah.

32:16 In the end, in my opinion, the data frame should be seen as a table in a database.

32:21 It's the final view of computation.

32:25 Like you can see it as a materialized view.

32:28 We have some data on this, and we want to get it into another table, which we would feed into our machine learning models or whatever.

32:37 And we do a lot of operations on them before we get there.

32:41 So I wouldn't see a data frame as a data.

32:45 It's not only a data structure.

32:46 It's not only a list or a dictionary.

32:48 There are lots of steps before we get into those tables.

32:52 We eventually will.

32:53 - Right.

32:54 So here's an interesting challenge.

32:57 There's a lot of visualization libraries.

33:00 There are a lot of other data science libraries that know and expect Pandas DataFrame.

33:07 So like, okay, what you do is you send me the Pandas DataFrame here, or we're going to patch Pandas so that if you call this function on the DataFrame, it's gonna do this thing.

33:15 And they may say, "Richie, fantastic job.

33:18 you've done here in Polars, but my stuff is already all built around pandas.

33:21 So I'm not going to use this.

33:23 Right.

33:23 But it's worth pointing out.

33:25 There's some cool pandas integration, right?

33:27 Yeah.

33:27 Yeah.

33:28 So this, so Polars doesn't want to do plotting.

33:30 I don't think it should be in a different language.

33:33 Maybe another language, another library can do it on top of Polars.

33:36 If they feel like it, it shouldn't be in Polars in my opinion.

33:39 But often when you do plotting, you're plotting the number of rows will not be billions.

33:44 I mean, there's no plotting engine that can deal with that.

33:47 So you will be reducing your big data set to something small, and then you can send it to the public.

33:54 - There's hardly a monitor that has enough pixels to show you that, right?

33:58 So, yeah.

33:59 - You can call it two pandas, and then we transform our Polars data frame to pandas, and then you can integrate with scikit-learn.

34:06 And we often find that progressively rewriting some pandas code into Polars already is cheaper than keeping it in Pandas.

34:14 So if you do, if you go from pandas to polars, do a join in polars and then back to pandas, we probably made up for those double copies.

34:22 Pandas does a lot of internal copies.

34:23 If you do a reset index, it copies all data.

34:26 If you do, there are a lot of internal copies in pandas which are implicit.

34:30 So I wouldn't worry about an explicit copy in the end of your ETL to go to plotting when the data is already spawned.

34:37 - Right, right.

34:38 So let's look at the benchmarks 'cause it sounds like to a large degree, even if you do have to do this conversion in the end, many times, it still might even be quicker.

34:47 So you've got some benchmarks over here, and you compared, I'm gonna need some good vision for this one.

34:52 You compared Polar's, Panda's, Dask, and then two things which are too small for me to read.

34:58 Tell us what you compared.

34:59 - Modin and Vax.

35:00 - Modin and Vax, okay.

35:01 And for people listening, you go out here and look at these benchmarks, link right off the homepage.

35:08 There's like a little tiny purple thing and a whole bunch of really tall bar graphs, cut the rest.

35:13 - Yeah.

35:14 - And the little tiny thing that you can kind of miss if you don't look carefully, that's the time it takes for Polars.

35:20 And then all the others are up there in like 60 seconds, a hundred seconds, and then Polars is like quarter of a second.

35:27 So, you know, it's easy to miss it in the graph, but the quick takeaway here, I think is, there's some fast stuff.

35:32 - Yeah, yeah.

35:33 We're often orders of magnitudes faster than pandas.

35:36 So it's not uncommon to hear it's 10 to 20x times faster, Especially if you write proper pandas and proper polos, it's probably 20x if we deal with I/O as well.

35:47 So what we see here are the TPCH benchmarks.

35:50 TPCH is a database query benchmark standard, which this is used by every query engine to show how fast it is.

35:59 And those are really hard questions that really reflects the muscles of a query engine.

36:05 So you have joins on several tables, different group by different nested group by etc. And yeah, yeah, I really tried to make those other tools faster. But so in memory, Rust and modem, that was really hard to make stuff faster than pandas. Now on some on a few occasions, once we include IO, all those tools first needed to go via pandas. And yeah, what this sort of shows is that we have pandas, which is a single threaded data frame engine, and then we have tools that parallelize formats, and it's not always, they don't, just parallelizing formats doesn't make it faster.

36:45 So if we have a filter or a element-wise multiplication, parallelization is easy.

36:50 You just split it up in chunks and do your parallelization, and then those tools win.

36:55 - You got 10 cores, you can start 10 threads, and I can take 1/10th the data and start to answer yes or no for the filter question, for example.

37:02 Most people don't realize that a lot of data frame operations are not embarrassingly parallel.

37:08 A group by is definitely not embarrassingly parallel.

37:11 A join needs a shuffle.

37:15 It's not embarrassingly parallel.

37:17 And that's why you see those tools being slower than pandas because they're string data and then you have a problem.

37:24 Or we need to do multiprocessing and we need to send those Python objects to another project can we copy data, which is slow, or we need to do multi-threading and we're bound by the GIL and we're single-threaded.

37:35 And then there is the expensive structure.

37:36 - Yeah, I think there's some interesting parallels for Dask and Polars.

37:42 On these benchmarks, at least, you're showing much better performance than Dask.

37:46 I've had Matthew Rocklin on a couple times to talk about Dask and some of the work they're doing there at Coiled, and it's very cool.

37:53 And one of the things that I think Dask is interesting for is allowing you to scale your code out to multi-cores on your machine or even distributed grid computing or process data that doesn't fit in memory and they can behind the scenes, juggle all that.

38:08 Yeah.

38:08 I feel like Polar's kind of has a different way, but attempts to solve some of those problems.

38:14 Yeah.

38:15 The Polar's has full control over it, over everything.

38:18 So it's built from the ground up and it controls the IO, it controls their own memory, it controls which trap gets which data and in DOS it goes through, it takes does this other tool and then parallelizes that, but it is limited by what this other tool also is limited by.

38:34 But I think, so on a single machine, it has those challenges.

38:38 I think Dask Distributed does have these challenges.

38:41 And I think for Distributed, it can work really well.

38:44 - Yeah, the interesting part with Dask, I think, is that it's kind of like Pandas, but it scales in all these interesting ways, across cores, bigger memory, but also across machines, and then, you know, across cores, across machines, like all that stuff.

38:56 - Yeah, and that's--

38:57 I feel like Dask is a little bit, maybe it's trying to solve like a little bit bigger compute problem.

39:02 Like how can we use a cluster of computers to answer these questions?

39:06 - Their documentation also says it themselves.

39:08 They say that they're probably not faster than Pandas on a single machine.

39:12 So they're more for the large, big data.

39:16 But Pandas wants to be, and a lot faster on a single machine, but also wants to be able to do out-of-port processing on a single machine.

39:23 So if you, we don't support all queries yet, but we already do basic joins, group by sorts, predicates, element wise operations.

39:33 And then we can process, I process 500 gigabytes on my laptop.

39:38 - That's pretty good.

39:39 Your laptop probably doesn't have 500--

39:41 - No, no, no, no, it's 16 gigs.

39:43 Yeah.

39:44 - Nice.

39:45 It's probably actually a value to, as you develop this product, to not have too massive of a computer to work on.

39:52 if you had a $5,000 workstation, you might be a little out of touch with many people using your code.

39:59 - Yeah. - And so on.

40:00 - Although I think there, I think Polar's like scaling on a single machine makes sense for different reasons as well.

40:08 I think a lot of people talk about distributed, but if you think about the complexity of distributed, you need to send data, shuffle data over the network to other machines.

40:17 So there are a lot of people using Polar's in our discord who have one terabyte of RAM and say, it's cheaper and a lot faster than Spark because they can run policies faster on a single machine and one, two, they have a beefy machine with like 120 cores and they don't have to go over the network to parallelize.

40:37 And yeah, so I think times are changing.

40:40 I think also scaling out data on a single machine is getting more and more.

40:45 - It is.

40:45 One of the areas in which it's interesting is GPUs.

40:48 Do you have any integration with GPUs any of those sort of things?

40:51 Not suggesting that necessarily is even a good idea.

40:54 I'm just wondering if it does.

40:55 No, I get this question, but I'm not really convinced I can get the memory.

40:59 I can get the data fast enough into the memory.

41:02 Like we want to process gigabytes of data and the challenge already on the CPU is, is getting the data from cache or memory fast enough on a CPU.

41:12 This is, I don't know.

41:14 I don't know.

41:14 Yeah.

41:15 So maybe we could talk really quickly about platforms that it runs on.

41:19 You know, I just, this is the very first show that I'm doing on my M2 Pro processor, which is fun.

41:25 I literally been using it for like an hour and a half, so I don't really have much to say, but it looks neat.

41:30 Anyway, you know, that's very different than an Intel machine, which is different than a Raspberry Pi, which is different than, you know, some version of Linux running on ARM or on AMD.

41:40 So where, where do these, what's the, the reach?

41:44 Well, we support it.

41:45 We support it.

41:46 So, Polars also has a lot of SIMD optimizations. SIMD stands for Single Instruction Multiple Data, where, for instance, if you do a floating point operation, instead of doing a single floating point at a time, you can fill in those vector lanes into your CPU, which can fit eight floating points, and in a single operation can compute eight at a time.

42:06 And then you have eight times the parallelism on a single core.

42:09 Those instructions are only activated for Intel.

42:13 So we don't have these instructions activated for ARM, but we do compile to ARM.

42:18 How it performs.

42:19 I think it performs fast.

42:21 Yeah.

42:23 But so if the standard machines, right.

42:25 macOS windows, Linux or where I'm going to go and it ships as a wheel.

42:30 So you don't have to have any, you don't have to have Rust or anything like that.

42:33 We also have Conda, but the Conda is always a bit lagging behind.

42:38 So I'd advise to install from pip because we can, we control this deployment.

42:43 Yeah, exactly.

42:44 You push it out to the PyPI and that's what pip sees and it's going to go right.

42:49 Pretty much instantly.

42:50 I guess it's worth pointing out while we're sitting here is, not that thing I highlighted this.

42:55 You do have a whole section in your user guide, the Polars book called coming from pandas that actually talks about the differences, not just how do I do this versus, you know, this operation in pandas versus Polars, but it also talks about some of the philosophy, like this lazy concepts that we've spoken about and a query optimization.

43:14 I feel like we covered it pretty well.

43:16 Unless there's maybe some other stuff that you want to throw in here really quick, but I mostly just want to throw this out as resource, because I know many people that are coming from pandas and they may be interested in this.

43:26 And this is probably a good place to start.

43:28 I'll link to it in the show notes.

43:29 I think the most controversial one is that we don't have the multi-index.

43:33 You don't have anything other than zero based, zero, one, two, three.

43:37 - Where is it in the array type of-- - Yeah.

43:39 Well, we can, we will support data structures that make lookups faster, like index in a database sense.

43:45 But it will not involve the, it will not change the semantics of the query.

43:50 That's an important thing.

43:52 Okay. Yeah, so I encourage people who are mostly pandas people to come down here and, you know, look through this.

43:57 It's pretty straightforward.

43:59 Another thing that I think is interesting and worth talking about maybe is we could touch a little bit on some of the How can I, in your user guide, you've got, how can I work with IO?

44:10 How can I work with time series?

44:12 How can I work with multiprocessing and so on?

44:14 What do you think is good to highlight out of here?

44:16 When the user guide is a bit outdated.

44:19 So I think it's a year old.

44:21 So the, for instance, IO is changing.

44:24 Boiler just writes it as its own IO readers.

44:28 So we've written our own CSV reader, JSON reader, RPC, Arrow, and that's all in our control, but for interaction with databases, often a bit more complicated.

44:41 Deal with different drivers, different ways. And currently we do this with ConnectorX, which is really great and allows us to read from a lot of different databases, but it doesn't allow us to write from databases yet. And this is luckily changing. I want to explain a bit why. So Polaris builds upon the Arrow memory specification. And the Arrow memory specification is sort of the standard of how memory for data, how memory for columnar data should look into, how columnar data should be represented in memory. And this is becoming a new standard. And Spark is using it, Dremel, Pandas itself. For instance, if you read a Parquet in Pandas, it reads in first into error memory and then copies that into Pandas memory. So the error memory specification becoming a standard, and this is a way to share data to processes, to other libraries within the process without copying data.

45:38 We can just swap our pointers if we know that we both support error.

45:42 >> Error defines basically in memory, it looks like this.

45:47 >> If you both agree on that, we can just swap our pointers.

45:50 >> Right. Because a .NET object, a C++ object, and a Python object, Those don't look like anything similar to any of them, right in memory.

45:59 And, you know, so, so this is from the Apache Arrow project. Yeah.

46:04 And this is really, really used a lot by a lot of different tools already.

46:09 And currently, there is coming the ADBC, which is the Apache Arrow database connector, which will solve all those problems, because then we can write, read and write from a lot of databases in arrow, and then it will be really fast and really easy for us to do.

46:23 So luckily we, we, that's one of those foundations of folders.

46:28 I'm really happy about it because supporting arrow and using arrow memory gives us a lot of interaction, interop with other libraries.

46:37 Yeah.

46:37 That's interesting.

46:38 And when you think of pandas, you know, it's kind of built on top of numpy as its core foundation and it can exchange numpy arrays with other things.

46:47 Yeah.

46:47 Do that.

46:47 So Apache arrow is kind of, kind of your, your base.

46:52 Yeah.

46:52 - Yeah, well, it's kind of full circle because Apache Arrow is started by Wes McKinney.

46:56 Wes McKinney being known as the creator of Pandas.

47:00 And when he got out of Pandas, he thought, okay, the memory representation of NumPy is just not, we should not use it.

47:07 And then he was inspired to build Apache Arrow, which learned from Pandas and yeah.

47:13 - So that's how you learn about these projects, right?

47:16 This is how you realize, oh, we had put this thing in place.

47:19 Maybe we work better, right?

47:21 work on a project for five years.

47:22 And you're like, if I got a chance to start over, but it's too late now, but every now and then you do actually get a chance to start over.

47:30 I didn't realize that Wes was involved with both.

47:33 I mean, I knew from pandas, but I didn't realize it.

47:35 Yeah.

47:36 He is a CTO of Fultron, which, no, he started Apache Arrow and that's, Apache Arrow is sort of super big, like use everywhere, but sort of middle better. Like it's end users are developers and end users are developers who build tools and not developers who use libraries. That's something like that.

47:54 Right. You might not even know that you're using it. You just use Agiles Polars. And oh, by the way, it happens to internally be better because of this. Yeah. Very cool. Okay, let's see. We've got a little bit of time left to talk about it. So for example, this some of these, how can I let me just touch on a couple that are nice here. So you talked about Connector X, you talked about the database, but it's like three lines of code to define a connection string, define a SQL query, and then just, you can just say PL.read SQL.

48:25 And there you go.

48:26 You've, you call it data frame or what do you call the thing you get back here?

48:29 So reading is always a data frame.

48:31 Scanning will be a lazy frame.

48:32 Got it.

48:33 Okay.

48:33 Is there a scan SQL as well?

48:35 No, this might happen in the future.

48:38 The challenge is, are we going to push back our optimizations?

48:43 So you write a polars query and then we must translate that into SQL, into the SQL we send to the database.

48:51 But that needs to be consistent over different databases.

48:54 That's a whole other rabbit hole we might get into.

48:57 >> I'm not sure it's worth it because you can already do many of these operations in the SQL query that you're sending over.

49:04 You have two layers of query engines and optimizers and query plans.

49:09 It's not like you can't add on additional filters, joins, sorts, and so on before it ever gets.

49:17 It would be terrible if someone writes select star from table and then writes the filters in polars.

49:23 And then the database has sent all those data over the network.

49:26 So yeah, ideally, we'd be able to push those predicates down into the SQL.

49:32 Yeah, but you know, somebody is going to do it because they're more comfortable writing polar API in Python than they are running T-SQL.

49:39 >> Yeah. If it's possible, someone will write it.

49:43 >> It's not optimal. That is right.

49:46 Let's see what else can you do here.

49:48 We've already talked about the CSV files.

49:51 This is the part that I was talking about where you've got the toggle to see the Rust code and the Python code.

49:57 I think people might appreciate that.

49:59 Parquet files. Parquet files is a more efficient format.

50:04 Maybe talk about using Parquet files versus CSV and why you might want to get rid of your CSV and like store these intermediate files and then load them.

50:13 The problem is it's really faulty as a reader.

50:16 I really did my best on that one.

50:17 But if you can use Parquet or Arrow, like you see, because your data is typed, there's no ambiguity on reading.

50:27 We know which type it is.

50:28 Right.

50:29 Because CSV files, even though it might be representing a date, it's still a string.

50:33 We need to parse it.

50:34 And it's all over.

50:35 Yeah, it's slow to parse it.

50:37 There's also we can just--

50:39 so Parquet interacts really nicely with query optimization.

50:44 So we can select just a single column from the file without touching any of the other columns.

50:49 We can read statistics.

50:50 And so a Parquet file can write statistics, which knows, OK, this page has got this maximum value, this minimum value.

50:57 And if you have written a Polars query, which says also only gives me the result where the value is larger than this.

51:04 And we see that the statistics say it cannot be in this file.

51:09 We can just skip the whole column.

51:11 We don't have to read.

51:12 - Oh, interesting.

51:13 - So there are a lot of optimizations, which, so the best work is work you don't have to do and Parquet allows it.

51:20 - Exactly.

51:21 Or you've done it when you created the file and you never do it again or something like that.

51:26 Yeah, so you've got a read Parquet, scan parquet, I suppose that's the data frame versus lazy frame.

51:32 And then you also have the ability to write them.

51:33 That's pretty interesting.

51:34 JSON, multiple files.

51:36 Yeah.

51:37 Yeah.

51:37 There's just a whole bunch of how do I, how can I rather, but bunch of neat things.

51:41 What else would you like?

51:42 I think the most important thing I want to touch on is the expression API.

51:46 So that's a bit, you go a bit higher.

51:49 So just follow up.

51:50 They got her own chapter.

51:53 One of the goals of the Polar's API is to keep the API service small, but give you a lot of things you can do.

52:01 And this is where the Polar's expressions come in.

52:03 So Polar's expressions are expressions of what you want to do, which are run and parallelized on a query engine.

52:10 And you can combine them indefinitely.

52:12 So an expression takes a series and produces a series.

52:15 And because the input is the same as the output, you can combine them.

52:19 And as you can see, we can do pretty complicated stuff.

52:22 and you can keep chaining them. And this is the same like, I'd like to see it for instance, the Python vocabulary is quite small. So we have a while we have a loop, we have a variable assignment. But if you I think it fits into maybe two pieces of paper. But with this, you can write any program you want with the combination of all those, all those, yeah, this vocabulary. Yeah. And that's what we want to do with the Polars expressions as well.

52:49 So you've got a lot of small building blocks which can be combined into...

52:55 Yeah, so somebody could say I want to select a column back, but then I don't want the actual values.

53:01 I want the unique ones, a uniqueness. So if there's duplicate, remove those and you can do a .account.

53:07 Then you can add an alias which gives it a new, which basically defines the column name.

53:12 Yeah, you could read it as...

53:13 It's not names, it's...

53:14 - You could read it as an S.

53:16 So take column names as unique names to in SQL, but as is a keyword and Python, so you're not allowed to use it.

53:23 - Right.

53:23 (laughs)

53:24 It means something else, yeah.

53:25 - That's interesting.

53:27 - Okay, yeah, so people, they use these expressions to do lots of transformations and filtering and things like that.

53:35 - Yeah, so these expressions can be used in a select on different places, but the knowledge of expressions extrapolates to different locations.

53:43 So you can do it in a select statement, and then you select column net, you select this expression, and you get a result.

53:49 You can also do this in a group by aggregation.

53:51 And then the same logic applies.

53:53 It runs on the same engine, and we make sure everything is consistent.

53:57 And this is really powerful because it's so expressive, people don't have to use custom apply with lamda.

54:05 Because when you use lamda, it's a black box to us.

54:07 It will be slow because it's Python, and we don't know what happens.

54:11 So lamda is, it will be slow, it will kill parallelization because it gils work, but yeah, Alanda is three times better.

54:18 Right. It gets in the way of a lot of your optimizations and a lot of your speed there.

54:24 That's why we want to make this expression API very complete. So you don't need them as much.

54:30 Yeah. So people are wanting to get this, get seriously into this, they should check out chapter three expressions, right? And just go through there. Probably, especially, you know, you know, sort of browse through the Python examples that they can see where, go back and see what they need to learn more about.

54:44 But it's a very interesting API.

54:47 The speed is very compelling thing.

54:49 I think it's a cool project.

54:50 Like I said, how many people we got here?

54:52 13,000 people using it already.

54:54 So that's a pretty big community.

54:56 - Yeah, so if you're interested in the project, we have a Discord where you can chat with us and ask questions and see how you can best do things.

55:04 It's pretty active there.

55:05 - Cool, the Discord's linked right off the homepage.

55:07 So that's awesome.

55:09 People can find it there.

55:10 Contributions, people want to make contributions.

55:12 I'm sure you're willing to accept PR's and other feedback.

55:15 - Before you put in a really large PR, please first open an issue with a, to start the discussion.

55:24 This contribution is welcome.

55:26 And we also have a few getting started, good for new contributors.

55:31 - Okay, yes, you've tagged or labeled some of the issues as look here if you want to get into this.

55:37 I must say, I think we're an interesting project to contribute to because we're, you can, it's not, not everything is set in stone.

55:46 So there are still places where you can play.

55:49 I'm not sure.

55:50 - There's still interesting work to be done.

55:52 It's not completely 100% polished and finalized.

55:56 - Yeah, on the periphery, yeah.

55:59 - Yeah, very cool.

56:00 Let's wrap it up with a comment from the audience here.

56:02 Ajit says, "Excellent content, guys.

56:04 It certainly helps me kickstart my journey from pandas to polars.

56:08 Awesome, awesome.

56:09 Glad to help.

56:11 I'm sure it will help many people do that.

56:13 So Ritchie, let's close it out with final call action.

56:16 People are interested in this project.

56:17 They wanna start playing and learning polars.

56:20 Maybe try it out on some of their code that is pandas at the moment.

56:23 What do they do?

56:24 - I'd recommend if you have a new project, just start in polars.

56:28 Because you can also rewrite some pandas, but the most fun experience will just start a new project in pandas because then you can really enjoy what Polars offers.

56:38 Learn the expression API, learn how you use it declaratively, and yeah, it will be, then it will be most fun.

56:45 Absolutely.

56:46 Sounds great.

56:46 And like we did point out, it has the to and from Pandas data frame.

56:51 So you can work on a section of your code and still have it consistent, right?

56:55 With other parts that have to be with Pandas.

56:57 You can progressively rewrite some performance heavy parts.

57:01 I also think, so Polars is really strict on the schema, on the types.

57:07 It's also, if you write any ETL, you will be really happy to do that in Polars because you can check the schema of a lazy frame before executing it.

57:15 Then you know the types before running the query.

57:18 And if the data comes in and it doesn't apply to this schema, you can fiil fast instead of having strange outputs.

57:25 Oh, that's interesting because you definitely don't want zero when you expected something else because it could be pars or other weird, whatever.

57:33 Right.

57:33 Yeah.

57:34 So this was my, so missing data in Polar doesn't change the schema.

57:39 Yes.

57:39 So Polar is really, the schema is defined by the operations and the data and not by the values in the data.

57:47 So you can statically check.

57:49 Excellent.

57:50 All right.

57:51 Well, congratulations on a cool project.

57:53 I'm glad we got to share with everybody.

57:55 Thanks for coming on the show.

57:56 Bye.

57:56 You bet.

57:56 Bye.

57:58 This has been another episode of Talk Python to Me.

58:01 Thank you to our sponsors.

58:02 Be sure to check out what they're offering.

58:04 It really helps support the show.

58:06 Taipy is here to take on the challenge of rapidly transforming a bare algorithm in Python into a full-fledged decision support system for end users.

58:14 Get started with Taipy Core and GUI for free at talkpython.fm/taipy, T-A-I-P-Y.

58:21 Earn extra income from sharing your software development opinion at user interviews.

58:26 head over to talkpython.fm/userinterviews to participate today.

58:31 Want to level up your Python?

58:33 We have one of the largest catalogs of Python video courses over at Talk Python.

58:37 Our content ranges from true beginners to deeply advanced topics like memory and async.

58:42 And best of all, there's not a subscription in sight.

58:45 Check it out for yourself at training.talkpython.fm.

58:48 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

58:52 We should be right at the top.

58:54 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct rss feed at /rss on talkpython.fm.

59:03 We're live streaming most of our recordings these days.

59:06 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

59:14 This is your host, Michael Kennedy.

59:16 Thanks so much for listening.

59:17 I really appreciate it.

59:18 Now get out there and write some Python code.

59:20 (upbeat music)

59:23 [Music]

59:38 (upbeat music)

59:41 [BLANK_AUDIO]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon