WEBVTT

00:00:00.020 --> 00:00:03.220
Pandas is at the core of virtually all data science done in Python.

00:00:03.660 --> 00:00:05.740
That is, virtually all data science.

00:00:06.520 --> 00:00:09.420
Since its beginning, Pandas has been based upon NumPy.

00:00:09.800 --> 00:00:15.400
But changes are afoot to update those internals, and you can now optionally use PyArrow.

00:00:15.760 --> 00:00:28.900
PyArrow comes with a ton of benefits, including its columnar format, which makes answering analytical questions faster, support for a range of high-performance file formats, inter-machine data streaming, faster file I.O., and more.

00:00:29.400 --> 00:00:32.840
Reuven Lerner is here to give us the lowdown on the PyArrow revolution.

00:00:33.560 --> 00:00:38.420
This is Talk Python to Me, episode 503, recorded April 8th, 2025.

00:00:39.900 --> 00:00:41.560
Are you ready for your host, please?

00:00:42.420 --> 00:00:45.160
You're listening to Michael Kennedy on Talk Python to Me.

00:00:45.800 --> 00:00:48.880
Live from Portland, Oregon, and this segment was made with Python.

00:00:52.140 --> 00:00:55.000
Welcome to Talk Python to Me, a weekly podcast on Python.

00:00:55.500 --> 00:00:57.260
This is your host, Michael Kennedy.

00:00:57.440 --> 00:02:08.220
Follow me on Mastodon where I'm @mkennedy and follow the podcast using @talkpython, both accounts over at fosstodon.org and keep up with the show and listen to over nine years of episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams over on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows. This episode is brought to you by NordLayer. NordLayer is a toggle-ready network security platform built for modern businesses. It combines VPN, access control, and threat protection in one easy-use platform. Visit talkpython.fm/nordlayer and remember to use the code talkpython-10. And it's brought to you by Auth0. Auth0 is an easy-to-implement adaptable authentication and authorization platform. Think easy user login, social sign-on, multi-factor authentication and robust role-based access control. With over 30 SDKs and quick starts, Auth0 scales with your product at every stage. Get 25,000 monthly active users for free at talkpython.fm/Auth0. Reuven, welcome back to Talk Python to Me. Awesome to have you here.

00:02:08.580 --> 00:02:10.240
Thank you so much. Delightful to be here with you.

00:02:10.600 --> 00:02:22.420
Yes, we're coming up on conference season and I saw you doing conference things. So bit of a conversation about what you're going to be covering at PyCon.

00:02:22.880 --> 00:02:23.040
Absolutely.

00:02:23.500 --> 00:02:39.800
Yeah. I'm really, I mean, I love conferences. I love seeing people. I definitely got to the point where they're like conference friends who I see every year and we can sort of catch up and hang out. It's just like a fun, fun experience. I always tell anyone who can like go to conferences. It's a great place to learn, but it's also just a great place to have fun.

00:02:40.040 --> 00:03:04.480
I agree with that. I also think it's a great way to connect more deeply with programming and technology and libraries and all that kind of stuff. It's real easy for, I think, for a lot of folks for this to feel like a set of tutorials and documentation, right? And then you get there and you're like, oh, all these people are doing it and they're excited. And there's the person that made that one. And, you know, like to swim in those waters, it's different.

00:03:04.800 --> 00:03:34.140
I also feel like it's kind of sad, but I mean, I go to all these companies and I get a feeling that for many people who are in programming nowadays, it's kind of lost its fun and its creativity and so it's very nice to be in a community where because open source everyone's there because they want to be there and because they are excited about it and you can sort of like you know recharge your excitement batteries as it were and realize oh there's more to this than just the drudgery of day-to-day and meetings and filling my corporate goals uh it's nice it's fun

00:03:34.380 --> 00:03:45.320
it is and you know speaking of just swimming in the waters and what is water right that famous quote you talked about how it's so much fun because of open source and things like that like I hadn't thought about that for a little while.

00:03:45.410 --> 00:03:48.820
Like, you know, when I work on stuff, I can just do whatever I want.

00:03:48.850 --> 00:03:50.840
If I want to share it, I can share it.

00:03:50.840 --> 00:03:51.900
I don't have to share it.

00:03:52.100 --> 00:03:55.760
Use whatever libraries that might be coming along that look promising.

00:03:56.440 --> 00:03:56.880
There's not

00:03:56.880 --> 00:04:02.680
a corporate mandate like we're going to have these features for our library in seven months.

00:04:03.140 --> 00:04:12.060
And because the customer demand asked for this and we're going to put this thing in to promote our cloud or our other thing or whatever, right?

00:04:12.140 --> 00:04:14.720
There's a lot of people out there writing code with a lot less flexibility.

00:04:15.340 --> 00:04:15.900
Oh, my God.

00:04:16.190 --> 00:04:16.280
Yes.

00:04:16.609 --> 00:04:18.920
And I seem to see more such people each year.

00:04:19.680 --> 00:04:29.520
And I also feel like I always say, like, I have this sort of dual flexibility that I feel very privileged to have that, A, I'm a freelancer, I'm independent, and B, I work in open source.

00:04:30.060 --> 00:04:33.220
So I can say such and such is dumb or such and such is bad.

00:04:34.320 --> 00:04:40.100
And people have like a normal job, as it were, have to sort of say, well, like, this is our product and it's great.

00:04:40.320 --> 00:04:41.600
Or at least they have to say it's the outside world.

00:04:41.900 --> 00:04:45.740
Whereas day to day, they're just going to meetings saying, how can we convince people that this is great?

00:04:47.020 --> 00:04:54.760
So, yeah, yeah, it's a nice way to escape that corporate golden handcuffs.

00:04:54.790 --> 00:04:55.900
Oh, I'm making it sound really terrible.

00:04:56.460 --> 00:04:59.440
For all of you listening, I'm happy you have good jobs.

00:04:59.510 --> 00:05:00.060
I really am.

00:05:02.840 --> 00:05:06.660
Unfortunately, I'm afraid people might start having to appreciate their jobs a little bit more.

00:05:07.120 --> 00:05:08.780
Things are looking a little hectic out there.

00:05:08.840 --> 00:05:13.560
I don't want to go into that, but one little side diversion before we dive into the main topic.

00:05:13.960 --> 00:05:15.280
You go into all these big companies.

00:05:16.580 --> 00:05:19.960
What's the LLM AI story for those?

00:05:20.080 --> 00:05:20.440
Is

00:05:20.440 --> 00:05:20.880
it different

00:05:20.880 --> 00:05:25.860
than people on the outside who can just YOLO around the tools however they want?

00:05:25.860 --> 00:05:26.800
Or what's it like?

00:05:27.240 --> 00:05:28.360
So every company is different.

00:05:28.640 --> 00:05:30.180
Every company is asking that question, right?

00:05:30.320 --> 00:05:31.700
And no one has an answer.

00:05:31.980 --> 00:05:37.980
I think a growing number of companies are assuming that their people will use LLMs of some sort.

00:05:38.560 --> 00:05:39.660
Copilot's been around for a while.

00:05:39.900 --> 00:05:40.580
People are using that.

00:05:41.080 --> 00:05:45.540
A year ago, I have one big client where they said, no, of course we would never use ChatGPT.

00:05:45.920 --> 00:05:48.380
And just a few weeks ago, I said, so, like, what's the story?

00:05:48.780 --> 00:05:50.260
Oh, yeah, we're definitely using some things.

00:05:50.360 --> 00:05:51.300
We'll get back to you on one.

00:05:51.620 --> 00:05:51.700
So

00:05:51.700 --> 00:05:52.420
it's been increasingly

00:05:52.420 --> 00:05:58.760
integrated just because, especially, you know, if you're a senior developer, these LLMs really help you just zoom along.

00:05:59.260 --> 00:06:05.700
The junior ones, it's sort of like a little iffier, but everyone at least has to answer the question, what are you doing with these?

00:06:06.040 --> 00:06:08.060
And I think it's increasingly integrated into their workflow.

00:06:08.140 --> 00:06:11.240
I don't think anyone knows what is the right way to do it.

00:06:11.510 --> 00:06:18.980
I do think that these companies that are talking about, well, we're not going to hire any developers this coming year because instead we're just going to use LLMs, that's just nuts.

00:06:19.580 --> 00:06:21.300
I think they're asking for trouble there.

00:06:22.100 --> 00:06:24.720
And in general, I tell people, don't have the LLMs write code for you.

00:06:25.100 --> 00:06:26.560
Have it help you strategize.

00:06:26.920 --> 00:06:27.920
Have it go over your code.

00:06:28.120 --> 00:06:29.000
Have it help you learn things.

00:06:29.540 --> 00:06:33.880
But somewhere, somehow, they're going to have LLM-generated code that no one's going to look at.

00:06:34.000 --> 00:06:36.920
And I don't like that idea so much, at least for now.

00:06:37.120 --> 00:06:37.320
Interesting.

00:06:38.060 --> 00:06:42.080
I would not trust 100% LLM written code.

00:06:42.560 --> 00:06:45.720
Not even necessarily because I think LLMs are bad at writing code.

00:06:45.860 --> 00:06:48.960
I'm stunned at how good they are at it.

00:06:49.440 --> 00:06:51.640
But they write the code that you ask them to write.

00:06:51.980 --> 00:06:54.600
Even if they get it 100% right, they write what you ask them to write.

00:06:54.780 --> 00:06:59.040
And it's like, I'm just seeing the office space guy like, I'm good with customers.

00:06:59.270 --> 00:07:00.360
I talk to the customers.

00:07:02.340 --> 00:07:05.000
What would you say you do here, Bob?

00:07:05.320 --> 00:07:05.440
Like

00:07:05.440 --> 00:07:05.900
that guy?

00:07:07.580 --> 00:07:11.360
I mean, you know, and you've got to give the specifications to the AI really, really well.

00:07:11.600 --> 00:07:11.840
There

00:07:11.840 --> 00:07:12.080
you go.

00:07:12.200 --> 00:07:12.500
There you go.

00:07:12.620 --> 00:07:19.880
So when ChatGPT first came out, right, it was this whole meme of the programming language everyone needs to learn now is English.

00:07:20.490 --> 00:07:26.240
Because all you have to do is tell the LLM what you want to do, and it will come out with code, voila, problem solved.

00:07:26.700 --> 00:07:50.000
And anyone who has worked on a project before, and especially anyone who's worked with clients before, non-technical clients, knows that the gap between specifying what you want in clear, precise language and getting code that does it can be vast. And the difference between success and failure. I often tell my students one of my favorite lines that I heard years ago, which is computers don't do what you want them to do. They do what you tell them to do.

00:07:51.720 --> 00:07:54.100
And like, we've all been bitten by that so many times.

00:07:54.280 --> 00:08:06.240
Yeah, we definitely have. We definitely have. All right. Well, I think this is a story that's going to continue to get just more insane. It's going to be interesting to see where things go.

00:08:06.320 --> 00:08:14.960
I think it's both going to supercharge open source, but also that super royal cost turbulence for programmers, right? So we're going to see.

00:08:15.180 --> 00:08:17.320
No, no question. I 100% agree.

00:08:17.760 --> 00:08:23.260
Yeah. So, you know, with all this, we haven't gotten to, I haven't given you a chance to introduce yourself to

00:08:23.260 --> 00:08:24.480
everyone because everyone

00:08:24.480 --> 00:08:28.680
knows Reuben, but maybe for the couple of people real quick introduction. Fantastic.

00:08:29.160 --> 00:08:44.039
So yeah, so I'm Reuben Lerner and I teach Python and Pandas and Git for a living. I've been doing it for a long time, since like 1995 or so. And so half of my work is going to companies and doing training there. And the other half is doing online learning.

00:08:44.380 --> 00:08:56.100
I've got my own platform, I've got books, I've got newsletters, YouTube channel, online bootcamp that I do. And my goal is just like to help people wherever they are with their Python Pandas knowledge to advance, get better, get more fluent.

00:08:56.440 --> 00:09:14.000
Awesome. Well, that's pretty much the story of this episode as well. But I also, I mean, that's a great introduction, but I didn't know you were such an athlete. I mean, you didn't even talk about all these, these workout books that you're writing and you're a workout influencer.

00:09:14.720 --> 00:09:24.460
Yeah. Well, I wish it were more, more, more, more physical than virtual. but yeah, yeah. So I've got my, two books published with Manning, Python Workout and Panda's Workout.

00:09:25.880 --> 00:09:30.460
And actually, Python Workout is now in its second edition in early release form.

00:09:31.799 --> 00:09:38.800
So it's a relatively minor update to take advantage of all the new stuff that's come out of Python in the last, what, three, four, five years.

00:09:39.060 --> 00:09:41.300
So it's not like a huge overhaul, but like something there.

00:09:41.580 --> 00:09:45.200
Some of the exercises that everyone kind of said, really, you want to do that?

00:09:45.600 --> 00:09:47.700
And so like, yeah, there are always some stinkers in there.

00:09:48.060 --> 00:09:49.900
But overall, it's great fun.

00:09:50.300 --> 00:09:58.940
And the idea is, you know, Manning and I came with the title, but the idea is you're only going to get better if you do lots of little practice every day.

00:09:59.180 --> 00:10:04.880
I often say that it's similar to learning a language, but actually, like I have started running in the last few months.

00:10:05.280 --> 00:10:06.340
And like, what do you know?

00:10:06.640 --> 00:10:08.800
You do a little more each day, a little more each day.

00:10:09.100 --> 00:10:12.640
And then like, you know, every so often you'll get injured, but like, then you go back to it.

00:10:12.960 --> 00:10:21.900
And so over time, you build up the strength and the stamina and the fluency so you can really like get into a project and do what you need and not be looking everything up all the time.

00:10:22.140 --> 00:11:27.400
You know, one of the things I noticed when I, you know, as you, I've done a lot of in-person corporate training type stuff, not for a while, but, you know, over my career, I think it's really interesting. You go interact with all these folks, some of whom are brand new on a team or a project, but others have been, you know, like I've been at the company for 20 years and that's usually really awesome. However, there's certainly plenty of times that I saw people who had 20 years of experience, but it didn't feel like they had 20 years of experience knowledge in the sense that they kind of did the same thing you did in the first couple years and just kept doing that for 18 more rather than having a wide ranging set of experiences. And it's like a lot of things like exercise, like sports, like other skills without focused practice on something. You can get sort of into a rut or get really good at like a few things, but you're like, well, I've never really created a website. I always just work on this database layer. It's like 20 years. Okay. Spread out of it. Come on. And I feel like this is the kind of stuff you're talking about maybe.

00:11:27.600 --> 00:12:17.380
That's exactly it. That you want to get this wide variety of practice. So the books, I don't think I said this explicitly, but the books are all exercises. Like there's some comments in there in some sort of, shall we call it like mini or micro tutorials to get you up to speed. But the idea is you've already learned Python, you've already learned pandas, and now you just need to practice and better that you practice with in what I call controlled frustration in this sort of like, you know, environment where it's not going to matter to your job, then you get to work and your boss is breathing down your neck and you've got deadlines. And I try to make it as varied as possible so that you'll sort of be exposed to as many different ideas as possible. So even if you don't remember it 100% you'd be like oh wait here I probably could have used a dictionary comprehension I don't quite remember exactly what the syntax is but that's probably the right direction and that's way better than uh what do I do now or just doing it the wrong way

00:12:17.380 --> 00:12:31.460
yeah absolutely and circling back a bit to our LLM conversation getting exposure to all these things you're like well the LLM wrote this and it looked weird I didn't understand it but now I see what it was doing it was using this thing that I hadn't really played with this aspect of the language I hadn't played with yeah

00:12:31.460 --> 00:12:59.220
one of the things i like to do uh with llms is i call the reverse socratic method um where i ask it lots of questions about either my code or like oh you're saying i should do it this way but why and what if i do it this way so instead of like the teacher asks the student lots of questions the students ask the teacher lots of questions that's when we think of the llm as a teacher and i found that i learned a lot of things that way probing both learn the nuances and I see where its limitations are and or where it's just bluffing.

00:13:00.320 --> 00:13:04.680
And so I find that to be sort of a useful technique to play with.

00:13:04.920 --> 00:13:05.620
Very interesting.

00:13:06.480 --> 00:13:11.020
Well, I would propose that Panda's certainly proven itself to be a tad useful.

00:13:11.620 --> 00:13:14.140
You know, here and there, there are like a handful of people using it nowadays.

00:13:14.700 --> 00:13:15.100
It's astonishing.

00:13:15.940 --> 00:13:16.300
I know.

00:13:16.410 --> 00:13:17.640
Does it even need an introduction?

00:13:17.880 --> 00:13:19.200
I'm not sure that it needs an introduction.

00:13:19.600 --> 00:13:19.720
I

00:13:19.720 --> 00:13:21.120
mean, if you're listening to

00:13:21.120 --> 00:13:21.500
this podcast.

00:13:21.670 --> 00:13:21.920
15 seconds.

00:13:22.330 --> 00:13:22.820
Yeah, if you're listening.

00:13:23.240 --> 00:13:50.760
No, I will. Here's I'll tell listeners out there. There's a very interesting group of people who do listen to this podcast. I'd be interested to hear your thoughts on this. I've had people write to me and they'll say, I really love your show. Thanks so much for doing it. However, you know, I'm I'm starting to understand a lot of the words that you guys are using or what you're talking about. I've been listening for, you know, two months or something. That's the serious persistence to listen for two months and not really like start out not even knowing what's going on.

00:13:50.940 --> 00:13:55.160
But a lot of people use this show like language immersion, you know?

00:13:55.819 --> 00:13:56.220
You

00:13:56.220 --> 00:13:59.380
want to learn Portuguese, you move to Brazil, and then you start learning it, right?

00:13:59.580 --> 00:14:00.380
Not the other way around.

00:14:00.920 --> 00:14:06.720
So I'm always cognizant of those folks who are using this to kind of as their first step into the industry.

00:14:07.280 --> 00:14:08.480
What do you tell those folks Pandas are?

00:14:08.800 --> 00:14:16.080
So Pandas is a library, like a module or package in Python that lets you do data analysis.

00:14:16.560 --> 00:14:21.660
And the way I describe it to people who are not programmers is it's basically Excel inside of Python.

00:14:22.070 --> 00:14:24.340
So you can read in data from a lot of different sources.

00:14:24.880 --> 00:14:27.520
You can analyze it in two-dimensional tables, right?

00:14:27.660 --> 00:14:31.700
So you've got rows, you've got columns, and then you can perform a ton of different calculations.

00:14:32.460 --> 00:14:37.840
You can use dates, you can use text, but you also have all the flexibility of Python as a programming language.

00:14:38.410 --> 00:14:39.940
So you can extract different parts of it.

00:14:40.110 --> 00:14:41.640
You can mix and match different parts of it.

00:14:41.960 --> 00:14:49.780
And Pandas is especially really good at importing from a ton of different sources and exporting back to those sources or those destinations, I guess.

00:14:50.320 --> 00:15:02.820
And it's become this, well, to extend the language thing, like this lingua franca, like a lot of people use pandas for even a tiny subset of what it can do just because it's so ridiculously flexible and because it's everywhere.

00:15:03.180 --> 00:15:04.340
Yeah, good description.

00:15:06.720 --> 00:15:09.580
This portion of Talk Python to Me is brought to you by NordLayer.

00:15:10.200 --> 00:15:14.500
NordLayer is a toggle-ready network security platform for modern businesses.

00:15:14.940 --> 00:15:19.520
It combines VPN, access control, and threat protection and one easy-to-use platform.

00:15:20.420 --> 00:15:25.720
There's no hardware or complex setup, just secure connections and full control in less than 10 minutes.

00:15:26.160 --> 00:15:30.840
It's easy to start with quick deployment, step-by-step onboarding, and 24-7 support.

00:15:31.660 --> 00:15:32.540
It's easy to combine.

00:15:33.080 --> 00:15:35.740
It works with existing setups in all major platforms.

00:15:36.500 --> 00:15:37.980
And it's easy to scale.

00:15:38.660 --> 00:15:41.760
Add users, features, and servers in just a few clicks.

00:15:42.240 --> 00:15:43.960
Single sign-on and provisioning included.

00:15:44.780 --> 00:15:48.260
NordLayer provides zero-trust network access-based solutions.

00:15:48.960 --> 00:15:53.920
It adds threat protection to keep malware, ransomware, and phishing from reaching your endpoints.

00:15:54.580 --> 00:16:00.580
It increases your threat intelligence to spot threats before they escalate, and it helps businesses achieve compliance.

00:16:01.300 --> 00:16:06.960
So if you're responsible for the security of your software or data science team, you should definitely give NordLayer a look.

00:16:07.540 --> 00:16:17.260
As Talk Python listeners, you'll get an exclusive offer, up to 22% off NordLayer's yearly plans, plus an additional 10% off the top with our coupon.

00:16:17.780 --> 00:16:20.460
Just use the code talkpython-10.

00:16:20.860 --> 00:16:23.340
That's talkpython-10, all lowercase.

00:16:23.880 --> 00:16:27.740
Try NordLayer risk-free with their 14-day money-back guarantee.

00:16:28.200 --> 00:16:30.820
Visit talkpython.fm/nordlayer to get started.

00:16:31.260 --> 00:16:33.180
That's talkpython.fm/nordlayer.

00:16:33.440 --> 00:16:35.260
The link is in your podcast player's show notes.

00:16:35.920 --> 00:16:38.160
Thank you to NordLayer for supporting Talk Python and me.

00:16:39.460 --> 00:16:43.020
This loading data from multiple data sources, it's nuts.

00:16:43.480 --> 00:16:45.240
It's crazy how good it is.

00:16:45.620 --> 00:16:50.720
So let me just throw an example out there for people who maybe haven't done a lot with importing data with pandas.

00:16:50.990 --> 00:17:00.500
There, you could do things like there is a HTML table on a state government website that talks about some bit of data that you need.

00:17:00.530 --> 00:17:03.360
And it's just embedded in some web page.

00:17:03.520 --> 00:17:09.839
It's the third table, HTML table, you know, bracket table slash table sort of thing on there.

00:17:09.939 --> 00:17:18.220
And you can say, load table, give it that URL and say, or, you know, load that HTML, give me the tables, go to the third one.

00:17:18.319 --> 00:17:19.360
And it's a Pandas data frame.

00:17:19.699 --> 00:17:22.339
Like that level of just grab it, right?

00:17:22.660 --> 00:17:29.680
So I have my, like one of the newsletters I produce is a bamboo weekly where I have like challenges each week to use Pandas.

00:17:30.120 --> 00:17:35.800
And I'm always trying to retrieve stuff from different sources because A, people have lots of different needs.

00:17:35.840 --> 00:17:36.940
B, there's lots of data out there.

00:17:37.200 --> 00:17:40.240
And C, then you have to clean it up and you need different techniques for cleaning it.

00:17:40.480 --> 00:17:44.820
And so like, right, you can retrieve it from a PDF file if there are tables there.

00:17:44.920 --> 00:17:51.920
You can retrieve, as you said, from HTML, you can retrieve it from Excel, from JSON, from other statistics programs, their binary formats.

00:17:52.320 --> 00:17:53.760
It basically is infinite.

00:17:53.860 --> 00:17:59.660
And then you've got CSV, which is like every possible almost kind of standard under the sun.

00:18:00.220 --> 00:18:01.600
And Pandas is like, oh, that's okay.

00:18:01.940 --> 00:18:04.460
We'll just give you 100,000 different options and then you can read any of them.

00:18:04.840 --> 00:18:05.420
Yeah, that's amazing.

00:18:06.340 --> 00:18:07.680
I'm just blown away with it.

00:18:07.940 --> 00:18:09.900
So super, super interesting.

00:18:10.800 --> 00:18:17.400
One of the things that I think kind of goes hand in hand with Pandas is, of course, NumPy.

00:18:17.940 --> 00:18:21.800
And that's a little bit at the heart of what we're getting at, right?

00:18:21.960 --> 00:18:27.080
Like traditionally, Pandas is sort of internally used NumPy to manage its data structures and so on.

00:18:27.440 --> 00:18:30.300
And there's some new libraries and formats coming along.

00:18:30.470 --> 00:18:35.220
And you might be able to mix and match or even have to mix and match eventually.

00:18:35.580 --> 00:18:35.720
Right.

00:18:36.140 --> 00:18:56.340
Right. So, so it all started like, so when people hear that, when people hear that Python is the number one language for data science and machine learning and data analytics, if they know anything about Python, they're a little confused. Like, wait a second. Python's a great language, but its data structures are big and slow. Why would I possibly want to use this?

00:18:56.440 --> 00:18:57.460
Yeah, especially yours.

00:18:58.200 --> 00:19:06.340
I think if you look at like what is most out of sync or out of space with like C or other native languages, right?

00:19:06.480 --> 00:19:10.320
Like how big is a Python float versus a regular one and like

00:19:10.320 --> 00:19:11.820
locality of

00:19:11.820 --> 00:19:12.640
data, all that kind of stuff.

00:19:12.980 --> 00:19:19.040
I think the last time I checked, it was like 24 bytes for a zero integer in Python.

00:19:19.900 --> 00:19:23.460
Yes, and their pointer dereferences and they're on the heap so they might be in different places.

00:19:23.560 --> 00:19:24.780
Like there's a lot going on here.

00:19:24.840 --> 00:19:25.260
so

00:19:25.260 --> 00:20:25.500
many years ago i don't know 20 years ago or something um we got numpy and numpy is basically like the best of both worlds it's c storage and speed and so all that efficiency but with this really thin layer of python so you can work with it and you sort of get the ease of python and as i said the efficiency of c and so numpy is fantastic at doing that it's very very widely used in science engineering math statistics all that stuff and you could do a ton of stuff with numpy at very happy. But most data analysis that we're going to do is going to be in two-dimensional tables, and it's going to use a lot of strings, and with the import and export that we were talking about. And so I always describe Pandas as like an automatic transmission for NumPy's manual transmission, that you get a lot of sort of convenience functionality that just makes it smoother, cleaner, easier to do your day-to-day stuff. And so Pandas has been reliant on NumPy for, well, since it started, like no doubt about it.

00:20:25.960 --> 00:20:30.560
And if you sort of chip away or like scrape away the outer layer there, you very quickly see NumPy stuff.

00:20:30.580 --> 00:20:38.780
For example, the D types, the data types that we use in each Pandas column, those are defined almost exclusively by NumPy types.

00:20:39.540 --> 00:20:46.140
And so I actually, when I teach Pandas, I first teach NumPy because I feel it's like an easier sort of lower level way to get used to it.

00:20:46.140 --> 00:20:49.320
And then they see all the techniques applied to Pandas as well.

00:20:49.560 --> 00:20:53.420
Yeah, the data types are interesting because NumPy is written in C.

00:20:53.900 --> 00:20:56.760
C operates on structured, well-defined.

00:20:56.940 --> 00:20:59.200
You know, this thing is four bytes, that's eight bytes and so on.

00:20:59.520 --> 00:20:59.900
Data types.

00:21:00.040 --> 00:21:02.020
And so those have really interesting limitations.

00:21:02.680 --> 00:21:08.640
I have a joke for you that I ran across recently and I think it highlights this.

00:21:08.840 --> 00:21:10.940
So I wish I had a picture I could share of, but I don't.

00:21:11.180 --> 00:21:18.200
So there was this programmer that finds one of these genie in a bottle, sort of genie things, rubs it.

00:21:18.240 --> 00:21:20.680
The genie pops out and says, hello, lucky one.

00:21:21.060 --> 00:21:21.720
You have three wishes.

00:21:22.100 --> 00:21:23.700
But before you can wish, there are some rules.

00:21:23.770 --> 00:21:26.680
You can't wish to kill someone or make someone fall in love with you.

00:21:27.000 --> 00:21:28.920
Most importantly, you can't wish for more wishes.

00:21:29.300 --> 00:21:31.420
The programmer says, well, can I wish for fewer wishes?

00:21:32.360 --> 00:21:33.940
Why would you wish for fewer wishes?

00:21:34.050 --> 00:21:34.840
I just don't understand.

00:21:35.660 --> 00:21:37.280
He goes, well, I want negative one wishes.

00:21:37.530 --> 00:21:43.540
Fine, you have 2,496,000,300,000.

00:21:45.480 --> 00:21:45.980
Oh, that's pretty

00:21:45.980 --> 00:21:46.100
good.

00:21:46.120 --> 00:21:46.700
So what

00:21:46.700 --> 00:21:47.000
happened?

00:21:47.320 --> 00:21:48.280
Like, why did that go wrong?

00:21:48.720 --> 00:21:49.800
I mean, that's the D types, right?

00:21:50.380 --> 00:22:06.300
That's exactly right. That's exactly right. So I never thought of it that way. I always think, maybe because I'm old enough, I think you and I are about the same age, that when you played video games when you were little, if you're really, really good, you would get the maximum score, it would sort of wrap around back to zero. Or

00:22:06.300 --> 00:22:07.080
if you had a really

00:22:07.080 --> 00:22:12.240
old car, and if you drove it a long time, eventually the odometer would go past zero. There's a

00:22:12.240 --> 00:22:13.060
limited number

00:22:13.060 --> 00:23:08.480
of digits, and after 9999, it has to go back to 000. And that's basically what's happening bitwise in NumPy. It has a certain set, unlike Python data types, like Python integers will get as big or as small as you have memory. There is no limit. But when you're working with NumPy or with pandas with these d types, you have to say, is it 8 bits or 16 or 32 or 64? And that's it. Once you reach that ceiling, then it wraps around and it will not warn you about this either. So you need to keep enough of a buffer there between what you think will be your maximum number and what could possibly ever be your maximum number. Like if you want to do, I don't know, eight bits for ages, that's probably fine. Right. But if you want to do eight bits for, I don't know, how long is the project going in number of days? Oh, better hope that your project is going to be done soon because you could be in trouble and you could be into negative territory.

00:23:09.020 --> 00:23:15.440
Yeah. They should have made a little bit bigger choice for the epoch since 1970, you know, that's right. That's right.

00:23:15.520 --> 00:23:17.580
find out in 2038 about that one.

00:23:18.340 --> 00:23:18.540
I think

00:23:18.540 --> 00:23:19.060
that's the year.

00:23:19.370 --> 00:23:20.100
Anyway, it's going to be bad.

00:23:20.470 --> 00:23:25.440
Yeah, so the genie was basically storing the wishes count in an unsigned 32-bit integer.

00:23:26.420 --> 00:23:27.000
I love that joke.

00:23:27.820 --> 00:23:29.980
And I have no one to tell it to, so I'm very glad you told it to me.

00:23:31.860 --> 00:23:32.160
Exactly.

00:23:32.320 --> 00:23:33.660
What? That's a stupid...

00:23:33.790 --> 00:23:34.440
I don't even understand.

00:23:34.710 --> 00:23:35.680
What a stupid genie.

00:23:35.820 --> 00:23:37.180
Of course, they have no more wishes.

00:23:37.660 --> 00:23:38.540
They got some more wishes.

00:23:39.780 --> 00:23:42.900
Another thing, you talked about this buffer sort of deal.

00:23:43.760 --> 00:23:52.260
If you think maybe you need 16 bits or whatever, or 32, maybe you want to be safe so you're going to double that.

00:23:53.000 --> 00:23:55.780
That also adds to a bunch of memory usage.

00:23:56.740 --> 00:24:04.060
When you allocate 64-bit integers instead of 32, even if you don't use that space, you consume that much memory.

00:24:04.580 --> 00:24:04.880
That's

00:24:04.880 --> 00:24:05.620
right. That's the thing.

00:24:06.370 --> 00:24:13.500
Again, if you're an even experienced Python programmer, you're like, well, I'll just like whatever the integers need, they need.

00:24:13.840 --> 00:24:18.080
But then comes along NumPy and Pandas and they say, no, you have to choose how big it's going to be.

00:24:18.300 --> 00:24:21.220
And I'm like, well, okay, let's just make everything 64 bits, right?

00:24:21.390 --> 00:24:21.900
What could be the cost?

00:24:21.900 --> 00:24:22.340
Just be safe.

00:24:22.840 --> 00:24:22.940
Right.

00:24:23.400 --> 00:24:29.560
And basically, let's say you have a billion rows that, you know, let's say you just have a billion elements.

00:24:30.140 --> 00:24:34.380
Well, 64 bits is going to be like literally twice as much as 32.

00:24:35.060 --> 00:24:40.900
And like that could mark the difference between running out of memory and not running out of memory or having to swap.

00:24:41.010 --> 00:24:42.540
Like it can get very bad.

00:24:42.960 --> 00:24:48.960
And so, especially since Pandas is constrained by what can fit into available RAM.

00:24:49.460 --> 00:25:03.060
So you're always stuck with this tension with these D types between you have to keep it bigger, big enough to fit all the data you want, and small enough that you'll be able to fit everything into memory.

00:25:03.560 --> 00:25:04.620
And it's a bit of a game.

00:25:04.900 --> 00:25:10.200
And there's no formula you can use because you can't know in advance what all your data is going to be usually.

00:25:10.460 --> 00:25:11.500
Yeah, for sure.

00:25:11.820 --> 00:25:15.360
I mean, honestly, it really freaked me out a little bit when I first started doing Python.

00:25:15.600 --> 00:25:17.880
And it didn't matter what integer type I created.

00:25:17.940 --> 00:25:20.720
I'm like, well, I give it this number, but what if it gets too big?

00:25:21.080 --> 00:25:22.040
How do I control that?

00:25:22.120 --> 00:25:23.620
You know, you don't.

00:25:23.720 --> 00:25:24.620
It's just, it's magic.

00:25:24.900 --> 00:25:25.320
Right, right.

00:25:25.420 --> 00:25:27.500
And I come from like a dynamic language background.

00:25:27.640 --> 00:25:30.820
Like I was always sort of brainwashed to think this is the way normal things are.

00:25:31.220 --> 00:25:38.720
And so when I was like told that there are languages where you have to say how many bits it's going to be in advance, I was like, wait, what kind of crazy stuff is this?

00:25:39.460 --> 00:25:43.040
But it turns out a very large number of people see that as totally normal.

00:25:43.360 --> 00:25:44.320
Yeah, it's interesting.

00:25:44.490 --> 00:25:49.780
I was just looking at some C-sharp stuff last night and all the symbols and all the stuff there.

00:25:49.950 --> 00:25:51.680
Like it seems normal when you're in that.

00:25:51.710 --> 00:25:52.680
Then you step out of it.

00:25:52.690 --> 00:25:56.560
Like, wait, I don't have to be constrained by this or I don't have to worry about that particular thing.

00:25:56.720 --> 00:25:57.060
That's weird.

00:25:57.300 --> 00:26:01.600
But wait, if I don't have to worry about it, why have I been spending all my time and energy thinking about it?

00:26:01.680 --> 00:26:09.720
Right. But I mean, I would say most languages, actually, you probably do have to worry about your numerical sizes.

00:26:09.940 --> 00:26:15.100
Right. Anything that's sort of compiled and allocates things like that, you work with memory.

00:26:15.520 --> 00:26:19.860
Look, it's like a statically type versus dynamically type language sort of thing.

00:26:20.180 --> 00:26:20.560
Right. Do

00:26:20.560 --> 00:26:24.600
you want to have that extra safety? Do you want to know in advance how much memory you're going to use?

00:26:25.240 --> 00:26:34.180
Or do you want it to be more expressive and flexible, but then potentially have problems if you don't think about it enough in advance?

00:26:35.440 --> 00:26:35.640
Yeah,

00:26:35.740 --> 00:26:36.120
yeah, yeah.

00:26:36.460 --> 00:26:37.300
Yeah, for sure.

00:26:38.020 --> 00:26:38.280
All right.

00:26:38.840 --> 00:26:43.100
So that's, I guess, one more thing before we move on to talk about Arrow.

00:26:43.560 --> 00:26:48.380
You can ask Pandas how much data a data frame is consuming, right?

00:26:48.680 --> 00:26:51.540
So the answer is, as always with me, yes and no.

00:26:52.000 --> 00:26:55.560
You can ask it how much memory it's using, and it will give you an answer.

00:26:56.100 --> 00:26:57.920
And sometimes that answer is even accurate.

00:26:58.640 --> 00:27:03.300
And the problem is basically that it will tell you how much memory is being used in NumPy.

00:27:03.680 --> 00:27:07.780
And so if you've got integers, if you've got floats, if you've got date times, it will be 100% accurate.

00:27:08.080 --> 00:27:14.740
The moment that you have strings or other objects, but let's just concentrate on strings, NumPy does have strings, and they're terrible.

00:27:15.160 --> 00:27:17.520
And so basically, Pandas is like, we're not going to use those.

00:27:17.860 --> 00:27:34.660
we're going to use Python strings, and we'll just store a pointer in NumPy, a 64-bit pointer that points the Python string, which means that if it calculates how much memory is being used by NumPy, it's showing you how big the pointer is, which is potentially, I mean, there is no connection.

00:27:34.800 --> 00:27:37.060
There's no correlation between that and the size of the string.

00:27:37.660 --> 00:27:38.840
All strings are eight bytes.

00:27:39.420 --> 00:27:39.720
That's right.

00:27:40.020 --> 00:27:40.120
That's

00:27:40.120 --> 00:27:40.220
right.

00:27:41.880 --> 00:27:43.140
Well, we washed our heads of that one.

00:27:43.540 --> 00:27:57.360
And so when you use df.info, use the info method on a data frame, it will report back. And then sometimes it'll put a plus after that number. And the plus means, hey, I've got some strings here in Python memory.

00:27:57.780 --> 00:28:23.860
I'm going to just give you a fast answer. And I'm not going to go explore that. If you really want a real answer, tell me basically deep equals true. And then I'll go off and explore our Python memory. It'll take longer, but you'll get an accurate count. And that's surprising to a lot people, including because the index and the column names are also typically strings. And so the moment you have an index or column names assigned, it'll also give you that plus and you can't depend on it.

00:28:23.960 --> 00:28:30.160
Yeah. Interesting. And you know, Python objects have the same issue. If you ask in Python, I can't remember exactly what the

00:28:30.160 --> 00:28:31.640
size of.

00:28:32.000 --> 00:29:02.160
Yeah. Get size of that's it. Yeah. And that will, that does the exact same thing. So if you've got a list, for example, or a dictionary, and you ask how big is it, it's like, well, it's basically how many pointers are stored in its structure that point out the things. And for the memory class I did at Talk Python, I had to write some code that would basically traverse the object graph of all the structures. And like, no, this is actually how big it is. And this is why it's doing this in memory and so on. And yeah, it's not unique to NumPy, but it's,

00:29:02.480 --> 00:29:02.600
yeah,

00:29:02.900 --> 00:29:32.480
it's just, you got pointers. It's a lot more work to traverse them and figure that stuff all out. Okay. So there's a lot of energy around a specification, a library called Arrow from Apache, Apache Arrow. It's the universal columnar, where it always catches me up, columnar format and multi-language toolbox or fast data interchange and in-memory analytics. And so this is, this is super interesting in this project. Let me go to its, it's a homepage or whatever.

00:29:32.890 --> 00:29:38.020
But yeah, you have this for many different languages, right?

00:29:38.100 --> 00:29:38.980
It's not just Python.

00:29:39.220 --> 00:29:42.340
In fact, it's like NumPy written and this one's written in C++.

00:29:42.740 --> 00:29:44.840
Yeah, I mean, so think about it this way.

00:29:45.100 --> 00:29:50.240
Like, I mean, I always sort of think about my evolution of seeing amazing stuff in programming languages.

00:29:50.720 --> 00:29:53.200
So it used to be really amazing that you could get strings, right?

00:29:53.300 --> 00:29:54.840
Back, let's get 30 years ago.

00:29:55.040 --> 00:29:57.560
Wow, you don't think about like, you know, arrays of characters.

00:29:58.020 --> 00:29:58.700
It's just a string.

00:29:59.180 --> 00:29:59.380
Amazing.

00:29:59.980 --> 00:30:02.280
Fast forward a number of years and I was amazed by dates and times.

00:30:02.800 --> 00:30:05.900
And nowadays, like everyone wants to do data frames.

00:30:06.480 --> 00:30:19.340
And so Wes McKinney, a bunch of other people, like he invented pandas, said, well, why don't we, instead of everyone inventing our own thing, why don't we create a backend data storage system that everyone can use that does all the data frame stuff?

00:30:19.400 --> 00:30:20.820
Because we all want them in our languages.

00:30:21.380 --> 00:30:28.440
And then we can make it really fast and universal and do lots of inputs and outputs and even have interchange among these different languages.

00:30:29.000 --> 00:30:30.560
And so that's what Arrow is basically trying to do.

00:30:30.560 --> 00:30:35.900
It's trying to be like the universal, super fast, super efficient data frame implementation.

00:30:36.700 --> 00:30:44.760
So your pandas library just needs to be a layer on top of that, which might have some echoes of just being a layer on top of NumPy.

00:30:45.040 --> 00:30:46.120
Yeah, it sounds similar.

00:30:46.320 --> 00:30:46.780
It sounds familiar.

00:30:47.340 --> 00:30:48.260
Yeah, very cool.

00:30:48.400 --> 00:30:53.960
So let's talk about this columnar thing, columnar aspect of it.

00:30:54.260 --> 00:30:58.420
So I guess pandas and NumPy operate on the concept of rows.

00:30:58.660 --> 00:31:06.860
I've got rows of data and arrow is more about, I have columns of data that are, could somehow have row definitions into them.

00:31:07.130 --> 00:31:12.980
And it lets you ask different questions more or less easy, right?

00:31:13.120 --> 00:31:17.860
Depending on what you're trying to ask, like what is the average of the miles per hour?

00:31:18.060 --> 00:31:20.100
Like, oh, well, that's just this thing.

00:31:20.330 --> 00:31:23.560
I go right down the column and boom, here's the answer, you know?

00:31:23.820 --> 00:31:28.520
Or you start asking that by rows and arrow, then it's got to do a lot of work.

00:31:28.600 --> 00:31:30.800
to kind of piece that together and quite the opposite for

00:31:30.800 --> 00:31:31.720
pandas, right?

00:31:32.060 --> 00:31:36.100
So I'm still digging into exactly like what's going on there.

00:31:36.480 --> 00:31:46.000
What you said is I think true, but it's also true that pandas data frames, you can think of them, I think of them as like a dictionary of series where each column is actually a series.

00:31:46.620 --> 00:31:50.600
So it's not really row by row, but the numply implementation is row by row.

00:31:50.980 --> 00:32:15.780
So I think like something in the backend there is being translated differently, but it is 100% true that Arrow is just way faster doing analysis of a column um numpy might be faster at adding elements or like doing that sort of thing but the moment that you want to as you said like get the mean or get the min or the max or like sum them up or whatever arrow is just like blazingly fast because that's what it was very specifically designed to do yeah

00:32:15.780 --> 00:32:25.880
you just you preload the data you got to load into some kind of data structure right and you can do that in ways that optimize some things at the cost for others. I think there's a

00:32:25.880 --> 00:32:27.200
little

00:32:27.200 --> 00:32:33.780
bit of a similarity between relational databases and document NoSQL ones in

00:32:33.780 --> 00:32:34.480
the sense like

00:32:34.480 --> 00:33:11.000
their data is structured in one way for really good operations, right? Like I want to go through this table really well and I want to maybe follow a relationship that's set up really well, but you're still computing all those things because they're in like different places in relational ones. And then like say a document database, if you know there's always this relationship you follow. You can just put them together and like kind of pre-compute them. But that makes other questions that don't follow that relationship, but like use the nested data super, not super hard, but much harder than it otherwise would be, right? So it's all about these trade-offs and how you store stuff. You know, what kind of questions are you going to ask it?

00:33:11.100 --> 00:33:41.420
That's right. Arrow like goes even further than that. Like it does compression because it says, well, if I've got all this stuff in the column and I see a lot of the same things, I'll as compressive. Also, it has strings. So we don't have to, like Arrow has its own implementation of strings in its binary format right there, which again, you know, we were talking a few minutes ago about how currently Pandas ignores NumPy strings. And so it uses Python strings. And so Arrow offers the opportunity of having them right there in memory nicely and efficiently.

00:33:43.020 --> 00:33:47.880
This portion of Talk Python is brought to you by Auth0. Do you struggle with authentication?

00:33:48.920 --> 00:33:55.320
Sure, you can start with usernames and passwords, but what about single sign-on, social auth, integration with AI agents?

00:33:55.940 --> 00:34:01.260
It can quickly become a major time sink, and rarely is authentication your core business.

00:34:01.920 --> 00:34:06.480
It's just table stakes that you've got to get right before you can move on to building your actual product.

00:34:07.100 --> 00:34:08.940
That's why you should consider Auth0.

00:34:09.500 --> 00:34:13.760
Auth0 is an easy-to-implement, adaptable authentication and authorization platform.

00:34:14.480 --> 00:34:20.879
Think easy user logins, social sign-on, multi-factor authentication, and robust role-based access control.

00:34:21.530 --> 00:34:28.020
With over 30 different SDKs and quick starts, Auth0 scales with your product at every stage.

00:34:28.679 --> 00:34:33.820
Auth0 lets you implement secure authentication and authorization for your preferred deployment environment.

00:34:34.240 --> 00:34:42.200
You can use all of your favorite tools and frameworks, whether it's Flask, Django, FastAPI, or something else, to manage user logins, roles, and permissions.

00:34:43.020 --> 00:34:47.860
Leave authentication to Auth0 so that you can start focusing on the features your users will love.

00:34:48.720 --> 00:34:53.159
Auth0's latest innovation, Auth4Gen AI, which is now available in developer preview.

00:34:53.700 --> 00:35:04.540
Secure your agentic apps and integrate with the Gen AI ecosystem using features like user authentication for AI agents, token vault, async authorization, and FGA for RAG.

00:35:05.260 --> 00:35:18.320
So if you're a Python developer or data scientist looking for an easy and powerful way to secure your applications, Get started now with up to 25,000 monthly active users for free at talkpython.fm/Auth0.

00:35:18.940 --> 00:35:21.100
That's talkpython.fm/Auth0.

00:35:21.210 --> 00:35:22.960
The link is in your podcast player's show notes.

00:35:23.640 --> 00:35:25.620
Thank you to Auth0 for supporting the show.

00:35:26.800 --> 00:35:28.700
Now we have PyArrow.

00:35:29.160 --> 00:35:30.680
What's the relationship between

00:35:30.680 --> 00:35:32.040
Arrow and PyArrow?

00:35:32.220 --> 00:35:37.960
So that's actually simple to explain, which is PyArrow is just the Python client for Arrow.

00:35:38.460 --> 00:35:39.500
So you want to use Arrow.

00:35:40.000 --> 00:35:44.320
you're a Python developer, you do import PyArrow, and you now have these data structures available.

00:35:44.720 --> 00:35:46.260
By the way, you can do that without pandas.

00:35:46.420 --> 00:35:54.780
If you are like a pandas hater, or you just have no interest in using it, but you want really fast data storage, use PyArrow.

00:35:55.020 --> 00:35:56.520
And there's nothing wrong with that.

00:35:57.320 --> 00:36:02.480
I'll even say that my interest in PyArrow and pandas started a few years ago.

00:36:02.480 --> 00:36:06.540
I saw a talk at a conference somewhere, and I was so incredibly confused.

00:36:06.940 --> 00:36:11.600
I was like, okay, so there's PyArrow and there's pandas, And they say there's a relationship, but what is that relationship?

00:36:12.000 --> 00:36:12.620
I have no idea.

00:36:13.060 --> 00:36:13.420
And that's

00:36:13.420 --> 00:36:14.040
what like...

00:36:14.040 --> 00:36:14.900
Because it's all NumPy.

00:36:15.040 --> 00:36:16.400
NumPy has nothing to do with it.

00:36:16.420 --> 00:36:17.100
What's going on here?

00:36:17.900 --> 00:36:18.780
Right, right, right.

00:36:19.460 --> 00:36:22.000
So you can use PyArrow and there's nothing wrong with it.

00:36:22.100 --> 00:36:25.960
And it has a rich set of data types and all sorts of really amazing functionality.

00:36:26.220 --> 00:36:27.340
And of course, it's super fast.

00:36:27.720 --> 00:36:31.820
Yeah, somewhere in here, I was looking around for which...

00:36:31.820 --> 00:36:35.620
There's a list that says, here's all the different languages that's supported on the Arrow project.

00:36:36.600 --> 00:36:39.560
And yeah, it's the implementation status, I believe.

00:36:40.120 --> 00:36:44.260
And so it says, well, what data types are supported per language?

00:36:44.500 --> 00:36:51.740
So there's like a Java implementation and a C# implementation and a Julia and a Swift and a Nano and a Rust and so on.

00:36:52.000 --> 00:36:56.800
And I'm looking through here and like, obviously, the C++ one has pretty much everything supported.

00:36:57.220 --> 00:37:05.559
Whereas, say, the Java one doesn't do decimal 32 or 64, but it does floats or does the big, the really big decimals.

00:37:06.220 --> 00:37:08.000
you know, things like that, 120 bit and so on.

00:37:08.280 --> 00:37:09.680
And I'm like, something is wrong.

00:37:09.900 --> 00:37:16.100
There's something is throwing me off here because I know there's a real popular Python and I don't, Python is not listed as a language.

00:37:16.640 --> 00:37:17.760
So that's throwing me off.

00:37:17.940 --> 00:37:18.680
Like, why is this?

00:37:19.620 --> 00:37:21.740
So I'm like, oh, under the

00:37:21.740 --> 00:37:22.340
details,

00:37:22.740 --> 00:37:34.900
it says, unless otherwise stated, Python, R, Ruby, and C, G, Lib libraries are following the C++ error library because there's like a really native tie to the original C++ version.

00:37:35.180 --> 00:37:35.680
Isn't that interesting?

00:37:36.120 --> 00:37:39.500
So I think it also means those languages, like those are all dynamic languages.

00:37:39.620 --> 00:37:45.860
Well, not C, Glib, but like the dynamic languages there are, I think, these thin layers that just talk directly to the C++

00:37:45.860 --> 00:37:46.760
implementation.

00:37:46.760 --> 00:37:46.780
Yeah,

00:37:47.080 --> 00:37:47.180
exactly.

00:37:47.820 --> 00:37:50.520
And so it's like, whatever that can do, we can do too.

00:37:51.340 --> 00:37:51.540
Zoom.

00:37:52.340 --> 00:37:53.500
Yeah, exactly.

00:37:54.160 --> 00:37:54.260
Exactly.

00:37:54.540 --> 00:38:01.600
So I think when you think about PyArrow, I feel like you almost should just think about the C++ layer.

00:38:01.840 --> 00:38:08.920
Or if you hear features of Arrow, look at the C++ stuff because PyArrow is just, like you say, a very thin wrapper on top of that.

00:38:09.020 --> 00:38:10.780
But at first when I look, it's like, what?

00:38:11.480 --> 00:38:13.180
They're talking about C# and Java?

00:38:14.040 --> 00:38:15.260
No Python in this?

00:38:15.280 --> 00:38:20.160
I mean, surely there's enough data science in Python to warrant a checkbox or a check column.

00:38:20.580 --> 00:38:22.540
They're like, we're so great, we don't even need a column.

00:38:23.720 --> 00:38:24.040
Exactly.

00:38:24.620 --> 00:38:24.780
Yeah.

00:38:25.220 --> 00:38:26.040
We're the native column.

00:38:26.380 --> 00:38:27.880
Anyway, I think that's really interesting.

00:38:28.460 --> 00:38:35.460
So I do want to go, I think this little data types thing gives us a bit of a jumping off point for circling back a little bit.

00:38:35.720 --> 00:38:38.880
Why did I bring up that genie joke, right?

00:38:39.180 --> 00:38:40.100
Other than I really like it.

00:38:40.100 --> 00:38:40.600
I think it's funny.

00:38:41.140 --> 00:38:50.320
But we also have these D-type concepts down in the C++ layer, which is really no different in terms of data types than C, right?

00:38:50.460 --> 00:38:54.260
There's still 4-bit or 8-bit numbers and so on, signed or unsigned.

00:38:54.280 --> 00:38:55.180
and

00:38:55.180 --> 00:39:09.180
you have this here but pi arrow and more generally arrow deals with that differently right if you have overflows or missing numbers it's not exactly the same as you know negative or certain positive 20 2.4 billion or whatever it is right

00:39:09.180 --> 00:39:45.960
i mean so so yeah well the whole missing data thing is a whole problem in and of itself so like i mean there's missing data in every data set we have right so people forget to enter stuff and sensors go dead and like networks are down, all sorts of stuff. So what do you do if the data is missing? Because you can't just like have a blank space there. And so like for many people, they're natural, like if they're new to this, their natural assumption is, oh, well, I'll just do like a minus one or I'll use zero and then it'll be great. And like you think about, well, what happens if the temperature sensors are dead? Okay, fine. So maybe we'll use minus 999. Well, wait, that's probably not so good either.

00:39:46.460 --> 00:40:08.960
And so after like, it's been a number of years that people have realized, okay, we need a totally separate thing to indicate that data is missing. And so that's where NAN comes in, not a number, or in modern Python, it would be NA. But then you get into other issues of, well, wait, what type is NA or what type is NAN? And it turns out that NAN in traditional NumPy is a float.

00:40:09.350 --> 00:40:45.480
And so if you have a bunch of strings and you want to say there's a missing value, oh, wait a second. So now we've got strings and we've got this float. Oh, no. And it just like goes downhill from there. And so one of the amazing things that Arrow did from the get-go was to say all these types, all these values we have are nullable, meaning that there is a specific value of nan or na or whatnot that fits with all these things. So you can have integers and na, you can have strings and na. You can even have the first row of that table you're showing there is null. It's kind of wild. But if your column contains only null values, then it will be defined to have a null D type.

00:40:45.820 --> 00:40:47.420
And then it's just like, oh yeah, we got 10 nulls.

00:40:47.680 --> 00:40:49.120
And then it's like almost zero storage.

00:40:50.100 --> 00:41:00.700
And so PyO took this into account and it means then that your data is, it's no less accurate, but it's also tighter, easier to work with and more predictable.

00:41:01.080 --> 00:41:10.200
Yeah, if you use a sentinel number or something for missing data, like you're seeing like negative 999, that may or may not work, but you better not ask what the average temperature is.

00:41:11.660 --> 00:41:12.780
It's just really cold there.

00:41:12.780 --> 00:41:15.380
I thought Hawaii was nice, But no, it's cold.

00:41:16.960 --> 00:41:17.580
That's right.

00:41:18.100 --> 00:41:18.420
That's right.

00:41:19.240 --> 00:41:19.680
Yeah, yeah.

00:41:20.040 --> 00:41:28.760
Another interesting aspect of arrow, C++ arrow, pi arrow, same thing, is the copy on write aspect to save memory, right?

00:41:29.060 --> 00:41:39.220
So maybe you've got a string that appears a lot of times like Kansas or Oregon or New Jersey or wherever, and you've got a million rows of those.

00:41:39.760 --> 00:41:42.540
Do you need that string repeated a million times, right?

00:41:42.760 --> 00:41:43.320
That's right.

00:41:43.620 --> 00:41:43.880
That's right.

00:41:44.000 --> 00:41:45.800
So it's much smarter about that sort of stuff.

00:41:46.180 --> 00:41:52.000
Like, you know, it's always easier to design a software system second time around when you see where all the issues were.

00:41:52.340 --> 00:41:58.300
And I think that they took a lot of the lessons from be it Pandas, be it R, be it Apache Spark, all these things.

00:41:58.420 --> 00:42:03.240
They're like, okay, where are there inefficiencies for the program or where are there inefficiencies in the system?

00:42:03.780 --> 00:42:10.340
And let's try to just like solve those problems as best as we can for the general public so they don't have to think about this.

00:42:10.780 --> 00:42:13.760
And yeah, that's part of like, so it gets way, way faster, way, way smaller.

00:42:14.020 --> 00:42:16.260
Yeah, we opened our pandas discussion.

00:42:16.410 --> 00:42:18.880
We're talking about importing data from lots of different sources.

00:42:19.500 --> 00:42:30.840
And it seems like Arrow might be slower because if it's doing compression, if it's doing deduping and all these types of things, it seems like it would be slower, but it's actually not.

00:42:31.310 --> 00:42:33.180
Like loading CSVs is way faster

00:42:33.640 --> 00:42:34.640
and these types of things, right?

00:42:34.920 --> 00:42:36.660
Oh my God, there's no comparison.

00:42:37.260 --> 00:42:44.320
So loading CSVs, and this is one of those things where Pyro, we'll get to this in a bit, But like Pyro will eventually like replace NumPy.

00:42:44.380 --> 00:42:53.300
But even today, like when we're recording, you can with like very confidently use PyArrow to read in your CSV files in Pandas.

00:42:53.620 --> 00:42:54.460
It won't change how it's stored.

00:42:54.480 --> 00:42:55.780
It'll still be stored in NumPy.

00:42:56.120 --> 00:43:02.720
I think, I'm not 100% sure, but I think that it does multi-threading and splitting up the file and all that stuff that we would sort of want it to do.

00:43:03.080 --> 00:43:04.640
So it's like blazingly fast.

00:43:05.060 --> 00:43:11.760
I'll even say like a few days ago, I was talking with people, I even put up a YouTube video about this I'm just so floored.

00:43:12.040 --> 00:43:15.760
So reading in an Excel file, I always thought, okay, Excel's a binary format.

00:43:15.940 --> 00:43:17.280
So I'll read it and it'll be nice and fast.

00:43:17.680 --> 00:43:20.460
And it took over a minute for me to read it in Excel.

00:43:21.120 --> 00:43:27.540
And then I tried it basically using one of the arrow binary formats that it has defined.

00:43:27.780 --> 00:43:29.940
I guess we'll talk about it a little bit because I'm jumping on a bit.

00:43:30.340 --> 00:43:34.100
And it was, I'm not exaggerating here, 2,000 times faster.

00:43:35.200 --> 00:43:40.600
It was so ridiculously, ridiculously fast because

00:43:40.600 --> 00:43:41.540
it

00:43:41.540 --> 00:43:45.920
is like so optimized for doing like one job and just that one job.

00:43:46.420 --> 00:44:04.960
Incredible. Yeah, you think Excel would be optimized for loading data, but I mean, Excel, the app is, but I'm pretty sure that the XLXS, whatever, like the format, I think that is a zip file that internally contains a probably namespace laden XML.

00:44:05.060 --> 00:44:05.520
document.

00:44:06.760 --> 00:44:15.580
You are good. So someone, one of my subscribers to Benful Weekly emailed me and he said, okay, I get it. You're an open source kind of guy. You're not up on all the Excel formats.

00:44:15.980 --> 00:44:23.060
Let me explain to you. Just last night we had office hours and he like went into it in more details. So you're spot on. XLSX is

00:44:23.060 --> 00:44:23.800
a zip file.

00:44:24.220 --> 00:44:34.600
You can actually unzip it and you can see all the XML files inside. And so that unzipping and that XML deserialization and so forth, that is where it's taking a ridiculously long time.

00:44:34.880 --> 00:44:34.980
Right.

00:44:35.140 --> 00:44:38.780
And apparently they didn't optimize for load speed, optimize for other stuff.

00:44:39.200 --> 00:44:40.640
And maybe that's the right choice for Excel.

00:44:41.080 --> 00:44:44.800
But this is like coming back to like, okay, we need to fix some of these problems.

00:44:45.040 --> 00:44:46.740
And what is the most common thing we do?

00:44:46.920 --> 00:44:48.180
Let's optimize for that, right?

00:44:48.480 --> 00:44:50.060
Kind of like columnar versus rows.

00:44:50.600 --> 00:44:54.620
So that brings us to a couple of file formats that are pretty interesting.

00:44:55.600 --> 00:44:57.460
Talk about Parquet first.

00:44:57.900 --> 00:45:01.460
By the way, I will admit, I have no idea if you're supposed to say Parquet or Parquet.

00:45:01.760 --> 00:45:02.920
So I'll go with you and parquet.

00:45:03.360 --> 00:45:06.300
I know that like the flooring is parquet, but whatever.

00:45:07.460 --> 00:45:08.020
You know what?

00:45:08.120 --> 00:45:09.200
I'm going to ask ChatGPT.

00:45:12.140 --> 00:45:13.460
It's always good at pronouncing things.

00:45:13.860 --> 00:45:20.900
So the basic idea is, okay, like the Arrow people came up with a great way of representing things efficiently in memory.

00:45:21.320 --> 00:45:25.100
So they said, well, what about representing that on disk?

00:45:25.560 --> 00:45:31.020
And they actually came up with two file formats because, you know, there are different tradeoffs we want to make.

00:45:31.320 --> 00:45:40.860
And Parquet format is like a sort of very, I don't know, verbatim version of, no, yeah, it's actually compressed.

00:45:41.260 --> 00:45:43.960
It's taking the binary data that we have and compressing it.

00:45:44.200 --> 00:45:44.920
What's the good news?

00:45:45.200 --> 00:45:46.580
Takes very little space on disk.

00:45:46.680 --> 00:45:50.940
The bad news is it takes a little bit extra time to do the compression decompression when you're saving and loading.

00:45:51.620 --> 00:45:53.140
Feather is the same idea.

00:45:53.360 --> 00:45:54.400
It just doesn't get compressed.

00:45:54.960 --> 00:45:57.780
So it takes up more disk space, but it's faster to load and save.

00:45:58.300 --> 00:46:03.380
In either one of these cases, you will be completely and utterly blown away by how fast they are.

00:46:03.680 --> 00:46:15.040
And the fact that they are binary formats that are exactly the same D types as you have in Arrow means there's no more guessing, there's no more playing around with CSV and having them nudged in the right direction.

00:46:15.600 --> 00:46:19.020
There's no more of this really long loading with Excel that we were just talking about.

00:46:19.380 --> 00:46:25.060
It just like screamingly fast pulls it into memory with exactly the D types that you wanted.

00:46:25.460 --> 00:46:27.480
Yeah, super interesting there, I think.

00:46:27.520 --> 00:46:28.860
Like, it's

00:46:28.860 --> 00:46:31.720
still, I see too many, like, because I deal with a lot of public data sets.

00:46:32.620 --> 00:46:35.780
And I see overwhelmingly they're still using CSV and Arrow.

00:46:36.300 --> 00:46:37.260
I'm sorry, CSV and Excel.

00:46:38.099 --> 00:46:43.480
Here and there, here and there, I'm starting to see people make things available in parquet format and feather format.

00:46:43.740 --> 00:46:46.420
So, like, it's making some inroads among the, like, data savvy.

00:46:47.120 --> 00:46:50.460
Yeah, well, what I was going to ask is, what do you think about the workflow?

00:46:50.920 --> 00:46:53.000
So, I'm going to work on a data science project.

00:46:53.480 --> 00:46:57.360
I've got a 200 meg CSV file that takes forever to load.

00:46:58.060 --> 00:47:03.200
maybe the first thing I do is convert it to either, probably I convert it to Parquet.

00:47:03.660 --> 00:47:07.320
I'm going with that French-ish pronunciation as well.

00:47:07.840 --> 00:47:09.100
I convert it to Parquet files.

00:47:09.640 --> 00:47:12.480
And then from then on, my program just works with it.

00:47:12.760 --> 00:47:19.700
Maybe even at the start of your notebook or start of your code, you say, what is the last change of the Parquet file and the CSV?

00:47:19.860 --> 00:47:21.960
And if the CSV is newer, then regenerate.

00:47:22.160 --> 00:47:23.720
Some little guard like that.

00:47:23.880 --> 00:47:26.920
But just keep your CSV file as part of your project.

00:47:27.380 --> 00:47:31.500
but operationally swap it over to one of these new formats and just work with that.

00:47:31.640 --> 00:47:33.420
I would a hundred percent go in that direction.

00:47:33.860 --> 00:47:38.480
It's, it's like, you know, if you start using UV, you're like, Oh my God, I can't, I can't believe

00:47:38.480 --> 00:47:38.580
it.

00:47:38.580 --> 00:47:38.920
I'm not going back.

00:47:38.990 --> 00:47:39.900
I wasted so many

00:47:39.900 --> 00:47:42.680
days of my life waiting for pip to do its thing.

00:47:43.800 --> 00:47:49.640
And in the same way, when you start reading in files from Parquet, like as opposed to CSV files or

00:47:49.640 --> 00:47:50.100
even a cell,

00:47:50.110 --> 00:47:55.500
you're like, Oh my God, this is like, it just happens so fast that you can't even believe it.

00:47:56.340 --> 00:47:56.980
Again, Like in

00:47:56.980 --> 00:48:00.700
my YouTube video, like I show, I use time, like, no, I didn't use time it.

00:48:00.700 --> 00:48:05.280
I just like ran, like loaded in Excel once and took like, again, a minute, 20 seconds.

00:48:05.770 --> 00:48:09.660
And then I actually used at time it in Jupyter to load it from parquet format.

00:48:10.030 --> 00:48:16.720
And it was very happy to do a whole bunch of different loops and still ended up like way, way, way faster because it was so much, so ridiculously fast.

00:48:17.020 --> 00:48:17.200
Yeah.

00:48:17.830 --> 00:48:18.320
Super interesting.

00:48:18.410 --> 00:48:35.720
And I think this is a big opportunity here for people to really, you could probably even for the sufficiently large project, maybe you're not even wanting to use Arrow, but you could still probably load up a data frame in PyArrow and then you can call like two pandas or something like that on it, right?

00:48:36.020 --> 00:48:38.420
Right, although, I mean, you could, you definitely could.

00:48:38.420 --> 00:48:41.200
And that's how I was like sort of first introduced to Arrow.

00:48:41.500 --> 00:48:41.840
It's like

00:48:41.900 --> 00:48:42.840
That's a gentle introduction.

00:48:43.440 --> 00:48:52.960
Right, like, right. It was like, well, here's Arrow and here's pandas. And look, you can convert between the two. But I mean, you can. And maybe there are a lot of people doing that.

00:48:53.320 --> 00:49:02.740
I just feel like, you know what, if I'm going to use Arrow, if I'm going to use PyArrow now, I'm just going to do it like directly inside of Pandas, inside my data frame and get like the best of both worlds.

00:49:03.360 --> 00:49:04.300
Right. Super interesting.

00:49:04.880 --> 00:49:18.920
So that's one of the big aspects or areas you focus on in your upcoming PyCon talk is that increasingly there's a way to say, I want to use Pandas, but Pandas don't use NumPy as your underlying storage engine.

00:49:19.260 --> 00:49:20.220
Use Arrow instead.

00:49:20.740 --> 00:49:21.600
That's right. That's right.

00:49:21.860 --> 00:49:22.560
So at

00:49:22.560 --> 00:49:23.460
some point,

00:49:23.870 --> 00:49:32.660
and it's not clear when, it's like they're going to make the switch where PyArrow will be the default storage and NumPy will be like an optional way to do it.

00:49:32.920 --> 00:49:35.640
Right now, it's not even the opposite of that.

00:49:35.880 --> 00:49:43.720
Right now, you can specify when you do a read CSV or read Excel or whatever, you can say dtype backend equals PyArrow.

00:49:43.950 --> 00:49:51.160
And then they have in like big bold letters, this is experimental, do not use in production, here be dragons, that sort of thing.

00:49:51.420 --> 00:49:58.920
But if you do do it, if you're a little like, you know, if you're willing to experiment, then the D types you see are not NumPy D types.

00:49:59.120 --> 00:50:00.140
They are PyArrow D types.

00:50:00.160 --> 00:50:03.200
And you can see the difference very clearly because it won't say N64.

00:50:03.300 --> 00:50:05.580
It'll say N64, square brackets, PyArrow.

00:50:05.900 --> 00:50:08.760
So it's very obvious to your eyes when you look at the D types.

00:50:09.040 --> 00:50:15.180
And it is blazingly, blazingly fast at anything you want to do on a column.

00:50:15.580 --> 00:50:16.700
So you want to do mean.

00:50:16.920 --> 00:50:17.840
You want to do max.

00:50:18.120 --> 00:50:24.680
You want to like, even when you do like group buys, I have in my talk, I have a whole bunch of graphs that I do.

00:50:25.060 --> 00:50:37.680
And there are a few graphs where the bar for NumPy pandas and the bar for Pyro pandas, you only see one bar because a Pyro data frame was so fast that like, it's just like basically

00:50:37.700 --> 00:50:39.160
Might as well be zero. Right.

00:50:40.580 --> 00:50:48.180
So I wouldn't say people should run out and put this in production just yet, but with every passing month or two, it's getting better and faster and more stable.

00:50:48.800 --> 00:50:50.660
And this is definitely the direction in which we're going.

00:50:50.900 --> 00:50:51.520
Very interesting.

00:50:51.820 --> 00:51:03.760
And so when you make that recommendation, like stable versus non-stable production, not production, I feel like that probably is a statement on Pandas plus PyArrow integrated, not a statement on Arrow itself.

00:51:04.700 --> 00:51:05.660
That's exactly right.

00:51:05.920 --> 00:51:06.140
That's right.

00:51:06.320 --> 00:51:11.500
The core developers are still like cautioning us because they're still like issues.

00:51:11.870 --> 00:51:20.480
And I don't even know, when I first started using PyArrow inside Pandas, I guess about two years ago, and I tried to do a bunch of string methods and said, hey, this string method is not even implemented.

00:51:20.920 --> 00:51:23.260
And now, as far as I can tell, all the string methods are.

00:51:23.980 --> 00:51:27.400
But there are all sorts of holes that I have not encountered that I'm sure exist.

00:51:28.560 --> 00:51:35.940
And there's also one big sort of downside of using Pyro, which is if you try to retrieve things by row.

00:51:36.370 --> 00:51:43.420
So if you're doing like.ilock to retrieve by row location, it is way slower than NumPy.

00:51:44.100 --> 00:51:47.360
Because suddenly it's like, oh, wait, you want to do biro?

00:51:47.600 --> 00:51:49.420
Oh, we're not so good at that.

00:51:49.840 --> 00:51:50.300
Hold your horses.

00:51:50.780 --> 00:51:51.920
Now, how often do you do that?

00:51:52.180 --> 00:51:52.820
Maybe not that often.

00:51:53.140 --> 00:51:54.860
Maybe it's not that big of a game changer.

00:51:56.400 --> 00:51:58.480
But you do need to take that into consideration.

00:51:58.530 --> 00:52:00.100
It's not a 100% win.

00:52:00.260 --> 00:52:05.440
It's also, you can convert from the NumPy version to the PyArrow version, right?

00:52:05.920 --> 00:52:06.640
Yeah, yeah.

00:52:06.840 --> 00:52:08.680
So there are two different things there.

00:52:08.690 --> 00:52:11.960
So one is if you have like a data frame.

00:52:12.460 --> 00:52:18.740
And so you can always use the as type method to take a series and get a new series back from it with a new D type.

00:52:19.070 --> 00:52:24.880
So if I have N64 and I want to make it N32 or vice versa, I say dot as type, the destination D type, I get back a series.

00:52:24.950 --> 00:52:28.120
I can assign it back to that original column and it'll work just fine.

00:52:28.460 --> 00:52:33.820
So instead I can say as type N64 py arrow and then assign it back and you can mix and match the D types.

00:52:34.150 --> 00:52:39.280
So you can have a data frame in which some D types are py arrow, some D types are numpy.

00:52:39.660 --> 00:52:47.340
Now, I just discovered literally in the last few days in preparation for updating my talk that there is a pandas option.

00:52:48.300 --> 00:52:49.040
Let's see, what is it?

00:52:49.120 --> 00:52:52.800
It is, I wrote this down here, future.infer string.

00:52:53.240 --> 00:52:53.960
So if you said

00:52:53.960 --> 00:52:54.840
future.infer

00:52:54.840 --> 00:53:04.720
string to be true, and then you load a CSV, all of your strings will be py arrow strings as opposed to Python strings as opposed to NumPy strings.

00:53:05.250 --> 00:53:09.560
And they are marked as, this has got to be like someone came up with this, NumPy.

00:53:09.840 --> 00:53:36.120
pyarrow. Now, what does that mean? It means that it's stored in pyarrow, but it uses some sort of numpy API accessor so that like pandas doesn't freak out, something like that. But it still uses like the pyarrow storage. So you're not going out to Python memory. It uses dramatically less memory than before and it's dramatically faster. And that seems like an in-betweeny step that people might want to adopt if they have a lot of string data.

00:53:36.360 --> 00:53:49.620
And that is an interesting step. Do you have, When you uv pip install pandas, do you have to do also include py arrow in order to get these features or does it come along?

00:53:50.900 --> 00:54:00.660
So the official statement that I've seen is that pandas three will, and I don't think there's a release date for that, pandas three will require py arrow as a dependency.

00:54:01.340 --> 00:54:07.000
So even though they're not going to change the default, it'll still be default using numpy, we'll have to have it around.

00:54:07.040 --> 00:54:10.960
I don't believe that it's automatically installed when you install pandas now.

00:54:11.310 --> 00:54:13.980
So I believe that you have to like pip install like

00:54:13.980 --> 00:54:14.620
both of them.

00:54:14.860 --> 00:54:21.120
It would probably raise an exception if you said the D type was string bracket pyro, but it didn't have pyro.

00:54:21.420 --> 00:54:22.560
Yes, yes, for sure.

00:54:23.600 --> 00:54:29.320
And I haven't done a lot of investigation to this, but it seems that like pyro has a lot of rich data types.

00:54:29.330 --> 00:54:31.860
And it even has like lists and structs.

00:54:32.100 --> 00:54:40.420
And it seems that pandas now has, just as it has.store and.dt to get to strings and date times, it has like a.list and a.struct.

00:54:40.760 --> 00:54:45.780
I've literally like written that down as something to investigate before I like do my talk next month.

00:54:46.100 --> 00:54:52.340
But it seems that they're trying to expose these complex arrow data structures from within pandas as well.

00:54:52.640 --> 00:54:53.720
How many people are really gonna use it?

00:54:53.780 --> 00:54:55.340
I'm not sure, but it seems kind of interesting.

00:54:55.580 --> 00:55:01.360
It's gonna be interesting to see what the row-based operation performance, what happens to that, you know?

00:55:01.600 --> 00:55:13.380
I'm just thinking, is it almost at the point currently, if it's slow enough, that if you know you're about to enter into a whole bunch of asking a bunch of row oriented questions, do you convert it to a NumPy based data frame?

00:55:14.000 --> 00:55:17.060
Then ask a bunch of questions and then like throw that away and carry on?

00:55:17.120 --> 00:55:17.740
Or I don't know.

00:55:18.140 --> 00:55:23.740
No, like, you know, I've been thinking about, well, OK, how many row operations do I really do?

00:55:24.160 --> 00:55:25.580
And it turns out not to be that many.

00:55:25.680 --> 00:55:27.680
Like, I think they're making the right call here.

00:55:28.100 --> 00:55:28.200
Oh,

00:55:28.260 --> 00:55:28.560
of course.

00:55:28.900 --> 00:55:55.320
it can't be like i just don't think that they're going to leave it this slow um and i see i can't remember exactly what it was but when i started playing with pyro i remember i think it was grouping i think it was grouping or maybe joining one of those two was really really slow and i was in touch with one of the core developers and they were like don't worry we know we're working on it that's why it's still not ready for prime time and it's definitely improved a ton since then so there are definitely people like working hard on this stuff yeah

00:55:55.320 --> 00:56:20.420
there's probably some data structures they can compute at load time that allow them more efficient iteration of row oriented data if it turns out to be a problem maybe i don't know maybe you set a flag like you know include optimizations for rows or whatever right and so it like does a little extra work to to pre-compute like let me ask questions of like that data structure and then that maps into the real columnar structure or whatever i don't know who knows it'll be interesting to see where it goes though

00:56:20.420 --> 00:56:21.180
yeah for

00:56:21.180 --> 00:56:42.920
sure for sure so once you have columnar data or you just have py arrow underneath in general it leads into the possibility of more direct interaction with other libraries like thinking of things like duck tv right duck tv is really focused on analytics more than rows like kind of sqlite versus duck

00:56:42.920 --> 00:56:43.540
tv is kind

00:56:43.540 --> 00:56:49.380
of the same thing as you know pandas versus py arrow type thing What do you think about the interop there?

00:56:49.430 --> 00:56:50.380
Is that making differences?

00:56:51.390 --> 00:56:52.160
I haven't played with DuckDB.

00:56:52.290 --> 00:56:52.800
What are your thoughts?

00:56:53.160 --> 00:56:56.720
So first of all, I played with DuckDB and it is just like astonishingly fast.

00:56:56.750 --> 00:57:03.540
Like it amazes me that something that queries Pandas data frames can be faster than Pandas itself.

00:57:04.040 --> 00:57:10.240
Right, you think that would be the, how could it possibly outrun the thing that is its foundation as part of that conversation, right?

00:57:10.600 --> 00:57:11.360
Like how could that be?

00:57:11.620 --> 00:57:12.460
And yet it is.

00:57:12.760 --> 00:57:34.900
So I increasingly see it this way, that pandas, as much as people love to hate it, and they say, oh, it's got this problem and that problem and so on and so forth, it's becoming, as much as it is a package, it's becoming like a pluggable infrastructure that you'll be able to have different backend storage facilities like NumPy, PyArrow, and then those will talk to databases and so forth.

00:57:35.200 --> 00:57:45.040
And then the query structure is also looking pluggable in some ways, whether it's DuckDB or FireDucks, or like, who knows, people will come up with more stuff.

00:57:45.440 --> 00:57:49.020
And so you'll be able to sort of use Pandas without using Pandas almost.

00:57:50.500 --> 00:57:52.980
And like choose your weapon.

00:57:54.140 --> 00:57:54.500
I don't know.

00:57:54.570 --> 00:57:55.500
I don't know where this is heading.

00:57:55.610 --> 00:58:00.840
But I think it just cements Pandas as like not just the default over stuck with it.

00:58:01.180 --> 00:58:07.360
But it's like the sort of meeting place for all these data manipulation libraries in the Python world.

00:58:07.660 --> 00:58:07.840
Yeah.

00:58:08.130 --> 00:58:08.260
Yeah.

00:58:08.500 --> 00:58:08.860
Very interesting.

00:58:09.440 --> 00:58:13.960
You know, the other big contender, I suppose, is probably Polar's, right?

00:58:14.320 --> 00:58:14.480
Right.

00:58:14.820 --> 00:58:17.660
for solving these types of problems and so on.

00:58:17.660 --> 00:58:20.060
And I believe it's also based on Pyro, right?

00:58:20.280 --> 00:58:21.580
I believe so, right.

00:58:21.940 --> 00:58:22.860
And so a lot of it's speed.

00:58:23.120 --> 00:58:29.180
I mean, look, I have only the most positive things to say about the developers and the people working on it and people using it.

00:58:29.960 --> 00:58:31.960
It is indeed astonishingly fast.

00:58:32.210 --> 00:58:36.860
And I think that's partly due to Arrow and partly due to like very hard work by Richie and so forth.

00:58:37.160 --> 00:58:41.360
I just don't see it like sort of pushing pandas aside

00:58:42.100 --> 00:58:42.620
simply because

00:58:42.620 --> 00:58:43.660
it's too entrenched.

00:58:43.720 --> 00:59:17.100
um i don't know if you remember again like i'm dating myself uh but like years ago the lisp people were furious that c was like the main language and there was this famous uh article called worse is better um that basically said how can it be that lisp is not the number one language when we all know it's fantastic how can this terrible language c be taking over the world and the answer was well it's everywhere and they've made a like a good good run of getting it everywhere so tough luck and i think in some ways

00:59:17.100 --> 00:59:18.140
even

00:59:18.140 --> 00:59:30.820
if folders is better like pandas is there and people are using it and you go try telling all these banks nah we're gonna like we're gonna throw out all the pandas work we've done in the last few years and put in bowlers just not gonna happen

00:59:30.820 --> 00:59:40.360
no it's not gonna happen i do think there's interesting libraries like um i interviewed marco from narwhals which is like an interoperability story between those two.

00:59:41.200 --> 00:59:42.060
I've heard about it.

00:59:42.360 --> 00:59:43.260
I've heard you talk about it.

00:59:43.730 --> 00:59:48.760
I've played with it a tiny, tiny bit, but not enough to really have a real opinion.

00:59:49.180 --> 00:59:53.260
But as far as I'm concerned, anything that does interoperability, like fantastic.

00:59:53.560 --> 01:00:08.260
It's pretty interesting in that it basically, it knows if you pass it a Pandas data frame or a Polar's data frame, and then it kind of adapts what it does to allow you to sort of operate on either kind of with the same operations, which is pretty interesting.

01:00:08.340 --> 01:00:09.820
but you do have to use the Polar's API.

01:00:09.990 --> 01:00:11.780
So that's something there, I suppose.

01:00:12.360 --> 01:00:18.040
Yeah, and I think this PyArrow change that's coming along, it's going to be powerful, right?

01:00:18.140 --> 01:00:21.020
Certainly the speed is going to be well appreciated.

01:00:21.200 --> 01:00:27.140
The ability to load larger amounts of data rather than duplicating a bunch of strings.

01:00:27.360 --> 01:00:27.680
It's great.

01:00:28.060 --> 01:00:31.360
But what do you see as the pitfalls or the challenges?

01:00:31.680 --> 01:00:32.760
We're getting short on time here.

01:00:32.860 --> 01:00:41.000
Maybe we could wrap it up with both a statement of encouragement and steps to take, but also maybe warnings to be looking out for?

01:00:41.280 --> 01:00:42.580
I don't think I have too many warnings.

01:00:42.770 --> 01:00:49.380
Like so far, I think the Pandas core developers have been very cautious and slow.

01:00:49.920 --> 01:00:52.480
Probably some people would argue too slow, but I think it's good.

01:00:52.620 --> 01:00:53.560
Like this is people's data.

01:00:53.860 --> 01:00:55.620
This is like a serious thing.

01:00:56.640 --> 01:00:57.180
Take it slowly.

01:00:57.500 --> 01:00:57.920
Be careful.

01:00:58.460 --> 01:01:00.640
Make sure everything is really working the right way and working quickly.

01:01:01.840 --> 01:01:04.780
But I think like it's very encouraging.

01:01:04.980 --> 01:01:10.780
And I would say if you're using pandas right now, it's worth doing like taking a little detour for a little bit of time.

01:01:11.420 --> 01:01:13.560
Try out PyArrow. Try out these other details.

01:01:13.700 --> 01:01:18.080
At the very least, you should certainly be using PyArrow to be loading your CSVs.

01:01:18.420 --> 01:01:22.700
And you should even try out this loading of strings that I just discovered recently.

01:01:23.160 --> 01:01:33.000
I think just those things alone might speed up your pipeline to give you faster iterations and feel better about it.

01:01:33.580 --> 01:01:35.960
And just be ready at some point, right?

01:01:36.040 --> 01:01:41.360
At some point in the next few years, I don't know exactly when, they're going to flip that switch and say Pyro is now the default.

01:01:41.880 --> 01:01:45.660
And you will be able to, I find it possible to believe that they're going to say, and we're chucking NumPy.

01:01:45.760 --> 01:01:46.360
That's not going to happen.

01:01:46.720 --> 01:01:48.880
But you will need to say explicitly, I want to stick with it.

01:01:49.040 --> 01:01:54.060
And some people, I think a lot of people are going to find it advantageous to make that change along with Pandas.

01:01:54.900 --> 01:01:55.560
It's going to be exciting.

01:01:55.780 --> 01:01:56.600
It is going to be exciting.

01:01:57.440 --> 01:02:01.500
So one area maybe I could ask you about is reproducibility.

01:02:01.900 --> 01:02:03.220
That matters for businesses.

01:02:03.860 --> 01:02:37.800
like you want to go like well we ran this report and we made this important decision to spend a billion dollars on this thing based on this analysis it's still good we make a mistake but certainly in the sciences right like people build upon papers and theories as if they are perfectly solid building blocks and if those things were to have trouble that would be a real big problem you want to be able to rerun your code 10 15 years later changes like this could make it not tomorrow or the next day, but eventually you could see it drifting far enough where it's like, oh, we're kind of done with NumPy and we're moving on to this thing.

01:02:37.830 --> 01:02:42.020
And eventually it might be tricky to get exact reproducibility.

01:02:42.260 --> 01:02:50.560
Right, it's sort of like, I remember I saw a talk about porting, if I remember correctly, like NumPy to Wasm.

01:02:50.880 --> 01:02:53.300
And they were like, did you know that NumPy requires Fortran?

01:02:54.320 --> 01:02:56.380
And so we had to like, like, I think it's NumPy.

01:02:56.450 --> 01:02:57.980
Like there was some part of this whole

01:02:57.980 --> 01:02:58.460
input, like

01:02:58.460 --> 01:03:02.880
the PyData stack, And none of us would have expected this because we're all like Fortran, right?

01:03:03.060 --> 01:03:03.640
Who uses that?

01:03:03.710 --> 01:03:05.560
But it turns out, right, people use these things.

01:03:05.980 --> 01:03:07.700
So people are going to have to take this into account.

01:03:07.830 --> 01:03:10.020
I think NumPy will still be around.

01:03:10.160 --> 01:03:12.280
Look, it's still a very actively used package.

01:03:12.820 --> 01:03:16.760
It's just not a good match for a lot of things that Pandas is doing.

01:03:17.160 --> 01:03:24.040
So you might need to, I don't know, put in your package specification what versions you want, that you do want NumPy to be included.

01:03:24.320 --> 01:03:25.920
Like it might be a little harder in the future.

01:03:26.340 --> 01:03:28.880
I don't think, like there's enough of an installed base.

01:03:29.320 --> 01:03:31.840
I don't think they're going to just like throw people to the wolves.

01:03:32.180 --> 01:03:35.820
I think it's going to be, it's not going to be a Python two to three situation.

01:03:36.240 --> 01:03:36.380
I

01:03:36.380 --> 01:03:38.020
think none of us have enough like

01:03:38.020 --> 01:03:42.540
emotional scarring that it's not going to happen.

01:03:44.400 --> 01:03:45.280
Yeah, I agree.

01:03:45.340 --> 01:03:46.060
I don't think it will happen.

01:03:46.160 --> 01:03:56.660
I'm just thinking, you know, over the long term, you can see sort of a slight eroding to the point where maybe, I mean, do we really think about running the same code 20 years later?

01:03:57.140 --> 01:03:58.800
Sometimes, but not that often.

01:03:58.880 --> 01:03:59.340
I

01:03:59.340 --> 01:04:01.280
mean, Python's only 30

01:04:01.280 --> 01:04:01.720
years old.

01:04:02.160 --> 01:04:03.180
NumPy's only 20, right?

01:04:03.780 --> 01:04:05.040
That's double its life, right?

01:04:05.120 --> 01:04:06.400
That's a long ways out.

01:04:06.760 --> 01:04:07.960
Pandas is less old.

01:04:08.240 --> 01:04:08.820
Right, right.

01:04:09.180 --> 01:04:17.040
I'm not too worried about that, but someone somewhere is going to get the short end of the stick a number of years from now, and that's okay.

01:04:17.220 --> 01:04:18.620
That's what their grad students are for.

01:04:20.160 --> 01:04:20.700
Rewrite it.

01:04:21.300 --> 01:04:24.660
No, more seriously, maybe pin your versions, right?

01:04:24.820 --> 01:04:33.520
If you're doing any sort of reproducibility, definitely pin your versions, but maybe even, you know, download some wheels and just hang on to some wheels for

01:04:33.520 --> 01:04:34.060
Linux or

01:04:34.060 --> 01:04:35.840
do a Docker sort of thing or something like that.

01:04:35.940 --> 01:04:36.220
Who knows?

01:04:36.480 --> 01:04:36.740
That's right.

01:04:37.000 --> 01:04:37.300
That's right.

01:04:37.520 --> 01:04:37.780
All right.

01:04:38.040 --> 01:04:41.860
And all these problems are obviously a sign of it being so successful, right?

01:04:41.980 --> 01:04:42.980
Pandas being so successful.

01:04:43.240 --> 01:04:43.800
Oh, for sure.

01:04:44.220 --> 01:04:44.680
What was it?

01:04:44.700 --> 01:04:46.040
Like the numbers are just astonishing.

01:04:46.140 --> 01:04:50.920
I think the last estimate were like they're between 5 and 10 million people using pandas nowadays.

01:04:51.760 --> 01:04:54.540
And let's assume that's like off by a factor of 10.

01:04:55.000 --> 01:04:56.260
It's still an astonishing number.

01:04:56.660 --> 01:04:57.480
It is astonishing.

01:04:57.820 --> 01:05:01.780
It's amazing. Well, we're going to be at PyCon. I got to book some stuff.

01:05:04.360 --> 01:05:18.340
In like five weeks from the time of recording, even less time from the time of release, maybe two weeks. Tell people about your talk. They can come see your dive into this, which I think will be fairly different. We didn't just go right down the slides of your talk or nothing like that. So

01:05:18.340 --> 01:05:19.280
there's a lot to learn from

01:05:19.280 --> 01:05:19.880
going to your talk.

01:05:20.040 --> 01:05:31.860
Yeah. The talk is much more like code oriented. Like here are like, here's how it looks. Here's how it works. Here's the like speed comparison. Here's where it's better. Here's where it's worse. so yeah.

01:05:31.860 --> 01:05:34.140
I never even people told, I haven't told people about the title.

01:05:34.640 --> 01:05:39.280
Oh yes. So it's called the pie, the, the pie arrow revolution in pandas.

01:05:39.740 --> 01:05:40.720
Yeah. So it's going

01:05:40.720 --> 01:05:44.920
to be Friday morning. I think I'm telling the truth there. And get

01:05:44.920 --> 01:05:46.360
people, people while they're fresh.

01:05:47.080 --> 01:05:56.300
Yeah, exactly. I will not be standing between them and lunch, which has often been the case in previous talks and don't strangely don't get a lot of questions then that's

01:05:56.300 --> 01:06:20.900
interesting i wonder how that works no you don't want that and you don't want the last talk of the day the last talk of the conference but i mean it's still good people still appreciate it but it's just it's the reality of travel and airplanes and hunger and all these things so really good i encourage people to go check out your talk and it should be fun it should probably be up on youtube i don't know what the time frame this year for talks being converted to youtube videos will be but eventually

01:06:20.900 --> 01:06:23.380
yeah usually it's like well like two months or so after the

01:06:23.360 --> 01:06:24.160
confident. Yeah, something like that.

01:06:24.240 --> 01:06:25.120
I'm pretty confident.

01:06:25.740 --> 01:06:26.100
Yeah, absolutely.

01:06:26.460 --> 01:06:26.760
Indeed.

01:06:27.220 --> 01:06:29.120
Reuven, always great to catch up with you.

01:06:29.250 --> 01:06:30.980
Thanks for being on the show. My

01:06:30.980 --> 01:06:32.100
great pleasure. I'll see you in a moment.

01:06:32.260 --> 01:06:32.780
Yep. Bye.

01:06:33.820 --> 01:06:41.060
This has been another episode of Talk Python to Me. Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show.

01:06:41.680 --> 01:07:18.580
This episode is brought to you by NordLayer. NordLayer is a toggle-ready network security platform built for modern businesses. It combines VPN, access control, and threat protection in one easy-to-use platform. Visit talkpython.fm/nordlayer and remember to use the code talkpython dash 10. And it's brought to you by Auth0. Auth0 is an easy to implement adaptable authentication and authorization platform. Think easy user logins, social sign-on, multi-factor authentication, and robust role-based access control. With over 30 SDKs and quick starts, Auth0 scales with your product at every stage.

01:07:19.200 --> 01:07:24.720
Get 25,000 monthly active users for free at talkpython.fm/auth zero.

01:07:25.480 --> 01:07:26.340
Want to level up your Python?

01:07:26.800 --> 01:07:30.460
We have one of the largest catalogs of Python video courses over at Talk Python.

01:07:30.920 --> 01:07:35.600
Our content ranges from true beginners to deeply advanced topics like memory and async.

01:07:36.000 --> 01:07:38.140
And best of all, there's not a subscription in sight.

01:07:38.640 --> 01:07:41.160
Check it out for yourself at training.talkpython.fm.

01:07:41.880 --> 01:07:46.020
Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:07:46.480 --> 01:07:47.360
We should be right at the top.

01:07:47.880 --> 01:07:56.720
You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:07:57.380 --> 01:07:59.600
We're live streaming most of our recordings these days.

01:07:59.990 --> 01:08:07.460
If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:08:08.480 --> 01:08:09.600
This is your host, Michael Kennedy.

01:08:10.020 --> 01:08:10.860
Thanks so much for listening.

01:08:11.030 --> 01:08:12.000
I really appreciate it.

01:08:12.320 --> 01:08:13.960
Now get out there and write some Python code.