The PyArrow Revolution
Episode Deep Dive
Guest Introduction and Background
Reuven Lerner joined the show to talk about PyArrow's growing role in the Python data science ecosystem. Reuven is a longtime Python trainer and consultant who teaches at companies around the world, focusing especially on Python, Pandas, and data science. He is also the author of several "workout" books on Python and Pandas, and speaks regularly at conferences like PyCon. In this conversation, he shares deep insights into why PyArrow is poised to change how we do data analysis with Python.
What to Know If You're New to Python
If you're just getting started with Python, this episode may feel advanced. Here are some basics to review first:
- Basic Python syntax (functions, lists, dictionaries) will help you follow the data analysis discussions.
- Working with packages (i.e., using
pip install
) is crucial for installing libraries like Pandas and PyArrow. - The idea of data frames: Pandas data frames store tabular data in code.
- Array vs. object data: Understanding how Python stores different data types can help with memory and performance discussions.
Key Points and Takeaways
- Why PyArrow Matters for Pandas
PyArrow is a high-performance, columnar data framework that integrates with many languages, including Python. Pandas can optionally use PyArrow as its backend instead of NumPy, offering significant performance gains for column-based operations. This promises faster data loading, reduced memory use, and potentially simpler data interoperability across languages.
- Links and Tools:
- PyArrow project: arrow.apache.org
- Pandas: pandas.pydata.org
- Links and Tools:
- PyArrow's Columnar Advantage
Traditional NumPy-based Pandas works row by row under the hood, which can be less efficient for many analytical queries. PyArrow's columnar format is much faster at column-oriented aggregations, like computing means, sums, or group-bys. This approach is already prevalent in modern data warehouses and drives significant speedups for large datasets.
- Links and Tools:
- Parquet format: parquet.apache.org
- Feather format: arrow.apache.org/docs/python/feather.html
- Links and Tools:
- Blazing-Fast I/O with PyArrow
Reading CSV files via PyArrow can be dramatically faster than using the default Pandas CSV reader. PyArrow can also handle compression, parse data types accurately, and even split large files for parallel processing. This is a major win for data scientists repeatedly loading big datasets.
- Links and Tools:
- PyArrow CSV docs: arrow.apache.org/docs/python/csv.html
- Links and Tools:
- Efficient String Handling
String data in NumPy-backed Pandas is stored as Python objects, often leading to huge memory overhead. By switching to the PyArrow backend for strings, Pandas can store them as actual columnar data. This compression and deduplication can massively cut memory usage, especially in large datasets with repeating values.
- Links and Tools:
- Future string inference in Pandas: pandas.pydata.org/docs
- Links and Tools:
- Nullable Data Types
Handling missing values (NA / NaN) can be messy with NumPy, often forcing everything to float or special placeholder values. PyArrow has a built-in notion of nullable columns, so each data type (integer, string, etc.) can cleanly store missing information. This leads to less confusion and more consistent analytics.
- Links and Tools:
- Pandas NA/nullable dtypes: pandas.pydata.org/docs/user_guide/integer_na.html
- Links and Tools:
- When Row-Based Access Slows Down
One drawback of a columnar layout is row-based operations may take a performance hit. If your workflow depends heavily on pulling data by row indexes, you might see slower speeds under PyArrow. That said, many data science workflows focus on column-level queries (aggregations, group-bys), where PyArrow excels.
- Links and Tools:
- Pandas
.iloc
indexing: pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
- Pandas
- Links and Tools:
- Hybrid Approaches & DuckDB
DuckDB is an in-process database designed for analytical queries and can integrate directly with Pandas or Arrow. Surprisingly, it can sometimes query Pandas data faster than Pandas itself. This highlights an evolving ecosystem of composable data tools around the Arrow format.
- Links and Tools:
- DuckDB: duckdb.org
- Links and Tools:
- Arrow-Backed File Formats
PyArrow underpins file formats like Parquet and Feather, which store data efficiently on disk. Parquet offers compression for smaller file sizes, while Feather focuses on raw speed for both reads and writes. Both formats preserve data types accurately, which eliminates guesswork on load.
- Links and Tools:
- Parquet format: parquet.apache.org
- Feather format: arrow.apache.org/docs/python/feather.html
- Links and Tools:
- Transition Timeline & Pandas 3
PyArrow is not yet the default in Pandas, but the maintainers have laid a path for that switch. Future Pandas releases, likely starting with Pandas 3, will require PyArrow and make it simpler to adopt. NumPy storage will remain an option, but Arrow-based data frames will increasingly be the norm.
- Links and Tools:
- Pandas GitHub issues: github.com/pandas-dev/pandas/issues
- Links and Tools:
- Practical Workflow Tips Many teams import raw CSV/Excel data into an Arrow-based format first and then operate on it in Pandas. This can eliminate repeated conversion overhead, supercharging your iterative analysis cycle. If you do rely on row-based operations, weigh the trade-offs or look into partial conversions.
- Links and Tools:
- Data cleaning guides in Pandas: pandas.pydata.org/docs/user_guide/
- "Move from Excel to Python with Pandas" (See Learning Resources)
Interesting Quotes and Stories
"I always tell anyone who can, go to conferences. It's a great place to learn, but it's also just a great place to have fun." -- Reuven
"I don't think it's going to be a Python 2 to 3 situation. Enough of us have enough emotional scarring that it's not going to happen." -- Reuven
"Computers don't do what you want them to do, they do what you tell them to do." -- Reuven
Key Definitions and Terms
- Arrow / PyArrow: A columnar in-memory data format and its Python library. Designed for fast data interchange and analytics.
- Columnar Data: Storing data by column instead of by row, speeding up aggregations and compression in analytical workflows.
- Parquet and Feather: Binary file formats under the Arrow project. They preserve types, compress data, and enable rapid loading.
- NumPy: A foundational library for numerical operations in Python. Pandas historically relies on it for its data storage.
- Nullable Dtypes: A concept allowing columns of integers, strings, etc., to handle missing values consistently (e.g., NA) without forcing float or sentinels.
Learning Resources
- Python for Absolute Beginners , Ideal for those just starting their Python journey and needing a strong programming foundation.
- Move from Excel to Python with Pandas , Learn how to transform your Excel-based workflow into a Pandas-powered powerhouse.
- Data Science Jumpstart with 10 Projects , For hands-on, project-based learning in the broader Python data science stack.
Overall Takeaway
PyArrow represents a huge leap forward for Python data science workflows. While NumPy remains foundational for numerical arrays, PyArrow's columnar storage and rich data types deliver blazing speed, smaller memory footprints, and simpler interchange with other data processing systems. Pandas is on the cusp of making PyArrow a first-class citizen, meaning we can expect the "PyArrow revolution" to keep gaining steam, helping data scientists, analysts, and developers handle larger data sets and more demanding tasks with Python.
Links from the show
Apache Arrow: github.com
Parquet: parquet.apache.org
Feather format: arrow.apache.org
Python Workout Book: manning.com
Pandas Workout Book: manning.com
Pandas: pandas.pydata.org
PyArrow CSV docs: arrow.apache.org
Future string inference in Pandas: pandas.pydata.org
Pandas NA/nullable dtypes: pandas.pydata.org
Pandas `.iloc` indexing: pandas.pydata.org
DuckDB: duckdb.org
Pandas user guide: pandas.pydata.org
Pandas GitHub issues: github.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode Transcript
Collapse transcript
00:00 Pandas is at the core of virtually all data science done in Python.
00:03 That is, virtually all data science.
00:06 Since its beginning, Pandas has been based upon NumPy.
00:09 But changes are afoot to update those internals, and you can now optionally use PyArrow.
00:15 PyArrow comes with a ton of benefits, including its columnar format, which makes answering analytical questions faster, support for a range of high-performance file formats, inter-machine data streaming, faster file I.O., and more.
00:29 Reuven Lerner is here to give us the lowdown on the PyArrow revolution.
00:33 This is Talk Python to Me, episode 503, recorded April 8th, 2025.
00:39 Are you ready for your host, please?
00:42 You're listening to Michael Kennedy on Talk Python to Me.
00:45 Live from Portland, Oregon, and this segment was made with Python.
00:52 Welcome to Talk Python to Me, a weekly podcast on Python.
00:55 This is your host, Michael Kennedy.
00:57 Follow me on Mastodon where I'm @mkennedy and follow the podcast using @talkpython, both accounts over at fosstodon.org and keep up with the show and listen to over nine years of episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams over on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows. This episode is brought to you by NordLayer. NordLayer is a toggle-ready network security platform built for modern businesses. It combines VPN, access control, and threat protection in one easy-use platform. Visit talkpython.fm/nordlayer and remember to use the code talkpython-10. And it's brought to you by Auth0. Auth0 is an easy-to-implement adaptable authentication and authorization platform. Think easy user login, social sign-on, multi-factor authentication and robust role-based access control. With over 30 SDKs and quick starts, Auth0 scales with your product at every stage. Get 25,000 monthly active users for free at talkpython.fm/Auth0. Reuven, welcome back to Talk Python to Me. Awesome to have you here.
02:08 Thank you so much. Delightful to be here with you.
02:10 Yes, we're coming up on conference season and I saw you doing conference things. So bit of a conversation about what you're going to be covering at PyCon.
02:22 Absolutely.
02:23 Yeah. I'm really, I mean, I love conferences. I love seeing people. I definitely got to the point where they're like conference friends who I see every year and we can sort of catch up and hang out. It's just like a fun, fun experience. I always tell anyone who can like go to conferences. It's a great place to learn, but it's also just a great place to have fun.
02:40 I agree with that. I also think it's a great way to connect more deeply with programming and technology and libraries and all that kind of stuff. It's real easy for, I think, for a lot of folks for this to feel like a set of tutorials and documentation, right? And then you get there and you're like, oh, all these people are doing it and they're excited. And there's the person that made that one. And, you know, like to swim in those waters, it's different.
03:04 I also feel like it's kind of sad, but I mean, I go to all these companies and I get a feeling that for many people who are in programming nowadays, it's kind of lost its fun and its creativity and so it's very nice to be in a community where because open source everyone's there because they want to be there and because they are excited about it and you can sort of like you know recharge your excitement batteries as it were and realize oh there's more to this than just the drudgery of day-to-day and meetings and filling my corporate goals uh it's nice it's fun
03:34 it is and you know speaking of just swimming in the waters and what is water right that famous quote you talked about how it's so much fun because of open source and things like that like I hadn't thought about that for a little while.
03:45 Like, you know, when I work on stuff, I can just do whatever I want.
03:48 If I want to share it, I can share it.
03:50 I don't have to share it.
03:52 Use whatever libraries that might be coming along that look promising.
03:56 There's not a corporate mandate like we're going to have these features for our library in seven months.
04:03 And because the customer demand asked for this and we're going to put this thing in to promote our cloud or our other thing or whatever, right?
04:12 There's a lot of people out there writing code with a lot less flexibility.
04:15 Oh, my God.
04:16 Yes.
04:16 And I seem to see more such people each year.
04:19 And I also feel like I always say, like, I have this sort of dual flexibility that I feel very privileged to have that, A, I'm a freelancer, I'm independent, and B, I work in open source.
04:30 So I can say such and such is dumb or such and such is bad.
04:34 And people have like a normal job, as it were, have to sort of say, well, like, this is our product and it's great.
04:40 Or at least they have to say it's the outside world.
04:41 Whereas day to day, they're just going to meetings saying, how can we convince people that this is great?
04:47 So, yeah, yeah, it's a nice way to escape that corporate golden handcuffs.
04:54 Oh, I'm making it sound really terrible.
04:56 For all of you listening, I'm happy you have good jobs.
04:59 I really am.
05:02 Unfortunately, I'm afraid people might start having to appreciate their jobs a little bit more.
05:07 Things are looking a little hectic out there.
05:08 I don't want to go into that, but one little side diversion before we dive into the main topic.
05:13 You go into all these big companies.
05:16 What's the LLM AI story for those?
05:20 Is it different than people on the outside who can just YOLO around the tools however they want?
05:25 Or what's it like?
05:27 So every company is different.
05:28 Every company is asking that question, right?
05:30 And no one has an answer.
05:31 I think a growing number of companies are assuming that their people will use LLMs of some sort.
05:38 Copilot's been around for a while.
05:39 People are using that.
05:41 A year ago, I have one big client where they said, no, of course we would never use ChatGPT.
05:45 And just a few weeks ago, I said, so, like, what's the story?
05:48 Oh, yeah, we're definitely using some things.
05:50 We'll get back to you on one.
05:51 So it's been increasingly integrated just because, especially, you know, if you're a senior developer, these LLMs really help you just zoom along.
05:59 The junior ones, it's sort of like a little iffier, but everyone at least has to answer the question, what are you doing with these?
06:06 And I think it's increasingly integrated into their workflow.
06:08 I don't think anyone knows what is the right way to do it.
06:11 I do think that these companies that are talking about, well, we're not going to hire any developers this coming year because instead we're just going to use LLMs, that's just nuts.
06:19 I think they're asking for trouble there.
06:22 And in general, I tell people, don't have the LLMs write code for you.
06:25 Have it help you strategize.
06:26 Have it go over your code.
06:28 Have it help you learn things.
06:29 But somewhere, somehow, they're going to have LLM-generated code that no one's going to look at.
06:34 And I don't like that idea so much, at least for now.
06:37 Interesting.
06:38 I would not trust 100% LLM written code.
06:42 Not even necessarily because I think LLMs are bad at writing code.
06:45 I'm stunned at how good they are at it.
06:49 But they write the code that you ask them to write.
06:51 Even if they get it 100% right, they write what you ask them to write.
06:54 And it's like, I'm just seeing the office space guy like, I'm good with customers.
06:59 I talk to the customers.
07:02 What would you say you do here, Bob?
07:05 Like that guy?
07:07 I mean, you know, and you've got to give the specifications to the AI really, really well.
07:11 There you go.
07:12 There you go.
07:12 So when ChatGPT first came out, right, it was this whole meme of the programming language everyone needs to learn now is English.
07:20 Because all you have to do is tell the LLM what you want to do, and it will come out with code, voila, problem solved.
07:26 And anyone who has worked on a project before, and especially anyone who's worked with clients before, non-technical clients, knows that the gap between specifying what you want in clear, precise language and getting code that does it can be vast. And the difference between success and failure. I often tell my students one of my favorite lines that I heard years ago, which is computers don't do what you want them to do. They do what you tell them to do.
07:51 And like, we've all been bitten by that so many times.
07:54 Yeah, we definitely have. We definitely have. All right. Well, I think this is a story that's going to continue to get just more insane. It's going to be interesting to see where things go.
08:06 I think it's both going to supercharge open source, but also that super royal cost turbulence for programmers, right? So we're going to see.
08:15 No, no question. I 100% agree.
08:17 Yeah. So, you know, with all this, we haven't gotten to, I haven't given you a chance to introduce yourself to
08:23 everyone because everyone knows Reuben, but maybe for the couple of people real quick introduction. Fantastic.
08:29 So yeah, so I'm Reuben Lerner and I teach Python and Pandas and Git for a living. I've been doing it for a long time, since like 1995 or so. And so half of my work is going to companies and doing training there. And the other half is doing online learning.
08:44 I've got my own platform, I've got books, I've got newsletters, YouTube channel, online bootcamp that I do. And my goal is just like to help people wherever they are with their Python Pandas knowledge to advance, get better, get more fluent.
08:56 Awesome. Well, that's pretty much the story of this episode as well. But I also, I mean, that's a great introduction, but I didn't know you were such an athlete. I mean, you didn't even talk about all these, these workout books that you're writing and you're a workout influencer.
09:14 Yeah. Well, I wish it were more, more, more, more physical than virtual. but yeah, yeah. So I've got my, two books published with Manning, Python Workout and Panda's Workout.
09:25 And actually, Python Workout is now in its second edition in early release form.
09:31 So it's a relatively minor update to take advantage of all the new stuff that's come out of Python in the last, what, three, four, five years.
09:39 So it's not like a huge overhaul, but like something there.
09:41 Some of the exercises that everyone kind of said, really, you want to do that?
09:45 And so like, yeah, there are always some stinkers in there.
09:48 But overall, it's great fun.
09:50 And the idea is, you know, Manning and I came with the title, but the idea is you're only going to get better if you do lots of little practice every day.
09:59 I often say that it's similar to learning a language, but actually, like I have started running in the last few months.
10:05 And like, what do you know?
10:06 You do a little more each day, a little more each day.
10:09 And then like, you know, every so often you'll get injured, but like, then you go back to it.
10:12 And so over time, you build up the strength and the stamina and the fluency so you can really like get into a project and do what you need and not be looking everything up all the time.
10:22 You know, one of the things I noticed when I, you know, as you, I've done a lot of in-person corporate training type stuff, not for a while, but, you know, over my career, I think it's really interesting. You go interact with all these folks, some of whom are brand new on a team or a project, but others have been, you know, like I've been at the company for 20 years and that's usually really awesome. However, there's certainly plenty of times that I saw people who had 20 years of experience, but it didn't feel like they had 20 years of experience knowledge in the sense that they kind of did the same thing you did in the first couple years and just kept doing that for 18 more rather than having a wide ranging set of experiences. And it's like a lot of things like exercise, like sports, like other skills without focused practice on something. You can get sort of into a rut or get really good at like a few things, but you're like, well, I've never really created a website. I always just work on this database layer. It's like 20 years. Okay. Spread out of it. Come on. And I feel like this is the kind of stuff you're talking about maybe.
11:27 That's exactly it. That you want to get this wide variety of practice. So the books, I don't think I said this explicitly, but the books are all exercises. Like there's some comments in there in some sort of, shall we call it like mini or micro tutorials to get you up to speed. But the idea is you've already learned Python, you've already learned pandas, and now you just need to practice and better that you practice with in what I call controlled frustration in this sort of like, you know, environment where it's not going to matter to your job, then you get to work and your boss is breathing down your neck and you've got deadlines. And I try to make it as varied as possible so that you'll sort of be exposed to as many different ideas as possible. So even if you don't remember it 100% you'd be like oh wait here I probably could have used a dictionary comprehension I don't quite remember exactly what the syntax is but that's probably the right direction and that's way better than uh what do I do now or just doing it the wrong way
12:17 yeah absolutely and circling back a bit to our LLM conversation getting exposure to all these things you're like well the LLM wrote this and it looked weird I didn't understand it but now I see what it was doing it was using this thing that I hadn't really played with this aspect of the language I hadn't played with yeah
12:31 one of the things i like to do uh with llms is i call the reverse socratic method um where i ask it lots of questions about either my code or like oh you're saying i should do it this way but why and what if i do it this way so instead of like the teacher asks the student lots of questions the students ask the teacher lots of questions that's when we think of the llm as a teacher and i found that i learned a lot of things that way probing both learn the nuances and I see where its limitations are and or where it's just bluffing.
13:00 And so I find that to be sort of a useful technique to play with.
13:04 Very interesting.
13:06 Well, I would propose that Panda's certainly proven itself to be a tad useful.
13:11 You know, here and there, there are like a handful of people using it nowadays.
13:14 It's astonishing.
13:15 I know.
13:16 Does it even need an introduction?
13:17 I'm not sure that it needs an introduction.
13:19 I mean, if you're listening to this podcast.
13:21 15 seconds.
13:22 Yeah, if you're listening.
13:23 No, I will. Here's I'll tell listeners out there. There's a very interesting group of people who do listen to this podcast. I'd be interested to hear your thoughts on this. I've had people write to me and they'll say, I really love your show. Thanks so much for doing it. However, you know, I'm I'm starting to understand a lot of the words that you guys are using or what you're talking about. I've been listening for, you know, two months or something. That's the serious persistence to listen for two months and not really like start out not even knowing what's going on.
13:50 But a lot of people use this show like language immersion, you know?
13:55 You want to learn Portuguese, you move to Brazil, and then you start learning it, right?
13:59 Not the other way around.
14:00 So I'm always cognizant of those folks who are using this to kind of as their first step into the industry.
14:07 What do you tell those folks Pandas are?
14:08 So Pandas is a library, like a module or package in Python that lets you do data analysis.
14:16 And the way I describe it to people who are not programmers is it's basically Excel inside of Python.
14:22 So you can read in data from a lot of different sources.
14:24 You can analyze it in two-dimensional tables, right?
14:27 So you've got rows, you've got columns, and then you can perform a ton of different calculations.
14:32 You can use dates, you can use text, but you also have all the flexibility of Python as a programming language.
14:38 So you can extract different parts of it.
14:40 You can mix and match different parts of it.
14:41 And Pandas is especially really good at importing from a ton of different sources and exporting back to those sources or those destinations, I guess.
14:50 And it's become this, well, to extend the language thing, like this lingua franca, like a lot of people use pandas for even a tiny subset of what it can do just because it's so ridiculously flexible and because it's everywhere.
15:03 Yeah, good description.
15:06 This portion of Talk Python to Me is brought to you by NordLayer.
15:10 NordLayer is a toggle-ready network security platform for modern businesses.
15:14 It combines VPN, access control, and threat protection and one easy-to-use platform.
15:20 There's no hardware or complex setup, just secure connections and full control in less than 10 minutes.
15:26 It's easy to start with quick deployment, step-by-step onboarding, and 24-7 support.
15:31 It's easy to combine.
15:33 It works with existing setups in all major platforms.
15:36 And it's easy to scale.
15:38 Add users, features, and servers in just a few clicks.
15:42 Single sign-on and provisioning included.
15:44 NordLayer provides zero-trust network access-based solutions.
15:48 It adds threat protection to keep malware, ransomware, and phishing from reaching your endpoints.
15:54 It increases your threat intelligence to spot threats before they escalate, and it helps businesses achieve compliance.
16:01 So if you're responsible for the security of your software or data science team, you should definitely give NordLayer a look.
16:07 As Talk Python listeners, you'll get an exclusive offer, up to 22% off NordLayer's yearly plans, plus an additional 10% off the top with our coupon.
16:17 Just use the code talkpython-10.
16:20 That's talkpython-10, all lowercase.
16:23 Try NordLayer risk-free with their 14-day money-back guarantee.
16:28 Visit talkpython.fm/nordlayer to get started.
16:31 That's talkpython.fm/nordlayer.
16:33 The link is in your podcast player's show notes.
16:35 Thank you to NordLayer for supporting Talk Python and me.
16:39 This loading data from multiple data sources, it's nuts.
16:43 It's crazy how good it is.
16:45 So let me just throw an example out there for people who maybe haven't done a lot with importing data with pandas.
16:50 There, you could do things like there is a HTML table on a state government website that talks about some bit of data that you need.
17:00 And it's just embedded in some web page.
17:03 It's the third table, HTML table, you know, bracket table slash table sort of thing on there.
17:09 And you can say, load table, give it that URL and say, or, you know, load that HTML, give me the tables, go to the third one.
17:18 And it's a Pandas data frame.
17:19 Like that level of just grab it, right?
17:22 So I have my, like one of the newsletters I produce is a bamboo weekly where I have like challenges each week to use Pandas.
17:30 And I'm always trying to retrieve stuff from different sources because A, people have lots of different needs.
17:35 B, there's lots of data out there.
17:37 And C, then you have to clean it up and you need different techniques for cleaning it.
17:40 And so like, right, you can retrieve it from a PDF file if there are tables there.
17:44 You can retrieve, as you said, from HTML, you can retrieve it from Excel, from JSON, from other statistics programs, their binary formats.
17:52 It basically is infinite.
17:53 And then you've got CSV, which is like every possible almost kind of standard under the sun.
18:00 And Pandas is like, oh, that's okay.
18:01 We'll just give you 100,000 different options and then you can read any of them.
18:04 Yeah, that's amazing.
18:06 I'm just blown away with it.
18:07 So super, super interesting.
18:10 One of the things that I think kind of goes hand in hand with Pandas is, of course, NumPy.
18:17 And that's a little bit at the heart of what we're getting at, right?
18:21 Like traditionally, Pandas is sort of internally used NumPy to manage its data structures and so on.
18:27 And there's some new libraries and formats coming along.
18:30 And you might be able to mix and match or even have to mix and match eventually.
18:35 Right.
18:36 Right. So, so it all started like, so when people hear that, when people hear that Python is the number one language for data science and machine learning and data analytics, if they know anything about Python, they're a little confused. Like, wait a second. Python's a great language, but its data structures are big and slow. Why would I possibly want to use this?
18:56 Yeah, especially yours.
18:58 I think if you look at like what is most out of sync or out of space with like C or other native languages, right?
19:06 Like how big is a Python float versus a regular one and like locality of
19:11 data, all that kind of stuff.
19:12 I think the last time I checked, it was like 24 bytes for a zero integer in Python.
19:19 Yes, and their pointer dereferences and they're on the heap so they might be in different places.
19:23 Like there's a lot going on here.
19:24 so many years ago i don't know 20 years ago or something um we got numpy and numpy is basically like the best of both worlds it's c storage and speed and so all that efficiency but with this really thin layer of python so you can work with it and you sort of get the ease of python and as i said the efficiency of c and so numpy is fantastic at doing that it's very very widely used in science engineering math statistics all that stuff and you could do a ton of stuff with numpy at very happy. But most data analysis that we're going to do is going to be in two-dimensional tables, and it's going to use a lot of strings, and with the import and export that we were talking about. And so I always describe Pandas as like an automatic transmission for NumPy's manual transmission, that you get a lot of sort of convenience functionality that just makes it smoother, cleaner, easier to do your day-to-day stuff. And so Pandas has been reliant on NumPy for, well, since it started, like no doubt about it.
20:25 And if you sort of chip away or like scrape away the outer layer there, you very quickly see NumPy stuff.
20:30 For example, the D types, the data types that we use in each Pandas column, those are defined almost exclusively by NumPy types.
20:39 And so I actually, when I teach Pandas, I first teach NumPy because I feel it's like an easier sort of lower level way to get used to it.
20:46 And then they see all the techniques applied to Pandas as well.
20:49 Yeah, the data types are interesting because NumPy is written in C.
20:53 C operates on structured, well-defined.
20:56 You know, this thing is four bytes, that's eight bytes and so on.
20:59 Data types.
21:00 And so those have really interesting limitations.
21:02 I have a joke for you that I ran across recently and I think it highlights this.
21:08 So I wish I had a picture I could share of, but I don't.
21:11 So there was this programmer that finds one of these genie in a bottle, sort of genie things, rubs it.
21:18 The genie pops out and says, hello, lucky one.
21:21 You have three wishes.
21:22 But before you can wish, there are some rules.
21:23 You can't wish to kill someone or make someone fall in love with you.
21:27 Most importantly, you can't wish for more wishes.
21:29 The programmer says, well, can I wish for fewer wishes?
21:32 Why would you wish for fewer wishes?
21:34 I just don't understand.
21:35 He goes, well, I want negative one wishes.
21:37 Fine, you have 2,496,000,300,000.
21:45 Oh, that's pretty good.
21:46 So what happened?
21:47 Like, why did that go wrong?
21:48 I mean, that's the D types, right?
21:50 That's exactly right. That's exactly right. So I never thought of it that way. I always think, maybe because I'm old enough, I think you and I are about the same age, that when you played video games when you were little, if you're really, really good, you would get the maximum score, it would sort of wrap around back to zero. Or
22:06 if you had a really old car, and if you drove it a long time, eventually the odometer would go past zero. There's a
22:12 limited number of digits, and after 9999, it has to go back to 000. And that's basically what's happening bitwise in NumPy. It has a certain set, unlike Python data types, like Python integers will get as big or as small as you have memory. There is no limit. But when you're working with NumPy or with pandas with these d types, you have to say, is it 8 bits or 16 or 32 or 64? And that's it. Once you reach that ceiling, then it wraps around and it will not warn you about this either. So you need to keep enough of a buffer there between what you think will be your maximum number and what could possibly ever be your maximum number. Like if you want to do, I don't know, eight bits for ages, that's probably fine. Right. But if you want to do eight bits for, I don't know, how long is the project going in number of days? Oh, better hope that your project is going to be done soon because you could be in trouble and you could be into negative territory.
23:09 Yeah. They should have made a little bit bigger choice for the epoch since 1970, you know, that's right. That's right.
23:15 find out in 2038 about that one.
23:18 I think that's the year.
23:19 Anyway, it's going to be bad.
23:20 Yeah, so the genie was basically storing the wishes count in an unsigned 32-bit integer.
23:26 I love that joke.
23:27 And I have no one to tell it to, so I'm very glad you told it to me.
23:31 Exactly.
23:32 What? That's a stupid...
23:33 I don't even understand.
23:34 What a stupid genie.
23:35 Of course, they have no more wishes.
23:37 They got some more wishes.
23:39 Another thing, you talked about this buffer sort of deal.
23:43 If you think maybe you need 16 bits or whatever, or 32, maybe you want to be safe so you're going to double that.
23:53 That also adds to a bunch of memory usage.
23:56 When you allocate 64-bit integers instead of 32, even if you don't use that space, you consume that much memory.
24:04 That's right. That's the thing.
24:06 Again, if you're an even experienced Python programmer, you're like, well, I'll just like whatever the integers need, they need.
24:13 But then comes along NumPy and Pandas and they say, no, you have to choose how big it's going to be.
24:18 And I'm like, well, okay, let's just make everything 64 bits, right?
24:21 What could be the cost?
24:21 Just be safe.
24:22 Right.
24:23 And basically, let's say you have a billion rows that, you know, let's say you just have a billion elements.
24:30 Well, 64 bits is going to be like literally twice as much as 32.
24:35 And like that could mark the difference between running out of memory and not running out of memory or having to swap.
24:41 Like it can get very bad.
24:42 And so, especially since Pandas is constrained by what can fit into available RAM.
24:49 So you're always stuck with this tension with these D types between you have to keep it bigger, big enough to fit all the data you want, and small enough that you'll be able to fit everything into memory.
25:03 And it's a bit of a game.
25:04 And there's no formula you can use because you can't know in advance what all your data is going to be usually.
25:10 Yeah, for sure.
25:11 I mean, honestly, it really freaked me out a little bit when I first started doing Python.
25:15 And it didn't matter what integer type I created.
25:17 I'm like, well, I give it this number, but what if it gets too big?
25:21 How do I control that?
25:22 You know, you don't.
25:23 It's just, it's magic.
25:24 Right, right.
25:25 And I come from like a dynamic language background.
25:27 Like I was always sort of brainwashed to think this is the way normal things are.
25:31 And so when I was like told that there are languages where you have to say how many bits it's going to be in advance, I was like, wait, what kind of crazy stuff is this?
25:39 But it turns out a very large number of people see that as totally normal.
25:43 Yeah, it's interesting.
25:44 I was just looking at some C-sharp stuff last night and all the symbols and all the stuff there.
25:49 Like it seems normal when you're in that.
25:51 Then you step out of it.
25:52 Like, wait, I don't have to be constrained by this or I don't have to worry about that particular thing.
25:56 That's weird.
25:57 But wait, if I don't have to worry about it, why have I been spending all my time and energy thinking about it?
26:01 Right. But I mean, I would say most languages, actually, you probably do have to worry about your numerical sizes.
26:09 Right. Anything that's sort of compiled and allocates things like that, you work with memory.
26:15 Look, it's like a statically type versus dynamically type language sort of thing.
26:20 Right. Do you want to have that extra safety? Do you want to know in advance how much memory you're going to use?
26:25 Or do you want it to be more expressive and flexible, but then potentially have problems if you don't think about it enough in advance?
26:35 Yeah, yeah, yeah.
26:36 Yeah, for sure.
26:38 All right.
26:38 So that's, I guess, one more thing before we move on to talk about Arrow.
26:43 You can ask Pandas how much data a data frame is consuming, right?
26:48 So the answer is, as always with me, yes and no.
26:52 You can ask it how much memory it's using, and it will give you an answer.
26:56 And sometimes that answer is even accurate.
26:58 And the problem is basically that it will tell you how much memory is being used in NumPy.
27:03 And so if you've got integers, if you've got floats, if you've got date times, it will be 100% accurate.
27:08 The moment that you have strings or other objects, but let's just concentrate on strings, NumPy does have strings, and they're terrible.
27:15 And so basically, Pandas is like, we're not going to use those.
27:17 we're going to use Python strings, and we'll just store a pointer in NumPy, a 64-bit pointer that points the Python string, which means that if it calculates how much memory is being used by NumPy, it's showing you how big the pointer is, which is potentially, I mean, there is no connection.
27:34 There's no correlation between that and the size of the string.
27:37 All strings are eight bytes.
27:39 That's right.
27:40 That's right.
27:41 Well, we washed our heads of that one.
27:43 And so when you use df.info, use the info method on a data frame, it will report back. And then sometimes it'll put a plus after that number. And the plus means, hey, I've got some strings here in Python memory.
27:57 I'm going to just give you a fast answer. And I'm not going to go explore that. If you really want a real answer, tell me basically deep equals true. And then I'll go off and explore our Python memory. It'll take longer, but you'll get an accurate count. And that's surprising to a lot people, including because the index and the column names are also typically strings. And so the moment you have an index or column names assigned, it'll also give you that plus and you can't depend on it.
28:23 Yeah. Interesting. And you know, Python objects have the same issue. If you ask in Python, I can't remember exactly what the
28:30 size of.
28:32 Yeah. Get size of that's it. Yeah. And that will, that does the exact same thing. So if you've got a list, for example, or a dictionary, and you ask how big is it, it's like, well, it's basically how many pointers are stored in its structure that point out the things. And for the memory class I did at Talk Python, I had to write some code that would basically traverse the object graph of all the structures. And like, no, this is actually how big it is. And this is why it's doing this in memory and so on. And yeah, it's not unique to NumPy, but it's,
29:02 yeah, it's just, you got pointers. It's a lot more work to traverse them and figure that stuff all out. Okay. So there's a lot of energy around a specification, a library called Arrow from Apache, Apache Arrow. It's the universal columnar, where it always catches me up, columnar format and multi-language toolbox or fast data interchange and in-memory analytics. And so this is, this is super interesting in this project. Let me go to its, it's a homepage or whatever.
29:32 But yeah, you have this for many different languages, right?
29:38 It's not just Python.
29:39 In fact, it's like NumPy written and this one's written in C++.
29:42 Yeah, I mean, so think about it this way.
29:45 Like, I mean, I always sort of think about my evolution of seeing amazing stuff in programming languages.
29:50 So it used to be really amazing that you could get strings, right?
29:53 Back, let's get 30 years ago.
29:55 Wow, you don't think about like, you know, arrays of characters.
29:58 It's just a string.
29:59 Amazing.
29:59 Fast forward a number of years and I was amazed by dates and times.
30:02 And nowadays, like everyone wants to do data frames.
30:06 And so Wes McKinney, a bunch of other people, like he invented pandas, said, well, why don't we, instead of everyone inventing our own thing, why don't we create a backend data storage system that everyone can use that does all the data frame stuff?
30:19 Because we all want them in our languages.
30:21 And then we can make it really fast and universal and do lots of inputs and outputs and even have interchange among these different languages.
30:29 And so that's what Arrow is basically trying to do.
30:30 It's trying to be like the universal, super fast, super efficient data frame implementation.
30:36 So your pandas library just needs to be a layer on top of that, which might have some echoes of just being a layer on top of NumPy.
30:45 Yeah, it sounds similar.
30:46 It sounds familiar.
30:47 Yeah, very cool.
30:48 So let's talk about this columnar thing, columnar aspect of it.
30:54 So I guess pandas and NumPy operate on the concept of rows.
30:58 I've got rows of data and arrow is more about, I have columns of data that are, could somehow have row definitions into them.
31:07 And it lets you ask different questions more or less easy, right?
31:13 Depending on what you're trying to ask, like what is the average of the miles per hour?
31:18 Like, oh, well, that's just this thing.
31:20 I go right down the column and boom, here's the answer, you know?
31:23 Or you start asking that by rows and arrow, then it's got to do a lot of work.
31:28 to kind of piece that together and quite the opposite for pandas, right?
31:32 So I'm still digging into exactly like what's going on there.
31:36 What you said is I think true, but it's also true that pandas data frames, you can think of them, I think of them as like a dictionary of series where each column is actually a series.
31:46 So it's not really row by row, but the numply implementation is row by row.
31:50 So I think like something in the backend there is being translated differently, but it is 100% true that Arrow is just way faster doing analysis of a column um numpy might be faster at adding elements or like doing that sort of thing but the moment that you want to as you said like get the mean or get the min or the max or like sum them up or whatever arrow is just like blazingly fast because that's what it was very specifically designed to do yeah
32:15 you just you preload the data you got to load into some kind of data structure right and you can do that in ways that optimize some things at the cost for others. I think there's a
32:25 little bit of a similarity between relational databases and document NoSQL ones in
32:33 the sense like their data is structured in one way for really good operations, right? Like I want to go through this table really well and I want to maybe follow a relationship that's set up really well, but you're still computing all those things because they're in like different places in relational ones. And then like say a document database, if you know there's always this relationship you follow. You can just put them together and like kind of pre-compute them. But that makes other questions that don't follow that relationship, but like use the nested data super, not super hard, but much harder than it otherwise would be, right? So it's all about these trade-offs and how you store stuff. You know, what kind of questions are you going to ask it?
33:11 That's right. Arrow like goes even further than that. Like it does compression because it says, well, if I've got all this stuff in the column and I see a lot of the same things, I'll as compressive. Also, it has strings. So we don't have to, like Arrow has its own implementation of strings in its binary format right there, which again, you know, we were talking a few minutes ago about how currently Pandas ignores NumPy strings. And so it uses Python strings. And so Arrow offers the opportunity of having them right there in memory nicely and efficiently.
33:43 This portion of Talk Python is brought to you by Auth0. Do you struggle with authentication?
33:48 Sure, you can start with usernames and passwords, but what about single sign-on, social auth, integration with AI agents?
33:55 It can quickly become a major time sink, and rarely is authentication your core business.
34:01 It's just table stakes that you've got to get right before you can move on to building your actual product.
34:07 That's why you should consider Auth0.
34:09 Auth0 is an easy-to-implement, adaptable authentication and authorization platform.
34:14 Think easy user logins, social sign-on, multi-factor authentication, and robust role-based access control.
34:21 With over 30 different SDKs and quick starts, Auth0 scales with your product at every stage.
34:28 Auth0 lets you implement secure authentication and authorization for your preferred deployment environment.
34:34 You can use all of your favorite tools and frameworks, whether it's Flask, Django, FastAPI, or something else, to manage user logins, roles, and permissions.
34:43 Leave authentication to Auth0 so that you can start focusing on the features your users will love.
34:48 Auth0's latest innovation, Auth4Gen AI, which is now available in developer preview.
34:53 Secure your agentic apps and integrate with the Gen AI ecosystem using features like user authentication for AI agents, token vault, async authorization, and FGA for RAG.
35:05 So if you're a Python developer or data scientist looking for an easy and powerful way to secure your applications, Get started now with up to 25,000 monthly active users for free at talkpython.fm/Auth0.
35:18 That's talkpython.fm/Auth0.
35:21 The link is in your podcast player's show notes.
35:23 Thank you to Auth0 for supporting the show.
35:26 Now we have PyArrow.
35:29 What's the relationship between Arrow and PyArrow?
35:32 So that's actually simple to explain, which is PyArrow is just the Python client for Arrow.
35:38 So you want to use Arrow.
35:40 you're a Python developer, you do import PyArrow, and you now have these data structures available.
35:44 By the way, you can do that without pandas.
35:46 If you are like a pandas hater, or you just have no interest in using it, but you want really fast data storage, use PyArrow.
35:55 And there's nothing wrong with that.
35:57 I'll even say that my interest in PyArrow and pandas started a few years ago.
36:02 I saw a talk at a conference somewhere, and I was so incredibly confused.
36:06 I was like, okay, so there's PyArrow and there's pandas, And they say there's a relationship, but what is that relationship?
36:12 I have no idea.
36:13 And that's what like...
36:14 Because it's all NumPy.
36:15 NumPy has nothing to do with it.
36:16 What's going on here?
36:17 Right, right, right.
36:19 So you can use PyArrow and there's nothing wrong with it.
36:22 And it has a rich set of data types and all sorts of really amazing functionality.
36:26 And of course, it's super fast.
36:27 Yeah, somewhere in here, I was looking around for which...
36:31 There's a list that says, here's all the different languages that's supported on the Arrow project.
36:36 And yeah, it's the implementation status, I believe.
36:40 And so it says, well, what data types are supported per language?
36:44 So there's like a Java implementation and a C# implementation and a Julia and a Swift and a Nano and a Rust and so on.
36:52 And I'm looking through here and like, obviously, the C++ one has pretty much everything supported.
36:57 Whereas, say, the Java one doesn't do decimal 32 or 64, but it does floats or does the big, the really big decimals.
37:06 you know, things like that, 120 bit and so on.
37:08 And I'm like, something is wrong.
37:09 There's something is throwing me off here because I know there's a real popular Python and I don't, Python is not listed as a language.
37:16 So that's throwing me off.
37:17 Like, why is this?
37:19 So I'm like, oh, under the details, it says, unless otherwise stated, Python, R, Ruby, and C, G, Lib libraries are following the C++ error library because there's like a really native tie to the original C++ version.
37:35 Isn't that interesting?
37:36 So I think it also means those languages, like those are all dynamic languages.
37:39 Well, not C, Glib, but like the dynamic languages there are, I think, these thin layers that just talk directly to the C++
37:45 implementation.
37:46 Yeah, exactly.
37:47 And so it's like, whatever that can do, we can do too.
37:51 Zoom.
37:52 Yeah, exactly.
37:54 Exactly.
37:54 So I think when you think about PyArrow, I feel like you almost should just think about the C++ layer.
38:01 Or if you hear features of Arrow, look at the C++ stuff because PyArrow is just, like you say, a very thin wrapper on top of that.
38:09 But at first when I look, it's like, what?
38:11 They're talking about C# and Java?
38:14 No Python in this?
38:15 I mean, surely there's enough data science in Python to warrant a checkbox or a check column.
38:20 They're like, we're so great, we don't even need a column.
38:23 Exactly.
38:24 Yeah.
38:25 We're the native column.
38:26 Anyway, I think that's really interesting.
38:28 So I do want to go, I think this little data types thing gives us a bit of a jumping off point for circling back a little bit.
38:35 Why did I bring up that genie joke, right?
38:39 Other than I really like it.
38:40 I think it's funny.
38:41 But we also have these D-type concepts down in the C++ layer, which is really no different in terms of data types than C, right?
38:50 There's still 4-bit or 8-bit numbers and so on, signed or unsigned.
38:54 and you have this here but pi arrow and more generally arrow deals with that differently right if you have overflows or missing numbers it's not exactly the same as you know negative or certain positive 20 2.4 billion or whatever it is right
39:09 i mean so so yeah well the whole missing data thing is a whole problem in and of itself so like i mean there's missing data in every data set we have right so people forget to enter stuff and sensors go dead and like networks are down, all sorts of stuff. So what do you do if the data is missing? Because you can't just like have a blank space there. And so like for many people, they're natural, like if they're new to this, their natural assumption is, oh, well, I'll just do like a minus one or I'll use zero and then it'll be great. And like you think about, well, what happens if the temperature sensors are dead? Okay, fine. So maybe we'll use minus 999. Well, wait, that's probably not so good either.
39:46 And so after like, it's been a number of years that people have realized, okay, we need a totally separate thing to indicate that data is missing. And so that's where NAN comes in, not a number, or in modern Python, it would be NA. But then you get into other issues of, well, wait, what type is NA or what type is NAN? And it turns out that NAN in traditional NumPy is a float.
40:09 And so if you have a bunch of strings and you want to say there's a missing value, oh, wait a second. So now we've got strings and we've got this float. Oh, no. And it just like goes downhill from there. And so one of the amazing things that Arrow did from the get-go was to say all these types, all these values we have are nullable, meaning that there is a specific value of nan or na or whatnot that fits with all these things. So you can have integers and na, you can have strings and na. You can even have the first row of that table you're showing there is null. It's kind of wild. But if your column contains only null values, then it will be defined to have a null D type.
40:45 And then it's just like, oh yeah, we got 10 nulls.
40:47 And then it's like almost zero storage.
40:50 And so PyO took this into account and it means then that your data is, it's no less accurate, but it's also tighter, easier to work with and more predictable.
41:01 Yeah, if you use a sentinel number or something for missing data, like you're seeing like negative 999, that may or may not work, but you better not ask what the average temperature is.
41:11 It's just really cold there.
41:12 I thought Hawaii was nice, But no, it's cold.
41:16 That's right.
41:18 That's right.
41:19 Yeah, yeah.
41:20 Another interesting aspect of arrow, C++ arrow, pi arrow, same thing, is the copy on write aspect to save memory, right?
41:29 So maybe you've got a string that appears a lot of times like Kansas or Oregon or New Jersey or wherever, and you've got a million rows of those.
41:39 Do you need that string repeated a million times, right?
41:42 That's right.
41:43 That's right.
41:44 So it's much smarter about that sort of stuff.
41:46 Like, you know, it's always easier to design a software system second time around when you see where all the issues were.
41:52 And I think that they took a lot of the lessons from be it Pandas, be it R, be it Apache Spark, all these things.
41:58 They're like, okay, where are there inefficiencies for the program or where are there inefficiencies in the system?
42:03 And let's try to just like solve those problems as best as we can for the general public so they don't have to think about this.
42:10 And yeah, that's part of like, so it gets way, way faster, way, way smaller.
42:14 Yeah, we opened our pandas discussion.
42:16 We're talking about importing data from lots of different sources.
42:19 And it seems like Arrow might be slower because if it's doing compression, if it's doing deduping and all these types of things, it seems like it would be slower, but it's actually not.
42:31 Like loading CSVs is way faster and these types of things, right?
42:34 Oh my God, there's no comparison.
42:37 So loading CSVs, and this is one of those things where Pyro, we'll get to this in a bit, But like Pyro will eventually like replace NumPy.
42:44 But even today, like when we're recording, you can with like very confidently use PyArrow to read in your CSV files in Pandas.
42:53 It won't change how it's stored.
42:54 It'll still be stored in NumPy.
42:56 I think, I'm not 100% sure, but I think that it does multi-threading and splitting up the file and all that stuff that we would sort of want it to do.
43:03 So it's like blazingly fast.
43:05 I'll even say like a few days ago, I was talking with people, I even put up a YouTube video about this I'm just so floored.
43:12 So reading in an Excel file, I always thought, okay, Excel's a binary format.
43:15 So I'll read it and it'll be nice and fast.
43:17 And it took over a minute for me to read it in Excel.
43:21 And then I tried it basically using one of the arrow binary formats that it has defined.
43:27 I guess we'll talk about it a little bit because I'm jumping on a bit.
43:30 And it was, I'm not exaggerating here, 2,000 times faster.
43:35 It was so ridiculously, ridiculously fast because it is like so optimized for doing like one job and just that one job.
43:46 Incredible. Yeah, you think Excel would be optimized for loading data, but I mean, Excel, the app is, but I'm pretty sure that the XLXS, whatever, like the format, I think that is a zip file that internally contains a probably namespace laden XML.
44:05 document.
44:06 You are good. So someone, one of my subscribers to Benful Weekly emailed me and he said, okay, I get it. You're an open source kind of guy. You're not up on all the Excel formats.
44:15 Let me explain to you. Just last night we had office hours and he like went into it in more details. So you're spot on. XLSX is
44:23 a zip file.
44:24 You can actually unzip it and you can see all the XML files inside. And so that unzipping and that XML deserialization and so forth, that is where it's taking a ridiculously long time.
44:34 Right.
44:35 And apparently they didn't optimize for load speed, optimize for other stuff.
44:39 And maybe that's the right choice for Excel.
44:41 But this is like coming back to like, okay, we need to fix some of these problems.
44:45 And what is the most common thing we do?
44:46 Let's optimize for that, right?
44:48 Kind of like columnar versus rows.
44:50 So that brings us to a couple of file formats that are pretty interesting.
44:55 Talk about Parquet first.
44:57 By the way, I will admit, I have no idea if you're supposed to say Parquet or Parquet.
45:01 So I'll go with you and parquet.
45:03 I know that like the flooring is parquet, but whatever.
45:07 You know what?
45:08 I'm going to ask ChatGPT.
45:12 It's always good at pronouncing things.
45:13 So the basic idea is, okay, like the Arrow people came up with a great way of representing things efficiently in memory.
45:21 So they said, well, what about representing that on disk?
45:25 And they actually came up with two file formats because, you know, there are different tradeoffs we want to make.
45:31 And Parquet format is like a sort of very, I don't know, verbatim version of, no, yeah, it's actually compressed.
45:41 It's taking the binary data that we have and compressing it.
45:44 What's the good news?
45:45 Takes very little space on disk.
45:46 The bad news is it takes a little bit extra time to do the compression decompression when you're saving and loading.
45:51 Feather is the same idea.
45:53 It just doesn't get compressed.
45:54 So it takes up more disk space, but it's faster to load and save.
45:58 In either one of these cases, you will be completely and utterly blown away by how fast they are.
46:03 And the fact that they are binary formats that are exactly the same D types as you have in Arrow means there's no more guessing, there's no more playing around with CSV and having them nudged in the right direction.
46:15 There's no more of this really long loading with Excel that we were just talking about.
46:19 It just like screamingly fast pulls it into memory with exactly the D types that you wanted.
46:25 Yeah, super interesting there, I think.
46:27 Like, it's still, I see too many, like, because I deal with a lot of public data sets.
46:32 And I see overwhelmingly they're still using CSV and Arrow.
46:36 I'm sorry, CSV and Excel.
46:38 Here and there, here and there, I'm starting to see people make things available in parquet format and feather format.
46:43 So, like, it's making some inroads among the, like, data savvy.
46:47 Yeah, well, what I was going to ask is, what do you think about the workflow?
46:50 So, I'm going to work on a data science project.
46:53 I've got a 200 meg CSV file that takes forever to load.
46:58 maybe the first thing I do is convert it to either, probably I convert it to Parquet.
47:03 I'm going with that French-ish pronunciation as well.
47:07 I convert it to Parquet files.
47:09 And then from then on, my program just works with it.
47:12 Maybe even at the start of your notebook or start of your code, you say, what is the last change of the Parquet file and the CSV?
47:19 And if the CSV is newer, then regenerate.
47:22 Some little guard like that.
47:23 But just keep your CSV file as part of your project.
47:27 but operationally swap it over to one of these new formats and just work with that.
47:31 I would a hundred percent go in that direction.
47:33 It's, it's like, you know, if you start using UV, you're like, Oh my God, I can't, I can't believe
47:38 it.
47:38 I'm not going back.
47:38 I wasted so many days of my life waiting for pip to do its thing.
47:43 And in the same way, when you start reading in files from Parquet, like as opposed to CSV files or
47:49 even a cell, you're like, Oh my God, this is like, it just happens so fast that you can't even believe it.
47:56 Again, Like in my YouTube video, like I show, I use time, like, no, I didn't use time it.
48:00 I just like ran, like loaded in Excel once and took like, again, a minute, 20 seconds.
48:05 And then I actually used at time it in Jupyter to load it from parquet format.
48:10 And it was very happy to do a whole bunch of different loops and still ended up like way, way, way faster because it was so much, so ridiculously fast.
48:17 Yeah.
48:17 Super interesting.
48:18 And I think this is a big opportunity here for people to really, you could probably even for the sufficiently large project, maybe you're not even wanting to use Arrow, but you could still probably load up a data frame in PyArrow and then you can call like two pandas or something like that on it, right?
48:36 Right, although, I mean, you could, you definitely could.
48:38 And that's how I was like sort of first introduced to Arrow.
48:41 It's like That's a gentle introduction.
48:43 Right, like, right. It was like, well, here's Arrow and here's pandas. And look, you can convert between the two. But I mean, you can. And maybe there are a lot of people doing that.
48:53 I just feel like, you know what, if I'm going to use Arrow, if I'm going to use PyArrow now, I'm just going to do it like directly inside of Pandas, inside my data frame and get like the best of both worlds.
49:03 Right. Super interesting.
49:04 So that's one of the big aspects or areas you focus on in your upcoming PyCon talk is that increasingly there's a way to say, I want to use Pandas, but Pandas don't use NumPy as your underlying storage engine.
49:19 Use Arrow instead.
49:20 That's right. That's right.
49:21 So at some point, and it's not clear when, it's like they're going to make the switch where PyArrow will be the default storage and NumPy will be like an optional way to do it.
49:32 Right now, it's not even the opposite of that.
49:35 Right now, you can specify when you do a read CSV or read Excel or whatever, you can say dtype backend equals PyArrow.
49:43 And then they have in like big bold letters, this is experimental, do not use in production, here be dragons, that sort of thing.
49:51 But if you do do it, if you're a little like, you know, if you're willing to experiment, then the D types you see are not NumPy D types.
49:59 They are PyArrow D types.
50:00 And you can see the difference very clearly because it won't say N64.
50:03 It'll say N64, square brackets, PyArrow.
50:05 So it's very obvious to your eyes when you look at the D types.
50:09 And it is blazingly, blazingly fast at anything you want to do on a column.
50:15 So you want to do mean.
50:16 You want to do max.
50:18 You want to like, even when you do like group buys, I have in my talk, I have a whole bunch of graphs that I do.
50:25 And there are a few graphs where the bar for NumPy pandas and the bar for Pyro pandas, you only see one bar because a Pyro data frame was so fast that like, it's just like basically
50:37 Might as well be zero. Right.
50:40 So I wouldn't say people should run out and put this in production just yet, but with every passing month or two, it's getting better and faster and more stable.
50:48 And this is definitely the direction in which we're going.
50:50 Very interesting.
50:51 And so when you make that recommendation, like stable versus non-stable production, not production, I feel like that probably is a statement on Pandas plus PyArrow integrated, not a statement on Arrow itself.
51:04 That's exactly right.
51:05 That's right.
51:06 The core developers are still like cautioning us because they're still like issues.
51:11 And I don't even know, when I first started using PyArrow inside Pandas, I guess about two years ago, and I tried to do a bunch of string methods and said, hey, this string method is not even implemented.
51:20 And now, as far as I can tell, all the string methods are.
51:23 But there are all sorts of holes that I have not encountered that I'm sure exist.
51:28 And there's also one big sort of downside of using Pyro, which is if you try to retrieve things by row.
51:36 So if you're doing like.ilock to retrieve by row location, it is way slower than NumPy.
51:44 Because suddenly it's like, oh, wait, you want to do biro?
51:47 Oh, we're not so good at that.
51:49 Hold your horses.
51:50 Now, how often do you do that?
51:52 Maybe not that often.
51:53 Maybe it's not that big of a game changer.
51:56 But you do need to take that into consideration.
51:58 It's not a 100% win.
52:00 It's also, you can convert from the NumPy version to the PyArrow version, right?
52:05 Yeah, yeah.
52:06 So there are two different things there.
52:08 So one is if you have like a data frame.
52:12 And so you can always use the as type method to take a series and get a new series back from it with a new D type.
52:19 So if I have N64 and I want to make it N32 or vice versa, I say dot as type, the destination D type, I get back a series.
52:24 I can assign it back to that original column and it'll work just fine.
52:28 So instead I can say as type N64 py arrow and then assign it back and you can mix and match the D types.
52:34 So you can have a data frame in which some D types are py arrow, some D types are numpy.
52:39 Now, I just discovered literally in the last few days in preparation for updating my talk that there is a pandas option.
52:48 Let's see, what is it?
52:49 It is, I wrote this down here, future.infer string.
52:53 So if you said future.infer string to be true, and then you load a CSV, all of your strings will be py arrow strings as opposed to Python strings as opposed to NumPy strings.
53:05 And they are marked as, this has got to be like someone came up with this, NumPy.
53:09 pyarrow. Now, what does that mean? It means that it's stored in pyarrow, but it uses some sort of numpy API accessor so that like pandas doesn't freak out, something like that. But it still uses like the pyarrow storage. So you're not going out to Python memory. It uses dramatically less memory than before and it's dramatically faster. And that seems like an in-betweeny step that people might want to adopt if they have a lot of string data.
53:36 And that is an interesting step. Do you have, When you uv pip install pandas, do you have to do also include py arrow in order to get these features or does it come along?
53:50 So the official statement that I've seen is that pandas three will, and I don't think there's a release date for that, pandas three will require py arrow as a dependency.
54:01 So even though they're not going to change the default, it'll still be default using numpy, we'll have to have it around.
54:07 I don't believe that it's automatically installed when you install pandas now.
54:11 So I believe that you have to like pip install like both of them.
54:14 It would probably raise an exception if you said the D type was string bracket pyro, but it didn't have pyro.
54:21 Yes, yes, for sure.
54:23 And I haven't done a lot of investigation to this, but it seems that like pyro has a lot of rich data types.
54:29 And it even has like lists and structs.
54:32 And it seems that pandas now has, just as it has.store and.dt to get to strings and date times, it has like a.list and a.struct.
54:40 I've literally like written that down as something to investigate before I like do my talk next month.
54:46 But it seems that they're trying to expose these complex arrow data structures from within pandas as well.
54:52 How many people are really gonna use it?
54:53 I'm not sure, but it seems kind of interesting.
54:55 It's gonna be interesting to see what the row-based operation performance, what happens to that, you know?
55:01 I'm just thinking, is it almost at the point currently, if it's slow enough, that if you know you're about to enter into a whole bunch of asking a bunch of row oriented questions, do you convert it to a NumPy based data frame?
55:14 Then ask a bunch of questions and then like throw that away and carry on?
55:17 Or I don't know.
55:18 No, like, you know, I've been thinking about, well, OK, how many row operations do I really do?
55:24 And it turns out not to be that many.
55:25 Like, I think they're making the right call here.
55:28 Oh, of course.
55:28 it can't be like i just don't think that they're going to leave it this slow um and i see i can't remember exactly what it was but when i started playing with pyro i remember i think it was grouping i think it was grouping or maybe joining one of those two was really really slow and i was in touch with one of the core developers and they were like don't worry we know we're working on it that's why it's still not ready for prime time and it's definitely improved a ton since then so there are definitely people like working hard on this stuff yeah
55:55 there's probably some data structures they can compute at load time that allow them more efficient iteration of row oriented data if it turns out to be a problem maybe i don't know maybe you set a flag like you know include optimizations for rows or whatever right and so it like does a little extra work to to pre-compute like let me ask questions of like that data structure and then that maps into the real columnar structure or whatever i don't know who knows it'll be interesting to see where it goes though
56:20 yeah for sure for sure so once you have columnar data or you just have py arrow underneath in general it leads into the possibility of more direct interaction with other libraries like thinking of things like duck tv right duck tv is really focused on analytics more than rows like kind of sqlite versus duck
56:42 tv is kind of the same thing as you know pandas versus py arrow type thing What do you think about the interop there?
56:49 Is that making differences?
56:51 I haven't played with DuckDB.
56:52 What are your thoughts?
56:53 So first of all, I played with DuckDB and it is just like astonishingly fast.
56:56 Like it amazes me that something that queries Pandas data frames can be faster than Pandas itself.
57:04 Right, you think that would be the, how could it possibly outrun the thing that is its foundation as part of that conversation, right?
57:10 Like how could that be?
57:11 And yet it is.
57:12 So I increasingly see it this way, that pandas, as much as people love to hate it, and they say, oh, it's got this problem and that problem and so on and so forth, it's becoming, as much as it is a package, it's becoming like a pluggable infrastructure that you'll be able to have different backend storage facilities like NumPy, PyArrow, and then those will talk to databases and so forth.
57:35 And then the query structure is also looking pluggable in some ways, whether it's DuckDB or FireDucks, or like, who knows, people will come up with more stuff.
57:45 And so you'll be able to sort of use Pandas without using Pandas almost.
57:50 And like choose your weapon.
57:54 I don't know.
57:54 I don't know where this is heading.
57:55 But I think it just cements Pandas as like not just the default over stuck with it.
58:01 But it's like the sort of meeting place for all these data manipulation libraries in the Python world.
58:07 Yeah.
58:08 Yeah.
58:08 Very interesting.
58:09 You know, the other big contender, I suppose, is probably Polar's, right?
58:14 Right.
58:14 for solving these types of problems and so on.
58:17 And I believe it's also based on Pyro, right?
58:20 I believe so, right.
58:21 And so a lot of it's speed.
58:23 I mean, look, I have only the most positive things to say about the developers and the people working on it and people using it.
58:29 It is indeed astonishingly fast.
58:32 And I think that's partly due to Arrow and partly due to like very hard work by Richie and so forth.
58:37 I just don't see it like sort of pushing pandas aside simply because it's too entrenched.
58:43 um i don't know if you remember again like i'm dating myself uh but like years ago the lisp people were furious that c was like the main language and there was this famous uh article called worse is better um that basically said how can it be that lisp is not the number one language when we all know it's fantastic how can this terrible language c be taking over the world and the answer was well it's everywhere and they've made a like a good good run of getting it everywhere so tough luck and i think in some ways
59:17 even if folders is better like pandas is there and people are using it and you go try telling all these banks nah we're gonna like we're gonna throw out all the pandas work we've done in the last few years and put in bowlers just not gonna happen
59:30 no it's not gonna happen i do think there's interesting libraries like um i interviewed marco from narwhals which is like an interoperability story between those two.
59:41 I've heard about it.
59:42 I've heard you talk about it.
59:43 I've played with it a tiny, tiny bit, but not enough to really have a real opinion.
59:49 But as far as I'm concerned, anything that does interoperability, like fantastic.
59:53 It's pretty interesting in that it basically, it knows if you pass it a Pandas data frame or a Polar's data frame, and then it kind of adapts what it does to allow you to sort of operate on either kind of with the same operations, which is pretty interesting.
01:00:08 but you do have to use the Polar's API.
01:00:09 So that's something there, I suppose.
01:00:12 Yeah, and I think this PyArrow change that's coming along, it's going to be powerful, right?
01:00:18 Certainly the speed is going to be well appreciated.
01:00:21 The ability to load larger amounts of data rather than duplicating a bunch of strings.
01:00:27 It's great.
01:00:28 But what do you see as the pitfalls or the challenges?
01:00:31 We're getting short on time here.
01:00:32 Maybe we could wrap it up with both a statement of encouragement and steps to take, but also maybe warnings to be looking out for?
01:00:41 I don't think I have too many warnings.
01:00:42 Like so far, I think the Pandas core developers have been very cautious and slow.
01:00:49 Probably some people would argue too slow, but I think it's good.
01:00:52 Like this is people's data.
01:00:53 This is like a serious thing.
01:00:56 Take it slowly.
01:00:57 Be careful.
01:00:58 Make sure everything is really working the right way and working quickly.
01:01:01 But I think like it's very encouraging.
01:01:04 And I would say if you're using pandas right now, it's worth doing like taking a little detour for a little bit of time.
01:01:11 Try out PyArrow. Try out these other details.
01:01:13 At the very least, you should certainly be using PyArrow to be loading your CSVs.
01:01:18 And you should even try out this loading of strings that I just discovered recently.
01:01:23 I think just those things alone might speed up your pipeline to give you faster iterations and feel better about it.
01:01:33 And just be ready at some point, right?
01:01:36 At some point in the next few years, I don't know exactly when, they're going to flip that switch and say Pyro is now the default.
01:01:41 And you will be able to, I find it possible to believe that they're going to say, and we're chucking NumPy.
01:01:45 That's not going to happen.
01:01:46 But you will need to say explicitly, I want to stick with it.
01:01:49 And some people, I think a lot of people are going to find it advantageous to make that change along with Pandas.
01:01:54 It's going to be exciting.
01:01:55 It is going to be exciting.
01:01:57 So one area maybe I could ask you about is reproducibility.
01:02:01 That matters for businesses.
01:02:03 like you want to go like well we ran this report and we made this important decision to spend a billion dollars on this thing based on this analysis it's still good we make a mistake but certainly in the sciences right like people build upon papers and theories as if they are perfectly solid building blocks and if those things were to have trouble that would be a real big problem you want to be able to rerun your code 10 15 years later changes like this could make it not tomorrow or the next day, but eventually you could see it drifting far enough where it's like, oh, we're kind of done with NumPy and we're moving on to this thing.
01:02:37 And eventually it might be tricky to get exact reproducibility.
01:02:42 Right, it's sort of like, I remember I saw a talk about porting, if I remember correctly, like NumPy to Wasm.
01:02:50 And they were like, did you know that NumPy requires Fortran?
01:02:54 And so we had to like, like, I think it's NumPy.
01:02:56 Like there was some part of this whole input, like the PyData stack, And none of us would have expected this because we're all like Fortran, right?
01:03:03 Who uses that?
01:03:03 But it turns out, right, people use these things.
01:03:05 So people are going to have to take this into account.
01:03:07 I think NumPy will still be around.
01:03:10 Look, it's still a very actively used package.
01:03:12 It's just not a good match for a lot of things that Pandas is doing.
01:03:17 So you might need to, I don't know, put in your package specification what versions you want, that you do want NumPy to be included.
01:03:24 Like it might be a little harder in the future.
01:03:26 I don't think, like there's enough of an installed base.
01:03:29 I don't think they're going to just like throw people to the wolves.
01:03:32 I think it's going to be, it's not going to be a Python two to three situation.
01:03:36 I think none of us have enough like emotional scarring that it's not going to happen.
01:03:44 Yeah, I agree.
01:03:45 I don't think it will happen.
01:03:46 I'm just thinking, you know, over the long term, you can see sort of a slight eroding to the point where maybe, I mean, do we really think about running the same code 20 years later?
01:03:57 Sometimes, but not that often.
01:03:58 I mean, Python's only 30 years old.
01:04:02 NumPy's only 20, right?
01:04:03 That's double its life, right?
01:04:05 That's a long ways out.
01:04:06 Pandas is less old.
01:04:08 Right, right.
01:04:09 I'm not too worried about that, but someone somewhere is going to get the short end of the stick a number of years from now, and that's okay.
01:04:17 That's what their grad students are for.
01:04:20 Rewrite it.
01:04:21 No, more seriously, maybe pin your versions, right?
01:04:24 If you're doing any sort of reproducibility, definitely pin your versions, but maybe even, you know, download some wheels and just hang on to some wheels for
01:04:33 Linux or do a Docker sort of thing or something like that.
01:04:35 Who knows?
01:04:36 That's right.
01:04:37 That's right.
01:04:37 All right.
01:04:38 And all these problems are obviously a sign of it being so successful, right?
01:04:41 Pandas being so successful.
01:04:43 Oh, for sure.
01:04:44 What was it?
01:04:44 Like the numbers are just astonishing.
01:04:46 I think the last estimate were like they're between 5 and 10 million people using pandas nowadays.
01:04:51 And let's assume that's like off by a factor of 10.
01:04:55 It's still an astonishing number.
01:04:56 It is astonishing.
01:04:57 It's amazing. Well, we're going to be at PyCon. I got to book some stuff.
01:05:04 In like five weeks from the time of recording, even less time from the time of release, maybe two weeks. Tell people about your talk. They can come see your dive into this, which I think will be fairly different. We didn't just go right down the slides of your talk or nothing like that. So
01:05:18 there's a lot to learn from going to your talk.
01:05:20 Yeah. The talk is much more like code oriented. Like here are like, here's how it looks. Here's how it works. Here's the like speed comparison. Here's where it's better. Here's where it's worse. so yeah.
01:05:31 I never even people told, I haven't told people about the title.
01:05:34 Oh yes. So it's called the pie, the, the pie arrow revolution in pandas.
01:05:39 Yeah. So it's going to be Friday morning. I think I'm telling the truth there. And get
01:05:44 people, people while they're fresh.
01:05:47 Yeah, exactly. I will not be standing between them and lunch, which has often been the case in previous talks and don't strangely don't get a lot of questions then that's
01:05:56 interesting i wonder how that works no you don't want that and you don't want the last talk of the day the last talk of the conference but i mean it's still good people still appreciate it but it's just it's the reality of travel and airplanes and hunger and all these things so really good i encourage people to go check out your talk and it should be fun it should probably be up on youtube i don't know what the time frame this year for talks being converted to youtube videos will be but eventually
01:06:20 yeah usually it's like well like two months or so after the confident. Yeah, something like that.
01:06:24 I'm pretty confident.
01:06:25 Yeah, absolutely.
01:06:26 Indeed.
01:06:27 Reuven, always great to catch up with you.
01:06:29 Thanks for being on the show. My great pleasure. I'll see you in a moment.
01:06:32 Yep. Bye.
01:06:33 This has been another episode of Talk Python to Me. Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show.
01:06:41 This episode is brought to you by NordLayer. NordLayer is a toggle-ready network security platform built for modern businesses. It combines VPN, access control, and threat protection in one easy-to-use platform. Visit talkpython.fm/nordlayer and remember to use the code talkpython dash 10. And it's brought to you by Auth0. Auth0 is an easy to implement adaptable authentication and authorization platform. Think easy user logins, social sign-on, multi-factor authentication, and robust role-based access control. With over 30 SDKs and quick starts, Auth0 scales with your product at every stage.
01:07:19 Get 25,000 monthly active users for free at talkpython.fm/auth zero.
01:07:25 Want to level up your Python?
01:07:26 We have one of the largest catalogs of Python video courses over at Talk Python.
01:07:30 Our content ranges from true beginners to deeply advanced topics like memory and async.
01:07:36 And best of all, there's not a subscription in sight.
01:07:38 Check it out for yourself at training.talkpython.fm.
01:07:41 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
01:07:46 We should be right at the top.
01:07:47 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.
01:07:57 We're live streaming most of our recordings these days.
01:07:59 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.
01:08:08 This is your host, Michael Kennedy.
01:08:10 Thanks so much for listening.
01:08:11 I really appreciate it.
01:08:12 Now get out there and write some Python code.