Learn Python with Talk Python's 270 hours of courses

#462: Pandas and Beyond with Wes McKinney Transcript

Recorded on Thursday, Apr 11, 2024.

00:00 This episode dives into some of the most important data science libraries from the Python space

00:05 with one of its pioneers, Wes McKinney.

00:08 He's the creator or co-creator of the pandas Apache Arrow and IBIS projects, as well as

00:13 an entrepreneur in this space.

00:16 This is Talk Python to Me, episode 462, recorded April 11th, 2024.

00:21 Are you ready for your host, Darius?

00:25 You're listening to Michael Kennedy on Talk Python to Me.

00:28 Live from Portland, Oregon, and this segment was made with Python.

00:33 Welcome to Talk Python to Me, a weekly podcast on Python.

00:38 This is your host, Michael Kennedy.

00:40 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:45 both on fosstodon.org.

00:48 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

00:53 We've started streaming most of our episodes live on YouTube.

00:56 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and

01:02 be part of that episode.

01:04 This episode is sponsored by Neo4j.

01:07 It's time to stop asking relational databases to do more than they were made for and simplify

01:12 complex data models with graphs.

01:15 Check out the sample FastAPI project and see what Neo4j, a native graph database, can do

01:21 for you.

01:22 Find out more at talkpython.fm/neo4j.

01:24 And it's brought to you by Mailtrap, an email delivery platform that developers love.

01:32 Try for free at mailtrap.io.

01:34 Hey, Wes, welcome to Talk Python to Me.

01:37 Thanks for having me.

01:38 You know, honestly, I feel like it's been a long time coming having you on the show.

01:42 You've had such a big impact in the Python space, especially the data science side of

01:46 that space, and it's high time to have you on the show.

01:48 So welcome.

01:49 Good to have you.

01:50 Yeah, it's great to be here.

01:51 I've been heads down a lot the last, you know, last n years.

01:56 And I actually haven't been, because I think a lot of my work has been more like data infrastructure

02:01 and working at even a lower level than Python.

02:05 So I haven't been as engaging as much directly with the Python community.

02:10 But it's been great to kind of get back more involved and start catching up on all the

02:15 things that people have been building.

02:17 And being at Posit gives me the ability to, yeah, sort of have more exposure to what's

02:21 going on and people that are using Python in the real world.

02:24 There's a ton of stuff going on at Posit that's super interesting.

02:27 And we'll talk about some of that.

02:28 And, you know, it's sometimes it's just really fun to build, you know, and work with people

02:32 building things.

02:33 And I'm sure you're enjoying that aspect of it.

02:35 For sure.

02:36 Nice.

02:37 Well, before we dive into pandas and all the things that you've been working on after that,

02:41 you know, let's just hear a quick bit about yourself for folks who don't know you.

02:45 Sure.

02:46 Yeah.

02:47 So I grew up in Akron, Ohio, mostly.

02:49 And I got involved, started getting involved in Python development around 2007, 2008.

02:55 And built, I was working in quant finance at the time, I started building a personal

03:00 data analysis toolkit that turned into the pandas project.

03:04 And then open source that in 2009, started getting involved in the Python community.

03:08 And I spent several years like writing my book, Python for Data Analysis, and then working

03:14 with the broader scientific Python, Python data science community to help enable Python

03:19 to become a mainstream programming language for doing data analysis and data science.

03:24 In the meantime, I've become an entrepreneur, I've started some companies and I've been

03:30 working to innovate and improve the computing infrastructure that powers data science tools

03:36 and libraries like pandas.

03:38 So that's led to some other projects like Apache Arrow and IBIS and some other things.

03:44 Recent years, I've been worked on a startup, Voltron Data, which is still very much going

03:50 strong and has a big team and is off to the races.

03:54 And I've had a long relationship with Posit, formerly RStudio.

03:58 And they were my home for doing Arrow development from 2018 to 2020.

04:04 They helped me incubate the startup that became Voltron Data.

04:08 And so I've gone back to work full time there as a software architect to help them with

04:14 their Python strategy to make sort of their data science platform a delight to use for

04:19 the Python user base.

04:21 I'm pretty impressed with what they're doing.

04:22 I didn't realize the connection between Voltron and Posit, but I have had Joe Chung on the

04:28 show before to talk about Shiny for Python.

04:32 And I've seen him demo a few really interesting things, how it integrates to notebooks these

04:37 days, some of the stuff that you all are doing.

04:40 And yeah, it's just it's fascinating.

04:41 Can you give people a quick elevator pitch on that while we're on that subject?

04:45 On Shiny or on Posit in general?

04:47 Yeah, whichever you feel like.

04:49 Yeah, so Posit started out 2009 as RStudio.

04:54 And so it didn't start out intending to be a company.

04:57 JJ Allaire and Joe Chang built a new IDE, integrated development environment for R,

05:03 because what was available at the time wasn't great.

05:06 And so they made that into, I think, probably one of the best data science IDEs that's ever

05:11 been built.

05:12 It's really an amazing piece of tech.

05:13 So it started becoming a company with customers and revenue in the 2013 timeframe.

05:20 And they've built a whole suite of tools to support enterprise data science teams to make

05:24 open source data science work in the real world.

05:27 But the company itself, it's a certified B corporation, has no plans to go public or

05:30 IPO.

05:31 It is dedicated to the mission of open source software for data science and technical communication,

05:36 and is basically building itself to be a hundred year company that has a revenue generating

05:42 enterprise product side and an open source side so that the open source feeds the enterprise

05:49 part of the business.

05:50 The enterprise part of the business generates revenue to support the open source development.

05:53 And the goal is to be able to sustainably support the mission of open source data science

05:58 for, hopefully, the rest of our lives.

06:01 And it's an amazing company.

06:02 It's been one of the most successful companies that dedicates a large fraction of its engineering,

06:06 time to open source software development.

06:09 So it's very impressed with the company and JJ Allaire, its founder.

06:13 And I'm excited to be helping it grow and become a sustainable long-term fixture in

06:21 the ecosystem.

06:22 Yes.

06:23 Yeah, it's definitely doing cool stuff.

06:25 Incentives are aligned well, right?

06:27 It's not private equity or IPO.

06:31 Many people know JJ Allaire created ColdFusion, which is like the original dynamic web development

06:38 framework in the 1990s.

06:40 And so he and his brother, Jeremy, and some others built Allaire Corp to commercialize

06:45 ColdFusion.

06:46 And they built a successful software business that was acquired by Macromedia, which was

06:49 eventually acquired by Adobe.

06:51 But they did go public as Allaire Corp during the dot-com bubble.

06:55 And JJ went on to found a couple of other successful startups.

06:59 And so he found himself in his late 30s 15 years ago, or around age 40, around the age

07:05 I am now, having been very successful as an entrepreneur, no need to make money, and looking

07:11 for a mission to spend the rest of his career on.

07:15 And that identifying data science and statistical computing as an open source, in particular,

07:21 making open source for data science work, was the mission that he aligned with and something

07:25 that he had been interested in earlier in his career.

07:28 But he had gotten busy with other things.

07:30 So it's really refreshing to work with people who are really mission focused and focused

07:34 on making impact in the world, creating great software, empowering people, increasing accessibility,

07:41 and making most of it available for free on the internet, and not being so focused on

07:45 empire building and producing great profits for venture investors and things like that.

07:50 So I think the goal of the company is to provide an amazing home for top-tier software developers.

07:58 To work on this software, to spend their careers, and to build families, and to be a happy and

08:05 healthy culture for working on this type of software.

08:08 That sounds excellent.

08:09 Very cool.

08:10 I didn't realize the history all the way back to ColdFusion.

08:12 Speaking of history, let's jump in.

08:14 Wes, there's a possibility that people out there listening don't know what Pandas is.

08:20 You would think it's pretty ubiquitous, and I certainly would say that it is, especially

08:24 in the data science space.

08:25 I got a bunch of listeners who listen and they say really surprising things.

08:29 They'll say stuff to me like, "Michael, I've been listening for six weeks now and I'm starting

08:33 to understand some of the stuff you all are talking about." I'm like, "Why did you listen for six weeks?

08:38 You didn't know what I was talking about.

08:40 That's crazy." And a lot of people use it as language immersion to get into the Python space.

08:45 So I'm sure there's plenty of people out there who are immersing themselves but are pretty

08:49 new.

08:50 So maybe for that crew, we could introduce what Pandas is to them.

08:54 Absolutely.

08:55 So this is the Data Manipulation and Analysis Toolkit for Python.

08:59 So it's a Python library that you install that enables you to read data files.

09:04 So read many different types of data files off of disk or off of remote storage, or read

09:09 data out of a database or some other remote data storage system.

09:13 This is tabular data.

09:14 So it's structured data like with columns.

09:16 You can think of it like a spreadsheet or some other tabular data set.

09:20 And then it provides you with this DataFrame object, which is kind of pandas.DataFrame

09:27 that is the main tabular data object.

09:29 And it has a ton of methods for accessing, slicing, grabbing subsets of the data, applying

09:36 functions on it that do filtering and subsetting and selection, as well as more analytical

09:42 operations like things that you might do with a database system or SQL.

09:46 So joins and lookups, as well as analytical functions like summary statistics, grouping

09:53 by some key and producing summary statistics.

09:57 So it's basically a Swiss Army knife for doing data manipulation, data cleaning, and supporting

10:03 the data analysis workflow.

10:05 But it doesn't actually include very much as far as actual statistics or models.

10:09 Or if you're doing something with LLMs or linear regression, or some type of machine

10:16 learning, you have to use another library.

10:18 But pandas is the on-ramp for all of the data into your environment in Python.

10:23 So when people are building some kind of application that touches data in Python, pandas is often

10:29 like the initial on-ramp for how data gets into Python, where you clean up the data,

10:34 you regularize it, you get it ready for analysis, and then you feed the clean data into the

10:41 downstream statistical library or data analysis library that you're using.

10:46 - That whole data wrangling side of things, right?

10:48 - Yeah, that's right, that's right.

10:50 And so, you know, in some history, Python had arrays like matrices and what we call

10:56 tensors now, multi-dimensional arrays going back all the way to 1995, which is pretty

11:02 early history for Python.

11:04 Like the Python programming language has only been around since like 1990 or 1991, if my

11:09 memory serves.

11:10 But what became NumPy in 2005, 2006, started out as numeric in 1995, and it provided numerical

11:19 computing, multi-dimensional arrays, matrices, the kind of stuff that you might do in MATLAB,

11:24 but it was mainly focused on numerical computing and not with the type of business datasets

11:30 that you find in database systems, which contain a lot of strings or dates or non-numeric data.

11:36 And so my initial interest was I found Python to be a really productive programming language.

11:41 I really liked writing code in it, writing simple scripts, like, you know, doing random

11:45 things for my job.

11:47 But then you had this numerical computing library, NumPy, which enabled you to work

11:51 with large numeric arrays and large datasets with a single data type.

11:57 But working with this more tabular type data, stuff that you would do in Excel or stuff

12:00 that you do in a database, it wasn't very easy to do that with NumPy or it wasn't really

12:05 designed for that.

12:06 And so that's what led to building this, like, higher level library that deals with these

12:11 tabular datasets in the Pandas library, which was originally focused on, you know, building

12:16 really with a really close relationship with NumPy.

12:19 So Pandas itself was like a thin layer on top of NumPy originally.

12:24 This portion of Talk Python to Me is brought to you by Neo4j.

12:27 I have told you about Neo4j, the native graph database on previous AdSpots.

12:32 This time, I want to tell you about their relatively new podcast, Graph Stuff.

12:37 If you care about graph databases and modeling with graphs, you should definitely give it

12:41 a listen.

12:42 On their season finale last year, they talked about the intersection of LLMs and knowledge

12:47 graphs.

12:48 Remember when ChatGPT launched?

12:50 It felt like the LLM was a magical tool out of the toolbox.

12:54 However, the more you use it, the more you realize that's not the case.

12:57 The technology is brilliant, but it's prone to issues such as hallucinations.

13:03 But there's hope.

13:04 If you feed the LLM reliable current data, ground it in the right data and context, then

13:09 it can make the right connections and give the right answers.

13:12 On the episode, the team at Neo4j explores how to get the results by pairing LLMs with

13:18 knowledge graphs and vector search.

13:21 Check out their podcast episode on Graph Stuff.

13:24 They share tips for retrieval methods, prompt engineering, and more.

13:28 So just visit talkpython.fm/neo4j-graphstuff to listen to an episode.

13:35 That's talkpython.fm/neo4j-graphstuff.

13:39 The link is in your podcast players show notes.

13:42 Thank you to Neo4j for supporting Talk Python to me.

13:46 One thing I find interesting about Pandas is it's almost its own programming environment

13:52 these days in the sense that, you know, traditional Python, we do a lot of loops, we do a lot

13:59 of attribute dereferencing, function calling, and a lot of what happens in Pandas is more

14:05 functional.

14:07 It's more applied to us.

14:09 It's almost like set operations, right?

14:11 And a lot of vector operations and so on.

14:14 Yeah, that was behavior that was inherited from NumPy.

14:17 So NumPy is very array oriented, vector oriented.

14:21 So you rather than write a for loop, you would write an array expression, which would operate

14:26 on whole batches of data in a single function call, which is a lot faster because you can

14:32 drop down into C code and get good performance that way.

14:37 And so Pandas adopted the NumPy way of like the NumPy like array expression or vector

14:43 operations.

14:44 And I'm curious to hear that that's extended to the types of like non-numeric data operations

14:49 that you can do in Pandas, like, you know, vectorized set lookups, where you can say

14:53 like, you would say like, oh, like this, I have this array of strings, and I have this

14:57 subset of strings, and I want to compute a Boolean array, which says whether or not each

15:01 string is contained in this set of strings.

15:04 And so in Pandas, that's the is in function.

15:06 So you would say like, column A, like is in some set of substrings, and that produces

15:12 that single function call produces a whole Boolean array that you can use for subsetting

15:17 later on.

15:18 Yeah, there's a ton of things that are really interesting in there.

15:20 One of the challenges, maybe you could speak to this a little bit, then I want to come

15:23 back to your performance comment.

15:25 One of the challenges I think is that some of these operations are not super obvious

15:29 that they exist or that they're discoverable, right?

15:32 Like instead of just indexing into say a column, you can index on an expression that might

15:37 filter out the columns or project them or things like that.

15:40 How do you recommend people like kind of discover a little bigger breadth of what they can do?

15:45 There's plenty of great books written about Pandas.

15:48 So there's my book, Python for Data Analysis.

15:51 I think Matt Harrison has written an excellent book, Effective Pandas.

15:55 Pandas documentation, I think has provides really nitty gritty detail about how all the

15:59 different things work.

16:01 But when I was writing my this book, Python for Data Analysis, my goal was to, you know,

16:06 create a primer, like a tutorial on how to solve data problems with Pandas.

16:12 And so for that, you know, I had to introduce some basics of how NumPy works so people could

16:16 understand array oriented computing, basics of Python, so you know enough Python to be

16:21 able to understand what things that Pandas is doing.

16:25 It builds incrementally.

16:26 And so like, as you go through the book, the content gets more and more advanced.

16:30 It introduces, you learn, you master initial set of techniques, and then you can start

16:35 learning about more advanced techniques.

16:37 So it's definitely a pedagogical resource.

16:41 And it is now freely as you as you're showing there on the screen, it is freely available

16:44 on the internet.

16:46 So JJ Allaire helped me port the book to use Quarto, which is a new technical publishing

16:52 system for writing books and blogs and website, you know, quarto.org.

16:57 And yes, that's how I was able to publish my book on the internet, as you know, essentially,

17:03 you can use Quarto to write books using Jupyter notebooks, which is cool.

17:06 My book was written a long time ago in O'Reilly's DocBook XML.

17:10 So not particularly fun to edit.

17:12 But yeah, because Quarto is built on Pandoc, which is a sort of markup language transpilation

17:18 system.

17:19 So you can use Pandoc to convert from one, you know, you can to convert documents from

17:23 one format to another.

17:25 And so that's the kind of the root framework that Quarto is built on for, you know, generating

17:30 starting with one document format and generating many different types of output formats.

17:33 That's cool.

17:34 I didn't realize your book was available just to read on the internet.

17:38 If you navigate around.

17:40 In the third edition, I was able to negotiate with O'Reilly and add a, you know, add an

17:45 append and make an amendment to my very old book contract from 2011 to let me release

17:52 the book for free on my website.

17:54 So it's yeah, it's just available there at westmckinney.com/book.

17:58 I find that like a lot of people really like the print book.

18:02 And so I think that having the online book just available, like whenever you are somewhere

18:06 and you want to look something up is great.

18:08 Print books are hard to search.

18:09 Yeah, that's true.

18:10 That's true.

18:11 Yeah.

18:12 And like, if you go to the search bar, and if you go back to the book and just look at

18:14 the search bar, you know, look at just search for like group by like, you know, all one

18:18 word or you know, yeah, it's like, it comes up really fast.

18:21 You can go to that section.

18:23 And it's pretty cool.

18:25 I thought that releasing the book for free online would would affect sales, but people

18:29 just really like having paper books, it seems even in 2024.

18:33 Yeah, even digital books are nice.

18:34 You got them with you all the time you can I think it's about taking the notes, where

18:38 do I put my highlights?

18:39 And how do I remember it?

18:40 And that's right.

18:41 Yeah, yeah, stuff like that.

18:42 This quarter thing looks super interesting.

18:45 If you look at Pandoc, if people haven't looked at this before, the conversion matrix.

18:53 I don't know how you would how would you describe this was busy and complete?

18:57 What is this?

18:58 This is crazy.

18:59 It's very busy.

19:00 Yeah, it's it's it can convert from looks like about, you know, 30 or 40 input formats,

19:05 formats to, you know, 50 or 60 output formats, maybe maybe more than that, kind of just my

19:10 just eyeballing it.

19:11 But yeah, it's like a pretty, pretty impressive.

19:12 And then if you took the combinatorial of like, how many different ways could you combine

19:16 the 30 to the 50?

19:18 It's kind of what it looks like.

19:19 It's right.

19:20 It's truly amazing.

19:21 So you've got Markdown, you want to turn it into a PDF, or you've got a Doku wiki and

19:26 you want to turn it into an EPUB or whatever, right?

19:29 Or even like reveal JS, probably to PowerPoint, I would imagine.

19:33 I don't know.

19:34 Yeah.

19:36 As history like backstory about about quarto.

19:38 So you know, it helps to keep in mind that that JJ created cold fusion, which was this,

19:44 you know, essentially publishing system, early publishing system for the internet, similar

19:48 to CGI and, and PHP and another dynamic web publishing systems.

19:53 And so at early on it at our studio, they created our Markdown, which is a basically

19:59 a extensions to Markdown that allow you to have code cells written in, in our and then

20:05 eventually they added support for some other languages where it's kind of like a Jupyter

20:08 notebook in the sense that you could have some Markdown and some code and some plots

20:12 and output, and you would run the arm Markdown render and it would it would, it would, you

20:17 know, generate all the output and insert it into the document.

20:21 And so you could use that to write blogs and websites and everything.

20:24 But but our markdown was written in our and so that limited in a sense, like it made it

20:29 harder to install because you would have to install our to use it.

20:32 And also people, it had an association with our that perhaps was like, like unmerited.

20:37 And so in the meantime, you know, with all everything that's happened with web technology,

20:42 it's now very easy to put a complete JavaScript engine in a stall footprint, you know, on

20:47 a machine with no dependencies, and to be able to run a system that is, you know, written

20:53 in a system that's written in JavaScript.

20:55 And so Cordo is completely language agnostic, it's written in TypeScript, and uses Pandoc

21:01 as an as an underlying engine.

21:03 And it's very easy to install.

21:05 And so it addresses some of the portability and extensibility issues that were that were

21:09 present in our Markdown.

21:11 But but as a result, you know, I think our the the Posit team had a lot of just has more

21:16 than a, you know, a decade, or if you include Cold Fusion, you know, more than 25 years

21:20 of experience and in building really developer friendly technical publishing tools.

21:25 And so I think that that it's not data science, but it's something that is an important part

21:30 of the data science workflow, which is, how do you present your make your analysis and

21:35 your work available for consumption in different formats.

21:38 And so having this this system that can, you know, publish outputs in in many different

21:43 places is super valuable.

21:46 So a lot of people start out in Jupyter notebooks, but but there's many different, you know,

21:49 many different possible input formats.

21:51 And so to be able to, you know, use the same source to publish to a website or to a confluence

21:56 page or to a PDF is like, super valuable.

21:59 Yeah, it's super interesting.

22:00 Okay, so then I got to explore some more.

22:03 Let's go back to this for a minute.

22:05 Just how about some kind words from the audience for you?

22:08 Ravid says, Wes, your work has changed my life.

22:11 It's very, very nice.

22:12 I'm happy to hear it.

22:13 But yeah, yeah, I'm more than happy to talk.

22:16 Let's talk in depth about pandas.

22:18 And I think history of the project is interesting.

22:20 And I think also how the project has has developed in the intervening 15 16 years is pretty interesting

22:28 as well.

22:29 Yeah, let's talk derivatives for a minute.

22:30 So growth and speed of adoption and all those things.

22:34 When you first started working on this, and you first put it out, did you foresee a world

22:37 where this was so popular and so important?

22:42 Did you think of Yeah, pretty soon black holes, I'm pretty sure I'll be part of that somehow.

22:46 It was always the aspiration of making Python this mainstream language for statistical computing

22:52 and data analysis.

22:54 Like I didn't, it didn't occur to me that it would become this popular or that it would

22:59 become like the one of the main tools that people use for working with data in a business

23:04 in a business setting like that would have been if that was the aspiration, or if that

23:08 was, you know, what I needed to achieve to be satisfied, that that would have been completely

23:12 unreasonable.

23:13 And I also don't think that in a certain sense, like, I don't know that that it's popularity,

23:18 it is deserved, and it's not deserved.

23:19 Like I think there's there's many other worthy efforts that have been created over the years

23:23 that have been really great work that that others have have done in this this domain.

23:28 And so the fact that that pandas caught on and became as popular as it is, I think it's

23:33 a combination of timing.

23:35 And you know, there was like a developer relations aspect that there was content available.

23:40 And like I wrote my book, and that made it easier for people to learn how to use the

23:43 project.

23:44 But also like we we had a serendipitous open source developer community that that came

23:49 together that allowed the project to grow and expand like really rapidly in the early

23:55 2010s.

23:56 And I definitely spent a lot of work like recruiting people to work on the project and

24:01 encouraging you know, others to work on it, because sometimes people create open source

24:04 projects, and then it's hard for hard for others to get involved and get a seat at the

24:08 table, so to speak.

24:09 But I was very keen to bring on others and to give them responsibility and you know,

24:16 ultimately, you know, hand over the reins to the project to others.

24:20 And that I've spoken a lot about that, you know, over the years, how important that is

24:23 to you know, for open source project creators to to make room for others in, you know, steering

24:28 and growing the project so that they can become owners of it as well.

24:32 It's tough to make space and tough to bring on folks.

24:36 Have you heard of the Djangonauts?

24:38 The Djangonauts?

24:39 I think it's Djangonauts dot space.

24:40 They have an awesome domain.

24:42 But it's basically like kind of like a boot camp, but it's for taking people who just

24:46 like Django and turn them into actually contributors or core contributors.

24:50 What's your onboarding story for people who do want to participate?

24:53 I'm embarrassed to say that I'm not, I don't have a comprehensive view of like all of the

24:58 different, you know, community outreach channels that the pandas project has done to help grow

25:02 new contributors.

25:04 So one of the core team members, Mark Garcia, has done an amazing job organizing documentation

25:12 sprints and other like contributor sourcing events, essentially creating very friendly,

25:18 accessible events where people who are interested in getting involved in pandas can meet each

25:22 other and then assist each other in making their first pull request.

25:27 And it could be something as simple as, you know, making a small improvement to the to

25:31 the pandas documentation because it's such a large project.

25:34 The documentation is like always something that could be better, you know, either adding

25:40 more adding more examples or documenting things that aren't documented or making, yeah, just

25:44 just making the documentation better.

25:46 And so it's something that for new contributors is more accessible than working on the internals

25:52 of like one of the algorithms or something.

25:55 And, and, or like we working on some significant performance improvement might be a bit intimidating

26:00 if you've never worked on the pandas code base.

26:02 And it's a pretty large code base because it's been it's been worked on continuously

26:05 for you know, for like going on 20 years.

26:08 So it's, yeah, it can be takes a while to really get to a place where you can be productive.

26:13 And that can be discouraging for new contributors, especially those who don't have a lot of open

26:18 source experience.

26:19 That's one of the ironies of the challenges of these big projects is they're just so finely

26:24 polished.

26:25 So many people are using them.

26:27 Every edge case matters to somebody.

26:29 Right.

26:30 And so to become a contributor and make changes to that, it takes a while, I'm sure.

26:34 Yeah, yeah.

26:35 I mean, I think it's definitely a big thing that helped is allowing people to get paid

26:39 to work on pandas or to be able to contribute to pandas as, as a part of their job description,

26:46 like as you know, maybe part of their job is maintaining, maintaining pandas.

26:50 So Anaconda, you know, was like, you know, one of the earliest companies who had engineers

26:55 on staff, you know, like, you know, Brock Mendel, Tom Augsberger, Jeff Reback, who part

27:01 of their job was maintaining and developing, developing pandas.

27:03 And that was, that was huge because prior to that, the project was purely based on volunteers.

27:09 Like I, I was a volunteer and everyone was working on the project as a, as a passion

27:13 project in their, in their free time.

27:15 And then Travis Oliphant, one of the founders, he and Peter Wang founded Anaconda.

27:21 Travis spun out from Anaconda to create Quonsight and has continued to sponsor development in

27:26 and around pandas.

27:27 And that's enabled people like Mark to do these community building events and, and for

27:32 it to not be, you know, something that's, you know, totally uncompensated.

27:35 Yeah.

27:36 That's a lot of, a lot of stuff going on.

27:37 And I think the interest is awesome, right?

27:40 I mean, if it was just a different level of problems, I feel like we could take on, you

27:45 know, you know what, I got this entire week and someone that's my job is to make this

27:48 work rather than I've, I've got two hours and can't really take on a huge project.

27:53 And so I'll work on the smaller improvements or whatever.

27:56 Yeah.

27:57 Many people know, but I, I haven't been involved day to day in pandas since 2013.

28:01 So that's, that's getting on.

28:02 That's a lot of years.

28:03 I still talk to the pandas contributors.

28:06 We had a, we had a pandas meetup core, core developer meetup here in Nashville pre COVID.

28:10 I think it was in 2019, maybe.

28:13 So I'm still in active contact with the pandas developers, but it's been a different team

28:18 of people leading the project.

28:20 It's taken on a life of its own, which is, which is amazing.

28:23 That's exactly, yeah.

28:24 As a project creator, that's exactly what you want is to not be beholden to the project

28:28 that you created and forced and, you know, kind of have to be, be responsible for it

28:32 and take care of it for the rest of your life.

28:35 But if you look at like a lot of the community, a lot of the most kind of intensive community

28:39 development has happened since, like, since I moved on to work on, on other projects.

28:43 And so now the project has, I don't know the exact count, but 10 thousands of contributors.

28:49 And so, you know, to have thousands of different unique individuals contributing to an open

28:53 source project is a, it's a big deal.

28:55 So I think even, I don't know what it says on the bottom of GitHub, it says, you know,

29:00 3,200 contributors, but that's maybe not even the full story because sometimes, you know,

29:07 people, they don't have their email address associated with their GitHub profile and, you

29:11 know, how GitHub counts contributors.

29:13 I would say probably the true number is closer to 4,000.

29:16 That's a testament, you know, to the, to the core team and all the outreach that they've

29:20 done and work making, making the project accessible and easy to contribute to.

29:25 Because if people, if you go and try to make a pull request to a project and there's many

29:29 different ways that you can fail.

29:31 So like either the project is technically like there's issues with the build system

29:36 or the developer tooling.

29:38 And so you struggle with the developer tooling.

29:40 And so if you aren't working on it every day and every night, you can't make heads or tails

29:43 of how the developer tools work.

29:45 But then there's also like the level of accessibility of the core development team.

29:50 Like if they don't, if they aren't there to support you in getting involved in the project

29:54 and learning how it works and creating documentation about how to contribute and what's expected

29:59 of you, that can also be, you know, a source of frustration where people churn out of the

30:04 project, you know, because it's just, it's too hard to find their sea legs.

30:09 And maybe also, you know, sometimes development teams are unfriendly or unhelpful or, you

30:14 know, they make, they make others feel like they make others feel like they're annoyed

30:18 with them or like they're wasting their time or something.

30:20 It's like, I don't want to look at your, you know, this pull request and give you feedback

30:24 because you know, I could do it more quickly by myself or something.

30:28 Like sometimes you see that an open source projects.

30:30 But they've created a very welcoming environment.

30:33 And yeah, I think the contribution numbers speak for themselves.

30:37 - They definitely do.

30:38 Maybe the last thing before we move on to the other stuff you're working on, but the

30:41 other interesting GitHub statistic here is the used by 1.6 million projects.

30:47 That's, I don't know if I've ever seen it used by that high.

30:50 There's probably some that are higher, but not many.

30:52 - Yeah, it's a lot of projects.

30:54 I think it's, it's interesting.

30:55 I think like many projects, it's reached a point where it's, it's an essential and assumed

30:59 part of the, of many people's toolkit.

31:03 Like they, like the first thing that they write at the top of a file that they're working

31:07 on is import pandas as PD or import NumPy as PD, you know, to create, I think in a sense,

31:12 like I think one of the reasons why, you know, pandas has gotten so popular is that it is

31:16 beneficial to the community, to the Python community to have fewer solutions, you know,

31:21 kind of the Zen of Python, there should be one and preferably only one obvious, obvious

31:25 way to do things.

31:26 And so if there were 10 different pandas like projects, you know, that creates skill portability

31:32 problems and it's just easier if everyone says, oh, we just, pandas is the thing that

31:36 we use and you change jobs and you can take all your skills, like how to use pandas with

31:40 you.

31:41 And I think that's also one of the reasons why Python has become so successful in the

31:45 business world is because you can teach somebody even without a lot of programming experience,

31:50 how to use Python, how to use pandas and become productive doing basic work very, very quickly.

31:56 And so one of the solutions I remember back in the early 2010s, there were a lot of articles

32:02 and talks about how to address the data science shortage.

32:06 And my belief and I gave a talk at Web Summit in Dublin in 2000, gosh, maybe 2017, I have

32:16 to look exactly.

32:17 But basically it was the data scientist shortage.

32:21 And my thesis was always, we should make it easier to be a data scientist or like lower

32:25 the bar for like what sort of skills you have to master before you can do productive work

32:31 in a business setting.

32:32 And so I think the fact that there is just pandas and that's like the one thing that

32:36 people have to learn how to use is like their essential starting point for doing any data

32:41 work has also led to this piling on of like people being motivated to make this one thing

32:47 better because you make improvements to pandas and they benefit millions of projects and

32:52 millions of people around the world.

32:53 And that's, yeah, so it's like a steady snowballing effect.

32:59 This portion of Talk Python to Me is brought to you by Mailtrap, an email delivery platform

33:04 that developers love.

33:06 An email sending solution with industry best analytics, SMTP and email API SDKs for major

33:13 programming languages and 24/7 human support.

33:17 Try for free at mailtrap.io.

33:22 I think doing data science is getting easier.

33:24 We've got a lot of interesting frameworks and tools.

33:27 Shiny for Python, one of them, right, that makes it easier to share and run your code,

33:32 you know?

33:33 Yeah, Shiny for Python, Streamlit, Dash, like these different interactive data application

33:38 publishing frameworks.

33:39 So you can go from a few lines of pandas code, loading some data and doing some analysis

33:45 and visualization to publishing that as an interactive website without having to know

33:51 how to use any web development frameworks or Node.js or anything like that.

33:56 And so to be able to get up and running and build a working interactive web application

34:01 that's powered by Python is, yeah, it's a game changer in terms of shortening end-to-end

34:07 development life cycles.

34:08 What do you think about JupyterLite and these PyOdide and basically Jupyter in a browser

34:17 type of things, WebAssembly and all that?

34:19 Yeah, so I'm definitely very excited about it, been following WebAssembly in general.

34:24 And so I guess some people listening will know about WebAssembly, but basically it's

34:28 a portable machine code that can be compiled and executed within your browser in a sandbox

34:35 environment.

34:36 So it protects against security issues and allows, prevents like the person who wrote

34:42 the WebAssembly code from doing something malicious on your machine, which is very important.

34:46 Won't necessarily stop them from like, you know, mining cryptocurrency while you have

34:49 the browser tab open.

34:50 That's a whole separate problem.

34:51 And it's enabled us to run the whole scientific Python stack, including Jupyter and NumPy

34:57 and Pandas totally in the browser without having a client and server and needing to

35:03 run a container someplace in the cloud.

35:05 And so I think in terms of creating application deployment, so like being able to deploy an

35:10 interactive data application like with Shiny, for example, without needing to have a server,

35:16 that's actually pretty amazing.

35:17 And so I think that simplifies, opens up new use cases, like new application architectures

35:22 and make that makes things a lot easier for, because setting up and running a server creates

35:27 brittleness, like it has cost.

35:29 And so if the browser is doubling as your server process, like that's, I think that's

35:34 really cool.

35:35 You also have like other projects like DuckDB, which is a high performance, embeddable analytic

35:41 SQL engine.

35:43 And so now with DuckDB compiled to Wasm, you can get a high performance database running

35:49 in your browser.

35:50 And so you can get low latency, interactive queries and interactive dashboards.

35:55 And so it's, yeah, there's WebAssembly has opened up this whole kind of new world of

36:00 possibilities and it's transformative, I think.

36:03 For Python in particular, you mentioned Pyodide, which is kind of a whole packaged stack.

36:08 So it's like a framework for build and build and packaging and basically building an application

36:14 and managing its dependencies.

36:16 So you could create a WebAssembly version of your application to be deployed like this.

36:21 But yeah, so I think one of the Pyodide, either the Pyodide main creator or maintainer went

36:26 to Anaconda, they created PyScript, which made, which is another attempt to make it

36:29 even easier to use Python to make it even easier to use Python to create web applications,

36:36 interactive web applications.

36:37 There's so many cool things here, like in the R community, they have WebR, which is

36:41 similar to PyScript and Pyodide in some ways, like compiling the whole R stack to WebAssembly.

36:46 There was just an article I saw on Hacker News where they worked on figuring out how

36:50 to get, how to trick LLVM into compiling Fortran code, like legacy Fortran code to WebAssembly.

36:57 Because when you're talking about all of this scientific computing stack, you need the linear

37:00 algebra and all of the 40 years of Fortran code that have been built to support scientific

37:06 applications.

37:07 And you can compile to and run in the browser.

37:09 So yeah, that's pretty wild to think of putting that in there, but very useful.

37:12 I didn't realize that you could use DuckDB as a WebAssembly component.

37:16 That's pretty cool.

37:17 Yeah, there's a company, I'm not an investor or plugging them or anything, but it's called

37:21 evidence.dev.

37:23 It's like a whole like business intelligence, open source business intelligence application

37:28 that's powered by, powered by DuckDB.

37:30 And so if you have data that fits in the browser, you know, to have a whole like interactive

37:34 dashboard or to be able to do business intelligence, like fully, like fully in the browser with

37:39 no need of a, no need of a server.

37:41 It's yeah, it's, it's very, very cool.

37:43 So I've been following DuckDB since the, you know, since the early days and you know, my

37:48 company Voltron Data, like we became members of the DuckDB foundation and built an actively

37:54 built a relationship with, with DuckDB labs.

37:56 So we could help accelerate progress in this space because I think the impact, the impact

38:02 is so is so immense and we just, it's hard to predict like what you know, what people

38:07 are going to build, build with all this stuff.

38:09 And so that was all, you know, with, I guess going back, you know, 15 years ago to Python,

38:13 like one of the reasons I became so passionate about building stuff for Python was about

38:19 in I think the way that Peter Wang puts that, it puts it as, you know, giving people superpowers.

38:24 So we want to enable people to build things with much less code and much less time.

38:29 And so by making it things that much more accessible, that much easier to do, like the

38:34 mantra in pandas was like, how do we make things one line of code?

38:37 Or like this, that must be easy.

38:39 It's like one line of code, one line of code.

38:41 It must be like, like make this as terse and simple and easy to do as possible so that

38:45 you can move on and focus on building the more interesting parts of your application

38:49 rather than struggling with how to read a CSV file or you know, how to do whichever

38:54 data munging technique that you need for your, for your data set.

38:58 - That would be an interesting mental model for DuckDB.

39:00 It's kind of an equivalent to SQLite, but more analytics database for folks, you know,

39:04 in process and that kind of things, right?

39:07 What do you think?

39:08 - Yeah, so yeah, DuckDB is like SQLite.

39:09 And in fact, it can run the whole SQLite test suite, I believe.

39:14 So it's a full database, but it's for analytic processing.

39:17 So it's optimized for analytic processing and as compared, you know, with SQLite, which

39:21 is not for data processing.

39:23 - Yeah, cool.

39:24 All right, well, let's talk about some things that you're working on beyond pandas.

39:29 You talked about Apache Arrow earlier.

39:31 What are you doing with Arrow and how's it fit in your world?

39:34 - The backstory there was, I don't know if you can hear the sirens in downtown Nashville,

39:39 but.

39:40 - No, actually, I don't hear the sirens.

39:41 - It's good, the microphone filters it out, filters it out pretty well.

39:44 - Yay for dynamic microphones, they're amazing.

39:47 - Yeah, so in like around the mid, like the mid 2010s, 2015, I started working at Cloudera,

39:53 like in the, which is a company that was like one of the pioneers in the big data ecosystem.

39:58 And I had been spent several years working on five, five years, five, six years working

40:03 on pandas.

40:04 And so I'd gone through the experience of building pandas from top to bottom.

40:09 And it was this full stack system that had had its own, you know, mini query engine,

40:15 all of its own algorithms and data structures and all this stuff that we had to build from

40:19 scratch.

40:20 And I started thinking about, you know, what if it was possible to build some of the underlying

40:24 computing technology, like data readers, like file readers, all the algorithms that power

40:31 the core components of pandas, like group operations, aggregations, filtering, selection,

40:37 all those things.

40:38 Like what if it were possible to have a general purpose library that isn't specific to Python,

40:43 isn't specific to pandas, but is really, really fast, really efficient, and has a large community

40:48 building, building it so that you could take that code with you and use it to build many

40:52 different types of libraries, not just data frame libraries, but also database engines

40:57 and stream processing engines and all kinds of things.

41:00 That was kind of what was in my mind when I started getting interested in what turned

41:04 into turn into arrow.

41:06 And one of the problems we realized we needed to solve this was like a group of other open

41:10 source developers and me was that we needed to create a way to represent data that was

41:15 not tied to a specific programming language.

41:19 And that could be used for a very efficient interchange between components.

41:24 And the idea is that you would have this immutable, this kind of constant data structure, which

41:29 is like it's the same in every programming language.

41:32 And then you can use that as the basis for writing all of your algorithms.

41:35 So as long as it's arrow, you have these reusable algorithms that process arrow data.

41:41 So we started with building the arrow format and standardizing it.

41:44 And then we've built a whole ecosystem of components like library components and different

41:50 programming languages for building applications that use the arrow format.

41:54 So that includes not only tools for building and interacting with the data, but also file

42:00 readers.

42:01 So you can read CSV files and JSON data and parquet files, read data out of database systems,

42:07 you know, wherever the data comes from, we want to have an efficient way to get it into

42:10 the arrow format.

42:11 And then we moved on to building data processing engines that are native to the arrow format,

42:18 so that arrow goes in, the data is processed, arrow goes out.

42:22 So DuckDB, for example, supports arrow as a preferred input format.

42:26 And is DuckDB is more or less arrow like in its internals, it has kind of arrow format

42:32 plus a number of extensions that are DuckDB specific for better performance within the

42:37 context of DuckDB.

42:39 So in numerous communities, so there's the Rust community, which has built Data Fusion,

42:43 which is an execution engine for arrow, SQL engine for arrow.

42:48 And so yeah, we've kind of like looked at the different layers of the stack, like data

42:51 access, computing, data transport, everything under the sun.

42:55 And then we've built libraries that are across many different programming languages so that

42:59 are you can pick and choose the pieces that you need to build your build your system.

43:03 And the goal ultimately, was that we in the future, which is now, we don't want people

43:07 to have to reinvent the wheel whenever they're building something like pandas that they could

43:11 just pick up these off the shelf components, they can design the developer experience,

43:17 the user experience that they want to create, and they can get build, you know, so if you

43:21 were building pandas, now you could build a pandas like library based on the arrow components

43:26 in much less time, and it would be fast and efficient and interoperable with the whole

43:31 ecosystem of other projects that use arrow.

43:33 It's very cool.

43:34 It's I mean, it was really ambitious, in some ways, obvious to people, they would they would

43:39 hear about arrow and they say that sounds obvious, like, clearly, we should have a universal

43:43 way of transporting data between systems and processing it in memory.

43:47 Why hasn't this been done in the past?

43:49 And it turns out that, as is true with many open source software problems that many of

43:54 these problems are, the social problems are harder than the technical problems.

43:58 And so if you can solve the people coordination and consensus problems, solving the technical

44:03 issues is much, much easier by comparison.

44:06 So I think we were lucky in that we found like the right group of people, the right

44:10 personalities where we were able to, as soon as I met, I met Jacques Nadeau, who had been

44:15 at MapR and we was working on his startup Dremio.

44:19 Like I knew instantly when I met Jacques Nadeau, I was like, I can work he's like, he's like

44:23 him like he's gonna help me make this happen.

44:25 And I met Julien Ledem, who had also co created Parquet.

44:29 I was like, yes, like, we were gonna make like I found the right people like we're we're

44:33 gonna make this happen.

44:34 It's been a labor of love and much, much work and stress and everything.

44:38 But I've been working on things circling, you know, with arrows, the sun, you know,

44:43 I've been building kind of satellites and moons and planets circling the arrow sun over

44:48 the last eight years or so.

44:49 And that's kept me pretty busy.

44:50 Yeah, it's only getting more exciting and interesting.

44:53 Over here, it says it uses efficient analytic operations on modern hardware like CPUs and

44:59 GPUs.

45:00 One of the big challenges of Python has been the GIL also one of its big benefits, but

45:05 one of its challenges when you get to multi core computational stuff has been the gill.

45:09 What's the story here?

45:10 Yeah.

45:11 So in Arrowland, when we're talking about analytic efficiency, it mainly has to do with

45:17 the like underlying like how the how modern CPU works or how a GPU works.

45:24 And so when the data is arranged in column oriented format, that enables the data to

45:29 be moved efficiently through the CPU cache pipelines.

45:34 So the data is made made available efficiently to the to the CPU cores.

45:39 And so we spent a lot of energy in Arrow making decisions firstly, to enable very cache of

45:45 like CPU cache or GPU cache efficient analytics on the data.

45:49 So we were kind of always when we were deciding we would break ties and make decisions based

45:53 on like what's going to be more efficient for the for the computer chip.

45:57 The other thing is that modern and this is true with GPUs, which have a different parallelism

46:02 model than or different kind of multi core parallelism model than CPUs.

46:08 But in CPUs, they've focused on adding what are called single instruction, multiple data

46:13 intrinsic, like built in operations in the processor, where you know, now you can process

46:20 up to 512 bytes of data in a single CPU instruction.

46:25 And so that's like my brain's doing the math right, like 1632 bit floats, or, you know,

46:31 eight 64 bit integers in a single CPU cycle.

46:34 There's like intrinsic operations.

46:35 So multiply this number by that one, multiply that number to these eight things all at once,

46:41 something like that.

46:42 That's right.

46:43 Yeah.

46:44 Or you might say like, oh, I have a bit mask and I want to select I want to gather like

46:48 the one bits that are set in this bit mask from this array of integers.

46:52 And so there's like a gather instruction, which allows you to select a subset sort of

46:58 SIMD vector of integers, you know, using a bit mask.

47:01 And so that turns out to be like a pretty critical operation in certain data analytic

47:05 workloads.

47:06 So yeah, we were really, we wanted to have a data format that was essentially, you know,

47:11 future proofed in the sense that it's, it's ideal for the coming wave, like current current

47:16 generation of CPUs, but also given that a lot of processing is moving to GPUs and to

47:21 FPGAs and to custom silicon, like we wanted Arrow to be usable there as well.

47:27 And it's Arrow has been successfully, you know, used as the foundation of GPU computing

47:32 libraries.

47:33 Like we kind of at Voltron Data, we built, we've built a whole accelerator native GPU

47:38 native, you know, scalable execution engine that's, that's Arrow based.

47:42 And so I think the fact that we, that was our aspiration and we've been able to prove

47:46 that out in real world workloads and show the kinds of efficiency gains that you can

47:50 get with using modern computing hardware correctly, or at least as well as it's intended to be

47:56 used.

47:57 That's a big deal in terms of like making applications faster, reducing the carbon footprint

48:01 of large scale data workloads, things like that.

48:03 Yeah.

48:04 Amazing.

48:05 All right.

48:06 Let's see what else have I got on deck here to talk to you about.

48:07 Do Ibis, want to talk about Ibis or which one you want to, we got a little time left.

48:11 We got a couple of things to cover.

48:13 Yeah.

48:14 Let's, we can talk about Ibis.

48:15 Yeah.

48:16 We could, we could probably spend another hour talking.

48:17 Yes.

48:18 Easy.

48:19 So one of the more interesting areas in recent years has been new DataFrame libraries and

48:24 DataFrame APIs that transpile or compile to different execute on different backends.

48:29 And so around the time that I was helping start Arrow, I created this project called

48:33 Ibis, which is basically a portable DataFrame API that knows how to generate SQL queries

48:40 and compile to pandas and pollers and different DataFrame, DataFrame backends.

48:46 And the goal is to provide a really productive DataFrame API that gives you portability across

48:53 different execution backends with the goal of enabling what we call the multi-engine

48:57 data stack.

48:58 So you aren't stuck with using one particular system because all of the code that you've

49:03 written is specialized to that system.

49:05 You have this tool, which, so maybe you could work with, you know, DuckDB on your laptop

49:10 or pandas or polars with Ibis on your laptop.

49:13 But if you have, if you need to run that workload someplace else, maybe with, you know, ClickHouse

49:17 or BigQuery, or maybe it's a large big data workload that's too big to fit on your laptop

49:22 and you need to use Spark SQL or something that you can just ask Ibis, say, "Hey, I want

49:28 to do the same thing on this larger dataset over here." And it has all the logic to generate the correct query representation and run that workload

49:35 for you.

49:36 So it's super useful.

49:37 But there's a whole wave of like, you know, work right now to help enable people to work

49:42 in a pandas-like way, but get work with big data or, you know, get better performance

49:47 than pandas because pandas is a Swiss army knife, but isn't a chainsaw.

49:51 So if you were rebuilding pandas from scratch, it would end up a lot.

49:55 There's areas of the projects that are, you know, more bloated or have performance overhead

49:59 that's hard to get rid of.

50:01 And so that's why you have Richie Fink started the Polars project, which is kind of a reimagining

50:06 of pandas, pandas data frames written in Rust and exposed in Python.

50:11 And Polars, of course, is built on Apache Arrow at its core.

50:15 So building an Arrow native data frame library in Rust and, you know, all the benefits that

50:20 come with building Python extensions in Rust, you know, you avoid the GIL and you can manage

50:25 the multithreading in a systems language, all that fun stuff.

50:28 Yeah.

50:29 When you're talking about Arrow and supporting different ways of using it and things being

50:32 built on it, certainly Polars came to mind for me.

50:35 You know, when you talk about Ibis, I think it's interesting that a lot of these data

50:39 frame libraries, they try to base their API to be pandas-like, but not identical, potentially,

50:46 you know, thinking of Dask and others.

50:49 But this Ibis sort of has the ability to configure it and extend it and make it different, kind

50:54 of like, for example, Dask, which is one of the backends here.

50:57 But the API doesn't change, right?

50:59 It just, it talks to the different backends.

51:01 Yeah.

51:02 There's different schools of thought on this.

51:03 So there's another project called Modin, which is similar to Ibis in many ways, in the sense

51:07 of like transpilation and sort of dynamically supporting different backends, but sought

51:12 to closely emulate the exact details of like the API call, the function name, the function

51:19 arguments must be exactly the same as pandas, with the goal of being a drop-in replacement

51:25 for people's pandas code.

51:26 And that's one approach, kind of the pandas emulation route.

51:29 And there's a library called Koalas for Spark, which is like a PySpark emulation layer for

51:35 the pandas API.

51:36 And then there's other projects like Polars and Ibis, Dask DataFrame, that take like design

51:41 cues from pandas in the sense of like the general way in which the API works, but has

51:47 made meaningful departures in the interest of doing things better in many ways than pandas

51:52 did in certain parts of the API, and making things simpler, and not being beholden to

51:57 decisions that were made in pandas, you know, 15 years ago.

51:59 Not to say there's anything bad about the pandas API, but like with any API, it's large,

52:04 like it's very large as evidenced by, you know, the 2000 pages of documentation.

52:10 And so I understand the desire to make things simpler, but also refining certain things,

52:14 making certain types of workloads easier to express.

52:18 And so Polars, for example, is very expression based.

52:21 And so everything is column expressions and is lazy, and not eagerly computed, whereas

52:26 pandas is eager execution, just like NumPy is, which is how pandas became eagerly executed

52:33 in the first place.

52:34 And so I think the mantra with Polars was, we don't want to support the eager execution

52:40 by default that pandas provides, we want to be able to build expressions so that we can

52:44 do query optimization, and take inefficient code and under the hood, rewrite it to be

52:50 more efficient, which is, you know, what you can do with a query optimizer.

52:53 And so ultimately, like, that matters a lot when you're executing code remotely, or in

52:57 like a big data system, you want to have the freedom to be able to take like a lazy analytic

53:03 expression and rewrite it based on it might be like you need to seriously rewrite the

53:08 expression in the case of like, Dask, for example, like Dask has to do planning across

53:13 a distributed cluster.

53:14 And so, you know, Dask DataFrame is very pandas like, but it also includes some explicit details

53:20 of being able to control how the data is partitioned and being able to have some knobs to turn

53:25 in terms of like, having more control over what's happening on a distributed cluster.

53:29 And I think the goal there is like to give the developer more control as opposed to like

53:32 trying to be intelligent, you know, make all of the decisions on behalf of the developer.

53:37 So you know, if you know about how you know, know a lot about your data set, then you can

53:41 make more you can make a, you know, decisions about how to how to schedule and execute it.

53:45 Of course, that Dask is building query, you know, query optimization to start making more

53:49 of those decisions on behalf of the user.

53:51 But you know, Dask has become very popular and impactful and making distributed computing

53:56 easier in Python.

53:57 So they've gotten, I think, got a long way without turning into a database.

54:01 And I think Dask never aspired to be a to be a database engine, which is a lot of distributed

54:05 computing is, you know, not database like it's could be distributed array computing

54:09 or distributed model training, and just being able to easily run distributed Python functions

54:14 on a cluster do distributed computing that way.

54:16 It was amazing, like how many people were using PySpark in the early days, just for

54:21 the convenience of being able to run Python functions in parallel on a cluster.

54:25 Yeah, and that's pretty interesting.

54:27 Not exactly what it's designed for, right?

54:30 Dask, you know, you probably come across situations where you do a sequence of operations, they're

54:35 kind of commutative in the end and practice, but from a computational perspective, like

54:39 how do I distribute this amongst different servers, maybe one order matters a lot more

54:44 than the other outperformance, you know?

54:47 Yeah, yeah.

54:48 Interesting.

54:49 All right.

54:50 One final thing.

54:51 SQLGLOT.

54:52 Yeah.

54:53 So SQLGLOT project started by Toby Mao.

54:54 So he's a Netflix alum and really, really talented, talented developer who's created

55:00 this SQL query transpilation framework library for Python and kind of underlying core library.

55:08 And so the problem that's being solved there is that SQL, despite being a quote unquote

55:12 standard is not at all standardized across different database systems.

55:17 And so if you want to take your SQL queries written for one engine and use them someplace

55:21 else, without something like SQLGLOT, you would have to manually rewrite and make sure

55:25 you get the typecasting and coalescing rules correct.

55:30 And so SQLGLOT understands the intricacies and the quirks of every database dialect,

55:36 SQL dialect, and knows how to correctly translate from one dialect to another.

55:41 And so IBIS now uses SQLGLOT as its underlying engine for query transpilation and generating

55:47 SQL outputs.

55:48 So originally, IBIS had its own kind of bad version of SQLGLOT, kind of a query transpilation,

55:54 like SQL transpilation that was powered by, I think, powered by SQLAlchemy and a bunch

56:00 of custom code.

56:01 And so I think they've been able to delete a lot of code in IBIS by moving to SQLGLOT.

56:06 And I know that, you know, SQLGLOT is also being used to power kind of a new, yeah, being

56:11 used in people building new products that are Python powered and things like that.

56:16 So Toby, like his company, Tobiko Data, they're building a product called SQL Mesh that's

56:23 powered by SQLGLOT.

56:24 So very cool project and maybe a bit in the weeds, but if you've ever needed to convert

56:28 a SQL query from one dialect to another, it's, yeah, SQLGLOT is here to save the day.

56:32 I would say, you know, even simple things is how do you specify a parameter variable,

56:38 you know, for parameterized query, right?

56:39 And Microsoft SQL Server, it's like at the parameter name and Oracle, it's like question

56:45 mark or SQL, I think it's also quite, you know, just that, even those simple things.

56:49 It's a pain.

56:50 And without it, you end up with little Bobby tables, which is also not good.

56:53 So that's true.

56:54 That's true.

56:55 Nobody wants to talk to him.

56:57 Yeah, this is really cool.

56:58 SQLGLOT, like polyglot, but all the languages of SQL.

57:02 Nice.

57:03 And you do things like you can say, read DuckDB and write to Hive or read DuckDB and then

57:09 write to Spark or whatever.

57:11 It's pretty cool.

57:12 All right, Wes, I think we're getting short on time, but you know, I know everybody appreciated

57:17 hearing from you and hearing what you're up to these days.

57:20 Anything you want to add before we wrap up?

57:22 I don't think so.

57:23 Yeah.

57:24 I enjoyed the conversation and yeah, there's a lot of stuff going on and still plenty of

57:30 things to get excited about.

57:32 So I think often people feel like all the exciting problems in the Python ecosystem

57:36 have been solved, but there's still a lot to do.

57:39 And yeah, we've made a lot of progress in the last 15 plus years, but in some ways it

57:45 feels like we're just getting started.

57:47 So we are just excited to see where things go next.

57:49 Yeah.

57:50 Every time I think, oh, all the problems are solved, then you discover all these new things

57:53 that are so creative and you're like, oh, well, that was a big problem.

57:55 I didn't even know it was a problem.

57:57 It's great.

57:58 All right.

57:59 Well, thanks for being here and taking the time and keep us updated on what you're up

58:02 to.

58:03 All right.

58:04 Thanks for joining us.

58:05 Bye-bye.

58:07 This has been another episode of Talk Python to Me.

58:08 Thank you to our sponsors.

58:10 Be sure to check out what they're offering.

58:12 It really helps support the show.

58:14 It's time to stop asking relational databases to do more than they were made for and simplify

58:19 complex data models with graphs.

58:22 Check out the sample FastAPI project and see what Neo4j, a native graph database, can do

58:28 for you.

58:29 You can find out more at talkpython.fm/neo4j.

58:33 Mailtrap, an email delivery platform that developers love.

58:37 Try for free at mailtrap.io.

58:39 Want to level up your Python?

58:42 We have one of the largest catalogs of Python video courses over at Talk Python.

58:46 Our content ranges from true beginners to deeply advanced topics like memory and async.

58:51 And best of all, there's not a subscription in sight.

58:54 Check it out for yourself at training.talkpython.fm.

58:57 Be sure to subscribe to the show.

58:58 Open your favorite podcast app and search for Python.

59:01 We should be right at the top.

59:03 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct

59:08 RSS feed at /rss on talkpython.fm.

59:12 We're live streaming most of our recordings these days.

59:15 If you want to be part of the show and have your comments featured on the air, be sure

59:19 to subscribe to our YouTube channel at talkpython.fm/youtube.

59:24 This is your host, Michael Kennedy.

59:25 Thanks so much for listening.

59:26 I really appreciate it.

59:27 Now get out there and write some Python code.

59:29 [MUSIC PLAYING]

59:32 [END PLAYBACK]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon