#462: Pandas and Beyond with Wes McKinney Transcript
00:00 This episode dives into some of the most important data science libraries from the Python space
00:05 with one of its pioneers, Wes McKinney.
00:08 He's the creator or co-creator of the pandas Apache Arrow and IBIS projects, as well as
00:13 an entrepreneur in this space.
00:16 This is Talk Python to Me, episode 462, recorded April 11th, 2024.
00:21 Are you ready for your host, Darius?
00:25 You're listening to Michael Kennedy on Talk Python to Me.
00:28 Live from Portland, Oregon, and this segment was made with Python.
00:33 Welcome to Talk Python to Me, a weekly podcast on Python.
00:38 This is your host, Michael Kennedy.
00:40 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,
00:45 both on fosstodon.org.
00:48 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
00:53 We've started streaming most of our episodes live on YouTube.
00:56 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and
01:02 be part of that episode.
01:04 This episode is sponsored by Neo4j.
01:07 It's time to stop asking relational databases to do more than they were made for and simplify
01:12 complex data models with graphs.
01:15 Check out the sample FastAPI project and see what Neo4j, a native graph database, can do
01:21 for you.
01:22 Find out more at talkpython.fm/neo4j.
01:24 And it's brought to you by Mailtrap, an email delivery platform that developers love.
01:32 Try for free at mailtrap.io.
01:34 Hey, Wes, welcome to Talk Python to Me.
01:37 Thanks for having me.
01:38 You know, honestly, I feel like it's been a long time coming having you on the show.
01:42 You've had such a big impact in the Python space, especially the data science side of
01:46 that space, and it's high time to have you on the show.
01:48 So welcome.
01:49 Good to have you.
01:50 Yeah, it's great to be here.
01:51 I've been heads down a lot the last, you know, last n years.
01:56 And I actually haven't been, because I think a lot of my work has been more like data infrastructure
02:01 and working at even a lower level than Python.
02:05 So I haven't been as engaging as much directly with the Python community.
02:10 But it's been great to kind of get back more involved and start catching up on all the
02:15 things that people have been building.
02:17 And being at Posit gives me the ability to, yeah, sort of have more exposure to what's
02:21 going on and people that are using Python in the real world.
02:24 There's a ton of stuff going on at Posit that's super interesting.
02:27 And we'll talk about some of that.
02:28 And, you know, it's sometimes it's just really fun to build, you know, and work with people
02:32 building things.
02:33 And I'm sure you're enjoying that aspect of it.
02:35 For sure.
02:36 Nice.
02:37 Well, before we dive into pandas and all the things that you've been working on after that,
02:41 you know, let's just hear a quick bit about yourself for folks who don't know you.
02:45 Sure.
02:46 Yeah.
02:47 So I grew up in Akron, Ohio, mostly.
02:49 And I got involved, started getting involved in Python development around 2007, 2008.
02:55 And built, I was working in quant finance at the time, I started building a personal
03:00 data analysis toolkit that turned into the pandas project.
03:04 And then open source that in 2009, started getting involved in the Python community.
03:08 And I spent several years like writing my book, Python for Data Analysis, and then working
03:14 with the broader scientific Python, Python data science community to help enable Python
03:19 to become a mainstream programming language for doing data analysis and data science.
03:24 In the meantime, I've become an entrepreneur, I've started some companies and I've been
03:30 working to innovate and improve the computing infrastructure that powers data science tools
03:36 and libraries like pandas.
03:38 So that's led to some other projects like Apache Arrow and IBIS and some other things.
03:44 Recent years, I've been worked on a startup, Voltron Data, which is still very much going
03:50 strong and has a big team and is off to the races.
03:54 And I've had a long relationship with Posit, formerly RStudio.
03:58 And they were my home for doing Arrow development from 2018 to 2020.
04:04 They helped me incubate the startup that became Voltron Data.
04:08 And so I've gone back to work full time there as a software architect to help them with
04:14 their Python strategy to make sort of their data science platform a delight to use for
04:19 the Python user base.
04:21 I'm pretty impressed with what they're doing.
04:22 I didn't realize the connection between Voltron and Posit, but I have had Joe Chung on the
04:28 show before to talk about Shiny for Python.
04:32 And I've seen him demo a few really interesting things, how it integrates to notebooks these
04:37 days, some of the stuff that you all are doing.
04:40 And yeah, it's just it's fascinating.
04:41 Can you give people a quick elevator pitch on that while we're on that subject?
04:45 On Shiny or on Posit in general?
04:47 Yeah, whichever you feel like.
04:49 Yeah, so Posit started out 2009 as RStudio.
04:54 And so it didn't start out intending to be a company.
04:57 JJ Allaire and Joe Chang built a new IDE, integrated development environment for R,
05:03 because what was available at the time wasn't great.
05:06 And so they made that into, I think, probably one of the best data science IDEs that's ever
05:11 been built.
05:12 It's really an amazing piece of tech.
05:13 So it started becoming a company with customers and revenue in the 2013 timeframe.
05:20 And they've built a whole suite of tools to support enterprise data science teams to make
05:24 open source data science work in the real world.
05:27 But the company itself, it's a certified B corporation, has no plans to go public or
05:30 IPO.
05:31 It is dedicated to the mission of open source software for data science and technical communication,
05:36 and is basically building itself to be a hundred year company that has a revenue generating
05:42 enterprise product side and an open source side so that the open source feeds the enterprise
05:49 part of the business.
05:50 The enterprise part of the business generates revenue to support the open source development.
05:53 And the goal is to be able to sustainably support the mission of open source data science
05:58 for, hopefully, the rest of our lives.
06:01 And it's an amazing company.
06:02 It's been one of the most successful companies that dedicates a large fraction of its engineering,
06:06 time to open source software development.
06:09 So it's very impressed with the company and JJ Allaire, its founder.
06:13 And I'm excited to be helping it grow and become a sustainable long-term fixture in
06:21 the ecosystem.
06:22 Yes.
06:23 Yeah, it's definitely doing cool stuff.
06:25 Incentives are aligned well, right?
06:27 It's not private equity or IPO.
06:31 Many people know JJ Allaire created ColdFusion, which is like the original dynamic web development
06:38 framework in the 1990s.
06:40 And so he and his brother, Jeremy, and some others built Allaire Corp to commercialize
06:45 ColdFusion.
06:46 And they built a successful software business that was acquired by Macromedia, which was
06:49 eventually acquired by Adobe.
06:51 But they did go public as Allaire Corp during the dot-com bubble.
06:55 And JJ went on to found a couple of other successful startups.
06:59 And so he found himself in his late 30s 15 years ago, or around age 40, around the age
07:05 I am now, having been very successful as an entrepreneur, no need to make money, and looking
07:11 for a mission to spend the rest of his career on.
07:15 And that identifying data science and statistical computing as an open source, in particular,
07:21 making open source for data science work, was the mission that he aligned with and something
07:25 that he had been interested in earlier in his career.
07:28 But he had gotten busy with other things.
07:30 So it's really refreshing to work with people who are really mission focused and focused
07:34 on making impact in the world, creating great software, empowering people, increasing accessibility,
07:41 and making most of it available for free on the internet, and not being so focused on
07:45 empire building and producing great profits for venture investors and things like that.
07:50 So I think the goal of the company is to provide an amazing home for top-tier software developers.
07:58 To work on this software, to spend their careers, and to build families, and to be a happy and
08:05 healthy culture for working on this type of software.
08:08 That sounds excellent.
08:09 Very cool.
08:10 I didn't realize the history all the way back to ColdFusion.
08:12 Speaking of history, let's jump in.
08:14 Wes, there's a possibility that people out there listening don't know what Pandas is.
08:20 You would think it's pretty ubiquitous, and I certainly would say that it is, especially
08:24 in the data science space.
08:25 I got a bunch of listeners who listen and they say really surprising things.
08:29 They'll say stuff to me like, "Michael, I've been listening for six weeks now and I'm starting
08:33 to understand some of the stuff you all are talking about." I'm like, "Why did you listen for six weeks?
08:38 You didn't know what I was talking about.
08:40 That's crazy." And a lot of people use it as language immersion to get into the Python space.
08:45 So I'm sure there's plenty of people out there who are immersing themselves but are pretty
08:49 new.
08:50 So maybe for that crew, we could introduce what Pandas is to them.
08:54 Absolutely.
08:55 So this is the Data Manipulation and Analysis Toolkit for Python.
08:59 So it's a Python library that you install that enables you to read data files.
09:04 So read many different types of data files off of disk or off of remote storage, or read
09:09 data out of a database or some other remote data storage system.
09:13 This is tabular data.
09:14 So it's structured data like with columns.
09:16 You can think of it like a spreadsheet or some other tabular data set.
09:20 And then it provides you with this DataFrame object, which is kind of pandas.DataFrame
09:27 that is the main tabular data object.
09:29 And it has a ton of methods for accessing, slicing, grabbing subsets of the data, applying
09:36 functions on it that do filtering and subsetting and selection, as well as more analytical
09:42 operations like things that you might do with a database system or SQL.
09:46 So joins and lookups, as well as analytical functions like summary statistics, grouping
09:53 by some key and producing summary statistics.
09:57 So it's basically a Swiss Army knife for doing data manipulation, data cleaning, and supporting
10:03 the data analysis workflow.
10:05 But it doesn't actually include very much as far as actual statistics or models.
10:09 Or if you're doing something with LLMs or linear regression, or some type of machine
10:16 learning, you have to use another library.
10:18 But pandas is the on-ramp for all of the data into your environment in Python.
10:23 So when people are building some kind of application that touches data in Python, pandas is often
10:29 like the initial on-ramp for how data gets into Python, where you clean up the data,
10:34 you regularize it, you get it ready for analysis, and then you feed the clean data into the
10:41 downstream statistical library or data analysis library that you're using.
10:46 - That whole data wrangling side of things, right?
10:48 - Yeah, that's right, that's right.
10:50 And so, you know, in some history, Python had arrays like matrices and what we call
10:56 tensors now, multi-dimensional arrays going back all the way to 1995, which is pretty
11:02 early history for Python.
11:04 Like the Python programming language has only been around since like 1990 or 1991, if my
11:09 memory serves.
11:10 But what became NumPy in 2005, 2006, started out as numeric in 1995, and it provided numerical
11:19 computing, multi-dimensional arrays, matrices, the kind of stuff that you might do in MATLAB,
11:24 but it was mainly focused on numerical computing and not with the type of business datasets
11:30 that you find in database systems, which contain a lot of strings or dates or non-numeric data.
11:36 And so my initial interest was I found Python to be a really productive programming language.
11:41 I really liked writing code in it, writing simple scripts, like, you know, doing random
11:45 things for my job.
11:47 But then you had this numerical computing library, NumPy, which enabled you to work
11:51 with large numeric arrays and large datasets with a single data type.
11:57 But working with this more tabular type data, stuff that you would do in Excel or stuff
12:00 that you do in a database, it wasn't very easy to do that with NumPy or it wasn't really
12:05 designed for that.
12:06 And so that's what led to building this, like, higher level library that deals with these
12:11 tabular datasets in the Pandas library, which was originally focused on, you know, building
12:16 really with a really close relationship with NumPy.
12:19 So Pandas itself was like a thin layer on top of NumPy originally.
12:24 This portion of Talk Python to Me is brought to you by Neo4j.
12:27 I have told you about Neo4j, the native graph database on previous AdSpots.
12:32 This time, I want to tell you about their relatively new podcast, Graph Stuff.
12:37 If you care about graph databases and modeling with graphs, you should definitely give it
12:41 a listen.
12:42 On their season finale last year, they talked about the intersection of LLMs and knowledge
12:47 graphs.
12:48 Remember when ChatGPT launched?
12:50 It felt like the LLM was a magical tool out of the toolbox.
12:54 However, the more you use it, the more you realize that's not the case.
12:57 The technology is brilliant, but it's prone to issues such as hallucinations.
13:03 But there's hope.
13:04 If you feed the LLM reliable current data, ground it in the right data and context, then
13:09 it can make the right connections and give the right answers.
13:12 On the episode, the team at Neo4j explores how to get the results by pairing LLMs with
13:18 knowledge graphs and vector search.
13:21 Check out their podcast episode on Graph Stuff.
13:24 They share tips for retrieval methods, prompt engineering, and more.
13:28 So just visit talkpython.fm/neo4j-graphstuff to listen to an episode.
13:35 That's talkpython.fm/neo4j-graphstuff.
13:39 The link is in your podcast players show notes.
13:42 Thank you to Neo4j for supporting Talk Python to me.
13:46 One thing I find interesting about Pandas is it's almost its own programming environment
13:52 these days in the sense that, you know, traditional Python, we do a lot of loops, we do a lot
13:59 of attribute dereferencing, function calling, and a lot of what happens in Pandas is more
14:05 functional.
14:07 It's more applied to us.
14:09 It's almost like set operations, right?
14:11 And a lot of vector operations and so on.
14:14 Yeah, that was behavior that was inherited from NumPy.
14:17 So NumPy is very array oriented, vector oriented.
14:21 So you rather than write a for loop, you would write an array expression, which would operate
14:26 on whole batches of data in a single function call, which is a lot faster because you can
14:32 drop down into C code and get good performance that way.
14:37 And so Pandas adopted the NumPy way of like the NumPy like array expression or vector
14:43 operations.
14:44 And I'm curious to hear that that's extended to the types of like non-numeric data operations
14:49 that you can do in Pandas, like, you know, vectorized set lookups, where you can say
14:53 like, you would say like, oh, like this, I have this array of strings, and I have this
14:57 subset of strings, and I want to compute a Boolean array, which says whether or not each
15:01 string is contained in this set of strings.
15:04 And so in Pandas, that's the is in function.
15:06 So you would say like, column A, like is in some set of substrings, and that produces
15:12 that single function call produces a whole Boolean array that you can use for subsetting
15:17 later on.
15:18 Yeah, there's a ton of things that are really interesting in there.
15:20 One of the challenges, maybe you could speak to this a little bit, then I want to come
15:23 back to your performance comment.
15:25 One of the challenges I think is that some of these operations are not super obvious
15:29 that they exist or that they're discoverable, right?
15:32 Like instead of just indexing into say a column, you can index on an expression that might
15:37 filter out the columns or project them or things like that.
15:40 How do you recommend people like kind of discover a little bigger breadth of what they can do?
15:45 There's plenty of great books written about Pandas.
15:48 So there's my book, Python for Data Analysis.
15:51 I think Matt Harrison has written an excellent book, Effective Pandas.
15:55 Pandas documentation, I think has provides really nitty gritty detail about how all the
15:59 different things work.
16:01 But when I was writing my this book, Python for Data Analysis, my goal was to, you know,
16:06 create a primer, like a tutorial on how to solve data problems with Pandas.
16:12 And so for that, you know, I had to introduce some basics of how NumPy works so people could
16:16 understand array oriented computing, basics of Python, so you know enough Python to be
16:21 able to understand what things that Pandas is doing.
16:25 It builds incrementally.
16:26 And so like, as you go through the book, the content gets more and more advanced.
16:30 It introduces, you learn, you master initial set of techniques, and then you can start
16:35 learning about more advanced techniques.
16:37 So it's definitely a pedagogical resource.
16:41 And it is now freely as you as you're showing there on the screen, it is freely available
16:44 on the internet.
16:46 So JJ Allaire helped me port the book to use Quarto, which is a new technical publishing
16:52 system for writing books and blogs and website, you know, quarto.org.
16:57 And yes, that's how I was able to publish my book on the internet, as you know, essentially,
17:03 you can use Quarto to write books using Jupyter notebooks, which is cool.
17:06 My book was written a long time ago in O'Reilly's DocBook XML.
17:10 So not particularly fun to edit.
17:12 But yeah, because Quarto is built on Pandoc, which is a sort of markup language transpilation
17:18 system.
17:19 So you can use Pandoc to convert from one, you know, you can to convert documents from
17:23 one format to another.
17:25 And so that's the kind of the root framework that Quarto is built on for, you know, generating
17:30 starting with one document format and generating many different types of output formats.
17:33 That's cool.
17:34 I didn't realize your book was available just to read on the internet.
17:38 If you navigate around.
17:40 In the third edition, I was able to negotiate with O'Reilly and add a, you know, add an
17:45 append and make an amendment to my very old book contract from 2011 to let me release
17:52 the book for free on my website.
17:54 So it's yeah, it's just available there at westmckinney.com/book.
17:58 I find that like a lot of people really like the print book.
18:02 And so I think that having the online book just available, like whenever you are somewhere
18:06 and you want to look something up is great.
18:08 Print books are hard to search.
18:09 Yeah, that's true.
18:10 That's true.
18:11 Yeah.
18:12 And like, if you go to the search bar, and if you go back to the book and just look at
18:14 the search bar, you know, look at just search for like group by like, you know, all one
18:18 word or you know, yeah, it's like, it comes up really fast.
18:21 You can go to that section.
18:23 And it's pretty cool.
18:25 I thought that releasing the book for free online would would affect sales, but people
18:29 just really like having paper books, it seems even in 2024.
18:33 Yeah, even digital books are nice.
18:34 You got them with you all the time you can I think it's about taking the notes, where
18:38 do I put my highlights?
18:39 And how do I remember it?
18:40 And that's right.
18:41 Yeah, yeah, stuff like that.
18:42 This quarter thing looks super interesting.
18:45 If you look at Pandoc, if people haven't looked at this before, the conversion matrix.
18:53 I don't know how you would how would you describe this was busy and complete?
18:57 What is this?
18:58 This is crazy.
18:59 It's very busy.
19:00 Yeah, it's it's it can convert from looks like about, you know, 30 or 40 input formats,
19:05 formats to, you know, 50 or 60 output formats, maybe maybe more than that, kind of just my
19:10 just eyeballing it.
19:11 But yeah, it's like a pretty, pretty impressive.
19:12 And then if you took the combinatorial of like, how many different ways could you combine
19:16 the 30 to the 50?
19:18 It's kind of what it looks like.
19:19 It's right.
19:20 It's truly amazing.
19:21 So you've got Markdown, you want to turn it into a PDF, or you've got a Doku wiki and
19:26 you want to turn it into an EPUB or whatever, right?
19:29 Or even like reveal JS, probably to PowerPoint, I would imagine.
19:33 I don't know.
19:34 Yeah.
19:36 As history like backstory about about quarto.
19:38 So you know, it helps to keep in mind that that JJ created cold fusion, which was this,
19:44 you know, essentially publishing system, early publishing system for the internet, similar
19:48 to CGI and, and PHP and another dynamic web publishing systems.
19:53 And so at early on it at our studio, they created our Markdown, which is a basically
19:59 a extensions to Markdown that allow you to have code cells written in, in our and then
20:05 eventually they added support for some other languages where it's kind of like a Jupyter
20:08 notebook in the sense that you could have some Markdown and some code and some plots
20:12 and output, and you would run the arm Markdown render and it would it would, it would, you
20:17 know, generate all the output and insert it into the document.
20:21 And so you could use that to write blogs and websites and everything.
20:24 But but our markdown was written in our and so that limited in a sense, like it made it
20:29 harder to install because you would have to install our to use it.
20:32 And also people, it had an association with our that perhaps was like, like unmerited.
20:37 And so in the meantime, you know, with all everything that's happened with web technology,
20:42 it's now very easy to put a complete JavaScript engine in a stall footprint, you know, on
20:47 a machine with no dependencies, and to be able to run a system that is, you know, written
20:53 in a system that's written in JavaScript.
20:55 And so Cordo is completely language agnostic, it's written in TypeScript, and uses Pandoc
21:01 as an as an underlying engine.
21:03 And it's very easy to install.
21:05 And so it addresses some of the portability and extensibility issues that were that were
21:09 present in our Markdown.
21:11 But but as a result, you know, I think our the the Posit team had a lot of just has more
21:16 than a, you know, a decade, or if you include Cold Fusion, you know, more than 25 years
21:20 of experience and in building really developer friendly technical publishing tools.
21:25 And so I think that that it's not data science, but it's something that is an important part
21:30 of the data science workflow, which is, how do you present your make your analysis and
21:35 your work available for consumption in different formats.
21:38 And so having this this system that can, you know, publish outputs in in many different
21:43 places is super valuable.
21:46 So a lot of people start out in Jupyter notebooks, but but there's many different, you know,
21:49 many different possible input formats.
21:51 And so to be able to, you know, use the same source to publish to a website or to a confluence
21:56 page or to a PDF is like, super valuable.
21:59 Yeah, it's super interesting.
22:00 Okay, so then I got to explore some more.
22:03 Let's go back to this for a minute.
22:05 Just how about some kind words from the audience for you?
22:08 Ravid says, Wes, your work has changed my life.
22:11 It's very, very nice.
22:12 I'm happy to hear it.
22:13 But yeah, yeah, I'm more than happy to talk.
22:16 Let's talk in depth about pandas.
22:18 And I think history of the project is interesting.
22:20 And I think also how the project has has developed in the intervening 15 16 years is pretty interesting
22:28 as well.
22:29 Yeah, let's talk derivatives for a minute.
22:30 So growth and speed of adoption and all those things.
22:34 When you first started working on this, and you first put it out, did you foresee a world
22:37 where this was so popular and so important?
22:42 Did you think of Yeah, pretty soon black holes, I'm pretty sure I'll be part of that somehow.
22:46 It was always the aspiration of making Python this mainstream language for statistical computing
22:52 and data analysis.
22:54 Like I didn't, it didn't occur to me that it would become this popular or that it would
22:59 become like the one of the main tools that people use for working with data in a business
23:04 in a business setting like that would have been if that was the aspiration, or if that
23:08 was, you know, what I needed to achieve to be satisfied, that that would have been completely
23:12 unreasonable.
23:13 And I also don't think that in a certain sense, like, I don't know that that it's popularity,
23:18 it is deserved, and it's not deserved.
23:19 Like I think there's there's many other worthy efforts that have been created over the years
23:23 that have been really great work that that others have have done in this this domain.
23:28 And so the fact that that pandas caught on and became as popular as it is, I think it's
23:33 a combination of timing.
23:35 And you know, there was like a developer relations aspect that there was content available.
23:40 And like I wrote my book, and that made it easier for people to learn how to use the
23:43 project.
23:44 But also like we we had a serendipitous open source developer community that that came
23:49 together that allowed the project to grow and expand like really rapidly in the early
23:55 2010s.
23:56 And I definitely spent a lot of work like recruiting people to work on the project and
24:01 encouraging you know, others to work on it, because sometimes people create open source
24:04 projects, and then it's hard for hard for others to get involved and get a seat at the
24:08 table, so to speak.
24:09 But I was very keen to bring on others and to give them responsibility and you know,
24:16 ultimately, you know, hand over the reins to the project to others.
24:20 And that I've spoken a lot about that, you know, over the years, how important that is
24:23 to you know, for open source project creators to to make room for others in, you know, steering
24:28 and growing the project so that they can become owners of it as well.
24:32 It's tough to make space and tough to bring on folks.
24:36 Have you heard of the Djangonauts?
24:38 The Djangonauts?
24:39 I think it's Djangonauts dot space.
24:40 They have an awesome domain.
24:42 But it's basically like kind of like a boot camp, but it's for taking people who just
24:46 like Django and turn them into actually contributors or core contributors.
24:50 What's your onboarding story for people who do want to participate?
24:53 I'm embarrassed to say that I'm not, I don't have a comprehensive view of like all of the
24:58 different, you know, community outreach channels that the pandas project has done to help grow
25:02 new contributors.
25:04 So one of the core team members, Mark Garcia, has done an amazing job organizing documentation
25:12 sprints and other like contributor sourcing events, essentially creating very friendly,
25:18 accessible events where people who are interested in getting involved in pandas can meet each
25:22 other and then assist each other in making their first pull request.
25:27 And it could be something as simple as, you know, making a small improvement to the to
25:31 the pandas documentation because it's such a large project.
25:34 The documentation is like always something that could be better, you know, either adding
25:40 more adding more examples or documenting things that aren't documented or making, yeah, just
25:44 just making the documentation better.
25:46 And so it's something that for new contributors is more accessible than working on the internals
25:52 of like one of the algorithms or something.
25:55 And, and, or like we working on some significant performance improvement might be a bit intimidating
26:00 if you've never worked on the pandas code base.
26:02 And it's a pretty large code base because it's been it's been worked on continuously
26:05 for you know, for like going on 20 years.
26:08 So it's, yeah, it can be takes a while to really get to a place where you can be productive.
26:13 And that can be discouraging for new contributors, especially those who don't have a lot of open
26:18 source experience.
26:19 That's one of the ironies of the challenges of these big projects is they're just so finely
26:24 polished.
26:25 So many people are using them.
26:27 Every edge case matters to somebody.
26:29 Right.
26:30 And so to become a contributor and make changes to that, it takes a while, I'm sure.
26:34 Yeah, yeah.
26:35 I mean, I think it's definitely a big thing that helped is allowing people to get paid
26:39 to work on pandas or to be able to contribute to pandas as, as a part of their job description,
26:46 like as you know, maybe part of their job is maintaining, maintaining pandas.
26:50 So Anaconda, you know, was like, you know, one of the earliest companies who had engineers
26:55 on staff, you know, like, you know, Brock Mendel, Tom Augsberger, Jeff Reback, who part
27:01 of their job was maintaining and developing, developing pandas.
27:03 And that was, that was huge because prior to that, the project was purely based on volunteers.
27:09 Like I, I was a volunteer and everyone was working on the project as a, as a passion
27:13 project in their, in their free time.
27:15 And then Travis Oliphant, one of the founders, he and Peter Wang founded Anaconda.
27:21 Travis spun out from Anaconda to create Quonsight and has continued to sponsor development in
27:26 and around pandas.
27:27 And that's enabled people like Mark to do these community building events and, and for
27:32 it to not be, you know, something that's, you know, totally uncompensated.
27:35 Yeah.
27:36 That's a lot of, a lot of stuff going on.
27:37 And I think the interest is awesome, right?
27:40 I mean, if it was just a different level of problems, I feel like we could take on, you
27:45 know, you know what, I got this entire week and someone that's my job is to make this
27:48 work rather than I've, I've got two hours and can't really take on a huge project.
27:53 And so I'll work on the smaller improvements or whatever.
27:56 Yeah.
27:57 Many people know, but I, I haven't been involved day to day in pandas since 2013.
28:01 So that's, that's getting on.
28:02 That's a lot of years.
28:03 I still talk to the pandas contributors.
28:06 We had a, we had a pandas meetup core, core developer meetup here in Nashville pre COVID.
28:10 I think it was in 2019, maybe.
28:13 So I'm still in active contact with the pandas developers, but it's been a different team
28:18 of people leading the project.
28:20 It's taken on a life of its own, which is, which is amazing.
28:23 That's exactly, yeah.
28:24 As a project creator, that's exactly what you want is to not be beholden to the project
28:28 that you created and forced and, you know, kind of have to be, be responsible for it
28:32 and take care of it for the rest of your life.
28:35 But if you look at like a lot of the community, a lot of the most kind of intensive community
28:39 development has happened since, like, since I moved on to work on, on other projects.
28:43 And so now the project has, I don't know the exact count, but 10 thousands of contributors.
28:49 And so, you know, to have thousands of different unique individuals contributing to an open
28:53 source project is a, it's a big deal.
28:55 So I think even, I don't know what it says on the bottom of GitHub, it says, you know,
29:00 3,200 contributors, but that's maybe not even the full story because sometimes, you know,
29:07 people, they don't have their email address associated with their GitHub profile and, you
29:11 know, how GitHub counts contributors.
29:13 I would say probably the true number is closer to 4,000.
29:16 That's a testament, you know, to the, to the core team and all the outreach that they've
29:20 done and work making, making the project accessible and easy to contribute to.
29:25 Because if people, if you go and try to make a pull request to a project and there's many
29:29 different ways that you can fail.
29:31 So like either the project is technically like there's issues with the build system
29:36 or the developer tooling.
29:38 And so you struggle with the developer tooling.
29:40 And so if you aren't working on it every day and every night, you can't make heads or tails
29:43 of how the developer tools work.
29:45 But then there's also like the level of accessibility of the core development team.
29:50 Like if they don't, if they aren't there to support you in getting involved in the project
29:54 and learning how it works and creating documentation about how to contribute and what's expected
29:59 of you, that can also be, you know, a source of frustration where people churn out of the
30:04 project, you know, because it's just, it's too hard to find their sea legs.
30:09 And maybe also, you know, sometimes development teams are unfriendly or unhelpful or, you
30:14 know, they make, they make others feel like they make others feel like they're annoyed
30:18 with them or like they're wasting their time or something.
30:20 It's like, I don't want to look at your, you know, this pull request and give you feedback
30:24 because you know, I could do it more quickly by myself or something.
30:28 Like sometimes you see that an open source projects.
30:30 But they've created a very welcoming environment.
30:33 And yeah, I think the contribution numbers speak for themselves.
30:37 - They definitely do.
30:38 Maybe the last thing before we move on to the other stuff you're working on, but the
30:41 other interesting GitHub statistic here is the used by 1.6 million projects.
30:47 That's, I don't know if I've ever seen it used by that high.
30:50 There's probably some that are higher, but not many.
30:52 - Yeah, it's a lot of projects.
30:54 I think it's, it's interesting.
30:55 I think like many projects, it's reached a point where it's, it's an essential and assumed
30:59 part of the, of many people's toolkit.
31:03 Like they, like the first thing that they write at the top of a file that they're working
31:07 on is import pandas as PD or import NumPy as PD, you know, to create, I think in a sense,
31:12 like I think one of the reasons why, you know, pandas has gotten so popular is that it is
31:16 beneficial to the community, to the Python community to have fewer solutions, you know,
31:21 kind of the Zen of Python, there should be one and preferably only one obvious, obvious
31:25 way to do things.
31:26 And so if there were 10 different pandas like projects, you know, that creates skill portability
31:32 problems and it's just easier if everyone says, oh, we just, pandas is the thing that
31:36 we use and you change jobs and you can take all your skills, like how to use pandas with
31:40 you.
31:41 And I think that's also one of the reasons why Python has become so successful in the
31:45 business world is because you can teach somebody even without a lot of programming experience,
31:50 how to use Python, how to use pandas and become productive doing basic work very, very quickly.
31:56 And so one of the solutions I remember back in the early 2010s, there were a lot of articles
32:02 and talks about how to address the data science shortage.
32:06 And my belief and I gave a talk at Web Summit in Dublin in 2000, gosh, maybe 2017, I have
32:16 to look exactly.
32:17 But basically it was the data scientist shortage.
32:21 And my thesis was always, we should make it easier to be a data scientist or like lower
32:25 the bar for like what sort of skills you have to master before you can do productive work
32:31 in a business setting.
32:32 And so I think the fact that there is just pandas and that's like the one thing that
32:36 people have to learn how to use is like their essential starting point for doing any data
32:41 work has also led to this piling on of like people being motivated to make this one thing
32:47 better because you make improvements to pandas and they benefit millions of projects and
32:52 millions of people around the world.
32:53 And that's, yeah, so it's like a steady snowballing effect.
32:59 This portion of Talk Python to Me is brought to you by Mailtrap, an email delivery platform
33:04 that developers love.
33:06 An email sending solution with industry best analytics, SMTP and email API SDKs for major
33:13 programming languages and 24/7 human support.
33:17 Try for free at mailtrap.io.
33:22 I think doing data science is getting easier.
33:24 We've got a lot of interesting frameworks and tools.
33:27 Shiny for Python, one of them, right, that makes it easier to share and run your code,
33:32 you know?
33:33 Yeah, Shiny for Python, Streamlit, Dash, like these different interactive data application
33:38 publishing frameworks.
33:39 So you can go from a few lines of pandas code, loading some data and doing some analysis
33:45 and visualization to publishing that as an interactive website without having to know
33:51 how to use any web development frameworks or Node.js or anything like that.
33:56 And so to be able to get up and running and build a working interactive web application
34:01 that's powered by Python is, yeah, it's a game changer in terms of shortening end-to-end
34:07 development life cycles.
34:08 What do you think about JupyterLite and these PyOdide and basically Jupyter in a browser
34:17 type of things, WebAssembly and all that?
34:19 Yeah, so I'm definitely very excited about it, been following WebAssembly in general.
34:24 And so I guess some people listening will know about WebAssembly, but basically it's
34:28 a portable machine code that can be compiled and executed within your browser in a sandbox
34:35 environment.
34:36 So it protects against security issues and allows, prevents like the person who wrote
34:42 the WebAssembly code from doing something malicious on your machine, which is very important.
34:46 Won't necessarily stop them from like, you know, mining cryptocurrency while you have
34:49 the browser tab open.
34:50 That's a whole separate problem.
34:51 And it's enabled us to run the whole scientific Python stack, including Jupyter and NumPy
34:57 and Pandas totally in the browser without having a client and server and needing to
35:03 run a container someplace in the cloud.
35:05 And so I think in terms of creating application deployment, so like being able to deploy an
35:10 interactive data application like with Shiny, for example, without needing to have a server,
35:16 that's actually pretty amazing.
35:17 And so I think that simplifies, opens up new use cases, like new application architectures
35:22 and make that makes things a lot easier for, because setting up and running a server creates
35:27 brittleness, like it has cost.
35:29 And so if the browser is doubling as your server process, like that's, I think that's
35:34 really cool.
35:35 You also have like other projects like DuckDB, which is a high performance, embeddable analytic
35:41 SQL engine.
35:43 And so now with DuckDB compiled to Wasm, you can get a high performance database running
35:49 in your browser.
35:50 And so you can get low latency, interactive queries and interactive dashboards.
35:55 And so it's, yeah, there's WebAssembly has opened up this whole kind of new world of
36:00 possibilities and it's transformative, I think.
36:03 For Python in particular, you mentioned Pyodide, which is kind of a whole packaged stack.
36:08 So it's like a framework for build and build and packaging and basically building an application
36:14 and managing its dependencies.
36:16 So you could create a WebAssembly version of your application to be deployed like this.
36:21 But yeah, so I think one of the Pyodide, either the Pyodide main creator or maintainer went
36:26 to Anaconda, they created PyScript, which made, which is another attempt to make it
36:29 even easier to use Python to make it even easier to use Python to create web applications,
36:36 interactive web applications.
36:37 There's so many cool things here, like in the R community, they have WebR, which is
36:41 similar to PyScript and Pyodide in some ways, like compiling the whole R stack to WebAssembly.
36:46 There was just an article I saw on Hacker News where they worked on figuring out how
36:50 to get, how to trick LLVM into compiling Fortran code, like legacy Fortran code to WebAssembly.
36:57 Because when you're talking about all of this scientific computing stack, you need the linear
37:00 algebra and all of the 40 years of Fortran code that have been built to support scientific
37:06 applications.
37:07 And you can compile to and run in the browser.
37:09 So yeah, that's pretty wild to think of putting that in there, but very useful.
37:12 I didn't realize that you could use DuckDB as a WebAssembly component.
37:16 That's pretty cool.
37:17 Yeah, there's a company, I'm not an investor or plugging them or anything, but it's called
37:21 evidence.dev.
37:23 It's like a whole like business intelligence, open source business intelligence application
37:28 that's powered by, powered by DuckDB.
37:30 And so if you have data that fits in the browser, you know, to have a whole like interactive
37:34 dashboard or to be able to do business intelligence, like fully, like fully in the browser with
37:39 no need of a, no need of a server.
37:41 It's yeah, it's, it's very, very cool.
37:43 So I've been following DuckDB since the, you know, since the early days and you know, my
37:48 company Voltron Data, like we became members of the DuckDB foundation and built an actively
37:54 built a relationship with, with DuckDB labs.
37:56 So we could help accelerate progress in this space because I think the impact, the impact
38:02 is so is so immense and we just, it's hard to predict like what you know, what people
38:07 are going to build, build with all this stuff.
38:09 And so that was all, you know, with, I guess going back, you know, 15 years ago to Python,
38:13 like one of the reasons I became so passionate about building stuff for Python was about
38:19 in I think the way that Peter Wang puts that, it puts it as, you know, giving people superpowers.
38:24 So we want to enable people to build things with much less code and much less time.
38:29 And so by making it things that much more accessible, that much easier to do, like the
38:34 mantra in pandas was like, how do we make things one line of code?
38:37 Or like this, that must be easy.
38:39 It's like one line of code, one line of code.
38:41 It must be like, like make this as terse and simple and easy to do as possible so that
38:45 you can move on and focus on building the more interesting parts of your application
38:49 rather than struggling with how to read a CSV file or you know, how to do whichever
38:54 data munging technique that you need for your, for your data set.
38:58 - That would be an interesting mental model for DuckDB.
39:00 It's kind of an equivalent to SQLite, but more analytics database for folks, you know,
39:04 in process and that kind of things, right?
39:07 What do you think?
39:08 - Yeah, so yeah, DuckDB is like SQLite.
39:09 And in fact, it can run the whole SQLite test suite, I believe.
39:14 So it's a full database, but it's for analytic processing.
39:17 So it's optimized for analytic processing and as compared, you know, with SQLite, which
39:21 is not for data processing.
39:23 - Yeah, cool.
39:24 All right, well, let's talk about some things that you're working on beyond pandas.
39:29 You talked about Apache Arrow earlier.
39:31 What are you doing with Arrow and how's it fit in your world?
39:34 - The backstory there was, I don't know if you can hear the sirens in downtown Nashville,
39:39 but.
39:40 - No, actually, I don't hear the sirens.
39:41 - It's good, the microphone filters it out, filters it out pretty well.
39:44 - Yay for dynamic microphones, they're amazing.
39:47 - Yeah, so in like around the mid, like the mid 2010s, 2015, I started working at Cloudera,
39:53 like in the, which is a company that was like one of the pioneers in the big data ecosystem.
39:58 And I had been spent several years working on five, five years, five, six years working
40:03 on pandas.
40:04 And so I'd gone through the experience of building pandas from top to bottom.
40:09 And it was this full stack system that had had its own, you know, mini query engine,
40:15 all of its own algorithms and data structures and all this stuff that we had to build from
40:19 scratch.
40:20 And I started thinking about, you know, what if it was possible to build some of the underlying
40:24 computing technology, like data readers, like file readers, all the algorithms that power
40:31 the core components of pandas, like group operations, aggregations, filtering, selection,
40:37 all those things.
40:38 Like what if it were possible to have a general purpose library that isn't specific to Python,
40:43 isn't specific to pandas, but is really, really fast, really efficient, and has a large community
40:48 building, building it so that you could take that code with you and use it to build many
40:52 different types of libraries, not just data frame libraries, but also database engines
40:57 and stream processing engines and all kinds of things.
41:00 That was kind of what was in my mind when I started getting interested in what turned
41:04 into turn into arrow.
41:06 And one of the problems we realized we needed to solve this was like a group of other open
41:10 source developers and me was that we needed to create a way to represent data that was
41:15 not tied to a specific programming language.
41:19 And that could be used for a very efficient interchange between components.
41:24 And the idea is that you would have this immutable, this kind of constant data structure, which
41:29 is like it's the same in every programming language.
41:32 And then you can use that as the basis for writing all of your algorithms.
41:35 So as long as it's arrow, you have these reusable algorithms that process arrow data.
41:41 So we started with building the arrow format and standardizing it.
41:44 And then we've built a whole ecosystem of components like library components and different
41:50 programming languages for building applications that use the arrow format.
41:54 So that includes not only tools for building and interacting with the data, but also file
42:00 readers.
42:01 So you can read CSV files and JSON data and parquet files, read data out of database systems,
42:07 you know, wherever the data comes from, we want to have an efficient way to get it into
42:10 the arrow format.
42:11 And then we moved on to building data processing engines that are native to the arrow format,
42:18 so that arrow goes in, the data is processed, arrow goes out.
42:22 So DuckDB, for example, supports arrow as a preferred input format.
42:26 And is DuckDB is more or less arrow like in its internals, it has kind of arrow format
42:32 plus a number of extensions that are DuckDB specific for better performance within the
42:37 context of DuckDB.
42:39 So in numerous communities, so there's the Rust community, which has built Data Fusion,
42:43 which is an execution engine for arrow, SQL engine for arrow.
42:48 And so yeah, we've kind of like looked at the different layers of the stack, like data
42:51 access, computing, data transport, everything under the sun.
42:55 And then we've built libraries that are across many different programming languages so that
42:59 are you can pick and choose the pieces that you need to build your build your system.
43:03 And the goal ultimately, was that we in the future, which is now, we don't want people
43:07 to have to reinvent the wheel whenever they're building something like pandas that they could
43:11 just pick up these off the shelf components, they can design the developer experience,
43:17 the user experience that they want to create, and they can get build, you know, so if you
43:21 were building pandas, now you could build a pandas like library based on the arrow components
43:26 in much less time, and it would be fast and efficient and interoperable with the whole
43:31 ecosystem of other projects that use arrow.
43:33 It's very cool.
43:34 It's I mean, it was really ambitious, in some ways, obvious to people, they would they would
43:39 hear about arrow and they say that sounds obvious, like, clearly, we should have a universal
43:43 way of transporting data between systems and processing it in memory.
43:47 Why hasn't this been done in the past?
43:49 And it turns out that, as is true with many open source software problems that many of
43:54 these problems are, the social problems are harder than the technical problems.
43:58 And so if you can solve the people coordination and consensus problems, solving the technical
44:03 issues is much, much easier by comparison.
44:06 So I think we were lucky in that we found like the right group of people, the right
44:10 personalities where we were able to, as soon as I met, I met Jacques Nadeau, who had been
44:15 at MapR and we was working on his startup Dremio.
44:19 Like I knew instantly when I met Jacques Nadeau, I was like, I can work he's like, he's like
44:23 him like he's gonna help me make this happen.
44:25 And I met Julien Ledem, who had also co created Parquet.
44:29 I was like, yes, like, we were gonna make like I found the right people like we're we're
44:33 gonna make this happen.
44:34 It's been a labor of love and much, much work and stress and everything.
44:38 But I've been working on things circling, you know, with arrows, the sun, you know,
44:43 I've been building kind of satellites and moons and planets circling the arrow sun over
44:48 the last eight years or so.
44:49 And that's kept me pretty busy.
44:50 Yeah, it's only getting more exciting and interesting.
44:53 Over here, it says it uses efficient analytic operations on modern hardware like CPUs and
44:59 GPUs.
45:00 One of the big challenges of Python has been the GIL also one of its big benefits, but
45:05 one of its challenges when you get to multi core computational stuff has been the gill.
45:09 What's the story here?
45:10 Yeah.
45:11 So in Arrowland, when we're talking about analytic efficiency, it mainly has to do with
45:17 the like underlying like how the how modern CPU works or how a GPU works.
45:24 And so when the data is arranged in column oriented format, that enables the data to
45:29 be moved efficiently through the CPU cache pipelines.
45:34 So the data is made made available efficiently to the to the CPU cores.
45:39 And so we spent a lot of energy in Arrow making decisions firstly, to enable very cache of
45:45 like CPU cache or GPU cache efficient analytics on the data.
45:49 So we were kind of always when we were deciding we would break ties and make decisions based
45:53 on like what's going to be more efficient for the for the computer chip.
45:57 The other thing is that modern and this is true with GPUs, which have a different parallelism
46:02 model than or different kind of multi core parallelism model than CPUs.
46:08 But in CPUs, they've focused on adding what are called single instruction, multiple data
46:13 intrinsic, like built in operations in the processor, where you know, now you can process
46:20 up to 512 bytes of data in a single CPU instruction.
46:25 And so that's like my brain's doing the math right, like 1632 bit floats, or, you know,
46:31 eight 64 bit integers in a single CPU cycle.
46:34 There's like intrinsic operations.
46:35 So multiply this number by that one, multiply that number to these eight things all at once,
46:41 something like that.
46:42 That's right.
46:43 Yeah.
46:44 Or you might say like, oh, I have a bit mask and I want to select I want to gather like
46:48 the one bits that are set in this bit mask from this array of integers.
46:52 And so there's like a gather instruction, which allows you to select a subset sort of
46:58 SIMD vector of integers, you know, using a bit mask.
47:01 And so that turns out to be like a pretty critical operation in certain data analytic
47:05 workloads.
47:06 So yeah, we were really, we wanted to have a data format that was essentially, you know,
47:11 future proofed in the sense that it's, it's ideal for the coming wave, like current current
47:16 generation of CPUs, but also given that a lot of processing is moving to GPUs and to
47:21 FPGAs and to custom silicon, like we wanted Arrow to be usable there as well.
47:27 And it's Arrow has been successfully, you know, used as the foundation of GPU computing
47:32 libraries.
47:33 Like we kind of at Voltron Data, we built, we've built a whole accelerator native GPU
47:38 native, you know, scalable execution engine that's, that's Arrow based.
47:42 And so I think the fact that we, that was our aspiration and we've been able to prove
47:46 that out in real world workloads and show the kinds of efficiency gains that you can
47:50 get with using modern computing hardware correctly, or at least as well as it's intended to be
47:56 used.
47:57 That's a big deal in terms of like making applications faster, reducing the carbon footprint
48:01 of large scale data workloads, things like that.
48:03 Yeah.
48:04 Amazing.
48:05 All right.
48:06 Let's see what else have I got on deck here to talk to you about.
48:07 Do Ibis, want to talk about Ibis or which one you want to, we got a little time left.
48:11 We got a couple of things to cover.
48:13 Yeah.
48:14 Let's, we can talk about Ibis.
48:15 Yeah.
48:16 We could, we could probably spend another hour talking.
48:17 Yes.
48:18 Easy.
48:19 So one of the more interesting areas in recent years has been new DataFrame libraries and
48:24 DataFrame APIs that transpile or compile to different execute on different backends.
48:29 And so around the time that I was helping start Arrow, I created this project called
48:33 Ibis, which is basically a portable DataFrame API that knows how to generate SQL queries
48:40 and compile to pandas and pollers and different DataFrame, DataFrame backends.
48:46 And the goal is to provide a really productive DataFrame API that gives you portability across
48:53 different execution backends with the goal of enabling what we call the multi-engine
48:57 data stack.
48:58 So you aren't stuck with using one particular system because all of the code that you've
49:03 written is specialized to that system.
49:05 You have this tool, which, so maybe you could work with, you know, DuckDB on your laptop
49:10 or pandas or polars with Ibis on your laptop.
49:13 But if you have, if you need to run that workload someplace else, maybe with, you know, ClickHouse
49:17 or BigQuery, or maybe it's a large big data workload that's too big to fit on your laptop
49:22 and you need to use Spark SQL or something that you can just ask Ibis, say, "Hey, I want
49:28 to do the same thing on this larger dataset over here." And it has all the logic to generate the correct query representation and run that workload
49:35 for you.
49:36 So it's super useful.
49:37 But there's a whole wave of like, you know, work right now to help enable people to work
49:42 in a pandas-like way, but get work with big data or, you know, get better performance
49:47 than pandas because pandas is a Swiss army knife, but isn't a chainsaw.
49:51 So if you were rebuilding pandas from scratch, it would end up a lot.
49:55 There's areas of the projects that are, you know, more bloated or have performance overhead
49:59 that's hard to get rid of.
50:01 And so that's why you have Richie Fink started the Polars project, which is kind of a reimagining
50:06 of pandas, pandas data frames written in Rust and exposed in Python.
50:11 And Polars, of course, is built on Apache Arrow at its core.
50:15 So building an Arrow native data frame library in Rust and, you know, all the benefits that
50:20 come with building Python extensions in Rust, you know, you avoid the GIL and you can manage
50:25 the multithreading in a systems language, all that fun stuff.
50:28 Yeah.
50:29 When you're talking about Arrow and supporting different ways of using it and things being
50:32 built on it, certainly Polars came to mind for me.
50:35 You know, when you talk about Ibis, I think it's interesting that a lot of these data
50:39 frame libraries, they try to base their API to be pandas-like, but not identical, potentially,
50:46 you know, thinking of Dask and others.
50:49 But this Ibis sort of has the ability to configure it and extend it and make it different, kind
50:54 of like, for example, Dask, which is one of the backends here.
50:57 But the API doesn't change, right?
50:59 It just, it talks to the different backends.
51:01 Yeah.
51:02 There's different schools of thought on this.
51:03 So there's another project called Modin, which is similar to Ibis in many ways, in the sense
51:07 of like transpilation and sort of dynamically supporting different backends, but sought
51:12 to closely emulate the exact details of like the API call, the function name, the function
51:19 arguments must be exactly the same as pandas, with the goal of being a drop-in replacement
51:25 for people's pandas code.
51:26 And that's one approach, kind of the pandas emulation route.
51:29 And there's a library called Koalas for Spark, which is like a PySpark emulation layer for
51:35 the pandas API.
51:36 And then there's other projects like Polars and Ibis, Dask DataFrame, that take like design
51:41 cues from pandas in the sense of like the general way in which the API works, but has
51:47 made meaningful departures in the interest of doing things better in many ways than pandas
51:52 did in certain parts of the API, and making things simpler, and not being beholden to
51:57 decisions that were made in pandas, you know, 15 years ago.
51:59 Not to say there's anything bad about the pandas API, but like with any API, it's large,
52:04 like it's very large as evidenced by, you know, the 2000 pages of documentation.
52:10 And so I understand the desire to make things simpler, but also refining certain things,
52:14 making certain types of workloads easier to express.
52:18 And so Polars, for example, is very expression based.
52:21 And so everything is column expressions and is lazy, and not eagerly computed, whereas
52:26 pandas is eager execution, just like NumPy is, which is how pandas became eagerly executed
52:33 in the first place.
52:34 And so I think the mantra with Polars was, we don't want to support the eager execution
52:40 by default that pandas provides, we want to be able to build expressions so that we can
52:44 do query optimization, and take inefficient code and under the hood, rewrite it to be
52:50 more efficient, which is, you know, what you can do with a query optimizer.
52:53 And so ultimately, like, that matters a lot when you're executing code remotely, or in
52:57 like a big data system, you want to have the freedom to be able to take like a lazy analytic
53:03 expression and rewrite it based on it might be like you need to seriously rewrite the
53:08 expression in the case of like, Dask, for example, like Dask has to do planning across
53:13 a distributed cluster.
53:14 And so, you know, Dask DataFrame is very pandas like, but it also includes some explicit details
53:20 of being able to control how the data is partitioned and being able to have some knobs to turn
53:25 in terms of like, having more control over what's happening on a distributed cluster.
53:29 And I think the goal there is like to give the developer more control as opposed to like
53:32 trying to be intelligent, you know, make all of the decisions on behalf of the developer.
53:37 So you know, if you know about how you know, know a lot about your data set, then you can
53:41 make more you can make a, you know, decisions about how to how to schedule and execute it.
53:45 Of course, that Dask is building query, you know, query optimization to start making more
53:49 of those decisions on behalf of the user.
53:51 But you know, Dask has become very popular and impactful and making distributed computing
53:56 easier in Python.
53:57 So they've gotten, I think, got a long way without turning into a database.
54:01 And I think Dask never aspired to be a to be a database engine, which is a lot of distributed
54:05 computing is, you know, not database like it's could be distributed array computing
54:09 or distributed model training, and just being able to easily run distributed Python functions
54:14 on a cluster do distributed computing that way.
54:16 It was amazing, like how many people were using PySpark in the early days, just for
54:21 the convenience of being able to run Python functions in parallel on a cluster.
54:25 Yeah, and that's pretty interesting.
54:27 Not exactly what it's designed for, right?
54:30 Dask, you know, you probably come across situations where you do a sequence of operations, they're
54:35 kind of commutative in the end and practice, but from a computational perspective, like
54:39 how do I distribute this amongst different servers, maybe one order matters a lot more
54:44 than the other outperformance, you know?
54:47 Yeah, yeah.
54:48 Interesting.
54:49 All right.
54:50 One final thing.
54:51 SQLGLOT.
54:52 Yeah.
54:53 So SQLGLOT project started by Toby Mao.
54:54 So he's a Netflix alum and really, really talented, talented developer who's created
55:00 this SQL query transpilation framework library for Python and kind of underlying core library.
55:08 And so the problem that's being solved there is that SQL, despite being a quote unquote
55:12 standard is not at all standardized across different database systems.
55:17 And so if you want to take your SQL queries written for one engine and use them someplace
55:21 else, without something like SQLGLOT, you would have to manually rewrite and make sure
55:25 you get the typecasting and coalescing rules correct.
55:30 And so SQLGLOT understands the intricacies and the quirks of every database dialect,
55:36 SQL dialect, and knows how to correctly translate from one dialect to another.
55:41 And so IBIS now uses SQLGLOT as its underlying engine for query transpilation and generating
55:47 SQL outputs.
55:48 So originally, IBIS had its own kind of bad version of SQLGLOT, kind of a query transpilation,
55:54 like SQL transpilation that was powered by, I think, powered by SQLAlchemy and a bunch
56:00 of custom code.
56:01 And so I think they've been able to delete a lot of code in IBIS by moving to SQLGLOT.
56:06 And I know that, you know, SQLGLOT is also being used to power kind of a new, yeah, being
56:11 used in people building new products that are Python powered and things like that.
56:16 So Toby, like his company, Tobiko Data, they're building a product called SQL Mesh that's
56:23 powered by SQLGLOT.
56:24 So very cool project and maybe a bit in the weeds, but if you've ever needed to convert
56:28 a SQL query from one dialect to another, it's, yeah, SQLGLOT is here to save the day.
56:32 I would say, you know, even simple things is how do you specify a parameter variable,
56:38 you know, for parameterized query, right?
56:39 And Microsoft SQL Server, it's like at the parameter name and Oracle, it's like question
56:45 mark or SQL, I think it's also quite, you know, just that, even those simple things.
56:49 It's a pain.
56:50 And without it, you end up with little Bobby tables, which is also not good.
56:53 So that's true.
56:54 That's true.
56:55 Nobody wants to talk to him.
56:57 Yeah, this is really cool.
56:58 SQLGLOT, like polyglot, but all the languages of SQL.
57:02 Nice.
57:03 And you do things like you can say, read DuckDB and write to Hive or read DuckDB and then
57:09 write to Spark or whatever.
57:11 It's pretty cool.
57:12 All right, Wes, I think we're getting short on time, but you know, I know everybody appreciated
57:17 hearing from you and hearing what you're up to these days.
57:20 Anything you want to add before we wrap up?
57:22 I don't think so.
57:23 Yeah.
57:24 I enjoyed the conversation and yeah, there's a lot of stuff going on and still plenty of
57:30 things to get excited about.
57:32 So I think often people feel like all the exciting problems in the Python ecosystem
57:36 have been solved, but there's still a lot to do.
57:39 And yeah, we've made a lot of progress in the last 15 plus years, but in some ways it
57:45 feels like we're just getting started.
57:47 So we are just excited to see where things go next.
57:49 Yeah.
57:50 Every time I think, oh, all the problems are solved, then you discover all these new things
57:53 that are so creative and you're like, oh, well, that was a big problem.
57:55 I didn't even know it was a problem.
57:57 It's great.
57:58 All right.
57:59 Well, thanks for being here and taking the time and keep us updated on what you're up
58:02 to.
58:03 All right.
58:04 Thanks for joining us.
58:05 Bye-bye.
58:07 This has been another episode of Talk Python to Me.
58:08 Thank you to our sponsors.
58:10 Be sure to check out what they're offering.
58:12 It really helps support the show.
58:14 It's time to stop asking relational databases to do more than they were made for and simplify
58:19 complex data models with graphs.
58:22 Check out the sample FastAPI project and see what Neo4j, a native graph database, can do
58:28 for you.
58:29 You can find out more at talkpython.fm/neo4j.
58:33 Mailtrap, an email delivery platform that developers love.
58:37 Try for free at mailtrap.io.
58:39 Want to level up your Python?
58:42 We have one of the largest catalogs of Python video courses over at Talk Python.
58:46 Our content ranges from true beginners to deeply advanced topics like memory and async.
58:51 And best of all, there's not a subscription in sight.
58:54 Check it out for yourself at training.talkpython.fm.
58:57 Be sure to subscribe to the show.
58:58 Open your favorite podcast app and search for Python.
59:01 We should be right at the top.
59:03 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct
59:08 RSS feed at /rss on talkpython.fm.
59:12 We're live streaming most of our recordings these days.
59:15 If you want to be part of the show and have your comments featured on the air, be sure
59:19 to subscribe to our YouTube channel at talkpython.fm/youtube.
59:24 This is your host, Michael Kennedy.
59:25 Thanks so much for listening.
59:26 I really appreciate it.
59:27 Now get out there and write some Python code.
59:29 [MUSIC PLAYING]
59:32 [END PLAYBACK]