Learn Python with Talk Python's 270 hours of courses

#471: Learning and teaching Pandas Transcript

Recorded on Sunday, Jul 7, 2024.

00:00 If you want to get better at something, oftentimes the path is pretty clear.

00:03 If you want to get better at swimming, you go to the pool and practice your strokes and put in time doing the laps.

00:09 Want to get better at mountain biking? Hit the trails and work on drills focusing on different aspects of riding.

00:14 You can do the same for programming.

00:16 Reuven Lerner is back on the podcast to talk about his book, Pandas Workout.

00:21 We dive into strategies for learning pandas in Python, as well as some of his workout exercises.

00:27 This is Talk Python to Me, episode 471, recorded July 7th, 2024.

00:33 Are you ready for your host? Here he is.

00:36 You're listening to Michael Kennedy on Talk Python to Me.

00:39 Live from Portland, Oregon, and this segment was made with Python.

00:43 Welcome to Talk Python to Me, a weekly podcast on Python.

00:50 This is your host, Michael Kennedy.

00:52 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython.

00:57 Talk Python.com.

00:58 Both accounts over at Fostadon.org.

01:00 And keep up with the show and listen to over nine years of episodes at talkpython.fm.

01:05 If you want to be part of our live episodes, you can find the live streams over on YouTube.

01:10 Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows.

01:16 This episode is brought to you by Sentry.

01:18 Don't let those errors go unnoticed.

01:20 Use Sentry like we do here at Talk Python.

01:22 Sign up at talkpython.fm/sentry.

01:25 And it's also brought to you by Scalable Path.

01:28 If you're a founder or engineering leader, you know how hard it is to find top-tier developers while keeping costs low.

01:35 Scalable Path is a software staffing company that helps you build remote dev teams that just fit.

01:40 Build your team at talkpython.fm/Scalable Path.

01:45 Before we jump into the interview, I want to let you know that we still have some spots left in my Code in a Castle event.

01:51 If you're looking to learn some of the premier frameworks and techniques in Python,

01:55 and you'd like to have a bucket list type of experience while doing so, then check out talkpython.fm/Castle.

02:02 In October, I'll be running a six-day Python course for an intimate audience in a villa in Tuscany.

02:09 Half the time we'll be learning Python, and the other half will be exploring the best of what Italy has to offer.

02:14 Check out the course outline, the excursions, and all the details at talkpython.fm/Castle.

02:20 Or if you'd like to just shoot me an email, Michael at talkpython.fm, or find me on the socials, and I'm happy to talk about it.

02:27 Hope to see you there.

02:28 Reuven, welcome back to Talk Python to Me.

02:31 How are you doing?

02:31 I'm doing great.

02:33 Great to be back here.

02:34 Nice to see you.

02:35 Yeah, it's great to see you as well.

02:36 I'm a little concerned, though.

02:38 There's some possibilities that maybe my Facebook ads are going to get messed up.

02:42 We are talking about pandas and pandas internationally, and I heard that you're some kind of animal trafficker.

02:49 Do you want to start the show with that story?

02:51 It's out of control.

02:51 It is the craziest story.

02:53 So I occasionally advertise on Facebook, advertise my pandas and Python training.

02:59 And I guess it was like two years ago, I tried doing a little bit more advertising.

03:03 And basically, I didn't really pay much attention to it until about a year ago when I noticed that when I tried to do some more advertising and it said, you are not allowed to advertise on any meta properties.

03:15 That's really weird.

03:16 Like, what did I do?

03:17 I looked.

03:17 I could not find any indication of what I'd done wrong.

03:20 So it says, if you want to appeal, click here.

03:22 So I clicked here.

03:23 And within 30 minutes or so, I get email back saying, your appeal has been checked and denied.

03:28 You will never be allowed to advertise on any meta property again.

03:31 And this was like, what have I done?

03:33 Like, what are you teaching?

03:35 I'm pretty innocent.

03:35 Some courses and some books.

03:37 Come on, man.

03:38 I figured also, I appealed.

03:41 Someone must have looked through this.

03:42 Anyway, in poking around online, someone said to me, oh, I was caught by the same thing.

03:47 It is illegal to sell rare and endangered animals.

03:51 And they believe that since you were selling Python and Pandas training, then you must have been training in live and endangered animals.

03:57 And thus, you are banned for life.

03:59 Well, I said, you know what?

04:00 I got some, first of all, like nuts.

04:02 And yet, a great story to tell to my machine learning classes.

04:05 So good on that front.

04:06 But I was like, this is easy to resolve, right?

04:08 I'll just contact someone at Facebook.

04:09 No one's available.

04:11 I got to text on my connections, like people I know who work there.

04:14 And I finally get the answer back saying that for legal purposes, they delete all data that's more than six months old.

04:20 Something like that.

04:21 So they could not go back and check to see why I was banned.

04:25 And thus, this ban really would be for life.

04:27 So because I waited a while, because it had been like a year between the ban and me noticing.

04:31 Oh, my gosh.

04:33 So they couldn't do anything about it.

04:34 So I was like half laughing and half like, you got to be kidding me about this.

04:39 And I posted on my blog about this.

04:42 And you guys picked it up on Python Bytes.

04:45 It was picked up in a bunch of other places like Hacker News.

04:48 And about a month later, I checked and it was back.

04:51 No one told me anything.

04:53 No one said anything.

04:54 Just magically and quietly, my account was restored.

04:58 Running to the press never helps.

04:59 It never helps.

05:00 That's right.

05:01 That's right.

05:02 But it was so absurd.

05:03 And I think it was at least like four senior engineers at Facebook who tried to help me with this.

05:08 And they were all like, we tried.

05:09 There's nothing we can do.

05:10 Crazy.

05:11 It reminds me, honestly, it reminds me so much of my App Store review experience, getting the new Talk Python training mobile apps in iOS, in the apps, in the iOS App Store, in the Google App Store.

05:24 They were tragically incompetent in their own special way, right?

05:30 Somewhat malicious, somewhat not malicious.

05:32 Just like, for example, Apple.

05:34 I think it was Apple.

05:35 It could have been.

05:35 I don't know.

05:36 Apple or Google.

05:37 It has to be one of them.

05:37 Said, you know, we've denied your application to publish this application because you're trying to impersonate in an existing one.

05:44 I said, what app am I trying to impersonate?

05:47 Like, it's hundreds of hours of stuff that we've created.

05:50 Like, they're like, well, first you might be hijacked.

05:52 So, well, in the description, it says, if you want to learn Python, you can take our courses.

05:58 Well, there's already an app called LearnPython, and you're trying to impersonate.

06:02 I'm like, what?

06:03 Like, it said if you want to learn Python, but there's an app called LearnPython.

06:07 I'm like, I just, I don't even under, like, my mind is, how do you read that description and think this is a trademark?

06:13 So, if you come up with an app called, like, Eat at a Restaurant, then no food order delivery place.

06:20 Yeah, you're going to be acquired by Uber Eats straight away.

06:22 Exactly.

06:23 And so, we went, but we, you would think, okay, fair mistake.

06:26 Like, yeah, yeah, okay.

06:27 Something caught it.

06:27 Just like you said, a simple request and a human will look and it'll be fine.

06:31 No, they're like, well, look, obviously you're doing this.

06:33 I'm like, obviously.

06:34 Let's try another scenario.

06:35 What if I said I wanted to learn to play the guitar?

06:38 And there was an app called Learn Guitar, but I don't want to learn.

06:41 It's not, it's not, the title is not Learn Guitar.

06:43 It's just the act of learning a guitar.

06:45 How else would you explain it?

06:46 Like, oh, okay.

06:47 So, I guess we see, we understand now.

06:50 You can have that sentence.

06:52 And it's just, these things are really, and you're at the complete mercy of them.

06:56 It's really, it's both like comically funny, but it's also like painful because we spent

07:01 four months building that app.

07:02 They wouldn't accept for this stupid reason, you know?

07:04 And they're taking a fair amount of money off the top for whatever people are earning from

07:09 these apps.

07:09 It's not like this is a charity or something.

07:11 What I found amazing with Facebook was there was literally no way to contact a human being.

07:17 Like, I tried all sorts of searches and forums and on, on, and on.

07:22 And like, and, and I found a lot of other people who had been caught up in this sort

07:27 of ridiculous situation, but like, there's no form even to say, hey, I think you made a

07:33 mistake.

07:33 Why don't you have a human look at this?

07:35 Because that would cost them too much, you know, someone's salary.

07:38 Yeah, absolutely.

07:38 It's, anyway, I don't want to spend too much time on it, but boy, is that a crazy story?

07:42 I mean, we're going to be trading in some, we'll be trading in some pandas today.

07:46 And so I just, I'm going to bleep that part.

07:49 Every time when the word pandas is said, we're bleeping it out on the YouTube.

07:52 version.

07:52 Well, this could be entertaining then.

07:54 Wow.

07:55 Ruben was really testy.

07:57 Like all those bleep, bleep, bleep.

07:59 No, seriously though.

08:01 You know, let's, let's catch up.

08:04 We'll talk about your book.

08:05 What have you been up to since the last time we spoke?

08:08 I can't remember the last time I had you on.

08:09 It's been a couple of years, I think.

08:10 It's been, it's been some time.

08:11 Yeah.

08:12 So I'm continuing to do Python pandas training at companies.

08:17 Since the pandemic, much more of it is online, just because companies are now used to doing

08:22 stuff on Zoom, team, WebEx, rather than bringing me there in person.

08:26 That's fine.

08:27 So I traveled to conferences rather than to clients.

08:29 And, you know, it sort of extended my flexibility on that.

08:31 And I've also been building up my online training stuff, both a whole lot of courses.

08:36 And I've got an online bootcamp that I run for Python and pandas twice a year.

08:39 The big thing, which is actually connected to what we're doing here also, is I have a new

08:43 newsletter called Bamboo Weekly, where I have pandas exercises.

08:47 It's sort of like the same style as the book, but every week I take a topic and I do it, I

08:51 do it based on current events.

08:52 So if there's something going on in the news, I try to find a data set, a real world data set

08:56 that has something to do with that.

08:58 And then we try to experiment with that.

09:00 Yeah, that sounds really fun to see like a real world example every week.

09:02 Yeah, yeah.

09:03 So people are like, how do you come up with that?

09:05 And I say, well, I listen to a lot of podcasts and I read a lot of newspapers and something

09:09 somewhere, like I read The Economist, for example, and they had this short, cute article about

09:15 the number of animals that go through Heathrow Airport every year.

09:18 I was like, wait, there's got to be a data set for that.

09:21 And sure enough, the Heathrow Airport Authority publishes a data set in CSV of how many animals

09:26 go through.

09:27 And it was like 200 horses and one and a half billion butterflies and everything in

09:32 between.

09:32 And so it was great fun to sort of play with that data and ask questions about it and give

09:39 people practice with something that's dirty and messy.

09:42 I don't mean the animals being dirty, right?

09:43 Like very messy.

09:44 Like you need to like really, you know, wrestle with it because that's the only way you're

09:47 going to improve.

09:47 So I'm having a lot of fun on all the training front.

09:50 I definitely see more and more use of, I'm sure this won't surprise you, more and more use

09:54 of Python in the data space as it just like catches fire there.

09:58 Yeah.

09:58 It's just going downhill, picking up speed, isn't it?

10:01 It's extraordinary.

10:02 Like I still remember asking people in my intro Python courses, so what are you here for?

10:07 They're like, yeah, my company's thinking of doing some stuff with data analysis and pandas

10:11 or at least as data analysis, I knew that even in NumPy before pandas got you out.

10:16 And me thinking, hmm, I should really learn this stuff because it sounds like it's going

10:19 to be popular.

10:19 And holy cow, it's like it's everywhere.

10:22 And the whole ecosystem is just growing, growing, growing.

10:25 It's like people are sort of seeing pandas as the like underlying infrastructure on which

10:29 they build their software tools and their companies.

10:32 Or even the thing that defines the API on which they can innovate, right?

10:36 Think DASK or a whole bunch of other things in that space, right?

10:40 That if you know pandas, chances are, if you're not too crazy, you can like do grid computing

10:45 by changing the import.

10:46 It's extraordinary.

10:47 Just extraordinary.

10:48 Yeah.

10:49 Yeah, it absolutely is.

10:50 And so back to your newsletter, pandas eat bamboo.

10:54 Is that right?

10:54 That's so-

10:55 I understand why you're banned from Facebook.

10:56 Okay.

10:56 All right.

10:57 It was actually my father's idea.

10:59 I was like, I need a like a catchy name for something.

11:02 And so he was like, well, what was something with bamboo?

11:05 I was like, right, right.

11:06 It's like, you know, it's food for thought and food for pandas.

11:09 Yeah.

11:09 Yeah.

11:10 That's actually really clever.

11:11 I like it a lot.

11:12 I like it a lot.

11:12 So you've been on a journey.

11:14 You've been on a journey to write a book, right?

11:16 Yes.

11:17 I was convinced to do a second one.

11:19 Right.

11:20 So I did a book with Manning called Python Workout, which was exercise in Python.

11:24 And when that finished, they said, so what other topics do you think would be useful?

11:29 I said, well, I'm doing a lot of pandas and people definitely need a lot of exercise and

11:32 a lot of practice in that.

11:33 And so I got to work both collecting the exercises I do with my corporate training and also like

11:39 coming up with some new ones as I learn new things.

11:41 Because pandas is so, so, so huge that it's very easy to get lost in there and not even

11:46 know what are the important topics to learn.

11:49 And so thanks to like working with people all the time, I sort of see also where they

11:53 get stuck and where they have problems and where it's really confusing to them.

11:56 So yeah.

11:57 So 200 and some odd exercises later.

11:59 By the way, I'll just tell you why did it come out when it came out?

12:02 Because I'm terribly bad at deadlines.

12:04 I said to the Manning people, I really want to have it at my booth at PyCon.

12:07 They said, okay, big talker.

12:09 You want it mid-May, huh?

12:11 So you better have everything done.

12:12 And they sort of backtracked on the calendar and said, okay, so you better have the whole

12:15 thing done by December.

12:16 And then finally I had some like, you know, fire under me and got it done.

12:21 Awesome.

12:21 As opposed to, I should get more done.

12:22 Yeah.

12:23 The version I actually have in my Apple Books here I pulled up is the Manning Early Access

12:28 Preview version.

12:29 But it's out for real now.

12:31 Yeah.

12:31 Here.

12:31 I've even got the paper copy here.

12:33 Woohoo.

12:34 Look at that.

12:34 I know.

12:35 That's a pretty hefty book, honestly.

12:36 I still keep looking at it.

12:37 That's a proper book.

12:38 It's like, wow.

12:39 I know.

12:39 I know.

12:40 It feels like it's really quite some feeling there to see.

12:42 I guess I really finished it.

12:44 Congratulations.

12:44 So I wanted to talk to you about maybe just kind of falling on with your bamboo idea.

12:50 Like give us some examples.

12:52 Give us some problems that people are solving with pandas.

12:54 And I mean, we're not going to talk through the code super detailed, but you could say like this

12:59 aspect or this feature of pandas like dot look or whatever is how you access and solve these

13:03 problems, right?

13:04 Like, so just kind of exploring that space.

13:06 I had Wes McKinney on before five episodes ago, something like that.

13:09 And I was just like, how do you learn pandas?

13:11 It's like so big, you know?

13:13 So I've actually changed.

13:15 Well, let me first say I'm definitely one of those people who thinks you should learn Python

13:20 before pandas.

13:21 Like I definitely think that knowing the language well will serve you very, very well in all sorts

13:28 of weird, small ways necessarily.

13:29 But at the same time, when you learn pandas, you have to learn that some of the paradigms

13:33 you learned, like some of the idioms from Python are not appropriate.

13:37 So I was giving a class in like optimizing pandas, like a short class, we'll call it micro

13:42 class, like 90 minutes long, about a year or so ago.

13:44 And at the end, I was like, oh, and by the way, obviously just never do for loops.

13:48 And everyone's like, wait, wait, wait, what?

13:50 I said, what do you mean what?

13:51 And they were like, we were taught in our intro pandas class that you should do a for loop

13:56 to do anything across the data frame.

13:58 And I was like, okay, what?

13:59 How can this be?

14:00 And so people think that because it's in Python, you should do it the same way.

14:06 But there are all these different idioms, especially whole vectorization that you need to internalize.

14:11 Otherwise, as I like to say to people, you should hope to be paid by the hour because like these

14:17 things are just going to take forever to run.

14:19 And so people don't necessarily understand sort of how to approach pandas stuff.

14:24 And then they don't understand sort of, let's see, I mean, I'll give like outline and then

14:28 we can sort of go into it.

14:29 Certainly how to access things with .lock.

14:32 Certainly how multi-indexes work.

14:34 How to work with the different d types.

14:37 Because these are things that we don't think about in the standard pandas world each day.

14:41 The closest analogy would maybe be that where you're working with a dictionary, you want

14:46 to think about what your keys are going to be.

14:47 But even then, it doesn't come close to it.

14:49 So like .lock, people are like, so fine.

14:52 So .lock is used for retrieving from a series or retrieving from a data frame.

14:57 But it's so much more than that because you have two parts to it, or potentially two parts.

15:01 You have the row selector and you have the column selector.

15:04 And each of those can be an individual name, a list of names, a Boolean series, or even a

15:11 Lambda.

15:11 And you can mix and match those in so many different ways.

15:13 And once you see those options, like your head sort of explodes with, oh, wait, I never thought

15:19 that I could access my data like that.

15:21 And then you're like, wait, you can also assign your data like that.

15:23 And then they're just sort of astonished.

15:25 Yeah, it's really wild.

15:27 You know, I think pandas has its own idiomatic style that is different than what you would

15:32 call pythonic, right?

15:34 Like it's pandonic.

15:35 I don't know what the name is, but idiomatic pandas, right?

15:38 I like that.

15:38 Where there's things that are specific to pandas, like this vectorization stuff, right?

15:44 Instead of looping over, right?

15:46 You know, you think about the Python performance angles or the data science performance angles,

15:51 right?

15:51 A lot of the speed that we get out of tools like pandas and numpy and polars and others is,

15:56 as you take the data, you push it down into some native layer and you just leave it there.

16:01 And you kind of speak to the native layer from Python and you say, deep down in your insides,

16:06 you got this thing, multiply all 1 million of them by two or whatever, right?

16:10 But if you loop over it, you're like converting it out of C into Python objects, then you're

16:17 operating it, then you're putting it back just 2 million times.

16:20 All of a sudden, all those benefits are gone.

16:22 And so certainly learning those types of things.

16:25 I mean, the vectorization, I think most people get pretty soon, although it sounds like not

16:29 that's really everyone.

16:30 I was really like flabbergasted.

16:34 But there's way more than that, right?

16:35 There's a whole, that's just probably the most obvious thing.

16:38 This portion of Talk Python to me is brought to you by Sentry.

16:41 Code breaks.

16:42 It's a fact of life.

16:44 With Sentry, you can fix it faster.

16:46 As I've told you all before, we use Sentry on many of our apps and APIs here at Talk Python.

16:51 I recently used Sentry to help me track down one of the weirdest bugs I've run into in a long

16:57 time.

16:57 Here's what happened.

16:58 When signing up for our mailing list, it would crash under a non-common execution pass, like

17:04 situations where someone was already subscribed or entered an invalid email address or something

17:09 like this.

17:10 The bizarre part was that our logging of that unusual condition itself was crashing.

17:16 How is it possible for our log to crash?

17:19 It's basically a glorified print statement.

17:22 Well, Sentry to the rescue.

17:24 I'm looking at the crash report right now, and I see way more information than you'd expect

17:28 to find in any log statement.

17:29 And because it's production, debuggers are out of the question.

17:33 I see the traceback of course, but also the browser version, client OS, server OS, server OS version,

17:40 whether it's production or Q and A, the email and name of the person signing up.

17:45 That's the person who actually experienced the crash.

17:47 Dictionaries of data on the call stack and so much more.

17:50 What was the problem?

17:51 I initialized the logger with the string info for the level rather than the enumeration dot info,

17:58 which was an integer based enum.

18:00 So the logging statement would crash saying that I could not use less than or equal to between strings and ints.

18:07 Crazy town.

18:08 But with Sentry, I captured it, fixed it, and I even helped the user who experienced that crash.

18:14 Don't fly blind.

18:16 Fix code faster with Sentry.

18:18 Create your Sentry account now at talkpython.fm/sentry.

18:22 And if you sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's business plan,

18:30 which will give you up to 20 times as many monthly events as well as other features.

18:34 Accessing things in that way.

18:36 By the way, you asked before, like, how should people even approach learning pandas?

18:40 And so I've started thinking about it a little differently based on feedback from people that I would sort of walk through it.

18:44 Okay, here's a series.

18:46 Here's a data frame.

18:47 And after creating some, you know, fake ones with, you know, made up numbers or random numbers,

18:52 then we'll start reading in files.

18:53 And then we'll do this.

18:55 And then we'll get to visualization.

18:56 And a student of mine said, you know, you're sort of missing the big, like, people miss the big picture doing that.

19:02 And they want to get the excitement.

19:03 Why don't you start with read a CSV file, visualize it right there inside of Jupyter,

19:08 and then people will be so impressed and amazed.

19:11 And then you fill in the gaps.

19:12 And so I've started doing that a bit.

19:14 And I think that has not been a bad approach to catch their attention, give them a sense of what the possibilities are.

19:20 And like, okay, let's now walk through each of these little pieces and build up to what we saw that first day.

19:25 And that's been fun.

19:27 Yeah, I totally agree.

19:28 That's also something that I really strive for.

19:31 I don't always do whenever I'm doing presentations.

19:33 But, you know, just because someone is chosen to sit in a seat for a day-long course or an hour-long presentation doesn't mean that they couldn't use a little inspiration, right?

19:43 And if you're like, wow, you did those three lines.

19:45 And now we have this picture.

19:46 And I understand all that.

19:47 Like, tell me more.

19:48 Right?

19:48 That right there.

19:48 That's right.

19:50 Right.

19:50 Now you have their actual attention the whole time.

19:53 And they're enjoying it.

19:54 And it's, yeah, it's so often it's like, well, in order to show you the nice stuff, I got to give you every level of detail.

20:00 It's like, no, you don't.

20:01 You're going to make a leap.

20:03 Don't do it that way.

20:05 That's right.

20:06 That's right.

20:06 And you mentioned, like, three lines of code.

20:08 One of the amazing things about Pandas is how often you can write very, very little code.

20:12 But it's, like, getting to that code that takes a while, really taking advantage of it.

20:17 Yeah.

20:17 We've been knowing that it's possible, right?

20:19 Right.

20:20 Right, right, right.

20:21 People don't even know.

20:21 So a lot of times, like, I've been using it now for long enough that I sort of intuit, oh, there's got to be a method that does this.

20:29 Like, someone has encountered this problem before.

20:31 And so either there's a method or there's an option or there's an add-on, like, something somewhere that just makes it trivially easy.

20:38 And that's part of the exploration that I try to do both in the book and the weekly and, like, in my training general.

20:43 And also, truth be told, I'm constantly learning stuff, right?

20:45 Like, it's rare for me to teach and not discover some option or some method that I did not know about because it's just so incredibly vast.

20:52 Yeah.

20:52 Or Pandas 2 comes out or something like that.

20:54 Yes.

20:55 Yes, indeed.

20:56 I mean, I've been exploring, I mean, tomorrow, tomorrow I head off to Prague for EuroPython,

21:01 where I'm giving a talk on PyArrow in Pandas.

21:04 And so I've been looking into that a lot.

21:06 And, oh, boy, right?

21:07 I mean, I've been using it for, say, a year or so.

21:10 But it's amazing.

21:11 And yet there are all these subtle changes that are happening in Pandas as a result that you need to know about or you need to expect when you use it.

21:19 But it's another, like, tool in our toolbox that we can pull out to make Pandas more effective, more efficient, deal with larger data, and also interact with other and interoperate with other systems.

21:30 Yeah, for sure.

21:31 It's just like, as I said, the whole ecosystem is just exploding.

21:33 It's really quite something.

21:34 Yeah, it really is.

21:35 And I think the Pandas 2 stuff is going to make a pretty big difference, changing, like, the internals away from just tables of numbers, basically.

21:41 Okay.

21:42 Let's talk about, so the way I thought we could maybe explore this Pandas workout book is let's just pick some fun exercises that you put together and talk about them.

21:53 Give us a quick overview of what the workout aspect of this book means anyway.

21:57 Sure.

21:58 Sure.

21:59 So the idea is that you can't learn everything all at once or quickly.

22:05 That's sort of like working out physically.

22:07 It's a long haul.

22:09 It's a long haul.

22:10 And every day you get a little better, a little stronger, a little more flexible.

22:14 And so if you see, you know, your Pandas learning journey as it's going to take me several months, not going to take me a day.

22:22 Then if every day you do a little bit of practice, you learn something new in some new direction, at the end of that journey, you're going to be able to solve many, many more problems in better, more idiomatic and more efficient ways.

22:34 And you'll be able to put these pieces together in ways that you didn't even expect.

22:38 So that's the basic idea.

22:39 And so the book is divided into, I want to say, 12 chapters.

22:42 I know we sort of rejiggered it at some point where each chapter focuses on a different aspect of Pandas.

22:48 But it's really sort of the total experience of going through it.

22:53 And so we have 200 exercises, plus there's like a, for lack of a better term, a midterm and a final.

22:59 Like a meteor projects that people can go through that people ask for after the Python book.

23:04 And each exercise then has not only the main exercise where I pose a problem, I give an explanation, I give an answer, note the order.

23:11 The answer comes at the end, so you won't peek as easily.

23:14 We talked about putting it in the back of the book and like that just didn't work out so well.

23:18 So at least you have to wade through the explanation a little bit or turn the page.

23:21 And then after the-

23:22 You can buy the one book with the questions.

23:24 You know, there was talk about that too.

23:27 Oh, you're serious.

23:27 Okay.

23:27 You can buy the answers later.

23:29 There was definitely-

23:30 I was joking.

23:31 Ten times the price.

23:32 I didn't mean to give anyone ideas.

23:34 I'm joking.

23:34 No, no, it's fine.

23:36 And then like after the answer, we have three, what we call beyond the exercise, which are,

23:41 okay, now that you've kind of gotten the basics, let's push either on the same data set or even like sort of go farther.

23:50 And so it's like, that's why we say it's 200 exercises because it's 50 official exercises and other three for each one.

23:56 And those tend to be much harder.

23:58 And I don't give a full explanation, although I do give the solution online in a Jupyter notebook.

24:01 You can download Jupyter for all these things and see how I solved it.

24:04 Nice.

24:05 And for many of them, there's a link to the Pandas Tutor.

24:08 I think, yeah, you've definitely spoken to Philip well in the past.

24:10 So he is not only a Python Tutor, but Pandas Tutor.

24:13 He and his team.

24:14 It's amazing.

24:15 And so you can just click on a link and it will take you to usually a miniature version of the data set because it's too big for Pandas Tutor.

24:22 And then you can sort of see the visualization, how these things work.

24:26 Yeah.

24:26 I don't think I talked about Pandas Tutor.

24:28 Maybe I have talked to Philip about it, but I've certainly used it in some of my courses.

24:34 But it's worth bringing it up just to talk.

24:36 Like, see how that thing works?

24:37 It's something special, this thing.

24:39 It really gives you some deep understanding into like, okay, if I run, you know, this group by command, then here's how all the pieces like flow back together.

24:48 And so you're using this during your book?

24:50 So, yeah.

24:50 So, I mean, every exercise has a link in the solution.

24:54 You click on it and it brings you to the code that I use.

24:56 It's usually, again, a sort of miniature version of the code, miniature version of the data set that you can then see the visualization there in your browser.

25:04 Nice.

25:04 Yeah, because you've got to basically, I think, encode the data in the URL.

25:07 So it can't be too much.

25:09 Right.

25:10 Right.

25:11 So I would like basically take the data in the data frame, turn it into a Python dictionary, and then like set it up because it won't work with files for security reasons.

25:21 Yeah.

25:21 So assign that dictionary to a variable or no, no, I'm sorry, I turned to CSV in a string and then assign that string to a variable and read it in.

25:28 And then I would see if it overflowed the pandas work, now pandas tutor limit.

25:35 And if it did, I sort of iterated until I got it small enough to fit in there and big enough to be useful and interesting.

25:40 That took a few iterations.

25:42 That's cool.

25:42 No, but it's a super, super good resource.

25:44 For learning pandas, I also think for exploring, right?

25:48 Like you end up with some code that like, I'm not really sure what this does, or this is actually new to me.

25:52 Like a lot of things you might encounter in this book.

25:54 You're like, let me visualize that.

25:56 Right.

25:56 Because, yeah.

25:57 Right.

25:57 Right.

25:57 I mean, as I said, like I do a lot of training in pandas and very often people have already played with it a bit, used it, even used it for a year or two.

26:06 And because it's so large, like it's not unusual for someone to say, oh, I had no idea that this functionality existed.

26:13 Why have I been wasting my time doing X and Y and Z?

26:17 I'll give you one example that I've been using more and more.

26:20 So, you know, Matt Harrison's a big fan of the method chaining approach.

26:24 And at first I was like, yeah, yeah, yeah, Matt, whatever.

26:26 Like, yo, stop pushing out on everyone.

26:28 You and your fluent interfaces.

26:30 Keep them to yourself.

26:31 Right.

26:31 And then I was like, actually, this is a great way to build things up little by little, line by line.

26:38 And I can use this.

26:39 It's pedagogically very useful.

26:40 Because I say, okay, let's think about how we want to break down this problem.

26:43 We'll do this.

26:44 Then we'll do this.

26:44 Then we'll do this and you can see it sort of going line by line until voila, we have the analysis that we want.

26:50 And so I inserted that into a lot of places in the book.

26:53 Sort of like one of the last edits that I did was to go back and change it to be more method chaining.

26:58 And I use it now all the time in my training and in Bamboo Weekly.

27:02 And so I bow to Matt on that.

27:06 He was right.

27:07 And I was stubbornly resistant for no good reason.

27:10 Yeah.

27:10 I really like that style as well.

27:12 You see, it's officially night here where I am.

27:14 It's just everything switched dark.

27:16 I just said that yesterday when I was doing office hours and all of a sudden I was sharing my screen and it changed the color.

27:22 I'm glad I'm not the only one that happens to you.

27:25 So, yeah, I'm a big fan of the method chaining fluent interfaces.

27:29 I mean, I would love to even see like Python itself in the standard library adopt that more.

27:34 Right.

27:35 Like there's so many things you operate on that will change something, but then it will return.

27:39 It will not return anything.

27:41 It's like a void method as much as we have those.

27:43 Right.

27:43 It returns none effectively.

27:45 So you can't say, you know, dot sort, dot this, dot that.

27:48 You have to like multi-step it.

27:50 And I would just love to see more of it.

27:51 But let's talk.

27:52 Well, I'll just say there on that front.

27:55 So Candace like does have the option to either get back a new data frame or to say in place equals true.

28:02 And then it does it locally, like it does it on that data structure and then returns none.

28:06 And people are consistently convinced that this is faster, more efficient, better.

28:11 And so I've been like trying to tell people, no, the Python core developers keep saying, do not think that is true.

28:17 It is not true.

28:18 And we are getting rid of in place equals true at some point.

28:21 Stop using it so that you can do method chaining.

28:24 And so no small number of people, again, in my course, are like, oh, really?

28:28 I had no idea.

28:29 I feel like, you know, I'm spreading the gospel as a word.

28:34 Throw that whole expression in some parentheses and dot yourself away.

28:37 Let's go.

28:38 All right.

28:38 That's right.

28:39 Let's dot our way on over to exercise seven of the many.

28:44 And let's just talk about long, medium and short taxi rides.

28:48 Tell us like kind of like we can only talk so much about the code, but like let's let's talk a bit about it and get people.

28:54 Like I said, I want to expose people to like what are some of the problems and aspects of pandas that you can use to solve them.

29:00 Sure.

29:00 So one of my favorite data sets to work with is New York City taxi information.

29:06 It's like everyone can identify with it.

29:09 You understand it.

29:10 And so this exercise uses a very small subset of that.

29:13 Maybe we'll talk next about the pandemic taxis, which is a much larger one.

29:16 But this is just 100,000, no, 10,000, 10,000 taxi rides from like five years ago, six years ago.

29:21 And the question is, well, how can we divide up this data set, which tells us how long, how far, how much people paid, when they were picked up, all that information.

29:31 How can we find out like the distance that they went and categorize that?

29:37 And the reason that this can be useful is we so often have numeric data that we need to put into categories, right?

29:42 What are best sellers?

29:44 What are poor sellers?

29:45 Who are, you know, the most, you know, employee of the year, that sort of thing.

29:49 There's so many places where turning something into a category will be useful.

29:52 And so it's very tempting to think, okay, I'll like do some if statements or I'll do some for statements.

29:58 But actually, pandas provides us with pd.cut, which just does it for us.

30:04 And this is one of those examples of once you learn it, you're like, oh, wow, I get it.

30:09 I don't have to have, oh, yeah, I even have here.

30:11 Like you might think you would want to say, let's set all the categories, you know, to be medium.

30:16 And then where it's less than two miles, we'll call it short.

30:19 And where it's greater than 10 miles, we'll call it long.

30:21 But you can just say pd.cut, we're going to cut it two.

30:24 We're going to cut it 10.

30:25 Anything less than two is short.

30:27 Anything greater than 10 is long.

30:28 Anything in the middle is medium.

30:29 Done.

30:30 Oh, that's interesting.

30:31 And pd.cut gives you back a new series.

30:32 This portion of Talk Python to Me is brought to you by Scalable Path.

30:37 If you're a founder or engineering leader, you know how hard it is to find top tier developers while keeping costs low.

30:44 That's where Scalable Path comes in.

30:46 They're a software staffing company that helps you build remote dev teams that just fit.

30:51 If you're wondering what sets this staffing company apart, well, one big differentiator is their approach.

30:57 They're founded and run by developers.

30:59 Scalable Path understands that finding the right developer is not just about technical skills.

31:05 It's about personality, work ethic, and how well they mesh with your team.

31:10 Their software architects will take the time to understand your vision and needs and then develop technical challenges for the roles you're looking to hire.

31:18 And these technical tests are conducted live on video by senior software developers so there's no gaming the system.

31:25 And Scalable Path takes it one step further.

31:27 They evaluate each developer's soft skills like communication, attitude, and work style before presenting best suited candidates to you.

31:36 Scalable Path has built a network of over 35,000 remote developers.

31:42 No more endless searches or sleepless nights worrying about the right hire.

31:45 And here's a special offer for Talk Python listeners.

31:48 You'll get 20% off of your first month.

31:51 So are you ready to scale your dev team and your business?

31:54 Get started by visiting talkpython.fm/scalablepath.

31:59 The link is in your podcast player's show notes.

32:01 Thank you to Scalable Path for supporting the show.

32:05 So the way you're doing it is you're setting everything to medium and then you're defining short and you're defining long instead of defining the three categories.

32:13 So that's like the sort of you might think this would be a good way to do it.

32:17 And it works, right?

32:19 As I like to say in my courses, unfortunately, this works.

32:21 Right?

32:23 Like so you will get the right answer this way.

32:26 But if you then like go to the next page here, you'll see that you can just use PD Cut.

32:30 I see.

32:31 Yeah.

32:32 And then you just say, here are the bins.

32:34 Here are the labels.

32:36 Go.

32:36 Well, this goes back to that thing we talked about, like looping over stuff versus just going, this is what I want you to do.

32:43 Do it 100% deep down inside the best you can.

32:47 Don't bother me.

32:48 Just figure it out for me.

32:49 Right?

32:49 That's exactly right.

32:50 And this means that like a whole lot of people have worked on PD Cut and have made it work efficiently, way more efficiently than you or I could do in our code.

33:00 Presumably.

33:01 Right?

33:02 It's not going to be any worse.

33:03 And it'll probably be a lot better.

33:05 The other thing is you get back and you can see it sorted at the end, the output of the series you get here.

33:11 So you feed it a series and you get back a series.

33:14 And the series looks at first glance like it's a bunch of strings.

33:18 And so you're going to have short, short, medium, medium, long, long, long, but it's actually not strings.

33:21 It's actually a category, which is a pandas version of an enum.

33:25 So really it's very small because it's just integers being stored there.

33:29 And then those integers are associated with strings.

33:31 So that's an example of like where they have thought through it.

33:34 And they've basically said, yeah, we're going to make this more efficient than you would probably think to do on your own.

33:38 Right.

33:39 Awesome.

33:39 Well, I would, you know, to be honest, I've been happy if I came up with that first solution.

33:44 Just using lock instead of there.

33:46 That's pretty cool.

33:46 But this, this cuts is super nice.

33:48 Yeah.

33:48 Just to find the boundaries with as bins and then off it goes.

33:52 Right.

33:52 Right.

33:53 And there's an option there include lowest.

33:55 And so I actually didn't know about this for a while.

33:58 So I'd be like, okay, so, so think about the bins has to be from some, some, you know, some small number to some large number.

34:03 And the question is then, well, wait a second.

34:05 What about that leftmost bin?

34:07 What about that?

34:07 Like if it's up to and not including, it's like less than, but not less than equal, then how do I do it?

34:12 So I would always be like, well, I'll take the min.

34:15 I'll do like the series dot min minus one.

34:18 And then it's guaranteed to be lower than that.

34:19 But no, it turns out the pandas developers thought about this long before I did.

34:23 And there's an option, a keyword argument you can pass, include lowest equals true.

34:26 Done.

34:27 Now it's less than an equal as opposed to just less than.

34:29 Right.

34:30 Because all these kind of things, the boundary conditions are always tricky, especially on floating points, right?

34:35 Oh, yes.

34:36 Oh, yes.

34:37 There's probably at least two spacecrafts that have crashed because of this.

34:41 All right.

34:41 The next one we want to go to is number 12, Finding Outliers.

34:47 Yeah.

34:47 So, you know, it's this is not a book about statistics and I'm not an expert in statistics, but there are a whole lot of statistical ideas that permeate working with pandas.

34:58 And so it's like, you know, mean, standard deviation, median, understanding sort of how they can differ from one another.

35:05 Right.

35:05 And especially the whole mean versus median thing where it's so easy to have a few outliers pull your data up or pull your data down and then sort of fool you.

35:14 Right.

35:15 And so often people want to find the opposite.

35:17 And so often people want to find the opposite.

35:18 And I've certainly found this, especially with corporate training with like cybersecurity people where they're like, we are always looking for outliers.

35:24 Like, you know, who was the user who was logging in at unusual times?

35:27 Right.

35:27 Who logged in for 20 IP addresses within one minute.

35:31 Right.

35:31 That's right.

35:33 That's not a good person.

35:34 That's not a good person.

35:34 They're not good.

35:35 I'm to tell you just that right now.

35:37 They just try to be very, very effective.

35:39 And so and so.

35:43 So this finding outliers is, OK, let's find out then not who was sort of normal, but like who was abnormal.

35:50 Right.

35:50 Who who is exhibiting, you know, unusual behavior.

35:54 And so here it's like, OK, let's take our set of numbers and let's find out who was more than one standard deviation.

36:01 I can't remember if it's one or two above the mean or below the mean.

36:04 And let's just find those values and ignore the sort of normal values.

36:08 And then we could talk a little about, you know, IQR, the inter-portile range, which is a standard statistical idea that we don't talk about enough, even though it's super useful.

36:17 Right.

36:17 People came up with these ideas a long time ago.

36:19 Oh, here's a fun fact.

36:21 John Tukey, who came up with a lot of these ideas, he also invented the word bit in computers.

36:26 Oh, wow.

36:26 OK.

36:27 So, yeah, I was like, wow, what a guy.

36:29 So you can see here then that like I'm going to take the, you know, trip distance.

36:34 Let's find, you know, trip distance.

36:36 Let's find where the trip distance is less than the trip distance at the first quarter of the first quarter minus one half times the IQR.

36:45 Right.

36:45 Meaning let's take the distance between the first quarter and the third quarter.

36:50 That's our inter-portile range.

36:51 That gives us a sense of like where is the bulk, where are the bulk of the numbers?

36:55 And let's find out who is below that or who is above that.

36:58 And so it's not surprising if we're looking for taxi distances, you're not going to have a lot of outliers that are very low because you can't go below zero miles in your taxi.

37:08 But you can go very high.

37:10 You can go very large.

37:11 And so looking for those trips that are greater than the 75th percentile plus one and a half times the IQR, we actually find, you know, I see here in front of me, you know, about 1800 taxi rides out of the 10,000.

37:24 So about 19 percent of the taxi rides are much, much longer than the mean ride.

37:28 Interesting.

37:29 And you can use this and you can imagine people in like the New York City Taxi Limousine Commission saying, oh, we can use this to plan, to charge, to, you know, send taxis to different places.

37:39 Or have a special program for long distance stuff or whatever.

37:42 Yeah.

37:43 Right.

37:43 Yeah.

37:44 Right.

37:44 Or if you're Uber, you know where to play.

37:45 They actually used to have the longitude and latitude of where people were picked up and dropped off.

37:50 And they got rid of that.

37:51 And I'm sure both for privacy reasons and because Uber and Lyft can look at that data and say, oh, well, we know now when to send cars where.

37:58 Although while true, I imagine that that data has been downloaded and archived with geolocation for certain people.

38:07 You can still get all that old data.

38:10 It's just the newer stuff that they don't do.

38:11 And they still do it by neighborhood.

38:12 I see.

38:13 I also wonder, like, how precise that longitude and latitude was.

38:16 Like, you could probably identify which home, like, was going to which other home.

38:20 Yeah.

38:21 Now you're getting into a problem.

38:22 How bad that would look.

38:23 Yeah.

38:23 Now you're getting into a shady spot or certain types of establishments, you know, that could be fraught with consequences for the people whose home you've already identified where they were picked up at or all sorts of stuff.

38:34 Exactly.

38:35 All right.

38:35 Fair.

38:35 Exactly.

38:36 Fair.

38:36 That's way more important than whether Uber can lease their vehicles strategically, you know?

38:41 Right.

38:41 Right.

38:42 And they might have some data on their own, too, you know?

38:44 Nice.

38:44 Okay.

38:45 So you can use this IQR under quartile range feature to pull that out here, right?

38:51 Right.

38:51 Right.

38:52 And, you know, we can pull that out and then do simple multiplication.

38:55 Right at the end of the day, what we're doing is we're pulling out this data and then just doing very, very simple statistical analysis of it.

39:02 Just to sort of say, how many outliers do we have?

39:05 How many low outliers?

39:05 How many high outliers?

39:06 And you can see this, like, at the end of the day, it's not that complicated in terms of math.

39:11 Sometimes people are like, well, how much math do I have to know to learn this stuff?

39:14 I'm like, I can get through it.

39:15 I promise you.

39:16 Not that much.

39:17 It's not that hard.

39:18 A few basic ideas and you're basically set.

39:20 Yeah.

39:21 Very cool.

39:21 All right.

39:21 On to the next one, which is endemic taxis.

39:25 That's a different kind of taxi.

39:26 So this was actually, so the two studies we looked at so far were with my tiny little 10,000 taxi ride sample from a few years ago.

39:36 I then took taxi rides from January and July of 2019 and 2020.

39:42 So four months there.

39:43 And the question was, I think I actually only looked at July in this exercise.

39:48 Yeah.

39:48 July 2019, July 2020, comparing taxi rides.

39:51 Now, just to remind you, July 2020, not a great time for tourism in New York City or anywhere else.

39:58 And so the question was, what differences do we see between July 2019 and July 2020?

40:04 How much did it go down in terms of taxi use?

40:09 How much less did people pay?

40:10 And then my favorite part of this was, did people use cash more in 2020 or less in 2020?

40:18 And so you see, first of all, it's like a decline of something like 80% in terms of taxi rides from 2019 to 2020.

40:24 Again, not a huge surprise to anyone who lived through the pandemic and like saw what was going on there.

40:29 And so my gut feeling was, well, no one wanted to touch anyone else or touch anything that anyone else had touched.

40:35 So clearly people would have used credit cards much more.

40:38 And no, it turns out.

40:39 There's that screen, Ruben.

40:41 You got to touch the OK.

40:42 That drove me crazy during the pandemic.

40:45 It's like, I'm going to just do touch lists.

40:46 And then you like sign it.

40:48 Are you OK?

40:48 You're like, mm-hmm.

40:50 I never thought of that.

40:52 I guess in Israel, like you just, you just like, you know, do the tapping of your car.

40:57 You didn't have to sign anything.

40:58 But yeah, the US, they're like, oh, hit OK to confirm.

41:01 It's like, OK, I thought we had just escaped me touching this disease-ridden thing.

41:05 But no.

41:06 Also, you have to like indicate how much of a tip you want to leave.

41:09 But that's a whole separate thing.

41:10 We're at the, I don't know, the car wash.

41:13 Don't you want to tip the car?

41:14 No, I don't want to tip the car wash.

41:16 That's the other thing is you should have looked at tips.

41:18 Like that could have been interesting, actually, as well.

41:21 I think, oh, I did.

41:22 I think, no, I think I did.

41:23 I definitely looked at tips, but I can't remember what it was like.

41:26 Around the pandemic, yeah.

41:27 Because I know a lot of people gave tips to like kind of say thanks.

41:29 And I was wondering if that would show up, you know, like, thanks for being out here amongst the diseased ones, you know.

41:34 So it turns out that people used cash more during the pandemic than credit.

41:39 And I like raised this with my family.

41:41 I was like, what the heck?

41:42 I think it was my sister who said, well, everyone who would use, like the people who are sort of higher income earners who use their credit cards more,

41:50 they were staying home.

41:51 It's the people who like were forced to go to work who use cash more, who were taking the taxis because they didn't want to take the subway.

41:57 Like there was like sort of an in-between sort of thing.

41:59 OK, interesting.

42:00 You had all these like wild pieces of data and analysis that you can do.

42:04 And here it was not just like pulling all this data.

42:08 But one of the big things was also I want to read in a lot of data.

42:12 I want to read in from two different CSV files.

42:14 Now what do I do?

42:15 Right?

42:15 Because it's very obvious.

42:17 You read in a CSV file, you get a data frame.

42:18 You read another CSV file, you get a data frame.

42:20 But I want to treat those as one.

42:22 And I want to then be able to distinguish between them.

42:24 How do I do that?

42:25 And so sure enough, Pandas has a concat method.

42:28 And I use concat now all the time.

42:30 I read in multiple CSV files into multiple data frames.

42:35 I can cat them together.

42:36 And you can cat them either horizontally or vertically, depending on what you want to do.

42:40 And this comes back to where I said-

42:42 Their columns are the same, then you might do it vertically.

42:44 But if you want to augment it potentially, it's like, oh, now we have the sale percentage data,

42:49 but it goes along with the columns exactly or the rows exactly.

42:52 Precisely.

42:53 Okay.

42:53 Precisely.

42:54 And so another nice way to do this also is not just read this one and read that one,

43:00 but you can use a list comprehension with something like glob.

43:03 Right?

43:04 So glob.glob on star.csv.

43:06 Get back a list of data frames and then just hand that to pd.concat.

43:10 And so that's where knowing Python and be able to pull that out and use those techniques

43:15 can really, really come in handy.

43:16 Yeah.

43:17 Earlier you mentioned, I want to go back to it real quick.

43:19 You mentioned that learn Python first and then Pandas.

43:22 A lot of times when I think learn Python, I think, okay, well, learn the language and then

43:27 learn the standard library to a good degree.

43:30 And then begin to chop away at the half million things on PyPI that are interesting.

43:35 You know, and it's like a never ending sort of thing.

43:37 And then there's, of course, this joke, like I learned Python.

43:40 It was a good weekend or something like that.

43:42 Right?

43:42 Like, how do you, how do you square these two things?

43:44 But I feel like the amount of Python that you need to learn is mostly centered around the

43:49 language and really is actually not that much.

43:51 And then you kind of learn the past, the pandas way.

43:54 Right.

43:54 Theoretically, if you wanted to.

43:56 Right.

43:56 Right.

43:56 For sure.

43:57 Look, I am a big fan of objects and classes and all that stuff.

44:01 But when I'm talking to people who specifically want to use Python for data analytics with pandas,

44:08 I say to them, objects will help you, but they're not going to be crucial.

44:12 Like if there's a part of the course that you want to sort of drop, save some money, save some time, then that's a place where we can save.

44:18 Because the odds of an analyst using pandas writing their own classes are pretty slim.

44:23 I think it'll give them some perspective on how these classes work.

44:26 Sure.

44:27 But I don't think that like they need to learn that.

44:29 And like a lot of the standard library there, it's hard to say.

44:32 Right.

44:33 So as I said, I love glob.

44:34 Right.

44:35 Globbing is fantastic.

44:36 But that's definitely not in like my intro class.

44:40 I'm going to say, oh, by the way.

44:41 Yeah.

44:41 I would bet there's probably 10 modules that if you knew, you might not need to learn more for six months.

44:47 You know, doing Jupyter pandas type of work.

44:50 Right.

44:50 Like path lib and a couple of things like that.

44:53 Right.

44:53 That's right.

44:53 That's right.

44:54 I mean, I just made up that number.

44:55 Yeah.

44:56 It's mostly like just being able to sort of work with the core data structures, understanding the syntax and how it works.

45:02 And even like defining some simple functions.

45:04 Right.

45:05 I think most people using pandas are not at the end going to define functions.

45:09 Although as I've gone on with my use of pandas, I see, you know, lambda, this is where it's at.

45:15 Like knowing how to use it really, really helps.

45:17 So there I do sort of go back with the pandas people.

45:19 I'm like, okay, this is going to be super weird.

45:21 We're going to talk about anonymous functions now.

45:24 And then, and then like, I typically do that with the more advanced pandas groups, not with the, the like introductory ones.

45:30 But if you can sort of wrap your mind around that, then it does help quite a bit.

45:34 It does.

45:34 So it takes what would be multiple step things and it lets you turn it into one of those changed expressions.

45:40 Because instead of defining a function somewhere else, you can just put it in line as part of a lambda and just keep on going, you know?

45:45 Great.

45:46 That's right.

45:46 And the fact that, so, so it's especially useful to find.

45:49 So we mentioned dot lock earlier.

45:51 So dot lock allows you to choose rows.

45:53 Let's start with that.

45:54 And so I can pass it a lambda and that lambda then gets the data frame you're working on as an argument.

46:02 And then whatever returns, if it returns a Boolean series, then it allows you to filter.

46:07 Well, you can then have multiple dot locks and multiple lambdas in a row to do successive filtering.

46:13 And yes, that is less efficient than doing it all in one fell swoop.

46:16 But boy, oh boy, it's easier to think about and it's easier to use.

46:20 So you just like whittle it down, whittle it down, whittle it down each line with its own lambda.

46:24 And so understanding how to do that, understanding why it's important that the lambda, inside the lambda, you're using the temporary parameter from the lambda as opposed to the overall variable for your data frame.

46:36 Because you're chaining it, that's useful as well.

46:38 Well, 100%.

46:39 And then you can also like comment out certain lines at the end and see what the intermediate values are.

46:44 And so you say it's less efficient.

46:45 It's less computationally efficient.

46:48 It might be more efficient as a human being trying to understand what the heck is happening, right?

46:51 That is spot on.

46:53 I often tell people that I think Python is a language for an age in which computers are cheap and people are expensive.

47:01 Yes.

47:02 Because like, right, our efficiency is the big bottleneck in terms of time, in terms of money.

47:08 So, right, if it takes my computer a few seconds more, who the heck cares?

47:12 Right.

47:12 My M2 Pro doesn't care if it's like a tiny bit more.

47:16 I don't care.

47:16 That's right.

47:17 That's right.

47:18 Or my cluster of GPUs, take your pick.

47:21 Although at that point, it starts to cost real money to burn those things, you know?

47:24 Right.

47:24 But still, I mean, depending on that.

47:26 Look, if you see that your computation is taking a long time, okay.

47:29 So then you sort of find that and you improve that.

47:32 Yeah.

47:32 That's a really good point.

47:33 Like a lot of things are, don't worry about that until it's actually going to become a problem.

47:36 It looks slow.

47:37 It turns out it's probably a blink of an eye.

47:39 All right.

47:39 Let's keep moving on.

47:41 We're getting short on time and we have a plethora of things to work with.

47:45 I kind of want to go to wine words.

47:48 Wine words.

47:49 Wine words.

47:50 All right.

47:50 Let me remember what exercise that is.

47:52 That is 37.

47:53 All right.

47:54 Yeah.

47:54 So people think of pandas, not wrongly, as being great at working with numbers.

48:00 But it turns out that it's fantastic at working with two other kinds of data types.

48:05 One is strings and one is dates and time.

48:08 And you can get a ton, a ton out of analysis working with these if you know how to work with it.

48:14 So there is, I forget where it is.

48:17 There's a machine learning archive of data sets.

48:23 And one of them is 150,000 wine reviews from Wine Magazine.

48:28 Oh, wow.

48:29 And so I said, okay.

48:30 Is it Kaggle maybe or is it somewhere else?

48:31 I think Kaggle has a version of it, but I think it's elsewhere.

48:35 Wine Mag.

48:36 150,000 reviews.

48:37 Okay.

48:38 Beautiful.

48:39 I don't know.

48:39 I'll find out where it is.

48:40 Anyway.

48:41 No worries.

48:41 So I said, okay, let's find out.

48:44 Right?

48:45 Because you drink a bottle of wine and you read the back and you sort of like, you know,

48:49 roll your eyes at what they've written.

48:50 Although, although I'll put in a plug.

48:52 I read this fantastic book a few years ago called Cork Dork by this journalist who decided

48:57 to become a sommelier.

48:58 And she took the exam and her journey toward there.

49:00 She was like, and she convinced me these words actually have real meaning and people are very

49:04 serious about it.

49:05 So I will not roll my eyes quite as much anymore.

49:07 Getting hints of nutmeg, but they're troubling.

49:10 Yes.

49:11 They don't belong.

49:13 So the question was, okay, what about these reviews of wine?

49:17 What words are people using?

49:19 And are they using certain words more with California wines and certain words more with French wines

49:24 or certain words more with red or with white or rosé?

49:27 And so we can then take this text, break it apart and search for it.

49:33 We can search for it using plain old Panda stuff.

49:35 We can search for it using regular expressions, which it has built in and works very well.

49:39 So here, so let's see.

49:42 So analysis on the words.

49:43 Here, so one of the 10 most common words for red wine, right?

49:47 So how do we do that?

49:48 Well, we have to take the description and break it apart.

49:51 And if you're used to using just plain old Python, you're like, oh, well, I guess I'll

49:55 break that into a list.

49:56 But now what?

49:57 Now I have a series of lists.

49:58 Now what do I do?

50:00 And so one of the key methods to know here is something called explode.

50:03 And explode is let's take a series of lists and turn that into a very, very, very long series.

50:10 And so basically each element in the list becomes an element in the series and they all share

50:15 an index.

50:16 So you know where they're originally from.

50:18 And then, you know, the world's your oyster.

50:20 So you can get rid of punctuation, right?

50:23 So I have a S dot stir lower.

50:25 Okay.

50:25 So we lowercase everything.

50:26 Then dot stir split.

50:27 And we split into a list.

50:28 Then we say explode, get into a long series.

50:30 Then we once again run strip, angry with punctuation.

50:33 And now we can use is in, which is yet another fantastic pandas method to say, are these words

50:40 in, like find the lines where these words are located.

50:43 And then we can just do a value count.

50:45 Definitely my favorite method.

50:46 How often does this thing show up?

50:48 And then we use head.

50:49 Ta-da.

50:50 We've got the 10 most common words there.

50:53 And when you break it down like that.

50:55 The value count sorts probably most common to least common, right?

50:59 Yeah, yeah, exactly.

51:00 So value counts, not only does it count how often something shows up, but exactly.

51:04 It sorts it from most common to least common.

51:06 So then head is just head.

51:07 Yeah.

51:08 However much you take there.

51:10 And then that's the ones.

51:11 Those are the popular ones.

51:12 That's right.

51:12 That's right.

51:13 So here I have like, you know, the page we're looking at.

51:15 So we get find where the country is France.

51:17 So that's the row selector.

51:19 And we want the column selector to be description.

51:21 So we only want the description.

51:22 And then we have our function top 10 words, which did what I just described.

51:26 We're going to take these common wine words, lowercase them, get rid of punctuation, so

51:29 forth, pass it in.

51:30 And we get back our top words.

51:31 And we find out what words are used to, you know, are associated with French wines as opposed

51:36 to California wines and so on and so forth.

51:38 Excellent.

51:39 Yeah.

51:39 Very, very cool example.

51:40 All right.

51:41 We, I think we got time for two more to go through.

51:44 You want to pick two that are popular or you want to leave?

51:47 let's see.

51:48 Let's see.

51:49 Let's do, maybe 32 multi-state.

51:54 Actually.

51:54 Hmm.

51:55 Yeah.

51:55 Multi.

51:56 Oh, no, let's do best tippers.

51:58 Best tippers.

51:58 Let's do best tippers.

51:59 Yeah.

52:00 You mentioned tippers.

52:01 Oh, you're going to open up a whole, this is going to be a whole thing.

52:03 Best tippers.

52:04 Number 42.

52:05 Let's go.

52:06 No, no, no judgment here, folks.

52:08 All right.

52:08 So the question is like, so we'll try to understand as everybody here when people tip their taxi

52:13 drivers more generously.

52:14 All right.

52:14 So the question was, did they tip better?

52:17 And we looked at 2019 before the pandemic in January and July.

52:21 So do they tip more in the winter or do they tip more in the summer?

52:24 And so this involves several things.

52:26 First of all, it involved using dates and time.

52:29 Again, one of these things that pandas is just amazing at that people just are not aware of all

52:34 the flexibility you have there.

52:35 Another thing is how easy it is to create a new column, right?

52:39 So we spoke before a bit how you can use broadcasting to just multiply.

52:42 I have a column multiplied by something or a column adding to another column.

52:46 But you can create a new column just by assigning to it.

52:48 Again, it's sort of like assigning to a dictionary.

52:50 It just then is there.

52:51 And so you can calculate the percentage that people tip, put that in a new column and then

52:56 say, well, let's now group by the month and let's find out.

53:01 Take the mean or the max or some sort of deviation.

53:03 Exactly.

53:04 Exactly.

53:05 And we can find out whether people tip more on average in January or in July.

53:11 I honestly don't remember what the answer is, which will hopefully not tick off too many

53:18 readers.

53:19 Here we go.

53:19 Let's see.

53:20 Oh, here we go.

53:21 Go back one page there.

53:22 So 32% of taxi riders in New York don't tip at all.

53:25 That's right.

53:26 That was a surprising thing to me.

53:28 And then I don't remember exactly what it was in a different month.

53:31 And you're talking summer versus winter.

53:33 There's probably a tourism angle versus non locals versus tourists.

53:37 I mean, I know people go to New York and not in the summer, but not as much, I imagine.

53:41 That's right.

53:42 That's right.

53:42 There are all these different factors that, that like, you know, come into it.

53:47 Yeah.

53:47 One of the things kind of takeaways I'm feeling as we were talking about all these is like,

53:50 there's a lot of interesting questions that can be asked and answered really quickly for

53:54 like sociology and urban planning and all kinds of interesting questions that don't feel like

54:00 programmer questions.

54:01 Yes.

54:02 So, so, you know, I, I, I mean, I talked to a lot of people, you probably do too, who are

54:06 like interested in advancing their careers with Python.

54:10 And they're like, well, you know, I don't have a computer science background.

54:13 Can I got a job as a programmer?

54:15 And their vision of a programmer is either someone working at a startup or at like one

54:20 of the big companies, Google, Amazon, Facebook, and so forth.

54:22 I say to them, look, there are an awful lot of people who have great jobs working at supermarket

54:28 chains and insurance companies in the backroom analyzing data.

54:33 And they are crucial, but we don't think of them as programmers and in governments and in

54:38 cities and weather forecasting places.

54:39 Like everyone nowadays is using data and collecting and analyzing it and having these skills either

54:45 gives you abilities to do your job better or gives you the ability to move into a new job

54:51 that you couldn't have done before that is that these places are desperately looking to fill.

54:55 Yeah, I can't.

54:55 Yeah, I totally agree.

54:56 I can't remember which episode it was.

54:58 I didn't name it right.

54:59 So there's an episode I did quite a while ago about our programmers and our data scientists

55:04 and Python data scientists working together.

55:06 And I think it was the research arm of Kroger, maybe that had like 200 data scientists.

55:12 Like that's a wow, that's a proper group of data scientists.

55:16 I mean, a lot of times data scientists, I feel like there's a couple of them for a company

55:20 compared to a software team or something.

55:21 No, and I tell people also, you don't like it's worth learning to write and it's worth learning

55:27 to speak.

55:27 Not because you're going to be like, you know, a Pulitzer Prize winning writer, not because

55:31 you're going to be like, you know, whatever prize you would get for like speaking well,

55:34 just because having these skills makes you more effective at your job.

55:38 And people who can write a program to suck up some data, analyze it and come back with

55:42 a result, especially if it's like a public data set that has something to do with what

55:47 they're working on.

55:47 They are so much more valuable to their company than they would have been otherwise.

55:52 Oh yeah, absolutely.

55:53 So I found it because I have a search engine on Talk Python that searches the transcripts

55:58 and everything else.

55:59 If you want to know something people about historical shows, when you hit search, it's not just like

56:03 the show notes.

56:03 So scaling data science across Python and R episode 236 with Ethan Swan, Bradley, Bokme,

56:12 and the company is 84.51 degrees.

56:16 So anyway, that's...

56:17 Oh, I've seen that name before.

56:18 Yeah, yeah, yeah.

56:19 That's what it was.

56:19 So people can check that out.

56:20 They're interested, but...

56:22 Very, very cool.

56:22 Yeah.

56:23 Super cool.

56:23 Okay.

56:24 Last one.

56:25 What are we doing?

56:25 Let's do cities.

56:27 Cities.

56:28 What number is it?

56:29 43.

56:30 Okay.

56:30 Oh, that's right.

56:31 So I found a few years ago this JSON file containing the thousand largest cities in the United States.

56:39 So first of all, like good to work with JSON because people need to know how to work with

56:44 that kind of data.

56:44 Yeah.

56:45 Second of all, okay, let's not look at the numbers so much as let's do some plotting.

56:51 And so I'm going to like make enemies now.

56:54 I really cannot handle Matplotlib.

56:56 I find it, it's like so incredibly powerful.

57:00 And anytime I want to do even the tiniest thing with it, I have to look up the documentation

57:04 and remind myself how it works.

57:05 Maybe it's because I don't use it enough.

57:07 And so I just use the pandas interface for plotting 95% of the time.

57:12 Let's call it 90% of the time.

57:13 The rest of the time I usually use Seaborn.

57:15 So, so I said, okay, let's see if we can do some plotting here of these largest cities.

57:20 So for example, let's see growth in Pennsylvania cities, like which cities in Pennsylvania?

57:25 Oh, I'm sorry.

57:26 Like, let's see a dual bar plot.

57:27 How many large cities are in each state?

57:29 Okay.

57:30 Well, that's a group by, right?

57:31 So you have to know how to do grouping.

57:32 You have to know what you're grouping on and what, how to use count as opposed to mean,

57:36 right?

57:36 That count even exists.

57:37 But then if I want to see a bar plot where it's sorted from smallest to largest, you got

57:42 to know how to do sorting.

57:42 And then we can do a bar plot.

57:44 So sure enough, you do a dot plot dot bar on your series.

57:47 Kablam.

57:48 You have a bar plot right there.

57:50 Yeah, that's awesome.

57:50 And we just get it and like, you know, it's nicely sorted there.

57:52 Quick takeaways.

57:54 Holy cow.

57:55 Does California have more big cities than I realized?

57:57 And, you know, where's New York and New Jersey?

58:00 Like it's way down the line.

58:01 You think of those as having like pretty megatropolis type places.

58:04 Massachusetts.

58:06 That's right.

58:06 That's right.

58:07 But it's how many cities.

58:09 Yeah.

58:09 Right.

58:10 So, so, so, so right.

58:11 Right.

58:11 That's like the, so keep going.

58:13 So like do a bar plot of a growth in Pennsylvania.

58:16 So how are we going to do that?

58:17 Well, wait a second.

58:18 If we want to do growth, it's a percentage, but it's written in the JSON as number and then

58:23 percent sign.

58:24 That's a string.

58:25 So we're going to have to get rid of the percent sign.

58:27 And then we're going to have to change the D type from string to a floating point.

58:31 And then we can sort it and then we can graph it.

58:34 And so the ads type changing the D type basically is a deep down inside of pandas pars as an

58:41 integer operation, right?

58:42 Rather than looping over and parsing or whatever.

58:45 Yeah.

58:45 Yeah.

58:46 Yeah.

58:46 Yeah.

58:46 Like basically, I mean, it has to do the looping.

58:48 It's doing the looping there at, you know, below sea level.

58:51 Yeah.

58:51 Below sea level.

58:52 Yeah.

58:53 Down by the dead sea.

58:53 I see.

58:54 Okay.

58:54 Got it.

58:57 And so you see here like this bar plot.

58:59 Now we have a whole bunch of cities that have gone down in size and a bunch of things gone

59:02 up in size.

59:03 And pandas is very happy to, you know, show, show the bar plot there.

59:07 You know, again, sorted so that we see it from the greatest sort of shrinkage to the greatest

59:13 growth.

59:13 Right.

59:14 With Pittsburgh having the most shrinkage and Allentown having the most growth.

59:18 I would not guess that.

59:19 And I would not have either given that like my sister-in-law and brother-in-law live in

59:22 Allentown and they're like, Oh, Allentown.

59:24 Actually, but I think that's a common thing that people think there, even though it seems

59:29 like a perfectly great place to be.

59:30 Yeah, sure.

59:31 Funny.

59:32 Awesome.

59:33 Well, there's a lot of cool takeaways.

59:35 One more.

59:36 Let me show you one.

59:37 Go to the last one here in this exercise.

59:39 Okay.

59:39 You got it.

59:40 So one of the most important types of plots you can do in data analysis is a scatter plot.

59:47 And you take your data frame and you say, give me the plot with based on your, you know,

59:52 the X axis should be this, the Y axis should be that.

59:55 Well, if we do a scatter plot based on longitude and latitude of the thousand largest U.S.

59:59 cities, you basically come up with a map of the U.S.

01:00:02 Yeah.

01:00:03 How interesting.

01:00:03 The other thing that you realize is California is technically bigger than a lot of other states.

01:00:08 So that also counts for Y.

01:00:09 Oh, yes.

01:00:10 Yeah.

01:00:10 Interesting.

01:00:11 But there's a lot of people in California.

01:00:12 Good place to be.

01:00:13 There are a lot of people there for sure.

01:00:14 If they're willing to pay the sunshine tax, it's nice there.

01:00:17 Cool.

01:00:19 All right, Reuben.

01:00:20 This is really, really excellent.

01:00:22 I know you want to give a shout out to a couple of exercises from Bamboo Weekly.

01:00:26 So I'll include those in the show notes as well.

01:00:29 And people can like take those and jump over and look at them as well.

01:00:32 That's fine.

01:00:33 That's great.

01:00:33 Yeah.

01:00:34 No, we went through a lot here just during this time.

01:00:36 Yeah, for sure.

01:00:37 And as I've discovered, there's basically, I mean, there is an infinite number of these sorts of things that you can do and different permutations of various sorts.

01:00:46 I mean, we didn't even talk about like, you know, multi-indexing, which opens up literally a whole new dimension of this stuff.

01:00:51 Yeah.

01:00:51 But it's great fun.

01:00:52 It's great, great fun.

01:00:53 And a lot of the Pandas 2 stuff coming on changes the foundation and opens up more possibilities still.

01:00:58 Absolutely.

01:00:58 All right.

01:00:59 Well, final thoughts.

01:01:00 Final call to action.

01:01:02 What do we got here?

01:01:03 What do you say?

01:01:03 Well, call to action, I mean, you can take a look at like, I guess my overall courses and so forth at LearnerPython.com.

01:01:10 Or if you just want to improve your Pandas knowledge, you can look at BambooWeekly.com, the newsletter there.

01:01:15 Does it cost money or does it just cost an email?

01:01:17 So it is a paid newsletter, but the first two questions and answers every week are free.

01:01:23 So like it's usually between five and nine questions each week.

01:01:28 But even if you don't pay, I still want people to get some Pandas learning improvement practice so they can totally do that.

01:01:36 Excellent.

01:01:36 All right.

01:01:37 Yeah.

01:01:37 And check out the book, obviously.

01:01:38 Oh, yeah.

01:01:39 That too.

01:01:39 PandasWorkout.com.

01:01:41 Cool.

01:01:41 I'll pull this up.

01:01:42 Yeah.

01:01:42 Awesome.

01:01:43 It's actually pretty thick.

01:01:44 Yes, it is.

01:01:45 So at the end, I was like, oh, that's why it took a long time.

01:01:51 Also procrastination, but you know.

01:01:53 Yeah.

01:01:53 I'd love to hear from people if they have.

01:01:56 I always love to hear like, you know, sort of interesting problems, data sets, issues that people are experiencing,

01:02:02 just so I can sort of figure out what are the next directions for me to explore that I can then try to help people with as well.

01:02:08 Awesome.

01:02:09 Well, I appreciate all your help coming on the show, sharing your knowledge, and just riffing on Pandas together.

01:02:15 It was a lot of fun.

01:02:16 Excellent.

01:02:16 My great pleasure.

01:02:17 Yeah.

01:02:17 Bye.

01:02:18 All right.

01:02:18 Bye-bye.

01:02:20 This has been another episode of Talk Python to Me.

01:02:23 Thank you to our sponsors.

01:02:24 Be sure to check out what they're offering.

01:02:26 It really helps support the show.

01:02:28 Take some stress out of your life.

01:02:30 Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.

01:02:36 Just visit talkpython.fm/sentry and get started for free.

01:02:41 And be sure to use the promo code TALKPYTHON, all one word.

01:02:44 This episode is brought to you by Scalable Path.

01:02:47 If you're a founder or engineering leader, you know how hard it is to find top-tier developers while keeping costs low.

01:02:53 Scalable Path is a software staffing company that helps you build remote dev teams that just fit.

01:02:59 Build your team at talkpython.fm/scalablepath.

01:03:03 Want to level up your Python?

01:03:04 We have one of the largest catalogs of Python video courses over at Talk Python.

01:03:09 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:03:14 And best of all, there's not a subscription in sight.

01:03:16 Check it out for yourself at training.talkpython.fm.

01:03:20 Be sure to subscribe to the show.

01:03:21 Open your favorite podcast app and search for Python.

01:03:24 We should be right at the top.

01:03:26 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:03:35 We're live streaming most of our recordings these days.

01:03:38 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:03:46 This is your host, Michael Kennedy.

01:03:48 Thanks so much for listening.

01:03:49 I really appreciate it.

01:03:50 Now get out there and write some Python code.

01:03:52 I'll see you next time.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon