Learn Python with Talk Python's 270 hours of courses

#471: Learning and teaching Pandas Transcript

Recorded on Sunday, Jul 7, 2024.

00:00 If you want to get better at something, oftentimes the path is pretty clear.

00:03 If you want to get better at swimming, you go to the pool and practice your strokes

00:07 and put in time doing the laps. Want to get better at mountain biking?

00:10 Hit the trails and work on drills focusing on different aspects of riding.

00:14 You can do the same for programming.

00:16 Reuven Lerner is back on the podcast to talk about his book, Pandas Workout.

00:21 We dive into strategies for learning pandas in Python, as well as some of his workout exercises.

00:27 This is Talk Python to Me, episode 471, recorded July 7th, 2024.

00:33 Welcome to Talk Python to Me, a weekly podcast on Python.

00:50 This is your host, Michael Kennedy.

00:52 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both accounts over at mastodon.org.

01:00 And keep up with the show and listen to over nine years of episodes at talkpython.fm.

01:05 If you want to be part of our live episodes, you can find the live streams over on YouTube.

01:10 Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows.

01:16 This episode is brought to you by Sentry.

01:18 Don't let those errors go unnoticed. Use Sentry like we do here at Talk Python.

01:22 Sign up at talkpython.fm/sentry.

01:25 And it's also brought to you by Scalable Path.

01:28 If you're a founder or engineering leader, you know how hard it is to find top tier developers while keeping costs low.

01:34 Scalable Path is a software staffing company that helps you build remote dev teams that just fit.

01:40 Build your team at talkpython.fm/scalablepath.

01:44 Before we jump into the interview, I want to let you know that we still have some spots left in my

01:49 Code in a Castle event. If you're looking to learn some of the premier frameworks and techniques in

01:54 Python, and you'd like to have a bucket list type of experience while doing so,

01:59 then check out talkpython.fm/castle.

02:01 In October, I'll be running a six day Python course for an intimate audience in a villa in

02:08 Tuscany. Half the time we'll be learning Python and the other half will be exploring the best of

02:13 what Italy has to offer. Check out the course outline, the excursions and all the details at

02:17 talkpython.fm/castle. Or if you'd like to just shoot me an email, michael@talkpython.fm or find

02:25 me on the socials and I'm happy to talk about it. Hope to see you there.

02:27 Reuven, welcome back to Talk Python To Me. How are you doing?

02:32 I'm doing great. Great to be back here. Nice to see you.

02:34 Yeah, it's great to see you as well. I'm a little concerned though.

02:38 There's some possibilities that maybe my Facebook ads are going to get messed up.

02:42 We are talking about pandas and pandas internationally. And I heard that you're

02:47 some kind of animal trafficker. Do you want to start the show with that story? It's out of control.

02:52 It is the craziest story. So I occasionally advertise on Facebook, advertise my pandas

02:58 and Python training. And I guess it was like two years ago, I tried doing a little bit more

03:03 advertising. And basically I didn't really pay much attention to it until about a year ago

03:09 when I noticed that when I tried to do some more advertising and it said,

03:12 you are not allowed to advertise on any meta properties. That's really weird. Like, what did

03:16 I do? I looked, I could not find any indication of what I'd done wrong. So it says, if you want

03:21 to appeal, click here. So I clicked here and within 30 minutes or so, I get email back saying,

03:25 your appeal has been checked and denied. You will never be allowed to advertise on

03:30 any meta property again. And this was like, what have I done? Like, what are you doing?

03:34 I feel like I'm pretty innocent. You're some courses and some books. Come on, man.

03:38 I'd like, I figured also I appealed, someone must've looked through this. Anyway, in poking

03:43 around online, someone said to me, oh, I was caught by the same thing. It is illegal to sell

03:49 rare and endangered animals. And they believe that since you were selling Python and pandas

03:54 training, then you must've been trading in live and endangered animals and thus you are banned

03:58 for life. Well, I said, you know what? I got some, first of all, like nuts and yet a great story to

04:03 tell to my machine learning classes. So good on that front. But I was like, this is easy to resolve,

04:07 right? I'll just contact someone at Facebook. No one's available. I got texts on my connections,

04:12 like people I know who work there. And I finally get the answer back saying that for legal purposes,

04:17 they delete all data that's more than six months old, something like that. So they could not go

04:23 back and check to see why I was banned. And thus this ban really would be for life. So because I

04:28 waited a while, cause it'd been like a year between the ban and me noticing. So they couldn't

04:33 do anything about it. So I was like half laughing and half like, you got to be kidding me about this.

04:39 and I posted on my blog about this and you guys picked it up. on Python bites,

04:45 it was picked up in a bunch of other places like hacker news. And about a month later I check and

04:50 it was back. No one told me anything. No one said anything just magically and quietly. My account

04:57 was restored. so they can never helps. That's right. That's right. But it was so

05:03 absurd. I think it was at least like four senior engineers at Facebook who tried to help me with

05:08 this. And they're all like, we tried, there's nothing we can do. Crazy. It reminds me, honestly,

05:13 it reminds me so much of my app store review experience, getting the new Talk Python Training

05:19 mobile apps in iOS and the apps in the iOS app store and the Google app store. They were

05:24 tragically incompetent in their own special way. Right? Somewhat malicious, somewhat not malicious,

05:32 just like for example, Apple, I think it was Apple. It could have been, I don't know,

05:36 Apple or Google has to be one of them said, you know, we've denied your application to publish

05:40 this application because you're trying to impersonate in an existing one. I said, what

05:44 app am I trying? What am I trying to impersonate? Like it's hundreds of hours of stuff that we've

05:49 created. Like they're like, well, first you might be hijacked. So, so what this is, well,

05:54 in the description it says, if you want to learn Python, you can take our courses. Well, there's

05:59 already an app called learn Python and you're trying to impersonate. I'm like, what? They're

06:03 like, it said, if you want to learn Python, but there's an app called learn Python. I'm like,

06:08 I just, I don't even under like my mind is how do you read that description and think this is

06:12 a trademark? So if you come up with an app called like eat at a restaurant, then no food ordering,

06:18 yeah. You're going to be acquired by UberEATS straight away. Exactly. And so we went, but we,

06:24 you would think, okay, fair mistake. Like, yeah, yeah. Okay. Something caught it. Just,

06:28 just like you said, a simple request and a human will look and it'd be fine. No, they're like,

06:31 well, look, obviously you're doing this. I'm like, obviously let's try another scenario.

06:35 What if I said I wanted to learn to play the guitar and there was an app called learn guitar,

06:40 but I don't want to learn. It's not, it's not, the title is not learn guitar. Just

06:44 the act of learning a guitar. How else would you explain it? Like, oh, okay. So I guess we,

06:48 we see, we understand now you can, you can have that sentence. And it's just,

06:52 these things are really, and you're at the, at the complete mercy of them. It's, it's really,

06:57 it's both like comically funny, but it's also like painful because we spent four months building

07:02 that app. They wouldn't accept for this stupid reason, you know? And they're taking a fair

07:05 amount of money off the top for whatever people are earning from these apps. It's not like this

07:10 is a charity or something. What I found amazing with Facebook was there was literally no way to

07:16 contact a human being. Like I tried all sorts of searches and forms and on, on, and on. And like,

07:22 and, and I found a lot of other people who had been caught up in this sort of ridiculous

07:28 situation, but like, there's no form even to say, Hey, I think you made a mistake. Why don't you

07:34 have a human look at this? Because that would cost them too much, you know, someone's salary.

07:38 Absolutely. It's anyway, I don't want to spend too much time on it, but boy, is that a crazy story?

07:42 I mean, we're going to be trading in some, we'll be trading in some pandas today. And so I just,

07:48 I'm going to bleep that part. Every time the word pandas is said, we're bleeping it out on

07:51 the YouTube version. Well, this could be entertaining then. Wow. Reuven was really

07:57 testy. Like all those bleeps. No, seriously though. You know, let's, let's catch up. We'll

08:04 talk about your book. What have you been up to since the last time we spoke? I can't remember

08:08 last time I had you on. It's been a couple of years, I think. It's been, it's been some time.

08:11 Yeah. So I'm continuing to do Python pandas training companies. Since the pandemic,

08:18 much more of it is online just because companies are now used to doing stuff on Zoom,

08:23 team, WebEx, rather than bringing me there in person. That's fine. So I travel to conferences

08:28 rather than to clients. And, you know, it sort of extended my flexibility on that. And I've also

08:32 been building up my online training stuff, both a whole lot of courses. And I've got an online

08:37 bootcamp that I run for Python and Pandas twice a year. The big thing, which is actually connected

08:41 to what we're doing here also is I have a new newsletter called Bamboo Weekly, where I have

08:45 pandas exercises. It's sort of like the same style as the book, but every week I take a topic and I

08:50 do it, I do it based on current events. So if there's something going on in the news, I try to

08:54 find a dataset, a real world dataset that has something to do with that. And then we, we try

08:59 to experiment with that. Yeah, that's actually fun to see like a real world example every week.

09:02 Yeah. Yeah. So people are like, how do you come up with that? And I say, well, I listen to a lot

09:06 of podcasts and I read a lot of newspapers and something somewhere, like I read The Economist,

09:11 for example, and they had this short, cute article about the number of animals that go

09:16 through Heathrow airport every year. I was like, wait, there's got to be a dataset for that. And

09:21 sure enough, the Heathrow airport authority publishes a dataset in CSV of how many animals

09:26 go through. And it was like 200 horses and one and a half billion butterflies and everything in

09:31 between. And so it was great fun to sort of play with that data and ask questions about it and

09:39 give people practice with something that's dirty and messy. I don't mean the animals being dirty,

09:43 right? Like very messy. Like you need to like really, you know, wrestle with it because that's

09:46 the only way you're going to improve. So I'm, I'm having a lot of fun on all the, the training

09:50 front. I definitely see more and more use of, I'm sure this won't surprise you, more and more use of

09:54 Python in the data space as it just like catches fire there. Yeah. It's just going downhill,

10:00 picking up speed, isn't it? It's extraordinary. Like I still remember asking people in my intro

10:05 Python courses. So what are you here for? They're like, yeah, my company's thinking of doing some

10:09 stuff with data analysis and pandas. And, or at least as data analysis, I think it was then even

10:13 in NumPy before pandas came out and me thinking, Hmm, I should really learn this stuff because it

10:18 sounds like it's going to be popular and holy cow, it's like, it's everywhere. And the whole ecosystem

10:24 is just growing and growing and growing. It's like people are sort of seeing pandas as the like

10:28 underlying infrastructure on which they build their software tools and their companies.

10:32 Right. Or even the thing that defines the API on which they can innovate, right? Think Dask or a

10:38 whole bunch of other things in that space. Right. That if you know pandas, chances are, if you're

10:42 not too crazy, you can like do grid computing by changing the import. It's extraordinary.

10:47 Just extraordinary. Yeah. Yeah, it absolutely is. And so back to your newsletter, pandas eat bamboo.

10:54 Is that right? That's so, I understand why you're back on Facebook. Okay. All right.

10:58 It was actually my father's idea. I was like, I need a catchy name for something. And so he was

11:04 like, well, we'll have something with bamboo. I was like, right, right. It's like, it's food for

11:08 thought and food for pandas. Yeah. Yeah. That's actually really clever. I like it a lot. I like

11:12 it a lot. So you've been on a journey. You've been on a journey to write a book, right?

11:17 Yes. I was convinced to do a second one. Right. So I did a book with Manning called Python workout,

11:22 which was exercise in Python. And when that finished, they said, so what other topics do

11:28 you think would be useful? I said, well, doing a lot of pandas and people definitely need a lot

11:31 of exercise and a lot of practice in that. And so I got to work both collecting the exercises I do

11:37 with my corporate training and also like coming up with some new ones as I learn new things.

11:41 Cause pandas is so, so, so huge that it's very easy to get lost in there and not even know what

11:47 are the important topics to learn. And, and so thanks to like working with people all the time,

11:52 I sort of see also where they get stuck and where they have problems and where it's really confusing

11:55 to them. So, yeah, so 200 and somewhat exercises later, by the way, I'll just tell you why did it

12:01 come out when it came out? Cause I'm terribly bad at deadlines. I said to the Manning people,

12:05 I really want to have it at my booth at PyCon. They said, okay, big talker, you want it mid May,

12:10 huh? So you'd better have everything done. And they sort of backtracked on the calendar and said,

12:14 okay, so you've got to have the whole thing done by December. And then finally I had some like,

12:18 you know, fire under me and got it done. I supposed to get bored.

12:22 Yeah. The version I actually have in my Apple books here I pulled up is the Manning early

12:28 access preview version, but it's out, out for real now. Yeah.

12:31 Here. I've even got the paper copy here.

12:33 Look at that.

12:34 I know.

12:35 That's a pretty hefty book, honestly.

12:36 I still keep looking at it, baby.

12:37 That's a proper book.

12:38 It's like, wow. I know. I know. It feels like it's really quite some feeling there to see.

12:42 I guess I really finished it.

12:44 Congratulations. So I wanted to talk to you about maybe just kind of following on with

12:49 your bamboo idea, like give us some examples, give us some problems that people are solving

12:53 with pandas. And I mean, we're not going to talk through the code super detailed,

12:57 but you could say like this aspect or this feature of pandas like dot look or whatever

13:02 is how you access and solve these problems. Right. Like, so just kind of exploring that space. I had

13:06 West McKinney on before five episodes ago, something like that. And I was just like,

13:10 how do you learn pandas? It's like so big, you know?

13:13 So I've actually changed. Well, let me, let me, let me first say, I'm definitely one of those

13:18 people who thinks you should learn Python before pandas. Like, I definitely think that knowing the

13:23 language well, will serve you very, very well, in all sorts of weird, small ways necessarily.

13:29 But at the same time, when you learn pandas, you have to learn that some of the paradigms you

13:34 learned, like some of the idioms from Python are not appropriate. So I was giving a class in like

13:39 optimizing pandas, like a short class, we'll call it micro class, like 90 minutes long,

13:43 about a year or so ago. And at the end, I was like, oh, and by the way, obviously just never do for loops. And everyone's like, wait, wait, wait, what?

13:50 I said, what do you mean? What? And they were like, we were taught in our intro pandas class

13:55 that you should do a for loop to do anything across the data frame. And I was like, okay,

13:59 what? How can this be? And so people think that because it's in Python, you should do it the same

14:06 way, but there are all these different idioms, especially whole vectorization that you need to

14:10 internalize. Otherwise, as I like to say to people, you should hope to be paid by the hour

14:16 because like these things are just gonna take forever to run. And so people like don't

14:21 necessarily understand sort of how to approach panda stuff. And then they don't understand sort

14:25 of, let's see, I mean, I'll give like, like outline and then we can sort of go into it.

14:30 Certainly how to access things with dot lock, certainly how multi indexes work,

14:34 how to, how to work with the different D types. these are, cause these are things that we

14:38 don't think about in the standard pandas world each day, right? The closest analogy would maybe

14:43 be that where you were working with dictionary, you want to think about what your keys are going

14:47 to be, but even then it doesn't come close to it. So like dot lock people are like, so, so fine. So

14:52 dot lock is used for retrieving from a series or retrieving from a data frame. But it's so much

14:57 more than that because you have two parts to it or potentially two parts. You have the row selector

15:03 and you have the column selector and each of those can be an individual name, a list of names, a

15:09 Boolean series, or even a Lambda. And you can mix and match those in so many different ways. And

15:14 once you see those options, like your head sort of explodes with, Oh wait, I never thought that I

15:19 could access my data like that. And then you're like, wait, you can also assign your data like

15:23 that. And then they're just sort of astonished. Yeah. It's really wild. You know, I think pandas

15:28 has its own idiomatic style that is different than what you would call Pythonic, right? Like

15:34 it's Pydantic. I don't know what the name is, but idiomatic pandas, right? Where there's things that

15:40 are specific to pandas, like this vectorization stuff, right? Instead of looping over, right?

15:46 You know, you think about the Python performance angles or the data science performance angles,

15:51 right? A lot of the speed that we get out of tools like pandas and NumPy and pullers and others is

15:56 because you take the data, you push it down into some native layer and you just leave it there.

16:00 And you tell, you kind of speak to the native layer from Python and you say, deep down in your

16:06 insides, you got this thing, multiply all 1 million of them by two or whatever. Right. But if

16:10 you loop over it, you're like kind of running out of C into, into, you know, Python objects, then

16:17 you're operating out, then you're putting it back, like just, you know, 2 million times, all of a

16:20 sudden, all those benefits are gone. And so certainly learning those types of things. I mean,

16:25 the vectorization, I think most people get pretty soon, although it sounds like not necessarily

16:30 everyone. I was really like flabbergasted. But there's way more than that, right? There's,

16:35 there's a whole, that's just probably the most obvious thing.

16:37 This portion of Talk Python to me is brought to you by Sentry. Code breaks. It's a fact of life.

16:43 With Sentry, you can fix it faster. As I've told you all before, we use Sentry on many of our apps

16:49 and APIs here at Talk Python. I recently used Sentry to help me track down one of the weirdest

16:55 bugs I've run into in a long time. Here's what happened. When signing up for our mailing list,

17:00 it would crash under a non-common execution pass, like situations where someone was already

17:06 subscribed or entered an invalid email address or something like this. The bizarre part was that our

17:12 logging of that unusual condition itself was crashing. How is it possible for her log to crash?

17:19 It's basically a glorified print statement. Well, Sentry to the rescue. I'm looking at the crash

17:24 report right now, and I see way more information than you would expect to find in any log statement.

17:30 And because it's production, debuggers are out of the question. I see the trace back, of course,

17:35 but also the browser version, client OS, server OS, server OS version, whether it's production

17:41 or Q&A, the email and name of the person signing up, that's the person who actually experienced

17:46 the crash, dictionaries of data on the call stack, and so much more. What was the problem?

17:51 I initialized the logger with the string info for the level rather than the enumeration dot info,

17:58 which was an integer-based Enum. So the logging statement would crash, saying that I could not

18:04 use less than or equal to between strings and ints. Crazy town. But with Sentry, I captured it,

18:11 fixed it, and I even helped the user who experienced that crash. Don't fly blind. Fix

18:16 code faster with Sentry. Create your Sentry account now at talkpython.fm/Sentry. And if you

18:22 sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's

18:29 business plan, which will give you up to 20 times as many monthly events as well as other features.

18:34 Accessing things in that way, by the way, you asked before, like, how should people even approach

18:39 learning pandas? And so I've started thinking about it a little differently based on feedback

18:42 from people, that I would sort of walk through it. Okay, here's a series, here's a data frame.

18:47 And after creating some fake ones with made-up numbers or random numbers, then we'll start

18:52 reading in files. And then we'll do this, and then we'll get to visualization. And a student

18:57 of mine said, you know, you're sort of missing the big, like, people miss the big picture doing

19:02 that, and they want to get the excitement. Why don't you start with, read a CSV file, visualize

19:07 it right there inside of Jupyter, and then people will be so impressed and amazed, and then you fill

19:12 in the gaps. And so I've started doing that a bit. And I think that has not been a bad approach to

19:17 catch their attention, give them a sense of what the possibilities are. I'm like, okay, let's now

19:21 walk through each of these little pieces and build up to what we saw that first day. And that's been

19:26 fun. Yeah, I totally agree. That's also something that I really strive for. I don't always do

19:32 whenever I'm doing presentations, but, you know, just because someone is chosen to sit in a seat

19:37 for a day long course or an hour long presentation, doesn't mean that they couldn't use a little

19:42 inspiration. Right. And if you can like, wow, you did those three lines, and now we have this

19:45 picture and I understand all that, like, tell me more. Right. That right there. That's right. Right.

19:50 Now you have their actual attention the whole time and they're enjoying it. And it's, yeah,

19:54 it's so often it's like, well, in order to show you the nice stuff, I got to give you every level

19:59 of details. Like, no, you don't. You're going to, you're going to make them leave. Don't do it that

20:03 way. That's right. That's right. And you mentioned like three lines of code. One of the amazing

20:08 things about pandas is how often you can write very, very little code, but it's like getting

20:14 to that code that takes a while really taking advantage of it. Yeah. We've been knowing that.

20:18 Right. Right. Right. Right. People don't even know. So a lot of times, like I've been using

20:24 it now for long enough that I sort of intuit, oh, there's got to be a method that does this. Like

20:29 someone has encountered this problem before. And so either as a method or there's an option

20:34 or there's an add on like something somewhere that just makes it trivially easy. And that's part of

20:39 the exploration that I try to do both in the book and the weekly, like in my training journal. And

20:43 also truth be told, I'm constantly learning stuff, right? Like it's, it's a rare for me to teach and

20:48 not discover some option or some method that I did not know about because it's just so incredibly

20:52 vast. Yeah. Or pandas too comes out or something like that. Yes. Yes, indeed. I mean, I've been

20:57 exploring, I mean tomorrow, tomorrow I head off to Prague for EuroPython, where I'm giving a talk

21:02 on a Pyarrow in pandas. And so I've been looking into that a lot and oh boy, right. I mean, I've

21:08 been using it for say a year or so, but it's amazing. And yet there are all these subtle

21:12 changes that are happening in pandas as a result that you need to know about it. You need to expect

21:17 when, when you use it, but it's another like tool in our toolbox that we can pull out to make pandas

21:23 more effective, more efficient, deal with larger data and also interact with other and interoperate

21:29 with other systems. Yeah, for sure. It's just like, as I said, the whole ecosystem is just

21:33 exploding. It's really quite something. Yeah, it really is. And I think the pandas too stuff is

21:36 going to make a pretty big difference in changing like the internals away from just tables of

21:40 numbers and basically, okay. Let's talk about, so the way I thought we could maybe explore this

21:46 pandas workout book is let's just pick some fun exercises that you put together and, and talk

21:52 about them. Like, give us a quick overview of what the workout aspect of this book means anyway,

21:57 then we'll, we'll pick the first one. Sure. So the idea is that you can't learn everything all

22:03 at once or quickly. That's sort of like, you know, working out physically, it's a long haul

22:09 and every day you get a little better, a little stronger, a little more flexible. And so if you

22:15 see, you know, your pandas learning journey as it's going to take me several months, not, it's

22:21 going to take me a day. Then if every day you do a little bit of practice, you learn something new

22:25 in some new direction. At the end of that journey, you're going to be able to solve many, many more

22:30 problems in better, more idiomatic and more efficient ways. And you'll be able to put these

22:35 pieces together in ways that you didn't even expect. So that's, that's the basic idea. And so

22:40 the book is divided into, I want to say 12 chapters. I know we sort of rejiggered it at

22:43 some point where each chapter focuses on a different aspect of pandas. But it's really

22:49 sort of the, the total experience of going through it. And so we have 200 exercises,

22:55 plus there's like a, for lack of better term, a midterm and a final, like a media projects that

23:01 people can go through that people ask for after the Python book. And each exercise then has not

23:07 only the main exercise where I pose a problem, I give an explanation, I give an answer, note the

23:11 order, the answer comes at the end. So you won't peek as easily. We talked about putting in the

23:15 back of the book and like, that just didn't work out so well. So at least you have to wait through

23:19 the explanation a little bit or turn the page. And then you can buy the one book with the questions.

23:24 You can buy the answers later. I was joking. I didn't mean, I didn't mean to give anyone ideas.

23:34 I'm joking. No, no, it's fine. And then like after the answer, we have three, what we call beyond the

23:39 exercise, which are okay. Now that you've kind of gotten the basics, let's push either on the same

23:46 data set or even like, like sort of go farther. And so it's like, that's why we say it's 200

23:52 exercises because it's 50 official exercises and other three for each one. And those tend to be

23:57 much harder. And I don't give a full explanation, although I do give the solution online in the

24:01 Jupyter notebook, you can download Jupyter for all these things and see how I solved it. And for

24:05 many of them, there's a link to the pandas tutor. I think, yeah, you've definitely spoken to Philip

24:10 Lowe in the past. So he's not only a Python tutor, but pandas tutor, he and his team, it's amazing.

24:15 And so you can just click on a link and it will take you to usually a miniature version of the

24:20 data set because it's too big for pandas tutor. And then you can sort of see the visualization,

24:25 how these things work. Yeah. I haven't, I don't think I talked about pandas tutor. Maybe I have

24:29 talked to Philip about it, but I've certainly used it in some of my courses, but it's worth

24:34 bringing up just to talk, like, see how that thing works. It's, it's something special. This

24:39 thing, it really gives you some deep understanding into like, okay, if I run, you know, this group

24:44 by command, then here's how all the pieces like flow back together. And so you're using this

24:49 during your book. So, yeah. So, I mean, every exercise has a link in the solution. You click

24:54 on it and brings you to the code that I use. Usually again, I'll sort of miniature version

24:58 of the code, miniature version of the data set that you can then see the visualization there

25:02 in, in your browser. Nice. Yeah. Cause you've got a basic, I think, encode the data and the URL or something. So it can't be too much.

25:09 Right. So I would like basically take the data in the data frame, turn it into a Python dictionary,

25:17 and then like set it up because it won't work with files for security reasons. So assign that

25:22 dictionary to a variable or no, no, I'm sorry. I turned to CSV in a string and then assign that

25:27 string to a variable and read it in. And then I would see if it overflowed the pandas work,

25:32 not pandas, a tutor limit. And if it did that sort of iterated until I got it small enough to fit in

25:37 there and big enough to be useful and interesting. I'll take a few iterations.

25:42 It's cool. No, but it's a super, super good resource for learning pandas. I also think

25:46 for exploring, right? Like you end up with some code that like, I'm not really sure what this

25:50 does, or this is actually new to me. Like a lot of the things you might encounter in this book,

25:54 you're like, let me visualize that. Right. Cause yeah, it's great.

25:57 Right. I mean, I, as I said, like I do a lot of training in pandas and very often people

26:03 have already played with it a bit, used it, even use it for a year or two. And because it's so

26:07 large, like it's not unusual for someone to say, Oh, I had no idea that this functionality existed.

26:13 Why have I been wasting my time doing X and Y and Z? I'll give you one, one example that I've

26:19 been using more and more. So, you know, Matt Harrison's a big fan of the method chaining

26:23 approach. And at first I was like, yeah, yeah, yeah, Matt, whatever. Like, like, yo,

26:27 stop pushing out on everyone. That's your point. Keep them to yourself.

26:31 Right. And then I was like, actually, this is a great way to build things up little by little,

26:37 line by line. And I can use this. It's pedagogically very useful because I say,

26:40 okay, let's think about how we want to break down this problem. We'll do this. Then we'll do this.

26:44 Then we'll do this. And you can see it sort of going line by line until voila, we have the

26:48 analysis that we want. And so I inserted that into a lot of places in the book. Sort of like

26:54 one of the last edits that I did was to go back and change it to be more method chaining. And I

26:59 use it now all the time in my training and in Bamboo Weekly. And I, so I, I, I bow to Matt on

27:05 that. He was, he was right. And I was stubbornly resistant for no good reason.

27:10 Yeah. I really like that style as well. You see, it's officially night here where I am.

27:14 It's just, everything switched to dark.

27:15 But I just had that yesterday when it, like I was doing office hours and all of a sudden,

27:20 like I was sharing my screen and it changed the color. I'm glad I'm not the only one.

27:24 So yeah, I, I'm a big fan of the method chaining fluent interfaces. I mean, I would love to even

27:30 see like I thought on itself in the standard library, adopt that more, right? Like there's

27:35 so many things you operate on that'll change something, but then it'll, it'll return.

27:40 No, it'll not return anything. It's like a void method as much as we have those,

27:43 right? It returns none effectively. So you can't say, you know, dot sort dot this dot that you have

27:48 to like multi-step it. And I would just love to see more of it, but let's talk.

27:51 Well, I'll just, I'll just say there on that front. So Condos like does have the option to

27:58 either get back a new data frame or to say in place equals true. And then it does it locally,

28:03 like it does it on that data structure and then returns none. And people are consistently

28:08 convinced that this is faster, more efficient, better. And so I've been like trying to tell

28:13 people, no, the Python core developers keep saying, do not think that is true. It is not true.

28:18 And we are getting rid of in place equals true at some point, stop using it so that you can do

28:23 method chaining. And so no small number of people again, in my course, they're like, Oh, really? I

28:28 had no idea. I feel like, you know, I'm spreading the gospel.

28:33 Throw that whole expression in some parentheses and dot yourself away. Let's go. All right.

28:38 That's right.

28:39 Let's dot our way on over to exercise seven of the many, and let's just talk about long,

28:46 medium and short taxi rides. Tell us like, kind of like we can only talk so much about the code,

28:51 but like, let's, let's talk a bit about it and get people, like I said, I want to expose people to

28:56 like, what are some of the problems and aspects of Pandas that you can use to solve them?

29:00 Sure. So one of my favorite datasets to work with is a New York city taxi information. It's like,

29:07 everyone can identify with it. You understand it. And so this exercise uses a very small

29:12 subset of that. Maybe we'll talk next about the pandemic taxis, which is a much larger one,

29:16 but this is just a hundred thousand, no, 10,000, 10,000 taxi rides from like five years ago,

29:20 six years ago. And the question is, well, how can we divide up this dataset, which tells us

29:27 how long, how far, how much people paid when they were picked up all that information.

29:31 How can we find out like the distance that they went and categorize that? And the reason that

29:37 this can be useful is we so often have numeric data that we need to put into categories, right?

29:42 What are best sellers? What are poor sellers who are, you know, the most, you know, you know,

29:47 employee of the year, that sort of thing. There's so many places where turning something into a

29:51 category will be useful. And so it's very tempting to think, okay, I'll like do some

29:56 if statements or I'll do some for statements, but actually pandas provides us with PD.cut,

30:02 which just does it for us. And this is one of those examples of once you learn it, you're like,

30:07 oh, wow, I get it. I don't have to have, oh yeah, even up here. Like you might think you would want

30:12 to say, let's set all the categories, you know, to be medium. And then where it's less than two

30:18 miles, we'll call it short or it's greater than 10 miles. We'll call it long, but you can just

30:22 say PD.cut, we're going to cut it to, we're going to cut it 10. Anything less than two is short.

30:26 Anything greater than 10 is long. Anything in the middle is medium. Done. PD.cut gives you back a

30:31 new series. - This portion of Talk Python to me is brought to you by Scalable Path. If you're a

30:37 founder or engineering leader, you know how hard it is to find top tier developers while keeping

30:43 costs low. That's where Scalable Path comes in. They're a software staffing company that helps

30:48 you build remote dev teams that just fit. If you're wondering what sets this staffing company apart,

30:54 well, one big differentiator is their approach. They're founded and run by developers. Scalable

31:00 Path understands that finding the right developer is not just about technical skills. It's about

31:06 personality, work ethic, and how well they mesh with your team. They are software architects,

31:11 will take the time to understand your vision and needs, and then develop technical challenges for

31:16 the roles you're looking to hire. And these technical tests are conducted live on video by

31:22 senior software developers, so there's no gaming the system. And Scalable Path takes it one step

31:27 further. They evaluate each developer's soft skills like communication, attitude, and work style

31:33 before presenting best suited candidates to you. Scalable Path has built a network of over 35,000

31:40 remote developers. No more endless searches or sleepless nights worrying about the right hire.

31:45 And here's a special offer for Talk Python listeners. You'll get 20% off of your first month.

31:52 So are you ready to scale your dev team and your business? Get started by visiting

31:56 talkpython.fm/scalablepath. The link is in your podcast player's show notes.

32:01 Thank you to Scalable Path for supporting the show.

32:04 So the way you're doing is you're setting everything to medium, and then you're defining

32:09 short and you're defining long instead of defining the three categories.

32:13 So that's like the sort of, you might think this would be a good way to do it.

32:18 And it works, right? As I like to say in my courses, unfortunately this works.

32:21 Right? Like, so you will get the right answer this way. But if you then like pay,

32:27 go to the next page here, you'll see that you can just use PD cut.

32:30 I see.

32:31 Yeah. And then you just say, here are the bins. Here are the labels. Go.

32:36 Well, this goes back to that thing we talked about, like looping over stuff versus just going,

32:42 this is what I want you to do. Do it a hundred percent deep down inside the best you can.

32:47 Don't bother me. Just figure it out for me.

32:49 That's exactly right. And this means that like a whole lot of people have worked on PD cut

32:55 and have made it work efficiently, way more efficiently than you or I could do in our code.

33:00 Presumably, right? It's not going to be any worse. And it'll probably be a lot better.

33:05 The other thing is you get back and you can see it's sorted at the end.

33:09 What the output of the series you get here. So you feed it a series and you get back a series,

33:14 but then the series looks at first glance, like it's a bunch of strings.

33:18 And so you're going to have short or short, medium, medium, long, long, long,

33:20 but it's actually not strings. It's actually a category, which is a Panda's version of an enum.

33:25 So really it's very small because it's just integers being stored there.

33:29 And then those integers are associated with strings. So that's an example of like where

33:33 they have thought through it. And they've basically said, yeah, we're going to make

33:36 this more efficient than you would probably think to do on your own.

33:38 Right. Awesome. Well, I would, to be honest, I've been happy if I came up with that first solution.

33:43 Just using lock and set it in there. That's pretty cool. But this cuts is super nice.

33:48 Yeah. Just define the boundaries as bins and then off it goes.

33:52 Right. Right. And there's an option there include lowest. And so I actually didn't

33:57 know about this for a while. So I'd be like, okay, so think about the bins has to be from

34:00 some small number to some large number. And the question is then, well, wait a second.

34:05 What about that leftmost bin? What about that? Like if it's up to and not including,

34:09 it's like less than, but not less than equal, then how do I do it? So I will always be like,

34:13 well, I'll take the min. I'll do like the series dot min minus one. And then it's guaranteed to

34:18 be lower than that. But no, it turns out the Panda's developers thought about this long before

34:22 I did. And there's an option, a keyword argument you can pass include lowest equals true done.

34:27 Now it's less than an equal as opposed to just less than.

34:29 Right. Because all these kinds of things, the boundary conditions are always tricky,

34:34 especially on floating points, right? Oh yes. Oh yes.

34:37 There's probably at least two spacecrafts that have crashed because of this. All right.

34:41 The next one we want to go to is number 12, finding outliers.

34:47 Yeah. So, you know, it's, this is not a book about statistics and I'm not an expert in

34:52 statistics, but there are a whole lot of statistical ideas that permeate working with

34:58 Pandas. And so it's like, you know, mean standard deviation, median understanding sort of how they

35:03 can differ from one another. Right. And especially the whole mean versus median thing where it's so

35:08 easy to have a few outliers, pull your data up or pull your data down and then sort of fool you.

35:14 Right. And so often people want to find the opposite. And I've certainly found this,

35:18 especially with our corporate training with like a cybersecurity people where they're like,

35:22 we are always looking for outliers. Like, you know, who was the user who was logging in at

35:26 unusual times? Who logged in for 20 IP addresses within one minute.

35:32 That's right. That's not a good person. They're not good. I'm telling you that right now.

35:36 They just try to be very, very effective. And so, and so, so this finding outliers is, okay,

35:44 let's find out them, not who was sort of normal, but like who was abnormal, right? Who, who was

35:51 exhibiting, you know, unusual behavior. And so here it's like, okay, let's take our set of numbers

35:58 and let's find out who was more than one standard deviation. It's one or two above the mean or

36:03 below the mean. And let's just find those values and ignore the sort of normal values. And then

36:09 we could talk a little about, you know, IQR, the inter quartile range, which is a standard

36:13 statistical idea that we don't talk about enough, even though it's super useful, right? People came

36:18 up with these ideas a long time ago. Oh, here's a fun fact. John Tukey, who came up with a lot of

36:22 these ideas, he also invented the word bit in computers. So yeah, I was like, wow, what a guy.

36:29 So you can see here then that like, I'm going to take the, you know, trip distance, let's find,

36:34 you know, trip distance. Let's find where the trip distance is less than the trip distance

36:39 at the first, the first quartile minus one half times the IQR, right? Meaning let's take the

36:46 distance between the first quartile and the third quartile. That's our inter quartile range. That

36:51 gives us a sense of like, where is the bulk, where the bulk of the numbers, and let's find out who

36:56 is below that or who's above that. And so it's not surprising if we're looking for taxi distances,

37:02 you're not going to have a lot of outliers that are very low because you can't go below zero miles

37:07 in your taxi, but you can go very high. You can go very large. And so looking for those trips that

37:14 are greater than the 75th percentile plus one and a half times the IQR, we actually find, you know,

37:20 I see here, I found me, you know, about 1800 taxi rides out of the 10,000. So about 19% of the taxi

37:25 rides are much, much longer than the mean ride. And you can use this, then you can imagine people

37:30 in like the New York City Taxi Limousine Commission saying, oh, we can use this to plan to charge to,

37:37 you know, send taxis at different places. - Sure, have a special program for long distance

37:41 stuff or whatever. Yeah. - Right. Right. Or if you're Uber, you know where to place, they actually used to have the geography, the longitude and latitude of where

37:48 people were picked up and dropped off and they got rid of that. And I'm sure both for privacy

37:52 reasons and because Uber and Lyft can look at that data and say, oh, well, we know now when to send

37:57 cars where. - Well, although while true, I imagine that that data has been downloaded and archived

38:04 with geolocation for certain people, right? - Oh yeah, you can still get all that old data.

38:09 It's just the newer stuff that they don't do. And they still do it by neighborhood.

38:12 - I see. - I also wonder like how precise that longitude and latitude was, like you would probably identify which home like was going to which other

38:19 home. - Yeah, now you're getting into a problem.

38:22 - I can only imagine how bad that would look. - Yeah, now you're getting into a shady spot or

38:25 certain types of establishments, you know, that could be brought with consequences for the people

38:31 whose home you've already identified where they were picked up at or all sorts of stuff, right?

38:34 - Exactly. - All right, fair.

38:35 - Exactly. - Fair. That's way more important than whether Uber can place their vehicles strategically, you know?

38:40 - Right. Right. And they might have some data on their own too, you know.

38:44 - Nice. Okay, so you can use this IQR, interquartile range feature to pull that out here,

38:50 right? - Right. Right. And, you know, we can pull that out and then do simple multiplication. Right at the end of the day,

38:56 what we're doing is we're pulling out this data and then just doing very, very simple statistical

39:01 analysis of it. Just to sort of say, how many outliers do we have? How many low outliers? How

39:05 many high outliers? And you can see this, like at the end of the day, it's not that complicated in

39:10 terms of math. Sometimes people are like, "Well, how much math do I have to know to learn this

39:14 stuff?" I'm like, "I can get through it. I promise you. Not that much. It's not that hard. A few

39:18 basic ideas and you're basically set." - Yeah, very cool. All right, onto the next

39:22 one, which is endemic taxis. That's a different kind of taxi.

39:26 - So this was actually, so the two or so that we looked at so far were with my tiny little

39:33 10,000 taxi ride sample from a few years ago. I then took taxi rides from January and July of

39:40 2019 and 2020. So four months there. And the question was, I think I actually only looked at

39:45 July in this exercise. Yeah, July, 2019, July, 2020, comparing taxi rides. Now, just to remind

39:52 you, July, 2020, not a great time for tourism in New York City or anywhere else. And so the

39:58 question was, what differences do we see between July, 2019 and July, 2020? How much did it go down

40:07 in terms of taxi use? How much less did people pay? And then my favorite part of this was,

40:12 did people use cash more in 2020 or less in 2020? And so you see, first of all, it's like a decline

40:20 of some like 80% in terms of taxi rides from 2019 to 2020. Again, not a huge surprise to anyone who

40:26 lived through the pandemic and saw what was going on then. And so my gut feeling was, well, no one

40:31 wanted to touch anyone else or touch anything that anyone else had touched. So clearly people would

40:36 have used credit cards much more. And no, it turns out-

40:39 There's that screen, Reuven. You got to touch the okay. That drove me crazy during the pandemic.

40:44 It's like, I'm going to just do touchless. And then you sign it or you okay. You're like,

40:51 I never thought of that. I guess in Israel, you just do the tapping of your car. You didn't have

40:57 to sign anything, but huh. Yeah. The US, they're like, oh, hit okay to confirm. It's like, okay. I thought we had just escaped me touching this disease-ridden

41:05 thing, but no. Also, you have to indicate how much of a tip you want to leave, but that's a

41:09 whole separate thing. Yeah. We're at the car wash. Don't you want to tip the car? No, I don't want

41:15 to tip the car wash. That's the other thing is you should have looked at tips. That could have

41:19 been interesting actually as well. I think, no, I think I did. I definitely looked at tips,

41:24 but I can't remember what it was like- Like around the pandemic. Yeah. Because I know a lot

41:27 of people gave tips to say thanks. And I was wondering if that would show up. Thanks for being

41:31 out here amongst the diseased ones. So it turns out that people used cash more

41:37 during the pandemic than credit. And I raised this with my family. I was like, what the heck?

41:42 I think it was my sister who said, well, everyone who would use, the people who were higher income

41:48 earners who use their credit cards more, they were staying home. It's the people who were forced to

41:53 go to work who use cash more, who were taking the taxis because they didn't want to take the subway.

41:57 Like there was sort of an in-between sort of thing. So you get all these wild pieces of data

42:03 and analysis that you can do. And here it was not just pulling out all this data, but one of the big

42:08 things was also, I want to read in a lot of data. I want to read in from two different CSV files.

42:14 Now what do I do? Because it's very obvious. You read in a CSV file, you get a data frame. You read

42:18 another CSV file, you get a day frame. But I want to treat those as one. And I want to then be able

42:23 to distinguish between them. How do I do that? And so sure enough, Pandas has a concat method.

42:28 And I use concat now all the time. I read in multiple CSV files into multiple data frames.

42:35 I can cat them together and you can get at them either horizontally or vertically, depending on

42:38 what you like, what you want to do. And- I see. If they're like, their columns are the same,

42:43 then you might do it vertically. But if you want to augment it potentially, because now we have the

42:48 sale percentage data, but it goes along with the columns exactly, or the rows exactly.

42:52 Precisely. Precisely. And so another nice way to do this also is not just read this one and read

43:00 that one, but you can use a list comprehension with something like glob. So glob. Glob on star.csv,

43:06 get back a list of data frames and then just hand that to pd.concat. And so that's where

43:12 knowing Python and be able to pull that out and use those techniques can really, really come in

43:16 handy. Yeah. Earlier you mentioned, I want to go back to it real quick. You mentioned that learn

43:20 Python first and then Pandas. A lot of times when I think learn Python, I think, okay, well,

43:25 learn the language and then learn the standard library to a good degree, and then begin to chop

43:31 away at the half million things on PyPI that are interesting. And it's like a never ending sort of

43:37 thing. And then there's of course, this joke, like I learned Python, it was a good weekend or

43:41 something like that. Right? Like, how do you square these two things? But I feel like the

43:45 amount of Python that you need to learn is mostly centered around the language and really is

43:50 actually not that much. And then you kind of learn the Pandas way, right? Theoretically,

43:55 if you wanted to. Right. Right. For sure. Look, I am a big fan of objects and classes and all

44:00 that stuff. But when I'm talking to people who specifically want to use Python for data analytics

44:07 with Pandas, I say to them, objects will help you, but they're not going to be crucial. Like

44:12 if there's a part of the course that you want to sort of drop, save some money, save some time,

44:16 then that's a place where we can save because the odds of an analyst using Pandas writing their own

44:21 classes are pretty slim. I feel give them some perspective on how these classes work. Sure. But

44:27 I don't think that like they need to learn that. So and like a lot of the standard library there,

44:31 it's hard to say. Right. So as I said, I love glob, right? Globing is fantastic.

44:36 But that's definitely not in like my intro class. I would say, oh, by the way. Yeah. I would bet

44:41 there's probably 10 modules that if you knew, you might not need to learn more for six months,

44:47 you know, doing Jupyter Pandas type of work, right? Like pathlib and a couple of things like

44:52 that. Right. That's right. That's right. I mean, I just made up that number. Yeah. It's mostly like

44:57 just being able to sort of work with the core data structures, understanding the syntax and how it

45:01 works and even like defining some simple functions. Right. I think most people using Pandas

45:06 are not at the end going to define functions, although as I've gone on with my use of Pandas,

45:11 I see, you know, Lambda, this is where it's at. Like knowing how to use it really, really helps.

45:17 So there I do sort of go back with the Pandas people. I'm like, OK, this is going to be super

45:21 weird. We're going to talk about anonymous functions now. And then and then like I typically

45:26 do that with a more advanced part of Pandas groups, not with the like introductory ones.

45:30 But if you can sort of wrap your mind around that, then it does help quite a bit. It does.

45:34 So it takes what would be multiple step things and lets you turn it into one of those chained

45:39 expressions because instead of defining a function somewhere else, you can just put it in line as

45:43 part of a Lambda and just keep on going, you know. Great. That's right. And the fact that so it's

45:48 especially useful, I find so we mentioned dot lock earlier. So dot lock allows you to choose

45:52 rows. Start with that. And so I can pass it a Lambda and that Lambda then gets the data frame

45:59 you're working on as an argument. And then whatever it returns, if it returns a Boolean

46:05 series, then it allows you to filter. Well, you can then have multiple dot locks and multiple

46:10 Lambdas in a row to do successive filtering. And yes, that is less efficient than doing it all in

46:15 one fell swoop. But boy, oh boy, it's easier to think about. It's easier to use. So you just like

46:20 whittle it down, whittle it down, whittle it down each line with its own Lambda. And so understanding

46:25 how to do that, understanding why it's important that the Lambda inside the Lambda using the

46:30 temporary parameter from the Lambda as opposed to the overall variable for your data frame,

46:35 because you're chaining it, that's useful as well. Well, 100%, then you can also like comment out

46:41 certain lines at the end and see what the intermediate values are. And so you say it's

46:44 less efficient. It's less computationally efficient. It might be more efficient as a

46:49 human being trying to understand what the heck is happening, right? That is spot on. I like,

46:54 I often tell people that I think Python is a language for an age in which computers are cheap

47:00 and people are expensive. Because like, right, our efficiency is the big bottleneck in terms of time,

47:07 in terms of money. So right, if it takes my computer a few seconds more, who the heck cares?

47:11 Right. My M2 pro doesn't care if it's like a tiny bit more, I don't care.

47:16 That's right. That's right.

47:18 Or my cluster of GPUs, take your pick. Although at that point it starts to cost real money to

47:22 burn those things, you know? Right. But still, I mean, depending, look, if you see that your computation is taking a long time, okay, so then you sort of find that and you

47:31 improve that. Yeah. That's a really good point. Like a lot of things are, don't worry about that

47:35 until it's actually going to become a problem. It looks slow. It turns out it's probably a blink

47:38 of an eye. All right. Let's keep moving on. We're getting short on time and we have a plethora of

47:44 things to work with. I kind of want to go to wine words. Wine words. Wine words. All right.

47:50 Let me remember what exercise that is. That is 37. All right.

47:54 Yeah. So people think of pandas, not wrongly, as being great at working with numbers,

48:00 but it turns out that it's fantastic at working with two other kinds of data types. One is strings

48:07 and one is dates and time. And you can get a ton, a ton out of analysis working with these if you

48:13 know how to work with it. So there is, I forget where it is, like there's a machine learning,

48:20 like archive of data sets. And one of them is 150,000 wine reviews from Wine Magazine.

48:28 Oh, wow. Is it Kaggle maybe or is it somewhere else?

48:31 I think Kaggle has a version of it, but I think it's elsewhere.

48:35 Wine Mag, 150,000 reviews. Okay. Beautiful.

48:39 I don't know. I'll find out where it is. Anyway.

48:41 No worries.

48:41 So I said, okay, let's find out. Because you drink a bottle of wine and you read the back

48:47 and you sort of like, you know, roll your eyes at what they've written. Although I'll put in a plug,

48:51 I read this fantastic book a few years ago called Cork Dork by this journalist who decided to become

48:57 a Sommelier and she took the exam and her journey toward there, she convinced me these words actually

49:03 have real meaning and people are very serious about it. So I will not roll my eyes quite as

49:07 much anymore.

49:07 I'm getting hints of nutmeg, but they're troubling.

49:10 Yes. Yes.

49:12 They don't belong.

49:13 So the question was, okay, what about these reviews of wine? What words are people using?

49:19 And are they using certain words more with California wines and certain words more with

49:24 French wines or certain words more with red or with white or rosé? And so we can then take this

49:30 text, break it apart and search for it. We can search for it using plain old Panda stuff. We can

49:35 search for it using regular expressions, which it has built in and works very well.

49:39 So here, so let's see some analysis on the words. So one of the 10 most common words for red wine.

49:47 Right. So how do we do that? Well, we have to take the description and break it apart.

49:51 And if you're used to using just plain old Python, you're like, oh, well, I guess I'll

49:55 break that into a list. But now what? Now I have a series of lists. Now what do I do?

49:59 And so one of the key methods to know here is something called Explode. And Explode is

50:05 let's take a series of lists and turn that into a very, very, very long series. And so basically

50:12 each element in the list becomes an element in the series and they all share an index. So you

50:16 know where they're originally from and then you know the world's your oyster. So you can get rid

50:21 of punctuation. Right. So I have a S dot stir lower. Okay. So we lowercase everything. Then

50:26 that stir split and we split into a list. Then we say explode, get into a long series. Then we once

50:31 again, run strip, get rid of punctuation. And now we can use is in, which is yet another fantastic

50:37 pandas method to say, are these words in like find the lines where these words are located.

50:43 And then we can just do a value counts. Definitely my favorite method. How often does this thing show

50:48 up? And then we use head. Ta-da. We've got the 10 most common, 10 most common words there.

50:53 Cause the most common, the counts, the value counts, sorts probably most,

50:57 most common to least common, right? Yeah, yeah, exactly. So value counts,

51:01 not only does it count how often something shows up, but exactly. It sorts it from most common to

51:05 least common. So then head, yeah. If you want, yeah. However much you take there and then that's,

51:10 that's the one. So those are the popular ones. That's right. That's right. So here I have like,

51:14 you know, the page we're looking at. So we get find where the country is France. So that's

51:18 the row selector and we want the console to be description. So we only want the description.

51:22 And then we have our function top 10 words, which did what I just described. We're going to take

51:26 these common wine words, lowercase, then get rid of the punctuation, so forth, pass it in. And we

51:30 get back our top words and we find out what words are used to, you know, are associated with French

51:35 wines as opposed to California wines and so on and so forth. Excellent. Yeah. Very, very cool example.

51:40 All right. We, I think we've got time for two more to go through. You want to pick two that

51:45 are popular or do you want to leave it? let's see. Let's see. Let's do, maybe 32

51:53 multi-city. Actually. Yeah. Multi best tippers. Yeah. You mentioned, Oh, you're going to open up

52:01 a whole, this is going to be a whole thing. Best tippers. Number 42. Let's go. No, no, no judgment

52:07 here, folks. All right. So the question is like, so we'll try to understand as I read here,

52:11 when people tip their taxi drivers more generously. All right. So the question was,

52:15 did they tip better? And we looked at 2019 before the pandemic in January and July. So do they tip

52:21 more in the winter or do they tip more in the summer? And so this involves several things.

52:26 First of all, it involved using dates and time. Again, one of these things that pandas is just

52:31 amazing at that people are not aware of all the flexibility you have there. Another thing is how

52:37 easy it is to create a new column, right? So we spoke before a bit, how you can use broadcasting

52:41 to just multiply. I have a column multiplied by something or a column adding to another column,

52:46 but you can create a new column just by assigning to it. Again, it's sort of like assigning to a

52:49 dictionary. It just then is there. And so you can calculate the percentage that people tip,

52:54 put that in a new column and then say, well, let's now group by the month and let's find out.

53:00 - Take the mean or the max or some sort of deviation.

53:03 - Exactly. Exactly. And we can find out like, you know, whether people tip more on average

53:09 in January or in July. I honestly don't remember what the answer is, which will hopefully not tick

53:17 off too many readers. Here we go. Let's see what it says. Oh, here we go. Go back one page there.

53:22 So 32% of taxi riders in New York don't tip at all. That's right. That was a surprising thing to me.

53:27 And then I don't remember exactly what it was for month.

53:31 - And you know, you're talking summer versus winter. There's probably a tourism

53:35 angle versus non locals versus tourists. I mean, I know people go to New York and

53:39 not in the summer, but not as much, I imagine.

53:41 - That's right. That's right. There are all these different factors that like, you know, come into

53:46 it. - Yeah. One of the things, kind of takeaways I'm feeling as we were talking about all these

53:49 is like, there's a lot of interesting questions that can be asked and answered really quickly

53:54 for like sociology and urban planning and all kinds of interesting questions that don't feel

54:00 like programmer questions. - Yes. So, you know, I mean, I talked to a lot of people, you probably

54:05 do too, who are like interested in advancing their careers with Python and they're like, well, you

54:11 know, I don't have a computer science background. Can I get a job as a programmer? And their vision

54:16 of a programmer is either someone working at a startup or at like one of the big companies,

54:20 Google, Amazon, Facebook, and so forth. I say to them, look, there are an awful lot of people who

54:26 have great jobs working at supermarket chains and insurance companies in the backroom analyzing data

54:33 and they are crucial, but we don't think of them as programmers and in governments and in cities

54:38 and weather forecasting, like everyone nowadays is using data and collecting and analyzing it

54:43 and having these skills either gives you abilities to do your job better or gives you the ability to

54:49 move into a new job that you couldn't have done before that these places are desperately looking

54:54 to fill. - Yeah, I totally agree. I can't remember which episode it was. I didn't name it right. So

55:00 there's an episode I did quite a while ago about our programmers and our data scientists and Python

55:05 data scientists working together. And I think it was the research arm of Kroger maybe that had like

55:10 200 data scientists. - Wow. - That's a proper group of data scientists. I mean, a lot of times data

55:17 scientists, I feel like there's a couple of them for a company compared to a software team or

55:21 something. - No, and I tell people also, it's worth learning to write and it's worth learning to

55:27 speak, not because you're going to be like a Pulitzer prize winning writer, not because you're

55:31 going to be like whatever prize you would get for like speaking well, just because having these

55:36 skills makes you more effective at your job. And people who can write a program to suck up some

55:40 data, analyze it and come back with a result, especially if it's like a public data set that

55:45 has something to do with what they're working on, they are so much more valuable to their company

55:50 than they would have been otherwise. - Oh yeah, absolutely. So I found it because I have a search

55:55 engine on Talk Python that searches the transcripts and everything else. So if you want to know

55:59 something people about historical shows, when you hit search, it's not just like the show notes.

56:04 So scaling data science across Python and R episode 236 with Ethan Swan, Bradley, Bokeme,

56:12 and the company is 84.51 degrees. So anyway, that's- - Oh, I've seen that before.

56:18 - Yeah, yeah, yeah. That's what it was. So people can check that out if they're

56:21 interested, but- - Very, very cool.

56:22 - Yeah, super cool. Okay, last one. What are we doing?

56:25 - Let's do cities. - Cities. What number is it?

56:29 - So 43. - Okay, oh, that's right.

56:31 - So I found a few years ago this JSON file containing the thousand largest cities in the

56:38 United States. So first of all, good to work with JSON because people need to know how to work with

56:44 that kind of data. Second of all, okay, let's not look at the numbers so much as let's do some

56:50 plotting. And so I'm going to make enemies now. I really cannot handle matplotlib. I find it,

56:57 it's like so incredibly powerful. And anytime I want to do even the tiniest thing with it,

57:02 I have to look up the documentation and remind myself how it works. Maybe it's because I don't

57:06 use it enough. And so I just use the pandas interface for plotting 95% of the time. Let's

57:12 call it 90% of the time. The rest of the time I usually use Seaborn. So I said, okay, let's see

57:17 if we can do some plotting here of these largest cities. So for example, let's see growth in

57:22 Pennsylvania cities, like which cities in Pennsylvania, oh, I'm sorry, like let's see a dual

57:27 bar plot. How many large cities are in each state? Okay, well, that's a group by, right? So you have

57:31 to know how to do grouping. You have to know what you're grouping on and how to use count as opposed

57:35 to mean, right? That can't even exist. But then if I want to see a bar plot where it's sorted from

57:40 smallest to largest, you got to know how to do sorting. And then we can do a bar plot. So sure

57:44 enough, you do a dot plot dot bar on your series, kablam, you have a bar plot right there. And we

57:50 just get it in like, you know, it's nicely sorted there. Quick takeaways. Holy cow, does California

57:55 have more big cities than I realized? And you know, where's New York and New Jersey? Like it's

58:00 way down the line. You think of those as having like pretty metropolis type places. Massachusetts.

58:06 That's right. That's right. But it's how many cities, right? So, right, right, right. That's

58:12 like the fun. So keep going. So like do a bar plot of growth in Pennsylvania. So how are we going to

58:17 do that? Well, wait a second. If we want to do growth, it's a percentage, but it's written in

58:21 the JSON as number and then percent sign. That's a string. So we're going to have to get rid of the

58:26 percent sign. And then we're going to have to change the D type from string to a floating point.

58:31 And then we can sort it and then we can graph it. And so that's changing the D type basically is a

58:38 deep down inside of pandas parse as an integer operation, right? Rather than looping over and

58:44 parsing or whatever. Yeah. Yeah. Yeah. Like basically, I mean, it has to do the looping.

58:48 It's doing the looping there at the below sea level. Yeah. Below sea level. Yeah. Down by the

58:53 dead sea. I see. Okay. Got it. And so you see here like this bar plot. Now we have a whole bunch of

59:00 cities that have gone down in size and a bunch of things going up in size and pandas is very happy

59:04 to show the bar plot there again, sorted so that we see it from the greatest sort of shrinkage to

59:12 the greatest growth. Right. With Pittsburgh having the most shrinkage and Allentown having the most

59:17 growth. I would not guess that. I would not have either given that like my sister-in-law and

59:21 brother-in-law live in Allentown and they're like, "Oh, Allentown." But I think that's a common thing

59:27 that people think there, even though it seems like a perfectly great place to be. Yeah. Sure.

59:31 Funny. Awesome. Well, there's a lot of cool takeaways from this.

59:35 One more. Let me show you one. Go to the last one here in this exercise.

59:39 Okay. You got it.

59:40 So one of the most important types of plots you can do in data analysis is a scatter plot. And

59:47 you take your data frame and you say, "Give me the plot with based on the X axis should be this,

59:53 the Y axis should be that." Well, if we do a scatter plot based on longitude and latitude

59:58 of the thousand largest US cities, you basically come up with a map of the US.

01:00:02 Yeah. How interesting. The other thing that you realize is California is technically bigger than

01:00:07 a lot of other states. So that also counts for why. Oh, yes.

01:00:10 Yeah. Interesting. But there's a lot of people in California. Good place to be.

01:00:13 There are a lot of people there for sure.

01:00:14 If they're willing to pay the sunshine tax, it's nice there.

01:00:17 Cool. All right, Reuven. This is really, really excellent. I know you want to give a shout out to

01:00:24 a couple of exercises from Bamboo Weekly. So I'll include those in the show notes as well.

01:00:29 People can pick those and jump over and look at them as well.

01:00:32 That's fine. That's great. Yeah. We went through a lot here just during this time.

01:00:36 Yeah, for sure. As I've discovered, there's basically, I mean, there is an infinite number of these sorts of things that you can do and different

01:00:43 permutations of various sorts. I mean, we didn't even talk about multi-indexing, which opens up

01:00:49 literally a whole new dimension of this stuff. But it's great fun. It's pretty good fun.

01:00:53 And a lot of the Pandas 2 stuff coming on changes the foundation and opens up more

01:00:57 possibilities still. Absolutely.

01:00:59 All right. Well, final thoughts, final call to action. What do we got here? What do you say?

01:01:03 Well, call to action. I mean, you can take a look at my overall courses and so forth at

01:01:09 learnerpython.com. Or if you just want to improve your Pandas knowledge, you can look at bamboo

01:01:13 weekly.com, the newsletter there. Does it cost money or does it just cost an email?

01:01:18 So it is a paid newsletter, but the first two questions and answers every week are free.

01:01:23 So it's usually between five and nine questions each week. But even if you don't pay,

01:01:30 I still want people to get some Pandas learning improvement practice so they can totally do that.

01:01:36 Excellent. All right. Yeah. And check out the book, obviously.

01:01:38 Oh, yeah. That too. Pandasworkout.com. I'll pull this up.

01:01:42 It's actually pretty thick.

01:01:44 Yes, it is.

01:01:45 At the end, I was like, "Oh, that's why it took a long time." Also procrastination.

01:01:52 I'd love to hear from people if they have... I always love to hear interesting problems,

01:01:59 data sets, issues that people are experiencing just so I can figure out what are the next

01:02:04 directions for me to explore that I can then try to help people with as well.

01:02:08 Awesome. Well, I appreciate all your help coming on the show, sharing your knowledge, and just riffing on Pandas together. It was a lot of fun.

01:02:16 Excellent. My great pleasure.

01:02:17 Yeah. Bye.

01:02:18 All right. Bye-bye.

01:02:20 Bye.

01:02:20 This has been another episode of Talk Python to Me. Thank you to our sponsors. Be sure to check

01:02:25 out what they're offering. It really helps support the show. Take some stress out of your life. Get

01:02:30 notified immediately about errors and performance issues in your web or mobile applications with

01:02:35 Sentry. Just visit talkpython.fm/sentry and get started for free. And be sure to use the promo

01:02:42 code TALKPYTHON, all one word. This episode is brought to you by Scalable Path. If you're a

01:02:47 founder or engineering leader, you know how hard it is to find top-tier developers while keeping

01:02:52 costs low. Scalable Path is a software staffing company that helps you build remote dev teams

01:02:58 that just fit. Build your team at talkpython.fm/scalablepath. Want to level up your Python?

01:03:05 We have one of the largest catalogs of Python video courses over at Talk Python. Our content

01:03:09 ranges from true beginners to deeply advanced topics like memory and async. And best of all,

01:03:15 there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.

01:03:19 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:03:24 We should be right at the top. You can also find the iTunes feed at /iTunes, the Google Play feed

01:03:30 at /play, and the Direct RSS feed at /rss on talkpython.fm. We're live streaming most of

01:03:37 our recordings these days. If you want to be part of the show and have your comments featured on

01:03:41 the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube. This is your

01:03:47 host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and

01:03:51 write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon