00:00 If you want to get better at something, oftentimes the path is pretty clear.
00:03 If you want to get better at swimming, you go to the pool and practice your strokes
00:07 and put in time doing the laps. Want to get better at mountain biking?
00:10 Hit the trails and work on drills focusing on different aspects of riding.
00:14 You can do the same for programming.
00:16 Reuven Lerner is back on the podcast to talk about his book, Pandas Workout.
00:21 We dive into strategies for learning pandas in Python, as well as some of his workout exercises.
00:27 This is Talk Python to Me, episode 471, recorded July 7th, 2024.
00:33 Welcome to Talk Python to Me, a weekly podcast on Python.
00:50 This is your host, Michael Kennedy.
00:52 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both accounts over at mastodon.org.
01:00 And keep up with the show and listen to over nine years of episodes at talkpython.fm.
01:05 If you want to be part of our live episodes, you can find the live streams over on YouTube.
01:10 Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows.
01:16 This episode is brought to you by Sentry.
01:18 Don't let those errors go unnoticed. Use Sentry like we do here at Talk Python.
01:22 Sign up at talkpython.fm/sentry.
01:25 And it's also brought to you by Scalable Path.
01:28 If you're a founder or engineering leader, you know how hard it is to find top tier developers while keeping costs low.
01:34 Scalable Path is a software staffing company that helps you build remote dev teams that just fit.
01:40 Build your team at talkpython.fm/scalablepath.
01:44 Before we jump into the interview, I want to let you know that we still have some spots left in my
01:49 Code in a Castle event. If you're looking to learn some of the premier frameworks and techniques in
01:54 Python, and you'd like to have a bucket list type of experience while doing so,
01:59 then check out talkpython.fm/castle.
02:01 In October, I'll be running a six day Python course for an intimate audience in a villa in
02:08 Tuscany. Half the time we'll be learning Python and the other half will be exploring the best of
02:13 what Italy has to offer. Check out the course outline, the excursions and all the details at
02:17 talkpython.fm/castle. Or if you'd like to just shoot me an email, michael@talkpython.fm or find
02:25 me on the socials and I'm happy to talk about it. Hope to see you there.
02:27 Reuven, welcome back to Talk Python To Me. How are you doing?
02:32 I'm doing great. Great to be back here. Nice to see you.
02:34 Yeah, it's great to see you as well. I'm a little concerned though.
02:38 There's some possibilities that maybe my Facebook ads are going to get messed up.
02:42 We are talking about pandas and pandas internationally. And I heard that you're
02:47 some kind of animal trafficker. Do you want to start the show with that story? It's out of control.
02:52 It is the craziest story. So I occasionally advertise on Facebook, advertise my pandas
02:58 and Python training. And I guess it was like two years ago, I tried doing a little bit more
03:03 advertising. And basically I didn't really pay much attention to it until about a year ago
03:09 when I noticed that when I tried to do some more advertising and it said,
03:12 you are not allowed to advertise on any meta properties. That's really weird. Like, what did
03:16 I do? I looked, I could not find any indication of what I'd done wrong. So it says, if you want
03:21 to appeal, click here. So I clicked here and within 30 minutes or so, I get email back saying,
03:25 your appeal has been checked and denied. You will never be allowed to advertise on
03:30 any meta property again. And this was like, what have I done? Like, what are you doing?
03:34 I feel like I'm pretty innocent. You're some courses and some books. Come on, man.
03:38 I'd like, I figured also I appealed, someone must've looked through this. Anyway, in poking
03:43 around online, someone said to me, oh, I was caught by the same thing. It is illegal to sell
03:49 rare and endangered animals. And they believe that since you were selling Python and pandas
03:54 training, then you must've been trading in live and endangered animals and thus you are banned
03:58 for life. Well, I said, you know what? I got some, first of all, like nuts and yet a great story to
04:03 tell to my machine learning classes. So good on that front. But I was like, this is easy to resolve,
04:07 right? I'll just contact someone at Facebook. No one's available. I got texts on my connections,
04:12 like people I know who work there. And I finally get the answer back saying that for legal purposes,
04:17 they delete all data that's more than six months old, something like that. So they could not go
04:23 back and check to see why I was banned. And thus this ban really would be for life. So because I
04:28 waited a while, cause it'd been like a year between the ban and me noticing. So they couldn't
04:33 do anything about it. So I was like half laughing and half like, you got to be kidding me about this.
04:39 and I posted on my blog about this and you guys picked it up. on Python bites,
04:45 it was picked up in a bunch of other places like hacker news. And about a month later I check and
04:50 it was back. No one told me anything. No one said anything just magically and quietly. My account
04:57 was restored. so they can never helps. That's right. That's right. But it was so
05:03 absurd. I think it was at least like four senior engineers at Facebook who tried to help me with
05:08 this. And they're all like, we tried, there's nothing we can do. Crazy. It reminds me, honestly,
05:13 it reminds me so much of my app store review experience, getting the new Talk Python Training
05:19 mobile apps in iOS and the apps in the iOS app store and the Google app store. They were
05:24 tragically incompetent in their own special way. Right? Somewhat malicious, somewhat not malicious,
05:32 just like for example, Apple, I think it was Apple. It could have been, I don't know,
05:36 Apple or Google has to be one of them said, you know, we've denied your application to publish
05:40 this application because you're trying to impersonate in an existing one. I said, what
05:44 app am I trying? What am I trying to impersonate? Like it's hundreds of hours of stuff that we've
05:49 created. Like they're like, well, first you might be hijacked. So, so what this is, well,
05:54 in the description it says, if you want to learn Python, you can take our courses. Well, there's
05:59 already an app called learn Python and you're trying to impersonate. I'm like, what? They're
06:03 like, it said, if you want to learn Python, but there's an app called learn Python. I'm like,
06:08 I just, I don't even under like my mind is how do you read that description and think this is
06:12 a trademark? So if you come up with an app called like eat at a restaurant, then no food ordering,
06:18 yeah. You're going to be acquired by UberEATS straight away. Exactly. And so we went, but we,
06:24 you would think, okay, fair mistake. Like, yeah, yeah. Okay. Something caught it. Just,
06:28 just like you said, a simple request and a human will look and it'd be fine. No, they're like,
06:31 well, look, obviously you're doing this. I'm like, obviously let's try another scenario.
06:35 What if I said I wanted to learn to play the guitar and there was an app called learn guitar,
06:40 but I don't want to learn. It's not, it's not, the title is not learn guitar. Just
06:44 the act of learning a guitar. How else would you explain it? Like, oh, okay. So I guess we,
06:48 we see, we understand now you can, you can have that sentence. And it's just,
06:52 these things are really, and you're at the, at the complete mercy of them. It's, it's really,
06:57 it's both like comically funny, but it's also like painful because we spent four months building
07:02 that app. They wouldn't accept for this stupid reason, you know? And they're taking a fair
07:05 amount of money off the top for whatever people are earning from these apps. It's not like this
07:10 is a charity or something. What I found amazing with Facebook was there was literally no way to
07:16 contact a human being. Like I tried all sorts of searches and forms and on, on, and on. And like,
07:22 and, and I found a lot of other people who had been caught up in this sort of ridiculous
07:28 situation, but like, there's no form even to say, Hey, I think you made a mistake. Why don't you
07:34 have a human look at this? Because that would cost them too much, you know, someone's salary.
07:38 Absolutely. It's anyway, I don't want to spend too much time on it, but boy, is that a crazy story?
07:42 I mean, we're going to be trading in some, we'll be trading in some pandas today. And so I just,
07:48 I'm going to bleep that part. Every time the word pandas is said, we're bleeping it out on
07:51 the YouTube version. Well, this could be entertaining then. Wow. Reuven was really
07:57 testy. Like all those bleeps. No, seriously though. You know, let's, let's catch up. We'll
08:04 talk about your book. What have you been up to since the last time we spoke? I can't remember
08:08 last time I had you on. It's been a couple of years, I think. It's been, it's been some time.
08:11 Yeah. So I'm continuing to do Python pandas training companies. Since the pandemic,
08:18 much more of it is online just because companies are now used to doing stuff on Zoom,
08:23 team, WebEx, rather than bringing me there in person. That's fine. So I travel to conferences
08:28 rather than to clients. And, you know, it sort of extended my flexibility on that. And I've also
08:32 been building up my online training stuff, both a whole lot of courses. And I've got an online
08:37 bootcamp that I run for Python and Pandas twice a year. The big thing, which is actually connected
08:41 to what we're doing here also is I have a new newsletter called Bamboo Weekly, where I have
08:45 pandas exercises. It's sort of like the same style as the book, but every week I take a topic and I
08:50 do it, I do it based on current events. So if there's something going on in the news, I try to
08:54 find a dataset, a real world dataset that has something to do with that. And then we, we try
08:59 to experiment with that. Yeah, that's actually fun to see like a real world example every week.
09:02 Yeah. Yeah. So people are like, how do you come up with that? And I say, well, I listen to a lot
09:06 of podcasts and I read a lot of newspapers and something somewhere, like I read The Economist,
09:11 for example, and they had this short, cute article about the number of animals that go
09:16 through Heathrow airport every year. I was like, wait, there's got to be a dataset for that. And
09:21 sure enough, the Heathrow airport authority publishes a dataset in CSV of how many animals
09:26 go through. And it was like 200 horses and one and a half billion butterflies and everything in
09:31 between. And so it was great fun to sort of play with that data and ask questions about it and
09:39 give people practice with something that's dirty and messy. I don't mean the animals being dirty,
09:43 right? Like very messy. Like you need to like really, you know, wrestle with it because that's
09:46 the only way you're going to improve. So I'm, I'm having a lot of fun on all the, the training
09:50 front. I definitely see more and more use of, I'm sure this won't surprise you, more and more use of
09:54 Python in the data space as it just like catches fire there. Yeah. It's just going downhill,
10:00 picking up speed, isn't it? It's extraordinary. Like I still remember asking people in my intro
10:05 Python courses. So what are you here for? They're like, yeah, my company's thinking of doing some
10:09 stuff with data analysis and pandas. And, or at least as data analysis, I think it was then even
10:13 in NumPy before pandas came out and me thinking, Hmm, I should really learn this stuff because it
10:18 sounds like it's going to be popular and holy cow, it's like, it's everywhere. And the whole ecosystem
10:24 is just growing and growing and growing. It's like people are sort of seeing pandas as the like
10:28 underlying infrastructure on which they build their software tools and their companies.
10:32 Right. Or even the thing that defines the API on which they can innovate, right? Think Dask or a
10:38 whole bunch of other things in that space. Right. That if you know pandas, chances are, if you're
10:42 not too crazy, you can like do grid computing by changing the import. It's extraordinary.
10:47 Just extraordinary. Yeah. Yeah, it absolutely is. And so back to your newsletter, pandas eat bamboo.
10:54 Is that right? That's so, I understand why you're back on Facebook. Okay. All right.
10:58 It was actually my father's idea. I was like, I need a catchy name for something. And so he was
11:04 like, well, we'll have something with bamboo. I was like, right, right. It's like, it's food for
11:08 thought and food for pandas. Yeah. Yeah. That's actually really clever. I like it a lot. I like
11:12 it a lot. So you've been on a journey. You've been on a journey to write a book, right?
11:17 Yes. I was convinced to do a second one. Right. So I did a book with Manning called Python workout,
11:22 which was exercise in Python. And when that finished, they said, so what other topics do
11:28 you think would be useful? I said, well, doing a lot of pandas and people definitely need a lot
11:31 of exercise and a lot of practice in that. And so I got to work both collecting the exercises I do
11:37 with my corporate training and also like coming up with some new ones as I learn new things.
11:41 Cause pandas is so, so, so huge that it's very easy to get lost in there and not even know what
11:47 are the important topics to learn. And, and so thanks to like working with people all the time,
11:52 I sort of see also where they get stuck and where they have problems and where it's really confusing
11:55 to them. So, yeah, so 200 and somewhat exercises later, by the way, I'll just tell you why did it
12:01 come out when it came out? Cause I'm terribly bad at deadlines. I said to the Manning people,
12:05 I really want to have it at my booth at PyCon. They said, okay, big talker, you want it mid May,
12:10 huh? So you'd better have everything done. And they sort of backtracked on the calendar and said,
12:14 okay, so you've got to have the whole thing done by December. And then finally I had some like,
12:18 you know, fire under me and got it done. I supposed to get bored.
12:22 Yeah. The version I actually have in my Apple books here I pulled up is the Manning early
12:28 access preview version, but it's out, out for real now. Yeah.
12:31 Here. I've even got the paper copy here.
12:33 Look at that.
12:34 I know.
12:35 That's a pretty hefty book, honestly.
12:36 I still keep looking at it, baby.
12:37 That's a proper book.
12:38 It's like, wow. I know. I know. It feels like it's really quite some feeling there to see.
12:42 I guess I really finished it.
12:44 Congratulations. So I wanted to talk to you about maybe just kind of following on with
12:49 your bamboo idea, like give us some examples, give us some problems that people are solving
12:53 with pandas. And I mean, we're not going to talk through the code super detailed,
12:57 but you could say like this aspect or this feature of pandas like dot look or whatever
13:02 is how you access and solve these problems. Right. Like, so just kind of exploring that space. I had
13:06 West McKinney on before five episodes ago, something like that. And I was just like,
13:10 how do you learn pandas? It's like so big, you know?
13:13 So I've actually changed. Well, let me, let me, let me first say, I'm definitely one of those
13:18 people who thinks you should learn Python before pandas. Like, I definitely think that knowing the
13:23 language well, will serve you very, very well, in all sorts of weird, small ways necessarily.
13:29 But at the same time, when you learn pandas, you have to learn that some of the paradigms you
13:34 learned, like some of the idioms from Python are not appropriate. So I was giving a class in like
13:39 optimizing pandas, like a short class, we'll call it micro class, like 90 minutes long,
13:43 about a year or so ago. And at the end, I was like, oh, and by the way, obviously just never do for loops. And everyone's like, wait, wait, wait, what?
13:50 I said, what do you mean? What? And they were like, we were taught in our intro pandas class
13:55 that you should do a for loop to do anything across the data frame. And I was like, okay,
13:59 what? How can this be? And so people think that because it's in Python, you should do it the same
14:06 way, but there are all these different idioms, especially whole vectorization that you need to
14:10 internalize. Otherwise, as I like to say to people, you should hope to be paid by the hour
14:16 because like these things are just gonna take forever to run. And so people like don't
14:21 necessarily understand sort of how to approach panda stuff. And then they don't understand sort
14:25 of, let's see, I mean, I'll give like, like outline and then we can sort of go into it.
14:30 Certainly how to access things with dot lock, certainly how multi indexes work,
14:34 how to, how to work with the different D types. these are, cause these are things that we
14:38 don't think about in the standard pandas world each day, right? The closest analogy would maybe
14:43 be that where you were working with dictionary, you want to think about what your keys are going
14:47 to be, but even then it doesn't come close to it. So like dot lock people are like, so, so fine. So
14:52 dot lock is used for retrieving from a series or retrieving from a data frame. But it's so much
14:57 more than that because you have two parts to it or potentially two parts. You have the row selector
15:03 and you have the column selector and each of those can be an individual name, a list of names, a
15:09 Boolean series, or even a Lambda. And you can mix and match those in so many different ways. And
15:14 once you see those options, like your head sort of explodes with, Oh wait, I never thought that I
15:19 could access my data like that. And then you're like, wait, you can also assign your data like
15:23 that. And then they're just sort of astonished. Yeah. It's really wild. You know, I think pandas
15:28 has its own idiomatic style that is different than what you would call Pythonic, right? Like
15:34 it's Pydantic. I don't know what the name is, but idiomatic pandas, right? Where there's things that
15:40 are specific to pandas, like this vectorization stuff, right? Instead of looping over, right?
15:46 You know, you think about the Python performance angles or the data science performance angles,
15:51 right? A lot of the speed that we get out of tools like pandas and NumPy and pullers and others is
15:56 because you take the data, you push it down into some native layer and you just leave it there.
16:00 And you tell, you kind of speak to the native layer from Python and you say, deep down in your
16:06 insides, you got this thing, multiply all 1 million of them by two or whatever. Right. But if
16:10 you loop over it, you're like kind of running out of C into, into, you know, Python objects, then
16:17 you're operating out, then you're putting it back, like just, you know, 2 million times, all of a
16:20 sudden, all those benefits are gone. And so certainly learning those types of things. I mean,
16:25 the vectorization, I think most people get pretty soon, although it sounds like not necessarily
16:30 everyone. I was really like flabbergasted. But there's way more than that, right? There's,
16:35 there's a whole, that's just probably the most obvious thing.
16:37 This portion of Talk Python to me is brought to you by Sentry. Code breaks. It's a fact of life.
16:43 With Sentry, you can fix it faster. As I've told you all before, we use Sentry on many of our apps
16:49 and APIs here at Talk Python. I recently used Sentry to help me track down one of the weirdest
16:55 bugs I've run into in a long time. Here's what happened. When signing up for our mailing list,
17:00 it would crash under a non-common execution pass, like situations where someone was already
17:06 subscribed or entered an invalid email address or something like this. The bizarre part was that our
17:12 logging of that unusual condition itself was crashing. How is it possible for her log to crash?
17:19 It's basically a glorified print statement. Well, Sentry to the rescue. I'm looking at the crash
17:24 report right now, and I see way more information than you would expect to find in any log statement.
17:30 And because it's production, debuggers are out of the question. I see the trace back, of course,
17:35 but also the browser version, client OS, server OS, server OS version, whether it's production
17:41 or Q&A, the email and name of the person signing up, that's the person who actually experienced
17:46 the crash, dictionaries of data on the call stack, and so much more. What was the problem?
17:51 I initialized the logger with the string info for the level rather than the enumeration dot info,
17:58 which was an integer-based Enum. So the logging statement would crash, saying that I could not
18:04 use less than or equal to between strings and ints. Crazy town. But with Sentry, I captured it,
18:11 fixed it, and I even helped the user who experienced that crash. Don't fly blind. Fix
18:16 code faster with Sentry. Create your Sentry account now at talkpython.fm/Sentry. And if you
18:22 sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's
18:29 business plan, which will give you up to 20 times as many monthly events as well as other features.
18:34 Accessing things in that way, by the way, you asked before, like, how should people even approach
18:39 learning pandas? And so I've started thinking about it a little differently based on feedback
18:42 from people, that I would sort of walk through it. Okay, here's a series, here's a data frame.
18:47 And after creating some fake ones with made-up numbers or random numbers, then we'll start
18:52 reading in files. And then we'll do this, and then we'll get to visualization. And a student
18:57 of mine said, you know, you're sort of missing the big, like, people miss the big picture doing
19:02 that, and they want to get the excitement. Why don't you start with, read a CSV file, visualize
19:07 it right there inside of Jupyter, and then people will be so impressed and amazed, and then you fill
19:12 in the gaps. And so I've started doing that a bit. And I think that has not been a bad approach to
19:17 catch their attention, give them a sense of what the possibilities are. I'm like, okay, let's now
19:21 walk through each of these little pieces and build up to what we saw that first day. And that's been
19:26 fun. Yeah, I totally agree. That's also something that I really strive for. I don't always do
19:32 whenever I'm doing presentations, but, you know, just because someone is chosen to sit in a seat
19:37 for a day long course or an hour long presentation, doesn't mean that they couldn't use a little
19:42 inspiration. Right. And if you can like, wow, you did those three lines, and now we have this
19:45 picture and I understand all that, like, tell me more. Right. That right there. That's right. Right.
19:50 Now you have their actual attention the whole time and they're enjoying it. And it's, yeah,
19:54 it's so often it's like, well, in order to show you the nice stuff, I got to give you every level
19:59 of details. Like, no, you don't. You're going to, you're going to make them leave. Don't do it that
20:03 way. That's right. That's right. And you mentioned like three lines of code. One of the amazing
20:08 things about pandas is how often you can write very, very little code, but it's like getting
20:14 to that code that takes a while really taking advantage of it. Yeah. We've been knowing that.
20:18 Right. Right. Right. Right. People don't even know. So a lot of times, like I've been using
20:24 it now for long enough that I sort of intuit, oh, there's got to be a method that does this. Like
20:29 someone has encountered this problem before. And so either as a method or there's an option
20:34 or there's an add on like something somewhere that just makes it trivially easy. And that's part of
20:39 the exploration that I try to do both in the book and the weekly, like in my training journal. And
20:43 also truth be told, I'm constantly learning stuff, right? Like it's, it's a rare for me to teach and
20:48 not discover some option or some method that I did not know about because it's just so incredibly
20:52 vast. Yeah. Or pandas too comes out or something like that. Yes. Yes, indeed. I mean, I've been
20:57 exploring, I mean tomorrow, tomorrow I head off to Prague for EuroPython, where I'm giving a talk
21:02 on a Pyarrow in pandas. And so I've been looking into that a lot and oh boy, right. I mean, I've
21:08 been using it for say a year or so, but it's amazing. And yet there are all these subtle
21:12 changes that are happening in pandas as a result that you need to know about it. You need to expect
21:17 when, when you use it, but it's another like tool in our toolbox that we can pull out to make pandas
21:23 more effective, more efficient, deal with larger data and also interact with other and interoperate
21:29 with other systems. Yeah, for sure. It's just like, as I said, the whole ecosystem is just
21:33 exploding. It's really quite something. Yeah, it really is. And I think the pandas too stuff is
21:36 going to make a pretty big difference in changing like the internals away from just tables of
21:40 numbers and basically, okay. Let's talk about, so the way I thought we could maybe explore this
21:46 pandas workout book is let's just pick some fun exercises that you put together and, and talk
21:52 about them. Like, give us a quick overview of what the workout aspect of this book means anyway,
21:57 then we'll, we'll pick the first one. Sure. So the idea is that you can't learn everything all
22:03 at once or quickly. That's sort of like, you know, working out physically, it's a long haul
22:09 and every day you get a little better, a little stronger, a little more flexible. And so if you
22:15 see, you know, your pandas learning journey as it's going to take me several months, not, it's
22:21 going to take me a day. Then if every day you do a little bit of practice, you learn something new
22:25 in some new direction. At the end of that journey, you're going to be able to solve many, many more
22:30 problems in better, more idiomatic and more efficient ways. And you'll be able to put these
22:35 pieces together in ways that you didn't even expect. So that's, that's the basic idea. And so
22:40 the book is divided into, I want to say 12 chapters. I know we sort of rejiggered it at
22:43 some point where each chapter focuses on a different aspect of pandas. But it's really
22:49 sort of the, the total experience of going through it. And so we have 200 exercises,
22:55 plus there's like a, for lack of better term, a midterm and a final, like a media projects that
23:01 people can go through that people ask for after the Python book. And each exercise then has not
23:07 only the main exercise where I pose a problem, I give an explanation, I give an answer, note the
23:11 order, the answer comes at the end. So you won't peek as easily. We talked about putting in the
23:15 back of the book and like, that just didn't work out so well. So at least you have to wait through
23:19 the explanation a little bit or turn the page. And then you can buy the one book with the questions.
23:24 You can buy the answers later. I was joking. I didn't mean, I didn't mean to give anyone ideas.
23:34 I'm joking. No, no, it's fine. And then like after the answer, we have three, what we call beyond the
23:39 exercise, which are okay. Now that you've kind of gotten the basics, let's push either on the same
23:46 data set or even like, like sort of go farther. And so it's like, that's why we say it's 200
23:52 exercises because it's 50 official exercises and other three for each one. And those tend to be
23:57 much harder. And I don't give a full explanation, although I do give the solution online in the
24:01 Jupyter notebook, you can download Jupyter for all these things and see how I solved it. And for
24:05 many of them, there's a link to the pandas tutor. I think, yeah, you've definitely spoken to Philip
24:10 Lowe in the past. So he's not only a Python tutor, but pandas tutor, he and his team, it's amazing.
24:15 And so you can just click on a link and it will take you to usually a miniature version of the
24:20 data set because it's too big for pandas tutor. And then you can sort of see the visualization,
24:25 how these things work. Yeah. I haven't, I don't think I talked about pandas tutor. Maybe I have
24:29 talked to Philip about it, but I've certainly used it in some of my courses, but it's worth
24:34 bringing up just to talk, like, see how that thing works. It's, it's something special. This
24:39 thing, it really gives you some deep understanding into like, okay, if I run, you know, this group
24:44 by command, then here's how all the pieces like flow back together. And so you're using this
24:49 during your book. So, yeah. So, I mean, every exercise has a link in the solution. You click
24:54 on it and brings you to the code that I use. Usually again, I'll sort of miniature version
24:58 of the code, miniature version of the data set that you can then see the visualization there
25:02 in, in your browser. Nice. Yeah. Cause you've got a basic, I think, encode the data and the URL or something. So it can't be too much.
25:09 Right. So I would like basically take the data in the data frame, turn it into a Python dictionary,
25:17 and then like set it up because it won't work with files for security reasons. So assign that
25:22 dictionary to a variable or no, no, I'm sorry. I turned to CSV in a string and then assign that
25:27 string to a variable and read it in. And then I would see if it overflowed the pandas work,
25:32 not pandas, a tutor limit. And if it did that sort of iterated until I got it small enough to fit in
25:37 there and big enough to be useful and interesting. I'll take a few iterations.
25:42 It's cool. No, but it's a super, super good resource for learning pandas. I also think
25:46 for exploring, right? Like you end up with some code that like, I'm not really sure what this
25:50 does, or this is actually new to me. Like a lot of the things you might encounter in this book,
25:54 you're like, let me visualize that. Right. Cause yeah, it's great.
25:57 Right. I mean, I, as I said, like I do a lot of training in pandas and very often people
26:03 have already played with it a bit, used it, even use it for a year or two. And because it's so
26:07 large, like it's not unusual for someone to say, Oh, I had no idea that this functionality existed.
26:13 Why have I been wasting my time doing X and Y and Z? I'll give you one, one example that I've
26:19 been using more and more. So, you know, Matt Harrison's a big fan of the method chaining
26:23 approach. And at first I was like, yeah, yeah, yeah, Matt, whatever. Like, like, yo,
26:27 stop pushing out on everyone. That's your point. Keep them to yourself.
26:31 Right. And then I was like, actually, this is a great way to build things up little by little,
26:37 line by line. And I can use this. It's pedagogically very useful because I say,
26:40 okay, let's think about how we want to break down this problem. We'll do this. Then we'll do this.
26:44 Then we'll do this. And you can see it sort of going line by line until voila, we have the
26:48 analysis that we want. And so I inserted that into a lot of places in the book. Sort of like
26:54 one of the last edits that I did was to go back and change it to be more method chaining. And I
26:59 use it now all the time in my training and in Bamboo Weekly. And I, so I, I, I bow to Matt on
27:05 that. He was, he was right. And I was stubbornly resistant for no good reason.
27:10 Yeah. I really like that style as well. You see, it's officially night here where I am.
27:14 It's just, everything switched to dark.
27:15 But I just had that yesterday when it, like I was doing office hours and all of a sudden,
27:20 like I was sharing my screen and it changed the color. I'm glad I'm not the only one.
27:24 So yeah, I, I'm a big fan of the method chaining fluent interfaces. I mean, I would love to even
27:30 see like I thought on itself in the standard library, adopt that more, right? Like there's
27:35 so many things you operate on that'll change something, but then it'll, it'll return.
27:40 No, it'll not return anything. It's like a void method as much as we have those,
27:43 right? It returns none effectively. So you can't say, you know, dot sort dot this dot that you have
27:48 to like multi-step it. And I would just love to see more of it, but let's talk.
27:51 Well, I'll just, I'll just say there on that front. So Condos like does have the option to
27:58 either get back a new data frame or to say in place equals true. And then it does it locally,
28:03 like it does it on that data structure and then returns none. And people are consistently
28:08 convinced that this is faster, more efficient, better. And so I've been like trying to tell
28:13 people, no, the Python core developers keep saying, do not think that is true. It is not true.
28:18 And we are getting rid of in place equals true at some point, stop using it so that you can do
28:23 method chaining. And so no small number of people again, in my course, they're like, Oh, really? I
28:28 had no idea. I feel like, you know, I'm spreading the gospel.
28:33 Throw that whole expression in some parentheses and dot yourself away. Let's go. All right.
28:38 That's right.
28:39 Let's dot our way on over to exercise seven of the many, and let's just talk about long,
28:46 medium and short taxi rides. Tell us like, kind of like we can only talk so much about the code,
28:51 but like, let's, let's talk a bit about it and get people, like I said, I want to expose people to
28:56 like, what are some of the problems and aspects of Pandas that you can use to solve them?
29:00 Sure. So one of my favorite datasets to work with is a New York city taxi information. It's like,
29:07 everyone can identify with it. You understand it. And so this exercise uses a very small
29:12 subset of that. Maybe we'll talk next about the pandemic taxis, which is a much larger one,
29:16 but this is just a hundred thousand, no, 10,000, 10,000 taxi rides from like five years ago,
29:20 six years ago. And the question is, well, how can we divide up this dataset, which tells us
29:27 how long, how far, how much people paid when they were picked up all that information.
29:31 How can we find out like the distance that they went and categorize that? And the reason that
29:37 this can be useful is we so often have numeric data that we need to put into categories, right?
29:42 What are best sellers? What are poor sellers who are, you know, the most, you know, you know,
29:47 employee of the year, that sort of thing. There's so many places where turning something into a
29:51 category will be useful. And so it's very tempting to think, okay, I'll like do some
29:56 if statements or I'll do some for statements, but actually pandas provides us with PD.cut,
30:02 which just does it for us. And this is one of those examples of once you learn it, you're like,
30:07 oh, wow, I get it. I don't have to have, oh yeah, even up here. Like you might think you would want
30:12 to say, let's set all the categories, you know, to be medium. And then where it's less than two
30:18 miles, we'll call it short or it's greater than 10 miles. We'll call it long, but you can just
30:22 say PD.cut, we're going to cut it to, we're going to cut it 10. Anything less than two is short.
30:26 Anything greater than 10 is long. Anything in the middle is medium. Done. PD.cut gives you back a
30:31 new series. - This portion of Talk Python to me is brought to you by Scalable Path. If you're a
30:37 founder or engineering leader, you know how hard it is to find top tier developers while keeping
30:43 costs low. That's where Scalable Path comes in. They're a software staffing company that helps
30:48 you build remote dev teams that just fit. If you're wondering what sets this staffing company apart,
30:54 well, one big differentiator is their approach. They're founded and run by developers. Scalable
31:00 Path understands that finding the right developer is not just about technical skills. It's about
31:06 personality, work ethic, and how well they mesh with your team. They are software architects,
31:11 will take the time to understand your vision and needs, and then develop technical challenges for
31:16 the roles you're looking to hire. And these technical tests are conducted live on video by
31:22 senior software developers, so there's no gaming the system. And Scalable Path takes it one step
31:27 further. They evaluate each developer's soft skills like communication, attitude, and work style
31:33 before presenting best suited candidates to you. Scalable Path has built a network of over 35,000
31:40 remote developers. No more endless searches or sleepless nights worrying about the right hire.
31:45 And here's a special offer for Talk Python listeners. You'll get 20% off of your first month.
31:52 So are you ready to scale your dev team and your business? Get started by visiting
31:56 talkpython.fm/scalablepath. The link is in your podcast player's show notes.
32:01 Thank you to Scalable Path for supporting the show.
32:04 So the way you're doing is you're setting everything to medium, and then you're defining
32:09 short and you're defining long instead of defining the three categories.
32:13 So that's like the sort of, you might think this would be a good way to do it.
32:18 And it works, right? As I like to say in my courses, unfortunately this works.
32:21 Right? Like, so you will get the right answer this way. But if you then like pay,
32:27 go to the next page here, you'll see that you can just use PD cut.
32:30 I see.
32:31 Yeah. And then you just say, here are the bins. Here are the labels. Go.
32:36 Well, this goes back to that thing we talked about, like looping over stuff versus just going,
32:42 this is what I want you to do. Do it a hundred percent deep down inside the best you can.
32:47 Don't bother me. Just figure it out for me.
32:49 That's exactly right. And this means that like a whole lot of people have worked on PD cut
32:55 and have made it work efficiently, way more efficiently than you or I could do in our code.
33:00 Presumably, right? It's not going to be any worse. And it'll probably be a lot better.
33:05 The other thing is you get back and you can see it's sorted at the end.
33:09 What the output of the series you get here. So you feed it a series and you get back a series,
33:14 but then the series looks at first glance, like it's a bunch of strings.
33:18 And so you're going to have short or short, medium, medium, long, long, long,
33:20 but it's actually not strings. It's actually a category, which is a Panda's version of an enum.
33:25 So really it's very small because it's just integers being stored there.
33:29 And then those integers are associated with strings. So that's an example of like where
33:33 they have thought through it. And they've basically said, yeah, we're going to make
33:36 this more efficient than you would probably think to do on your own.
33:38 Right. Awesome. Well, I would, to be honest, I've been happy if I came up with that first solution.
33:43 Just using lock and set it in there. That's pretty cool. But this cuts is super nice.
33:48 Yeah. Just define the boundaries as bins and then off it goes.
33:52 Right. Right. And there's an option there include lowest. And so I actually didn't
33:57 know about this for a while. So I'd be like, okay, so think about the bins has to be from
34:00 some small number to some large number. And the question is then, well, wait a second.
34:05 What about that leftmost bin? What about that? Like if it's up to and not including,
34:09 it's like less than, but not less than equal, then how do I do it? So I will always be like,
34:13 well, I'll take the min. I'll do like the series dot min minus one. And then it's guaranteed to
34:18 be lower than that. But no, it turns out the Panda's developers thought about this long before
34:22 I did. And there's an option, a keyword argument you can pass include lowest equals true done.
34:27 Now it's less than an equal as opposed to just less than.
34:29 Right. Because all these kinds of things, the boundary conditions are always tricky,
34:34 especially on floating points, right? Oh yes. Oh yes.
34:37 There's probably at least two spacecrafts that have crashed because of this. All right.
34:41 The next one we want to go to is number 12, finding outliers.
34:47 Yeah. So, you know, it's, this is not a book about statistics and I'm not an expert in
34:52 statistics, but there are a whole lot of statistical ideas that permeate working with
34:58 Pandas. And so it's like, you know, mean standard deviation, median understanding sort of how they
35:03 can differ from one another. Right. And especially the whole mean versus median thing where it's so
35:08 easy to have a few outliers, pull your data up or pull your data down and then sort of fool you.
35:14 Right. And so often people want to find the opposite. And I've certainly found this,
35:18 especially with our corporate training with like a cybersecurity people where they're like,
35:22 we are always looking for outliers. Like, you know, who was the user who was logging in at
35:26 unusual times? Who logged in for 20 IP addresses within one minute.
35:32 That's right. That's not a good person. They're not good. I'm telling you that right now.
35:36 They just try to be very, very effective. And so, and so, so this finding outliers is, okay,
35:44 let's find out them, not who was sort of normal, but like who was abnormal, right? Who, who was
35:51 exhibiting, you know, unusual behavior. And so here it's like, okay, let's take our set of numbers
35:58 and let's find out who was more than one standard deviation. It's one or two above the mean or
36:03 below the mean. And let's just find those values and ignore the sort of normal values. And then
36:09 we could talk a little about, you know, IQR, the inter quartile range, which is a standard
36:13 statistical idea that we don't talk about enough, even though it's super useful, right? People came
36:18 up with these ideas a long time ago. Oh, here's a fun fact. John Tukey, who came up with a lot of
36:22 these ideas, he also invented the word bit in computers. So yeah, I was like, wow, what a guy.
36:29 So you can see here then that like, I'm going to take the, you know, trip distance, let's find,
36:34 you know, trip distance. Let's find where the trip distance is less than the trip distance
36:39 at the first, the first quartile minus one half times the IQR, right? Meaning let's take the
36:46 distance between the first quartile and the third quartile. That's our inter quartile range. That
36:51 gives us a sense of like, where is the bulk, where the bulk of the numbers, and let's find out who
36:56 is below that or who's above that. And so it's not surprising if we're looking for taxi distances,
37:02 you're not going to have a lot of outliers that are very low because you can't go below zero miles
37:07 in your taxi, but you can go very high. You can go very large. And so looking for those trips that
37:14 are greater than the 75th percentile plus one and a half times the IQR, we actually find, you know,
37:20 I see here, I found me, you know, about 1800 taxi rides out of the 10,000. So about 19% of the taxi
37:25 rides are much, much longer than the mean ride. And you can use this, then you can imagine people
37:30 in like the New York City Taxi Limousine Commission saying, oh, we can use this to plan to charge to,
37:37 you know, send taxis at different places. - Sure, have a special program for long distance
37:41 stuff or whatever. Yeah. - Right. Right. Or if you're Uber, you know where to place, they actually used to have the geography, the longitude and latitude of where
37:48 people were picked up and dropped off and they got rid of that. And I'm sure both for privacy
37:52 reasons and because Uber and Lyft can look at that data and say, oh, well, we know now when to send
37:57 cars where. - Well, although while true, I imagine that that data has been downloaded and archived
38:04 with geolocation for certain people, right? - Oh yeah, you can still get all that old data.
38:09 It's just the newer stuff that they don't do. And they still do it by neighborhood.
38:12 - I see. - I also wonder like how precise that longitude and latitude was, like you would probably identify which home like was going to which other
38:19 home. - Yeah, now you're getting into a problem.
38:22 - I can only imagine how bad that would look. - Yeah, now you're getting into a shady spot or
38:25 certain types of establishments, you know, that could be brought with consequences for the people
38:31 whose home you've already identified where they were picked up at or all sorts of stuff, right?
38:34 - Exactly. - All right, fair.
38:35 - Exactly. - Fair. That's way more important than whether Uber can place their vehicles strategically, you know?
38:40 - Right. Right. And they might have some data on their own too, you know.
38:44 - Nice. Okay, so you can use this IQR, interquartile range feature to pull that out here,
38:50 right? - Right. Right. And, you know, we can pull that out and then do simple multiplication. Right at the end of the day,
38:56 what we're doing is we're pulling out this data and then just doing very, very simple statistical
39:01 analysis of it. Just to sort of say, how many outliers do we have? How many low outliers? How
39:05 many high outliers? And you can see this, like at the end of the day, it's not that complicated in
39:10 terms of math. Sometimes people are like, "Well, how much math do I have to know to learn this
39:14 stuff?" I'm like, "I can get through it. I promise you. Not that much. It's not that hard. A few
39:18 basic ideas and you're basically set." - Yeah, very cool. All right, onto the next
39:22 one, which is endemic taxis. That's a different kind of taxi.
39:26 - So this was actually, so the two or so that we looked at so far were with my tiny little
39:33 10,000 taxi ride sample from a few years ago. I then took taxi rides from January and July of
39:40 2019 and 2020. So four months there. And the question was, I think I actually only looked at
39:45 July in this exercise. Yeah, July, 2019, July, 2020, comparing taxi rides. Now, just to remind
39:52 you, July, 2020, not a great time for tourism in New York City or anywhere else. And so the
39:58 question was, what differences do we see between July, 2019 and July, 2020? How much did it go down
40:07 in terms of taxi use? How much less did people pay? And then my favorite part of this was,
40:12 did people use cash more in 2020 or less in 2020? And so you see, first of all, it's like a decline
40:20 of some like 80% in terms of taxi rides from 2019 to 2020. Again, not a huge surprise to anyone who
40:26 lived through the pandemic and saw what was going on then. And so my gut feeling was, well, no one
40:31 wanted to touch anyone else or touch anything that anyone else had touched. So clearly people would
40:36 have used credit cards much more. And no, it turns out-
40:39 There's that screen, Reuven. You got to touch the okay. That drove me crazy during the pandemic.
40:44 It's like, I'm going to just do touchless. And then you sign it or you okay. You're like,
40:51 I never thought of that. I guess in Israel, you just do the tapping of your car. You didn't have
40:57 to sign anything, but huh. Yeah. The US, they're like, oh, hit okay to confirm. It's like, okay. I thought we had just escaped me touching this disease-ridden
41:05 thing, but no. Also, you have to indicate how much of a tip you want to leave, but that's a
41:09 whole separate thing. Yeah. We're at the car wash. Don't you want to tip the car? No, I don't want
41:15 to tip the car wash. That's the other thing is you should have looked at tips. That could have
41:19 been interesting actually as well. I think, no, I think I did. I definitely looked at tips,
41:24 but I can't remember what it was like- Like around the pandemic. Yeah. Because I know a lot
41:27 of people gave tips to say thanks. And I was wondering if that would show up. Thanks for being
41:31 out here amongst the diseased ones. So it turns out that people used cash more
41:37 during the pandemic than credit. And I raised this with my family. I was like, what the heck?
41:42 I think it was my sister who said, well, everyone who would use, the people who were higher income
41:48 earners who use their credit cards more, they were staying home. It's the people who were forced to
41:53 go to work who use cash more, who were taking the taxis because they didn't want to take the subway.
41:57 Like there was sort of an in-between sort of thing. So you get all these wild pieces of data
42:03 and analysis that you can do. And here it was not just pulling out all this data, but one of the big
42:08 things was also, I want to read in a lot of data. I want to read in from two different CSV files.
42:14 Now what do I do? Because it's very obvious. You read in a CSV file, you get a data frame. You read
42:18 another CSV file, you get a day frame. But I want to treat those as one. And I want to then be able
42:23 to distinguish between them. How do I do that? And so sure enough, Pandas has a concat method.
42:28 And I use concat now all the time. I read in multiple CSV files into multiple data frames.
42:35 I can cat them together and you can get at them either horizontally or vertically, depending on
42:38 what you like, what you want to do. And- I see. If they're like, their columns are the same,
42:43 then you might do it vertically. But if you want to augment it potentially, because now we have the
42:48 sale percentage data, but it goes along with the columns exactly, or the rows exactly.
42:52 Precisely. Precisely. And so another nice way to do this also is not just read this one and read
43:00 that one, but you can use a list comprehension with something like glob. So glob. Glob on star.csv,
43:06 get back a list of data frames and then just hand that to pd.concat. And so that's where
43:12 knowing Python and be able to pull that out and use those techniques can really, really come in
43:16 handy. Yeah. Earlier you mentioned, I want to go back to it real quick. You mentioned that learn
43:20 Python first and then Pandas. A lot of times when I think learn Python, I think, okay, well,
43:25 learn the language and then learn the standard library to a good degree, and then begin to chop
43:31 away at the half million things on PyPI that are interesting. And it's like a never ending sort of
43:37 thing. And then there's of course, this joke, like I learned Python, it was a good weekend or
43:41 something like that. Right? Like, how do you square these two things? But I feel like the
43:45 amount of Python that you need to learn is mostly centered around the language and really is
43:50 actually not that much. And then you kind of learn the Pandas way, right? Theoretically,
43:55 if you wanted to. Right. Right. For sure. Look, I am a big fan of objects and classes and all
44:00 that stuff. But when I'm talking to people who specifically want to use Python for data analytics
44:07 with Pandas, I say to them, objects will help you, but they're not going to be crucial. Like
44:12 if there's a part of the course that you want to sort of drop, save some money, save some time,
44:16 then that's a place where we can save because the odds of an analyst using Pandas writing their own
44:21 classes are pretty slim. I feel give them some perspective on how these classes work. Sure. But
44:27 I don't think that like they need to learn that. So and like a lot of the standard library there,
44:31 it's hard to say. Right. So as I said, I love glob, right? Globing is fantastic.
44:36 But that's definitely not in like my intro class. I would say, oh, by the way. Yeah. I would bet
44:41 there's probably 10 modules that if you knew, you might not need to learn more for six months,
44:47 you know, doing Jupyter Pandas type of work, right? Like pathlib and a couple of things like
44:52 that. Right. That's right. That's right. I mean, I just made up that number. Yeah. It's mostly like
44:57 just being able to sort of work with the core data structures, understanding the syntax and how it
45:01 works and even like defining some simple functions. Right. I think most people using Pandas
45:06 are not at the end going to define functions, although as I've gone on with my use of Pandas,
45:11 I see, you know, Lambda, this is where it's at. Like knowing how to use it really, really helps.
45:17 So there I do sort of go back with the Pandas people. I'm like, OK, this is going to be super
45:21 weird. We're going to talk about anonymous functions now. And then and then like I typically
45:26 do that with a more advanced part of Pandas groups, not with the like introductory ones.
45:30 But if you can sort of wrap your mind around that, then it does help quite a bit. It does.
45:34 So it takes what would be multiple step things and lets you turn it into one of those chained
45:39 expressions because instead of defining a function somewhere else, you can just put it in line as
45:43 part of a Lambda and just keep on going, you know. Great. That's right. And the fact that so it's
45:48 especially useful, I find so we mentioned dot lock earlier. So dot lock allows you to choose
45:52 rows. Start with that. And so I can pass it a Lambda and that Lambda then gets the data frame
45:59 you're working on as an argument. And then whatever it returns, if it returns a Boolean
46:05 series, then it allows you to filter. Well, you can then have multiple dot locks and multiple
46:10 Lambdas in a row to do successive filtering. And yes, that is less efficient than doing it all in
46:15 one fell swoop. But boy, oh boy, it's easier to think about. It's easier to use. So you just like
46:20 whittle it down, whittle it down, whittle it down each line with its own Lambda. And so understanding
46:25 how to do that, understanding why it's important that the Lambda inside the Lambda using the
46:30 temporary parameter from the Lambda as opposed to the overall variable for your data frame,
46:35 because you're chaining it, that's useful as well. Well, 100%, then you can also like comment out
46:41 certain lines at the end and see what the intermediate values are. And so you say it's
46:44 less efficient. It's less computationally efficient. It might be more efficient as a
46:49 human being trying to understand what the heck is happening, right? That is spot on. I like,
46:54 I often tell people that I think Python is a language for an age in which computers are cheap
47:00 and people are expensive. Because like, right, our efficiency is the big bottleneck in terms of time,
47:07 in terms of money. So right, if it takes my computer a few seconds more, who the heck cares?
47:11 Right. My M2 pro doesn't care if it's like a tiny bit more, I don't care.
47:16 That's right. That's right.
47:18 Or my cluster of GPUs, take your pick. Although at that point it starts to cost real money to
47:22 burn those things, you know? Right. But still, I mean, depending, look, if you see that your computation is taking a long time, okay, so then you sort of find that and you
47:31 improve that. Yeah. That's a really good point. Like a lot of things are, don't worry about that
47:35 until it's actually going to become a problem. It looks slow. It turns out it's probably a blink
47:38 of an eye. All right. Let's keep moving on. We're getting short on time and we have a plethora of
47:44 things to work with. I kind of want to go to wine words. Wine words. Wine words. All right.
47:50 Let me remember what exercise that is. That is 37. All right.
47:54 Yeah. So people think of pandas, not wrongly, as being great at working with numbers,
48:00 but it turns out that it's fantastic at working with two other kinds of data types. One is strings
48:07 and one is dates and time. And you can get a ton, a ton out of analysis working with these if you
48:13 know how to work with it. So there is, I forget where it is, like there's a machine learning,
48:20 like archive of data sets. And one of them is 150,000 wine reviews from Wine Magazine.
48:28 Oh, wow. Is it Kaggle maybe or is it somewhere else?
48:31 I think Kaggle has a version of it, but I think it's elsewhere.
48:35 Wine Mag, 150,000 reviews. Okay. Beautiful.
48:39 I don't know. I'll find out where it is. Anyway.
48:41 No worries.
48:41 So I said, okay, let's find out. Because you drink a bottle of wine and you read the back
48:47 and you sort of like, you know, roll your eyes at what they've written. Although I'll put in a plug,
48:51 I read this fantastic book a few years ago called Cork Dork by this journalist who decided to become
48:57 a Sommelier and she took the exam and her journey toward there, she convinced me these words actually
49:03 have real meaning and people are very serious about it. So I will not roll my eyes quite as
49:07 much anymore.
49:07 I'm getting hints of nutmeg, but they're troubling.
49:10 Yes. Yes.
49:12 They don't belong.
49:13 So the question was, okay, what about these reviews of wine? What words are people using?
49:19 And are they using certain words more with California wines and certain words more with
49:24 French wines or certain words more with red or with white or rosé? And so we can then take this
49:30 text, break it apart and search for it. We can search for it using plain old Panda stuff. We can
49:35 search for it using regular expressions, which it has built in and works very well.
49:39 So here, so let's see some analysis on the words. So one of the 10 most common words for red wine.
49:47 Right. So how do we do that? Well, we have to take the description and break it apart.
49:51 And if you're used to using just plain old Python, you're like, oh, well, I guess I'll
49:55 break that into a list. But now what? Now I have a series of lists. Now what do I do?
49:59 And so one of the key methods to know here is something called Explode. And Explode is
50:05 let's take a series of lists and turn that into a very, very, very long series. And so basically
50:12 each element in the list becomes an element in the series and they all share an index. So you
50:16 know where they're originally from and then you know the world's your oyster. So you can get rid
50:21 of punctuation. Right. So I have a S dot stir lower. Okay. So we lowercase everything. Then
50:26 that stir split and we split into a list. Then we say explode, get into a long series. Then we once
50:31 again, run strip, get rid of punctuation. And now we can use is in, which is yet another fantastic
50:37 pandas method to say, are these words in like find the lines where these words are located.
50:43 And then we can just do a value counts. Definitely my favorite method. How often does this thing show
50:48 up? And then we use head. Ta-da. We've got the 10 most common, 10 most common words there.
50:53 Cause the most common, the counts, the value counts, sorts probably most,
50:57 most common to least common, right? Yeah, yeah, exactly. So value counts,
51:01 not only does it count how often something shows up, but exactly. It sorts it from most common to
51:05 least common. So then head, yeah. If you want, yeah. However much you take there and then that's,
51:10 that's the one. So those are the popular ones. That's right. That's right. So here I have like,
51:14 you know, the page we're looking at. So we get find where the country is France. So that's
51:18 the row selector and we want the console to be description. So we only want the description.
51:22 And then we have our function top 10 words, which did what I just described. We're going to take
51:26 these common wine words, lowercase, then get rid of the punctuation, so forth, pass it in. And we
51:30 get back our top words and we find out what words are used to, you know, are associated with French
51:35 wines as opposed to California wines and so on and so forth. Excellent. Yeah. Very, very cool example.
51:40 All right. We, I think we've got time for two more to go through. You want to pick two that
51:45 are popular or do you want to leave it? let's see. Let's see. Let's do, maybe 32
51:53 multi-city. Actually. Yeah. Multi best tippers. Yeah. You mentioned, Oh, you're going to open up
52:01 a whole, this is going to be a whole thing. Best tippers. Number 42. Let's go. No, no, no judgment
52:07 here, folks. All right. So the question is like, so we'll try to understand as I read here,
52:11 when people tip their taxi drivers more generously. All right. So the question was,
52:15 did they tip better? And we looked at 2019 before the pandemic in January and July. So do they tip
52:21 more in the winter or do they tip more in the summer? And so this involves several things.
52:26 First of all, it involved using dates and time. Again, one of these things that pandas is just
52:31 amazing at that people are not aware of all the flexibility you have there. Another thing is how
52:37 easy it is to create a new column, right? So we spoke before a bit, how you can use broadcasting
52:41 to just multiply. I have a column multiplied by something or a column adding to another column,
52:46 but you can create a new column just by assigning to it. Again, it's sort of like assigning to a
52:49 dictionary. It just then is there. And so you can calculate the percentage that people tip,
52:54 put that in a new column and then say, well, let's now group by the month and let's find out.
53:00 - Take the mean or the max or some sort of deviation.
53:03 - Exactly. Exactly. And we can find out like, you know, whether people tip more on average
53:09 in January or in July. I honestly don't remember what the answer is, which will hopefully not tick
53:17 off too many readers. Here we go. Let's see what it says. Oh, here we go. Go back one page there.
53:22 So 32% of taxi riders in New York don't tip at all. That's right. That was a surprising thing to me.
53:27 And then I don't remember exactly what it was for month.
53:31 - And you know, you're talking summer versus winter. There's probably a tourism
53:35 angle versus non locals versus tourists. I mean, I know people go to New York and
53:39 not in the summer, but not as much, I imagine.
53:41 - That's right. That's right. There are all these different factors that like, you know, come into
53:46 it. - Yeah. One of the things, kind of takeaways I'm feeling as we were talking about all these
53:49 is like, there's a lot of interesting questions that can be asked and answered really quickly
53:54 for like sociology and urban planning and all kinds of interesting questions that don't feel
54:00 like programmer questions. - Yes. So, you know, I mean, I talked to a lot of people, you probably
54:05 do too, who are like interested in advancing their careers with Python and they're like, well, you
54:11 know, I don't have a computer science background. Can I get a job as a programmer? And their vision
54:16 of a programmer is either someone working at a startup or at like one of the big companies,
54:20 Google, Amazon, Facebook, and so forth. I say to them, look, there are an awful lot of people who
54:26 have great jobs working at supermarket chains and insurance companies in the backroom analyzing data
54:33 and they are crucial, but we don't think of them as programmers and in governments and in cities
54:38 and weather forecasting, like everyone nowadays is using data and collecting and analyzing it
54:43 and having these skills either gives you abilities to do your job better or gives you the ability to
54:49 move into a new job that you couldn't have done before that these places are desperately looking
54:54 to fill. - Yeah, I totally agree. I can't remember which episode it was. I didn't name it right. So
55:00 there's an episode I did quite a while ago about our programmers and our data scientists and Python
55:05 data scientists working together. And I think it was the research arm of Kroger maybe that had like
55:10 200 data scientists. - Wow. - That's a proper group of data scientists. I mean, a lot of times data
55:17 scientists, I feel like there's a couple of them for a company compared to a software team or
55:21 something. - No, and I tell people also, it's worth learning to write and it's worth learning to
55:27 speak, not because you're going to be like a Pulitzer prize winning writer, not because you're
55:31 going to be like whatever prize you would get for like speaking well, just because having these
55:36 skills makes you more effective at your job. And people who can write a program to suck up some
55:40 data, analyze it and come back with a result, especially if it's like a public data set that
55:45 has something to do with what they're working on, they are so much more valuable to their company
55:50 than they would have been otherwise. - Oh yeah, absolutely. So I found it because I have a search
55:55 engine on Talk Python that searches the transcripts and everything else. So if you want to know
55:59 something people about historical shows, when you hit search, it's not just like the show notes.
56:04 So scaling data science across Python and R episode 236 with Ethan Swan, Bradley, Bokeme,
56:12 and the company is 84.51 degrees. So anyway, that's- - Oh, I've seen that before.
56:18 - Yeah, yeah, yeah. That's what it was. So people can check that out if they're
56:21 interested, but- - Very, very cool.
56:22 - Yeah, super cool. Okay, last one. What are we doing?
56:25 - Let's do cities. - Cities. What number is it?
56:29 - So 43. - Okay, oh, that's right.
56:31 - So I found a few years ago this JSON file containing the thousand largest cities in the
56:38 United States. So first of all, good to work with JSON because people need to know how to work with
56:44 that kind of data. Second of all, okay, let's not look at the numbers so much as let's do some
56:50 plotting. And so I'm going to make enemies now. I really cannot handle matplotlib. I find it,
56:57 it's like so incredibly powerful. And anytime I want to do even the tiniest thing with it,
57:02 I have to look up the documentation and remind myself how it works. Maybe it's because I don't
57:06 use it enough. And so I just use the pandas interface for plotting 95% of the time. Let's
57:12 call it 90% of the time. The rest of the time I usually use Seaborn. So I said, okay, let's see
57:17 if we can do some plotting here of these largest cities. So for example, let's see growth in
57:22 Pennsylvania cities, like which cities in Pennsylvania, oh, I'm sorry, like let's see a dual
57:27 bar plot. How many large cities are in each state? Okay, well, that's a group by, right? So you have
57:31 to know how to do grouping. You have to know what you're grouping on and how to use count as opposed
57:35 to mean, right? That can't even exist. But then if I want to see a bar plot where it's sorted from
57:40 smallest to largest, you got to know how to do sorting. And then we can do a bar plot. So sure
57:44 enough, you do a dot plot dot bar on your series, kablam, you have a bar plot right there. And we
57:50 just get it in like, you know, it's nicely sorted there. Quick takeaways. Holy cow, does California
57:55 have more big cities than I realized? And you know, where's New York and New Jersey? Like it's
58:00 way down the line. You think of those as having like pretty metropolis type places. Massachusetts.
58:06 That's right. That's right. But it's how many cities, right? So, right, right, right. That's
58:12 like the fun. So keep going. So like do a bar plot of growth in Pennsylvania. So how are we going to
58:17 do that? Well, wait a second. If we want to do growth, it's a percentage, but it's written in
58:21 the JSON as number and then percent sign. That's a string. So we're going to have to get rid of the
58:26 percent sign. And then we're going to have to change the D type from string to a floating point.
58:31 And then we can sort it and then we can graph it. And so that's changing the D type basically is a
58:38 deep down inside of pandas parse as an integer operation, right? Rather than looping over and
58:44 parsing or whatever. Yeah. Yeah. Yeah. Like basically, I mean, it has to do the looping.
58:48 It's doing the looping there at the below sea level. Yeah. Below sea level. Yeah. Down by the
58:53 dead sea. I see. Okay. Got it. And so you see here like this bar plot. Now we have a whole bunch of
59:00 cities that have gone down in size and a bunch of things going up in size and pandas is very happy
59:04 to show the bar plot there again, sorted so that we see it from the greatest sort of shrinkage to
59:12 the greatest growth. Right. With Pittsburgh having the most shrinkage and Allentown having the most
59:17 growth. I would not guess that. I would not have either given that like my sister-in-law and
59:21 brother-in-law live in Allentown and they're like, "Oh, Allentown." But I think that's a common thing
59:27 that people think there, even though it seems like a perfectly great place to be. Yeah. Sure.
59:31 Funny. Awesome. Well, there's a lot of cool takeaways from this.
59:35 One more. Let me show you one. Go to the last one here in this exercise.
59:39 Okay. You got it.
59:40 So one of the most important types of plots you can do in data analysis is a scatter plot. And
59:47 you take your data frame and you say, "Give me the plot with based on the X axis should be this,
59:53 the Y axis should be that." Well, if we do a scatter plot based on longitude and latitude
59:58 of the thousand largest US cities, you basically come up with a map of the US.
01:00:02 Yeah. How interesting. The other thing that you realize is California is technically bigger than
01:00:07 a lot of other states. So that also counts for why. Oh, yes.
01:00:10 Yeah. Interesting. But there's a lot of people in California. Good place to be.
01:00:13 There are a lot of people there for sure.
01:00:14 If they're willing to pay the sunshine tax, it's nice there.
01:00:17 Cool. All right, Reuven. This is really, really excellent. I know you want to give a shout out to
01:00:24 a couple of exercises from Bamboo Weekly. So I'll include those in the show notes as well.
01:00:29 People can pick those and jump over and look at them as well.
01:00:32 That's fine. That's great. Yeah. We went through a lot here just during this time.
01:00:36 Yeah, for sure. As I've discovered, there's basically, I mean, there is an infinite number of these sorts of things that you can do and different
01:00:43 permutations of various sorts. I mean, we didn't even talk about multi-indexing, which opens up
01:00:49 literally a whole new dimension of this stuff. But it's great fun. It's pretty good fun.
01:00:53 And a lot of the Pandas 2 stuff coming on changes the foundation and opens up more
01:00:57 possibilities still. Absolutely.
01:00:59 All right. Well, final thoughts, final call to action. What do we got here? What do you say?
01:01:03 Well, call to action. I mean, you can take a look at my overall courses and so forth at
01:01:09 learnerpython.com. Or if you just want to improve your Pandas knowledge, you can look at bamboo
01:01:13 weekly.com, the newsletter there. Does it cost money or does it just cost an email?
01:01:18 So it is a paid newsletter, but the first two questions and answers every week are free.
01:01:23 So it's usually between five and nine questions each week. But even if you don't pay,
01:01:30 I still want people to get some Pandas learning improvement practice so they can totally do that.
01:01:36 Excellent. All right. Yeah. And check out the book, obviously.
01:01:38 Oh, yeah. That too. Pandasworkout.com. I'll pull this up.
01:01:42 It's actually pretty thick.
01:01:44 Yes, it is.
01:01:45 At the end, I was like, "Oh, that's why it took a long time." Also procrastination.
01:01:52 I'd love to hear from people if they have... I always love to hear interesting problems,
01:01:59 data sets, issues that people are experiencing just so I can figure out what are the next
01:02:04 directions for me to explore that I can then try to help people with as well.
01:02:08 Awesome. Well, I appreciate all your help coming on the show, sharing your knowledge, and just riffing on Pandas together. It was a lot of fun.
01:02:16 Excellent. My great pleasure.
01:02:17 Yeah. Bye.
01:02:18 All right. Bye-bye.
01:02:20 Bye.
01:02:20 This has been another episode of Talk Python to Me. Thank you to our sponsors. Be sure to check
01:02:25 out what they're offering. It really helps support the show. Take some stress out of your life. Get
01:02:30 notified immediately about errors and performance issues in your web or mobile applications with
01:02:35 Sentry. Just visit talkpython.fm/sentry and get started for free. And be sure to use the promo
01:02:42 code TALKPYTHON, all one word. This episode is brought to you by Scalable Path. If you're a
01:02:47 founder or engineering leader, you know how hard it is to find top-tier developers while keeping
01:02:52 costs low. Scalable Path is a software staffing company that helps you build remote dev teams
01:02:58 that just fit. Build your team at talkpython.fm/scalablepath. Want to level up your Python?
01:03:05 We have one of the largest catalogs of Python video courses over at Talk Python. Our content
01:03:09 ranges from true beginners to deeply advanced topics like memory and async. And best of all,
01:03:15 there's not a subscription in sight. Check it out for yourself at training.talkpython.fm.
01:03:19 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
01:03:24 We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed
01:03:30 at /play, and the Direct RSS feed at /rss on talkpython.fm. We're live streaming most of
01:03:37 our recordings these days. If you want to be part of the show and have your comments featured on
01:03:41 the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube. This is your
01:03:47 host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and
01:03:51 write some Python code.