Monitor performance issues & errors in your code

#341: 25 Pandas Functions You Didn’t Know Existed Transcript

Recorded on Thursday, Nov 4, 2021.

00:00 Do you do anything with Jupyter Notebooks? If you do, there's a very good chance you're working with the Pandas library. This is one of the primary tools for anyone doing computational work or data exploration with Python, yet this library is massive and knowing the idiomatic way to use it can be hard to discover. That's why I've invited Beck's Tuychiev to be our guest. He wrote an excellent article highlighting 25 Idiomatic and as functions and properties we should all keep in our data toolkit. I'm sure there is something here for all of us to take away and use Pandas. That much better.

00:33 This is Talk Python to Me episode 341, recorded November 4, 2021 welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy and keep up with a show and listen to past episodes at Talkpython.FM and follow the show on Twitter via @talkpython. We started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at 'Talk Python.FM/ YouTube' to get notified about upcoming shows and be part of that episode.

01:18 This episode is brought to you by Shortcut and Lenode and the transcripts are sponsored by assembly AI.

01:26 Bex welcome to Talk Python to me.

01:29 Hello, Michael. Thanks for having me.

01:30 Yeah, it's fantastic to have you here on the show. Your article 25 Pandas Functions that you didn't know or probably don't know. I guess as we'll see, that really caught my attention. Honestly, I don't know many of them, so I've learned a bunch by watching it. I do spend more time on the website of Python and the database side of Python than I do on the data science. But certainly Pandas is a super important part of Python these days.

02:00 And honestly, the whole data science side is the fastest growing part of Python. Pandass is like one of the first libraries that you will be introduced in any Beginner Python or any Beginner data science course. And it's amazing how much it has grown since it was first launched. And the funny thing about the article is that before writing it, I also didn't know most of the functions. I would always get annoyed by people who use some complex functions, and I just wanted to know how they worked and explained to my audience that was the idea of the article.

02:36 Both me and the audience learning that's the little bit of secret behind these types of things behind the tutorials behind articles and podcasts and even behind courses is a lot of times we dive into them because you're like, I really want to learn these things and just let me put it in a format I can present to the rest of the world and help everyone else out. Right?

02:55 Yeah. Awesome.

02:58 Before we get into this, I want to talk about your articles and some Kaggle competitions and we'll dive into the 25 functions, but let's start with your story. How did you get into programming and Python?

03:07 Right after I finished high school, I got interested in one development I learned from HTML and CSS, and I was hoping to get things to get more, to be more exciting. But at some time I just got bored because I'm really into math. And what development has nothing to do with math. It was very boring. So I switched to learning Python, learned it after for a while and discovered that data science is mostly connected to math and statistics.

03:39 I just bought a really good course, and that's how it starts.

03:44 Yes, that's fantastic. I think people do often feel like you have to be really good at math to be good at programming. And honestly, most of programming has very little to do with math.

03:54 Yes, of course, data science does.

03:57 Data science is unique in this way, I guess computational science, if you're astrophysicist, you do a lot of math as well. But for most of us, math is just a structured way of thinking. And we have structured programs and that's kind of the end of the relationship there. But for if someone is out there and they really love math and they want to take it farther, but they want to do that in computers. It sounds like recommending data science might be the right path.

04:22 Yeah, of course. It's really beautiful how software and math connect together in data science. What kind of things you can see, like for neural networks and state of the art machine learning algorithms? It's really amazing.

04:35 Yeah. It's one of these areas that's just growing so fast. And there's such big advancements.

04:44 I think back to when I was in College and we talked about artificial intelligence and AI, and it was all about the Turing test.

04:53 Could you get a chat bot that would trick a human into thinking that it was an actual other human? And it never really seemed to come into reality. It always seemed like, oh, it's kind of always 30 years out. And then all of a sudden, we have self driving cars, and we have Google Copilot.

05:11 The step jump over the last couple of years has been amazing.

05:14 Yes. I was also amazed by Google Copilot right after it was launched. I wrote an article on it, like as a kind of intro, it really took off like so many people were interested in it. It received, like, more than 50,000 views. The article. A lot of people are amazed by it.

05:30 I'm amazed by it as well. I think it is amazing. It's also bringing to light some interesting, almost legal and philosophical things. Right. If people put code on GitHub, they didn't actually intend to train an AI with it. If they put code on GitHub, that's under GPL. Well, what the AI knows is that now GPL or is that completely can that be using closed source? These are not known, right. These are interesting questions.

05:58 I don't think we're going to answer. We're not going to completely fill them out today.

06:02 Let's focus on something more a little smaller. So you mentioned your articles and you've been doing a lot of writing. So you're a top ten writer in Artificial Intelligence on and you're also a Kaggle master.

06:17 Yeah.

06:18 Let's talk about those two things for a little bit. Just give us a sense of the stuff that you write about on Medium and maybe some of your favorite articles before we dive into that one that I picked up.

06:26 I started writing on medium a year ago. It was just purely for educational purposes. I really liked how the things you learn will be locked into your brain by writing about them. So it was a really amazing way to learn something new. But as my number of articles grew like my audience grew and I met a lot of people, it opened a lot of doors for me writing.

06:53 Yeah.

06:54 And most important of all, I'm more confident about my knowledge than ever before.

06:59 That's fantastic. I really like that. You point out that it opened doors because so many people feel like I'm not ready to write. I'm not ready to speak a user group or a conference, or I'm not ready to appear on a podcast or any of these sorts of ways where you put yourself out there.

07:14 Right.

07:15 But when you do that, the act of doing that pushes you to grow. And it also opens doors to people. If you're out there and you're genuine, it doesn't have to be an absolute expert in everything. You just have to be excited and interested. Other people who are excited want to talk to you and work on something with you, right?

07:30 Yes. You just have to be one step ahead of your audience. And that's it right. When you write articles, that's right.

07:36 And not necessarily in everything they know, just the little area that you're interested in. Right?

07:39 Yes.

07:40 Awesome. And so that's really great that you're doing this right and stuff. The other thing is Kaggle, tell us about what you've been doing at Kaggle.

07:47 I really admire people who do competitions on Kaggle for a while, and I really had this imposter syndrome. I couldn't join the competitions because I thought that they were too complex that I had a lot of things to learn before I joined them. I still do. But after I joined the Tabular Playground competitions, I learned that I can do it. So I started posting my articles in the form of Notebooks on Kaggle as well. We started getting a lot of views and really nice comments from the audience. The community on Kaggle is even more amazing than on Medium for an article that gets read by 1000 people on Medium, I usually receive, like, one or two comments. But if you post the same article as a notebook on Kaggle, like, the audience loves it because Kaggle is mostly suited for this kind of tutorials. And I usually receive, like, 30 or 40 comments. And that's really amazing as a writer to be part of that kind of community.

08:47 Yeah. That's really amazing. I had no idea. I didn't realize you could post on Kaggle. You kind of post your solutions and then have a conversation around them, sort of. Right.

08:55 Yes.

08:55 Okay. Awesome. People want to get started with Kaggle. What do they need to do? Like, maybe before we drop this topic if people haven't done stuff with Kaggle yet, but maybe you want to use it to learn. What's your advice there?

09:05 Yeah.

09:07 After right. After you create an account, they have a whole suit of courses, free courses you can take. I think those are very good starting points for any beginner. And also they have, like, two or three beginner level competitions, so you don't get intimidated by those Grand Masters or Masters. They are just simple data sets you can work with, and you just have to submit your predictions and just get a score. And nothing too complex. And that's really the amazing part of Kaggle. That's why those three competitions. I think they have, like, 100,000 people competing at any single time in any time.

09:50 One of the challenges when you're learning is finding a structured problem to approach.

09:56 Maybe in the web world, people try to build things that are too ambitious. They're like, oh, I want to build Airbnb like, you don't really hardly understand CSS. Let's take it down a notch and we'll get the right size problem for you to address. Data science has the same problem, but I think it has another aspect, which is and you need the data to start from right.

10:18 Kaggle helps in bringing that kind of stuff over.

10:21 Yeah. Kaggle has an amazing list of data sets. I almost always use Kaggle data sets for my articles because most of them are digestible and small enough for people to get advantage of.

10:36 Awesome. A question from the audience from Brandon Bennett asks, Are Kaggle competitions just machine learning and artificial intelligence related? Or are there other types?

10:45 Yeah. Kaggle competitions are only AI or data science related.

10:49 Yeah. Okay.

10:50 For example, the latest launch on Kaggle, I think, is about finding the cuteness quotient of pets you just taking thousands of images and you process them with Python or R, and the neural network learns the structure and learns the cutest quotient and spits out a new quotient for any new image you get.

11:15 That's amazing.

11:18 Here's a machine learning model that can answer. Is it a cat or a dog and now it's giving you a cuteness score?

11:23 Yeah.

11:24 I can definitely see my daughter getting into data science with this one. She's all about pets and cats and dogs. I personally want to put a vote out there for the Golden Cocker, the golden retriever, mixed with a Cocker spaniel. Boy, those things are cute. Okay, so that's Kaggle sounds really great for learning. And I suspect knowing something about pandas will pay off.

11:49 Right. Like, it's such a foundational aspect.

11:53 Pandas used extensively.

11:55 It is. And I feel like pandas is one of those things that you could learn it really quickly. You could learn to do stuff with pandas in a day.

12:04 But then in a year, you could still be learning stuff about pandas if you use every day for a year. You know what I mean?

12:10 Yeah. Most data science libraries are just very fast. There are a lot of functionalities and most of the time you can get around by learning, like, ten or 15% of all those functions. But when you really need to get something like really rare edge cases or unique cases, you really need to know those rare functions that are buried in the documentation just so that you don't have to reinvent the wheel.

12:37 Yeah. In Python, we speak about Pythonic code. There's code that we could write. That might be code that runs, but it looks like it comes from Java or it looks like it comes from C, and somebody just got it working. I suspect you have the same thing in data science and around Pandas. It's like, yeah, you technically could do this with Pandas, but why don't you just call this function? And probably the answer is, well, I didn't know that function existed. Of course I would have called it if I'm known to do it, but I just didn't know.

13:06 Right.

13:06 I'm new.

13:06 Yeah.

13:07 So hopefully we can shine a light on some of those things that you can do.

13:12 Not that will necessarily cover in your article. But if you're doing a for loop with data frame, you're probably doing it wrong. Right?

13:18 Yeah. Golden rule is to never use loops like, completely.

13:23 Yeah, that's pretty interesting. It definitely takes a different way of thinking, sort of set based processing and passing and expressions of Lambda to various places and whatnot maps and whatnot. Okay, we're going to talk about some of those. Let's dive in. First of all, how do you pick these 25 or these? Just 25 that you saw people use. They were interesting. I didn't even know that existed. Or what was your philosophy here for this kind of articles?

13:46 I usually go to the Apr reference of the documentation. It just lists every single class and functionality of some library, the Apr reference, and I just read them one by one. Decide which one of those is going to be beneficial to me and possibly for my audience. And I just pick them out one by one.

14:05 Yes, that's really cool. I love to discover these types of things. So why don't you kick it off with number one? Number one here.

14:12 The first one is Excel Writer.

14:15 It's a class for writing to Excel sheets. If you have multiple data frames, you can write to Excel sheets as separate tabs with separate sheets.

14:27 The dataframes has this to Excel function, but to give it the Excel right instance, it's going to write it to a separate sheet. It's going to enable you to write to separate sheets.

14:37 This is super neat. So in your example here which of course, will link to the article and people can check out they'll have a bunch of code samples under each one of these. You've got two data frames, data frames, and you want to put them into some kind of Excel spreadsheet. So you create one of the writers. This is the function you're talking about. And then you go to the data frame. You say to Excel, and you give it the writer and a sheet name. And you can do that for each data frame and give it different sheet names. And it just piles up along the bottom. Right.

15:05 It's really neat.

15:06 It's ridiculously simple. Right. It's like given the data frames, it's three lines of code to create an Excel file and write it. Yes.

15:13 If you know this, you'd have to create two separate excel files and just add them together latency manually, which is not programmatic.

15:21 Right. Or maybe you say you don't know that you can write to Excel. I'm pretty sure I could write to a CSV, and there's multiple levels. Right. Like, one level is like, I'm going to write it line by line, putting the commas in there myself. Another one could be the write CSV, read CSV, write Csv. But this one is more structured. Right. And then you could possibly use some of the more advanced tooling to do things like Stylize or highlight aspects of it or whatever. Right. Like Py, Open Excel or something like that. Now, for this one, it says that you need to have the right supporting libraries there. Right. You, for example, have to have different libraries. I can't remember which one I was. I think it was Py open.

16:04 Yeah, that's it. Right here. I know it was in here. Yeah. Openpyxl if you want to work with XLS files and there's other ones as well. Right.

16:12 Otherwise you'll get an error.

16:14 Right. So basically, Pandas delegates to this library which actually understands Excel and writes to it. There's another one where it talks about using 'fsspec', and this caught my attention as like, oh, wow. This is way more flexible because I'm not sure if people are aware of what FS spec is. Are you familiar with Fsspec?

16:35 No.

16:35 So Fsspec is this library that allows you to treat different destinations as Python file systems, like with open some file name instead of file name. You can do all sorts of stuff. So let me see if I can find some of the documentation here of the things that can go to integrates with a bunch of different places.

16:58 But it goes to places like S3 storage and FTP and database and Zip files and all of these types of crazy things. And it even does Caching. It says. Right. So this Excel writer already sounds really interesting because it writes to Excel like destination of these Excel files. It could be an Excel file in a database or something with basically hardly any changes to the code.

17:23 Yeah.

17:24 Yeah, that's super cool. So good one to kick it off. There a lot going on.

17:30 This portion of Talk Python to Me is brought to you by Shortcut, formerly known as Clubhouse IO. Happy with your project management tool. Most tools are either too simple for a growing engineering team to manage everything, or way too complex for anyone to want to use them without constant prodding. Shortcut is different, though, because it's worse. Wait, no, I mean it's better. Shortcut is project management built specifically for software teams. It's fast, intuitive, flexible, powerful, and many other nice positive adjectives. Key features include team based workflows. Individual teams can use default workflows or customize them to match the way they work. Org wide Goals and Roadmaps. The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and back hype version control integration.

18:19 Whether you use GitHub.

18:20 GitLab or Bitbucket Clubhouse ties directly into them so you can update progress from the command line keyboard friendly interface. The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse. Throw that thing in the trash. Iteration-planning, set weekly priorities, and let Shortcut run the schedule for you with accompanying burdown charts and other reporting. Give it a try over at Talkpython.FM/Shortcut. Again, that's 'Talkpython.FM/shortcut'. Choose Shortcut because you shouldn't have to project manage your project management.

18:59 The next one is Pipe, right?

19:01 Yeah.

19:04 There's like a lumberjack looking dude smoking pipe. That's very cool. Yes.

19:08 Tell us about pipe when you do data analysis, like most of the time, the data you'll be dealing with will be like, not clean. You have to perform some operations, and Pipe really offers a way to just package all those operations into a single line of code or into a single block of code.

19:29 It's kind of like Sklearn pipelines, but you just have to run only single line of code and just perform several operations at the same time. It's really just a neat way to do data cleaning, right.

19:40 And it's what's called a fluent API. So if I call dataframe.pipe, what comes back is another data frame. And then I could call Pipe on it again and then pipe and pipe and chain those together, applying different operations and transformations. It's almost like a mapreducer aggregation framework type of thing here. Right. It's pretty flexible. Everything.

20:01 It's just like Pandas entirety.

20:05 One of the amazing features of Pandas, like consistency, always.

20:08 Yeah, I really like it. It looks super neat. So you need to do transformations on a data frame with custom functions and get answers out. Yeah. Another thing that you pointed out here is that as part of this, you could apply to the whole data frame, or you could pass a set of columns as part of it as what you're piping across. What does that do? That reduces the result to just those. If you pass in three things, just those three columns.

20:34 And then these two functions remove outliers and encode categoricals are functions that accept arguments, and when you pass it to pipe, you just have to pass the function name.

20:45 Got it.

20:46 Which means you can't pass the argument to pass the arguments. Actually, you just have to provide them after the comma. So this remove outliers function just accepts one argument at the list and it performs, like, outlier removal and just hold it up.

21:02 I see. So you can pass like, your function might take the data frame, but it might also take additional information, like, I want to exclude things that are over $100 and just throw them away. Well, you got to pass that 100 in because it needs to know 100 versus some other cut off value. Right. Got it. Yes. Okay.

21:18 Cool.

21:18 And you say it resembles Sklearn pipeline. Yeah. It's pretty cool. All right. We're up to number three. factorize.

21:27 Tell us about this one in machine learning algorithms only accept numerical data. And the most real world data sets contain categoricals, which means, like, there are, like, class one, class two or class three, and you have to encode them to numeric like, 0123 or using, like, 100 encoder or label encoder in, Sklearn. But you can do the independent as well. You just have to pass the column to factorize, and it just includes them with numericals for each class.

21:58 I see. So let me see if I can give an audio friendly example for listeners here. If we've got something that says data frame or one of the pieces is what the weather was like. Sunny, rainy sun, rain, snow, clouds, something like that. You can't feed sun to the machine learning model. You got to give it a number, right?

22:21 Yeah.

22:21 So this will convert that to, like, zero for sun and everywhere sun appeared, you would now have a zero one for rain everywhere. There's a rain and so on. So it just does that figures out how many different categories there are, and then it gives them a number that can be sent off to machine learning. Right.

22:36 You explain that.

22:38 Awesome. So I'm learning, right.

22:40 I'm just following along with you here. Awesome. Okay. That's a really cool one. This next one seems a little bit crazy, but it looks very useful. Explode, right. What does explode?

22:51 Surveys usually contain questions. There are multiple choice. You can just pick a lot of, like, more than one answer to one question and that's recorded as one answer. So you're just going to end up with this kind of list in a single cell of the table like a question. If you have a question, one and the user just picks the answers.

23:15 Abc is going to end up as a list in a single cell of a table.

23:18 Right.

23:18 So, for example, here you have a series that has one and then six and then seven, and then the fourth element is a list of three other numbers. And you're like, wait a minute. Those are not supposed to just be multi dimensional. I want a straight series. Right.

23:31 You want a straight series series. And when you call Explode on this series, it's going to just expand the series vertically and just going to fill up, just takes the elements of the single sell list and just expand them vertically.

23:46 Yeah. These are the types of things that you were talking about with loops. Right. It would be easy to go through and say, I'm going to build up a new data frame. And if I see a list instead of a number, I'm going to just start appending those from the list with an inner loop, and then we'll carry on. Right. And here you've literally done it in one line.

24:02 Yeah. This would be crazy complex if you did it, like, manually.

24:05 Right. And honestly slower. Right. Because a lot of this is probably implemented and see where you would be doing it at the Python layer.

24:12 It can be very slow.

24:13 All right. Another question from Brandon out there. Glad he's here in the live stream. How would I apply Explode to the entire data frame? I'm guessing he's thinking about maybe if you had multiple columns and they each potentially had this.

24:26 I don't think that's possible.

24:27 Yeah.

24:27 I don't think time is allowed that.

24:29 Yes. Okay. So it's got to be on a series, not on a listing. Right.

24:33 Got it.

24:33 Okay.

24:34 Cool.

24:35 So these all fun names that stand out the next one very fun names.

24:39 Yeah.

24:39 And you pick some cool pictures, right? Yeah.

24:42 All right.

24:42 So what's the next one?

24:43 Squeeze.

24:44 Squeeze.

24:45 As you can see, there are some conditional operators return data frames. Even it's a single cell, like, as you can see from the subset set, we're just asking the diamonds data frame to return all diamonds that are priced below one dollars, and it just returns a single result, which is 326, but it's returned as a data frame, which is not comfortable to work with, like, a single cell data frame.

25:13 Right. Because Panda doesn't know ahead of time that '.loc' call is going to result in a single item. This happens a lot in databases, too. You do a query and the result is actually a single thing. But the framework has no way to know that the data is structured in a way that's unique. Or that's a one thing. And I suspect that's common here with data frames as well. You structured like, I know this is going to give me the one answer.

25:35 Yes. But the whole table.

25:38 Yeah. Well, now I got to dig in and give me the first real first call. Okay. So squeeze helps fixes you.

25:44 Just call this squeeze on a single cell data frame or series, and it removes all the dimensionality and just charge the number.

25:53 Interesting. That's cool. What happens if I call it on one that's got more than one item? Do you know, does it just give you the first or does it freak out and let you know I never tried that.

26:03 Yeah.

26:05 Don't do that. Right.

26:07 Maybe if you just actually want the first answer, maybe it's okay, but it also might give you an exception.

26:11 I don't know. Try it now.

26:13 Yeah. Exactly. Cool. So the next one has to do with finding things in a range, right?

26:18 Yeah. Between you want to take all the rows that are in between some range. For example, here in the code example, I'm choosing all diamonds. They are priced between $3,500 and $3,700.

26:36 Nice. So of course, you could do this probably as an expression. You could definitely do this as a loop, but both of those are slower, I'm sure, because they're not implemented internally, right?

26:47 Yeah. Less elegant. This one is better and faster and shorter. Yeah.

26:52 One of the things the third parameter you can pass here to between in addition to, like, the lower bound and upper bound is whether or not it includes the endpoints. Right.

27:00 This one is inclusive is neither. So it's like open set.

27:06 Nice. Another thing that I've seen here, which is not one of your 25, but it looks nice. I'm used to visualizing quickly visualizing a data frame when I get it back with head or tail. And I want to know, like, okay, kind of. What did I get back here? Well, show me the front. That'll be good. Do a head or let's go to the end and see what happened at the end. But here you have sample. That's interesting.

27:25 I use it often because some data sets have, like, ordering, for example, time series data sets. And the first few rows might be not too representative of the whole data frame. So I just called .sample with, like, five or ten rows and that randomly samples the set. And usually sometimes that shows the data set head or tail.

27:49 Right. Exactly.

27:50 So I just kind of randomly pick some stuff throughout the data set to show you what's going on. Right. For large data sets, that's really handy. Nice to know. So the next one has to do with I'm guessing, like, when you're doing matrix multiplication and vectors and, like, truly doing math most of the time, I would expect this to show up.

28:09 Yeah.

28:13 Transpose.

28:15 You usually don't do math or matrix multiplication in Pandas. You'll do it in NumPy. But this one I use it mostly for on the result of describe. You see here describe returns the axis inverted. So the five number summary is given as rows and that's really a problem when you have multiple columns because the data set starts to expand horizontally, which makes you scroll to the right or what you don't want.

28:45 So when you do describe, you get things like given a data set, it'll say, here's the count of this index, the mean of this index or this call or the value of a column standard deviation. And so on. And the number of options there is unbounded. But the fact that it goes count means standard deviation, minimum and a few more things. And that's fixed. And that fits pretty well. So you're saying if you transpose or, like, flip the rows and columns so that you make it go vertical instead of cross, that's an easier way to look at it.

29:12 Yeah, I agree.

29:13 And it's as easy as saying T. So it's not too hard to do, right. You might as well it's an attribute. Yeah. Cool. All right. That's a really good one. So you're saying if I'm going to do, like, some kind of matrix multiplication stuff, I should not do it in Pandas. I should just stick to NumPy.

29:28 Yeah. Numpy is like purely for mathematical purposes, and it's much faster than Pandas.

29:34 I suspect that NumPy has a good transpose as well.

29:36 But rather it has the same attributes.

29:39 Yes. There's a lot of synergy between those two libraries. So the next one has to do with styling things and how they look. Right. One of the things that's cool about Pandas is it mixes well with Jupyter notebooks and Jupyter notebooks. Have a nice sort of explore the data and let's see what's going on. Let me just look at it. Right. So this Styler thing, the style attribute helps you with that. Right.

30:00 Yeah. Here it takes advantage of the fact that Jupyter uses HTML and CSS under the hood, so you can take advantage of that and use some HTML and CSS knowledge to style your data frame based on some like Iphonic loops or conditionals. Here. For example, after you take the transport of the describe, you can just highlight the maximums of each row or column using the highlight Max function.

30:28 The pandas offers a lot of functions. After the style attributes. You can just use the built in functions, or you can come up with some custom logic to style your data frame using HTML. Css.

30:40 Okay. Yeah. This is great. So you can say, for example, here style highlightmax, and then you give it some CSS values, like color, dark red, or something like that. Right.

30:50 You just don't have to look at the raw numbers. It just shows you the most important metrics or the ones that you want. It's really useful when you have multiple columns. You just don't want to have to. You just don't want to look at all those crazy numbers and you just use some.

31:06 Yeah, like a real reasonable or maybe straightforward thing. You might start out by doing well let me just sort it. We'll sort it. So the highest ones at the top. But in this example, you've got multiple columns and the Max of one column is in one value, but it's a different row for a different attribute of it. Right. So sorting it is going to do nothing except for like, if you come up with a whole bunch of variations and try to look at it a little bit of color, a little bit of picture goes a long ways.

31:33 Yeah, they do.

31:34 Yeah, absolutely.

31:37 This portion of Talk Python to Me is sponsored by Linode.

31:41 Cut.

31:41 Your cloud builds in half with Linode's Linux virtual machines. Develop, deploy, and scale your modern applications faster and easier. Whether you're developing a personal project or managing larger workloads, you deserve simple, affordable, and accessible cloud computing solutions. Get started on Linode today with $100 in free credit for listeners of Talk Python, you can find all the details over at 'Talk Python.FM/Linode'. Linode has data centers around the world with the same simple and consistent pricing, regardless of location, choose the data center that's nearest to you. You also receive 24/7, 365 human support with no tears or handoffs, regardless of your plan size. Imagine that real human support for everyone. You can choose shared or dedicated compute instances, or you can use your hundred dollars in credit on S3 compatible object storage managed Kubernetes clusters, and more. If it runs on Linux, it runs on Linode, visit 'Talkpython.FM' and click the Create Free Account button to get started. You can also find the link right in your podcast player show notes. Thank you to Linode for supporting Talk Python.

32:51 Yeah. The second example you have here in your article is a little more nuanced. This looks great. Tell us about that.

32:56 This one is like background gradient, so it just colors each cell of the column based on its magnitude. It's kind of like it's kind of like continuous palette. It just shows where the maximum or the minimums are and just how they compare to each other. Yeah.

33:15 It's almost like if you could do a heat map in the next cell table by making the cells different colors you can pass in a color map and all sorts of stuff to control how that looks.

33:24 Yeah.

33:24 Cool. I like it. This is great. It's one of these things where, again, one line of code and you can dramatically improve the presentation value or the informational value of what you're looking at. Right. Nice. All right. I feel like that's similar to your number nine.

33:39 Yeah. This one is pandas options, like it's kind of settings of your phone. You just send them globally, and it applies to all the data frames, series and all the functions that you are going to be using inside the project or inside the session of Jupyter's notebook.

33:53 So if you want to have some sort of number of columns that are shown or some kind of color or something like that. You can just set that up at the beginning.

34:01 You just don't have to call them every single time or change them every single time. It's just a shorthand way of doing things like setting global settings.

34:11 Yeah. You could probably even do something like have a little JSON file that describes the look and feel of what you're doing. Just your first line. Just load it up and set it and then go from there something to that effect, right?

34:22 Yeah.

34:23 So you don't have to completely fill your the first three lines of notebook with, like, set up code.

34:28 Yeah. For example, one of those examples is like a display Max rows. If you set it to five and you just call the data frame, it's going to only show the first five rows, so you don't have to call .Hat every time.

34:41 That's interesting. Yes. Because, of course, if there's enough rows, it won't print the whole thing out. Right. Probably you don't want to print 10 million rows and completely lock up the system.

34:50 Yeah, it is good.

34:52 Another one that's kind of nice is display precision. And if you set that, you won't see the 1.27 e to the five or whatever. Right.

35:02 It's really annoying when you're working with math functions. It just keeps giving in scientific notation when you just want to see the first or four or five decimal places.

35:15 Yeah. Scientific notation is great when you're dealing with huge numbers or tremendously small numbers. Right. Like how many meters across as an atom. Okay. So you're going to need an e to something. But for human beings, often you want to just look at the number ago. Yeah. That's a million. Not like 1.2 e to the six or seven. Whatever.

35:33 It's going to be really annoying.

35:34 That's cool. And this is just one of those options you can set up, and it just globally applies to that notebook. So another thing that's interesting about Pandas is the columns have types usually, but not always.

35:47 It's one of those, like, beginning level things that you'll encounter, but it can get really annoying if the datasets are incorrect for your column. The most important one is the object data type.

35:58 Right. That's like, I don't really know. So we're just going to store it.

36:03 I'm just going to put it inside of an object. And objects are like, object data type is the worst one. It also limits the functionality of panels, and it's also the most memory consuming.

36:14 Right. So the next function, what number are we on here? Number ten. And the hit list is converted, as in convert data types.

36:25 When you call it on the whole data frame, it tries to infer the correct data type for each column.

36:33 If it's a float or integer or string like that.

36:36 So, for example, you're reading a CSV file, and some of the columns are detected correctly, like floats, but others get this object. But after calling convert D types, it's like, you know what? No, those are strings.

36:46 But you can't handle the daytime because there are so many formats and can't possibly know all of them.

36:54 Why are daytime so hard? They really shouldn't be. But they really are. It's crazy. And then you throw in time zones and you'll forget it. Okay. And throw in daylight savings and all these other things.

37:07 Daylight saving is crazy.

37:08 Yes. I expect some of the Kaggle stuff. Part of the challenge is like normalize these dates because who knows or something along those lines.

37:15 Time zones is not like Total Mess.

37:18 Yeah, for sure. So related to converting the data types is to select them. Yeah. Which is a way to filter what's in there. Like, you can filter by column or rows or even a condition. But this is saying I only want the strings or I only want the numbers. Right.

37:36 While doing machine learning, you have to apply certain preprocessing functions to only subsets of the data and like, only on categoricals or only numerics. So this function will become very handy. You just pass the data type as using Numpy, and it just gives all the subset of the data frame with that data type.

37:58 Nice. So you would say, like data frame.select data types and then include equals NP number, and now instantly the resulting data frame is a subset that only has numbers, right?

38:09 Yes.

38:09 That's cool. And then also you point out that you can do the reverse to give you just the informational bits, like categories and stuff or reading by saying exclude.

38:20 Yeah.

38:21 Very nice. Okay. Well, we just missed it with Halloween here.

38:25 Yeah.

38:26 But mask. You've got a cool picture, like a mask here. But mask is number twelve.

38:34 It is conditional

38:36 You can use it on series or data frames, and it just returns the subset of the data where some condition is true.

38:44 Yeah. Okay.

38:46 In this example here.

38:47 You'Ve got a bunch of ages and I want to take all those rows that are beyond 60 or below 50 and convert those values to NaN.

38:59 Okay. So this is like an in place update, or I guess it replaces creates another one is like as if you updated it and it finds all the stuff that's I guess outside of your range and then applies this other value. Right. Like if it's stuff that's outside of this range, in this case, you're going to set it to not a number, but it could be set to zero or Max or anything. Yeah.

39:20 Cool.

39:21 A very good one. Similar, I guess, is min and Max and then some of these. As we get a little further down your recommendations, I like them. They're not just here. You can apply this function, but apply it in this scenario or this context to get an interesting outcome. Right. So that's what number 13 is like min and Max along columns axis usually when you call Min and Max on a column.

39:44 It just returns the minimum or maximum of that column. But sometimes you want it to row wise, like it just treats rows as columns, and it gives min and Max across the rows. That's usually useful a handy way of doing something that would take a lot of codes if you don't manually.

40:02 Another one of these tricks that are techniques that lets you avoid looping. Right here.

40:06 I show a good example of comparing four different libraries on five data sets. You want the best performance on each data set, so you have to find the best score across the rows.

40:18 Exactly. So the columns are the different libraries like Xg boost, Cat boost, psychiclearn, and so on being applied to the same data set. And you want us to go for row one. What one did the best row two. What one did the best.

40:32 Yeah.

40:33 Cool.

40:33 Number 14 nlargest and nsmallest.

40:37 Yeah. We're talking about the maximum minimum. So when you pass a number and a column name, it just returns the data frame that contains the smallest or largest and rows of that column.

40:53 Nice. So if I were to call minor Max, that would give me the smallest or the largest one, respectively. Right.

40:59 Yeah.

40:59 But a really interesting or common question you might have is like, what are the top ten selling products this month? Right. And this lets you just say and largest ten. And then you pick the column on which to judge it. Here you have price. Right.

41:13 Five most expensive diamonds in the diamonds,data set, yeah.

41:16 Again, one of these things. No more looping or any of that stuff. No more if statements just call it right.

41:22 This one is like the five most cheapest diamonds.

41:25 Yeah. And smallest and largest. Fantastic. Also, sometimes when you're asking for a minimum or maximum thing, you don't actually want the minimum or maximum, you want to know where that is because you're going to get that thing back and say, I need that whole row because I want to learn more information about it. Right. But if you said, well, what's the minimum price? It's seven like. Oh, okay. Great. Now do I need to loop through until I find a thing in a seven or something like this? So you've got a recommendation for that?

41:52 Yeah. The IDXmax is IDXmin that returns the index values of minimum or Max so that you can look at the row that they are stored at or the column.

42:03 Fantastic.

42:04 Yeah.

42:04 So here's the row that contains the minimum price. I love it. Really nice. So many of these are really easy to apply, right. It's not a lot of research to learn how to apply IDX Max, but at the same time, knowing that it exists now, all of a sudden, you can use it really easily, but you probably wouldn't have known to look for it, right?

42:25 Yeah.

42:25 Cool. People often talk about differences between beginner developers and expert developers. And I think a lot of times beginners, look at folks like you who have a lot of experience, like, all this guy is so incredibly smart, and he just has this way of solving these problems. It's so amazing. And to some degree, that's probably true. But a lot of it is like just building up layers and layers of these like, oh, I know. I can use Idxmax. I know that I can use N largest, and you just sort of pile them together and then, like, Bam, the solution becomes easier because you have these little building blocks. Right. So I think this is really valuable for people getting independent.

43:03 I usually think that the biggest difference between beginner level and a more experienced programmer is just how much time they spend on the documentation.

43:13 Yeah.

43:15 If you patiently read the docs to you, you're just going to be really good user of that particular tool. I agree.

43:21 I agree. There's just more you understand. It better. You know, more of what it has to offer. So it's less you've got to reinvent. All right. I talked about how you have something that may be well known, but then applying it in a scenario and this number 16 is value_counts with dropna-false.

43:39 What's this one about when you have a series with, like, categoricals, you just want to see their proportions or their numbers as a whole in the total series, and that usually doesn't include the null values. So you have to call is null and change it with some so that you learn the number of NaN in that column. But you can do it efficiently with Valley counts with my setting drop and a to false, which includes the proportions of the null values as well.

44:09 Yeah. So it just gives you basically a percentage as a ratio here. It's just a ratio of the number of the different categories that have appeared here. Right. So very cool. And now just not a numbers included. That's great.

44:22 Yeah.

44:22 Number 17. Clip. This is a good one for data that exceeds, I don't know, maybe a range, maybe some instruments supposed to collect here to 100, and it goes crazy and goes outside or 100. Yeah.

44:35 For example, we go back to the ages example where I just want to have agents between, like, 18 or and 60. And I want to exclude all those values. And when you call Flip with those custom values, it is going to impose those hard limits on the whole series.

44:52 Right. So it'll replace the ones that are over with the maximum you set and the ones that are too low. It will bring them up to the minimum. Right.

44:59 Yeah.

44:59 Very cool.

45:00 Again, I guess the whole data set not looping only 1 coloumn a time. Yeah.

45:05 We talked about how difficult time is, but you do have some recommendations for searching for data that appears at a certain time or in a time range. Right. What's number 18.

45:15 This one is like a subsetting rows of the data frame at some particular time of the day, like any time of the day. But for example, 03:00 930 or any time that you want. You're just going to take all those rows and return them using a time.

45:33 Yeah, that's super easy. Right. Just pass in at time. And you literally specify times. Right. Like 15 colon.

45:44 Like real conversation or messaging.

45:46 And then the other one, which is also interesting, is between time. Right. Like what happened in the morning.

45:51 For example, like what are those sales that happened in the morning or after midnight or during some particular interval? This one is really handy to do that. Yeah.

46:03 So super easy. Just data frame between time or is that a series?

46:07 It doesn't matter. It doesn't matter. Usually it has to have a daytime index. That's it. Yes.

46:13 Okay. So then you're passing strings like 9:45 to 12:00. And that's like a late morning or something beautiful. The next one here has to do with time series number 19, B date range. Tell us about this.

46:28 This one is like stands for business date range. So, like, internally built into calendars.

46:39 How can I say when you want to index the data frame, you want time series data frame. You want to include only, like, working days. Like you want to exclude all the weekdays, weekends. You can do that for every single of the year, for every single week of the year, because you can possibly know which dates are weekends. So when you call the bdate range, it just indexes the data frame using only weekdays. And also it excludes the holidays, I think.

47:09 Oh, my gosh. I was just wondering about holidays. Like there's another wrinkle in there already. Things like leap year and stuff like that is built into this, I would imagine. So this is super cool.

47:19 Yeah. This is very important for when you are doing time series forecasting or financement analysis, because working with stocks because stocks are only traded on weekdays and not on holidays. So it will be very important.

47:32 Or even if you do in, like, traffic analysis, you want to understand accidents that are a result of rush hour. Right. You wouldn't want to look on a weekend.

47:40 Yeah.

47:41 The next one has to do with correlation. Autocorr. C-O-R-R. Yeah.

47:47 Auto correlation.

47:47 Yeah. I don't do much with time series. You're going to have to tell us about this one. What's going on here.

47:52 This is usually how the autocorrelation of a series or time series tells the predictability of the time series with itself.

48:01 Do you know about correlation conditioning?

48:03 Yeah, exactly. It tells you how much the model matches the actual data. Like it's 97%. Likely that the model will predict the stuff coming up. Right. It could be linear or more complicated, but that's something like that.

48:14 Yeah. The gist of this is that if the time series has a higher order correlation with itself, it means you can predict it more easily.

48:23 Got it. Yeah.

48:24 Basically. How predictable or unpredictable is this thing? Yeah.

48:27 There's a lot of details about our collection, and it has very many applications in time series. But the gist is that it shows you how much predictability it has, like, at each interval.

48:40 Cool.

48:40 It sounds very useful if you're doing that kind of stuff. All right. Number 21 hasnans

48:45 It's also an attribute. You just call it on a series and returns true or false if you have interjections true if you have at least one missing value in a series.

48:56 Yeah. So there was this quote I remember who it's attributed to. So that says something to the effect of data cleanup and data wrangling is not the dirty work. It is the work of data science like to get everything ready, and then you just, like, hit it with the magic at the end. Right. And this feels like that lands, right. In that realm is like giving some data frame or series. Does it have not in numbers or is it all good?

49:19 Yeah. Missing values is a huge problem in machine learning. Most scikit learning algorithms don't accept missing values, so you either have to drop them or impede them using some techniques. And this one is very handy to detect those missing values.

49:35 Right. I suspect this is the first test. Like, if it has not a numbers and then go do stuff, but if it says false, then you're good to go roll.

49:46 Through.

49:47 Unfortunately. Are you familiar with the missingno.

49:52 Yeah.

49:53 Another thing that sort of came to mind is like this whole thing. This missing no package, as in no numbers to a way to not just answer yes or no, but to get visualizations. Have you used this?

50:05 Yeah. I also wrote an article on it, I think.

50:07 Okay. Yeah. So definitely. That's awesome. Things like this sound really useful to me.

50:12 I really like the missing matrix. It just shows the reasons why missing values are correlated, how missing values are correlated with all the columns.

50:21 Right. Is it a whole bunch of missing data in one row?

50:23 Yeah.

50:24 And then it's all good, or is it interspersed like this one is missing the birthday, but that one is missing the name or something like that, right.

50:30 It's a really good package.

50:31 Yeah. Fantastic. All right. At number 22, AT and IAT.

50:35 This one is like a faster versions of Lock and Ilog.

50:39 It just enables you to index your data frame, but this one is specifically designed for retrieving single value conditionals.

50:48 Nice. It's almost like an array index.

50:51 Yeah.

50:51 A little bit. What's the difference between at and IAt using at?

50:55 You can use, like, column labels, like, as you can see here we are using Cut and an index but Iat you have to know the index of that column.

51:03 I see. So with add it would be like row and then column name or Iat as row and column number. It's probably less flexible. You got to know that cut is four because it could be moved around as people are creating or inserting data. Okay. 'Argsort' as in aggregation.

51:19 This one just returns the indices that would search a data frame.

51:23 Okay.

51:23 Based on some column. So in doing data analysis, you sometimes want the indices, not the actual sort of data, so that you can use those indices in multiple times over.

51:34 Got it. So you get the sorted. Say, I want to sort by the total bill, but then give me the indexes as if it was sorted, but don't actually change it. So then you could go and then request data of those indexes. Got it?

51:46 Yeah.

51:47 Nice. All right. We're closing in on the end and we've brought in the cat the cat Accessor.

51:52 Yeah. I should have put in an image here.

51:54 Yes. There would have been some kind of cool cat you can put in there.

51:57 Yeah.

51:58 Pandas enables you to perform some DT type specific functions like there is DT Accessor for daytime and also STR for strings. And this one is strictly for categorical purposes. It has a large suite of categorical functions that makes it easier to work on categories ordinals or nominal data.

52:20 Yes. Fantastic. And let's bring it to the 25th with Nth GroupBy nth.

52:27 Yes. This one is less useful or used in very rare edge cases. When you group by some column, possibly a categorical column, you want to look at those rows or groups, right. Calling Nth on group data frame. Just return that Nth row of the groups of each group.

52:47 Got it. Okay. Yeah. That looks really cool.

52:49 Yeah.

52:50 All right. Well, that's it for our list. Hopefully people out there listening have definitely learned something. Now your title was just to put a little Disclaimer in here for everyone. 25 Panda functions you didn't know existed. Pipe P guarantee equals 0.8. So you had this 80%.

53:06 Yeah.

53:07 I love it. It's a little bit of a stats joke in the title.

53:10 No one complained about that. So I think that was right.

53:13 Yes, it sounds about right. It seems like there's a lot of neat use cases here that people can find. These are your 25 that you found interesting. Other people might find them as well, but there's only.

53:24 Yeah. These are the types of things though that people can say. All right. Today I'm going to try to work with number one as I'm doing my data analysis and stuff. I just be doing some Excel stuff. So let's do the Excel writer one, and then maybe later it's like, oh, I know. I'm doing survey type of data. So let me work with Explode and just try to if you work these in one at a time. Eventually they become part of your tool chest, and they're good, right?

53:48 Yeah. Just expanding our tools and skills.

53:51 I think part of the trick is to make sure that you apply it a little bit, right. You know, they're out there, but just as you use them, like, bring them in.

53:58 Just save that it's time and resources.

54:00 Awesome. Yeah. Half the battle is just knowing that it exists, right? It's not that it's necessarily hard to use. It's. Like, I just didn't know this was even an option.

54:07 All of these are very easy to use. You just know that they exist.

54:11 Yeah. I feel like so much appendices that way, but it's hard to know because there's so much to do there. Cool. I stream Brandon. Now we're cutting down. I want to throw it up. It's very helpful. Thank you for the article. Becks, you're welcome. Yeah, I agree. Thanks for doing this one. I do want to point out we certainly don't have time to cover it, but let me pull it up here so I can make sure it goes in the links as well. You did the same thing for NumPy, right. And you also were a little more. I got to say, you're a little more confident here. Your P of guarantee equals zero point 85 instead of .8 numpy Pandas are a little bit harder to understand.

54:44 That's why most of them don't bother to learn those people. So I was a bit confident because I also didn't know most of these functions. That's why I was a bit more confident.

54:54 Yeah. Fantastic. All right. So people like this flow, and they want to kind of go a little deeper and go into the NumPy layer. They can check that out. They can also check out a bunch of your other writing for Escalator. Okay. Right on for Escalating.

55:07 Great.

55:08 Anything else you want to add to this article before we call it good on that topic?

55:12 No, I think we covered everything.

55:13 Yes, we covered it. Well, I think it was fun.

55:16 All right. Now, before you get out of here, there's the two questions you've got to answer if you're going to write some Python code, what editor do you use are you going to use for data analysis?

55:27 I usually use Jupyter Lab, but if I have to do pure Python, that's always PyCharm. I love it.

55:34 Awesome. That's a good combo.

55:35 Yes.

55:36 And then notable Pi package. Something doesn't have to be something super popular, but something they ran across that people are like, you're like, people should know about this. This is something I learned about.

55:46 I recently come across with, UMAP, UMAP, it's for dimensional reduction. Umap, Python. It's usually used for very large data sets to project them to 2D so that you can visualize them.

55:58 Awesome.

55:59 This one is a really useful package. Nice.

56:01 So definitely people are trying to project down to 2D. That's one of the problems right. How do you look at some of the stuff that's.

56:08 Like, 100 dimensional or 200 dimensions?

56:12 I don't even have any idea at all how to do 100 dimensions. I remember. And we were doing some work with, like, complex analysis and two dimensional, but each dimension was complex numbers, four dimensional. That was a challenge. I have no idea how to approach 100.

56:27 No one does. That's why this kind of dimensionality reduction techniques exist.

56:31 Yeah, fantastic. And of course, important machine learning and stuff right there's, like dimensions that you can just throw away because they don't actually contribute to the predictions and stuff, right?

56:40 Yeah.

56:42 Super.

56:42 All right.

56:43 Beck, thank you for being here. Final Call Action people want to get deeper in Pandas, maybe learn more about some of your articles. What do you tell them?

56:51 As I said, just first, take the documentation. The documentation is usually should be your first choice. It's the best place to learn about the library.

56:58 It takes a little dedication, but go through it and find out what it has to offer and go from there. Right.

57:03 It's a bit hard to read, but the documentation always gives the best information about the library because it's written by the package created so they know the library the best for sure.

57:16 All right. Well, thank you for being here and thanks for writing the article and sharing it with us.

57:20 Have you?

57:20 Yes, you bet. Bye.

57:21 Thank you. Bye.

57:23 This has been another episode of Talk Python to me. Thank you to our sponsors. Be sure to check out what they're offering.

57:29 It really helps support the show.

57:31 Choose Shortcut, formerly Clubhouse .IO for tracking all of your projects work because you shouldn't have to project. Manage your project management, visit talkpython.FM /Shortcut, Simplify your infrastructure and cut your Cloud bills in half with one Linodes Linux virtual machines. Develop, deploy and scale your modern applications faster and easier. Visit 'Talkpython.fm/linode' and click the Create Free Account button to get started.

57:56 Do you need a great automatic speech to Text API?

57:59 Get human level accuracy in just a few lines of code.

58:01 Visit Talkpython.FM/AssemblyAI when you level up your Python, we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async and best of all.

58:16 There'S not a subscription in sight.

58:18 Check it out for yourself at 'Training.Talkpython.FM' be sure to subscribe to the show, open your favorite podcast app and search for Python. We should be right at the top. You can also find the itunes feed at /itunes, the Google Play feed at /play and the Direct RSS feed at /rss on 'Talk Python.FM'. We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at 'Talk Python.FM/YouTube' this is your host, Michael Kennedy. Thanks so much for listening.

58:51 I really appreciate it.

58:52 Now get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon