25 Pandas Functions You Didn’t Know Existed

Episode #341, published Wed, Nov 17, 2021, recorded Thu, Nov 4, 2021

Episode Deep Dive Links Transcript

Do you do anything with Jupyter notebooks? If you do, there is a very good chance you're working with the pandas library. This is one of THE primary tools of anyone doing computational work or data exploration with Python. Yet, this library is massive and knowing the idiomatic way to use it can be hard to discover.

That's why I've invited Bex Tuychiev to be our guest. He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties we should all keep in our data toolkit. I'm sure there is something here for all of us to take away and use pandas that much better.

Play on YouTube

Watch the live stream version

The 25 functions

ExcelWriter is a generic class for creating excel files (with sheets!) and writing DataFrames to them.
pipe is one of the best functions for doing data cleaning in a concise, compact manner in Pandas
factorize: This function is a pandas alternative to Sklearn’s LabelEncoder
A function with an interesting name is explode.
Another function with a funky name is squeeze and is used in very rare but annoying edge cases.
between: A rather nifty function for boolean indexing numeric features within a range.
All DataFrames have a simple T attribute, which stands for transpose.
Did you know that Pandas allows you to style DataFrames?
Pandas options
convert_dtypes: We all know that pandas has an annoying tendency to mark some columns as object data type. Instead of manually specifying their types, you can use convert_dtypes method which tries to infer the best data type.
A function I use all the time is select_dtypes.
mask allows you to quickly replace cell values where a custom condition is true.
min and max along the columns axis
nlargest and nsmallest.
However, sometimes you want the position of the min/max, you should use idxmax/idxmin
value_counts with dropna=False: common operation to find the percentage of missing values is to chain isnull and sum and divide by the length of the array - you can do the same thing with value_counts with relevant arguments
clip function makes it really easy to find outliers outside a range and replace them with the hard limits.
at_time allows you to subset values at a specific date or time.
bdate_range is a short-hand function to create TimeSeries indices with business-day frequency
autocorr
Pandas offers a quick method to check if a given series contains any nulls with hasnans attribute
at and iat: These two accessors are much faster alternatives to loc and iloc with a disadvantage. They only allow selecting or replacing a single value at a time
argsort: You should use this function when you want to extract the indices that would sort an array
When a column is a category, you can use several special functions using the cat accessor.
GroupBy.nth: This function only works with GroupBy objects. Specifically, after grouping, nth returns the nth row from each group

Episode Deep Dive

Guests Introduction and Background

Beks (Bex) Toichev is a seasoned Python developer and data science enthusiast. He’s recognized as a Kaggle master, where hethud regularly participates in competitions and shares tutorials. He also writes top-rated articles on Medium, specializing in artificial intelligence and data science topics. During this episode, Beks explains how he came to focus on Python for data science, how he uses writing to solidify his learning, and why pandas is such a critical library for data exploration and analysis.

What to Know If You're New to Python

Here are a few basics to help you follow along with the discussion in this episode and get more out of pandas:

Variables and Data Structures: Python lets you store data in lists, dictionaries, and more. Pandas builds upon these to store tabular data.
Avoiding Loops in Pandas: A major theme is that you often replace Python for-loops with vectorized operations and built-in functions in pandas.
Indexing and Slicing: Pandas data frames can be sliced similarly to Python lists, but with more powerful indexing options like .loc and .iloc.
pip / venv: Installing pandas or other libraries is done via tools like pip install pandas. Virtual environments (venv or others) help you manage project dependencies without conflicts.

Key Points and Takeaways

25 Lesser-Known Pandas Functions Beks shares a collection of “hidden gem” pandas functions and features that can drastically improve productivity. Many of these allow you to work more efficiently without falling back to manual loops or if-statements. Knowing they exist is often more important than learning the details by heart. You can immediately apply them in data cleaning, manipulation, or advanced exploration tasks.
- Links and Tools:
  - pandas
Fluent Data Processing with pipe One standout function is DataFrame.pipe(), which helps you chain operations together in a very readable way. Instead of writing nested function calls or creating intermediate data frames, you can design a pipeline that processes data step by step. This fluent style mirrors scikit-learn pipelines and can make your notebooks more maintainable.
- Links and Tools:
  - pipe() docs
Converting Categorical Data with factorize factorize quickly transforms text-based categories (like "sun" or "rain") into numeric labels for machine learning. This is similar to label encoding but built right into pandas. It’s especially handy if you want to skip external libraries for simple encoding tasks.
- Links and Tools:
  - factorize() docs
Handling Nested or Multi-Value Cells via explode Survey results often have rows where one cell contains multiple values (lists). explode automatically splits those lists into separate rows, preserving other columns. It’s a perfect example of removing loops and manual code for an operation that can be done in one line.
- Links and Tools:
  - explode() docs
Highlighting Key Insights with Pandas Styling The DataFrame.style attribute lets you add color scales, highlights, and conditional formatting directly in Jupyter notebooks. This is useful for quickly spotting trends or outliers. Background gradients, highlighting min/max values, and custom CSS can all provide immediate, visual feedback about your data.
- Links and Tools:
  - Styling docs
Checking for Missing Values Pandas offers many ways to handle NaN and missing data. A particularly quick check is .hasnans on a Series, letting you decide if you need to drop or impute missing values. Missing data is a central challenge for data scientists, and acknowledging it early can save a lot of time.
- Links and Tools:
  - MissingNo (external library) – For visualizing missing data.
Time Series Filtering with at_time and between_time If you have a DateTime index, these methods let you filter rows occurring at a specific time or across a time range (e.g., "business hours") in just one line. It’s essential for tasks like slicing out morning data or ignoring weekend time stamps in stock trading data.
- Links and Tools:
  - at_time() docs
  - between_time() docs
Business Date Ranges (bdate_range) For time series work, bdate_range excludes weekends and holidays if desired, giving you a date index pre-filtered for business days. This is crucial when analyzing stock data or any schedule-bound events.
- Links and Tools:
  - bdate_range() docs
The Speed of Pandas vs. Loops A recurring lesson is that using built-in pandas methods is much faster than writing loops in plain Python. Pandas uses efficient, C-backed code under the hood, so shifting your mindset to vectorized operations can drastically speed up your workflow.
- Links and Tools:
  - NumPy – Underlies much of pandas’ performance benefits.
Bex’s Tips for Mastering Libraries Beks repeatedly emphasized reading official documentation to discover lesser-known gems. He also points out that contributing to or engaging with communities (like Kaggle or Medium articles) accelerates learning. You need to do more than just skim; deep exploration of the docs, plus real-world application, cements these skills.

Links and Tools:
- Kaggle
- Medium

Interesting Quotes and Stories

“You just have to be one step ahead of your audience, and that’s it.” – Beks on writing articles or sharing knowledge even if you’re not an absolute expert.
“I used to get annoyed seeing these complex functions, so I wrote the article to learn them, and share with the audience.” – Beks explaining his motivation for discovering 25 lesser-known pandas features.

Key Definitions and Terms

Vectorized Operations: Performing array-wide or column-wide operations in one step rather than looping through each element.
Categorical Encoding: Turning text labels (categories) into numeric values for machine learning.
Time-Series Analysis: Working with time-indexed data, often focusing on specialized indexing and filtering.
Missing Values / NaN: Indicators in data that information is not available or not applicable, requiring cleaning or imputation.

Learning Resources

If you want to deepen your Python and data science skills, here are some courses from Talk Python Training.

Python for Absolute Beginners: Perfect for those just starting their coding journey in Python.
Move from Excel to Python with Pandas: Ideal if you’re transitioning from spreadsheets to pandas for data manipulation.
Data Science Jumpstart with 10 Projects: Get hands-on with real-world examples and dive deeper into data science techniques.

Overall Takeaway

This episode serves as a reminder that pandas contains many tools beyond the basics everyone knows. Embracing these lesser-known functions will improve efficiency, clarity, and performance in your data science workflows. By continuously exploring the documentation, writing about what you learn, and engaging with open communities (like Kaggle), you’ll keep discovering new ways to take advantage of Python’s rich data ecosystem. Above all, remember that sometimes “knowing about a feature” is the biggest leap toward more powerful and elegant solutions.

Links from the show

Bex Tuychiev: linkedin.com
Bex's Medium profile: ibexorigin.medium.com

Numpy 25 functions article: towardsdatascience.com
missingno package: coderzcolumn.com
Watch this episode on YouTube: youtube.com
Episode #341 deep-dive: talkpython.fm/341
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #341 deep-dive: talkpython.fm/341

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Do you do anything with Jupyter Notebooks?

00:02 If you do, there's a very good chance you're working with the Pandas library.

00:05 This is one of the primary tools for anyone doing computational work or data exploration with Python.

00:12 Yet, this library is massive, and knowing the idiomatic way to use it can be hard to discover.

00:18 That's why I've invited Beks Toichev to be our guest.

00:21 He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties

00:26 we should all keep in our data toolkit.

00:28 I'm sure there is something here for all of us to take away and use Pandas that much better.

00:33 This is Talk Python To Me, episode 341, recorded November 4th, 2021.

00:39 Welcome to Talk Python To Me, a weekly podcast on Python.

00:55 This is your host, Michael Kennedy.

00:57 Follow me on Twitter, where I'm @mkennedy, and keep up with the show and listen to past

01:01 episodes at talkpython.fm.

01:03 And follow the show on Twitter via at Talk Python.

01:06 We've started streaming most of our episodes live on YouTube.

01:09 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming

01:15 shows and be part of that episode.

01:17 This episode is brought to you by Shortcut and Linode, and the transcripts are sponsored

01:23 by Assembly AI.

01:24 Bex, welcome to Talk Python To Me.

01:29 Hello, Michael.

01:29 Thanks for having me.

01:30 Hey, it's fantastic to have you here on the show.

01:33 Your article 25 Pandas functions that you didn't know or probably don't know, I guess, as we'll

01:40 see.

01:41 That really caught my attention.

01:43 Honestly, I don't know many of them.

01:45 So I learned a bunch by watching it.

01:47 You know, I do spend more time on the web side of Python and the database side of Python than

01:53 I do on the data science.

01:54 But certainly Pandas is a super important part of Python these days.

02:00 And honestly, the whole data science side is the fastest growing part of Python.

02:04 Pandas is like one of the first libraries that you will be introduced in any beginner Python

02:09 or in any beginner data science course.

02:12 And it's amazing how much it has grown since it was first launched.

02:18 And the funny thing about the article is that before writing it, I also didn't know most of the functions.

02:23 I would always get annoyed by people who use some like complex functions.

02:28 And I just wanted to know how they worked and explain it to my audience.

02:33 So that was the idea of the article.

02:36 Both me and the audience learning.

02:37 That's the little bit of secret behind these types of things.

02:41 Behind the tutorials, behind articles, behind podcasts, and even behind courses.

02:45 A lot of times we dive into them because we're like, oh, I really want to learn these things.

02:50 And just let me, you know, put it in a format I can present to the rest of the world and help

02:54 everyone else out, right?

02:55 Yeah, yeah.

02:56 Awesome.

02:56 Yeah, before we get into this, I want to talk about your articles and some Kaggle competitions.

03:01 And then we'll dive into the 25 functions.

03:04 But, you know, let's start with your story.

03:06 How did you get into programming in Python?

03:07 Right after I finished high school, I got interested in web development.

03:11 I learned some HTML and CSS.

03:13 And I was hoping to get things to get more, to be more exciting.

03:18 But at some time, I just got bored because I'm really into math.

03:23 And web development had nothing to do with math, so it was very boring.

03:28 So I switched to learning Python.

03:30 Learned it for a while and discovered that data science is more, mostly connected to math and statistics.

03:37 So I just bought a really good course.

03:41 And that's how it starts.

03:44 Yeah, that's fantastic.

03:45 You know, I think people do often feel like you have to be really good at math to be good

03:49 at programming.

03:49 And honestly, most of programming has very little to do with math.

03:54 Yes, of course.

03:55 Yeah.

03:56 But data science does.

03:57 So data science is unique in this way.

03:59 I mean, I guess computational science, right?

04:01 If you're an astrophysicist, you do a lot of math as well.

04:04 But for most of us, math is just a structured way of thinking.

04:08 And we have structured programs.

04:10 And that's kind of the end of the relationship there.

04:12 But if someone is out there and they really love math and they want to take it farther,

04:18 but they want to do that in computers, it sounds like recommending data science might be the

04:21 right path.

04:22 Yeah, of course.

04:23 It's very, really beautiful how software and math connect together in data science,

04:28 what kind of things it can achieve for neural networks and state-of-the-art machine learning

04:34 algorithms.

04:34 It's really amazing.

04:36 Yeah.

04:36 It's one of these areas that's just growing so fast.

04:40 And there's such big advancements.

04:42 Yeah.

04:42 You know, you look at, I think back to when I was in college and we talked about artificial

04:48 intelligence and AI, and it was all about the Turing test, you know?

04:53 Could you get a chat bot that would trick a human into thinking that it was an actual other

04:58 human?

04:59 And it never really seemed to come into reality.

05:03 It always seemed like, oh, there's, it's kind of always 30 years out.

05:06 And then all of a sudden we have self-driving cars and we have Google Copilot.

05:10 Yeah.

05:11 It's just the step jump over the last couple of years has been amazing.

05:14 Yeah.

05:14 I was also amazed by Google Copilot.

05:16 Like right after it was launched, I wrote an article on it, like as a kind of intro.

05:21 And it really took off.

05:23 Like so many people were interested in it.

05:25 Like it received like more than 50,000 views, the article.

05:28 Yeah.

05:29 A lot of people are amazed by it.

05:30 I'm amazed by it as well.

05:31 I think it's, it is amazing.

05:33 It's also bringing to light some interesting, almost legal and philosophical things, right?

05:40 If people put code on GitHub, they didn't necessarily intend to train an AI with it.

05:45 If they put code on GitHub, that's under GPL.

05:48 Well, what the AI knows, is that now GPL or is that completely, you know, can that be used

05:53 in closed source?

05:54 These are not known, right?

05:55 These are, these are interesting questions.

05:57 Yeah.

05:58 I don't think we're going to answer.

05:59 We're not going to completely fill them out today.

06:01 Let's focus on something more, a little smaller.

06:05 So you mentioned your articles and you've been doing a lot of writing.

06:09 So you're a top 10 writer in artificial intelligence on Medium.

06:13 Yeah.

06:14 Yeah.

06:14 And you're also a Kaggle master.

06:17 Yeah.

06:18 Yeah.

06:19 Let's talk about those two things for a little bit.

06:20 Just give us a sense of the stuff that you write about on Medium and maybe some of your

06:23 favorite articles before we dive into this one that I picked out.

06:26 I started writing on Medium a year ago.

06:29 It was just purely for educational purposes.

06:31 I really liked how like what the things you learn will be like, will be locked into your

06:37 brain by writing about them.

06:39 So it was a really amazing way to learn something new.

06:42 But as my number of articles grew, like my audience grew and I met a lot of people.

06:49 I had, it opened a lot of doors for me writing.

06:53 Yeah.

06:54 And most important of all, I'm more confident about my knowledge than ever before.

06:59 That's fantastic.

07:00 I really like that you point out that it opened doors because so many people feel like I'm

07:06 not ready to write.

07:07 I'm not ready to speak a user group or a conference, or I'm not ready to appear on a podcast or any

07:12 of these sorts of ways where you put yourself out there.

07:14 Right.

07:14 Yeah.

07:15 But when you do that, the act of doing that pushes you to grow.

07:19 And it also opens doors to people.

07:21 You know, if you're out there and you're genuine, you don't have to be an absolute expert in

07:25 everything.

07:25 You just have to be excited and interested.

07:27 Other people who are excited want to talk to you and work on something with you, right?

07:30 Yes.

07:31 You just have to be one step ahead of your audience and that's it.

07:34 Right.

07:34 When you write articles.

07:35 That's right.

07:36 And not necessarily in everything they know, just the little area that you're interested in,

07:39 right?

07:39 Yes.

07:40 Yes.

07:40 Yeah.

07:40 Awesome.

07:41 And so that's really great that you're doing this writing stuff.

07:44 The other thing is Kaggle.

07:45 Tell us about what you've been doing at Kaggle.

07:48 I really admire people who do, who do like do competitions on Kaggle for a while.

07:53 And I really had this like imposter syndrome.

07:56 I couldn't join the competitions because I thought that they were too complex that I had

08:01 like a lot of things to learn before I joined them.

08:03 I still do.

08:04 But after I joined like the tabular playground competitions, I learned that I can do it.

08:11 Yeah.

08:12 So I started posting my articles in the form of notebooks on Kaggle as well, which started

08:17 getting a lot of views and really nice comments from the audience.

08:20 The community on Kaggle is even more amazing than on Medium.

08:24 For an article that gets like read by thousands of people on Medium, I usually receive like

08:29 one or two comments.

08:30 But if you write, if you post the same article as a notebook on Kaggle, like the audience loves

08:35 it because Kaggle is mostly suited for this kind of tutorials.

08:40 And I usually receive like 30 or 40 comments.

08:42 And that's really amazing as a writer to be part of that kind of community.

08:46 Yeah.

08:47 That's really amazing.

08:48 I had no idea.

08:48 I didn't realize you could post on Kaggle.

08:50 Yeah.

08:51 No.

08:51 You kind of post your solutions and then have a conversation around them sort of, right?

08:55 Yes.

08:56 Okay.

08:56 Awesome.

08:57 People want to get started with Kaggle.

08:58 What do they need to do?

08:59 Like maybe before we drop this topic, if people haven't done stuff with Kaggle yet, but they

09:03 maybe want to use it to learn.

09:04 What's your advice there?

09:05 Yeah.

09:06 I just, right after you create an account, they have a whole suite of courses, free courses

09:12 you can take.

09:13 I think those are the, those are very best, very good starting points for any beginner.

09:19 And also they have like two or three beginner level competitions.

09:23 So you don't get intimidated by those grandmasters or masters.

09:27 They're just a simple datasets you can work with and you just have to submit your predictions

09:33 and just get a score and nothing too complex.

09:37 And that's really the amazing part of Kaggle.

09:40 That's why those three competitions I have, I think they have like 100,000 people competing

09:46 at any single time in, in any time.

09:49 That's wild.

09:50 One of the challenges when you're learning is finding a structured problem to approach,

09:55 right?

09:56 Maybe in the web world, people try to build things that are too ambitious.

10:00 They're like, oh, I want to build Airbnb.

10:02 You're like, whoa, whoa, whoa, whoa.

10:03 You don't really hardly understand CSS.

10:05 Let's take it down a notch and let's go slow.

10:08 And we'll get a right side problem for you to address.

10:10 Data science has the same problem, but I think it has another aspect, which is, and you need

10:14 the data to start from, right?

10:17 Yeah.

10:17 And I feel like Kaggle helps in bringing that kind of stuff over.

10:22 Yeah.

10:23 Kaggle like has an amazing list of datasets.

10:26 I almost always use Kaggle datasets for my, for my articles because most of them are digestible

10:33 and small enough for people to get an advantage of.

10:35 Awesome.

10:36 A question from the audience from Brandon Bennett asks, are Kaggle competitions, just machine

10:42 learning and artificial intelligence related?

10:44 Are there other types?

10:45 Yeah.

10:45 Kaggle competitions are only AI or data science related.

10:49 Yeah.

10:49 Okay.

10:50 So for example, the latest launched on Kaggle, I think is about finding the cuteness quotient

10:57 of pets.

10:58 It was, yeah, you just take in like thousands of images and you process them with Python or

11:05 R and the neural network learns the structure and learns the cuteness quotient and just spits

11:12 out a new quotient for any new image you get.

11:15 That's amazing.

11:16 So it used to be, here's a machine learning model that can answer, is it a cat or a dog?

11:21 And now it's giving you a cuteness score.

11:23 Yeah.

11:24 Yeah.

11:24 I can definitely see my daughter getting into data science with this one.

11:27 She's all about pets and cats and dogs.

11:30 And I personally want to put a vote out there for the golden cocker, the golden retriever mixed

11:37 with the cocker spaniel.

11:38 Boy, those things are cute.

11:39 Okay.

11:40 So that's Kaggle.

11:41 Sounds really great for learning.

11:43 Yeah.

11:43 And I suspect knowing something about pandas will pay off.

11:47 Oh, of course.

11:48 Right?

11:49 Like it's such a foundational aspect.

11:52 Yeah.

11:52 Pandas are used extensively.

11:55 It is.

11:56 And I feel like pandas is one of those things that you could learn it really quickly.

12:00 You could learn to do stuff with pandas in a day.

12:03 Yeah.

12:04 Yeah.

12:05 But then in a year, you could still be learning stuff about pandas.

12:08 If you use it every day for a year, you know what I mean?

12:10 Yeah.

12:11 Most data science libraries are just very vast.

12:14 There are a lot of functionalities.

12:16 And most of the time, like you can get around by learning like 10 or 15% of all those functions.

12:22 But when you really need to get something like really rare edge cases or unique cases,

12:28 you really need to know some of those rare functions that are buried in the documentation

12:34 just so that you don't have to reinvent the wheel.

12:37 Yeah.

12:37 In Python, we speak about Pythonic code.

12:41 There's code that we could write that might be code that runs, but it looks like it comes

12:46 from Java or it looks like it comes from C and somebody just got it working.

12:50 And I suspect you have the same thing in data science and around pandas.

12:54 It's like, yeah, you technically could do this with pandas, but why don't you just call

12:58 this function?

12:59 And probably the answer is, well, I didn't know that function existed.

13:02 Of course, I would have called it if I had known to do it, but I just didn't know, right?

13:06 I'm new.

13:06 Yeah.

13:07 So hopefully we can shine a light on some of those things that you can do.

13:10 I mean, for example, not that we'll necessarily cover it in your article, but if you're doing

13:15 a for loop with a data frame, you're probably doing it wrong, right?

13:18 The golden rule is to never use loops, like teach loops completely.

13:23 Yeah.

13:23 That's pretty interesting.

13:24 It definitely takes a different way of thinking, sort of set-based processing and passing in

13:30 expressions and lambdas to various places and whatnot.

13:32 Yeah.

13:33 Maps and whatnot.

13:34 Okay.

13:34 We're going to talk about some of those.

13:35 Let's dive in.

13:36 First of all, how did you pick these 25?

13:39 Were these just 25 that you saw people use?

13:41 They were interesting.

13:42 You're like, I didn't even know that existed.

13:43 Or what was your philosophy here?

13:45 For this kind of articles, I usually go to the API reference of the documentation.

13:49 It just lists every single class and functionality of some library, the API reference.

13:54 And I just read them one by one.

13:56 I decide which one of those is going to be beneficial to me and possibly for my audience.

14:02 And I just pick them out.

14:04 Yeah.

14:04 Come by one.

14:04 Yeah.

14:05 Yeah.

14:05 That's really cool.

14:05 I love to discover these types of things.

14:07 So why don't we, you kick it off with number one?

14:10 Yeah.

14:11 What's number one here?

14:12 The first one is Excel writer.

14:13 It's a class for writing to Excel sheets.

14:18 So if you have multiple data frames, you can write to Excel sheets as separate tabs with

14:24 separate sheets.

14:24 The pandas has usually, the data frames have this two Excel function, but you give it the

14:31 Excel writer instance, it's going to write it to a separate sheet.

14:35 It's going to enable you to write to separate sheets.

14:37 Yeah.

14:38 This is super neat.

14:38 So in your example here, which of course we'll link to the article and people can check

14:43 out, they all have a bunch of code samples under each one of these.

14:46 You've got two data frames.

14:48 Yeah.

14:48 And you want to put them into some kind of Excel spreadsheet.

14:52 So you create one of these writers.

14:54 This is the function you're talking about.

14:56 And then you go to the data frame, you say to Excel and you give it the writer and a sheet

14:59 name and you give it, you can do that for each data frame and give it different sheet

15:03 names and it just piles up along the bottom.

15:05 Right.

15:05 It's really neat.

15:06 It's ridiculously simple, right?

15:08 It's like given the data frames, it's three lines of code to create an Excel file and

15:12 write it.

15:12 Yeah.

15:13 Yeah.

15:13 If you know this, you'd have to create two separate Excel files and just add them together

15:18 later manually, which is not programmatic.

15:20 Right.

15:21 Or maybe you say you don't know that you can write to Excel.

15:24 I mean, I'm pretty sure I could write to CSV.

15:26 Ah, yeah.

15:27 And there's multiple levels, right?

15:28 Like one level is like, I'm going to write it line by line, putting the commas in there

15:32 myself.

15:33 Another one could be the write CSV, right?

15:35 Read CSV, write CSV.

15:36 But this one is like more structured, right?

15:39 And then you could possibly use some of the more advanced tooling to do things like stylize

15:44 or highlight aspects of it or whatever, right?

15:47 Like Py Open Excel or something like that.

15:49 Now for this one, you talk about, all right.

15:52 It says that you need to have the right supporting libraries there, right?

15:57 You, for example, have to have different libraries.

16:00 I can't remember which one it was.

16:01 I think it was Py.

16:02 Py open by Excel.

16:03 Open by Excel.

16:04 Yeah, that's it right here.

16:05 I knew it was in here.

16:06 Yeah.

16:06 Open Py Excel.

16:07 If you want to work with XLS files.

16:10 And there's other ones as well, right?

16:12 Otherwise, you'll get an error.

16:13 Right.

16:14 So basically, Pandas delegates to this library, which actually understands Excel and writes

16:20 to it.

16:20 Yeah.

16:21 There's another one where it talks about using FSSpec.

16:25 And this caught my attention as like, oh, wow, this is way more flexible.

16:29 Because I'm not sure people are aware of what FSSpec is.

16:33 Are you familiar with FSSpec?

16:34 No, no.

16:35 So FSSpec is this library that allows you to treat different destinations as Python file systems.

16:43 Like, you know, with open some file name.

16:45 Instead of file name, you can do all sorts of stuff.

16:48 So let me see if I can find some of the documentation here of the things that it can go to.

16:54 Yeah.

16:54 Integrates with a bunch of different places, but it goes to places like S3 storage and

17:01 FTP and database and zip files and all of these types of crazy things.

17:07 And it even does caching, I guess, is right?

17:10 So this Excel writer, while it already sounds really interesting because it writes to Excel,

17:15 like destination of these Excel files, like this could be an Excel file in a database or

17:20 something with basically hardly any changes to the code.

17:23 Yes.

17:24 Yeah, that's super cool.

17:25 So good one to kick it off there.

17:27 A lot going on.

17:28 This portion of Talk Python To Me is brought to you by Shortcut, formerly known as clubhouse.io.

17:35 Happy with your project management tool?

17:37 Most tools are either too simple for a growing engineering team to manage everything,

17:41 or way too complex for anyone to want to use them without constant prodding.

17:45 Shortcut is different though, because it's worse.

17:48 No, wait, no, I mean, it's better.

17:49 Shortcut is project management built specifically for software teams.

17:53 It's fast, intuitive, flexible, powerful, and many other nice positive adjectives.

17:58 Key features include team-based workflows.

18:04 Or customize them to match the way they work.

18:09 type version control integration.

18:10 Type version control integration.

18:19 Whether you use GitHub, GitLab, or Bitbucket, Clubhouse ties directly into them, so you can update progress from the command line.

18:26 Keyboard friendly interface.

18:28 The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse.

18:34 Throw that thing in the trash.

18:35 Iteration planning.

18:36 Set weekly priorities.

18:38 Set weekly priorities and let Shortcut run the schedule for you with accompanying burndown charts and other reporting.

18:44 Give it a try over at talk python.

18:47 Talk python dot fm slash shortcut.

18:49 Again, that's talk python dot fm slash shortcut.

18:53 Choose shortcut because you shouldn't have to project manage your project management.

18:57 The next one is pipe, right?

19:01 Yeah.

19:02 The image also.

19:03 Yeah.

19:04 There's like a lumberjack looking dude smoking pipe there.

19:07 That's very cool.

19:08 Yeah.

19:08 Yes.

19:08 Tell us about pipe.

19:09 When you do data analysis, like most of the time, the data you'll be dealing with will be like not clean.

19:16 You have to perform some operations.

19:18 And pipe really offers a way to just package those all those operations into a single line of code or into a single code of block of code.

19:28 It's kind of like it's going to like SQL and pipelines where you just have to run as only single line of code and just perform several operations at the same time.

19:37 It's really just a neat way to do data cleaning.

19:40 Right.

19:40 And it's what's called a fluent API.

19:43 So if I call data frame dot pipe, what comes back is another data frame and then I could call dot pipe on it again and then dot pipe and dot pipe and chain those together.

19:52 Yes.

19:52 Applying different operations and transformations.

19:55 It's almost like a map reduce or aggregation framework type of thing here.

20:00 Right.

20:00 It's pretty flexible.

20:01 I would think it's just like and it's, you know, its entirety like the amazing one of the amazing features of pandas like consistency always.

20:08 Yeah, I really like it.

20:10 It looks super neat.

20:11 So you need to do transformations on a data frame with custom functions and get answers out.

20:17 Yeah.

20:18 Another thing that you pointed out here is that as part of this, you could apply it to the whole data frame or you could pass a set of columns.

20:26 Yes.

20:26 As part of it.

20:27 So as what you're piping across, what does that do?

20:30 That reduces the result to just those.

20:32 If you pass in three things, just those three columns.

20:34 And then these two functions remove outliers and go categoricals are our function that accept arguments.

20:41 And when you pass it to pipe, we just have to pass the function name.

20:45 Got it.

20:45 Which means you can pass the arguments.

20:47 So to pass the arguments, actually, you just have to provide them after the comma.

20:53 So this remove outliers function just accepts one argument as a list and it performs like outlier removal and just returns the whole data frame.

21:02 I see.

21:03 So you can pass like your function might take the data frame, but it might also take additional information.

21:07 Like I want to exclude things that are over a hundred dollars and just throw them away.

21:11 Well, you got to pass that hundred in because it needs to know a hundred versus some other cutoff value.

21:16 Right.

21:16 Got it.

21:17 Yes.

21:17 Okay.

21:17 Cool.

21:18 And you say it resembles scikit-learn pipeline.

21:21 Yeah, that's pretty cool.

21:23 All right.

21:24 We're up to number three factorize.

21:27 Yeah.

21:28 Tell us about this one.

21:29 In machine learning as algorithms only accept numerical data and the most real world data sets contain categoricals, which means like there are like a class one, class two or class three.

21:41 And you have to encode them like a two numeric, like zero, one, two, three, or using like one hundred encoder or labeling code in scikit-learn.

21:49 But you can do that in pandas as well.

21:51 You just have to pass the column to factorize and it just encodes them with numericals for each class.

21:58 I see.

21:59 So let me see if I can give an audio friendly example for listeners here.

22:03 Yeah.

22:03 If we've got something that says a data frame where one of the pieces is what the weather was like sunny, rainy, sun, rain, snow, clouds, something like that.

22:16 You can't feed sun to the machine learning model.

22:20 You got to give it a number, right?

22:21 Yes.

22:21 So this will convert that to like zero for sun and everywhere sun appeared, you would now have a zero.

22:26 One for rain, everywhere there was a rain and so on.

22:30 So it just does that, figures out how many different categories there are and then gives them a number that can be sent off to machine learning, right?

22:36 You explained that.

22:37 Awesome.

22:39 You see, I'm learning, right?

22:40 I'm just following along with you here.

22:42 Awesome.

22:42 Okay.

22:43 That's a really cool one.

22:44 This next one seems a little bit crazy, but it looks very useful.

22:48 Explode, right?

22:49 What is explode?

22:50 Survey data.

22:51 Surveys usually contain questions that are multiple choice.

22:55 You can just pick a lot of like more than one answer to one question and that's recorded as one answer.

23:01 So you're just going to end up with this kind of lists in a single cell of the table.

23:07 Like a question.

23:08 Oh, if you have a question one and the user just picks the answers ABC, it's going to end up ABC is going to end up as a list in a single cell of a table.

23:18 Right.

23:18 So for an example here, you have a series that has one and then six and then seven.

23:23 And then the fourth element is a list of three other numbers.

23:26 And you're like, wait a minute, those are not supposed to just be multidimensional.

23:30 I want a straight series, right?

23:31 You want a straight series.

23:32 And when you call explode on this series, it's going to just expand the series vertically and just going to fill up.

23:40 It just takes the elements of the single cell lists and just expands them vertically.

23:46 Yeah.

23:47 And these are the types of things that you were talking about with loops, right?

23:50 It would be easy to go through and say, I'm going to build up a new data frame.

23:54 And if I see a list instead of a number, I'm going to just start appending those from the list with an inner loop and then we'll carry on.

23:59 Right.

23:59 And here you've literally done it in one line.

24:01 Yeah.

24:02 Yeah.

24:02 This would be crazy complex if you did it like manually.

24:04 Right.

24:06 And honestly, slower, right?

24:07 Because a lot of this is probably implemented in C, whereas you would be doing it at the Python layer.

24:12 It's going to be very slow.

24:13 All right.

24:13 Another question from Brandon out there.

24:15 Glad he's here in the live stream.

24:16 How would I apply Explode to the entire data frame?

24:20 I'm guessing he's thinking about maybe if you had multiple columns and they each potentially had this.

24:26 I don't think that's possible.

24:27 Yeah.

24:27 I don't think Pandas allows that.

24:29 Yeah.

24:29 Okay.

24:29 So it's got to be on a series, not on a data frame.

24:32 Right.

24:33 Got it.

24:33 Okay.

24:34 Cool.

24:34 So these are all fun names that stand out.

24:37 The next one.

24:38 You're a fun name.

24:39 Yeah.

24:40 And you pick some cool pictures, right?

24:41 Yeah.

24:42 Yeah.

24:42 All right.

24:42 So what's the next one?

24:43 Squeeze.

24:44 Squeeze.

24:44 As you can see, there are some conditional operators who return real data frames, even if it's a single cell.

24:53 As you can see from the subset set, we're just asking the diamonds data frame to return all diamonds that are priced below $1.

25:02 And it just returns a single result, which is 326.

25:08 But it's returned as a data frame, which is not comfortable to work with, like a single cell data frame.

25:13 Right.

25:13 Because Panda doesn't know ahead of time that a .luck call is going to result in a single item.

25:18 This happens a lot in databases, too.

25:20 You do a query, and the result is actually a single thing.

25:23 But the framework has no way to know that the data is structured in a way that's unique or that's a one thing.

25:29 And I suspect that's common here with data frames as well.

25:32 You're structured.

25:33 Like, I know this is going to give me the one answer.

25:35 Yeah.

25:36 But it just returns the whole table.

25:38 Yeah, yeah, you're like, well, now I got to, like, dig in and give me the first row, first column.

25:42 Yeah, okay.

25:43 So squeeze helps fix this?

25:44 You just call this, like, oh, squeeze on a single cell data frame or series, and just it removes all the dimensionality and just returns the number.

25:53 Interesting.

25:53 That's cool.

25:53 What happens if I call it on one that's got more than one item?

25:57 Do you know?

25:58 Does it just give you the first, or does it freak out and let you know?

26:01 I never tried that.

26:02 Yeah, I never tried that.

26:04 Yeah, don't do that, right?

26:06 It's like, maybe if you just actually want the first answer, maybe it's okay, but it also might give you an exception.

26:11 I don't know.

26:12 I'll be fine to try it now.

26:13 Yeah, exactly.

26:14 Cool.

26:14 So the next one has to do with finding things in a range, right?

26:18 Yeah, between, yeah.

26:20 Yeah, it's like, just the name suggests, like, you want to take all the rows that are in between some range.

26:27 For example, here in the code example, I'm choosing all diamonds that are priced between $3,500 and $3,700.

26:35 Nice.

26:36 So, of course, you could do this probably as an expression.

26:39 You could definitely do this as a loop.

26:42 But both of those are slower, I'm sure, because they're not implemented internally, right?

26:47 Yeah, less elegant.

26:48 This one is better and faster and shorter.

26:51 Yeah, and one of the things, the third parameter you can pass here to between, in addition to, like, the lower bound and upper bound, is whether or not it includes the endpoints, right?

27:00 This one is inclusive is neither.

27:03 So it's like open set.

27:05 Nice.

27:06 Another thing that I've seen here, which is not one of your 25, but looks nice.

27:10 I'm used to visualizing, quickly visualizing a data frame when I get it back with head or tail.

27:17 And I want to know, like, okay, kind of what did I get back here?

27:19 Show me the front.

27:20 That'll be good.

27:21 Do ahead.

27:21 Or let's go to the end and see what happened at the end.

27:23 But here you have .sample.

27:25 That's interesting.

27:25 I use it often because some data sets have, like, ordering, for example, time series data sets.

27:31 And the first few rows might be not too representative of the whole data frame.

27:36 So I just call sample with, like, five or ten rows.

27:40 And that randomly samples the data set.

27:43 And usually sometimes that represents the data set better than head or tail.

27:49 Right, exactly.

27:50 And so it just kind of randomly picks some stuff throughout the data set to show you what's going on, right?

27:55 For large data sets, that's really handy.

27:57 Nice to know.

27:58 Yeah.

27:58 So the next one has to do with, I'm guessing, like, when you're doing matrix multiplication and vectors and, like, truly doing math.

28:07 Most of the time I would expect this to show up.

28:09 Yeah.

28:10 Most of the time, yes.

28:11 Yeah.

28:11 Transpose.

28:12 Yeah.

28:13 It stands for transpose.

28:14 I usually, you usually don't do math or matrix multiplication in Pandas.

28:19 You will do it in NumPy.

28:21 But this one, I use it mostly for when you, on the result of describe.

28:25 You see here, describe returns.

28:28 The axis inverted.

28:30 So the five numbers number is given as rows.

28:33 And that's really a problem when you have multiple columns because the data set starts to expand horizontally, which makes you scroll to the right, which you don't want.

28:45 So when you do describe, you get things like given a data set, it'll say, here's the count of this index, the mean of this index, or this value of a column, standard deviation, and so on.

28:56 And the number of options there is unbounded.

28:58 But the fact that it goes count, mean, standard deviation, minimum, and then a few more things, that's fixed.

29:03 And that fits pretty well.

29:04 So you're saying if you transpose or flip the rows and columns so that you make it go vertical instead of across, that's an easier way to look at it.

29:12 Yeah, yeah.

29:12 I agree.

29:13 And it's as easy as saying .t.

29:15 So it's not too hard to do, right?

29:17 You might as well.

29:17 It's an attribute.

29:18 Yeah.

29:19 Cool.

29:19 All right.

29:20 That's a really good one.

29:22 So you're saying if I'm going to do like some kind of matrix multiplication stuff, I should not do it in Pandas.

29:27 I should just stick to NumPy.

29:28 Yeah.

29:29 NumPy is like purely for mathematical purposes.

29:31 And it's much faster than Pandas.

29:34 I suspect that NumPy has a good transpose as well.

29:36 But yeah, there.

29:37 It has the same attribute.

29:39 Yeah.

29:39 There's a lot of synergy between those two libraries.

29:41 So the next one has to do with styling things and how they look, right?

29:45 One of the things that's cool about Pandas is it mixes well with Jupyter Notebooks.

29:50 And Jupyter Notebooks have a nice sort of explore the data.

29:53 And let's see what's going on.

29:54 Let me just look at it, right?

29:55 So this styler thing, the style attribute helps you with that, right?

29:59 Yeah.

30:00 Here, like it takes advantage of that.

30:02 The fact that Jupyter uses HTML and CSS under the hood.

30:06 So you can take advantage of that and use some HTML and CSS knowledge to style your data frame based on some like hyphonic loops or conditionals.

30:17 Here, for example, after you take the transports or the describe, you can just highlight the maximums of each row or column using the highlight column max function.

30:28 Yeah.

30:29 The Pandas offers a lot of functions after the style attribute.

30:32 You can use the built-in functions or you can come up with some custom logic to style your data frame using HTML and CSS.

30:39 Okay.

30:40 Yeah, this is great.

30:41 So you can say, for example, here dot style dot highlight max.

30:46 And then you give it some CSS values like colors, dark red or something like that, right?

30:50 You just don't have to look at the row numbers.

30:53 It just shows you the most important metrics or the ones that you want.

30:57 It's really useful when you have like multiple columns.

31:00 You just don't want to have to.

31:01 You just don't want to look at all those crazy numbers and you just use some.

31:06 Yeah, like a real reasonable or maybe straightforward thing you might start out by doing.

31:11 So, well, let me just sort it.

31:13 We'll sort it so the highest one's at the top.

31:15 But in this example, you've got multiple columns and the max of one column is in one value, but it's a different row for a different attribute of it, right?

31:24 So sorting it is going to do nothing except for like if you come up with a whole bunch of variations and try to look at it and a little bit of color, a little bit of picture goes a long ways.

31:33 Yeah, yeah.

31:34 Visual.

31:34 Yeah, absolutely.

31:35 This portion of Talk Python To Me is sponsored by Linode.

31:40 Cut your cloud bills in half with Linode's Linux virtual machines.

31:44 Develop, deploy and scale your modern applications faster and easier.

31:49 Whether you're developing a personal project or managing larger workloads, you deserve simple, affordable and accessible cloud computing solutions.

31:56 Get started on Linode today with $100 in free credit for listeners of Talk Python.

32:01 You can find all the details over at talkpython.fm/Linode.

32:05 Linode has data centers around the world with the same simple and consistent pricing, regardless of location.

32:12 Choose the data center that's nearest to you.

32:15 You also receive 24, 7, 365 human support with no tiers or handoffs, regardless of your plan size.

32:23 Imagine that real human support for everyone.

32:25 You can choose shared or dedicated compute instances, or you can use your $100 in credit on S3 compatible object storage, managed Kubernetes clusters and more.

32:36 If it runs on Linux, it runs on Linode.

32:38 Visit talkpython.fm and click the create free account button to get started.

32:43 You can also find the link right in your podcast player show notes.

32:46 Thank you to Linode for supporting Talk Python.

32:49 Yeah, the second example you have here in your article is a little more nuanced.

32:55 This looks great.

32:55 Tell us about that.

32:56 This one is like background gradient.

32:58 So it just colors each cell of the column based on its magnitude.

33:04 It's kind of like a continuous palette.

33:08 It just shows where the maximum or the minimums are and just how they compare to each other.

33:15 Yeah, it's almost like if you could do a heat map in an Excel table, you know, by making the cells different colors.

33:20 You can pass in a color map and all sorts of stuff to control how that looks.

33:24 Yeah.

33:24 Yeah, cool.

33:25 I like it.

33:25 This is great.

33:26 You know, it's one of these things where, again, one line of code and you can dramatically improve the presentation value or the informational value of what you're looking at.

33:34 Right.

33:35 Nice.

33:35 All right.

33:36 I feel like that's similar to your number nine.

33:39 Yeah.

33:39 This one is Pandas options.

33:41 Like it's kind of settings of your phone.

33:43 You just set them globally and it applies to all the data frames, the series and all the functions that you are going to be using inside the project or inside the session of Jupyter Notebook.

33:53 So if you want to have some sort of number of columns that are shown or some kind of color or something like that, you can just set that up at the beginning.

34:01 Yeah.

34:01 You just don't have to call them every single time or change them every single time.

34:05 It's just a shorthand of way of doing things like setting global settings.

34:11 Yeah.

34:11 You could probably even do something like have a little JSON file that describes the look and feel of what you're doing.

34:17 Just your first line, just load it up and set it and then, you know, go from there.

34:21 Something to that effect, right?

34:22 Yeah.

34:23 Yeah.

34:23 So you don't have to completely fill the first few lines of your notebook with like setup code.

34:28 Yeah.

34:28 For example, one of those examples is like a display max rows.

34:32 If you set it to five and you just call the data frame, it's going to only show the first five rows.

34:38 So you don't have to call .hat every time.

34:40 Oh, that's interesting.

34:41 Yeah.

34:42 Because of course, if there's enough rows, it won't print the whole thing out, right?

34:45 Probably.

34:46 Yeah.

34:46 You don't want to print 10 million rows and completely lock up the system.

34:49 Yeah.

34:50 Yeah.

34:50 That's going to.

34:51 Cool.

34:52 Oh, and another one that's kind of nice is display precision.

34:55 And if you set that, you won't see the, you know, 1.27e to the five or whatever, right?

35:01 You can.

35:02 It's really annoying when you're working with like math functions.

35:06 It just keeps giving in like scientific notation when you just want to like see the first or four or five decimal places.

35:15 Yeah.

35:15 Scientific notation is great when you're dealing with huge numbers or tremendously small numbers, right?

35:20 Like how many meters across is an atom?

35:23 Okay.

35:23 So you're going to need an E to something.

35:25 But for human beings often, you know, you want to just look at the number and go, yeah, that's a million, not like, you know, 1.2e to the six or seven, whatever.

35:33 It's going to be really annoying.

35:34 That's cool.

35:35 And this is just one of those options you can set up and it just globally applies to that notebook.

35:39 So another thing that's interesting about pandas is the columns have types usually, but not always.

35:46 It's one of those like beginning level things that you will encounter, but it can get really annoying if the data types are incorrect for your column.

35:55 The most important one is the object data type.

35:58 Right.

35:58 That's like, I don't really know.

36:00 So we're just going to store it.

36:02 Yeah.

36:03 I'm just going to put it inside of an object and objects are like object data type is the worst one.

36:10 It also limits the functionality of pandas and it's also the most memory consuming.

36:14 Right.

36:15 So the next function, what number are we on here?

36:18 Number 10.

36:18 10, yes.

36:19 And the hit list is convert underscore D types as in convert data types.

36:25 When you call it on the whole data frame, it just, it tries to infer the correct data type for each column.

36:32 If it's a float or integer or string like that.

36:36 So your example, you're reading a CSV file and some of the columns are detected correctly like floats, but others get this object.

36:43 But after calling convert D types, it's like, you know what?

36:45 No, those are strings.

36:46 But it can't handle the date times because there are so many date time formats and pandas can't possibly know all of them.

36:54 Why are date times so hard?

36:55 They really shouldn't be, but they really are.

36:58 It's crazy.

36:58 And then you throw in time zones and you'll forget it.

37:01 Okay.

37:01 And throw in daylight savings and all these other things.

37:04 Oh, yeah.

37:05 That's crazy.

37:06 Yeah.

37:07 Daylight saving is crazy.

37:08 Yeah.

37:08 I suspect some of the Kaggle stuff.

37:10 Part of the challenge is like normalize these dates because who knows or something along those lines.

37:15 Time zones are like total mess.

37:18 Yeah, for sure.

37:18 So related to converting the data types is to select them.

37:24 Yeah.

37:24 Which is a way to filter what's in there.

37:28 Like you can filter by column or rows or even a condition.

37:31 But this is saying like, I only want the strings or only want the numbers, right?

37:36 While doing machine learning, you have to apply certain pre-processing functions to only a subsets of the data.

37:43 Like only on categoricals or only numerics.

37:46 So this function will become very handy.

37:49 You just pass the data type using NumPy.

37:54 And it just gives all the subset of the data frame with that data type.

37:58 Nice.

37:59 So you would say like data frame dot select data types and then include equals np number.

38:04 And now instantly the resulting data frame is a subset that only has numbers, right?

38:09 Yes.

38:10 That's cool.

38:10 And then also you point out that you can do the reverse.

38:12 Just like give you just the other, like just the informational bits, like categories and stuff or rating by saying exclude.

38:20 Yeah.

38:21 Very nice.

38:21 Okay.

38:22 Well, we just missed it with Halloween here.

38:24 Yeah.

38:25 Yeah.

38:26 Mask.

38:26 But mask.

38:27 Yeah.

38:28 Cool.

38:28 Like a mask here.

38:30 But mask is number 12.

38:33 That's about it.

38:34 It's a conditional on, you can use it on, on series or data frames and it just returns the subset of the data where some condition is true.

38:44 Yeah.

38:45 Okay.

38:45 So, yeah.

38:46 And this example here, you've got a bunch of ages.

38:49 And I want to subset them using B2.

38:51 I want to take all those rows that are beyond 60 or below 50 and convert those values to NAND.

38:59 Okay.

38:59 So, this is like an in-place update or I guess it replaces, creates another one that is like as if you updated it.

39:06 And it finds all the stuff that's, I guess, outside of your range and then applies this other value, right?

39:12 Like if it's stuff that's outside of this range, in this case, you're going to set it to not a number, but it could be set to zero or max or anything.

39:19 Uh-huh.

39:20 Yeah.

39:20 Cool.

39:21 A very good one.

39:22 Similar, I guess, is min and max.

39:25 And then some of these, as we get a little farther down your recommendations, I like them.

39:28 They're not just, oh, here, you can apply this function, but apply it in this scenario or this context to get an interesting outcome, right?

39:36 So, that's what number 13 is like.

39:37 Min and max along columns axis.

39:40 Usually, when you call min or max on a column, it just returns the minimum or maximum of that column.

39:47 But sometimes you want it to row-wise, like it just treats rows as columns and it gives min and max across the rows.

39:56 That's usually useful.

39:58 A handy way of doing something that would take a lot of code if you're done manually.

40:02 Another one of these tricks that are techniques that lets you avoid looping, right?

40:06 Here I show a good example of like comparing four different libraries on five datasets.

40:11 You want the best performance on each dataset.

40:14 So, you have to find the best score across the rows.

40:18 Exactly.

40:18 So, the columns are the different libraries like XGBoost, CatBoost, scikit-learn, and so on, being applied to the same dataset.

40:25 And you want to just go for row one, what one did the best?

40:28 Row two, what one did the best?

40:29 Yeah.

40:30 Yeah.

40:30 Very nice.

40:31 It takes a lot of code if done manually.

40:32 Yeah.

40:33 Cool.

40:33 Number 14, N largest and N smallest.

40:37 Yeah.

40:38 We're talking about those max or minimums.

40:41 So, N largest, when you pass a number and a column name, it just returns the data frame that contains the smallest or largest N rows of that column.

40:53 Nice.

40:53 So, if I were to call min or max, that would give me the smallest or the largest one, respectively, right?

40:59 Yes.

40:59 But a really interesting or common question you might have is like, what are the top 10 selling products this month, right?

41:07 Yeah.

41:07 And this lets you just say N largest 10, and then you pick the column on which to judge it.

41:12 Here you have price, right?

41:13 Five most expensive diamonds in the diamonds data set.

41:16 Yeah.

41:16 Again, one of these things that, you know, no more looping or any of that stuff.

41:20 No more if statements.

41:21 Just call it, right?

41:22 This one is like the five cheapest, most cheapest diamonds.

41:25 Yeah.

41:25 And so, N smallest and N largest.

41:27 Fantastic.

41:28 Also, sometimes when you're asking for a minimum or maximum thing, you don't actually want the minimum or maximum.

41:34 You want to know where that is because you're going to get that thing back and say, I need that whole row because I want to learn more information about it, right?

41:42 But if you said, well, what's the minimum price?

41:44 It's seven.

41:44 Like, oh, okay, great.

41:46 Now do I need to like loop through until I find that thing that has seven or something like this?

41:50 So, you've got a recommendation for that.

41:52 Yeah.

41:52 The IDX man is IDX min.

41:54 This returns the index values of minimum or max.

41:58 So that you can look at the row that they are stored at or the column.

42:03 Fantastic.

42:03 Yeah.

42:04 So, here's the row that contains the minimum price.

42:07 I love it.

42:08 Yeah.

42:08 Really nice.

42:08 So, so many of these are really easy to apply, right?

42:11 Like, it's not a lot of research to learn how to apply ID max, but at the same time, or IDX max, but at the same time, knowing that it exists, now all of a sudden you can use it really easily.

42:22 But you probably wouldn't have known to look for it, right?

42:24 Yeah.

42:25 Yeah.

42:25 Cool.

42:26 People often talk about differences between beginner developers and expert developers.

42:32 And I think a lot of times beginners look at folks like you who have a lot of experience.

42:36 They're like, oh, this guy is so incredibly smart and he just has this way of solving these problems.

42:40 It's so amazing.

42:41 And, you know, to some degree, that's probably true.

42:44 But a lot of it is like just building up layers and layers of these like, oh, I know I can use ID max, IDX max.

42:51 I know that I can use N largest.

42:53 And you just sort of pile them together.

42:54 And then like, bam, like the solution becomes easier because you have these little building blocks.

42:59 Right.

42:59 So it's, I think it's really valuable for people getting into Pandas.

43:03 I usually think that the biggest difference between a beginner level and a more experienced programmer is just, is like just how much time they spend on the documentation.

43:12 Yeah.

43:13 Yeah.

43:14 Yeah.

43:14 If you read the docs, like if you patiently read the docs, you're just going to become a really good user of that particular tool or library.

43:21 I agree.

43:21 There's just more, you understand it better.

43:24 You know more of what it has to offer.

43:26 So it's like, it's less you've got to reinvent.

43:29 Yeah.

43:29 All right.

43:29 I talked about how you have something that may be well known, but then applying it in a scenario.

43:34 And this number 16 is value counts with drop in a false.

43:39 What's this one about?

43:40 When you have a series with like categoricals, you just want to see the proportions or their numbers as a whole in the total series.

43:48 And that usually doesn't include the null values.

43:51 So you have to call is null and chain it with some so that you get a, you learn the number of NADs in that column.

43:59 But you can do it efficiently with value counts with setting, by setting drop in a to false, which includes the proportions of the null values as well.

44:09 Yeah.

44:09 So it just gives you a, basically a percentage as a ratio here.

44:13 It's just a ratio of the number of the different categories that have appeared here.

44:18 Right.

44:19 So very cool.

44:20 And now just not a number is included.

44:21 That's great.

44:22 Yeah.

44:22 Number 17 clip.

44:24 This is a good one.

44:25 Yeah.

44:25 For data that exceeds, I don't know, maybe a range, maybe it's supposed, some instruments supposed to collect zero to a hundred and it's goes crazy and goes outside of a hundred.

44:34 Yeah.

44:35 For example, we go back to the ages example where I just want to have ages between like 18 or 60, 18 and 60.

44:43 And I want to exclude all those values.

44:46 And when you call clip with those custom values, it's just going to impose those hard limits on the whole series.

44:52 Right.

44:52 So it'll replace the ones that are over with the maximum that you said and the ones that are too low, it'll bring them up to the minimum.

44:59 Right.

44:59 Yeah.

44:59 Very cool.

45:00 Again, against the whole data set, not looping.

45:02 Only at one column at a time.

45:04 Yeah.

45:05 We talked about how difficult time is, but you do have some recommendations for searching for data that appears at a certain time or in a time range, right?

45:14 What's number 18?

45:15 This one is like a subsetting of rows of the data frame at some particular time of the day, like any time of the day.

45:23 But you like, for example, three o'clock, 9.30, 10.30, or any time that you want.

45:29 You're just going to take all those rows and return them using at time.

45:33 Yeah, that's super easy, right?

45:35 Just pass in at time and you literally specify times, right?

45:39 Like 15 colon zero, zero as a string.

45:42 Like a real conversation or messaging.

45:46 And then the other one, which is also interesting, is between time, right?

45:49 Like what happened in the morning, for example?

45:52 Like what are those sales that happened in the morning or after midnight or during some particular interval?

46:00 This one is really handy to do that.

46:03 Yeah.

46:03 So super easy.

46:04 Just data frame dot between time.

46:05 Or is that a series?

46:06 No, it doesn't matter.

46:08 Okay.

46:08 It doesn't matter.

46:09 It usually has to be, it has to have a daytime index.

46:13 That's it.

46:13 Yeah.

46:13 Okay.

46:14 So then you just pass in strings like 9 colon 45 to 12 colon zero zero.

46:19 And you know, that's like late morning or something.

46:21 Beautiful.

46:21 The next one here has to do with time series.

46:24 Number 19, B date range.

46:26 Tell us about this.

46:27 Well, this one is like, stands for business date range, business date range.

46:32 So like fundus internally built in a lot built into calendars.

46:37 Like it just, when you want to, how can I say, when you want to index the data frame, you want time series data frame.

46:43 You want to include only like working days.

46:47 Like you want to exclude all the weekdays, weekends.

46:49 Yeah.

46:50 You can do that for every single of the year or for every single week of the year, because you can possibly know which days are weekends.

46:57 So when you call B date range, it just takes, it just indexes the data frame using only weekdays.

47:05 And also it excludes the holidays, I think.

47:09 Oh my gosh.

47:09 I was just wondering about holidays.

47:11 Like there's another wrinkle in there.

47:12 Already things like leap year and stuff like that is built into this, I would imagine.

47:17 So this is super cool.

47:18 Yeah.

47:19 This is very important for when you are doing time series forecasting or announcing analysis because like, or working with stocks because stocks are only traded on weekdays and not on holidays.

47:31 So it will be very important.

47:32 Or even if you do in like traffic analysis, you want to understand accidents that are a result of rush hour, right?

47:39 You wouldn't want to look on a weekend.

47:40 Yeah.

47:41 All right.

47:41 The next one has to do with correlation.

47:43 Auto core, C-O-R-R.

47:46 Yeah.

47:46 Auto correlation.

47:47 Yeah.

47:48 I don't do much with time series.

47:49 You're going to have to tell us about this one.

47:51 What's going on here?

47:51 This is usually how it's the auto correlation of a series or time series tells the predictability of the time series with itself.

48:00 It's, do you know about correlation coefficient?

48:03 Yeah, exactly.

48:03 It tells you how much the model matches the actual data.

48:07 Like it's 97% likely that the model will predict the stuff coming up, right?

48:12 Could be linear or more complicated, but that's something like that.

48:14 Yeah.

48:15 The gist of this is that if a time series has a high auto correlation with itself, it means that you can predict it more easily.

48:23 Got it.

48:23 Yeah.

48:23 It's basically how predictable or unpredictable is this thing.

48:27 Yeah.

48:27 There's a lot of details about autoc relation and it has very many applications in time series.

48:34 But the gist is that like it shows you how much predictability it has like at each interval.

48:40 Cool.

48:40 It sounds very useful if you're doing that kind of stuff.

48:42 All right.

48:43 Number 21 has NANDs.

48:45 It's also an attribute.

48:47 You just call it on a series and it returns true or false.

48:51 If you have, it returns true if you have at least one missing value in a series.

48:56 Yeah.

48:56 So there was this quote, I remember who it's attributed to.

48:59 Sorry.

49:00 That says something to the effect of like data cleanup and data wrangling is not the dirty work.

49:06 It is the work of data science, like to get everything ready.

49:09 And then you just like hit it with the magic at the end.

49:11 Right.

49:12 And this feels like that lands right in that realm is like given some data frame or series,

49:16 does it have not in numbers or is it all good?

49:19 Yeah.

49:19 Missing values is like a huge problem in machine learning.

49:22 Most scikit-learn algorithms don't accept missing values.

49:26 So you either have to drop them or impute them using some techniques.

49:31 And this one is very handy to detect those missing values.

49:35 Right.

49:36 I suspect this is the first test.

49:37 Like if it has not in numbers and then we're going to go do stuff.

49:41 But if it says false, then you're good to go.

49:43 Just roll.

49:43 Yeah.

49:44 Yeah.

49:44 Go with that.

49:45 But it usually turns through.

49:46 Unfortunately.

49:48 Are you familiar with the missing no?

49:51 Let me.

49:51 Yeah.

49:52 Yeah.

49:52 This is another thing that I would sort of came to mind is like this whole thing, this

49:57 missing no package as in like no numbers.

50:00 So a way to not just answer yes or no, but to get visualizations.

50:03 Have you used this?

50:04 Yeah.

50:05 Yeah.

50:05 I also wrote an article on it, I think.

50:07 Okay.

50:07 Well, yeah.

50:08 So definitely.

50:08 That's awesome.

50:09 Yeah.

50:09 Things like this sound really useful to me.

50:11 They seem like.

50:12 I really like that missingness matrix.

50:14 It just shows the reasons why missing values are correlated to how missing values are correlated

50:20 with other columns.

50:21 Right.

50:21 Is it a whole bunch of missing data in one row?

50:23 Yeah.

50:24 And then it's all good?

50:25 Or is it interspersed?

50:26 Like this one's missing the birthday, but that one's missing the name or something like

50:29 that.

50:29 Right.

50:30 Yeah.

50:30 It's a really good package.

50:31 Yeah.

50:31 Fantastic.

50:32 All right.

50:32 At number 22.

50:34 At and Iat.

50:35 This one is like a faster versions of lock and Iat.

50:39 It just enables you to index your data frame.

50:43 But this one is specifically designed for retrieving single value conditionals.

50:48 Nice.

50:48 It's almost like an array index.

50:51 Yeah.

50:51 A little bit.

50:52 What's the difference between at and Iat?

50:54 Using at, you can use like column labels.

50:57 Like as you can see here, we are using cut and an index.

51:00 But Iat, you have to know the index of that column.

51:03 I see.

51:04 So with At, it would be like row and then column name, where Iat is row and column number.

51:09 It's probably less flexible.

51:10 You got to know that cut is four because it could be moved around as people are creating

51:15 or inserting data.

51:16 Yeah.

51:16 Okay.

51:16 Ag sort as in aggregation.

51:19 This one just returns the indices that would sort a data frame.

51:23 Okay.

51:23 Based on some column.

51:25 So in during data analysis, you sometimes want the indices, not the actual sorted data

51:31 so that you can use those indices in multiple times over.

51:34 Got it.

51:35 So you get the sorted.

51:35 Say, I want to sort by the total bill.

51:38 Yeah.

51:38 But then give me the indexes as if it was sorted, but don't actually change it.

51:43 So then you could go and then request data off those indexes.

51:46 Got it.

51:46 Yeah.

51:47 Nice.

51:47 All right.

51:47 We're closing in on the end and we've brought in the cat, the cat accessor.

51:51 Cat accessor.

51:52 Yeah.

51:52 I should have put an image here.

51:54 Yeah.

51:54 There would have been some kind of cool cat you can put in there.

51:57 Yeah.

51:58 As like pandas enables you to perform some like data type specific functions.

52:03 Like there is DT accessor for date time and also STR for strings.

52:08 And this one is for strictly for categorical purposes.

52:11 It has like a large suite of categorical functions that makes it easier to work on categories, ordinals

52:19 or nominal data.

52:20 Yeah.

52:21 Fantastic.

52:22 And let's bring it to the 25th with a nth group by nth.

52:27 Yeah.

52:27 This one is less useful or used very in very rare edge cases.

52:32 When you group by some column, possibly a categorical column, we want to look at those rows or groups,

52:38 right?

52:39 Calling nth on grouped data frame just returns that nth row or nth row of that groups of each group.

52:47 Got it.

52:47 Okay.

52:48 Yeah.

52:49 That looks really cool.

52:49 Yeah.

52:50 All right.

52:50 Well, that's it for our list.

52:52 Hopefully people out there listening have definitely learned something.

52:56 Now, your title was just to put a little disclaimer in here for everyone.

53:00 It's 25 panda functions you didn't know existed.

53:02 Pipe P guarantee equals 0.8.

53:04 So you had this 80%.

53:06 Yeah.

53:06 I'm guaranteed.

53:07 I love it.

53:08 That's a little bit of a stats joke in the title.

53:10 No one complained about that.

53:12 So I think that was right.

53:13 Yeah.

53:14 It sounds about right.

53:15 It seems like there's a lot of neat use cases here that people can find.

53:18 These are your 25 that you found interesting.

53:20 Other people might find them as well.

53:22 There are so many.

53:23 Oh, so many.

53:24 Yeah.

53:25 These are the types of things, though, that people can say, all right, today I'm going to

53:29 try to work with number one as I'm doing my data analysis and stuff.

53:32 I just, I know I'm going to be doing some Excel stuff.

53:34 So let's do the Excel writer one.

53:36 And then, you know, maybe later it's like, oh, I know I'm doing survey type of data.

53:40 So let me work with explode and just try to, you know, if you work these in one at a time,

53:45 eventually they become part of your tool chest and they're good, right?

53:48 Yeah.

53:48 And just expanding your tool set and skills.

53:51 I think part of the trick is to make sure that you apply it a little bit, right?

53:55 I mean, you know, they're out there, but just as you use them, like bring them in.

53:58 It just saves you time and resources.

54:00 Awesome.

54:00 Yeah.

54:01 Half the battle is just knowing that it exists, right?

54:03 It's not that it's necessarily hard to use.

54:04 It's like, I just didn't know this was even an option.

54:07 Yeah.

54:08 All of these are very easy to use.

54:09 You just know that they exist.

54:11 Yeah.

54:11 I feel like so much of Pandas is that way, but they're so, it's hard to know because there's

54:15 so much to do there.

54:16 It's cool.

54:17 Out of the live stream, Brandon, just wanted, now we're cutting it out.

54:19 I wanted to throw out, he said, very helpful.

54:21 Thank you for the article, Bex.

54:22 Cool.

54:23 You're welcome.

54:23 Yeah, I agree.

54:24 Yeah.

54:24 Thanks for doing this one.

54:25 I do want to point out, we certainly don't have time to cover it, but let me pull it up

54:30 here so I can make sure it goes in the links as well.

54:32 You did the same thing for NumPy, right?

54:34 And you also were a little more confident.

54:36 I got to say, you're a little more confident here.

54:37 Your P of guarantee equals 0.85 instead of 0.8.

54:40 NumPy practices are a little bit harder to understand.

54:44 That's why most of them don't bother to learn those, most people.

54:48 So I was a bit confident because I also didn't know most of these functions.

54:52 That's why I was a bit more confident.

54:54 Yeah.

54:54 Fantastic.

54:55 All right.

54:56 So if people like this flow and they want to kind of go a little deeper and go into the

55:00 NumPy layer, they can check that out.

55:01 And they can also check out a bunch of your other writing.

55:03 I also have the same for SK Learn.

55:05 Okay.

55:05 Right on for SK Learn.

55:07 Great.

55:07 All right.

55:08 Anything else you want to add to this article before we call it good on that topic?

55:12 I think we covered everything.

55:13 Yeah.

55:14 We covered it well.

55:14 I think it was fun.

55:15 Yeah.

55:15 It was fun.

55:16 All right.

55:16 Now, before you get out of here, there's the two questions you've got to answer.

55:21 If you're going to write some Python code, what editor do you use?

55:25 What are you going to use?

55:25 For data analysis, I usually use JupyterLab.

55:28 Yep.

55:29 But if I have to do pure Python, that's always PyCharm.

55:33 I love it.

55:34 Awesome.

55:34 That's a good combo.

55:35 Yeah.

55:35 And then notable PyPI package.

55:38 Something.

55:39 It doesn't have to be something super popular, but something that you've been across that

55:42 people are like, you're like, people should know about this.

55:44 This is something I learned about.

55:45 I recently come across with UMAP.

55:48 UMAP?

55:49 It's for dimensional add to reduction.

55:50 UMAP Python.

55:51 It's usually used for like very large data sets to project them to 2D so that you can

55:57 visualize them.

55:58 This one is a really useful package.

56:00 Nice.

56:01 So definitely people are trying to project down to 2D.

56:04 I mean, that's one of the problems, right?

56:05 Is how do you look at some of this stuff that's...

56:08 Like 100 dimensional or 200 dimensions.

56:10 You just can't visualize.

56:12 I don't even have any idea at all how to do 100 dimensions.

56:16 I remember we were doing some work with complex analysis and two dimensional.

56:21 Each dimension was complex numbers.

56:23 So four dimensional.

56:24 That was a challenge.

56:25 I have no idea how to approach 100.

56:27 No one does.

56:28 That's why this kind of dimensional add to reduction techniques exist.

56:31 Yeah.

56:31 Fantastic.

56:32 And of course, important machine learning and stuff, right?

56:35 There's like dimensions that you can just throw away because they don't actually contribute

56:38 to the predictions and stuff, right?

56:40 Yeah.

56:40 You might does that exactly.

56:41 Excellent.

56:41 Super.

56:42 All right, Bex.

56:43 Thank you for being here.

56:44 Final call to action.

56:45 People want to get deeper in Pandas, maybe learn more about some of your articles.

56:50 You know, what do you tell them?

56:51 As I said, just first check the documentation.

56:53 The documentation is usually, it should be your first choice.

56:56 It's the best place to learn about a library.

56:59 It takes a little dedication, but go through it and find out what it has to offer and go

57:02 from there, right?

57:03 It's a bit hard to read, but the documentation is always like gives the best information about

57:09 the library because it's written by the package creators.

57:13 So they know the library the best.

57:15 For sure.

57:15 Yeah.

57:16 All right.

57:17 Well, thank you for being here.

57:18 Thanks for writing the article and sharing that with us.

57:19 Thanks for having me.

57:20 Yeah, you bet.

57:21 Bye.

57:21 Thank you.

57:22 Bye.

57:22 This has been another episode of Talk Python To Me.

57:26 Thank you to our sponsors.

57:28 Be sure to check out what they're offering.

57:29 It really helps support the show.

57:31 Choose Shortcut, formerly Clubhouse.io, for tracking all of your project's work because

57:36 you shouldn't have to project manage your project management.

57:39 Visit talkpython.fm/shortcut.

57:43 Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual

57:46 machines.

57:47 Develop, deploy, and scale your modern applications faster and easier.

57:50 Visit talkpython.fm/linode and click the create free account button to get started.

57:55 Do you need a great automatic speech to text API?

57:59 Get human level accuracy in just a few lines of code.

58:01 Visit talkpython.fm/assembly AI.

58:04 Want to level up your Python?

58:06 We have one of the largest catalogs of Python video courses over at Talk Python.

58:11 Our content ranges from true beginners to deeply advanced topics like memory and async.

58:15 And best of all, there's not a subscription in sight.

58:18 Check it out for yourself at training.talkpython.fm.

58:21 Be sure to subscribe to the show.

58:23 Open your favorite podcast app and search for Python.

58:26 We should be right at the top.

58:27 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the

58:33 direct RSS feed at /rss on talkpython.fm.

58:36 We're live streaming most of our recordings these days.

58:40 If you want to be part of the show and have your comments featured on the air, be sure to

58:44 subscribe to our YouTube channel at talkpython.fm/youtube.

58:48 This is your host, Michael Kennedy.

58:49 Thanks so much for listening.

58:51 I really appreciate it.

58:52 Now get out there and write some Python code.

58:54 Bye.

58:55 Bye.

58:56 Bye.

58:57 Bye.

58:58 Bye.

58:59 Bye.

59:00 Bye.

59:01 Bye.

59:02 Bye.

59:03 Bye.

59:04 Bye.

59:05 Bye.

59:06 Bye.

59:07 Bye.

59:08 Bye.

59:09 Bye.

59:10 Bye.

59:11 Bye.

59:12 Thank you.

59:14 Thank you.