WEBVTT

00:00:00.001 --> 00:00:02.040
Do you do anything with Jupyter Notebooks?

00:00:02.040 --> 00:00:05.740
If you do, there's a very good chance you're working with the Pandas library.

00:00:05.740 --> 00:00:12.140
This is one of the primary tools for anyone doing computational work or data exploration with Python.

00:00:12.140 --> 00:00:18.100
Yet, this library is massive, and knowing the idiomatic way to use it can be hard to discover.

00:00:18.100 --> 00:00:21.240
That's why I've invited Beks Toichev to be our guest.

00:00:21.240 --> 00:00:26.460
He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties

00:00:26.460 --> 00:00:28.380
we should all keep in our data toolkit.

00:00:28.560 --> 00:00:33.120
I'm sure there is something here for all of us to take away and use Pandas that much better.

00:00:33.120 --> 00:00:39.140
This is Talk Python to Me, episode 341, recorded November 4th, 2021.

00:00:39.140 --> 00:00:55.340
Welcome to Talk Python to Me, a weekly podcast on Python.

00:00:55.340 --> 00:00:57.060
This is your host, Michael Kennedy.

00:00:57.120 --> 00:01:01.280
Follow me on Twitter, where I'm @mkennedy, and keep up with the show and listen to past

00:01:01.280 --> 00:01:03.260
episodes at talkpython.fm.

00:01:03.260 --> 00:01:06.300
And follow the show on Twitter via at Talk Python.

00:01:06.300 --> 00:01:09.900
We've started streaming most of our episodes live on YouTube.

00:01:09.900 --> 00:01:15.660
Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming

00:01:15.660 --> 00:01:17.500
shows and be part of that episode.

00:01:17.500 --> 00:01:23.380
This episode is brought to you by Shortcut and Linode, and the transcripts are sponsored

00:01:23.380 --> 00:01:24.660
by Assembly AI.

00:01:24.660 --> 00:01:29.000
Bex, welcome to Talk Python to Me.

00:01:29.000 --> 00:01:29.800
Hello, Michael.

00:01:29.800 --> 00:01:30.900
Thanks for having me.

00:01:30.900 --> 00:01:33.720
Hey, it's fantastic to have you here on the show.

00:01:33.720 --> 00:01:40.960
Your article 25 Pandas functions that you didn't know or probably don't know, I guess, as we'll

00:01:40.960 --> 00:01:41.260
see.

00:01:41.260 --> 00:01:42.940
That really caught my attention.

00:01:43.520 --> 00:01:45.240
Honestly, I don't know many of them.

00:01:45.240 --> 00:01:47.400
So I learned a bunch by watching it.

00:01:47.400 --> 00:01:53.740
You know, I do spend more time on the web side of Python and the database side of Python than

00:01:53.740 --> 00:01:54.720
I do on the data science.

00:01:54.720 --> 00:02:00.580
But certainly Pandas is a super important part of Python these days.

00:02:00.680 --> 00:02:04.460
And honestly, the whole data science side is the fastest growing part of Python.

00:02:04.460 --> 00:02:09.960
Pandas is like one of the first libraries that you will be introduced in any beginner Python

00:02:09.960 --> 00:02:12.500
or in any beginner data science course.

00:02:12.500 --> 00:02:17.020
And it's amazing how much it has grown since it was first launched.

00:02:18.020 --> 00:02:23.540
And the funny thing about the article is that before writing it, I also didn't know most of the functions.

00:02:23.540 --> 00:02:28.100
I would always get annoyed by people who use some like complex functions.

00:02:28.100 --> 00:02:33.280
And I just wanted to know how they worked and explain it to my audience.

00:02:33.280 --> 00:02:35.660
So that was the idea of the article.

00:02:36.080 --> 00:02:37.780
Both me and the audience learning.

00:02:37.780 --> 00:02:41.160
That's the little bit of secret behind these types of things.

00:02:41.160 --> 00:02:45.620
Behind the tutorials, behind articles, behind podcasts, and even behind courses.

00:02:45.620 --> 00:02:50.000
A lot of times we dive into them because we're like, oh, I really want to learn these things.

00:02:50.000 --> 00:02:54.780
And just let me, you know, put it in a format I can present to the rest of the world and help

00:02:54.780 --> 00:02:55.700
everyone else out, right?

00:02:55.700 --> 00:02:56.300
Yeah, yeah.

00:02:56.300 --> 00:02:56.700
Awesome.

00:02:56.700 --> 00:03:01.720
Yeah, before we get into this, I want to talk about your articles and some Kaggle competitions.

00:03:01.720 --> 00:03:04.440
And then we'll dive into the 25 functions.

00:03:04.680 --> 00:03:06.380
But, you know, let's start with your story.

00:03:06.380 --> 00:03:07.740
How did you get into programming in Python?

00:03:07.740 --> 00:03:11.800
Right after I finished high school, I got interested in web development.

00:03:11.800 --> 00:03:13.880
I learned some HTML and CSS.

00:03:13.880 --> 00:03:18.540
And I was hoping to get things to get more, to be more exciting.

00:03:18.540 --> 00:03:23.800
But at some time, I just got bored because I'm really into math.

00:03:23.800 --> 00:03:28.460
And web development had nothing to do with math, so it was very boring.

00:03:28.460 --> 00:03:30.500
So I switched to learning Python.

00:03:30.500 --> 00:03:34.660
Learned it for a while and discovered that data science is more,

00:03:34.660 --> 00:03:37.000
mostly connected to math and statistics.

00:03:37.000 --> 00:03:41.620
So I just bought a really good course.

00:03:41.620 --> 00:03:44.180
And that's how it starts.

00:03:44.180 --> 00:03:45.060
Yeah, that's fantastic.

00:03:45.060 --> 00:03:49.220
You know, I think people do often feel like you have to be really good at math to be good

00:03:49.220 --> 00:03:49.780
at programming.

00:03:49.780 --> 00:03:54.460
And honestly, most of programming has very little to do with math.

00:03:54.460 --> 00:03:55.620
Yes, of course.

00:03:55.620 --> 00:03:56.000
Yeah.

00:03:56.000 --> 00:03:57.060
But data science does.

00:03:57.060 --> 00:03:59.280
So data science is unique in this way.

00:03:59.360 --> 00:04:01.080
I mean, I guess computational science, right?

00:04:01.080 --> 00:04:04.160
If you're an astrophysicist, you do a lot of math as well.

00:04:04.160 --> 00:04:08.720
But for most of us, math is just a structured way of thinking.

00:04:08.720 --> 00:04:10.280
And we have structured programs.

00:04:10.280 --> 00:04:12.520
And that's kind of the end of the relationship there.

00:04:12.520 --> 00:04:18.100
But if someone is out there and they really love math and they want to take it farther,

00:04:18.100 --> 00:04:21.620
but they want to do that in computers, it sounds like recommending data science might be the

00:04:21.620 --> 00:04:22.100
right path.

00:04:22.100 --> 00:04:23.220
Yeah, of course.

00:04:23.220 --> 00:04:28.800
It's very, really beautiful how software and math connect together in data science,

00:04:28.800 --> 00:04:34.260
what kind of things it can achieve for neural networks and state-of-the-art machine learning

00:04:34.260 --> 00:04:34.660
algorithms.

00:04:34.660 --> 00:04:35.880
It's really amazing.

00:04:36.080 --> 00:04:36.240
Yeah.

00:04:36.240 --> 00:04:40.400
It's one of these areas that's just growing so fast.

00:04:40.400 --> 00:04:42.580
And there's such big advancements.

00:04:42.580 --> 00:04:42.840
Yeah.

00:04:42.840 --> 00:04:48.800
You know, you look at, I think back to when I was in college and we talked about artificial

00:04:48.800 --> 00:04:53.040
intelligence and AI, and it was all about the Turing test, you know?

00:04:53.040 --> 00:04:58.980
Could you get a chat bot that would trick a human into thinking that it was an actual other

00:04:58.980 --> 00:04:59.400
human?

00:04:59.400 --> 00:05:02.940
And it never really seemed to come into reality.

00:05:03.080 --> 00:05:06.240
It always seemed like, oh, there's, it's kind of always 30 years out.

00:05:06.240 --> 00:05:10.180
And then all of a sudden we have self-driving cars and we have Google Copilot.

00:05:10.180 --> 00:05:11.160
Yeah.

00:05:11.160 --> 00:05:14.480
It's just the step jump over the last couple of years has been amazing.

00:05:14.480 --> 00:05:14.780
Yeah.

00:05:14.780 --> 00:05:16.520
I was also amazed by Google Copilot.

00:05:16.520 --> 00:05:21.760
Like right after it was launched, I wrote an article on it, like as a kind of intro.

00:05:21.760 --> 00:05:23.340
And it really took off.

00:05:23.340 --> 00:05:25.520
Like so many people were interested in it.

00:05:25.520 --> 00:05:28.880
Like it received like more than 50,000 views, the article.

00:05:28.880 --> 00:05:29.240
Yeah.

00:05:29.240 --> 00:05:30.780
A lot of people are amazed by it.

00:05:30.780 --> 00:05:31.880
I'm amazed by it as well.

00:05:31.940 --> 00:05:33.380
I think it's, it is amazing.

00:05:33.380 --> 00:05:40.360
It's also bringing to light some interesting, almost legal and philosophical things, right?

00:05:40.360 --> 00:05:45.080
If people put code on GitHub, they didn't necessarily intend to train an AI with it.

00:05:45.080 --> 00:05:48.080
If they put code on GitHub, that's under GPL.

00:05:48.080 --> 00:05:53.460
Well, what the AI knows, is that now GPL or is that completely, you know, can that be used

00:05:53.460 --> 00:05:54.000
in closed source?

00:05:54.000 --> 00:05:55.460
These are not known, right?

00:05:55.460 --> 00:05:57.580
These are, these are interesting questions.

00:05:57.900 --> 00:05:58.260
Yeah.

00:05:58.260 --> 00:05:59.340
I don't think we're going to answer.

00:05:59.340 --> 00:06:01.980
We're not going to completely fill them out today.

00:06:01.980 --> 00:06:05.240
Let's focus on something more, a little smaller.

00:06:05.240 --> 00:06:09.280
So you mentioned your articles and you've been doing a lot of writing.

00:06:09.280 --> 00:06:13.540
So you're a top 10 writer in artificial intelligence on Medium.

00:06:13.540 --> 00:06:13.640
Yeah.

00:06:13.640 --> 00:06:14.160
Yeah.

00:06:14.160 --> 00:06:14.540
Yeah.

00:06:14.720 --> 00:06:17.300
And you're also a Kaggle master.

00:06:17.300 --> 00:06:18.320
Yeah.

00:06:18.320 --> 00:06:18.560
Yeah.

00:06:18.560 --> 00:06:19.140
Yeah.

00:06:19.140 --> 00:06:20.620
Let's talk about those two things for a little bit.

00:06:20.620 --> 00:06:23.980
Just give us a sense of the stuff that you write about on Medium and maybe some of your

00:06:23.980 --> 00:06:26.600
favorite articles before we dive into this one that I picked out.

00:06:26.600 --> 00:06:29.000
I started writing on Medium a year ago.

00:06:29.000 --> 00:06:31.820
It was just purely for educational purposes.

00:06:31.820 --> 00:06:37.920
I really liked how like what the things you learn will be like, will be locked into your

00:06:37.920 --> 00:06:39.220
brain by writing about them.

00:06:39.220 --> 00:06:42.740
So it was a really amazing way to learn something new.

00:06:42.740 --> 00:06:49.240
But as my number of articles grew, like my audience grew and I met a lot of people.

00:06:49.920 --> 00:06:53.440
I had, it opened a lot of doors for me writing.

00:06:53.440 --> 00:06:54.260
Yeah.

00:06:54.260 --> 00:06:59.400
And most important of all, I'm more confident about my knowledge than ever before.

00:06:59.400 --> 00:07:00.080
That's fantastic.

00:07:00.080 --> 00:07:06.100
I really like that you point out that it opened doors because so many people feel like I'm

00:07:06.100 --> 00:07:07.020
not ready to write.

00:07:07.020 --> 00:07:12.660
I'm not ready to speak a user group or a conference, or I'm not ready to appear on a podcast or any

00:07:12.660 --> 00:07:14.720
of these sorts of ways where you put yourself out there.

00:07:14.720 --> 00:07:14.920
Right.

00:07:14.920 --> 00:07:15.260
Yeah.

00:07:15.260 --> 00:07:19.480
But when you do that, the act of doing that pushes you to grow.

00:07:19.640 --> 00:07:21.520
And it also opens doors to people.

00:07:21.520 --> 00:07:25.540
You know, if you're out there and you're genuine, you don't have to be an absolute expert in

00:07:25.540 --> 00:07:25.900
everything.

00:07:25.900 --> 00:07:27.840
You just have to be excited and interested.

00:07:27.840 --> 00:07:30.800
Other people who are excited want to talk to you and work on something with you, right?

00:07:30.800 --> 00:07:31.300
Yes.

00:07:31.300 --> 00:07:34.360
You just have to be one step ahead of your audience and that's it.

00:07:34.360 --> 00:07:34.660
Right.

00:07:34.660 --> 00:07:35.620
When you write articles.

00:07:35.620 --> 00:07:36.040
That's right.

00:07:36.040 --> 00:07:39.740
And not necessarily in everything they know, just the little area that you're interested in,

00:07:39.740 --> 00:07:39.920
right?

00:07:39.920 --> 00:07:40.300
Yes.

00:07:40.300 --> 00:07:40.600
Yes.

00:07:40.600 --> 00:07:40.920
Yeah.

00:07:40.920 --> 00:07:41.340
Awesome.

00:07:41.340 --> 00:07:44.180
And so that's really great that you're doing this writing stuff.

00:07:44.180 --> 00:07:45.860
The other thing is Kaggle.

00:07:45.860 --> 00:07:47.780
Tell us about what you've been doing at Kaggle.

00:07:48.000 --> 00:07:53.440
I really admire people who do, who do like do competitions on Kaggle for a while.

00:07:53.440 --> 00:07:56.620
And I really had this like imposter syndrome.

00:07:56.620 --> 00:08:01.500
I couldn't join the competitions because I thought that they were too complex that I had

00:08:01.500 --> 00:08:03.820
like a lot of things to learn before I joined them.

00:08:03.880 --> 00:08:04.880
I still do.

00:08:04.880 --> 00:08:11.700
But after I joined like the tabular playground competitions, I learned that I can do it.

00:08:11.700 --> 00:08:12.140
Yeah.

00:08:12.140 --> 00:08:17.200
So I started posting my articles in the form of notebooks on Kaggle as well, which started

00:08:17.200 --> 00:08:20.540
getting a lot of views and really nice comments from the audience.

00:08:20.660 --> 00:08:24.040
The community on Kaggle is even more amazing than on Medium.

00:08:24.040 --> 00:08:29.220
For an article that gets like read by thousands of people on Medium, I usually receive like

00:08:29.220 --> 00:08:30.120
one or two comments.

00:08:30.120 --> 00:08:35.440
But if you write, if you post the same article as a notebook on Kaggle, like the audience loves

00:08:35.440 --> 00:08:39.680
it because Kaggle is mostly suited for this kind of tutorials.

00:08:40.040 --> 00:08:42.520
And I usually receive like 30 or 40 comments.

00:08:42.520 --> 00:08:46.860
And that's really amazing as a writer to be part of that kind of community.

00:08:46.860 --> 00:08:47.320
Yeah.

00:08:47.320 --> 00:08:48.140
That's really amazing.

00:08:48.140 --> 00:08:48.800
I had no idea.

00:08:48.800 --> 00:08:50.960
I didn't realize you could post on Kaggle.

00:08:50.960 --> 00:08:51.420
Yeah.

00:08:51.420 --> 00:08:51.700
No.

00:08:51.700 --> 00:08:55.020
You kind of post your solutions and then have a conversation around them sort of, right?

00:08:55.020 --> 00:08:55.400
Yes.

00:08:55.400 --> 00:08:55.740
Yes.

00:08:55.740 --> 00:08:56.060
Yes.

00:08:56.060 --> 00:08:56.420
Okay.

00:08:56.420 --> 00:08:57.020
Awesome.

00:08:57.020 --> 00:08:58.420
People want to get started with Kaggle.

00:08:58.420 --> 00:08:59.520
What do they need to do?

00:08:59.520 --> 00:09:03.740
Like maybe before we drop this topic, if people haven't done stuff with Kaggle yet, but they

00:09:03.740 --> 00:09:04.920
maybe want to use it to learn.

00:09:04.920 --> 00:09:05.660
What's your advice there?

00:09:05.660 --> 00:09:06.060
Yeah.

00:09:06.060 --> 00:09:12.220
I just, right after you create an account, they have a whole suite of courses, free courses

00:09:12.220 --> 00:09:13.240
you can take.

00:09:13.240 --> 00:09:19.180
I think those are the, those are very best, very good starting points for any beginner.

00:09:19.480 --> 00:09:23.140
And also they have like two or three beginner level competitions.

00:09:23.140 --> 00:09:27.240
So you don't get intimidated by those grandmasters or masters.

00:09:27.240 --> 00:09:33.540
They're just a simple datasets you can work with and you just have to submit your predictions

00:09:33.540 --> 00:09:37.840
and just get a score and nothing too complex.

00:09:37.840 --> 00:09:40.380
And that's really the amazing part of Kaggle.

00:09:40.380 --> 00:09:46.360
That's why those three competitions I have, I think they have like 100,000 people competing

00:09:46.360 --> 00:09:49.580
at any single time in, in any time.

00:09:49.580 --> 00:09:50.140
That's wild.

00:09:50.140 --> 00:09:55.580
One of the challenges when you're learning is finding a structured problem to approach,

00:09:55.580 --> 00:09:56.740
right?

00:09:56.740 --> 00:10:00.820
Maybe in the web world, people try to build things that are too ambitious.

00:10:00.820 --> 00:10:02.500
They're like, oh, I want to build Airbnb.

00:10:02.500 --> 00:10:03.660
You're like, whoa, whoa, whoa, whoa.

00:10:03.660 --> 00:10:05.460
You don't really hardly understand CSS.

00:10:05.460 --> 00:10:07.980
Let's take it down a notch and let's go slow.

00:10:08.160 --> 00:10:10.160
And we'll get a right side problem for you to address.

00:10:10.160 --> 00:10:14.900
Data science has the same problem, but I think it has another aspect, which is, and you need

00:10:14.900 --> 00:10:17.260
the data to start from, right?

00:10:17.260 --> 00:10:17.500
Yeah.

00:10:17.500 --> 00:10:21.700
And I feel like Kaggle helps in bringing that kind of stuff over.

00:10:22.140 --> 00:10:22.420
Yeah.

00:10:22.420 --> 00:10:23.020
Yeah.

00:10:23.020 --> 00:10:26.300
Kaggle like has an amazing list of datasets.

00:10:26.300 --> 00:10:33.140
I almost always use Kaggle datasets for my, for my articles because most of them are digestible

00:10:33.140 --> 00:10:35.920
and small enough for people to get an advantage of.

00:10:35.920 --> 00:10:36.320
Awesome.

00:10:36.320 --> 00:10:42.320
A question from the audience from Brandon Bennett asks, are Kaggle competitions, just machine

00:10:42.320 --> 00:10:44.180
learning and artificial intelligence related?

00:10:44.180 --> 00:10:45.220
Are there other types?

00:10:45.220 --> 00:10:45.860
Yeah.

00:10:45.860 --> 00:10:49.320
Kaggle competitions are only AI or data science related.

00:10:49.560 --> 00:10:49.760
Yeah.

00:10:49.760 --> 00:10:50.280
Okay.

00:10:50.280 --> 00:10:57.560
So for example, the latest launched on Kaggle, I think is about finding the cuteness quotient

00:10:57.560 --> 00:10:58.700
of pets.

00:10:58.700 --> 00:11:05.680
It was, yeah, you just take in like thousands of images and you process them with Python or

00:11:05.680 --> 00:11:12.500
R and the neural network learns the structure and learns the cuteness quotient and just spits

00:11:12.500 --> 00:11:15.740
out a new quotient for any new image you get.

00:11:15.740 --> 00:11:16.280
That's amazing.

00:11:16.820 --> 00:11:21.820
So it used to be, here's a machine learning model that can answer, is it a cat or a dog?

00:11:21.820 --> 00:11:23.640
And now it's giving you a cuteness score.

00:11:23.640 --> 00:11:24.000
Yeah.

00:11:24.000 --> 00:11:24.220
Yeah.

00:11:24.220 --> 00:11:27.460
I can definitely see my daughter getting into data science with this one.

00:11:27.460 --> 00:11:30.380
She's all about pets and cats and dogs.

00:11:30.380 --> 00:11:37.520
And I personally want to put a vote out there for the golden cocker, the golden retriever mixed

00:11:37.520 --> 00:11:38.500
with the cocker spaniel.

00:11:38.500 --> 00:11:39.420
Boy, those things are cute.

00:11:39.420 --> 00:11:39.920
Okay.

00:11:40.260 --> 00:11:41.260
So that's Kaggle.

00:11:41.260 --> 00:11:43.220
Sounds really great for learning.

00:11:43.220 --> 00:11:43.580
Yeah.

00:11:43.580 --> 00:11:47.400
And I suspect knowing something about pandas will pay off.

00:11:47.400 --> 00:11:48.940
Oh, of course.

00:11:48.940 --> 00:11:49.600
Right?

00:11:49.600 --> 00:11:52.580
Like it's such a foundational aspect.

00:11:52.580 --> 00:11:52.880
Yeah.

00:11:52.880 --> 00:11:55.540
Pandas are used extensively.

00:11:55.540 --> 00:11:56.240
It is.

00:11:56.240 --> 00:12:00.960
And I feel like pandas is one of those things that you could learn it really quickly.

00:12:00.960 --> 00:12:03.900
You could learn to do stuff with pandas in a day.

00:12:03.900 --> 00:12:04.800
Yeah.

00:12:04.800 --> 00:12:05.040
Yeah.

00:12:05.040 --> 00:12:08.240
But then in a year, you could still be learning stuff about pandas.

00:12:08.240 --> 00:12:10.920
If you use it every day for a year, you know what I mean?

00:12:10.920 --> 00:12:11.220
Yeah.

00:12:11.220 --> 00:12:14.740
Most data science libraries are just very vast.

00:12:14.740 --> 00:12:16.800
There are a lot of functionalities.

00:12:16.800 --> 00:12:22.700
And most of the time, like you can get around by learning like 10 or 15% of all those functions.

00:12:22.960 --> 00:12:28.440
But when you really need to get something like really rare edge cases or unique cases,

00:12:28.440 --> 00:12:34.220
you really need to know some of those rare functions that are buried in the documentation

00:12:34.220 --> 00:12:37.200
just so that you don't have to reinvent the wheel.

00:12:37.200 --> 00:12:37.520
Yeah.

00:12:37.520 --> 00:12:41.100
In Python, we speak about Pythonic code.

00:12:41.100 --> 00:12:46.680
There's code that we could write that might be code that runs, but it looks like it comes

00:12:46.680 --> 00:12:50.280
from Java or it looks like it comes from C and somebody just got it working.

00:12:50.280 --> 00:12:54.100
And I suspect you have the same thing in data science and around pandas.

00:12:54.100 --> 00:12:58.680
It's like, yeah, you technically could do this with pandas, but why don't you just call

00:12:58.680 --> 00:12:59.300
this function?

00:12:59.300 --> 00:13:02.740
And probably the answer is, well, I didn't know that function existed.

00:13:02.740 --> 00:13:06.420
Of course, I would have called it if I had known to do it, but I just didn't know, right?

00:13:06.420 --> 00:13:06.860
I'm new.

00:13:06.860 --> 00:13:07.460
Yeah.

00:13:07.460 --> 00:13:10.700
So hopefully we can shine a light on some of those things that you can do.

00:13:10.700 --> 00:13:15.100
I mean, for example, not that we'll necessarily cover it in your article, but if you're doing

00:13:15.100 --> 00:13:18.520
a for loop with a data frame, you're probably doing it wrong, right?

00:13:18.520 --> 00:13:23.100
The golden rule is to never use loops, like teach loops completely.

00:13:23.100 --> 00:13:23.560
Yeah.

00:13:23.560 --> 00:13:24.460
That's pretty interesting.

00:13:24.460 --> 00:13:30.140
It definitely takes a different way of thinking, sort of set-based processing and passing in

00:13:30.140 --> 00:13:32.820
expressions and lambdas to various places and whatnot.

00:13:32.820 --> 00:13:33.180
Yeah.

00:13:33.180 --> 00:13:34.080
Maps and whatnot.

00:13:34.080 --> 00:13:34.440
Okay.

00:13:34.440 --> 00:13:35.640
We're going to talk about some of those.

00:13:35.640 --> 00:13:36.940
Let's dive in.

00:13:36.940 --> 00:13:39.140
First of all, how did you pick these 25?

00:13:39.140 --> 00:13:41.720
Were these just 25 that you saw people use?

00:13:41.720 --> 00:13:42.340
They were interesting.

00:13:42.340 --> 00:13:43.800
You're like, I didn't even know that existed.

00:13:43.800 --> 00:13:45.300
Or what was your philosophy here?

00:13:45.300 --> 00:13:49.140
For this kind of articles, I usually go to the API reference of the documentation.

00:13:49.140 --> 00:13:54.280
It just lists every single class and functionality of some library, the API reference.

00:13:54.280 --> 00:13:56.800
And I just read them one by one.

00:13:56.800 --> 00:14:02.000
I decide which one of those is going to be beneficial to me and possibly for my audience.

00:14:02.180 --> 00:14:04.020
And I just pick them out.

00:14:04.020 --> 00:14:04.260
Yeah.

00:14:04.260 --> 00:14:04.760
Come by one.

00:14:04.760 --> 00:14:05.260
Yeah.

00:14:05.260 --> 00:14:05.460
Yeah.

00:14:05.460 --> 00:14:05.980
That's really cool.

00:14:05.980 --> 00:14:07.880
I love to discover these types of things.

00:14:07.880 --> 00:14:10.860
So why don't we, you kick it off with number one?

00:14:10.860 --> 00:14:11.380
Yeah.

00:14:11.380 --> 00:14:12.060
What's number one here?

00:14:12.060 --> 00:14:13.460
The first one is Excel writer.

00:14:13.940 --> 00:14:18.780
It's a class for writing to Excel sheets.

00:14:18.780 --> 00:14:24.140
So if you have multiple data frames, you can write to Excel sheets as separate tabs with

00:14:24.140 --> 00:14:24.880
separate sheets.

00:14:24.880 --> 00:14:31.600
The pandas has usually, the data frames have this two Excel function, but you give it the

00:14:31.600 --> 00:14:35.300
Excel writer instance, it's going to write it to a separate sheet.

00:14:35.300 --> 00:14:37.820
It's going to enable you to write to separate sheets.

00:14:37.820 --> 00:14:38.040
Yeah.

00:14:38.040 --> 00:14:38.760
This is super neat.

00:14:38.760 --> 00:14:43.480
So in your example here, which of course we'll link to the article and people can check

00:14:43.480 --> 00:14:46.120
out, they all have a bunch of code samples under each one of these.

00:14:46.120 --> 00:14:48.600
You've got two data frames.

00:14:48.600 --> 00:14:48.900
Yeah.

00:14:48.900 --> 00:14:52.820
And you want to put them into some kind of Excel spreadsheet.

00:14:52.820 --> 00:14:54.040
So you create one of these writers.

00:14:54.040 --> 00:14:56.260
This is the function you're talking about.

00:14:56.260 --> 00:14:59.940
And then you go to the data frame, you say to Excel and you give it the writer and a sheet

00:14:59.940 --> 00:15:03.820
name and you give it, you can do that for each data frame and give it different sheet

00:15:03.820 --> 00:15:05.520
names and it just piles up along the bottom.

00:15:05.520 --> 00:15:05.740
Right.

00:15:05.740 --> 00:15:06.440
It's really neat.

00:15:06.440 --> 00:15:08.080
It's ridiculously simple, right?

00:15:08.080 --> 00:15:12.440
It's like given the data frames, it's three lines of code to create an Excel file and

00:15:12.440 --> 00:15:12.800
write it.

00:15:12.800 --> 00:15:13.020
Yeah.

00:15:13.020 --> 00:15:13.360
Yeah.

00:15:13.360 --> 00:15:18.220
If you know this, you'd have to create two separate Excel files and just add them together

00:15:18.220 --> 00:15:20.960
later manually, which is not programmatic.

00:15:20.960 --> 00:15:21.360
Right.

00:15:21.360 --> 00:15:24.700
Or maybe you say you don't know that you can write to Excel.

00:15:24.700 --> 00:15:26.640
I mean, I'm pretty sure I could write to CSV.

00:15:26.640 --> 00:15:27.320
Ah, yeah.

00:15:27.320 --> 00:15:28.700
And there's multiple levels, right?

00:15:28.700 --> 00:15:32.860
Like one level is like, I'm going to write it line by line, putting the commas in there

00:15:32.860 --> 00:15:33.260
myself.

00:15:33.260 --> 00:15:35.820
Another one could be the write CSV, right?

00:15:35.820 --> 00:15:36.940
Read CSV, write CSV.

00:15:36.940 --> 00:15:39.380
But this one is like more structured, right?

00:15:39.380 --> 00:15:44.660
And then you could possibly use some of the more advanced tooling to do things like stylize

00:15:44.660 --> 00:15:47.160
or highlight aspects of it or whatever, right?

00:15:47.160 --> 00:15:49.140
Like Py Open Excel or something like that.

00:15:49.140 --> 00:15:52.520
Now for this one, you talk about, all right.

00:15:52.520 --> 00:15:57.080
It says that you need to have the right supporting libraries there, right?

00:15:57.080 --> 00:15:59.920
You, for example, have to have different libraries.

00:16:00.120 --> 00:16:01.360
I can't remember which one it was.

00:16:01.360 --> 00:16:02.240
I think it was Py.

00:16:02.240 --> 00:16:03.400
Py open by Excel.

00:16:03.400 --> 00:16:04.540
Open by Excel.

00:16:04.540 --> 00:16:05.540
Yeah, that's it right here.

00:16:05.540 --> 00:16:06.360
I knew it was in here.

00:16:06.360 --> 00:16:06.800
Yeah.

00:16:06.800 --> 00:16:07.680
Open Py Excel.

00:16:07.680 --> 00:16:10.160
If you want to work with XLS files.

00:16:10.160 --> 00:16:12.360
And there's other ones as well, right?

00:16:12.360 --> 00:16:13.980
Otherwise, you'll get an error.

00:16:13.980 --> 00:16:14.440
Right.

00:16:14.500 --> 00:16:20.480
So basically, Pandas delegates to this library, which actually understands Excel and writes

00:16:20.480 --> 00:16:20.800
to it.

00:16:20.800 --> 00:16:21.020
Yeah.

00:16:21.020 --> 00:16:25.160
There's another one where it talks about using FSSpec.

00:16:25.160 --> 00:16:29.180
And this caught my attention as like, oh, wow, this is way more flexible.

00:16:29.180 --> 00:16:33.320
Because I'm not sure people are aware of what FSSpec is.

00:16:33.320 --> 00:16:34.960
Are you familiar with FSSpec?

00:16:34.960 --> 00:16:35.680
No, no.

00:16:35.680 --> 00:16:43.300
So FSSpec is this library that allows you to treat different destinations as Python file systems.

00:16:43.300 --> 00:16:45.820
Like, you know, with open some file name.

00:16:45.820 --> 00:16:48.040
Instead of file name, you can do all sorts of stuff.

00:16:48.040 --> 00:16:54.420
So let me see if I can find some of the documentation here of the things that it can go to.

00:16:54.840 --> 00:16:54.920
Yeah.

00:16:54.920 --> 00:17:01.680
Integrates with a bunch of different places, but it goes to places like S3 storage and

00:17:01.680 --> 00:17:07.980
FTP and database and zip files and all of these types of crazy things.

00:17:07.980 --> 00:17:10.060
And it even does caching, I guess, is right?

00:17:10.060 --> 00:17:15.160
So this Excel writer, while it already sounds really interesting because it writes to Excel,

00:17:15.160 --> 00:17:20.020
like destination of these Excel files, like this could be an Excel file in a database or

00:17:20.020 --> 00:17:23.680
something with basically hardly any changes to the code.

00:17:23.680 --> 00:17:24.320
Yes.

00:17:24.460 --> 00:17:25.220
Yeah, that's super cool.

00:17:25.220 --> 00:17:27.000
So good one to kick it off there.

00:17:27.000 --> 00:17:28.060
A lot going on.

00:17:28.060 --> 00:17:35.260
This portion of Talk Python to Me is brought to you by Shortcut, formerly known as clubhouse.io.

00:17:35.260 --> 00:17:37.120
Happy with your project management tool?

00:17:37.120 --> 00:17:41.420
Most tools are either too simple for a growing engineering team to manage everything,

00:17:41.420 --> 00:17:45.520
or way too complex for anyone to want to use them without constant prodding.

00:17:45.520 --> 00:17:48.420
Shortcut is different though, because it's worse.

00:17:48.420 --> 00:17:49.880
No, wait, no, I mean, it's better.

00:17:49.880 --> 00:17:53.640
Shortcut is project management built specifically for software teams.

00:17:53.820 --> 00:17:58.420
It's fast, intuitive, flexible, powerful, and many other nice positive adjectives.

00:17:58.420 --> 00:18:04.020
Key features include team-based workflows.

00:18:04.020 --> 00:18:09.620
Or customize them to match the way they work.

00:18:09.620 --> 00:18:10.220
type version control integration.

00:18:10.220 --> 00:18:10.220
Type version control integration.

00:18:10.220 --> 00:18:19.220
Type version control integration.

00:18:19.220 --> 00:18:26.220
Whether you use GitHub, GitLab, or Bitbucket, Clubhouse ties directly into them, so you can update progress from the command line.

00:18:26.220 --> 00:18:26.820
Keyboard friendly interface.

00:18:26.820 --> 00:18:28.820
Keyboard friendly interface.

00:18:28.820 --> 00:18:34.820
The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse.

00:18:34.820 --> 00:18:35.820
Throw that thing in the trash.

00:18:35.820 --> 00:18:36.420
Iteration planning.

00:18:36.420 --> 00:18:38.420
Set weekly priorities.

00:18:38.420 --> 00:18:44.420
Set weekly priorities and let Shortcut run the schedule for you with accompanying burndown charts and other reporting.

00:18:44.420 --> 00:18:47.020
Give it a try over at talk python.

00:18:47.020 --> 00:18:49.020
Talk python dot fm slash shortcut.

00:18:49.020 --> 00:18:53.020
Again, that's talk python dot fm slash shortcut.

00:18:53.020 --> 00:18:57.020
Choose shortcut because you shouldn't have to project manage your project management.

00:18:57.020 --> 00:19:01.620
The next one is pipe, right?

00:19:01.620 --> 00:19:02.620
Yeah.

00:19:02.620 --> 00:19:03.620
The image also.

00:19:03.620 --> 00:19:04.620
Yeah.

00:19:04.620 --> 00:19:07.620
There's like a lumberjack looking dude smoking pipe there.

00:19:07.620 --> 00:19:08.620
That's very cool.

00:19:08.620 --> 00:19:08.620
Yeah.

00:19:08.620 --> 00:19:08.920
Yes.

00:19:08.920 --> 00:19:09.620
Tell us about pipe.

00:19:09.620 --> 00:19:16.120
When you do data analysis, like most of the time, the data you'll be dealing with will be like not clean.

00:19:16.120 --> 00:19:18.620
You have to perform some operations.

00:19:18.620 --> 00:19:28.720
And pipe really offers a way to just package those all those operations into a single line of code or into a single code of block of code.

00:19:28.720 --> 00:19:37.720
It's kind of like it's going to like SQL and pipelines where you just have to run as only single line of code and just perform several operations at the same time.

00:19:37.720 --> 00:19:40.320
It's really just a neat way to do data cleaning.

00:19:40.320 --> 00:19:40.720
Right.

00:19:40.720 --> 00:19:43.320
And it's what's called a fluent API.

00:19:43.320 --> 00:19:52.420
So if I call data frame dot pipe, what comes back is another data frame and then I could call dot pipe on it again and then dot pipe and dot pipe and chain those together.

00:19:52.420 --> 00:19:52.820
Yes.

00:19:52.820 --> 00:19:55.720
Applying different operations and transformations.

00:19:55.720 --> 00:20:00.120
It's almost like a map reduce or aggregation framework type of thing here.

00:20:00.120 --> 00:20:00.320
Right.

00:20:00.320 --> 00:20:01.420
It's pretty flexible.

00:20:01.420 --> 00:20:08.920
I would think it's just like and it's, you know, its entirety like the amazing one of the amazing features of pandas like consistency always.

00:20:08.920 --> 00:20:10.020
Yeah, I really like it.

00:20:10.020 --> 00:20:11.320
It looks super neat.

00:20:11.320 --> 00:20:17.920
So you need to do transformations on a data frame with custom functions and get answers out.

00:20:17.920 --> 00:20:18.120
Yeah.

00:20:18.120 --> 00:20:26.320
Another thing that you pointed out here is that as part of this, you could apply it to the whole data frame or you could pass a set of columns.

00:20:26.320 --> 00:20:26.720
Yes.

00:20:26.720 --> 00:20:27.420
As part of it.

00:20:27.420 --> 00:20:30.020
So as what you're piping across, what does that do?

00:20:30.020 --> 00:20:32.320
That reduces the result to just those.

00:20:32.320 --> 00:20:34.520
If you pass in three things, just those three columns.

00:20:34.520 --> 00:20:41.120
And then these two functions remove outliers and go categoricals are our function that accept arguments.

00:20:41.120 --> 00:20:45.520
And when you pass it to pipe, we just have to pass the function name.

00:20:45.520 --> 00:20:45.720
Got it.

00:20:45.720 --> 00:20:47.720
Which means you can pass the arguments.

00:20:47.720 --> 00:20:53.320
So to pass the arguments, actually, you just have to provide them after the comma.

00:20:53.320 --> 00:21:02.320
So this remove outliers function just accepts one argument as a list and it performs like outlier removal and just returns the whole data frame.

00:21:02.320 --> 00:21:03.320
I see.

00:21:03.320 --> 00:21:07.720
So you can pass like your function might take the data frame, but it might also take additional information.

00:21:07.720 --> 00:21:11.920
Like I want to exclude things that are over a hundred dollars and just throw them away.

00:21:11.920 --> 00:21:16.420
Well, you got to pass that hundred in because it needs to know a hundred versus some other cutoff value.

00:21:16.420 --> 00:21:16.720
Right.

00:21:16.720 --> 00:21:17.220
Got it.

00:21:17.220 --> 00:21:17.320
Yes.

00:21:17.320 --> 00:21:17.620
Yes.

00:21:17.620 --> 00:21:17.920
Okay.

00:21:17.920 --> 00:21:18.920
Cool.

00:21:18.920 --> 00:21:21.920
And you say it resembles scikit-learn pipeline.

00:21:21.920 --> 00:21:23.320
Yeah, that's pretty cool.

00:21:23.320 --> 00:21:24.320
All right.

00:21:24.320 --> 00:21:27.320
We're up to number three factorize.

00:21:27.320 --> 00:21:28.320
Yeah.

00:21:28.320 --> 00:21:29.320
Tell us about this one.

00:21:29.320 --> 00:21:41.320
In machine learning as algorithms only accept numerical data and the most real world data sets contain categoricals, which means like there are like a class one, class two or class three.

00:21:41.320 --> 00:21:49.120
And you have to encode them like a two numeric, like zero, one, two, three, or using like one hundred encoder or labeling code in scikit-learn.

00:21:49.120 --> 00:21:51.820
But you can do that in pandas as well.

00:21:51.820 --> 00:21:58.820
You just have to pass the column to factorize and it just encodes them with numericals for each class.

00:21:58.820 --> 00:21:59.220
I see.

00:21:59.220 --> 00:22:03.320
So let me see if I can give an audio friendly example for listeners here.

00:22:03.320 --> 00:22:03.720
Yeah.

00:22:03.720 --> 00:22:16.320
If we've got something that says a data frame where one of the pieces is what the weather was like sunny, rainy, sun, rain, snow, clouds, something like that.

00:22:16.320 --> 00:22:20.220
You can't feed sun to the machine learning model.

00:22:20.220 --> 00:22:21.320
You got to give it a number, right?

00:22:21.320 --> 00:22:21.820
Yes.

00:22:21.820 --> 00:22:26.920
So this will convert that to like zero for sun and everywhere sun appeared, you would now have a zero.

00:22:26.920 --> 00:22:30.120
One for rain, everywhere there was a rain and so on.

00:22:30.120 --> 00:22:36.520
So it just does that, figures out how many different categories there are and then gives them a number that can be sent off to machine learning, right?

00:22:36.520 --> 00:22:37.520
You explained that.

00:22:37.520 --> 00:22:39.020
Awesome.

00:22:39.020 --> 00:22:40.180
You see, I'm learning, right?

00:22:40.180 --> 00:22:42.520
I'm just following along with you here.

00:22:42.520 --> 00:22:42.820
Awesome.

00:22:42.820 --> 00:22:43.420
Okay.

00:22:43.420 --> 00:22:44.580
That's a really cool one.

00:22:44.580 --> 00:22:48.220
This next one seems a little bit crazy, but it looks very useful.

00:22:48.220 --> 00:22:49.660
Explode, right?

00:22:49.660 --> 00:22:50.420
What is explode?

00:22:50.420 --> 00:22:51.320
Survey data.

00:22:51.320 --> 00:22:55.360
Surveys usually contain questions that are multiple choice.

00:22:55.780 --> 00:23:01.980
You can just pick a lot of like more than one answer to one question and that's recorded as one answer.

00:23:01.980 --> 00:23:07.020
So you're just going to end up with this kind of lists in a single cell of the table.

00:23:07.020 --> 00:23:08.360
Like a question.

00:23:08.360 --> 00:23:18.520
Oh, if you have a question one and the user just picks the answers ABC, it's going to end up ABC is going to end up as a list in a single cell of a table.

00:23:18.520 --> 00:23:18.880
Right.

00:23:18.880 --> 00:23:23.500
So for an example here, you have a series that has one and then six and then seven.

00:23:23.500 --> 00:23:26.860
And then the fourth element is a list of three other numbers.

00:23:26.860 --> 00:23:30.180
And you're like, wait a minute, those are not supposed to just be multidimensional.

00:23:30.180 --> 00:23:31.640
I want a straight series, right?

00:23:31.640 --> 00:23:32.640
You want a straight series.

00:23:32.640 --> 00:23:40.600
And when you call explode on this series, it's going to just expand the series vertically and just going to fill up.

00:23:40.600 --> 00:23:46.120
It just takes the elements of the single cell lists and just expands them vertically.

00:23:46.120 --> 00:23:47.040
Yeah.

00:23:47.040 --> 00:23:50.580
And these are the types of things that you were talking about with loops, right?

00:23:50.580 --> 00:23:54.100
It would be easy to go through and say, I'm going to build up a new data frame.

00:23:54.100 --> 00:23:59.660
And if I see a list instead of a number, I'm going to just start appending those from the list with an inner loop and then we'll carry on.

00:23:59.660 --> 00:23:59.880
Right.

00:23:59.880 --> 00:24:01.980
And here you've literally done it in one line.

00:24:01.980 --> 00:24:02.260
Yeah.

00:24:02.260 --> 00:24:02.440
Yeah.

00:24:02.440 --> 00:24:04.920
This would be crazy complex if you did it like manually.

00:24:04.920 --> 00:24:06.040
Right.

00:24:06.040 --> 00:24:07.740
And honestly, slower, right?

00:24:07.740 --> 00:24:12.460
Because a lot of this is probably implemented in C, whereas you would be doing it at the Python layer.

00:24:12.460 --> 00:24:13.500
It's going to be very slow.

00:24:13.500 --> 00:24:13.860
All right.

00:24:13.860 --> 00:24:15.540
Another question from Brandon out there.

00:24:15.660 --> 00:24:16.840
Glad he's here in the live stream.

00:24:16.840 --> 00:24:20.700
How would I apply Explode to the entire data frame?

00:24:20.700 --> 00:24:26.180
I'm guessing he's thinking about maybe if you had multiple columns and they each potentially had this.

00:24:26.180 --> 00:24:27.200
I don't think that's possible.

00:24:27.200 --> 00:24:27.600
Yeah.

00:24:27.600 --> 00:24:29.200
I don't think Pandas allows that.

00:24:29.200 --> 00:24:29.500
Yeah.

00:24:29.500 --> 00:24:29.700
Okay.

00:24:29.700 --> 00:24:32.700
So it's got to be on a series, not on a data frame.

00:24:32.700 --> 00:24:33.460
Right.

00:24:33.460 --> 00:24:33.860
Got it.

00:24:33.860 --> 00:24:34.120
Okay.

00:24:34.120 --> 00:24:34.740
Cool.

00:24:34.740 --> 00:24:37.800
So these are all fun names that stand out.

00:24:37.800 --> 00:24:38.600
The next one.

00:24:38.600 --> 00:24:39.420
You're a fun name.

00:24:39.420 --> 00:24:39.700
Yeah.

00:24:39.700 --> 00:24:40.180
Yeah.

00:24:40.180 --> 00:24:41.600
And you pick some cool pictures, right?

00:24:41.600 --> 00:24:42.040
Yeah.

00:24:42.040 --> 00:24:42.280
Yeah.

00:24:42.280 --> 00:24:42.580
All right.

00:24:42.660 --> 00:24:43.580
So what's the next one?

00:24:43.580 --> 00:24:44.020
Squeeze.

00:24:44.020 --> 00:24:44.600
Squeeze.

00:24:44.600 --> 00:24:44.920
Squeeze.

00:24:44.920 --> 00:24:53.160
As you can see, there are some conditional operators who return real data frames, even if it's a single cell.

00:24:53.360 --> 00:25:02.520
As you can see from the subset set, we're just asking the diamonds data frame to return all diamonds that are priced below $1.

00:25:02.520 --> 00:25:08.080
And it just returns a single result, which is 326.

00:25:08.080 --> 00:25:13.400
But it's returned as a data frame, which is not comfortable to work with, like a single cell data frame.

00:25:13.520 --> 00:25:13.620
Right.

00:25:13.620 --> 00:25:18.440
Because Panda doesn't know ahead of time that a .luck call is going to result in a single item.

00:25:18.440 --> 00:25:20.220
This happens a lot in databases, too.

00:25:20.220 --> 00:25:23.700
You do a query, and the result is actually a single thing.

00:25:23.700 --> 00:25:29.340
But the framework has no way to know that the data is structured in a way that's unique or that's a one thing.

00:25:29.340 --> 00:25:32.760
And I suspect that's common here with data frames as well.

00:25:32.760 --> 00:25:33.620
You're structured.

00:25:33.620 --> 00:25:35.760
Like, I know this is going to give me the one answer.

00:25:35.760 --> 00:25:36.200
Yeah.

00:25:36.200 --> 00:25:37.900
But it just returns the whole table.

00:25:38.120 --> 00:25:42.660
Yeah, yeah, you're like, well, now I got to, like, dig in and give me the first row, first column.

00:25:42.660 --> 00:25:43.100
Yeah, okay.

00:25:43.100 --> 00:25:44.480
So squeeze helps fix this?

00:25:44.480 --> 00:25:53.040
You just call this, like, oh, squeeze on a single cell data frame or series, and just it removes all the dimensionality and just returns the number.

00:25:53.040 --> 00:25:53.560
Interesting.

00:25:53.560 --> 00:25:53.980
That's cool.

00:25:53.980 --> 00:25:57.760
What happens if I call it on one that's got more than one item?

00:25:57.760 --> 00:25:58.120
Do you know?

00:25:58.120 --> 00:26:01.020
Does it just give you the first, or does it freak out and let you know?

00:26:01.020 --> 00:26:02.100
I never tried that.

00:26:02.100 --> 00:26:04.600
Yeah, I never tried that.

00:26:04.600 --> 00:26:06.040
Yeah, don't do that, right?

00:26:06.100 --> 00:26:11.520
It's like, maybe if you just actually want the first answer, maybe it's okay, but it also might give you an exception.

00:26:11.520 --> 00:26:12.040
I don't know.

00:26:12.040 --> 00:26:13.340
I'll be fine to try it now.

00:26:13.340 --> 00:26:14.000
Yeah, exactly.

00:26:14.000 --> 00:26:14.740
Cool.

00:26:14.740 --> 00:26:18.660
So the next one has to do with finding things in a range, right?

00:26:18.660 --> 00:26:20.020
Yeah, between, yeah.

00:26:20.020 --> 00:26:27.820
Yeah, it's like, just the name suggests, like, you want to take all the rows that are in between some range.

00:26:27.820 --> 00:26:35.480
For example, here in the code example, I'm choosing all diamonds that are priced between $3,500 and $3,700.

00:26:35.480 --> 00:26:36.460
Nice.

00:26:36.460 --> 00:26:39.680
So, of course, you could do this probably as an expression.

00:26:39.680 --> 00:26:42.340
You could definitely do this as a loop.

00:26:42.340 --> 00:26:47.160
But both of those are slower, I'm sure, because they're not implemented internally, right?

00:26:47.160 --> 00:26:48.200
Yeah, less elegant.

00:26:48.200 --> 00:26:51.700
This one is better and faster and shorter.

00:26:51.700 --> 00:27:00.700
Yeah, and one of the things, the third parameter you can pass here to between, in addition to, like, the lower bound and upper bound, is whether or not it includes the endpoints, right?

00:27:00.700 --> 00:27:03.540
This one is inclusive is neither.

00:27:03.540 --> 00:27:05.940
So it's like open set.

00:27:05.940 --> 00:27:06.420
Nice.

00:27:06.620 --> 00:27:10.400
Another thing that I've seen here, which is not one of your 25, but looks nice.

00:27:10.400 --> 00:27:17.100
I'm used to visualizing, quickly visualizing a data frame when I get it back with head or tail.

00:27:17.100 --> 00:27:19.680
And I want to know, like, okay, kind of what did I get back here?

00:27:19.680 --> 00:27:20.580
Show me the front.

00:27:20.580 --> 00:27:21.120
That'll be good.

00:27:21.120 --> 00:27:21.640
Do ahead.

00:27:21.640 --> 00:27:23.740
Or let's go to the end and see what happened at the end.

00:27:23.740 --> 00:27:25.000
But here you have .sample.

00:27:25.000 --> 00:27:25.740
That's interesting.

00:27:25.740 --> 00:27:31.220
I use it often because some data sets have, like, ordering, for example, time series data sets.

00:27:31.220 --> 00:27:36.780
And the first few rows might be not too representative of the whole data frame.

00:27:36.780 --> 00:27:40.820
So I just call sample with, like, five or ten rows.

00:27:40.820 --> 00:27:43.280
And that randomly samples the data set.

00:27:43.280 --> 00:27:49.180
And usually sometimes that represents the data set better than head or tail.

00:27:49.180 --> 00:27:50.100
Right, exactly.

00:27:50.100 --> 00:27:55.400
And so it just kind of randomly picks some stuff throughout the data set to show you what's going on, right?

00:27:55.540 --> 00:27:57.240
For large data sets, that's really handy.

00:27:57.240 --> 00:27:58.160
Nice to know.

00:27:58.160 --> 00:27:58.420
Yeah.

00:27:58.420 --> 00:28:07.420
So the next one has to do with, I'm guessing, like, when you're doing matrix multiplication and vectors and, like, truly doing math.

00:28:07.420 --> 00:28:09.840
Most of the time I would expect this to show up.

00:28:09.840 --> 00:28:10.020
Yeah.

00:28:10.020 --> 00:28:11.240
Most of the time, yes.

00:28:11.240 --> 00:28:11.680
Yeah.

00:28:11.680 --> 00:28:12.400
Transpose.

00:28:12.400 --> 00:28:12.680
Yeah.

00:28:12.680 --> 00:28:13.040
Yeah.

00:28:13.040 --> 00:28:14.020
It stands for transpose.

00:28:14.020 --> 00:28:19.900
I usually, you usually don't do math or matrix multiplication in Pandas.

00:28:19.900 --> 00:28:21.120
You will do it in NumPy.

00:28:21.260 --> 00:28:25.960
But this one, I use it mostly for when you, on the result of describe.

00:28:25.960 --> 00:28:28.400
You see here, describe returns.

00:28:28.400 --> 00:28:30.120
The axis inverted.

00:28:30.120 --> 00:28:33.640
So the five numbers number is given as rows.

00:28:33.820 --> 00:28:45.420
And that's really a problem when you have multiple columns because the data set starts to expand horizontally, which makes you scroll to the right, which you don't want.

00:28:45.420 --> 00:28:56.420
So when you do describe, you get things like given a data set, it'll say, here's the count of this index, the mean of this index, or this value of a column, standard deviation, and so on.

00:28:56.420 --> 00:28:58.880
And the number of options there is unbounded.

00:28:58.880 --> 00:29:03.580
But the fact that it goes count, mean, standard deviation, minimum, and then a few more things, that's fixed.

00:29:03.580 --> 00:29:04.520
And that fits pretty well.

00:29:04.620 --> 00:29:12.160
So you're saying if you transpose or flip the rows and columns so that you make it go vertical instead of across, that's an easier way to look at it.

00:29:12.160 --> 00:29:12.600
Yeah, yeah.

00:29:12.600 --> 00:29:13.200
I agree.

00:29:13.200 --> 00:29:15.140
And it's as easy as saying .t.

00:29:15.140 --> 00:29:17.060
So it's not too hard to do, right?

00:29:17.060 --> 00:29:17.640
You might as well.

00:29:17.640 --> 00:29:18.440
It's an attribute.

00:29:18.440 --> 00:29:19.180
Yeah.

00:29:19.180 --> 00:29:19.640
Cool.

00:29:19.640 --> 00:29:20.560
All right.

00:29:20.560 --> 00:29:21.480
That's a really good one.

00:29:22.020 --> 00:29:27.600
So you're saying if I'm going to do like some kind of matrix multiplication stuff, I should not do it in Pandas.

00:29:27.600 --> 00:29:28.700
I should just stick to NumPy.

00:29:28.700 --> 00:29:29.020
Yeah.

00:29:29.020 --> 00:29:31.780
NumPy is like purely for mathematical purposes.

00:29:31.780 --> 00:29:34.120
And it's much faster than Pandas.

00:29:34.120 --> 00:29:36.640
I suspect that NumPy has a good transpose as well.

00:29:36.640 --> 00:29:37.980
But yeah, there.

00:29:37.980 --> 00:29:39.340
It has the same attribute.

00:29:39.340 --> 00:29:39.720
Yeah.

00:29:39.720 --> 00:29:41.520
There's a lot of synergy between those two libraries.

00:29:41.520 --> 00:29:45.940
So the next one has to do with styling things and how they look, right?

00:29:45.940 --> 00:29:50.640
One of the things that's cool about Pandas is it mixes well with Jupyter Notebooks.

00:29:50.980 --> 00:29:53.720
And Jupyter Notebooks have a nice sort of explore the data.

00:29:53.720 --> 00:29:54.640
And let's see what's going on.

00:29:54.640 --> 00:29:55.660
Let me just look at it, right?

00:29:55.660 --> 00:29:59.800
So this styler thing, the style attribute helps you with that, right?

00:29:59.800 --> 00:30:00.360
Yeah.

00:30:00.360 --> 00:30:02.960
Here, like it takes advantage of that.

00:30:02.960 --> 00:30:06.860
The fact that Jupyter uses HTML and CSS under the hood.

00:30:06.860 --> 00:30:17.380
So you can take advantage of that and use some HTML and CSS knowledge to style your data frame based on some like hyphonic loops or conditionals.

00:30:17.840 --> 00:30:28.440
Here, for example, after you take the transports or the describe, you can just highlight the maximums of each row or column using the highlight column max function.

00:30:28.440 --> 00:30:29.000
Yeah.

00:30:29.000 --> 00:30:32.600
The Pandas offers a lot of functions after the style attribute.

00:30:32.860 --> 00:30:39.840
You can use the built-in functions or you can come up with some custom logic to style your data frame using HTML and CSS.

00:30:39.840 --> 00:30:40.340
Okay.

00:30:40.340 --> 00:30:41.760
Yeah, this is great.

00:30:41.760 --> 00:30:46.660
So you can say, for example, here dot style dot highlight max.

00:30:46.860 --> 00:30:50.960
And then you give it some CSS values like colors, dark red or something like that, right?

00:30:50.960 --> 00:30:53.320
You just don't have to look at the row numbers.

00:30:53.320 --> 00:30:57.460
It just shows you the most important metrics or the ones that you want.

00:30:57.460 --> 00:31:00.480
It's really useful when you have like multiple columns.

00:31:00.480 --> 00:31:01.900
You just don't want to have to.

00:31:01.900 --> 00:31:06.420
You just don't want to look at all those crazy numbers and you just use some.

00:31:06.420 --> 00:31:11.980
Yeah, like a real reasonable or maybe straightforward thing you might start out by doing.

00:31:11.980 --> 00:31:13.160
So, well, let me just sort it.

00:31:13.160 --> 00:31:15.080
We'll sort it so the highest one's at the top.

00:31:15.080 --> 00:31:24.880
But in this example, you've got multiple columns and the max of one column is in one value, but it's a different row for a different attribute of it, right?

00:31:24.880 --> 00:31:33.420
So sorting it is going to do nothing except for like if you come up with a whole bunch of variations and try to look at it and a little bit of color, a little bit of picture goes a long ways.

00:31:33.420 --> 00:31:34.060
Yeah, yeah.

00:31:34.060 --> 00:31:34.720
Visual.

00:31:34.720 --> 00:31:35.540
Yeah, absolutely.

00:31:35.540 --> 00:31:40.760
This portion of Talk Python to me is sponsored by Linode.

00:31:40.760 --> 00:31:44.700
Cut your cloud bills in half with Linode's Linux virtual machines.

00:31:44.700 --> 00:31:48.820
Develop, deploy and scale your modern applications faster and easier.

00:31:49.140 --> 00:31:56.380
Whether you're developing a personal project or managing larger workloads, you deserve simple, affordable and accessible cloud computing solutions.

00:31:56.380 --> 00:32:01.620
Get started on Linode today with $100 in free credit for listeners of Talk Python.

00:32:01.620 --> 00:32:05.980
You can find all the details over at talkpython.fm/Linode.

00:32:05.980 --> 00:32:12.560
Linode has data centers around the world with the same simple and consistent pricing, regardless of location.

00:32:12.560 --> 00:32:15.040
Choose the data center that's nearest to you.

00:32:15.900 --> 00:32:23.000
You also receive 24, 7, 365 human support with no tiers or handoffs, regardless of your plan size.

00:32:23.000 --> 00:32:25.880
Imagine that real human support for everyone.

00:32:25.880 --> 00:32:36.080
You can choose shared or dedicated compute instances, or you can use your $100 in credit on S3 compatible object storage, managed Kubernetes clusters and more.

00:32:36.080 --> 00:32:38.660
If it runs on Linux, it runs on Linode.

00:32:38.660 --> 00:32:43.280
Visit talkpython.fm and click the create free account button to get started.

00:32:43.280 --> 00:32:46.160
You can also find the link right in your podcast player show notes.

00:32:46.160 --> 00:32:49.100
Thank you to Linode for supporting Talk Python.

00:32:49.100 --> 00:32:55.040
Yeah, the second example you have here in your article is a little more nuanced.

00:32:55.040 --> 00:32:55.840
This looks great.

00:32:55.840 --> 00:32:56.720
Tell us about that.

00:32:56.720 --> 00:32:58.520
This one is like background gradient.

00:32:58.520 --> 00:33:03.880
So it just colors each cell of the column based on its magnitude.

00:33:04.140 --> 00:33:08.760
It's kind of like a continuous palette.

00:33:08.760 --> 00:33:15.020
It just shows where the maximum or the minimums are and just how they compare to each other.

00:33:15.020 --> 00:33:20.880
Yeah, it's almost like if you could do a heat map in an Excel table, you know, by making the cells different colors.

00:33:20.880 --> 00:33:24.480
You can pass in a color map and all sorts of stuff to control how that looks.

00:33:24.480 --> 00:33:24.740
Yeah.

00:33:24.860 --> 00:33:25.200
Yeah, cool.

00:33:25.200 --> 00:33:25.600
I like it.

00:33:25.600 --> 00:33:26.500
This is great.

00:33:26.500 --> 00:33:34.940
You know, it's one of these things where, again, one line of code and you can dramatically improve the presentation value or the informational value of what you're looking at.

00:33:34.940 --> 00:33:35.100
Right.

00:33:35.100 --> 00:33:35.700
Nice.

00:33:35.700 --> 00:33:36.540
All right.

00:33:36.540 --> 00:33:39.260
I feel like that's similar to your number nine.

00:33:39.260 --> 00:33:39.700
Yeah.

00:33:39.700 --> 00:33:41.120
This one is Pandas options.

00:33:41.380 --> 00:33:43.140
Like it's kind of settings of your phone.

00:33:43.140 --> 00:33:53.720
You just set them globally and it applies to all the data frames, the series and all the functions that you are going to be using inside the project or inside the session of Jupyter Notebook.

00:33:53.720 --> 00:34:01.180
So if you want to have some sort of number of columns that are shown or some kind of color or something like that, you can just set that up at the beginning.

00:34:01.180 --> 00:34:01.840
Yeah.

00:34:01.840 --> 00:34:05.580
You just don't have to call them every single time or change them every single time.

00:34:05.580 --> 00:34:11.140
It's just a shorthand of way of doing things like setting global settings.

00:34:11.140 --> 00:34:11.500
Yeah.

00:34:11.500 --> 00:34:17.520
You could probably even do something like have a little JSON file that describes the look and feel of what you're doing.

00:34:17.520 --> 00:34:21.740
Just your first line, just load it up and set it and then, you know, go from there.

00:34:21.740 --> 00:34:22.800
Something to that effect, right?

00:34:22.800 --> 00:34:23.080
Yeah.

00:34:23.080 --> 00:34:23.380
Yeah.

00:34:23.380 --> 00:34:28.160
So you don't have to completely fill the first few lines of your notebook with like setup code.

00:34:28.160 --> 00:34:28.540
Yeah.

00:34:28.540 --> 00:34:32.460
For example, one of those examples is like a display max rows.

00:34:32.460 --> 00:34:38.180
If you set it to five and you just call the data frame, it's going to only show the first five rows.

00:34:38.180 --> 00:34:40.980
So you don't have to call .hat every time.

00:34:40.980 --> 00:34:41.680
Oh, that's interesting.

00:34:41.680 --> 00:34:42.040
Yeah.

00:34:42.040 --> 00:34:45.720
Because of course, if there's enough rows, it won't print the whole thing out, right?

00:34:45.720 --> 00:34:46.040
Probably.

00:34:46.040 --> 00:34:46.580
Yeah.

00:34:46.580 --> 00:34:49.660
You don't want to print 10 million rows and completely lock up the system.

00:34:49.660 --> 00:34:50.060
Yeah.

00:34:50.060 --> 00:34:50.440
Yeah.

00:34:50.440 --> 00:34:50.620
Yeah.

00:34:50.620 --> 00:34:50.720
Yeah.

00:34:50.720 --> 00:34:51.520
That's going to.

00:34:51.520 --> 00:34:52.220
Cool.

00:34:52.220 --> 00:34:55.400
Oh, and another one that's kind of nice is display precision.

00:34:55.740 --> 00:35:01.980
And if you set that, you won't see the, you know, 1.27e to the five or whatever, right?

00:35:01.980 --> 00:35:02.320
You can.

00:35:02.320 --> 00:35:06.860
It's really annoying when you're working with like math functions.

00:35:06.860 --> 00:35:15.020
It just keeps giving in like scientific notation when you just want to like see the first or four or five decimal places.

00:35:15.200 --> 00:35:15.400
Yeah.

00:35:15.400 --> 00:35:20.660
Scientific notation is great when you're dealing with huge numbers or tremendously small numbers, right?

00:35:20.660 --> 00:35:23.100
Like how many meters across is an atom?

00:35:23.100 --> 00:35:23.440
Okay.

00:35:23.440 --> 00:35:25.220
So you're going to need an E to something.

00:35:25.220 --> 00:35:33.420
But for human beings often, you know, you want to just look at the number and go, yeah, that's a million, not like, you know, 1.2e to the six or seven, whatever.

00:35:33.420 --> 00:35:34.680
It's going to be really annoying.

00:35:34.680 --> 00:35:35.300
That's cool.

00:35:35.300 --> 00:35:39.540
And this is just one of those options you can set up and it just globally applies to that notebook.

00:35:39.540 --> 00:35:46.880
So another thing that's interesting about pandas is the columns have types usually, but not always.

00:35:46.880 --> 00:35:55.940
It's one of those like beginning level things that you will encounter, but it can get really annoying if the data types are incorrect for your column.

00:35:55.940 --> 00:35:58.720
The most important one is the object data type.

00:35:58.720 --> 00:35:58.980
Right.

00:35:58.980 --> 00:36:00.600
That's like, I don't really know.

00:36:00.600 --> 00:36:02.260
So we're just going to store it.

00:36:02.260 --> 00:36:02.500
Yeah.

00:36:02.500 --> 00:36:02.760
Yeah.

00:36:03.120 --> 00:36:10.140
I'm just going to put it inside of an object and objects are like object data type is the worst one.

00:36:10.140 --> 00:36:14.740
It also limits the functionality of pandas and it's also the most memory consuming.

00:36:14.740 --> 00:36:15.140
Right.

00:36:15.140 --> 00:36:18.180
So the next function, what number are we on here?

00:36:18.180 --> 00:36:18.780
Number 10.

00:36:18.780 --> 00:36:19.420
10, yes.

00:36:19.420 --> 00:36:25.340
And the hit list is convert underscore D types as in convert data types.

00:36:25.340 --> 00:36:31.960
When you call it on the whole data frame, it just, it tries to infer the correct data type for each column.

00:36:32.380 --> 00:36:36.100
If it's a float or integer or string like that.

00:36:36.100 --> 00:36:43.740
So your example, you're reading a CSV file and some of the columns are detected correctly like floats, but others get this object.

00:36:43.740 --> 00:36:45.760
But after calling convert D types, it's like, you know what?

00:36:45.760 --> 00:36:46.580
No, those are strings.

00:36:46.580 --> 00:36:54.060
But it can't handle the date times because there are so many date time formats and pandas can't possibly know all of them.

00:36:54.060 --> 00:36:55.480
Why are date times so hard?

00:36:55.480 --> 00:36:58.060
They really shouldn't be, but they really are.

00:36:58.060 --> 00:36:58.780
It's crazy.

00:36:58.780 --> 00:37:01.320
And then you throw in time zones and you'll forget it.

00:37:01.320 --> 00:37:01.800
Okay.

00:37:01.800 --> 00:37:04.840
And throw in daylight savings and all these other things.

00:37:04.840 --> 00:37:05.520
Oh, yeah.

00:37:05.520 --> 00:37:06.460
That's crazy.

00:37:06.460 --> 00:37:07.220
Yeah.

00:37:07.220 --> 00:37:08.340
Daylight saving is crazy.

00:37:08.340 --> 00:37:08.680
Yeah.

00:37:08.680 --> 00:37:10.240
I suspect some of the Kaggle stuff.

00:37:10.240 --> 00:37:15.500
Part of the challenge is like normalize these dates because who knows or something along those lines.

00:37:15.500 --> 00:37:17.660
Time zones are like total mess.

00:37:18.000 --> 00:37:18.960
Yeah, for sure.

00:37:18.960 --> 00:37:24.340
So related to converting the data types is to select them.

00:37:24.340 --> 00:37:24.860
Yeah.

00:37:24.860 --> 00:37:28.160
Which is a way to filter what's in there.

00:37:28.160 --> 00:37:31.680
Like you can filter by column or rows or even a condition.

00:37:31.680 --> 00:37:36.260
But this is saying like, I only want the strings or only want the numbers, right?

00:37:36.260 --> 00:37:43.820
While doing machine learning, you have to apply certain pre-processing functions to only a subsets of the data.

00:37:43.980 --> 00:37:46.800
Like only on categoricals or only numerics.

00:37:46.800 --> 00:37:49.860
So this function will become very handy.

00:37:49.860 --> 00:37:54.360
You just pass the data type using NumPy.

00:37:54.360 --> 00:37:58.940
And it just gives all the subset of the data frame with that data type.

00:37:58.940 --> 00:37:59.260
Nice.

00:37:59.260 --> 00:38:04.560
So you would say like data frame dot select data types and then include equals np number.

00:38:04.560 --> 00:38:09.540
And now instantly the resulting data frame is a subset that only has numbers, right?

00:38:09.540 --> 00:38:09.920
Yes.

00:38:10.000 --> 00:38:10.320
That's cool.

00:38:10.320 --> 00:38:12.580
And then also you point out that you can do the reverse.

00:38:12.580 --> 00:38:20.840
Just like give you just the other, like just the informational bits, like categories and stuff or rating by saying exclude.

00:38:20.840 --> 00:38:21.200
Yeah.

00:38:21.200 --> 00:38:21.840
Very nice.

00:38:21.840 --> 00:38:22.200
Okay.

00:38:22.200 --> 00:38:24.880
Well, we just missed it with Halloween here.

00:38:24.880 --> 00:38:25.940
Yeah.

00:38:25.940 --> 00:38:26.200
Yeah.

00:38:26.200 --> 00:38:26.680
Mask.

00:38:26.680 --> 00:38:27.720
But mask.

00:38:27.720 --> 00:38:28.380
Yeah.

00:38:28.380 --> 00:38:28.780
Cool.

00:38:28.780 --> 00:38:30.240
Like a mask here.

00:38:30.240 --> 00:38:33.900
But mask is number 12.

00:38:33.900 --> 00:38:34.800
That's about it.

00:38:34.940 --> 00:38:44.680
It's a conditional on, you can use it on, on series or data frames and it just returns the subset of the data where some condition is true.

00:38:44.680 --> 00:38:45.080
Yeah.

00:38:45.080 --> 00:38:45.700
Okay.

00:38:45.700 --> 00:38:46.740
So, yeah.

00:38:46.740 --> 00:38:49.060
And this example here, you've got a bunch of ages.

00:38:49.060 --> 00:38:51.340
And I want to subset them using B2.

00:38:51.340 --> 00:38:59.440
I want to take all those rows that are beyond 60 or below 50 and convert those values to NAND.

00:38:59.440 --> 00:38:59.720
Okay.

00:38:59.720 --> 00:39:05.780
So, this is like an in-place update or I guess it replaces, creates another one that is like as if you updated it.

00:39:06.040 --> 00:39:12.600
And it finds all the stuff that's, I guess, outside of your range and then applies this other value, right?

00:39:12.600 --> 00:39:19.840
Like if it's stuff that's outside of this range, in this case, you're going to set it to not a number, but it could be set to zero or max or anything.

00:39:19.840 --> 00:39:20.460
Uh-huh.

00:39:20.460 --> 00:39:20.740
Yeah.

00:39:20.740 --> 00:39:21.020
Cool.

00:39:21.020 --> 00:39:22.180
A very good one.

00:39:22.180 --> 00:39:24.920
Similar, I guess, is min and max.

00:39:25.020 --> 00:39:28.840
And then some of these, as we get a little farther down your recommendations, I like them.

00:39:28.840 --> 00:39:36.120
They're not just, oh, here, you can apply this function, but apply it in this scenario or this context to get an interesting outcome, right?

00:39:36.120 --> 00:39:37.880
So, that's what number 13 is like.

00:39:37.880 --> 00:39:40.580
Min and max along columns axis.

00:39:40.580 --> 00:39:47.480
Usually, when you call min or max on a column, it just returns the minimum or maximum of that column.

00:39:47.920 --> 00:39:56.440
But sometimes you want it to row-wise, like it just treats rows as columns and it gives min and max across the rows.

00:39:56.440 --> 00:39:58.380
That's usually useful.

00:39:58.380 --> 00:40:02.540
A handy way of doing something that would take a lot of code if you're done manually.

00:40:02.540 --> 00:40:06.580
Another one of these tricks that are techniques that lets you avoid looping, right?

00:40:06.580 --> 00:40:11.260
Here I show a good example of like comparing four different libraries on five datasets.

00:40:11.260 --> 00:40:14.420
You want the best performance on each dataset.

00:40:14.900 --> 00:40:18.140
So, you have to find the best score across the rows.

00:40:18.140 --> 00:40:18.580
Exactly.

00:40:18.580 --> 00:40:25.620
So, the columns are the different libraries like XGBoost, CatBoost, Scikit-Learn, and so on, being applied to the same dataset.

00:40:25.620 --> 00:40:28.540
And you want to just go for row one, what one did the best?

00:40:28.540 --> 00:40:29.940
Row two, what one did the best?

00:40:29.940 --> 00:40:30.180
Yeah.

00:40:30.180 --> 00:40:30.780
Yeah.

00:40:30.780 --> 00:40:31.280
Very nice.

00:40:31.280 --> 00:40:32.680
It takes a lot of code if done manually.

00:40:32.680 --> 00:40:33.020
Yeah.

00:40:33.020 --> 00:40:33.380
Cool.

00:40:33.380 --> 00:40:37.080
Number 14, N largest and N smallest.

00:40:37.080 --> 00:40:38.020
Yeah.

00:40:38.020 --> 00:40:41.020
We're talking about those max or minimums.

00:40:41.800 --> 00:40:53.560
So, N largest, when you pass a number and a column name, it just returns the data frame that contains the smallest or largest N rows of that column.

00:40:53.560 --> 00:40:53.860
Nice.

00:40:53.980 --> 00:40:59.380
So, if I were to call min or max, that would give me the smallest or the largest one, respectively, right?

00:40:59.380 --> 00:40:59.780
Yes.

00:40:59.780 --> 00:41:07.020
But a really interesting or common question you might have is like, what are the top 10 selling products this month, right?

00:41:07.020 --> 00:41:07.240
Yeah.

00:41:07.360 --> 00:41:12.440
And this lets you just say N largest 10, and then you pick the column on which to judge it.

00:41:12.440 --> 00:41:13.560
Here you have price, right?

00:41:13.560 --> 00:41:16.460
Five most expensive diamonds in the diamonds data set.

00:41:16.460 --> 00:41:16.720
Yeah.

00:41:16.720 --> 00:41:20.260
Again, one of these things that, you know, no more looping or any of that stuff.

00:41:20.260 --> 00:41:21.160
No more if statements.

00:41:21.160 --> 00:41:22.740
Just call it, right?

00:41:22.740 --> 00:41:25.320
This one is like the five cheapest, most cheapest diamonds.

00:41:25.320 --> 00:41:25.740
Yeah.

00:41:25.740 --> 00:41:27.220
And so, N smallest and N largest.

00:41:27.220 --> 00:41:27.660
Fantastic.

00:41:28.200 --> 00:41:34.880
Also, sometimes when you're asking for a minimum or maximum thing, you don't actually want the minimum or maximum.

00:41:34.880 --> 00:41:42.260
You want to know where that is because you're going to get that thing back and say, I need that whole row because I want to learn more information about it, right?

00:41:42.260 --> 00:41:44.140
But if you said, well, what's the minimum price?

00:41:44.140 --> 00:41:44.980
It's seven.

00:41:44.980 --> 00:41:46.000
Like, oh, okay, great.

00:41:46.000 --> 00:41:50.320
Now do I need to like loop through until I find that thing that has seven or something like this?

00:41:50.320 --> 00:41:52.020
So, you've got a recommendation for that.

00:41:52.020 --> 00:41:52.220
Yeah.

00:41:52.220 --> 00:41:54.280
The IDX man is IDX min.

00:41:54.280 --> 00:41:57.920
This returns the index values of minimum or max.

00:41:58.120 --> 00:42:03.040
So that you can look at the row that they are stored at or the column.

00:42:03.040 --> 00:42:03.640
Fantastic.

00:42:03.640 --> 00:42:04.360
Yeah.

00:42:04.360 --> 00:42:07.600
So, here's the row that contains the minimum price.

00:42:07.600 --> 00:42:08.200
I love it.

00:42:08.200 --> 00:42:08.340
Yeah.

00:42:08.340 --> 00:42:08.880
Really nice.

00:42:08.880 --> 00:42:11.720
So, so many of these are really easy to apply, right?

00:42:11.720 --> 00:42:22.260
Like, it's not a lot of research to learn how to apply ID max, but at the same time, or IDX max, but at the same time, knowing that it exists, now all of a sudden you can use it really easily.

00:42:22.260 --> 00:42:24.940
But you probably wouldn't have known to look for it, right?

00:42:24.940 --> 00:42:25.280
Yeah.

00:42:25.280 --> 00:42:25.660
Yeah.

00:42:25.660 --> 00:42:26.140
Cool.

00:42:26.540 --> 00:42:32.040
People often talk about differences between beginner developers and expert developers.

00:42:32.040 --> 00:42:36.700
And I think a lot of times beginners look at folks like you who have a lot of experience.

00:42:36.700 --> 00:42:40.840
They're like, oh, this guy is so incredibly smart and he just has this way of solving these problems.

00:42:40.840 --> 00:42:41.640
It's so amazing.

00:42:41.640 --> 00:42:44.280
And, you know, to some degree, that's probably true.

00:42:44.720 --> 00:42:51.660
But a lot of it is like just building up layers and layers of these like, oh, I know I can use ID max, IDX max.

00:42:51.660 --> 00:42:53.440
I know that I can use N largest.

00:42:53.440 --> 00:42:54.820
And you just sort of pile them together.

00:42:54.820 --> 00:42:59.380
And then like, bam, like the solution becomes easier because you have these little building blocks.

00:42:59.380 --> 00:42:59.620
Right.

00:42:59.640 --> 00:43:03.240
So it's, I think it's really valuable for people getting into Pandas.

00:43:03.240 --> 00:43:12.720
I usually think that the biggest difference between a beginner level and a more experienced programmer is just, is like just how much time they spend on the documentation.

00:43:12.720 --> 00:43:13.460
Yeah.

00:43:13.460 --> 00:43:14.080
Yeah.

00:43:14.080 --> 00:43:14.180
Yeah.

00:43:14.180 --> 00:43:21.180
If you read the docs, like if you patiently read the docs, you're just going to become a really good user of that particular tool or library.

00:43:21.300 --> 00:43:21.720
I agree.

00:43:21.720 --> 00:43:24.620
There's just more, you understand it better.

00:43:24.620 --> 00:43:26.920
You know more of what it has to offer.

00:43:26.920 --> 00:43:29.240
So it's like, it's less you've got to reinvent.

00:43:29.240 --> 00:43:29.640
Yeah.

00:43:29.640 --> 00:43:29.960
All right.

00:43:29.960 --> 00:43:34.960
I talked about how you have something that may be well known, but then applying it in a scenario.

00:43:34.960 --> 00:43:39.180
And this number 16 is value counts with drop in a false.

00:43:39.180 --> 00:43:40.060
What's this one about?

00:43:40.060 --> 00:43:48.420
When you have a series with like categoricals, you just want to see the proportions or their numbers as a whole in the total series.

00:43:48.860 --> 00:43:51.820
And that usually doesn't include the null values.

00:43:51.820 --> 00:43:59.620
So you have to call is null and chain it with some so that you get a, you learn the number of NADs in that column.

00:43:59.620 --> 00:44:09.060
But you can do it efficiently with value counts with setting, by setting drop in a to false, which includes the proportions of the null values as well.

00:44:09.060 --> 00:44:09.300
Yeah.

00:44:09.300 --> 00:44:13.760
So it just gives you a, basically a percentage as a ratio here.

00:44:13.880 --> 00:44:18.920
It's just a ratio of the number of the different categories that have appeared here.

00:44:18.920 --> 00:44:19.280
Right.

00:44:19.280 --> 00:44:20.120
So very cool.

00:44:20.120 --> 00:44:21.620
And now just not a number is included.

00:44:21.620 --> 00:44:22.080
That's great.

00:44:22.080 --> 00:44:22.440
Yeah.

00:44:22.440 --> 00:44:24.300
Number 17 clip.

00:44:24.300 --> 00:44:25.380
This is a good one.

00:44:25.380 --> 00:44:25.680
Yeah.

00:44:25.680 --> 00:44:25.840
Yeah.

00:44:25.840 --> 00:44:34.880
For data that exceeds, I don't know, maybe a range, maybe it's supposed, some instruments supposed to collect zero to a hundred and it's goes crazy and goes outside of a hundred.

00:44:34.880 --> 00:44:35.160
Yeah.

00:44:35.160 --> 00:44:43.940
For example, we go back to the ages example where I just want to have ages between like 18 or 60, 18 and 60.

00:44:43.940 --> 00:44:46.140
And I want to exclude all those values.

00:44:46.320 --> 00:44:52.500
And when you call clip with those custom values, it's just going to impose those hard limits on the whole series.

00:44:52.500 --> 00:44:52.860
Right.

00:44:52.860 --> 00:44:59.260
So it'll replace the ones that are over with the maximum that you said and the ones that are too low, it'll bring them up to the minimum.

00:44:59.260 --> 00:44:59.580
Right.

00:44:59.580 --> 00:44:59.940
Yeah.

00:44:59.940 --> 00:45:00.460
Very cool.

00:45:00.460 --> 00:45:02.980
Again, against the whole data set, not looping.

00:45:02.980 --> 00:45:04.840
Only at one column at a time.

00:45:04.840 --> 00:45:05.140
Yeah.

00:45:05.360 --> 00:45:14.280
We talked about how difficult time is, but you do have some recommendations for searching for data that appears at a certain time or in a time range, right?

00:45:14.280 --> 00:45:15.140
What's number 18?

00:45:15.140 --> 00:45:23.560
This one is like a subsetting of rows of the data frame at some particular time of the day, like any time of the day.

00:45:23.560 --> 00:45:29.560
But you like, for example, three o'clock, 9.30, 10.30, or any time that you want.

00:45:29.560 --> 00:45:33.540
You're just going to take all those rows and return them using at time.

00:45:33.780 --> 00:45:35.540
Yeah, that's super easy, right?

00:45:35.540 --> 00:45:39.580
Just pass in at time and you literally specify times, right?

00:45:39.580 --> 00:45:42.780
Like 15 colon zero, zero as a string.

00:45:42.780 --> 00:45:46.420
Like a real conversation or messaging.

00:45:46.420 --> 00:45:49.660
And then the other one, which is also interesting, is between time, right?

00:45:49.660 --> 00:45:52.560
Like what happened in the morning, for example?

00:45:52.560 --> 00:46:00.380
Like what are those sales that happened in the morning or after midnight or during some particular interval?

00:46:00.380 --> 00:46:02.840
This one is really handy to do that.

00:46:03.020 --> 00:46:03.200
Yeah.

00:46:03.200 --> 00:46:04.020
So super easy.

00:46:04.020 --> 00:46:05.820
Just data frame dot between time.

00:46:05.820 --> 00:46:06.860
Or is that a series?

00:46:06.860 --> 00:46:08.560
No, it doesn't matter.

00:46:08.560 --> 00:46:08.920
Okay.

00:46:08.920 --> 00:46:09.520
It doesn't matter.

00:46:09.520 --> 00:46:13.020
It usually has to be, it has to have a daytime index.

00:46:13.020 --> 00:46:13.460
That's it.

00:46:13.460 --> 00:46:13.760
Yeah.

00:46:13.760 --> 00:46:13.980
Okay.

00:46:14.060 --> 00:46:19.280
So then you just pass in strings like 9 colon 45 to 12 colon zero zero.

00:46:19.280 --> 00:46:21.160
And you know, that's like late morning or something.

00:46:21.160 --> 00:46:21.560
Beautiful.

00:46:21.560 --> 00:46:24.520
The next one here has to do with time series.

00:46:24.520 --> 00:46:26.960
Number 19, B date range.

00:46:26.960 --> 00:46:27.660
Tell us about this.

00:46:27.660 --> 00:46:32.320
Well, this one is like, stands for business date range, business date range.

00:46:32.320 --> 00:46:37.580
So like fundus internally built in a lot built into calendars.

00:46:37.580 --> 00:46:43.860
Like it just, when you want to, how can I say, when you want to index the data frame, you want time series data frame.

00:46:43.920 --> 00:46:47.120
You want to include only like working days.

00:46:47.120 --> 00:46:49.980
Like you want to exclude all the weekdays, weekends.

00:46:49.980 --> 00:46:50.240
Yeah.

00:46:50.340 --> 00:46:57.080
You can do that for every single of the year or for every single week of the year, because you can possibly know which days are weekends.

00:46:57.080 --> 00:47:05.580
So when you call B date range, it just takes, it just indexes the data frame using only weekdays.

00:47:05.580 --> 00:47:09.000
And also it excludes the holidays, I think.

00:47:09.000 --> 00:47:09.800
Oh my gosh.

00:47:09.800 --> 00:47:11.360
I was just wondering about holidays.

00:47:11.360 --> 00:47:12.820
Like there's another wrinkle in there.

00:47:12.820 --> 00:47:17.960
Already things like leap year and stuff like that is built into this, I would imagine.

00:47:17.960 --> 00:47:18.940
So this is super cool.

00:47:18.940 --> 00:47:19.400
Yeah.

00:47:19.420 --> 00:47:31.440
This is very important for when you are doing time series forecasting or announcing analysis because like, or working with stocks because stocks are only traded on weekdays and not on holidays.

00:47:31.440 --> 00:47:32.960
So it will be very important.

00:47:32.960 --> 00:47:39.500
Or even if you do in like traffic analysis, you want to understand accidents that are a result of rush hour, right?

00:47:39.500 --> 00:47:40.860
You wouldn't want to look on a weekend.

00:47:40.860 --> 00:47:41.260
Yeah.

00:47:41.260 --> 00:47:41.620
All right.

00:47:41.620 --> 00:47:43.540
The next one has to do with correlation.

00:47:43.540 --> 00:47:46.060
Auto core, C-O-R-R.

00:47:46.060 --> 00:47:46.780
Yeah.

00:47:46.780 --> 00:47:47.800
Auto correlation.

00:47:47.800 --> 00:47:48.280
Yeah.

00:47:48.400 --> 00:47:49.720
I don't do much with time series.

00:47:49.720 --> 00:47:51.100
You're going to have to tell us about this one.

00:47:51.100 --> 00:47:51.820
What's going on here?

00:47:51.820 --> 00:48:00.420
This is usually how it's the auto correlation of a series or time series tells the predictability of the time series with itself.

00:48:00.420 --> 00:48:02.720
It's, do you know about correlation coefficient?

00:48:03.060 --> 00:48:03.660
Yeah, exactly.

00:48:03.660 --> 00:48:07.580
It tells you how much the model matches the actual data.

00:48:07.580 --> 00:48:12.120
Like it's 97% likely that the model will predict the stuff coming up, right?

00:48:12.120 --> 00:48:14.720
Could be linear or more complicated, but that's something like that.

00:48:14.720 --> 00:48:14.860
Yeah.

00:48:14.860 --> 00:48:15.180
Yeah.

00:48:15.180 --> 00:48:23.020
The gist of this is that if a time series has a high auto correlation with itself, it means that you can predict it more easily.

00:48:23.020 --> 00:48:23.460
Got it.

00:48:23.460 --> 00:48:23.660
Yeah.

00:48:23.740 --> 00:48:27.700
It's basically how predictable or unpredictable is this thing.

00:48:27.700 --> 00:48:27.880
Yeah.

00:48:27.880 --> 00:48:34.040
There's a lot of details about autoc relation and it has very many applications in time series.

00:48:34.040 --> 00:48:40.320
But the gist is that like it shows you how much predictability it has like at each interval.

00:48:40.320 --> 00:48:40.640
Cool.

00:48:40.640 --> 00:48:42.660
It sounds very useful if you're doing that kind of stuff.

00:48:42.660 --> 00:48:43.320
All right.

00:48:43.320 --> 00:48:45.860
Number 21 has NANDs.

00:48:45.940 --> 00:48:47.140
It's also an attribute.

00:48:47.140 --> 00:48:51.700
You just call it on a series and it returns true or false.

00:48:51.700 --> 00:48:56.260
If you have, it returns true if you have at least one missing value in a series.

00:48:56.260 --> 00:48:56.640
Yeah.

00:48:56.640 --> 00:48:59.640
So there was this quote, I remember who it's attributed to.

00:48:59.640 --> 00:49:00.160
Sorry.

00:49:00.160 --> 00:49:06.800
That says something to the effect of like data cleanup and data wrangling is not the dirty work.

00:49:06.800 --> 00:49:09.560
It is the work of data science, like to get everything ready.

00:49:09.560 --> 00:49:11.800
And then you just like hit it with the magic at the end.

00:49:11.800 --> 00:49:12.040
Right.

00:49:12.080 --> 00:49:16.820
And this feels like that lands right in that realm is like given some data frame or series,

00:49:16.820 --> 00:49:19.000
does it have not in numbers or is it all good?

00:49:19.000 --> 00:49:19.420
Yeah.

00:49:19.420 --> 00:49:22.480
Missing values is like a huge problem in machine learning.

00:49:22.480 --> 00:49:26.780
Most scikit-learn algorithms don't accept missing values.

00:49:26.780 --> 00:49:31.760
So you either have to drop them or impute them using some techniques.

00:49:31.760 --> 00:49:35.720
And this one is very handy to detect those missing values.

00:49:35.720 --> 00:49:36.040
Right.

00:49:36.040 --> 00:49:37.660
I suspect this is the first test.

00:49:37.660 --> 00:49:41.320
Like if it has not in numbers and then we're going to go do stuff.

00:49:41.440 --> 00:49:43.220
But if it says false, then you're good to go.

00:49:43.220 --> 00:49:43.680
Just roll.

00:49:43.680 --> 00:49:44.440
Yeah.

00:49:44.440 --> 00:49:44.660
Yeah.

00:49:44.660 --> 00:49:45.160
Go with that.

00:49:45.160 --> 00:49:46.740
But it usually turns through.

00:49:46.740 --> 00:49:48.080
Unfortunately.

00:49:48.080 --> 00:49:51.320
Are you familiar with the missing no?

00:49:51.320 --> 00:49:51.860
Let me.

00:49:51.860 --> 00:49:52.500
Yeah.

00:49:52.500 --> 00:49:52.740
Yeah.

00:49:52.740 --> 00:49:57.320
This is another thing that I would sort of came to mind is like this whole thing, this

00:49:57.320 --> 00:50:00.700
missing no package as in like no numbers.

00:50:00.700 --> 00:50:03.700
So a way to not just answer yes or no, but to get visualizations.

00:50:03.700 --> 00:50:04.940
Have you used this?

00:50:04.940 --> 00:50:05.320
Yeah.

00:50:05.320 --> 00:50:05.520
Yeah.

00:50:05.520 --> 00:50:07.500
I also wrote an article on it, I think.

00:50:07.500 --> 00:50:07.840
Okay.

00:50:07.840 --> 00:50:08.380
Well, yeah.

00:50:08.380 --> 00:50:08.960
So definitely.

00:50:08.960 --> 00:50:09.560
That's awesome.

00:50:09.560 --> 00:50:09.840
Yeah.

00:50:09.940 --> 00:50:11.880
Things like this sound really useful to me.

00:50:11.880 --> 00:50:12.480
They seem like.

00:50:12.480 --> 00:50:14.760
I really like that missingness matrix.

00:50:14.760 --> 00:50:20.320
It just shows the reasons why missing values are correlated to how missing values are correlated

00:50:20.320 --> 00:50:21.280
with other columns.

00:50:21.280 --> 00:50:21.680
Right.

00:50:21.680 --> 00:50:23.860
Is it a whole bunch of missing data in one row?

00:50:23.860 --> 00:50:24.220
Yeah.

00:50:24.220 --> 00:50:25.220
And then it's all good?

00:50:25.220 --> 00:50:26.440
Or is it interspersed?

00:50:26.440 --> 00:50:29.800
Like this one's missing the birthday, but that one's missing the name or something like

00:50:29.800 --> 00:50:29.940
that.

00:50:29.940 --> 00:50:30.100
Right.

00:50:30.100 --> 00:50:30.440
Yeah.

00:50:30.440 --> 00:50:31.560
It's a really good package.

00:50:31.560 --> 00:50:31.840
Yeah.

00:50:31.840 --> 00:50:32.260
Fantastic.

00:50:32.260 --> 00:50:32.940
All right.

00:50:32.940 --> 00:50:34.140
At number 22.

00:50:34.140 --> 00:50:35.500
At and Iat.

00:50:35.500 --> 00:50:39.080
This one is like a faster versions of lock and Iat.

00:50:39.080 --> 00:50:43.360
It just enables you to index your data frame.

00:50:43.920 --> 00:50:48.640
But this one is specifically designed for retrieving single value conditionals.

00:50:48.640 --> 00:50:48.940
Nice.

00:50:48.940 --> 00:50:51.520
It's almost like an array index.

00:50:51.520 --> 00:50:51.940
Yeah.

00:50:51.940 --> 00:50:52.500
A little bit.

00:50:52.500 --> 00:50:54.780
What's the difference between at and Iat?

00:50:54.780 --> 00:50:57.000
Using at, you can use like column labels.

00:50:57.000 --> 00:51:00.280
Like as you can see here, we are using cut and an index.

00:51:00.740 --> 00:51:03.820
But Iat, you have to know the index of that column.

00:51:03.820 --> 00:51:04.220
I see.

00:51:04.220 --> 00:51:09.480
So with At, it would be like row and then column name, where Iat is row and column number.

00:51:09.480 --> 00:51:10.820
It's probably less flexible.

00:51:10.820 --> 00:51:15.180
You got to know that cut is four because it could be moved around as people are creating

00:51:15.180 --> 00:51:16.080
or inserting data.

00:51:16.080 --> 00:51:16.260
Yeah.

00:51:16.260 --> 00:51:16.760
Okay.

00:51:16.760 --> 00:51:19.080
Ag sort as in aggregation.

00:51:19.080 --> 00:51:23.580
This one just returns the indices that would sort a data frame.

00:51:23.580 --> 00:51:23.900
Okay.

00:51:23.900 --> 00:51:24.880
Based on some column.

00:51:25.280 --> 00:51:31.060
So in during data analysis, you sometimes want the indices, not the actual sorted data

00:51:31.060 --> 00:51:34.600
so that you can use those indices in multiple times over.

00:51:34.600 --> 00:51:35.100
Got it.

00:51:35.100 --> 00:51:35.960
So you get the sorted.

00:51:35.960 --> 00:51:38.580
Say, I want to sort by the total bill.

00:51:38.580 --> 00:51:38.900
Yeah.

00:51:38.900 --> 00:51:43.100
But then give me the indexes as if it was sorted, but don't actually change it.

00:51:43.100 --> 00:51:46.380
So then you could go and then request data off those indexes.

00:51:46.380 --> 00:51:46.660
Got it.

00:51:46.660 --> 00:51:46.860
Yeah.

00:51:46.860 --> 00:51:47.020
Yeah.

00:51:47.020 --> 00:51:47.380
Nice.

00:51:47.380 --> 00:51:47.860
All right.

00:51:47.860 --> 00:51:51.860
We're closing in on the end and we've brought in the cat, the cat accessor.

00:51:51.860 --> 00:51:52.580
Cat accessor.

00:51:52.580 --> 00:51:52.800
Yeah.

00:51:52.800 --> 00:51:54.180
I should have put an image here.

00:51:54.180 --> 00:51:54.800
Yeah.

00:51:54.920 --> 00:51:57.100
There would have been some kind of cool cat you can put in there.

00:51:57.100 --> 00:51:58.180
Yeah.

00:51:58.180 --> 00:52:03.440
As like pandas enables you to perform some like data type specific functions.

00:52:03.440 --> 00:52:08.720
Like there is DT accessor for date time and also STR for strings.

00:52:08.720 --> 00:52:11.660
And this one is for strictly for categorical purposes.

00:52:11.660 --> 00:52:19.820
It has like a large suite of categorical functions that makes it easier to work on categories, ordinals

00:52:19.820 --> 00:52:20.760
or nominal data.

00:52:20.760 --> 00:52:21.120
Yeah.

00:52:21.120 --> 00:52:21.620
Fantastic.

00:52:22.120 --> 00:52:27.500
And let's bring it to the 25th with a nth group by nth.

00:52:27.500 --> 00:52:27.740
Yeah.

00:52:27.740 --> 00:52:32.540
This one is less useful or used very in very rare edge cases.

00:52:32.540 --> 00:52:38.840
When you group by some column, possibly a categorical column, we want to look at those rows or groups,

00:52:38.840 --> 00:52:39.380
right?

00:52:39.380 --> 00:52:47.360
Calling nth on grouped data frame just returns that nth row or nth row of that groups of each group.

00:52:47.640 --> 00:52:47.820
Got it.

00:52:47.820 --> 00:52:48.440
Okay.

00:52:48.440 --> 00:52:49.060
Yeah.

00:52:49.060 --> 00:52:49.860
That looks really cool.

00:52:49.860 --> 00:52:50.380
Yeah.

00:52:50.380 --> 00:52:50.780
All right.

00:52:50.780 --> 00:52:52.700
Well, that's it for our list.

00:52:52.700 --> 00:52:56.240
Hopefully people out there listening have definitely learned something.

00:52:56.240 --> 00:53:00.100
Now, your title was just to put a little disclaimer in here for everyone.

00:53:00.100 --> 00:53:02.500
It's 25 panda functions you didn't know existed.

00:53:02.500 --> 00:53:04.880
Pipe P guarantee equals 0.8.

00:53:04.960 --> 00:53:06.600
So you had this 80%.

00:53:06.600 --> 00:53:06.880
Yeah.

00:53:06.880 --> 00:53:07.440
I'm guaranteed.

00:53:07.440 --> 00:53:08.360
I love it.

00:53:08.360 --> 00:53:10.840
That's a little bit of a stats joke in the title.

00:53:10.840 --> 00:53:12.260
No one complained about that.

00:53:12.260 --> 00:53:13.760
So I think that was right.

00:53:13.760 --> 00:53:14.140
Yeah.

00:53:14.140 --> 00:53:15.420
It sounds about right.

00:53:15.420 --> 00:53:18.840
It seems like there's a lot of neat use cases here that people can find.

00:53:18.840 --> 00:53:20.960
These are your 25 that you found interesting.

00:53:20.960 --> 00:53:22.500
Other people might find them as well.

00:53:22.880 --> 00:53:23.700
There are so many.

00:53:23.700 --> 00:53:24.860
Oh, so many.

00:53:24.860 --> 00:53:25.080
Yeah.

00:53:25.080 --> 00:53:29.200
These are the types of things, though, that people can say, all right, today I'm going to

00:53:29.200 --> 00:53:32.560
try to work with number one as I'm doing my data analysis and stuff.

00:53:32.560 --> 00:53:34.380
I just, I know I'm going to be doing some Excel stuff.

00:53:34.380 --> 00:53:36.380
So let's do the Excel writer one.

00:53:36.380 --> 00:53:40.660
And then, you know, maybe later it's like, oh, I know I'm doing survey type of data.

00:53:40.660 --> 00:53:45.080
So let me work with explode and just try to, you know, if you work these in one at a time,

00:53:45.080 --> 00:53:48.420
eventually they become part of your tool chest and they're good, right?

00:53:48.540 --> 00:53:48.720
Yeah.

00:53:48.720 --> 00:53:48.940
Yeah.

00:53:48.940 --> 00:53:51.140
And just expanding your tool set and skills.

00:53:51.140 --> 00:53:55.040
I think part of the trick is to make sure that you apply it a little bit, right?

00:53:55.040 --> 00:53:58.120
I mean, you know, they're out there, but just as you use them, like bring them in.

00:53:58.120 --> 00:54:00.260
It just saves you time and resources.

00:54:00.260 --> 00:54:00.720
Awesome.

00:54:00.720 --> 00:54:01.220
Yeah.

00:54:01.220 --> 00:54:03.420
Half the battle is just knowing that it exists, right?

00:54:03.420 --> 00:54:04.920
It's not that it's necessarily hard to use.

00:54:04.920 --> 00:54:07.100
It's like, I just didn't know this was even an option.

00:54:07.100 --> 00:54:08.040
Yeah.

00:54:08.040 --> 00:54:09.640
All of these are very easy to use.

00:54:09.640 --> 00:54:11.120
You just know that they exist.

00:54:11.120 --> 00:54:11.400
Yeah.

00:54:11.400 --> 00:54:15.460
I feel like so much of Pandas is that way, but they're so, it's hard to know because there's

00:54:15.460 --> 00:54:16.400
so much to do there.

00:54:16.400 --> 00:54:16.980
It's cool.

00:54:17.360 --> 00:54:19.640
Out of the live stream, Brandon, just wanted, now we're cutting it out.

00:54:19.640 --> 00:54:21.340
I wanted to throw out, he said, very helpful.

00:54:21.340 --> 00:54:22.800
Thank you for the article, Bex.

00:54:22.800 --> 00:54:23.360
Cool.

00:54:23.360 --> 00:54:23.760
You're welcome.

00:54:23.760 --> 00:54:24.360
Yeah, I agree.

00:54:24.360 --> 00:54:24.560
Yeah.

00:54:24.560 --> 00:54:25.360
Thanks for doing this one.

00:54:25.360 --> 00:54:30.140
I do want to point out, we certainly don't have time to cover it, but let me pull it up

00:54:30.140 --> 00:54:32.620
here so I can make sure it goes in the links as well.

00:54:32.620 --> 00:54:34.340
You did the same thing for NumPy, right?

00:54:34.340 --> 00:54:36.100
And you also were a little more confident.

00:54:36.100 --> 00:54:37.940
I got to say, you're a little more confident here.

00:54:37.940 --> 00:54:40.740
Your P of guarantee equals 0.85 instead of 0.8.

00:54:40.740 --> 00:54:44.020
NumPy practices are a little bit harder to understand.

00:54:44.540 --> 00:54:48.020
That's why most of them don't bother to learn those, most people.

00:54:48.020 --> 00:54:52.500
So I was a bit confident because I also didn't know most of these functions.

00:54:52.500 --> 00:54:54.440
That's why I was a bit more confident.

00:54:54.440 --> 00:54:54.820
Yeah.

00:54:54.820 --> 00:54:55.400
Fantastic.

00:54:55.400 --> 00:54:56.340
All right.

00:54:56.340 --> 00:55:00.020
So if people like this flow and they want to kind of go a little deeper and go into the

00:55:00.020 --> 00:55:01.700
NumPy layer, they can check that out.

00:55:01.700 --> 00:55:03.640
And they can also check out a bunch of your other writing.

00:55:03.740 --> 00:55:05.220
I also have the same for SK Learn.

00:55:05.220 --> 00:55:05.740
Okay.

00:55:05.740 --> 00:55:07.420
Right on for SK Learn.

00:55:07.420 --> 00:55:07.640
Great.

00:55:07.640 --> 00:55:08.280
All right.

00:55:08.280 --> 00:55:12.020
Anything else you want to add to this article before we call it good on that topic?

00:55:12.020 --> 00:55:13.680
I think we covered everything.

00:55:13.680 --> 00:55:14.040
Yeah.

00:55:14.040 --> 00:55:14.660
We covered it well.

00:55:14.660 --> 00:55:15.400
I think it was fun.

00:55:15.400 --> 00:55:15.640
Yeah.

00:55:15.640 --> 00:55:16.560
It was fun.

00:55:16.560 --> 00:55:16.760
All right.

00:55:16.760 --> 00:55:21.680
Now, before you get out of here, there's the two questions you've got to answer.

00:55:21.680 --> 00:55:25.140
If you're going to write some Python code, what editor do you use?

00:55:25.140 --> 00:55:25.980
What are you going to use?

00:55:25.980 --> 00:55:28.800
For data analysis, I usually use JupyterLab.

00:55:28.800 --> 00:55:29.180
Yep.

00:55:29.180 --> 00:55:33.400
But if I have to do pure Python, that's always PyCharm.

00:55:33.400 --> 00:55:34.220
I love it.

00:55:34.220 --> 00:55:34.540
Awesome.

00:55:34.540 --> 00:55:35.220
That's a good combo.

00:55:35.220 --> 00:55:35.960
Yeah.

00:55:35.960 --> 00:55:38.300
And then notable PyPI package.

00:55:38.300 --> 00:55:39.160
Something.

00:55:39.160 --> 00:55:42.160
It doesn't have to be something super popular, but something that you've been across that

00:55:42.160 --> 00:55:44.500
people are like, you're like, people should know about this.

00:55:44.500 --> 00:55:45.860
This is something I learned about.

00:55:45.860 --> 00:55:48.240
I recently come across with UMAP.

00:55:48.240 --> 00:55:49.000
UMAP?

00:55:49.000 --> 00:55:50.660
It's for dimensional add to reduction.

00:55:50.660 --> 00:55:51.520
UMAP Python.

00:55:51.960 --> 00:55:57.720
It's usually used for like very large data sets to project them to 2D so that you can

00:55:57.720 --> 00:55:58.580
visualize them.

00:55:58.580 --> 00:56:00.640
This one is a really useful package.

00:56:00.640 --> 00:56:01.020
Nice.

00:56:01.020 --> 00:56:04.560
So definitely people are trying to project down to 2D.

00:56:04.560 --> 00:56:05.980
I mean, that's one of the problems, right?

00:56:05.980 --> 00:56:08.400
Is how do you look at some of this stuff that's...

00:56:08.400 --> 00:56:10.580
Like 100 dimensional or 200 dimensions.

00:56:10.580 --> 00:56:12.220
You just can't visualize.

00:56:12.220 --> 00:56:16.160
I don't even have any idea at all how to do 100 dimensions.

00:56:16.160 --> 00:56:21.120
I remember we were doing some work with complex analysis and two dimensional.

00:56:21.320 --> 00:56:23.240
Each dimension was complex numbers.

00:56:23.240 --> 00:56:24.000
So four dimensional.

00:56:24.000 --> 00:56:25.480
That was a challenge.

00:56:25.480 --> 00:56:27.660
I have no idea how to approach 100.

00:56:27.660 --> 00:56:28.660
No one does.

00:56:28.660 --> 00:56:31.660
That's why this kind of dimensional add to reduction techniques exist.

00:56:31.660 --> 00:56:31.980
Yeah.

00:56:31.980 --> 00:56:32.440
Fantastic.

00:56:32.440 --> 00:56:35.160
And of course, important machine learning and stuff, right?

00:56:35.160 --> 00:56:38.560
There's like dimensions that you can just throw away because they don't actually contribute

00:56:38.560 --> 00:56:40.040
to the predictions and stuff, right?

00:56:40.100 --> 00:56:40.340
Yeah.

00:56:40.340 --> 00:56:41.600
You might does that exactly.

00:56:41.600 --> 00:56:41.920
Excellent.

00:56:41.920 --> 00:56:42.500
Super.

00:56:42.500 --> 00:56:43.640
All right, Bex.

00:56:43.640 --> 00:56:44.800
Thank you for being here.

00:56:44.800 --> 00:56:45.920
Final call to action.

00:56:45.920 --> 00:56:50.240
People want to get deeper in Pandas, maybe learn more about some of your articles.

00:56:50.240 --> 00:56:51.340
You know, what do you tell them?

00:56:51.340 --> 00:56:53.760
As I said, just first check the documentation.

00:56:53.760 --> 00:56:56.920
The documentation is usually, it should be your first choice.

00:56:56.920 --> 00:56:58.940
It's the best place to learn about a library.

00:56:59.080 --> 00:57:02.800
It takes a little dedication, but go through it and find out what it has to offer and go

00:57:02.800 --> 00:57:03.300
from there, right?

00:57:03.300 --> 00:57:09.260
It's a bit hard to read, but the documentation is always like gives the best information about

00:57:09.260 --> 00:57:12.860
the library because it's written by the package creators.

00:57:13.500 --> 00:57:15.540
So they know the library the best.

00:57:15.540 --> 00:57:15.960
For sure.

00:57:15.960 --> 00:57:16.700
Yeah.

00:57:16.700 --> 00:57:17.220
All right.

00:57:17.220 --> 00:57:18.180
Well, thank you for being here.

00:57:18.180 --> 00:57:19.840
Thanks for writing the article and sharing that with us.

00:57:19.840 --> 00:57:20.740
Thanks for having me.

00:57:20.740 --> 00:57:21.400
Yeah, you bet.

00:57:21.400 --> 00:57:21.820
Bye.

00:57:21.820 --> 00:57:22.300
Thank you.

00:57:22.300 --> 00:57:22.460
Bye.

00:57:22.460 --> 00:57:26.440
This has been another episode of Talk Python to Me.

00:57:26.440 --> 00:57:28.260
Thank you to our sponsors.

00:57:28.260 --> 00:57:29.860
Be sure to check out what they're offering.

00:57:29.860 --> 00:57:31.280
It really helps support the show.

00:57:31.280 --> 00:57:36.800
Choose Shortcut, formerly Clubhouse.io, for tracking all of your project's work because

00:57:36.800 --> 00:57:39.560
you shouldn't have to project manage your project management.

00:57:39.560 --> 00:57:42.400
Visit talkpython.fm/shortcut.

00:57:43.180 --> 00:57:46.980
Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual

00:57:46.980 --> 00:57:47.400
machines.

00:57:47.400 --> 00:57:50.760
Develop, deploy, and scale your modern applications faster and easier.

00:57:50.760 --> 00:57:55.720
Visit talkpython.fm/linode and click the create free account button to get started.

00:57:55.720 --> 00:57:59.440
Do you need a great automatic speech to text API?

00:57:59.440 --> 00:58:01.960
Get human level accuracy in just a few lines of code.

00:58:01.960 --> 00:58:04.820
Visit talkpython.fm/assembly AI.

00:58:04.820 --> 00:58:06.600
Want to level up your Python?

00:58:06.600 --> 00:58:10.640
We have one of the largest catalogs of Python video courses over at Talk Python.

00:58:11.080 --> 00:58:15.820
Our content ranges from true beginners to deeply advanced topics like memory and async.

00:58:15.820 --> 00:58:18.500
And best of all, there's not a subscription in sight.

00:58:18.500 --> 00:58:21.400
Check it out for yourself at training.talkpython.fm.

00:58:21.400 --> 00:58:23.300
Be sure to subscribe to the show.

00:58:23.300 --> 00:58:26.080
Open your favorite podcast app and search for Python.

00:58:26.080 --> 00:58:27.380
We should be right at the top.

00:58:27.380 --> 00:58:33.180
You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the

00:58:33.180 --> 00:58:36.740
direct RSS feed at /rss on talkpython.fm.

00:58:36.740 --> 00:58:40.180
We're live streaming most of our recordings these days.

00:58:40.180 --> 00:58:44.300
If you want to be part of the show and have your comments featured on the air, be sure to

00:58:44.300 --> 00:58:48.020
subscribe to our YouTube channel at talkpython.fm/youtube.

00:58:48.420 --> 00:58:49.860
This is your host, Michael Kennedy.

00:58:49.860 --> 00:58:51.160
Thanks so much for listening.

00:58:51.160 --> 00:58:52.320
I really appreciate it.

00:58:52.320 --> 00:58:54.220
Now get out there and write some Python code.

00:58:54.220 --> 00:58:54.860
Bye.

00:58:54.860 --> 00:58:55.860
Bye.

00:58:55.860 --> 00:58:56.860
Bye.

00:58:56.860 --> 00:58:57.860
Bye.

00:58:57.860 --> 00:58:58.860
Bye.

00:58:58.860 --> 00:58:59.860
Bye.

00:58:59.860 --> 00:59:00.860
Bye.

00:59:00.860 --> 00:59:01.860
Bye.

00:59:01.860 --> 00:59:02.860
Bye.

00:59:02.860 --> 00:59:03.860
Bye.

00:59:03.860 --> 00:59:04.860
Bye.

00:59:04.860 --> 00:59:05.860
Bye.

00:59:05.860 --> 00:59:06.860
Bye.

00:59:06.860 --> 00:59:07.860
Bye.

00:59:07.860 --> 00:59:08.860
Bye.

00:59:08.860 --> 00:59:09.860
Bye.

00:59:09.860 --> 00:59:10.860
Bye.

00:59:10.860 --> 00:59:11.860
Bye.

00:59:11.860 --> 00:59:11.860
Bye.

00:59:11.860 --> 00:59:11.860
Bye.

00:59:11.860 --> 00:59:11.860
Bye.

00:59:11.860 --> 00:59:12.860
Bye.

00:59:12.860 --> 00:59:14.860
Thank you.

00:59:14.860 --> 00:59:44.840
Thank you.