25 Pandas Functions You Didn’t Know Existed
Episode #341,
published Wed, Nov 17, 2021, recorded Thu, Nov 4, 2021
Do you do anything with Jupyter notebooks? If you do, there is a very good chance you're working with the pandas library. This is one of THE primary tools of anyone doing computational work or data exploration with Python. Yet, this library is massive and knowing the idiomatic way to use it can be hard to discover.
That's why I've invited Bex Tuychiev to be our guest. He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties we should all keep in our data toolkit. I'm sure there is something here for all of us to take away and use pandas that much better.
Links from the show
That's why I've invited Bex Tuychiev to be our guest. He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties we should all keep in our data toolkit. I'm sure there is something here for all of us to take away and use pandas that much better.
The 25 functions
- ExcelWriter is a generic class for creating excel files (with sheets!) and writing DataFrames to them.
- pipe is one of the best functions for doing data cleaning in a concise, compact manner in Pandas
- factorize: This function is a pandas alternative to Sklearn’s LabelEncoder
- A function with an interesting name is explode.
- Another function with a funky name is squeeze and is used in very rare but annoying edge cases.
- between: A rather nifty function for boolean indexing numeric features within a range.
- All DataFrames have a simple T attribute, which stands for transpose.
- Did you know that Pandas allows you to style DataFrames?
- Pandas options
- convert_dtypes: We all know that pandas has an annoying tendency to mark some columns as object data type. Instead of manually specifying their types, you can use convert_dtypes method which tries to infer the best data type.
- A function I use all the time is select_dtypes.
- mask allows you to quickly replace cell values where a custom condition is true.
- min and max along the columns axis
- nlargest and nsmallest.
- However, sometimes you want the position of the min/max, you should use idxmax/idxmin
- value_counts with dropna=False: common operation to find the percentage of missing values is to chain isnull and sum and divide by the length of the array - you can do the same thing with value_counts with relevant arguments
- clip function makes it really easy to find outliers outside a range and replace them with the hard limits.
- at_time allows you to subset values at a specific date or time.
- bdate_range is a short-hand function to create TimeSeries indices with business-day frequency
- autocorr
- Pandas offers a quick method to check if a given series contains any nulls with hasnans attribute
- at and iat: These two accessors are much faster alternatives to loc and iloc with a disadvantage. They only allow selecting or replacing a single value at a time
- argsort: You should use this function when you want to extract the indices that would sort an array
- When a column is a category, you can use several special functions using the cat accessor.
- GroupBy.nth: This function only works with GroupBy objects. Specifically, after grouping, nth returns the nth row from each group
Links from the show
Bex Tuychiev: linkedin.com
Bex's Medium profile: ibexorigin.medium.com
Numpy 25 functions article: towardsdatascience.com
missingno package: coderzcolumn.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to us on YouTube: youtube.com
Follow Talk Python on Mastodon: talkpython
Follow Michael on Mastodon: mkennedy
Bex's Medium profile: ibexorigin.medium.com
Numpy 25 functions article: towardsdatascience.com
missingno package: coderzcolumn.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to us on YouTube: youtube.com
Follow Talk Python on Mastodon: talkpython
Follow Michael on Mastodon: mkennedy