Monitor performance issues & errors in your code

25 Pandas Functions You Didn’t Know Existed

Episode #341, published Wed, Nov 17, 2021, recorded Thu, Nov 4, 2021

Do you do anything with Jupyter notebooks? If you do, there is a very good chance you're working with the pandas library. This is one of THE primary tools of anyone doing computational work or data exploration with Python. Yet, this library is massive and knowing the idiomatic way to use it can be hard to discover.

That's why I've invited Bex Tuychiev to be our guest. He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties we should all keep in our data toolkit. I'm sure there is something here for all of us to take away and use pandas that much better.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

The 25 functions

  1. ExcelWriter is a generic class for creating excel files (with sheets!) and writing DataFrames to them.
  2. pipe is one of the best functions for doing data cleaning in a concise, compact manner in Pandas
  3. factorize: This function is a pandas alternative to Sklearn’s LabelEncoder
  4. A function with an interesting name is explode.
  5. Another function with a funky name is squeeze and is used in very rare but annoying edge cases.
  6. between: A rather nifty function for boolean indexing numeric features within a range.
  7. All DataFrames have a simple T attribute, which stands for transpose.
  8. Did you know that Pandas allows you to style DataFrames?
  9. Pandas options
  10. convert_dtypes: We all know that pandas has an annoying tendency to mark some columns as object data type. Instead of manually specifying their types, you can use convert_dtypes method which tries to infer the best data type.
  11. A function I use all the time is select_dtypes.
  12. mask allows you to quickly replace cell values where a custom condition is true.
  13. min and max along the columns axis
  14. nlargest and nsmallest.
  15. However, sometimes you want the position of the min/max, you should use idxmax/idxmin
  16. value_counts with dropna=False: common operation to find the percentage of missing values is to chain isnull and sum and divide by the length of the array - you can do the same thing with value_counts with relevant arguments
  17. clip function makes it really easy to find outliers outside a range and replace them with the hard limits.
  18. at_time allows you to subset values at a specific date or time.
  19. bdate_range is a short-hand function to create TimeSeries indices with business-day frequency
  20. autocorr
  21. Pandas offers a quick method to check if a given series contains any nulls with hasnans attribute
  22. at and iat: These two accessors are much faster alternatives to loc and iloc with a disadvantage. They only allow selecting or replacing a single value at a time
  23. argsort: You should use this function when you want to extract the indices that would sort an array
  24. When a column is a category, you can use several special functions using the cat accessor.
  25. GroupBy.nth: This function only works with GroupBy objects. Specifically, after grouping, nth returns the nth row from each group

Links from the show

Bex Tuychiev:
Bex's Medium profile:

Numpy 25 functions article:
missingno package:
Watch this episode on YouTube:
Episode transcripts:

--- Stay in touch with us ---
Subscribe to us on YouTube:
Follow Talk Python on Mastodon: talkpython
Follow Michael on Mastodon: mkennedy

Want to go deeper? Check out our courses

Talk Python's Mastodon Michael Kennedy's Mastodon