Learn Python with Talk Python's 270 hours of courses

Pandas and Beyond with Wes McKinney

Episode #462, published Wed, May 15, 2024, recorded Thu, Apr 11, 2024

This episode dives into some of the most important data science libraries from the Python space with one of its pioneers: Wes McKinney. He's the creator or co-creator of pandas, Apache Arrow, and Ibis projects and an entrepreneur in this space.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Wes McKinney is a pioneering force in the Python data ecosystem, best known for creating or co-creating pandas, Apache Arrow, and Ibis. He has a deep background in quantitative finance and data infrastructure. Wes is also an entrepreneur, having co-founded Voltron Data, and he now holds a software architect role at Posit (formerly RStudio) to help shape the company’s Python strategy. Throughout his career, he has focused on improving Python-based data workflows and building new libraries and standards for data science at scale.

What to Know If You’re New to Python

If you’re just getting started, you’ll want to know that pandas is the de facto library for working with tabular data (think “spreadsheets” in Python). Many of the discussions below center around topics such as dataframe APIs, vectorized data operations, and integrations with broader ecosystems like Arrow, polars, and SQL.

Key Points and Takeaways

  1. Pandas as the Data Analysis Cornerstone Pandas has become the standard library for Python data analysis, offering tabular data structures, built-in methods for filtering and grouping, and a powerful ecosystem around it. It’s widely adopted in industry and academia, partly because it’s a well-known entry point for data science and analytics in Python.
  2. Apache Arrow: Modern Data Interchange and Performance Wes and colleagues started Apache Arrow to create a language-agnostic, columnar memory format for high-performance analytics. Arrow takes advantage of cache-efficient operations and vectorization, enabling faster workloads on modern CPUs and even GPUs. It also serves as a bridge between different systems (like Python and Rust-based projects) without needing bespoke data conversions.
  3. Ibis: One API, Many Backends Ibis focuses on providing a dataframe-like API that can generate SQL or connect to various engines under the hood (DuckDB, Spark, polars, pandas, etc.). This portability lets you run similar code whether you’re on a local laptop dataset or a large cluster. It simplifies the big data story by giving you a uniform interface.
  4. SQLglot for Query Transpilation SQLglot translates SQL queries across different database dialects—helpful if you’ve ever had to move queries from, say, Hive or Spark to DuckDB or ClickHouse. Ibis recently adopted SQLglot to handle SQL generation and dialect differences more seamlessly.
  5. Rise of Polars and Rust-based Data Tools Polars is a Rust-based dataframe library that’s Arrow-native under the hood, offering lazy execution and advanced optimizations. It’s a testament to the growing push for speed, parallelism, and more efficient data handling. This is a newer but notable addition to Python’s data stack.
  6. Posit (formerly RStudio) and Shiny for Python Although historically associated with R, Posit is increasingly investing in Python. Wes’s role there helps shape cross-language interoperability. Shiny for Python is a prominent example, focusing on a “reactive programming” model for building interactive dashboards without managing complex callback logic.
  7. Big Data, Dask, and Parallel Python While the episode didn’t go very deep on parallel Python frameworks, Wes touched on the importance of scaling computations beyond a single node. Tools like Dask can spread pandas-like workloads across CPU cores or clusters, supporting bigger-than-memory data processing.
  8. WebAssembly & Browser-based Data Science A short but exciting point: projects like Pyodide and DuckDB compiled to WebAssembly open a future where Python and data workloads can run fully in the browser. This removes the need for specialized servers and can further simplify deploying interactive data applications.

Interesting Quotes and Stories

  • Wes on unlocking capabilities for users: “One of the reasons I became passionate about Python was about giving people superpowers—making tasks easier so that you can focus on the interesting parts of your application.”
  • On the mission of Posit: “It’s really refreshing to work with people who are truly mission focused and want to bring open-source data science tools to everyone, sustainably and for the long term.”

Key Definitions and Terms

  • DataFrame: A tabular data structure, similar to a spreadsheet, commonly used in pandas and other libraries like polars.
  • Vectorized Operations: Performing an operation on entire arrays or columns at once, rather than in Python-level loops.
  • Columnar Format: A way of storing data column-by-column rather than row-by-row (e.g., Arrow, Parquet), usually more efficient for analytics.
  • Transpilation: Converting code or queries from one form (or language) to another, e.g., from one SQL dialect to a different dialect.

Learning Resources

Here are some courses that align well with the topics covered:

Overall Takeaway

This conversation with Wes McKinney highlights a broad ecosystem push toward simpler yet highly performant data workflows in Python. From the maturity and ubiquity of pandas to the high-speed innovations of Arrow, Ibis, and Rust-powered libraries, the Python data landscape continues to evolve rapidly. Underlying all these tools is a shared goal: making real-world data work more productive and accessible so developers can focus on the insights rather than the plumbing.

Links from the show

Wes' Website: wesmckinney.com
Pandas: pandas.pydata.org
Apache Arrow: arrow.apache.org
Ibis: ibis-project.org
Python for Data Analysis - Groupby Summary: wesmckinney.com/book
Polars: pola.rs
Dask: dask.org
Sqlglot: sqlglot.com
Pandoc: pandoc.org
Quarto: quarto.org
Evidence framework: evidence.dev
pyscript: pyscript.net
duckdb: duckdb.org
Jupyterlite: jupyter.org
Djangonauts: djangonaut.space
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon