Learn Python with Talk Python's 270 hours of courses

Understanding Pandas visually with PandasTutor

Episode #358, published Fri, Mar 25, 2022, recorded Mon, Feb 28, 2022

Pandas is a great library that allows you to accomplish a ton of filtering and processing in condensed syntax. But how well do you understand what's happening? Sam Lau and Philip Guo built a great site to help use visually explore how Pandas is processing your dataset with your specific syntax. It's called PandasTutor, and Sam is here to tell us about it.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Sam Lau is a PhD student at UC San Diego, advised by Philip Guo, with a research focus on human-computer interaction (HCI), specifically how people use computers to teach and learn programming and data science. Drawing on his teaching experiences, Sam co-created PandasTutor (along with Philip Guo) to help students and developers visualize step-by-step transformations of their data in Python’s Pandas library.

What to Know If You're New to Python

If you’re just getting started and want to understand this episode fully, make sure you are comfortable writing and running simple Python code and can import libraries like pandas. Understanding basic data structures (lists, dictionaries) and how they might show up in data science is helpful too. If you’ve never worked with Pandas before, just know that it’s a key library in Python for data manipulation, letting you filter, group, and reshape tabular data. Mastering Pandas is often a foundational skill for data science in Python.

Key Points and Takeaways

  1. Visualizing Pandas Transformations with PandasTutor PandasTutor (https://pandastutor.com) is a web-based tool that reveals what happens behind each line of your Pandas code by showing intermediate data transformations step by step. It visually depicts how rows move, group, and aggregate so you can see exactly how filters, sorts, and group operations shape your data.
  2. Filtering and Sorting in Pandas One of the simplest yet most common data-wrangling tasks is filtering rows and sorting columns. PandasTutor draws arrows between rows to show which rows survive a filter (e.g., df[df.size == "medium"]) and highlights the exact column used for sorting. This helps clarify how seemingly straightforward operations can lead to confusion when the data frame is large or intermediate steps are hidden in the code.
  3. Deeper Understanding of GroupBy Grouping data and then running aggregations (sum, median, max, etc.) is crucial but often confuses learners. PandasTutor color-codes groups and draws multiple arrows converging into a single row for each group. This clarity is especially useful for novices trying to figure out why rows disappear or how their data is partitioned.
  4. Use Cases in Teaching and Presentations Sam described using PandasTutor’s visual approach in the classroom. Instead of manually commenting out lines or juggling multiple data frame printouts, an instructor can show transformations side by side, making it much easier for students to see how each step works. Educators can embed the generated diagrams into slides or share them via links.
  5. Beyond Beginners: Debugging and Stack Overflow While PandasTutor is excellent for newcomers, experienced developers can also paste complex Pandas snippets from Stack Overflow (or their own code) into PandasTutor. This “visual debugger” clarifies the flow and transformations, helping you verify the final data structure matches expectations.
    • Potential Future Feature:
      • A browser extension or bookmarklet to auto-visualize Pandas code snippets right on Stack Overflow
  6. Expansion to Other Data Tools The tool’s design can apply to other data frameworks, potentially including joins, pivots, or even R’s Tidyverse. Sam mentioned Tidy Data Tutor, a sibling project for the R ecosystem. With enough interest and development effort, concepts like Dask or Spark might see similar approaches to illustrating distributed data transformations.
  7. Potential Integrations: JupyterLab or Offline Mode The discussion touched on embedding PandasTutor as a JupyterLab extension, so users can visualize their steps inline. Another direction might be using WebAssembly or Pyodide to sidestep sandbox restrictions, letting you work with real or larger data sets offline without server constraints.
  8. Dealing with Data Imports and Code Sandboxing Since PandasTutor runs your code in a Docker container with limited internet access, a typical Pandas CSV import from a URL might not work. Instead, you can embed short CSV data as a Python multiline string. This design keeps the tool secure but can require a little extra setup to show your own data transformations.
  9. Sam’s Ongoing Work and Future Plans Sam is currently finishing his PhD and working on a data science book (tentatively titled “Learning Data Science”). PandasTutor may expand to handle a broader range of Pandas operations—like joins and pivots—and potentially be open-sourced once it’s ready for wider community collaboration.
  10. Insights on Human-Computer Interaction in Data Education Sam’s research reveals how vital it is for students to see both code and data transformations at the same time. Interactive, visual step-by-step breakdowns can reduce friction in understanding. This HCI-centric approach underscores that tools like PandasTutor don’t just accelerate coding but also deepen learning and confidence.

Interesting Quotes and Stories

  • Sam on teaching data transformations: “A lot of times, when we’re teaching data science, we have multiple lines of code that produce only the final output. Students ask, ‘What happened in the middle?’ and we’d be commenting code in and out. PandasTutor came directly from wanting a tool that shows each stage visually.”
  • On groupby confusion: “Groupby is especially tricky because rows just vanish into those groups. PandasTutor’s color coding and arrows let you see exactly where they went and how they’re aggregated.”

Key Definitions and Terms

  • DataFrame: A two-dimensional labeled data structure commonly used in Pandas.
  • Filtering: Selecting rows of data based on one or more conditions, e.g., df[df["column"] == value].
  • Sorting: Rearranging rows in ascending or descending order according to a specified column.
  • GroupBy: Splitting a DataFrame into groups based on some criteria, applying a function to each group independently, and combining the results.
  • Aggregation: Summarizing data in groups by a specific operation like sum, mean, median, or max.

Learning Resources

Below are a few options to deepen your knowledge of Python, Pandas, and data manipulation concepts:

Overall Takeaway

PandasTutor highlights the power of visual feedback in data science. Whether you’re a teacher, student, or experienced developer, seeing precisely how rows and columns transform across filters, sorts, and groupby operations can simplify debugging, enhance learning, and enrich your understanding of Pandas. Sam’s HCI-driven perspective reminds us that good tooling isn’t just about speed—it’s also about insight and clarity into what our code really does.

Links from the show

Sam Lau: samlau.me
Sam on Twitter: @samlau95
PandasTutor: pandastutor.com
PythonTutor: pythontutor.com
Principles and Techniques of Data Science book: textbook.ds100.org
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon