Understanding Pandas visually with PandasTutor
Episode Deep Dive
Guests Introduction and Background
Sam Lau is a PhD student at UC San Diego, advised by Philip Guo, with a research focus on human-computer interaction (HCI), specifically how people use computers to teach and learn programming and data science. Drawing on his teaching experiences, Sam co-created PandasTutor (along with Philip Guo) to help students and developers visualize step-by-step transformations of their data in Python’s Pandas library.
What to Know If You're New to Python
If you’re just getting started and want to understand this episode fully, make sure you are comfortable writing and running simple Python code and can import libraries like pandas
. Understanding basic data structures (lists, dictionaries) and how they might show up in data science is helpful too. If you’ve never worked with Pandas before, just know that it’s a key library in Python for data manipulation, letting you filter, group, and reshape tabular data. Mastering Pandas is often a foundational skill for data science in Python.
Key Points and Takeaways
- Visualizing Pandas Transformations with PandasTutor
PandasTutor (https://pandastutor.com) is a web-based tool that reveals what happens behind each line of your Pandas code by showing intermediate data transformations step by step. It visually depicts how rows move, group, and aggregate so you can see exactly how filters, sorts, and group operations shape your data.
- Links and Tools:
- PandasTutor
- PythonTutor (a related project by Philip Guo)
- Links and Tools:
- Filtering and Sorting in Pandas
One of the simplest yet most common data-wrangling tasks is filtering rows and sorting columns. PandasTutor draws arrows between rows to show which rows survive a filter (e.g.,
df[df.size == "medium"]
) and highlights the exact column used for sorting. This helps clarify how seemingly straightforward operations can lead to confusion when the data frame is large or intermediate steps are hidden in the code.- Links and Tools:
- Deeper Understanding of GroupBy
Grouping data and then running aggregations (
sum
,median
,max
, etc.) is crucial but often confuses learners. PandasTutor color-codes groups and draws multiple arrows converging into a single row for each group. This clarity is especially useful for novices trying to figure out why rows disappear or how their data is partitioned.- Links and Tools:
- Use Cases in Teaching and Presentations Sam described using PandasTutor’s visual approach in the classroom. Instead of manually commenting out lines or juggling multiple data frame printouts, an instructor can show transformations side by side, making it much easier for students to see how each step works. Educators can embed the generated diagrams into slides or share them via links.
- Beyond Beginners: Debugging and Stack Overflow
While PandasTutor is excellent for newcomers, experienced developers can also paste complex Pandas snippets from Stack Overflow (or their own code) into PandasTutor. This “visual debugger” clarifies the flow and transformations, helping you verify the final data structure matches expectations.
- Potential Future Feature:
- A browser extension or bookmarklet to auto-visualize Pandas code snippets right on Stack Overflow
- Potential Future Feature:
- Expansion to Other Data Tools
The tool’s design can apply to other data frameworks, potentially including joins, pivots, or even R’s Tidyverse. Sam mentioned Tidy Data Tutor, a sibling project for the R ecosystem. With enough interest and development effort, concepts like Dask or Spark might see similar approaches to illustrating distributed data transformations.
- Links and Tools:
- Tidy Data Tutor (R variant by Sam’s colleague)
- Links and Tools:
- Potential Integrations: JupyterLab or Offline Mode
The discussion touched on embedding PandasTutor as a JupyterLab extension, so users can visualize their steps inline. Another direction might be using WebAssembly or Pyodide to sidestep sandbox restrictions, letting you work with real or larger data sets offline without server constraints.
- Links and Tools:
- Dealing with Data Imports and Code Sandboxing Since PandasTutor runs your code in a Docker container with limited internet access, a typical Pandas CSV import from a URL might not work. Instead, you can embed short CSV data as a Python multiline string. This design keeps the tool secure but can require a little extra setup to show your own data transformations.
- Sam’s Ongoing Work and Future Plans Sam is currently finishing his PhD and working on a data science book (tentatively titled “Learning Data Science”). PandasTutor may expand to handle a broader range of Pandas operations—like joins and pivots—and potentially be open-sourced once it’s ready for wider community collaboration.
- Insights on Human-Computer Interaction in Data Education Sam’s research reveals how vital it is for students to see both code and data transformations at the same time. Interactive, visual step-by-step breakdowns can reduce friction in understanding. This HCI-centric approach underscores that tools like PandasTutor don’t just accelerate coding but also deepen learning and confidence.
Interesting Quotes and Stories
- Sam on teaching data transformations: “A lot of times, when we’re teaching data science, we have multiple lines of code that produce only the final output. Students ask, ‘What happened in the middle?’ and we’d be commenting code in and out. PandasTutor came directly from wanting a tool that shows each stage visually.”
- On groupby confusion: “Groupby is especially tricky because rows just vanish into those groups. PandasTutor’s color coding and arrows let you see exactly where they went and how they’re aggregated.”
Key Definitions and Terms
- DataFrame: A two-dimensional labeled data structure commonly used in Pandas.
- Filtering: Selecting rows of data based on one or more conditions, e.g.,
df[df["column"] == value]
. - Sorting: Rearranging rows in ascending or descending order according to a specified column.
- GroupBy: Splitting a DataFrame into groups based on some criteria, applying a function to each group independently, and combining the results.
- Aggregation: Summarizing data in groups by a specific operation like sum, mean, median, or max.
Learning Resources
Below are a few options to deepen your knowledge of Python, Pandas, and data manipulation concepts:
- Python for Absolute Beginners: A gentle introduction if you’re brand-new to programming in Python.
- Move from Excel to Python with Pandas: Ideal if you know your way around spreadsheets and want to step up to Pandas for more sophisticated data analysis.
Overall Takeaway
PandasTutor highlights the power of visual feedback in data science. Whether you’re a teacher, student, or experienced developer, seeing precisely how rows and columns transform across filters, sorts, and groupby operations can simplify debugging, enhance learning, and enrich your understanding of Pandas. Sam’s HCI-driven perspective reminds us that good tooling isn’t just about speed—it’s also about insight and clarity into what our code really does.
Links from the show
Sam on Twitter: @samlau95
PandasTutor: pandastutor.com
PythonTutor: pythontutor.com
Principles and Techniques of Data Science book: textbook.ds100.org
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy