Learn Python with Talk Python's 270 hours of courses

The Intersection of Tabular Data and Generative AI

Episode #410, published Thu, Apr 6, 2023, recorded Sun, Apr 2, 2023

AI has taken the world by storm. It's gone from near zero to amazing in just a few years. We have ChatGPT, we have Stable Diffusion. But what about Jupyter Notebooks and pandas? In this episode, we meet Justin Waugh, the creator of Sketch. Sketch adds the ability to have conversational AI interactions about your pandas data frames (code and data). It's pretty powerful and I know you'll enjoy the conversation.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Justin Waugh is the creator of Sketch at Approximate Labs and a seasoned expert working at the intersection of Python, data, and AI. With a background in experimental physics (complete with hands-on LabVIEW and GPU processing experience), Justin moved from academia into startups, taking his passion for high-performance computing and machine learning to the software world. He explored GPUs early on for electron-counting in physics labs and found parallels with the cutting-edge deep learning kernels of frameworks like CUDA and new neural net architectures. Justin’s drive to merge data-driven science with practical applications led him to found and work at multiple startups, ultimately creating tools such as Sketch and Lambda Prompt to integrate conversational AI into data workflows.


What to Know If You're New to Python

If you’re just getting started with Python and want to follow along more easily, focus on pandas data frames and Jupyter notebooks, since these are at the heart of the conversation about tabular data and conversational AI. It’s helpful to understand how to install packages (e.g. pip install) and how to load data into pandas (e.g. pd.read_csv). Also, knowing how to run code snippets within Jupyter cells will help you explore libraries like Sketch interactively.


Key Points and Takeaways

  1. Sketch: A Conversational AI Layer for Pandas Sketch is a Python library that augments pandas data frames with an AI-powered “ask” and “howto” interface. You can literally ask your data questions (e.g., “Which columns might contain sensitive info?”) or ask Sketch how to code certain data transformations and get workable Python code in return. It uses Large Language Models (LLMs) under the hood and context from the data itself to generate answers and code. This approach dramatically cuts down on switching between your notebook and external documentation or Stack Overflow.
  2. Bringing Conversational AI into Data Analysis The conversation highlights how ChatGPT-like models don’t just assist with code generation but also interpret and explain data. Instead of purely writing transformations, these models can describe anomalies, identify potential data issues, and even highlight PII-related columns. This bidirectional conversation with your data frames opens new possibilities for collaborative data science and faster discovery.
  3. Data Sketches: Efficient Summaries for Large Datasets Justin’s background in data sketches (probabilistic data structures like HyperLogLog) plays a key role in how Sketch can quickly grasp the “shape” of data. These sketches let you approximate metrics—like unique values—without scanning an entire massive dataset. Combining that snapshot of the data with GPT-like models gives them the right context to answer questions about the data efficiently.
  4. Lambda Prompt: A Toolkit for Building AI Functions Justin also discussed Lambda Prompt, another library that turns LLM endpoints (like OpenAI’s) into straightforward Python functions using Jinja templates. By defining your own prompts as functions, you can chain or compose them for more complex tasks, such as generating SQL queries, rewriting code, or building custom chat-style features. It makes building AI-driven apps simpler and more “Pythonic.”
  5. GPU Computation and the Rise of AI Justin’s story about early GPU usage in physics labs illustrates how GPU hardware rapidly evolved from specialized graphics pipelines to mainstream parallel computing engines. The discussion highlights how frameworks leveraging GPU acceleration (like PyTorch or TensorFlow) have driven breakthroughs in image generation, text modeling, and large language models. This hardware+software synergy paved the way for advanced AI tools we see today.
    • Tools and Libraries
  6. Ethics and Licensing in AI Training Data A recurring topic was whether AI systems, like GitHub Copilot or image generation models, inadvertently incorporate copyrighted or GPL-licensed material. Justin pointed out ongoing lawsuits and broader conversations about data usage, especially the potential for “license stripping” when code is regurgitated from a generative model. Although no definitive legal resolutions emerged, the discussion underscores that privacy and ethics in AI remain an evolving challenge.
  7. ChatGPT vs. GitHub Copilot for Python Coding The episode compared the broader context-based ChatGPT experience with Copilot’s integrated approach. ChatGPT can do more open-ended tasks and explanations, while Copilot excels at inline code suggestions in IDEs like VS Code. Combining them can significantly level up your productivity, but each tool addresses slightly different developer workflows.
  8. Practical Examples: Data Cleaning and Feature Engineering One highlight was showing how Sketch can parse addresses to extract city, state, and zip, or quickly group sales data by region. Being able to say “clean up messy addresses” and get workable Python code in a single step is especially valuable for people who handle daily data wrangling tasks. Even if the code is 90% correct, it can drastically reduce the time spent on boilerplate tasks.
  9. Integrating Sketch in Jupyter for a Better Notebook Flow Since many data scientists live in Jupyter Notebooks, Sketch offers a direct path to embedded AI queries. You can run df.sketch.ask("...") or df.sketch.howto("...") and remain within the environment without context switching to a browser. This synergy with Jupyter makes it an incredibly smooth experience for data exploration, data cleaning, and immediate code generation.
  10. Future of AI-Driven Data Tools The discussion closed on bigger visions for fully automated data pipelines, advanced conversational data analysis, and bridging more complex tasks like model training. Justin’s company, Approximate Labs, aims to unify the steps of data discovery, transformation, and high-level analysis through AI-driven solutions—indicative of a broader industry movement toward more intelligent data platforms.

Interesting Quotes and Stories

  • Justin on missing GitHub Copilot while offline: “I was on a flight recently ... I felt like I was walking through mud instead of running. I realized I’ve become reliant on it in a big way.”
  • On GPU usage in early physics: “At the time, I was using C++ and a distributed LabVIEW project just to move some motors and measure electrons—and I realized, it’s basically convolution kernels that the neural nets were doing, too.”

Key Definitions and Terms

  • Generative AI: A branch of AI that creates new content (text, images, audio) based on training data, often powered by large language models.
  • Data Sketches: Probabilistic data structures (like HyperLogLog) that let you estimate measures like unique counts quickly and with less memory.
  • pandas: A Python library for data manipulation and analysis, providing data structures and operations to manipulate numerical tables and time series.
  • LLM (Large Language Model): A neural network trained on vast text corpora to predict and generate human-like language responses and code.

Learning Resources

Below are a few curated learning resources to help deepen your knowledge and skill set around Python’s data and AI ecosystem:


Overall Takeaway

The rise of conversational AI for data analysis, as showcased by Sketch, signals a major transformation in how Python developers and data scientists interact with their datasets. By integrating language models directly into workflows—whether for code generation, data summarization, or interactive exploration—tools like Sketch and Lambda Prompt streamline repetitive tasks and open up new levels of creativity in data wrangling. This episode shows that, while powerful, AI-based solutions also bring considerations around ethics, licensing, and reliability. Overall, the conversation is a strong testament to Python’s vibrant community and the growing potential for AI-assisted development in everything from data cleaning to advanced analytics.

Links from the show

Sketch: github.com
Lambdapromp: github.com
Python Bytes 320 - Coverage of Sketch: pythonbytes.fm
ChatGPT: chat.openai.com
Midjourney: midjourney.com
Github Copilot: github.com
GitHub Copilot Litigation site: githubcopilotlitigation.com
Attention is All You Need paper: research.google.com
Live Colab Demo: colab.research.google.com
AI Panda from Midjourney: digitaloceanspaces.com
Ray: pypi.org
Apache Arrow: arrow.apache.org

Python Web Apps that Fly with CDNs Course: talkpython.fm
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon