Learn Python with Talk Python's 270 hours of courses

Test-Driven Prompt Engineering for LLMs with Promptimize

Episode #417, published Tue, May 30, 2023, recorded Mon, May 22, 2023

Large language models and chat-based AIs are kind of mind blowing at the moment. Many of us are playing with them for working on code or just as a fun alternative to search. But others of us are building applications with AI at the core. And when doing that, the slightly unpredictable nature and probabilistic nature of LLMs make writing and testing Python code very tricky. Enter promptimize from Maxime Beauchemin and Preset. It's a framework for non-deterministic testing of LLMs inside our applications. Let's dive inside the AIs with Max.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Maxime Beauchemin is a seasoned data engineer, open-source contributor, and entrepreneur. He created Apache Airflow and Apache Superset—both widely used data engineering and data visualization frameworks—and founded Preset, which offers a managed version of Superset. These days, Max is exploring how AI can help data analytics teams, particularly around prompt engineering for large language models (LLMs). In this episode, he discusses his new project, Promptimize, a tool designed to bring test-driven best practices to the uncertain world of LLM-based applications.

Key Points and Takeaways

  1. The Need for TDD in Prompt Engineering LLM-powered apps often produce outputs that vary due to their probabilistic nature. This unpredictability makes typical “unit test” patterns insufficient or too rigid. A test-driven approach that tolerates shades of correctness—rather than strict pass/fail—lets you gradually improve prompt reliability.
  2. Prompt Crafting vs. Prompt Engineering The episode draws a distinction between on-the-fly “prompt crafting” for one-off tasks and “prompt engineering” with structured, repeatable processes. Prompt engineering involves programmatic context management, output formatting instructions, and tests to measure performance across many queries.
    • Tools / Concepts:
      • LangChain
      • “Few-shot” / “Zero-shot” prompting
  3. Unpredictability in LLMs and Why It Matters Unlike typical APIs, LLMs can change their output due to training updates or small prompt variations. This requires ongoing monitoring and re-validation rather than a once-and-done approach. Max compares it to “web scraping,” where external changes break your code unexpectedly.
    • Tools / Concepts:
      • “Temperature” setting in LLMs
      • Context window limitations
  4. How Promptimize Works Promptimize is a Python-based framework that automates sending prompts to an LLM, capturing responses, measuring accuracy or relevancy, and compiling reports. By storing outcomes in a report file, developers can track improvements or regressions across prompts over time—similar to test coverage in classical TDD.
  5. Text-to-SQL as a Prime Use Case One of Max’s main motivations is to use LLMs to generate SQL queries automatically. By providing the schema and table details in the prompt, the model can produce complex queries on behalf of the user. Promptimize helps ensure those queries stay correct and relevant as prompts and database schemas evolve.
  6. Scoring and Partial Success Traditional unit tests expect a simple pass/fail. With LLM-driven features, partial correctness can still be valuable. Promptimize allows weighting test importance, quantifying partial matches (e.g., 70% accurate), and incrementally refining prompts for better results rather than discarding them outright.
    • Concepts:
      • Weighted scoring
      • Structured vs. free-form output
  7. Managing Cost and Performance Calling large language models can be both time-consuming and expensive, especially with bigger models like GPT-4. Promptimize helps teams measure how various prompt versions affect response times and token usage, allowing data-driven decisions about cost vs. accuracy.
    • Tools / Concepts:
      • “Turbo” models from OpenAI
      • Performance measurement
  8. Bringing AI into Established Products Max notes that integrating AI within a business app or data tool (e.g., Superset) adds new complexities. He sees TDD for prompts as essential to mitigate risk when shipping uncertain AI capabilities to thousands of users.
    • Tools / References:
      • Apache Airflow (from Max’s background)
      • Vector databases (for context retrieval)
  9. Using LLMs to Improve Everyday Coding Beyond text-to-SQL, the discussion covers how ChatGPT or Bard can accelerate Python development. They can generate functions, docstrings, or entire modules quickly. Combined with TDD, devs can iterate, refine, and trust the final output more than ad hoc code generation.
  10. Handling Rapid AI Evolution The AI space is moving so quickly that what works today may need revisiting in a few weeks. Storing a library of test prompts (like a “test suite”) ensures you won’t lose ground when upgrading or swapping models. Promptimize or a similar test library can anchor your progress.
  • Concepts:
    • “Evaluations” or “evals” for LLMs
    • Future-proofing your prompts

Interesting Quotes and Stories

  • The Bird Incident: In a lighthearted moment, a bird literally flew into Max’s home office during the recording. While unexpected, it underscored how unpredictability can strike anywhere—whether in a live podcast or an AI’s output.
  • On the AI Tipping Point: “Everything is changing so fast in this space. It feels like we’re working on unsettled ground. But TDD for prompts keeps me oriented.” (Max Beauchemin)

Key Definitions and Terms

  • LLM (Large Language Model): A neural network trained on massive text corpora, capable of generating context-aware text.
  • Prompt Crafting: Writing instructions or sample data to guide an LLM for a single, immediate answer.
  • Prompt Engineering: A more systematic, repeatable approach to providing context, constraints, and test coverage for LLM interactions.
  • Context Window: The maximum amount of text (tokens) a model can consider in a single prompt.
  • Temperature: A setting controlling how deterministic or “creative” the model’s next-word predictions should be.
  • Few-Shot / Zero-Shot: Prompting strategies where “few-shot” includes some examples and “zero-shot” supplies no examples.

Learning Resources

If you'd like to dive deeper into Python testing, AI, and the broader Python ecosystem mentioned in this episode, here are some recommended courses and resources:

Overall Takeaway

Today’s LLM-driven software challenges the usual development and testing patterns. By adapting TDD to prompt engineering, developers can mitigate the unpredictability of AI outputs and continually improve them. Whether it’s auto-generating SQL or coding Python functions on the fly, building a test suite for your prompts ensures your AI features remain reliable, auditable, and ready to evolve alongside the rapidly changing AI landscape.

And now for a poem

In test-driven dreams of code and AI,
Max wields his prompts with watchful eye,
From Airflow's flight to Superset's grace,
He ventures forth in the LLM space.

Chance words spark new paths to roam,
Yet Promptimize tethers them home,
Scoring each phrase, weighting each score,
To keep creation’s risks off the floor.

SQL spins from data’s clay,
A scripted dance to show the way,
While we refine, iterate, then gauge—
TDD for prompts on the wild LLM stage.

Links from the show

Max on Twitter: @mistercrunch
Promptimize: github.com
Introducing Promptimize ("the blog post"): preset.io
Preset: preset.io
Apache Superset: Modern Data Exploration Platform episode: talkpython.fm
ChatGPT: chat.openai.com
LeMUR: assemblyai.com
Microsoft Security Copilot: blogs.microsoft.com
AutoGPT: github.com
Midjourney: midjourney.com
Midjourney generated pytest tips thumbnail: talkpython.fm
Midjourney generated radio astronomy thumbnail: talkpython.fm
Prompt engineering: learnprompting.org
Michael's ChatGPT result for scraping Talk Python episodes: github.com
Apache Airflow: github.com
Apache Superset: github.com
Tay AI Goes Bad: theverge.com
LangChain: github.com
LangChain Cookbook: github.com
Promptimize Python Examples: github.com
TLDR AI: tldr.tech
AI Tool List: futuretools.io
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon