Taming Flaky Tests
Episode Deep Dive
Guests Introduction and Background
Gregory Kapfhammer is a faculty member at Allegheny College with a rich background in software testing and Python usage. He initially discovered Python in a graduate-level AI course and came back to it full force once he started using pytest, thanks to its simplicity and extensive plugin ecosystem. Gregory is especially focused on research around flaky tests and applying modern tools (like fuzz testing and property-based testing) to expose hidden problems in code and tests.
Owain Parry is a PhD student at the University of Sheffield, concentrating his research on flaky tests. Throughout his PhD, he has been studying real-world projects (mostly in Python) to identify, categorize, and help mitigate flakiness in testing. He has created and leveraged pytest plugins to detect non-deterministic behavior within test suites. Owain’s work also addresses how flaky tests affect developer confidence and workflow.
What to Know If You're New to Python
If you’re exploring Python for testing, here are a few quick pointers before digging deeper into flaky tests and testing best practices:
- Familiarize yourself with virtual environments to keep dependencies for different projects separate.
- Understand the basics of Python’s unittest or pytest libraries for test creation.
- Know that Python has a strong test culture with many tools (like fixtures and decorators) to keep your code well-organized.
Key Points and Takeaways
- Flaky Tests and Their Impact
Flaky tests are tests that pass and fail unpredictably, even when the source code hasn’t changed. They can degrade developer confidence in test suites and create a “boy who cried wolf” scenario, where real failures risk getting ignored. Identifying flaky tests early is essential to maintain trust in continuous integration and deployment pipelines.
- Tools / Links:
- Root Causes of Flakiness
Many flaky tests arise from shared resources or external services: improper cleanup, incomplete setup, and environment differences can all cause intermittent failures. Parallel test execution or concurrency features in frameworks often expose these issues more quickly.
- Tools / Links:
- pytest fixtures
- pytest socket plugin (to disable unwanted network calls)
- Tools / Links:
- Detecting Order-Dependent Tests
Tests that rely on a specific order can be uncovered by running your suite in random order or in parallel. If you see sporadic failures once the execution order is shuffled, you likely have a hidden dependency between tests.
- Tools / Links:
- pytest-randomly
- pytest-xdist (for parallel test runs)
- Tools / Links:
- Mitigation Strategies
Approaches vary from quarantining known flaky tests to systematically rerunning them. Some organizations track and measure test flakiness to spot the worst offenders, but success relies on promptly fixing these tests rather than ignoring or permanently disabling them.
- Tools / Links:
- Cypress Flaky Test Management
- Spotify’s “Flaky Bot” concept (not open source, but an example approach mentioned in the discussion)
- Tools / Links:
- When Flakiness Hides Real Bugs Not all flaky tests are poorly written. Sometimes they expose genuine, intermittent bugs—like race conditions, unexpected resource contention, or mistaken assumptions. It’s valuable to examine failures rather than dismiss them as “just flaky.”
- Value of End-to-End and Integration Tests Though more prone to flakiness, end-to-end tests catch real issues across multiple layers of an application. With careful setup/teardown or mocking of external services, you can reduce flakiness while still benefiting from broader coverage.
- Property-Based and Randomized Testing Tools like Hypothesis can discover edge cases by automatically generating inputs. Tests failing “randomly” could indicate deeper issues in your code. Additionally, repeated runs in different environments might expose bugs you otherwise miss.
- Efficient Feedback Loops Developers often run subsets of tests locally and reserve full suites for CI to save time. Coverage tools such as coverage.py can help you run only the tests relevant to your changes, while still periodically running the full suite (randomly ordered) in CI to find unexpected dependencies.
- Fixture Scope and Setup/Teardown Mistakes with fixture scope (e.g., using class-scoped when you meant function-scoped) lead to tests sharing state. Failing to clean up databases, files, or mocking out external services can inadvertently cause data leakage or resource locking, creating flakiness.
- Psychological Effect and Developer Trust Flaky tests can diminish trust in a test suite. When developers repeatedly see “broken” builds caused by intermittent failures, they might ignore or disable tests altogether, losing the protection and benefits that tests bring.
Interesting Quotes and Stories
- “Sometimes a flaky test is a silver lining because it reveals you never really accounted for concurrency or environment differences.” – Gregory Kapfhammer
- “If you just rerun flaky tests and call it good when they pass once, you might miss the fact that your system has a real nondeterministic bug.” – Owain Parry
- “Once developers lose trust in their tests, they tend to ignore even legitimate failures.” – Observed throughout the episode as a key psychological challenge
Key Definitions and Terms
- Flaky Test: A test that can pass or fail inconsistently without any changes to the code or test itself.
- Order-Dependent Test: A test whose outcome is influenced by the tests that ran before it.
- Test Quarantine: Temporarily isolating known flaky tests from the main pipeline to avoid breaking builds.
- Property-Based Testing: A testing style where the framework generates various inputs to verify broad properties or behaviors of code.
- Fixture Scope: In pytest, defines the lifetime and sharing mode (function, class, module, etc.) for setup data.
Learning Resources
- Getting started with pytest: Deepen your knowledge of writing and organizing tests, fixtures, parametric tests, and more.
- Python Memory Management and Tips: While not directly test-focused, better understanding Python’s internals can sometimes help with concurrency or memory-related flakiness in tests.
Overall Takeaway
Flaky tests may feel like a nuisance, but they are often key signals of deeper issues in code or assumptions about the environment. By tracking down these root causes—whether they stem from shared resources, concurrency, insufficient teardown, or real hidden bugs—teams can restore developer trust and improve overall software quality. Taking advantage of plugins, proper test design, and consistent review practices will help you keep flaky tests at bay, delivering a stronger, more reliable continuous integration story.
Links from the show
Owain Parry on Twitter: @oparry9
Radon: pypi.org
pytest-xdist: github.com
awesome-pytest: github.com
Tenacity: readthedocs.io
Stamina: github.com
Flaky Test Management: docs.cypress.io
Flaky Test Management (Datadog): datadoghq.com
Flaky Test Management (Spotify): engineering.atspotify.com
Flaky Test Management (Google): testing.googleblog.com
Detecting Test Pollution: github.com
Surveying the developer experience of flaky tests paper: www.gregorykapfhammer.com
Build Kite CI/CD: buildkite.com
Flake It: Finding and Fixing Flaky Test Cases: github.com
Unflakable: unflakable.com
CircleCI Test Detection: circleci.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy