Open Source Sports Analytics with PySport
Episode Deep Dive
Guests Introduction and Background
Koen Vossen is the maintainer and driving force behind PySport, a community-driven project that aggregates Python (and non-Python) open-source sports analytics libraries. He’s been writing code since his early days experimenting with LEGO Mindstorms and Visual Basic. Koen later learned Python professionally and now runs his own company, TeamTV, where they combine performance analytics and video for sports clubs. He’s also been active in PyData Eindhoven, helping bring people together to discuss data science and open source.
What to Know If You're New to Python
If you’re just getting started with Python and want to follow along with how Koen and others handle sports data, here are a few prerequisites:
- A basic grasp of Python’s ecosystem for data: Pandas and dataframes, simple scripts, and notebooks.
- Some familiarity with web requests (e.g.,
requests
or similar libraries) is helpful for working with live sports data or APIs. - Knowing how to organize code into reusable modules is useful if you decide to build or contribute to a sports analytics library.
- You can set up a simple environment using venv or another environment manager to isolate your sports-analytics projects.
Key Points and Takeaways
- PySport as a Community Hub Koen created PySport to unify sports analytics tools and connect both hobbyists and professional clubs. It serves as an “awesome list” of packages across multiple sports (NBA, NFL, F1, and beyond) and different programming languages. It removes guesswork in finding high-quality libraries by highlighting documentation, last commit date, and number of contributors.
- Links / Tools:
- Cloppy for Soccer Data Standardization Much of the soccer analytics world struggled with inconsistent file formats and data layouts. Koen’s package, Cloppy, addresses this by standardizing disparate event and tracking data. Multiple contributors, including club analysts, have shaped Cloppy to transform raw soccer data into consistent Pandas-friendly formats for deeper analysis.
- Links / Tools:
- Scraping vs. Official Data Sources While some leagues and vendors offer open APIs, many data sets remain behind paywalls or unofficial endpoints. This forces developers to rely on community-built scrapers, which can break if the target site changes layout or policies. PySport attempts to catalog these scrapers but also encourages caution and consideration of legality and long-term stability.
- Links / Tools:
- PyBall (NBA)
- MLBGame (Baseball) (less frequently updated)
- Links / Tools:
- Data Availability Challenges Sports data can be very rich—sometimes capturing 25 frames per second for each player or the ball position. Yet, official tracking data is often locked away in proprietary formats or restricted licensing deals. Projects like StatsBomb help by releasing free samples, but the overall ecosystem still faces hurdles in getting open, high-fidelity data.
- Links / Tools:
- Motorsports Analytics with fastF1 Formula 1 (F1) data is surprisingly well-structured: telemetry, timing, track layout, and driver positions. The Fast F1 package integrates strongly with Pandas to give you session info, telemetry, and visualizations. This provides a terrific example of how open data or partially open data can fuel in-depth analysis and advanced plotting.
- Links / Tools:
- Bridging Clubs, Research, and Fans Koen sees PySport as a vehicle to bring professional clubs, open-source developers, and sports enthusiasts together. Many clubs do rely on open-source Python, but they need a nudge and a starting point. PySport’s curated list and community guidelines make it easier for domain experts (like video analysts) to cross over into coding and share improvements.
- Links / Tools:
- PyData Eindhoven (Community example)
- Links / Tools:
- Common Sports Analytics Models (e.g., xG) Advanced metrics like expected goals (xG) in soccer quantify how likely a shot is to become a goal, depending on context. Tools like Soccer Action, StatsBomb, and Cloppy help generate these metrics. The approach could also transfer to similar “goal-based” sports such as hockey where you evaluate shot quality.
- Links / Tools:
- Visualization Libraries (mplsoccer, fastF1, more) Beautiful visuals drive home sports analytics insights. The mplsoccer package merges Matplotlib plots and soccer pitch layouts. Meanwhile, fastF1 includes built-in telemetry and track-position charts. Such libraries are often preconfigured with sports-specific templates, saving hours on custom drawing.
- Links / Tools:
- JupyterLite Playground for Zero-Install PySport hosts a Playground that runs on JupyterLite and Pyodide (WebAssembly), letting visitors try soccer analytics without installing anything. This environment patches certain libraries like
requests
so you can fetch data in the browser. It’s especially helpful for new contributors or analysts who don’t want to manage local Python installs.- Links / Tools:
- DuckDB for Pythonic Querying Toward the end of the episode, Koen highlighted DuckDB as an in-process SQL engine that integrates smoothly with Pandas and Parquet files. It’s great for quickly running SQL queries on top of existing data structures, making data exploration for sports analytics both powerful and straightforward.
- Links / Tools:
Interesting Quotes and Stories
- On repetitive data transformations: “I noticed 80% of the code was just parsing the data to a useful format. It’s always the same code—why not build a library to fix that?”
- On bridging open source with professional clubs: “Lots of clubs already use open source, but they don’t always know how or who to collaborate with. PySport tries to make people aware of everything that’s already built.”
Key Definitions and Terms
- Tracking Data: High-frequency player or object positions (e.g., 25 frames/sec for each player). Often generated via camera systems and computer vision.
- Event Data: Discrete events in a game such as passes, shots, turnovers, commonly represented with timestamps, player info, and field coordinates.
- xG (Expected Goals): A model to quantify the likelihood of scoring given the shot context (position, defenders, angle, etc.).
- WebAssembly (Pyodide/JupyterLite): A way to run Python code entirely in the browser, no server required.
Learning Resources
Below are some helpful Python courses to deepen your data-focused development.
- Data Science Jumpstart with 10 Projects: Build a strong foundation in data analysis with 10 hands-on projects.
- Python Data Visualization: Learn to create beautiful and insightful visualizations to deepen your sports analytics.
- Move from Excel to Python with Pandas: Transition your data workflows from Excel to code-driven analytics in Python.
Overall Takeaway
The sports analytics community is vibrant and ever-growing, with Python playing a central role in everything from scraping and data cleanup to advanced metrics and rich visualizations. PySport exemplifies the spirit of open collaboration by curating diverse tools, fostering contributions, and bridging the gap between professional clubs, hobbyists, and data scientists. Whether you’re a seasoned developer or just getting started, there’s ample room to join this community, tackle rich data challenges, and build truly innovative sports analytics solutions.
Links from the show
PySport on Twitter: @PySportOrg
Calling R from Python: medium.com
DuckDB: duckdb.org
PySport Playground: playground.pysport.org
NFLVerse: github.com
NBA Stats: nba.com
Sports Databases: opensource.pysport.org
Data sets: opensource.pysport.org
Visualizations: opensource.pysport.org
I/O: opensource.pysport.org
Models: opensource.pysport.org
Scrapers/APIs: opensource.pysport.org
Fast F1: docs.fastf1.dev
Fast F1 graphics: docs.fastf1.dev
Pysport Intro: pysport.org
New Talk Python Training Apps: talkpython.fm
Michael's blog post about the apps: mkennedy.codes
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy