Learn Python with Talk Python's 270 hours of courses

Open Source Sports Analytics with PySport

Episode #416, published Mon, May 22, 2023, recorded Thu, May 11, 2023

If you're looking for fun data sets for learning, for teaching, maybe a conference talk, or even if you're just really into them, sports offers up a continuous stream of rich data that many people can relate to. Yet, accessing that data can be tricky. Sometimes it's locked away in obscure file formats. Other times, the data exists but without a clear API to access it. On this episode, we talk about PySport - something of an awesome list of a wide range of libraries (mostly but not all Python) for accessing a wide variety of sports data from the NFL, NBA, F1, and more. We have Koen Vossen, maintainer of PySport to talk through some of the more popular projects.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Koen Vossen is the maintainer and driving force behind PySport, a community-driven project that aggregates Python (and non-Python) open-source sports analytics libraries. He’s been writing code since his early days experimenting with LEGO Mindstorms and Visual Basic. Koen later learned Python professionally and now runs his own company, TeamTV, where they combine performance analytics and video for sports clubs. He’s also been active in PyData Eindhoven, helping bring people together to discuss data science and open source.

What to Know If You're New to Python

If you’re just getting started with Python and want to follow along with how Koen and others handle sports data, here are a few prerequisites:

  • A basic grasp of Python’s ecosystem for data: Pandas and dataframes, simple scripts, and notebooks.
  • Some familiarity with web requests (e.g., requests or similar libraries) is helpful for working with live sports data or APIs.
  • Knowing how to organize code into reusable modules is useful if you decide to build or contribute to a sports analytics library.
  • You can set up a simple environment using venv or another environment manager to isolate your sports-analytics projects.

Key Points and Takeaways

  1. PySport as a Community Hub Koen created PySport to unify sports analytics tools and connect both hobbyists and professional clubs. It serves as an “awesome list” of packages across multiple sports (NBA, NFL, F1, and beyond) and different programming languages. It removes guesswork in finding high-quality libraries by highlighting documentation, last commit date, and number of contributors.
  2. Cloppy for Soccer Data Standardization Much of the soccer analytics world struggled with inconsistent file formats and data layouts. Koen’s package, Cloppy, addresses this by standardizing disparate event and tracking data. Multiple contributors, including club analysts, have shaped Cloppy to transform raw soccer data into consistent Pandas-friendly formats for deeper analysis.
  3. Scraping vs. Official Data Sources While some leagues and vendors offer open APIs, many data sets remain behind paywalls or unofficial endpoints. This forces developers to rely on community-built scrapers, which can break if the target site changes layout or policies. PySport attempts to catalog these scrapers but also encourages caution and consideration of legality and long-term stability.
  4. Data Availability Challenges Sports data can be very rich—sometimes capturing 25 frames per second for each player or the ball position. Yet, official tracking data is often locked away in proprietary formats or restricted licensing deals. Projects like StatsBomb help by releasing free samples, but the overall ecosystem still faces hurdles in getting open, high-fidelity data.
  5. Motorsports Analytics with fastF1 Formula 1 (F1) data is surprisingly well-structured: telemetry, timing, track layout, and driver positions. The Fast F1 package integrates strongly with Pandas to give you session info, telemetry, and visualizations. This provides a terrific example of how open data or partially open data can fuel in-depth analysis and advanced plotting.
  6. Bridging Clubs, Research, and Fans Koen sees PySport as a vehicle to bring professional clubs, open-source developers, and sports enthusiasts together. Many clubs do rely on open-source Python, but they need a nudge and a starting point. PySport’s curated list and community guidelines make it easier for domain experts (like video analysts) to cross over into coding and share improvements.
  7. Common Sports Analytics Models (e.g., xG) Advanced metrics like expected goals (xG) in soccer quantify how likely a shot is to become a goal, depending on context. Tools like Soccer Action, StatsBomb, and Cloppy help generate these metrics. The approach could also transfer to similar “goal-based” sports such as hockey where you evaluate shot quality.
  8. Visualization Libraries (mplsoccer, fastF1, more) Beautiful visuals drive home sports analytics insights. The mplsoccer package merges Matplotlib plots and soccer pitch layouts. Meanwhile, fastF1 includes built-in telemetry and track-position charts. Such libraries are often preconfigured with sports-specific templates, saving hours on custom drawing.
  9. JupyterLite Playground for Zero-Install PySport hosts a Playground that runs on JupyterLite and Pyodide (WebAssembly), letting visitors try soccer analytics without installing anything. This environment patches certain libraries like requests so you can fetch data in the browser. It’s especially helpful for new contributors or analysts who don’t want to manage local Python installs.
  10. DuckDB for Pythonic Querying Toward the end of the episode, Koen highlighted DuckDB as an in-process SQL engine that integrates smoothly with Pandas and Parquet files. It’s great for quickly running SQL queries on top of existing data structures, making data exploration for sports analytics both powerful and straightforward.

Interesting Quotes and Stories

  • On repetitive data transformations: “I noticed 80% of the code was just parsing the data to a useful format. It’s always the same code—why not build a library to fix that?”
  • On bridging open source with professional clubs: “Lots of clubs already use open source, but they don’t always know how or who to collaborate with. PySport tries to make people aware of everything that’s already built.”

Key Definitions and Terms

  • Tracking Data: High-frequency player or object positions (e.g., 25 frames/sec for each player). Often generated via camera systems and computer vision.
  • Event Data: Discrete events in a game such as passes, shots, turnovers, commonly represented with timestamps, player info, and field coordinates.
  • xG (Expected Goals): A model to quantify the likelihood of scoring given the shot context (position, defenders, angle, etc.).
  • WebAssembly (Pyodide/JupyterLite): A way to run Python code entirely in the browser, no server required.

Learning Resources

Below are some helpful Python courses to deepen your data-focused development.

Overall Takeaway

The sports analytics community is vibrant and ever-growing, with Python playing a central role in everything from scraping and data cleanup to advanced metrics and rich visualizations. PySport exemplifies the spirit of open collaboration by curating diverse tools, fostering contributions, and bridging the gap between professional clubs, hobbyists, and data scientists. Whether you’re a seasoned developer or just getting started, there’s ample room to join this community, tackle rich data challenges, and build truly innovative sports analytics solutions.

Links from the show

Koen on Twitter: @mr_le_fox
PySport on Twitter: @PySportOrg
Calling R from Python: medium.com
DuckDB: duckdb.org
PySport Playground: playground.pysport.org
NFLVerse: github.com
NBA Stats: nba.com
Sports Databases: opensource.pysport.org
Data sets: opensource.pysport.org
Visualizations: opensource.pysport.org
I/O: opensource.pysport.org
Models: opensource.pysport.org
Scrapers/APIs: opensource.pysport.org
Fast F1: docs.fastf1.dev
Fast F1 graphics: docs.fastf1.dev
Pysport Intro: pysport.org

New Talk Python Training Apps: talkpython.fm
Michael's blog post about the apps: mkennedy.codes
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon