Open Source Sports Analytics with PySport

Episode #416, published Mon, May 22, 2023, recorded Thu, May 11, 2023

Episode Deep Dive Links Transcript

If you're looking for fun data sets for learning, for teaching, maybe a conference talk, or even if you're just really into them, sports offers up a continuous stream of rich data that many people can relate to. Yet, accessing that data can be tricky. Sometimes it's locked away in obscure file formats. Other times, the data exists but without a clear API to access it. On this episode, we talk about PySport - something of an awesome list of a wide range of libraries (mostly but not all Python) for accessing a wide variety of sports data from the NFL, NBA, F1, and more. We have Koen Vossen, maintainer of PySport to talk through some of the more popular projects.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Koen Vossen is the maintainer and driving force behind PySport, a community-driven project that aggregates Python (and non-Python) open-source sports analytics libraries. He’s been writing code since his early days experimenting with LEGO Mindstorms and Visual Basic. Koen later learned Python professionally and now runs his own company, TeamTV, where they combine performance analytics and video for sports clubs. He’s also been active in PyData Eindhoven, helping bring people together to discuss data science and open source.

What to Know If You're New to Python

If you’re just getting started with Python and want to follow along with how Koen and others handle sports data, here are a few prerequisites:

A basic grasp of Python’s ecosystem for data: Pandas and dataframes, simple scripts, and notebooks.
Some familiarity with web requests (e.g., requests or similar libraries) is helpful for working with live sports data or APIs.
Knowing how to organize code into reusable modules is useful if you decide to build or contribute to a sports analytics library.
You can set up a simple environment using venv or another environment manager to isolate your sports-analytics projects.

Key Points and Takeaways

PySport as a Community Hub Koen created PySport to unify sports analytics tools and connect both hobbyists and professional clubs. It serves as an “awesome list” of packages across multiple sports (NBA, NFL, F1, and beyond) and different programming languages. It removes guesswork in finding high-quality libraries by highlighting documentation, last commit date, and number of contributors.
- Links / Tools:
  - PySport.org
Cloppy for Soccer Data Standardization Much of the soccer analytics world struggled with inconsistent file formats and data layouts. Koen’s package, Cloppy, addresses this by standardizing disparate event and tracking data. Multiple contributors, including club analysts, have shaped Cloppy to transform raw soccer data into consistent Pandas-friendly formats for deeper analysis.
- Links / Tools:
  - Cloppy on GitHub
Scraping vs. Official Data Sources While some leagues and vendors offer open APIs, many data sets remain behind paywalls or unofficial endpoints. This forces developers to rely on community-built scrapers, which can break if the target site changes layout or policies. PySport attempts to catalog these scrapers but also encourages caution and consideration of legality and long-term stability.
- Links / Tools:
  - PyBall (NBA)
  - MLBGame (Baseball) (less frequently updated)
Data Availability Challenges Sports data can be very rich—sometimes capturing 25 frames per second for each player or the ball position. Yet, official tracking data is often locked away in proprietary formats or restricted licensing deals. Projects like StatsBomb help by releasing free samples, but the overall ecosystem still faces hurdles in getting open, high-fidelity data.
- Links / Tools:
  - StatsBomb Open Data
  - StatsBomb PyPi Parser
Motorsports Analytics with fastF1 Formula 1 (F1) data is surprisingly well-structured: telemetry, timing, track layout, and driver positions. The Fast F1 package integrates strongly with Pandas to give you session info, telemetry, and visualizations. This provides a terrific example of how open data or partially open data can fuel in-depth analysis and advanced plotting.
- Links / Tools:
  - Fast F1 on GitHub
Bridging Clubs, Research, and Fans Koen sees PySport as a vehicle to bring professional clubs, open-source developers, and sports enthusiasts together. Many clubs do rely on open-source Python, but they need a nudge and a starting point. PySport’s curated list and community guidelines make it easier for domain experts (like video analysts) to cross over into coding and share improvements.
- Links / Tools:
  - PyData Eindhoven (Community example)
Common Sports Analytics Models (e.g., xG) Advanced metrics like expected goals (xG) in soccer quantify how likely a shot is to become a goal, depending on context. Tools like Soccer Action, StatsBomb, and Cloppy help generate these metrics. The approach could also transfer to similar “goal-based” sports such as hockey where you evaluate shot quality.
- Links / Tools:
  - socceraction
Visualization Libraries (mplsoccer, fastF1, more) Beautiful visuals drive home sports analytics insights. The mplsoccer package merges Matplotlib plots and soccer pitch layouts. Meanwhile, fastF1 includes built-in telemetry and track-position charts. Such libraries are often preconfigured with sports-specific templates, saving hours on custom drawing.
- Links / Tools:
  - mplsoccer on GitHub
JupyterLite Playground for Zero-Install PySport hosts a Playground that runs on JupyterLite and Pyodide (WebAssembly), letting visitors try soccer analytics without installing anything. This environment patches certain libraries like requests so you can fetch data in the browser. It’s especially helpful for new contributors or analysts who don’t want to manage local Python installs.
- Links / Tools:
  - Playground at PySport.org
DuckDB for Pythonic Querying Toward the end of the episode, Koen highlighted DuckDB as an in-process SQL engine that integrates smoothly with Pandas and Parquet files. It’s great for quickly running SQL queries on top of existing data structures, making data exploration for sports analytics both powerful and straightforward.

Links / Tools:
- DuckDB

Interesting Quotes and Stories

On repetitive data transformations: “I noticed 80% of the code was just parsing the data to a useful format. It’s always the same code—why not build a library to fix that?”
On bridging open source with professional clubs: “Lots of clubs already use open source, but they don’t always know how or who to collaborate with. PySport tries to make people aware of everything that’s already built.”

Key Definitions and Terms

Tracking Data: High-frequency player or object positions (e.g., 25 frames/sec for each player). Often generated via camera systems and computer vision.
Event Data: Discrete events in a game such as passes, shots, turnovers, commonly represented with timestamps, player info, and field coordinates.
xG (Expected Goals): A model to quantify the likelihood of scoring given the shot context (position, defenders, angle, etc.).
WebAssembly (Pyodide/JupyterLite): A way to run Python code entirely in the browser, no server required.

Learning Resources

Below are some helpful Python courses to deepen your data-focused development.

Data Science Jumpstart with 10 Projects: Build a strong foundation in data analysis with 10 hands-on projects.
Python Data Visualization: Learn to create beautiful and insightful visualizations to deepen your sports analytics.
Move from Excel to Python with Pandas: Transition your data workflows from Excel to code-driven analytics in Python.

Overall Takeaway

The sports analytics community is vibrant and ever-growing, with Python playing a central role in everything from scraping and data cleanup to advanced metrics and rich visualizations. PySport exemplifies the spirit of open collaboration by curating diverse tools, fostering contributions, and bridging the gap between professional clubs, hobbyists, and data scientists. Whether you’re a seasoned developer or just getting started, there’s ample room to join this community, tackle rich data challenges, and build truly innovative sports analytics solutions.

Links from the show

Koen on Twitter: @mr_le_fox
PySport on Twitter: @PySportOrg
Calling R from Python: medium.com
DuckDB: duckdb.org
PySport Playground: playground.pysport.org
NFLVerse: github.com
NBA Stats: nba.com
Sports Databases: opensource.pysport.org
Data sets: opensource.pysport.org
Visualizations: opensource.pysport.org
I/O: opensource.pysport.org
Models: opensource.pysport.org
Scrapers/APIs: opensource.pysport.org
Fast F1: docs.fastf1.dev
Fast F1 graphics: docs.fastf1.dev
Pysport Intro: pysport.org

New Talk Python Training Apps: talkpython.fm
Michael's blog post about the apps: mkennedy.codes
Watch this episode on YouTube: youtube.com
Episode #416 deep-dive: talkpython.fm/416
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #416 deep-dive: talkpython.fm/416

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 If you're looking for fun data sets for learning, for teaching, maybe a conference talk, or even if you're just really into them,

00:06 sports offers up a continuous stream of rich data that many people can relate to.

00:11 Yet accessing that data can be tricky.

00:14 Sometimes it's locked away in obscure file formats.

00:18 Other times the data exists, but without a clear API to access it.

00:22 On this episode, we talk about PySport, something of an awesome list of a wide range of libraries,

00:28 mostly but not all Python, for accessing a wide variety of sports data from the NFL, NBA, F1, and more.

00:36 We have Kuhn Vassen, the founder of PySport, to talk through some of the more popular projects.

00:41 This is Talk Python To Me, episode 416, recorded May 11th, 2023.

00:57 Welcome to Talk Python To Me, a weekly podcast on Python.

01:03 This is your host, Michael Kennedy.

01:05 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.

01:12 Be careful with impersonating accounts on other instances.

01:15 There are many.

01:16 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

01:22 We've started streaming most of our episodes live on YouTube.

01:25 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:33 This episode is brought to you by JetBrains, who encourage you to get work done with PyCharm.

01:39 Download your free trial of PyCharm Professional at talkpython.fm/done dash with dash PyCharm.

01:46 And it's brought to you by InfluxDB.

01:49 InfluxDB is the database purpose built for handling time series data at a massive scale for real-time analytics.

01:55 Try them for free at talkpython.fm/InfluxDB.

01:59 A quick announcement before we jump into the conversation around PySports.

02:04 We have over 240 hours of Python course video content at Talk Python.

02:10 And if you're watching that content on a mobile platform, phone, or tablet, the browser is definitely not the best experience.

02:17 For example, on iOS, it won't even auto-advance or start playing without your interaction.

02:23 We've had some mobile apps for our courses for a while now, but they have fallen a bit into disrepair for a couple of reasons, including App Store tyranny.

02:32 But over the past four months, we've completely reimagined and rewrote our mobile apps in a modern and beautiful platform.

02:39 And I'm super happy to announce that they are now available for Android and iOS on both phone and tablets in their respective app stores.

02:47 So please visit talkpython.fm/apps to see for yourself how beautiful and clean the new apps are and why I'm so excited about them.

02:57 Download them for free and even take a couple of our free courses included in there, as well as the paid ones that you might have gotten, of course.

03:04 Finally, if you're curious for a bit of a look behind the curtain about how and why we rewrote them, check out my personal site, mkennedy.codes, for the full story.

03:14 Thank you all for supporting our work.

03:16 Now, on to the show.

03:18 Kun, welcome to Talk Python To Me.

03:21 Yeah, thanks.

03:22 Thanks for having me.

03:22 Really cool to talk about this topic.

03:25 Yeah, I'm real excited to have you here.

03:27 You have quite a collection of libraries.

03:30 Now, up front, these are not all your libraries, right?

03:33 These are sort of kind of an awesome list of Python and even beyond Python, sports libraries, data sets, APIs, models, everything, right?

03:42 I think it's for people, it was quite hard to find open source packages.

03:47 And I tried to collect everything I could find and make it available for everyone to find what I need.

03:54 It sounds like a great mission.

03:55 And as people will see, there is a bunch of stuff that we'll get you to talk about.

04:01 So I noticed not absolutely every sport is covered there, but many of the popular sports are covered.

04:09 And if you're interested in sports, I think also if you're interested in just examples that connect with people, right?

04:16 Imagine you're a university professor and you don't want to use the New York City tax data one more time.

04:24 You want to say, well, maybe people are into soccer, American football or NBA, whatever it is, right?

04:31 Maybe you could come up with something more interesting.

04:33 F1, for example, right?

04:35 Yeah, definitely.

04:35 There's quite some cool data available to use in your courses.

04:40 Yeah, absolutely.

04:41 Yeah.

04:41 Also, if people are members of some kind of club or team, maybe they could use some of this to bring some cool visualizations or analysis to their own organization, right?

04:53 Yeah, that's also one of the things that PySport likes to encourage to use open source packages that are already available instead of building your own stuff, because that actually happens a lot.

05:07 So that's also a part of the mission of PySport to make people aware of what's already there and try to bring people together.

05:15 One of the big problems, not problems, it's an opportunity, but it's also a challenge of Python.

05:20 If you go to PyPI.org right now, there's 453,000 packages.

05:25 I didn't know the number, but that's quite a lot.

05:27 We're coming up on half a million.

05:29 And if your goal is to work with some specific data set or try to solve a certain type of problem, often the hardest part is figuring out, well, what library do I use?

05:39 Does it exist?

05:40 And if so, is it up to date and all of these things?

05:43 So having a list like this, a place that aggregates it and sorts it, filters it, super neat.

05:49 So really looking forward to talking to you about it.

05:51 Before we get to that, though, just give us a quick bit on your backstory.

05:54 How'd you get into programming in Python?

05:56 Yeah, this is an interesting story.

05:57 I think I started programming when I was, I think, around 12, when I get Lego Mindstorms.

06:04 Lego that you could also program.

06:06 My father gave me a visual basic book.

06:09 Yeah, I should just figure it out.

06:11 So that's where I started with programming.

06:13 And then during high school, I also did web development with PHP.

06:19 I'm not really sure at what age, but eventually I ended up with, I think, the first Dutch search engine.

06:26 They want to, yeah, they need a Python developer.

06:29 And I didn't really know Python, but that wasn't an issue.

06:33 So there I learned Python.

06:35 And from that point on, I, yeah, only, or mostly used Python.

06:40 Right on.

06:41 You're like, all right, forget this PHP stuff.

06:42 I'm going Python.

06:43 Yeah, well, to be honest, we, my company, we still use PHP.

06:48 Yeah, I'm sure.

06:49 Because, well, it works quite well.

06:52 The performance is okay.

06:53 I'm not really sure if I'm allowed to say it on this show, but it also had some advances.

06:58 Yeah, absolutely.

06:59 Well, I mean, almost every language has some, something it's particularly good at and reasons to keep using it.

07:05 And then also there's just tons of software that was written, you know, pick your language, written in that language.

07:09 And it still works well.

07:10 And there's plenty of reasons to just keep going with it, right?

07:13 Yeah.

07:13 But, yeah, Python is really my language of most interest and what I really use on my day-to-day work.

07:23 Excellent.

07:24 Yeah.

07:25 Yeah, cool.

07:25 What are you doing these days?

07:26 Are you still working at the search engine?

07:28 No, that was quite a while ago.

07:30 I also worked at a huge online marketing agency where we did run the software department.

07:38 We created tools to collect all the data from all kinds of different sources and make it available for the teams.

07:45 Right now I'm running my own company.

07:48 It's called TeamTV, where we provide all kinds of tools where we use video and data for, example, performance analysts, but also for highlight creation or live streaming.

08:02 Just to make sure that we try, at least we try to combine video and data in all possible ways within the support domain.

08:10 Yeah, that sounds like a really interesting thing to be working on.

08:14 Yeah, I started mostly on the video engineering part.

08:18 So we built quite some stuff ourselves there.

08:20 So from people uploading huge amounts of footage that we need to transcode and how to scale it, how to serve it.

08:29 So that's stuff we build ourselves.

08:31 Yeah, later on, we keep on building more stuff around data and always keep the combination between data and the video.

08:40 Because, well, you can see some sort of metric, but you always want to see the footage behind it to actually understand the context of it.

08:49 Yeah, sure.

08:49 Sounds really fun.

08:50 And you're also involved with PyData Endover, is that right?

08:53 Yeah, yeah.

08:54 Give them a shout out.

08:55 Yeah, yeah.

08:56 Five years ago, we started with PyData Eindhoven.

08:59 We were already friends with PyData Amsterdam.

09:02 They said, well, maybe you should also start an Eindhoven chapter.

09:06 I think this year will be the anniversary, the five-year anniversary for PyData Endover.

09:11 And it's, yeah, it's an amazing community.

09:14 And that's, yeah, also inspired me to start with PySport.

09:18 I'm not really sure if people at all that are listening to this podcast know PyData.

09:24 Maybe, I think.

09:26 I hope so.

09:26 Yeah, I suspect most of them probably do.

09:29 At least the data science inclined among us.

09:32 Yeah, I can tell a little bit about it.

09:34 Right now, we have a nice way of organizing the meetups and trying to get more people involved

09:40 and talk about data science and share knowledge.

09:43 And then once a year, we have the conference where we try to get, yeah, collect money that

09:49 we can send to NumFocus and they can share it over all the open source projects.

09:54 So, yeah.

09:55 It's a really amazing community.

09:58 That's excellent.

09:59 Yeah, NumFocus does a lot to support the bigger data science oriented projects.

10:05 Yeah.

10:06 I think that's kind of unique amongst the, in the Python space.

10:09 You know, there's not really anything like that in the web or UI.

10:14 You know, there's not a lot of areas where there's like organization that says, okay, we're going to try to find the popular projects and support them across organizations.

10:22 Right.

10:22 Like people support Flask, but they don't also support Django in the same sort of organization.

10:28 Right.

10:28 I think it's also an opportunity for all the companies that are using those open source packages to give back.

10:36 And I think doing it through NumFocus, it makes it also easier because they use a lot of packages and can just donate to NumFocus and they will make sure it's distributed over those packages.

10:49 Right.

10:49 Absolutely.

10:50 If you use Pandas, you should also support NumPy, right?

10:53 Because that's kind of the foundation and so on.

10:56 Yeah, yeah, yeah.

10:57 Interesting.

10:58 Oh, that makes a lot of sense.

10:59 All right.

10:59 Well, let's jump into sports and your project, PySport.

11:06 Are there other people who are maintainers and working on this or is this just your project?

11:10 To get the meetup that we had just a couple of weeks ago, we had some more people collected.

11:17 And now we are building from there on to get more people involved with just PySport.

11:23 But one of the projects we built with PySport is the Cloppy package.

11:27 And there we have worked together with Jan van Haaren.

11:31 He's a head of data science at Club Brugge, a big club in Belgium.

11:37 We had the main maintainers there, but I think right now we have 22 contributors to the package.

11:44 So there's quite some people contributing there.

11:47 Yeah.

11:48 Yeah, that's a big group.

11:49 That's a lot of people contributing.

11:50 Yeah.

11:50 Let's start it this way.

11:51 Tell people what PySport is and about that.

11:53 And then we can talk a little broadly just about sports analytics before we get into the details.

11:57 The most important mission of PySport is to bridge the gap between the clubs and the sports analytics and just people and by using open source packages.

12:10 Because a lot of clubs are using open source packages.

12:14 Open source packages are used by the clubs and people want to have a way to contribute to their favorite club.

12:21 I think a lot of people are still struggling on how to do it.

12:24 And with PySport, yeah, we want to share the knowledge and teach people on how to do it.

12:31 So we try to get the experts from the clubs, but also getting the knowledge from, you know, like Thundas or other big packages and see how we can get all that knowledge into the sports analytics community.

12:46 With Cloppy, we try to set an example on how to build such a package, how to work together on such a thing and also encourage people to contribute.

12:59 Yeah.

12:59 Show that you don't have to create a pull request that there's a major effector, but also like minor things like typing errors, fix and documentation and show people that that's also very valuable to a package.

13:15 Interesting.

13:16 So Cloppy is standardized soccer tracking and event data, right?

13:21 So you started out with soccer or as I guess a lot of the world might refer to it as football, but in the U.S., that's already taken.

13:30 It's a namespace collision.

13:31 Yeah.

13:32 Yeah.

13:34 It's sometimes, yeah, it's difficult to talk about football, but here in Europe, we call it football, but for the package, because it's international.

13:43 It's also worldwide, so I call it soccer.

13:46 Yeah.

13:47 Yeah.

13:47 Namespace.

13:48 Namespace.

13:49 So give us a quick bit of background on Cloppy, but since it's kind of one of the founding, you created this as a way to sort of set an example, right?

13:57 For how to create a package and it helps people understand this event, this club data.

14:03 Where that started is on Twitter, there was already quite some people talking about sports analytics, of course.

14:09 And one guy, Joe Mulberry, he's working at a Danish top club.

14:14 He asked for help because he created a notebook and he wanted to build a Flask API on top of it.

14:20 And I said, well, I know Python.

14:22 I don't know really much about soccer or about data, but yeah, I would like to be involved.

14:28 I would like to help you.

14:29 And when I received a notebook, I noticed that like 80% of the code was about reading and standardizing the data to a format that he could work with.

14:41 And when we talked about it, it seemed like most, at least a lot of people are struggling with that issue and doing the same thing over and over again.

14:49 Because in more notebooks that I saw, people were doing the same thing, but in different ways.

14:54 And some were not correct implementation or inefficient implementations.

15:00 So I thought, well, one thing I know is how to read data and how to get it into a standardized format, because that was also one of the things I did at an online marketing company.

15:11 Yeah.

15:12 Like, I don't know much about your data format, but I know about processing data and normalizing it and all that, right?

15:18 Yeah.

15:19 I built a package start with just tracking data, but also try to explain what the next steps could be.

15:25 And then people said, well, this is really useful.

15:27 And from that part, I kept on adding deserializers for different kinds of data for the tracking data and also for the event data.

15:37 Yeah.

15:37 It tried to get knowledge from non-sport bigger projects.

15:42 So I also got Will Kunnen from the Texel package.

15:48 He also did several reviews on this package and give feedback to try to get the package on a higher level.

15:56 So people within sports analytics community could also gain more knowledge from there.

16:01 But maybe also a small background on the data.

16:06 So the tracking data, that's like positioning data for all players on the pitch.

16:11 I think it's most of the time 25 frames per second.

16:15 So you know the location for each player and the ball.

16:18 And on the other side, you have the event data.

16:20 So there are all passes and shots and things like that.

16:23 At this time from this position, there was a shot on the goal or there was a pass or there was a takeaway or penalty.

16:29 Yeah.

16:30 Yeah.

16:30 That's event data.

16:31 Yeah.

16:32 And all the vendors choose different formats.

16:34 Yeah.

16:35 Yeah.

16:36 Oh, geez.

16:37 That sounds hard.

16:39 So first of all, 25 hertz of all the people's location.

16:44 This is beyond somebody with just a pen and paper and notebook writing down, oh, at this time there was a shot on the goal by number 25.

16:53 How do they get that data?

16:54 That's crazy.

16:55 Yeah.

16:55 That's quite an advanced systems that they use.

16:58 So in the stadium, I think they have like 20 cameras around the pitch.

17:02 They use computer vision to detect all the players and combine it.

17:07 But I believe, and I'm not really sure if they're already vendors on the market that do it totally automated.

17:14 But I think from the system that I'm currently used in soccer, there are still some people needed for difficult situations like a corner kick, where a lot of people in a small area and a lot of occlusions happen.

17:27 You can't see the numbers, yeah.

17:27 Yeah, they can't see the numbers.

17:28 So just after a corner, some manual operator has to reassign some players or correct something.

17:35 But it's quite an advanced system already.

17:38 It sounds incredibly advanced.

17:39 It sounds like an awesome data set to work with because with that much data, you really can make a lot of interesting predictions and trends.

17:48 I mean, at some point, maybe we'll just put some sort of tracking RFID thing on the back of the player's heads, just stitch it on there.

17:56 And then you can fully automate it, you know?

17:58 Yeah, I think a soccer day ticket.

18:02 Yeah, maybe.

18:03 Yeah.

18:03 Not sure if all players would accept it.

18:05 But for example, on ice hockey, yeah, you can put it on the helmets.

18:09 Yeah, you can put it on the helmets, sure.

18:11 For football.

18:11 Yeah.

18:13 Yeah.

18:13 Things like automobile racing, you know, they have, not all of them, but for example, F1 has incredibly high frequency of, like, points that measure where is this car, how fast is it going, the cars are sending out real-time telemetry.

18:28 There's certainly many sports that have quite high fidelity in their data.

18:32 I must admit, I haven't seen the data from F1 yet, but it would be really interesting to learn from them and now to work with data and see if it can be applied to football or soccer or the sport.

18:46 This portion of Talk Python To Me is brought to you by JetBrains, who encourage you to get work done with PyCharm.

18:53 PyCharm Professional is the complete IDE that supports all major Python workflows, including full-stack development.

19:00 That's front-end JavaScript, Python backend, and data support, as well as data science workflows with Jupyter.

19:06 PyCharm just works out of the box.

19:09 Some editors provide their functionality through piecemeal add-ins that you put together from a variety of sources.

19:16 PyCharm is ready to go from minute one.

19:18 And PyCharm thrives on complexity.

19:21 The biggest selling point for me personally is that PyCharm understands the code structure of my entire project, even across languages such as Python and SQL and HTML.

19:32 If you see your editor completing statements just because the word appears elsewhere in the file, but it's not actually relevant to that code block, that should make you really nervous.

19:42 I've been a happy paying customer of PyCharm for years.

19:45 Hardly a workday passes that I'm not deep inside PyCharm working on projects here at Talk Python.

19:52 What tool is more important to your productivity than your code editor?

19:56 You deserve one that works the best.

19:58 So download your free trial of PyCharm Professional today at talkpython.fm/done with PyCharm and get work done.

20:07 That link is in your podcast player show notes.

20:09 Thank you to PyCharm from JetBrains for sponsoring the show and keeping Talk Python going strong.

20:15 I bet it's a lot, actually.

20:18 I bet it is, you know, just in terms of actual quantity of data, you know, how fast are sampling and how many cars for how long.

20:25 It's probably a lot of data.

20:26 That's also one of the interesting things about working with sports data.

20:30 I think the data engineering part and this package just focused on reading the data.

20:36 But then the next step, yeah, how to work with the data, especially if you would like to use the tracking data for a whole season.

20:43 Yeah, that's quite some data that also vendors can start struggling a bit with.

20:48 It just occurred to me, there's probably a whole nother demographic or aspect who would be interested in this kind of data.

20:55 It would be like sports betting people.

20:57 I mean, not that I have any interest in that at all.

20:59 But if you were trying to figure out like, OK, if this team plays that team, if you can understand, OK, this their star player,

21:05 if we match up their moves against the other person's moves, it turns out there's a weakness in this way for their defense or who knows.

21:12 Right. I mean, there's there's with that much data.

21:14 There's probably some interesting stuff you can do.

21:16 I think that a lot of vendors of the data also have the, yeah, the betting industry as well as their clients.

21:25 Because, yeah.

21:25 I don't really care to work for them or support them.

21:30 It's a little bit shady, I suppose.

21:31 But it does seem like you could, it's almost like really detailed information about companies for the stock market.

21:39 This is kind of like a little bit like that for the sports betting in some ways, I suppose.

21:42 Yeah.

21:43 Yeah.

21:44 Yeah.

21:44 Interesting.

21:44 I think one of the challenges here is probably a lot of this data is not easily offered up.

21:51 There's probably not a lot of JSON APIs with low latency that are super easy to access or some there must be, but not.

21:58 There's probably a lot of data out there that is not overly welcome to either be given out or it's given out over in batch over slow periods or something like that.

22:09 Right.

22:09 Maybe speak to a little bit about the data availability.

22:11 Yeah, that's quite an issue.

22:13 And I know mostly about the soccer data, but I can imagine that the same applies to most of the other sports.

22:21 And I think data availability is a major issue, at least if you want to encourage the community to work with it and do research on it and get people build more cool stuff without being within a club.

22:36 There are some companies that already provide quite a big setup of open event data.

22:41 Statsbomb is one of them.

22:43 I think they provide around 1,500 data sets for event data.

22:48 But if you're looking at the tracking data, maybe there are like 10, maybe 15 sets available because all those vendors have deals with the leaks.

22:58 They are not allowed to start it.

23:00 So you have to know someone within a club.

23:03 Or use a beautiful soup or scrapey or something like that, right?

23:07 That's the other option.

23:08 But then it's still very hard to get the tracking data because I'm not sure if you can actually scrape it.

23:15 But that's one of the things that I noticed when working on the open source of Piesport's website.

23:23 There are really a lot of scrapers.

23:25 And I think that's an indication that there's an issue with data availability.

23:32 Yeah, it's not.

23:32 This plugs into the API, but this is a scraper.

23:34 Yeah.

23:35 I guess it's worth pointing out or throwing out a bit of word of caution.

23:39 Just because the website is publicly available and you can hit it with some kind of scraping tool.

23:45 That doesn't mean you legally can do stuff with the data.

23:48 You probably want to be pretty careful about that, right?

23:51 Yeah, because I think even when it's not explicitly mentioned, most of the times it's not allowed to scrape the data at all.

23:59 But also in soccer, there are quite some websites that are explicitly forbid it.

24:05 And yeah, so the packages are there.

24:08 And it's also a bit, I was thinking about should I include them or should I not include them?

24:13 Because they kind of encourage non-legal actions, but yeah, not really sure about it again.

24:20 Yeah, sure.

24:21 I can see the case for both sides of that.

24:23 But I just want to let people know, just be careful with what you do with the data.

24:28 It's one thing if it's an academic research project and it's just for my own interest or whatever.

24:33 Yeah, if you start scraping that entire website and trying to make money out of it, you should not do it.

24:38 Or find a way to do it legitimately, right?

24:41 But just don't sneak through.

24:43 Yeah.

24:44 All right.

24:44 Well, I think it might be fun to, let's talk through some of the packages you have here.

24:52 So if you go to PySport.org and there's a nav bar and on the left it says open source.

24:58 And if people click that, then they end up with a whole bunch of, you know, I'll open it just this way for a moment.

25:05 We can look at it and talk about it.

25:06 So if you just click on it, it actually, there's a delay as it downloads.

25:10 Yeah, there's still something I need to fix because, yeah, it's quite some packages.

25:16 I mean, this is not a complaint.

25:17 It's just, I don't know how many pages that is, but that's a really small scroll bar.

25:21 What I noticed that's pretty cool is you can go in and there's a filter that you all have and you can filter by your language.

25:28 Right now you have Haskell, Python and R and others.

25:31 And then you can pick by sports and then you can pick by type of thing, right?

25:36 So I filtered our discussion down to Python libraries just because, you know.

25:40 We have a single title.

25:41 Yeah.

25:41 And you could also pick amongst the different types of tools.

25:46 So we talked about the scrapers and probably to a lesser degree, the APIs, right?

25:51 The API clients, which is cool.

25:52 There are some in there.

25:53 They say, here's the API and we just built a strongly typed package rather than just doing straight rest, which is great.

25:59 But you also have models and calculators like for predicting things.

26:03 And then IO for file formats, visualization, open data and databases.

26:08 Right.

26:09 So I encourage people to rather than try to read the whole list, which is hundreds and hundreds of packages to, you know, filter down maybe to the sport you're interested in or a couple of sports or the type of tooling you're interested in.

26:22 Yeah.

26:23 I think filtering is a must, but maybe if you have plenty of time, you could just scroll and see what's interesting because it's still, I think, a very interesting list to see what's just what available and get inspiration.

26:37 Yeah.

26:37 It's quite a list.

26:38 Yeah.

26:39 Yeah.

26:39 So what's the sort here?

26:40 If I come here, how do I, how does this get sorted?

26:43 Like, is there any meaning to the order they appear?

26:46 Is it just when they were entered or?

26:47 It's a good question.

26:49 I also open source the data collection part of this website, but it's daily collected, at least to provide an update.

26:57 And I think, I must say, I think there's an order.

27:01 And when I added the packages, I think that's the order here.

27:05 But to be honest, this can be pretty random.

27:08 Excellent.

27:09 All right.

27:09 Excellent.

27:10 All right.

27:10 So here, I'll just sort of go through a couple of the scrapers here.

27:14 And we can maybe dive into one or two potentially.

27:17 So there's PyBall.

27:19 We'll just go through just to give people a sense, right?

27:22 Of the ones here, right?

27:23 So there's PyBall, which is a Python API.

27:27 Nice.

27:28 Wrapper for stats.nba.com with a focus on NBA and WNBA application.

27:34 That's pretty cool.

27:35 I don't know anything about stats.nba.com.

27:38 But it looks like, yeah, this is a whole website with all sorts of data.

27:42 It's got players, teams, leaders.

27:44 Looks great, actually.

27:45 I think quite some people are also using this package.

27:49 I think it's a mostly used package when working with basketball data.

27:54 And it's not that they use the API to get this data.

27:59 Yeah, you get quite a bit of data here.

28:00 You've got like the player, their team, their age, their total number of points scored.

28:06 A lot of stuff you can do to sort of compare them.

28:09 And yeah, that's great.

28:10 So if you're into basketball, I think it's a great start.

28:13 It's also quite actively maintained.

28:17 That's also one of the things that I intentionally mentioned on the list.

28:22 Because some packages are not really maintained well.

28:25 I think it's a benefit.

28:28 Yeah, one of the things in the list that you call out is the number of contributors,

28:32 the latest version, when the last commit was to the package.

28:36 That's pretty cool.

28:37 In the beginning, I thought, well, maybe I can just manually update the list.

28:40 But then I decided, I think data engineering is fun.

28:47 Let's find a way to automatically fetch the data and update it.

28:52 Also, the license is pretty important to show it here.

28:56 And also, I'm not going to commit to see how actively it's maintained, the latest versions.

29:01 And also the contributors.

29:07 Sure, the difference between a package with one contributor and one with 30 contributors.

29:13 That's a big difference.

29:13 It's a really big difference.

29:15 Yeah.

29:15 I think it's also good for people to see if there's a package with just a single contributor

29:21 that might give an opportunity to contribute to it or work together.

29:26 So PySport would like to encourage people to get involved in those projects.

29:31 Yeah, that's a good idea.

29:31 So that could help out here.

29:34 Yeah.

29:34 And each one of these packages, you can go in and open the details here.

29:37 And it gives you a little bit more information.

29:39 Like, for example, it actually lists the contributors and links to their GitHub profiles and choose

29:44 their website and the GitHub page and PyPI and so on.

29:48 Yeah.

29:48 And also, you can click on one of the contributes and see what other packages they built.

29:54 Oh, really?

29:55 Okay.

29:56 So, like, if I click on this one, yeah, they've done just this one.

30:00 Well, and this one, just a single one.

30:01 Yeah.

30:02 Some of them, they might have worked on multiple.

30:04 I know Dependipods worked on a few.

30:05 Yeah.

30:07 That's a really nice contributor.

30:08 Yeah.

30:09 Yeah.

30:10 Yeah.

30:10 The absolutely prolific open source contributor.

30:13 Yeah.

30:14 Works on my project too.

30:15 This portion of Talk Python To Me is brought to you by Influx Data, the makers of InfluxDB.

30:22 InfluxDB is a database purpose built for handling time series data at a massive scale for real-time

30:29 analytics.

30:30 Developers can ingest, store, and analyze all types of time series data, metrics, events,

30:35 and traces in a single platform.

30:37 So, dear listener, let me ask you a question.

30:39 How would boundless cardinality and lightning-fast SQL queries impact the way that you develop

30:44 real-time applications?

30:45 InfluxDB processes large time series data sets and provides low-latency SQL queries, making

30:52 it the go-to choice for developers building real-time applications and seeking crucial insights.

30:58 For developer efficiency, InfluxDB helps you create IoT, analytics, and cloud applications

31:03 using timestamped data rapidly and at scale.

31:06 It's designed to ingest billions of data points in real-time with unlimited cardinality.

31:11 InfluxDB streamlines building once and deploying across various products and environments from

31:17 the edge, on-premise, and to the cloud.

31:20 Try it for free at talkpython.fm/influxDB.

31:24 The link is in your podcast player show notes.

31:26 Thanks to Influx Data for supporting the show.

31:32 I didn't realize you could actually see all the projects that PySport knows about that that

31:37 particular user works on.

31:38 That's a cool aspect of it.

31:40 I spent quite some time on fetching all the data and trying to combine it.

31:44 Also, fetching data for PyTi and also do the similar for the R packages.

31:49 Yeah.

31:49 And seeing how to get all the available data on one place.

31:53 It also tries to fetch images or screenshots from the readmes of the repo storage.

31:59 That works for some.

32:01 Oh, yeah.

32:01 That's nice.

32:02 Screenshots are really going to be very helpful.

32:04 Less important on the scrapers, more on the visualizers, probably.

32:09 But still.

32:09 Yeah, definitely.

32:10 What is opensource.pySport.org written in?

32:13 It's written in React using Next.js.

32:19 So it was also quite an adventure for me because it's the first application that might also explain

32:25 why it's still a bit slow on loading because I didn't really dive into how to make it faster.

32:32 It used still WIND.

32:33 But in the backend, it's Python.

32:35 It's using Luigi.

32:37 Okay.

32:38 That's, I still think it's a pretty interesting tool because it's really simple to set up

32:45 like orchestration of some tasks.

32:48 Right.

32:48 Like the daily scraping, updating the packages and that kind of stuff.

32:51 Yeah.

32:52 And then there's a GitHub action that runs on a daily basis and then patches all the data

32:58 and updates and commits it in a different branch.

33:01 And that one gets deployed to the Purcell, I believe.

33:06 Okay.

33:06 Yeah.

33:06 Very interesting.

33:07 But if you are interested in the source, you can also, it's also HopeSource.

33:13 Okay.

33:13 Great.

33:14 So, highball for NBA.

33:16 We have the hockey scraper, which is for scraping NHL play-by-play and shift data with six contributors.

33:23 That's pretty interesting.

33:25 What you'll see on the filter list for every sport, there's a package also for the NHL, for

33:31 ice hockey.

33:32 That's a little bit less maintained, I think.

33:36 But I have to, I'm not really sure if it still works because with those scrapers, it can work

33:43 today and not tomorrow.

33:44 It doesn't even necessarily mean that they were intentionally blocked.

33:48 It could just be, hey, we've redesigned our site.

33:51 Doesn't it look awesome?

33:52 You're like, oh, the CSS selector is no longer pull up the thing.

33:55 So, yeah.

33:57 So, that's also on the scraping part.

33:59 If it's last commit is like a while ago, it might be broken.

34:05 Maybe, maybe not.

34:06 Yeah, sure.

34:06 All right.

34:07 Let's see some more.

34:08 I think the StatsBomb API is an official package.

34:12 It's also cool that StatsBomb provides an open source package for accessing their data.

34:18 Yeah.

34:18 What is StatsBomb?

34:19 I see that showing up in many places on these different packages.

34:22 Yeah.

34:23 StatsBomb is, I think, one of the leading providers of event data in football.

34:29 And I think in both football and soccer and in football.

34:33 So, they provide the event data.

34:35 So, everything that happens on the pitch, like passes, dribbles, interceptions, everything.

34:41 They are also one of the providers of the open data sets.

34:45 Okay.

34:45 Yeah.

34:46 They've got a free data section.

34:47 That's cool.

34:47 Yeah.

34:48 They proclaim themselves as data champions.

34:51 That's kind of cool.

34:52 Yeah.

34:53 I think the data is pretty good.

34:55 I think also one of the best in the market right now.

35:00 But at least that's what I heard from some users.

35:03 Sure.

35:04 They even have courses.

35:05 Modern scouting and data-driven recruitment.

35:08 That's kind of interesting, isn't it?

35:10 Yeah.

35:10 You also have to figure out how to apply data science in your job.

35:16 So, how to use it and how to use the data for scouting purposes.

35:21 Yeah.

35:21 If you work in a professional sports organization or even college sports, the U.S. at least,

35:28 there's a lot of recruiting people up from lower levels.

35:31 The tab is in all sports.

35:33 But I think the data is really helping to make the number of players that you have to watch

35:40 from the footage a lot less.

35:42 So, if you can already make a short list instead of watching 15,000 players, then it's really

35:49 convenient.

35:49 Sure.

35:50 Or maybe you're looking for a particular asset or a particular part of the play that a player

35:56 is good at.

35:57 Right?

35:57 Maybe you're looking for a quarterback for a football team that is especially good at running

36:03 the ball in addition to just throwing it.

36:05 Right?

36:05 You could ask the data for that and really narrow it quite quickly, I imagine.

36:10 And then you have to work with the data, figuring out how to extract it.

36:14 Because maybe that single metric that's really important for you is not available in the original

36:19 data set.

36:20 So, then you have to figure out how to work with the data and get those metrics out of

36:26 the raw data.

36:27 Yeah.

36:27 Maybe it's something calculated or inferred.

36:29 Yeah.

36:29 And that's also one of the things that happens in soccer based on the tracking data.

36:34 But it will probably happen also in football and all the other sports.

36:38 That clubs will define their own metrics based on, for example, tracking data.

36:43 And use that to figure out what players match their own play the most.

36:49 Cool.

36:49 Okay.

36:50 So, yeah.

36:50 That's what, as you can see, there's a bunch of stats bombs here.

36:53 Pi Baseball, an MLB game, seem to be a couple of things around baseball data.

36:59 And baseball is one of those games that's kind of, I feel like baseball is one of those games

37:05 that was almost created by a statistician just so they could come up with stats.

37:10 There's so many stats.

37:12 And, you know, people get averages, you know, that what kind of hitter are they?

37:15 Well, they're like a 0.3, you know, they're a 300 hitter.

37:19 What are I, you know, 30% and all that.

37:22 And I'm not a huge fan of baseball.

37:23 I find it kind of a slow game.

37:25 It's kind of fun to play, but to watch.

37:27 And it's like, you know, same as golf.

37:29 I don't watch those things.

37:30 Yeah.

37:31 I'm sure they're fun to play, but it's just like, in terms of stats, these kinds of games,

37:35 there's probably a ton of stats here because it's all about stats there.

37:38 I also believe that the baseball data science departments are one of the biggest departments

37:44 overall sport.

37:45 And maybe, but I'm not sure about it.

37:48 You can also make a lot of impact there.

37:51 Maybe.

37:51 Sure.

37:52 Because also in all the sport, for example, soccer, a lot of things has impact on the

37:57 eventual outcome.

37:58 It's also a discussion if all data is available to know what actually has the most impact.

38:06 So that's also one of the discussions within the soccer analyst community.

38:11 Yeah.

38:11 For both of these, Pi Baseball and MLB Game, you can see from your Luigi automation.

38:18 They're both quite, well, the MLB Game is not particularly up to date.

38:22 I guess the Pi Baseball one is more up to date.

38:24 But, you know, 13 contributors, 30 contributors.

38:27 That's quite a lot.

38:28 That's quite a lot.

38:29 And the Pi Baseball was updated this month, right?

38:33 But, you know, when I saw these, I'm like, oh, these are kind of similar.

38:36 And then I look at your page here and I see, oh, well, Pi Baseball is, you know, way more

38:41 up to date, modern.

38:42 And you should check that out first, right?

38:43 That's the kind of value you get for having the info.

38:45 Yeah.

38:46 That's also the intention that you have a quite quick overview of, yeah, how it's maintained

38:53 it.

38:53 And, yeah.

38:55 Yeah.

38:55 And that one also goes against the API.

38:57 So let's see.

38:58 A couple more.

38:59 I guess it's worth giving a shout out to the NFL FastPie.

39:04 That, well, you know, NFL's got a lot of data as well.

39:07 What else?

39:07 There's some college baseball.

39:09 Here's one that I think is that shows up across a lot of the different categories because

39:12 it seems to do a lot, which is Fast F1.

39:16 Have you seen that?

39:17 Have you played with this any?

39:17 Also updated this month.

39:19 I should dig into it because quite some contributors.

39:22 And I think it's really interesting to also see the mode of sports or cycling or more of those

39:29 sports to see what they are doing, how they're doing it.

39:32 Yeah.

39:32 I noticed looking through here that there's not a lot of motor sports compared to the other

39:36 sports.

39:37 And so people, if you're out there, like if you're an IndyCar or if you're in motocross

39:41 or somewhere like, and you've got a package and shoot it over to these guys and have them

39:45 put it in the list.

39:45 That'd be cool.

39:46 Yeah.

39:46 The Fast F1, they've got a page here that has a bunch of things.

39:49 It has access to timing data, telemetry, session results, and all the data is provided

39:55 in an extended Panda data, Panda's data frame format, which is pretty cool.

40:00 Right.

40:01 Integration with Matplotlib.

40:02 There's an examples gallery too.

40:05 You come over here and you can see it has things like position changes during the race.

40:10 So this, it'll say, if you go up here, it'll do things like, you got to go forward, you

40:15 know, go to the C, get season 23 race one or for race, I guess, rather than practice or

40:22 qualifying.

40:22 And that's Bahrain.

40:23 And so then here's, you know, it has all the drivers, their time throughout the race, their

40:28 position.

40:28 You can see probably pit stop.

40:30 There's a lot of cool stuff you can see in here.

40:31 It looks really nice.

40:32 And also with those examples, I think it's really helpful to get people started with those

40:38 packages.

40:39 Yeah.

40:40 It's not exactly a Jupyter notebook.

40:41 It's the HTML of a Jupyter notebook.

40:43 But, you know, it's still exactly what you need, right?

40:46 But I think you can even download it by a notebook.

40:49 You download it right there.

40:50 Absolutely.

40:50 Yeah.

40:51 Yeah.

40:51 And apparently two and a half seconds to generate this script.

40:54 Let's see.

40:56 You can even got cool visualizations like on the track, color it by speed around the tracks

41:03 of the start.

41:03 You know, there's a lot of cool data here.

41:05 I'm not really sure why I haven't seen this one before, but yeah, it looks really, really

41:09 cool.

41:09 Yeah.

41:11 When I looked, I looked around a couple of the different packages and this one, like

41:13 the documentation and examples and stuff seem, seem super good.

41:16 Okay.

41:17 So that's the scrapers.

41:18 There's many more.

41:20 There's plenty more there.

41:21 Another one, models, calculators.

41:23 Maybe take us through some of the ones that stand out in this category.

41:26 Like, for example, there's Lori's code for Metrica tracking data.

41:30 I love it that it's just, it's Lori's code.

41:32 Good job, Lori.

41:32 Yeah.

41:33 So this is mostly about how to also do all kinds of modeling on top of it, do predictions

41:39 on top of data.

41:40 You know, one of the packages that I think is pretty interesting is the soccer action.

41:45 Yeah, of course.

41:46 Again, it's soccer.

41:47 There's only Python, possibly.

41:49 But for example, they have soccer XG, which is, what is that?

41:53 XG boost models for soccer event data?

41:55 That's the expected goals.

41:59 So what's the expected value for a certain shot?

42:02 If it should go in or not.

42:05 So it's also based on a position on the page, how many players are between the player with

42:13 the ball and the goal.

42:14 So you can use it to determine, yeah, how, if a player should score a goal and how many goals

42:22 he should make.

42:23 Yeah.

42:24 Yeah.

42:24 I think this is actually one of the really interesting aspects is the model and calculate it.

42:28 You know, the prediction side is pretty cool.

42:30 There's quite some work to do for PiSports because, for example, the expected goals.

42:36 There's also one of the things that I've seen in ice hockey.

42:39 Also, in other sports where you have to score within a goal.

42:44 And I think it would be cool to find a way to abstract it over all sports.

42:49 Yeah.

42:49 Because it is kind of the same idea, probably different data sets, but right.

42:54 Like scoring in hockey and scoring in soccer is from a structural perspective of the data

42:59 is kind of the same thing, even though it's really quite different in size of the goal and

43:04 how easy it is and all that.

43:06 Yeah.

43:06 But I think we can still learn from the other sports and see how they did it.

43:10 Yeah.

43:10 Train up a model, but on different data, right?

43:12 But same type of model, potentially.

43:14 Yeah.

43:14 Maybe some different features, but yeah.

43:17 Yeah.

43:17 So the next category is IO.

43:20 And that obviously stats bomb is in here, right?

43:23 Python package to parse, stats bombs, JSON data to CSV, which is cool.

43:27 Some on soccer, the spattle format, which I have no idea what that is.

43:31 Yeah.

43:31 That's also one of the things they built to make like atomic data format.

43:37 That's also kind of standardized.

43:41 So there's some overlap between soccer action and cloppy.

43:44 I think they mostly focused on how to eventually work with the data.

43:49 So calculate also the expected threat and also like a contribution model.

43:55 So for every action towards a goal.

43:59 Right.

44:00 Right.

44:00 Okay.

44:01 So maybe there's a takeaway and then a pass and a pass and then a score.

44:04 Like all of those people should somehow get credit for that potentially, right?

44:08 Yeah.

44:08 Okay.

44:08 Makes sense.

44:09 But they also build the way to load the data.

44:12 And they will currently also working together with them to see if we can make cloppy to load

44:20 the data and have the cloppy package focus on loading it and standardizing it and then have

44:25 the soccer action using it.

44:27 So see how the Nego blocks can work together.

44:30 Absolutely.

44:30 We have the NFLDB, a library to manage and update NFL data in a relational database.

44:36 That's kind of cool.

44:37 All right.

44:37 Let's see.

44:38 The next category is the visualization.

44:40 I think probably the most important part is probably the actual data acquisition, but

44:45 the most desired part is probably the visualization, right?

44:48 The data engineering part is not really, what do you call it, really sexy.

44:53 I mean, no one sees it.

44:54 The output is a structured CSV or packet file.

44:58 So that's not really cool to show.

45:00 But for example, the NPL soccer, I think it's a really, really nice package used by every

45:09 person in the soccer community.

45:11 Yeah.

45:12 There's a lot of contributors here.

45:14 Yeah.

45:14 And the visualizations look really cool.

45:17 Yeah.

45:18 They also have a huge list of examples.

45:22 Okay.

45:22 So all kinds of, you can just copy and paste to create some pizza charts.

45:28 I love them.

45:28 Yeah.

45:29 Yeah.

45:29 We'll actually come back to the pizza charts in just a moment, actually.

45:32 But yeah, these are some good looking visualizations here.

45:35 Yeah.

45:35 And I think the interesting thing about this package is that at some point there were two

45:40 packages that did similar things.

45:42 And then they decided, well, we should just work together.

45:45 And they spent quite some time on integrating those packages.

45:48 And then there was one.

45:50 And I think that's really cool to see that instead of kind of competing, they decided to

45:56 work together and make, I think, one of the most awesome.

45:59 packages for the soccer community.

46:01 It's really nice.

46:02 It's really nice.

46:03 There's a lot of soccer ones in here.

46:05 Yeah.

46:05 There's also one for a PT plot for American football, although I don't understand what PT

46:10 stands for.

46:11 And then the fast formula one is also in there.

46:14 We already saw those pictures, but a lot of nice visualizations there.

46:17 Yeah.

46:18 And is that it for all the categories?

46:20 No.

46:20 Then there's the open data.

46:21 Yeah.

46:21 I think maybe when I look at this list, are some missing?

46:25 Okay.

46:25 It's still a bit limited on what data is available.

46:28 That's something that we should work together also with leaks to see if there's a way to make

46:35 some more data available.

46:36 Yeah.

46:37 They have it and they offer it publicly, put it in the list, right?

46:40 When it's available, I would definitely add it.

46:43 But there's already some interesting data.

46:46 There may be a little bit smaller data sets, but you can definitely use it to start playing

46:52 around with it.

46:52 All right.

46:53 So I think that kind of covers the list with the Python filter sort on.

47:00 You wanted also to give a quick shout out to NFLverse, right?

47:04 Because while not Python is quite a series of packages that does cool stuff in the NFL for

47:10 that data, right?

47:11 Yeah.

47:11 So it's not Python.

47:13 It's for our users.

47:14 But I think what's really interesting there, what they did is they created quite some different

47:20 packages, one for collecting the data, one for organizing it, one for reading the data,

47:26 one for doing all kinds of modeling, one for creating the visualizations.

47:30 And I think that's also an example for all the sports on how to make those packages available,

47:37 making sure that everything fits together.

47:40 Yeah, that's cool.

47:41 It's under the NFL virtual organization, but a bunch of different projects.

47:45 You know, you talked about having the data and stuff that's not immediately obvious or

47:50 predictable.

47:50 You might need a higher level sort of thinking about it.

47:53 And one of them that stands out here is the NFL fourth, which is studies fourth down decision

47:58 datas with the NFL version models, which is kind of cool because that's one of the big

48:03 decisions that a coach makes and it can make the game or it can lose the game.

48:07 And there's a go, no go decision, right?

48:09 And there's a lot of, it's not just, well, they went this far, then they didn't make it.

48:13 It's well, it was the, they had 30 seconds left in the game and they had to do it or, you know,

48:18 because otherwise they were just going to lose anyway.

48:20 Right.

48:21 There's a lot of higher, like sort of inference and higher level things you want to bring into

48:25 that rather than just 30% of the time they make it fourth down.

48:28 Right.

48:29 Yeah.

48:29 And this, I think also one of the reasons they just built an entire package around it

48:33 to work with it.

48:35 Yeah.

48:35 That's pretty interesting.

48:36 Now, before all the Python people say, I don't want to learn R, I don't care about R, it is

48:40 also worth pointing out that you can call R from Python.

48:44 I don't know how much like the visualization stuff still works super well or anything like

48:48 that, but you can use, or what is it called?

48:51 R pi two.

48:53 Okay.

48:54 And you can end up, you just pass it in our file and then you start calling functions or

48:59 whatever, get a, get a function out of it and call that function.

49:02 So it's worth, you know, if, if you really, really want to use some of these packages, maybe

49:07 it's worth doing a quick little integration and then turn it into a data frame, a pandas data

49:11 frame and running with it or something.

49:13 It looks interesting.

49:14 It's definitely worth a try.

49:16 It's nothing I've ever used, but I can see, you know, if you really care about NFL data

49:20 and you really care about Python, it might be worth, worth giving those, those combos a

49:24 look there.

49:25 I think there is one package to work with, with their data from Python.

49:30 So if you look at the list there, there should be at least one.

49:33 I think it's not on their website on their GitHub page, but I think there's another one

49:39 that integrates well with it.

49:41 Sure.

49:41 Right.

49:42 Not under the organization, but maybe somebody else.

49:44 Yeah.

49:45 That does.

49:45 Yeah.

49:46 That's cool.

49:46 Excellent.

49:47 Maybe they use this, this integration.

49:49 I don't know.

49:50 All right.

49:51 And then the last thing I want to talk about here is interesting on two levels.

49:55 So you've got a playground.

49:57 So you've got a playground.pysport.org, which is a hosted notebook to play with some examples,

50:03 like in particular, Cloppy and MPL soccer, right?

50:06 I think one of the issues or challenges for a lot of people also working within the bigger

50:11 clubs is that they don't always have a background in programming.

50:15 So often they start as a video analyst or working as a performance analyst, and then they think,

50:21 well, there's data.

50:21 I want to work with it.

50:23 And if you need to set up your Python environment for the first time, it can be a bit overwhelming.

50:28 So that's why I, for, well, there is JupyterLite, which is a very cool project based on Pyrodite.

50:37 Let's see if, yeah, if you can use it.

50:39 And it is just a start with the Cloppy and the MPL soccer package.

50:44 I just fetched the notebook from there, from my gallery, integrated into this one, into the

50:51 playground, and you can just start playing around with it.

50:55 Yeah.

50:56 And so here's a proper Jupyter notebook using all of their libraries and stuff.

51:00 But what's awesome about this, as you said, based on Pyrodite, I'm not sure it necessarily

51:05 actually stuck in people's minds.

51:08 Like, this is running in WebAssembly on our front end, right?

51:11 Which is pretty epic.

51:13 It makes it really convenient for people to just start playing around with it without installing

51:18 Python and working with virtual environments.

51:21 You know how it works.

51:23 Yeah.

51:23 It makes it super easy for you to host it because all you're doing is serving up static files.

51:28 You're not hosting, you're not running a Kubernetes cluster or anything like that, right?

51:32 Trying to prevent abuse of it and so on.

51:34 Yeah.

51:34 So, yeah.

51:36 Multiple sites make it good for me and for the people using it.

51:41 For sure.

51:41 And it even does that wild, what's it called, pizza plot, that kind of style of plot that

51:47 we're looking at.

51:48 And it runs fast and great.

51:50 Yeah.

51:50 This is really, really nice.

51:51 Yeah.

51:52 Are you happy with Pyrodite or Jupyter Lite?

51:54 Yeah.

51:54 There was some issues with it, especially around working with fetching data because some

52:03 of these try to fetch the open data from SlabsBomb or also some fonts and stuff like that.

52:09 Yeah.

52:09 So we had to work around it.

52:11 And it's also what you see on top of here is the patching of the request library to make

52:18 it work in Jupyter Lite.

52:19 Yeah.

52:20 I think it's better to have a working version than not patching it.

52:26 I think it's great.

52:27 And then everything that uses requests can just do its thing.

52:30 Yeah.

52:31 This is really cool.

52:31 When I saw that you had this, I thought, oh, this is clever that it's based on Jupyter

52:35 Lite.

52:35 And it's really nice.

52:37 Yeah.

52:37 So people can check that out.

52:39 Maybe people out there listening maintain some of these packages and have notebooks.

52:43 Like if they get them working here, could they submit them to you and have them added

52:47 in this list?

52:48 The entire playground is part of the PySupport organization on GitHub.

52:53 You can just watch, see the repository and make a pull request.

52:57 And I will just review it and merge it.

53:01 And then it will be available here.

53:03 Yeah.

53:04 That's awesome.

53:04 So I'm really happy for more packages here.

53:06 More examples.

53:07 Yeah.

53:09 More examples would be very welcome.

53:10 Excellent.

53:10 All right.

53:11 Well, I think we're getting pretty much short on time for talking about sports analytics,

53:17 but really, really good work there.

53:19 Now, before you get out of here, I have the final two questions for you.

53:22 I always ask.

53:23 Notable PyPI package, something you've come across.

53:25 You're like, oh, this library is awesome.

53:27 People should check it out.

53:28 I mean, it's kind of the whole topic of this show.

53:30 So we talked about, you know, maybe a hundred.

53:32 We didn't mention them all, but went through a list of a hundred different Python packages.

53:37 But something you want to give a shout out to that you think is cool out there?

53:39 I'm not really sure if the entire Python world already knows it.

53:43 But on the last PySupport meetup, I made an example using DuckDB.

53:48 That was something that people didn't know about it, especially with integration with Pandas data frames,

53:55 that you just build a data frame and run queries directly on top of it.

53:59 Yeah.

54:00 Interesting.

54:00 I heard a DuckDB, but I didn't realize the Pandas kind of direct integration.

54:05 It also has direct parquet.

54:07 Interesting.

54:08 Okay.

54:08 That makes it quite easy to also play around with SQL queries.

54:14 And I was very happy that I had a presentation on last PyData Eindhoven conference.

54:20 Yeah.

54:20 I think it's a package that, well, not everyone, but it's really worth checking out because it can make your life easier.

54:28 I think it's just a Swiss army knife for data engineering.

54:32 And yeah, I think it's a nice one.

54:34 Yeah.

54:35 Great recommendation.

54:36 And if you're going to write some Python code, what editor are you using these days?

54:39 I'm using PyCharm.

54:41 So I'm not, yeah, not sure if it's cool.

54:44 I love PyCharm.

54:45 PyCharm is awesome.

54:45 Okay.

54:45 Excellent one.

54:46 Yeah.

54:46 So I guess final call to action.

54:48 People are interested in open source sports analytics.

54:51 They're open and maybe interested in PySport, want to contribute back or, you know, be part of it in some way.

54:57 What do you tell them?

54:57 Yeah.

54:58 You can reach out on Twitter or LinkedIn to see where you can contribute.

55:05 And I think it's also, if you're not working in the support domain and would like to contribute, please reach out because I think the knowledge from outside of sports is really useful within sports.

55:15 So there are a lot of options to contribute and, yeah, make an even more community and make a more better community.

55:22 Yeah.

55:22 Absolutely.

55:23 All right.

55:24 Well, thank you so much for being here and sharing all these projects you've collected.

55:27 Thanks a lot for being on the show.

55:29 It's really, really nice.

55:31 Yeah.

55:31 Thank you.

55:32 You're welcome.

55:32 Bye.

55:33 Bye.

55:34 This has been another episode of Talk Python To Me.

55:37 Thank you to our sponsors.

55:38 Be sure to check out what they're offering.

55:40 It really helps support the show.

55:41 The folks over at JetBrains encourage you to get work done with PyCharm.

55:46 PyCharm Professional understands complex projects across multiple languages and technologies,

55:52 so you can stay productive while you're writing Python code and other code like HTML or SQL.

55:58 Download your free trial at talkpython.fm/donewithpycharm.

56:03 Influx Data encourages you to try InfluxDB.

56:06 InfluxDB is a database purpose-built for handling time series data at a massive scale for real-time analytics.

56:13 Try it for free at talkpython.fm/InfluxDB.

56:17 Want to level up your Python?

56:19 We have one of the largest catalogs of Python video courses over at Talk Python.

56:23 Our content ranges from true beginners to deeply advanced topics like memory and async.

56:28 And best of all, there's not a subscription in sight.

56:32 Check it out for yourself at training.talkpython.fm.

56:34 Be sure to subscribe to the show.

56:36 Open your favorite podcast app and search for Python.

56:39 We should be right at the top.

56:40 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

56:45 and the direct RSS feed at /rss on talkpython.fm.

56:50 We're live streaming most of our recordings these days.

56:53 If you want to be part of the show and have your comments featured on the air,

56:56 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

57:01 This is your host, Michael Kennedy.

57:02 Thanks so much for listening.

57:04 I really appreciate it.

57:05 Now get out there and write some Python code.

57:07 Thank you.

57:28 Thank you.