#416: Open Source Sports Analytics with PySport Transcript
00:00 If you're looking for fun datasets for learning, for teaching, maybe a conference talk, or even if you're just really into them, Sports offers up a continuous stream of rich data that many people can relate to.
00:12 Yet accessing that data can be tricky.
00:15 Sometimes it's locked away in obscure file formats, other times the data exists but without a clear API to access it.
00:22 On this episode, we talk about PySport, something of an awesome list of a wide range of libraries, but not all Python for accessing a wide variety of sports data from the NFL, NBA, F1, and more.
00:37 We have Koen Vossen, the founder of PySport, to talk through some of the more popular projects.
00:41 This is Talk Python To Me, episode 416, recorded May 11th, 2023.
00:57 Welcome to Talk Python To Me, a weekly podcast on Python. This is your host, Michael Kennedy.
01:06 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org. Be careful with impersonating accounts on other instances, there are many.
01:17 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
01:22 We've started streaming most of our episodes live on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.
01:34 This episode is brought to you by JetBrains, who encourage you to get work done with PyCharm.
01:40 Download your free trial of PyCharm Professional at talkpython.fm/done-with-pycharm.
01:47 And it's brought to you by InfluxDB.
01:49 InfluxDB is the database purpose-built for handling time series data at a massive scale for real-time analytics.
01:56 Try them for free at talkpython.fm/influxdb.
02:01 A quick announcement before we jump into the conversation around PySports.
02:05 We have over 240 hours of Python course video content at Talk Python.
02:10 And if you're watching that content on a mobile platform, phone or tablet, the browser is definitely not the best experience.
02:18 For example, on iOS, it won't even auto-advance or start playing without your interaction.
02:24 We've had some mobile apps for our courses for a while now, but they have fallen a bit into disrepair for a couple of reasons, including app store tyranny.
02:33 But over the past four months, we've completely reimagined and rewrote our mobile apps in a modern and beautiful platform.
02:40 And I'm super happy to announce that they are now available for Android and iOS on both phone and tablets in their respective app stores.
02:48 So please visit talkpython.fm/apps to see for yourself how beautiful and clean the new apps are and why I'm so excited about them. Download them for free and even take a couple of our free courses included in there, as well as the paid ones that you might have gotten of course. Finally, if you're curious for a bit of a look behind the curtain about how and why we rewrote them, check out my personal site mkennedy.codes for the full story. Thank you all for supporting our work. Now, on to the show.
03:18 - Koen, welcome to Talk Python Me.
03:21 - Yeah, thanks for having me.
03:23 Really cool to talk about this topic.
03:25 - Yeah, I'm real excited to have you here.
03:28 You have quite a collection of libraries.
03:31 Now, up front, these are not all your libraries, right?
03:33 These are sort of kind of an awesome list of Python, and even beyond Python, sports, libraries, data sets, APIs, models, everything, right?
03:42 - I think it's for people, it was quite hard to find open source packages, And I try to collect everything I could find and make it available for everyone to find what they need.
03:54 - It sounds like a great mission.
03:56 And as people will see, there is a bunch of stuff that we'll get to talk about.
04:01 So I noticed not absolutely every sport is covered there, but many of the popular sports are covered.
04:09 And if you're interested in sports, I think also if you're interested in just examples that connect with people, right?
04:16 Imagine you're a university professor and you don't wanna use the New York City tax data one more time.
04:24 You wanna say, well, maybe people are into soccer, American football, or NBA, whatever it is, right?
04:31 Maybe you could come up with something more interesting.
04:33 F1, for example, right?
04:35 - Yeah, definitely.
04:36 There's quite some cool data available to use in your courses, yeah, absolutely.
04:41 - Yeah, also, if people are members of some kind of club or team, maybe they could use some of this to bring some cool visualizations or analysis to their own organization.
04:53 That's also one of the things that PySport likes to encourage to use open source packages that are already available instead of building your own stuff, because that's, that actually happens a lot.
05:07 So that's also one of the, a part of the mission of PySport to, to make people aware of what's already there and try to bring people together.
05:16 One of the big problems, not problems, it's an opportunity, but it's also a challenge of Python.
05:20 You know, if you go to pypi.org right now, there's 453,000 packages.
05:25 I didn't know the number, but that's quite a lot.
05:28 We're coming up on half a million.
05:30 And if your goal is to work with some specific data set or try to solve a certain type of problem, often the hardest part is figuring out, well, what library do I use? Does it exist?
05:40 And if so, is it up to date and all of these things.
05:43 So having a list like this, a place that aggregates it and sorts it, filters it, super neat.
05:48 So really looking forward to talking to you about it.
05:51 Before we get to that though, just give us a quick bit on your backstory.
05:54 How'd you get into programming in Python?
05:56 Yeah, this is an interesting story.
05:58 I think I started programming when I was around 12, when I got the Lego Mindstorms.
06:03 Lego that you could also program.
06:05 My father gave me a Visual Basic book.
06:09 Yeah, I should just figure it out.
06:11 So that's where I started with programming and then during high school, I also did a web development with PHP.
06:19 I'm not really sure at what age, but eventually I ended up with a, I think the first ditch search engine.
06:26 They want to, yeah, they need a Python developer and I didn't really know Python, but that wasn't an issue.
06:32 So there I learned Python.
06:35 And from that point on, I, yeah, only, or mostly used Python.
06:40 Right on, you're like, all right, forget this PHP stuff, go in Python.
06:44 Well, to be honest, we, my company, we still use, PHP.
06:48 Yeah, I'm sure.
06:49 Because, well, it works quite well.
06:52 It's the performances is okay.
06:53 I'm not really sure if I'm allowed to say it on this show, but it also had some advances.
06:58 Yeah, absolutely.
06:59 Well, I mean, almost every language has some, something it's particularly good at and reasons to keep using it.
07:05 And then also there's just tons of software that was written, you know, pick your language written in that language and it still works well.
07:10 And there's plenty of reason to just keep going with it.
07:13 But, yeah, Python is, is really my language of, of, most interest.
07:19 what I really use on my, day to day, work.
07:25 What are you doing these days?
07:26 Are you still working at the search engine?
07:28 No, I did that quite a while ago.
07:30 I also worked at a, a huge online marketing agency where we did, I run the software department and we create tools to collect all the data from all kinds of different sources and make it available for the teams.
07:46 All right now I'm running my own company.
07:48 It's called Team TV, where we provide all kinds of tools where we use video and data for example, performance analysts, but also for highlight creation or live streaming, just to make sure that we tried, at least we tried to combine video and data in all possible ways within support domain.
08:11 That's, that's sounds like a really interesting thing to be working on.
08:14 I started mostly on the, on the video engineering part.
08:18 So we built quite some stuff ourselves there.
08:21 So from people uploading a huge amount of footage that we need to transcode and how to scale it, how to serve it, so that's stuff we build ourselves.
08:32 Later on, we keep on building more stuff around data and always keep the combination between data and the video because they, well, you can see some sort of metric, but you always want to see the footage behind it to actually understand the context of it.
08:49 Yeah, sure.
08:49 Sounds really fun.
08:50 And you're also involved with Pydata Eindhoven.
08:53 Is that right?
08:54 For, yeah, five years ago we started with Pydata Eindhoven.
09:00 We were already friends with Pydata Amsterdam.
09:03 They said, well, maybe you should also start on Eindhoven chapter.
09:06 I think this year will be the, the adversity, the five year adversity for Pydata Eindhoven and it's yeah, it's amazing community and that's yeah.
09:15 Also inspired me to start with the, the Pysport.
09:19 I'm not really sure if people all listen to the, to this podcast.
09:23 No Pydata maybe.
09:25 I think, I hope so.
09:27 I suspect most, most of them probably do.
09:29 at least the data science inclined among us.
09:32 - Yeah, I can tell a little bit about it, but I think right now we have a nice way of organizing the meetups and trying to get more people involved and talk about data science and share knowledge.
09:43 And then once a year, we have the conference where we try to get, yeah, collect money that we can send to NumFocus and they can share it over all the open source projects.
09:54 So yeah, it's a really amazing community here.
09:58 - Excellent. Yeah, NumFocus does a lot to support the bigger data science oriented projects.
10:06 I think that's kind of unique amongst the, in the Python space.
10:09 You know, there's not really anything like that in the web or UI.
10:14 You know, there's not a lot of areas where there's like organization that says, "Okay, we're going to try to find the popular projects and support them across organizations." Like people support flask, but they don't also support Django in the same sort of organization.
10:28 I think it's, it's also an opportunity for all the companies that are using those open source packages to give back.
10:36 And I think doing it through non-focus, it makes it also easier because they use a lot of packages and can just donate to non-focus and they will make sure it's as disrooted over those, yeah.
10:50 If you use pandas, you should also support NumPy, right?
10:54 Because that's kind of the foundation of, and so.
10:56 - Yeah, yeah, yeah, yeah, yeah.
10:57 - Interesting.
10:58 Oh, that makes a lot of sense.
10:59 All right, well, let's jump into sports and your project, Py Sport.
11:06 There are other people who are maintainers and working on this, or is this just your project?
11:10 - To get the meetup that we had just a couple of weeks ago, we had some more people collected And now we are building from there on to get more people involved with just PySport.
11:23 But one of the project we built with PySport is the Kloppy package.
11:27 And there we have worked together with John van Haaren.
11:31 He's a head of data science at Club Brugge, a big club in Belgium.
11:37 We are the main maintainers there, but I think right now we have 22 contributors to the package.
11:44 So there's quite some people contributing there.
11:48 That's a big group.
11:49 That's a lot of people contributing.
11:50 Let's start it this way.
11:51 Tell people what Pysport is and about that.
11:53 And then we can talk a little broadly just about sports analytics before we get into the details.
11:57 The most important mission of, of Pysport is to, you know, to bridge the gap between the clubs and the sports analytics and to just people and by using open source packages, because a lot of clubs are using open source packages.
12:14 The sort packages are used by the clubs and people want to have a way to contribute to their favorite club.
12:21 I think a lot of people are still struggling on how to do it.
12:24 And with PySports, yeah, we want to share the knowledge and teach people on how to do it.
12:31 So we try to get the experts from the clubs, but also getting the knowledge from, you know, like pandas or other big packages and see how we can get all knowledge into the sports analytics community. With Kloppy we try to set an example on how to build such a package, how to work together on such a thing and also encourage people to contribute. Show that you don't have to create a pull request, that there's a major refactor but also like minor things like typing errors fix in documentation and show people that that's also very valuable to a package.
13:16 - Interesting, so Kloppy is standardized soccer tracking and event data, right?
13:22 So you started out with soccer or as I guess a lot of the world might refer to it as football, but in the US, that's already taken, it's a namespace collision.
13:32 - Yeah, yeah, yeah, yeah.
13:33 Yeah, it's sometimes, yeah, it's difficult to talk about football, but here in Europe, we call it football, that you have put a package because it's international.
13:43 So worldwide, I call it soccer.
13:47 Less namespace namespace.
13:49 So give us a quick bit of background on Kloppy, but since it's kind of one of the founding, you created this as a way to sort of set an example, right.
13:57 For how to create a package and it helps people understand this event, this club data.
14:03 Where that started is on Twitter.
14:05 There was already quite some people talking about sports analytics, of course.
14:09 And one guy, Joe Mulberry, he's working at a Danish top club.
14:14 He asked for help because he created a notebook and he wanted to build an Flask API on top of it.
14:20 And I said, well, I know Python.
14:22 I don't know really much, much about soccer or about data, but yeah, I would like to be involved.
14:28 I would like to help you.
14:29 And when I received a notebook, I noticed that like 80% of the code was about reading and standardizing the data to a format that he could work with.
14:41 when we talked about it, it seemed like most, at least a lot of people are struggling with that issue and doing the same thing over and over again, because in more notebooks that I saw people were doing the same thing, but in different ways and somewhere not correct implementation or inefficient implementations.
15:00 So I thought, well, one thing I know is, is how to read data and how to get it into a standardized format because that was also one of the things I did at an online marketing company.
15:12 I don't know much, much about your data format, but I know about processing data and analyzing it and all that.
15:19 I built a package, started with just tracking data, but also try to explain what the next steps could be.
15:25 And then people said, well, this is really useful.
15:28 And from that part, I kept on adding the serializers for different kinds of data for the tracking data and also for the event data.
15:38 To try to get knowledge from non sport bigger projects.
15:42 So I also got, Will Gunnum from the textile package.
15:48 He also did several reviews on this package and give feedback to, yeah.
15:53 To try to get the package on a, on a higher level.
15:56 So people within sports and list community could also gain more knowledge from there.
16:02 But maybe also good, a big, a small, a small background on, on the data.
16:06 So the tracking data, that's like positioning data for all players on the pitch, I think it's most of the time, 25, 25 frames a second.
16:15 So, you know, the location for each player and the ball and on the other side, you have the event data.
16:20 So there are all passes and shots and things like that.
16:23 At this time from this position, there was a shot on the goal or there was a pass or there was a takeaway or penalty.
16:30 - Yeah, that's event data, yeah.
16:32 And all the vendors choose different formats.
16:35 - Yeah, yeah, oh geez.
16:38 That sounds hard.
16:39 So first of all, 25 Hertz of all the people's location.
16:44 This is beyond somebody with just a pen and paper and notebook writing down, oh, at this time there was a shot on the goal by number 25.
16:52 Like, how do they get that data?
16:54 That's crazy.
16:55 - Yeah, yeah, that's quite an advanced systems that I use.
16:58 So in the stadium, I think they have like 20 cameras around the pitch that they use computer vision to detect all the players and combine it.
17:08 But I believe, and I'm not really sure if they're already with animals on the market that do it totally automated.
17:14 But I think from the system that I'm currently used in soccer, there are still some people needed for difficult situations like a corner kick, where a lot of people in a small area and a lot of occlusions happen.
17:26 - You can't see the numbers, yeah.
17:27 - Yeah, they can see the numbers.
17:28 So just after a corner, some manual operator has to reassign some players or correct something, but it's quite a fun system already.
17:38 - It sounds incredibly advanced.
17:39 It sounds like an awesome data set to work with because with that much data, you really can make a lot of interesting predictions and trends.
17:48 I mean, at some point, maybe we'll just put some sort of like tracking RFID thing on the back of the player's heads, just stitch it on there.
17:56 then you can fully automate it, you know?
17:59 Yeah, I think, well, a soccer day ticket, yeah, maybe, yeah.
18:03 Not sure if all players would accept it, but for example, on ice hockey, yeah, you can put on the helmet.
18:09 Yeah, you could put on the helmet, sure.
18:11 For football, yeah.
18:13 Things like automobile racing, you know, they have, not all of them, but for example, F1 has incredibly a high frequency of, like, points that measure where is this car, how fast is it going, going, the cars are sending out real-time telemetry. There's certainly many sports that have quite high fidelity in their data.
18:32 I must admit I haven't seen the data from F1 yet, but it would be really interesting to learn from them and how to work with data and see if it can be applied to football or soccer or the sport.
18:47 This portion of Talk Python to Me is brought to you by JetBrains, who encourage you to get work done with PyCharm.
18:53 PyCharm Professional is the complete IDE that supports all major Python workflows, including full stack development.
19:07 PyCharm just works out of the box.
19:09 Some editors provide their functionality through piecemeal add-ins that you put together from a variety of sources.
19:16 PyCharm is ready to go from minute one.
19:19 And PyCharm thrives on complexity.
19:22 The biggest selling point for me personally is that PyCharm understands the code structure of my entire project, even across languages such as Python and SQL and HTML.
19:33 If you see your editor completing statements just because the word appears elsewhere in the file, but it's not actually relevant to that code block, that should make you really nervous.
19:43 I've been a happy paying customer of PyCharm for years.
19:46 Hardly a workday passes that I'm not deep inside PyCharm working on projects here at Talk Python.
19:53 What tool is more important to your productivity than your code editor?
19:57 You deserve one that works the best.
19:59 So download your free trial of PyCharm Professional today at talkpython.fm/donewithpycharm and get work done.
20:07 That link is in your podcast player show notes.
20:10 Thank you to PyCharm from JetBrains for sponsoring the show and keeping Talk Python going strong.
20:15 I bet it's a lot, actually.
20:18 I bet it is, you know, just in terms of actual quantity of data, you know, how fast are sampling and how many cars for how long it's probably a lot of data.
20:26 That's also one of the interesting things about, working with sports data that I think the data engineering part and, and this package just focused on reading the data, but then the next step, yeah, how to work with, how to work with the data, especially if you would like to use the tracking data for a whole season, that it's quite some data that also pandas can start struggling a bit with.
20:48 It just occurred to me, there's probably a whole other demographic or aspect who would be interested in this kind of data, it would be like sports betting people.
20:57 I mean, not that I have any interest in that at all.
21:00 But if you were trying to figure out like, okay, if this team plays that team, if you can understand, okay, this, their star player, if we match up their moves against the other person's moves, it turns out there's a weakness in this way for their defense or who knows, right?
21:13 I mean, there's probably with that much data, there's probably some interesting stuff you can do.
21:16 I think that a lot of vendors of the data also have the betting industry as well as their clients.
21:25 Because yeah, they I really care to work for them or support them.
21:28 It's a little bit shady, I suppose.
21:31 But yeah, it does seem like you could it's it's almost like really detailed information about companies for the stock market.
21:38 This is kind of like a little bit like that for the sports betting in some ways, I suppose.
21:42 Yeah, yeah.
21:45 I think one of the challenges here is probably a lot of this data is not easily offered up.
21:51 There's probably not a lot of JSON APIs with low latency that are super easy to access.
21:56 For some there must be, but not...
21:58 There's probably a lot of data out there that is not overly welcome to either be given out or it's given out over in batch over slow periods or something like that, right?
22:09 Maybe speak to a little bit about the data availability.
22:11 Yeah, that's quite an issue.
22:13 And I know mostly about the soccer data, but I can imagine that the same applies to most of the other sports.
22:21 And I think data availability is a major issue, at least if you want to encourage the community to work with it and do research on it and get people build more cool stuff without being within a club.
22:36 There are some companies that already provide quite a big setup of open event data.
22:41 Statsbomb is one of them. I think they provide around 1500 data sets for event data.
22:48 But if you're looking at the tracking data, there may be like 10, maybe 15 sets available, because all those vendors have deals with the leagues. They are not allowed to share it.
23:00 So you have to know someone within a club or use a beautiful super scrapey or something like that.
23:07 That's the other option, but, then it's still very hard to get a tracking data because I'm not sure if you can actually scrape it.
23:15 But that's, it is one of the things that, that I noticed when working on the, the, the open source of PySports website that are really a lot of scrapers and I think that's an indication that there's an issue with data availability.
23:30 - Yeah, it's not, this plugs into the API, but this is a scraper.
23:35 - Yeah.
23:36 - I guess it's worth pointing out or throwing out a bit of word of caution, just because the website is publicly available and you can hit it with some kind of scraping tool, that doesn't mean you legally can do stuff with the data.
23:49 You probably want to be pretty careful about that, right?
23:51 - Yeah, because I think even when it's not explicitly mentioned, most of the times it's not allowed to scrape the data at all.
23:59 but also in soccer, quite some websites that explicitly are forbidden.
24:05 And yeah, so the factors are there, and it's also a bit, I was thinking about, should I include them or should I not include them?
24:13 Because they kind of encourage non-legal actions, but yeah, not really sure about it, again.
24:20 - Yeah, sure, I can see the case for both sides of that.
24:24 But yeah, I just wanna let people know, like, just be careful with what you do with the data.
24:28 It's one thing if it's an academic research project and it's just for my own interest or whatever.
24:33 But if you start craving that website and trying to make money out of it, you should not do it.
24:38 Or find a way to do it legitimately.
24:41 But just don't sneak through.
24:44 All right.
24:44 Well, I think it might be fun to let's talk through some of the packages you have here.
24:51 So if you go to pysport.org and there's a nav bar and on the left it says open source.
24:57 And if people click that, then they end up with a whole bunch of, I'll open it just this way for a moment, we can look at it and talk about it.
25:06 So if you just click on it, it actually there's a delay as it downloads.
25:10 Yeah, there's still something I need to fix because, yeah, it's quite some package.
25:16 It's not a complaint, it's just I don't know how many pages that is, but that's a really small scroll bar.
25:21 What I noticed that's pretty cool is you can go in, there's a filter that you all have, and you can filter by your language.
25:28 Right now you have Haskell, Python, and R, and others.
25:31 And then you can pick by sports, and then you can pick by type of thing, right?
25:36 So I filtered our discussion down to Python libraries just because, you know, - We have a single app for that.
25:41 - Yeah.
25:42 And you could also pick amongst the different types of tools.
25:46 So we talked about the scrapers, and probably to a lesser degree, the APIs, right, the API clients, which is cool.
25:53 There are some in there, they say, "Here's the API," we just built a strongly typed package rather than just doing straight rest, which is great.
25:59 But you also have models and calculators like for predicting things and then IO for file formats, visualization, open data, and databases.
26:08 Right? So I encourage people to rather than try to read the whole list, which is hundreds and hundreds of packages, to, you know, filter down maybe to the sport you're interested in or a couple of sports or the type of tooling you're interested in. Yeah?
26:22 Yeah, I think filtering is a must, but maybe if you have plenty of time, you could just scroll and see what's interesting because it's still, I think, a very interesting list to see what's just what available and get inspiration.
26:38 It's quite a list.
26:40 So what's the sort here?
26:41 If I come here, how do I, how does this get sorted?
26:44 Like, is there any meaning to the order they appear?
26:46 Is it just when they were entered or?
26:47 That's a good question.
26:49 I also open source the data collection part of this website, but it's, it's daily collected at least to provide an update.
26:57 And I think I'm a, I think there's a, an order.
27:01 And when I added the packages, I think that's the order here, but, to be honest, it can be pretty random.
27:10 All right.
27:10 So here I'll, I'll just sort of, go through a couple of the scrapers here and we can maybe dive into one or two potentially, so there's py ball.
27:19 I just, we'll just go through just to give people a sense, right.
27:22 of the ones here, right?
27:23 So there's py ball, which is a Python API, nice wrapper for stats.nba.com with a focus on NBA and WNBA application.
27:34 That's pretty cool.
27:35 I don't know anything about stats.nba.com, but it looks like, yeah, this is a whole website with all sorts of data.
27:42 It's got players, teams, leaders.
27:44 Looks great.
27:46 I think quite some people are also using this package.
27:49 I think it's mostly used package when working with basketball data.
27:54 And it's nice that they used the API to get this data.
27:59 Yeah, you get quite a bit of data here.
28:00 You've got like the player, their team, their age, their total number of points scored.
28:06 A lot of stuff you can do to sort of compare them.
28:08 And yeah, that's great.
28:10 So if you're into basketball, I think this is a great start.
28:13 That's also quite actively maintained.
28:17 That's also one of the things that I intentionally mentioned on the list, because some packages are not really maintained well.
28:26 I think it's a benefit.
28:28 Yeah, one of the things in the list that you call out is the number of contributors, the latest version, when the last commit was to the package.
28:36 That's pretty cool.
28:37 In the beginning, I thought, well, maybe I can just manually update the list, but...
28:41 But then I decided, I think data engineering is fun.
28:46 Let's find a way to automatically fetch the data and update it.
28:52 Also the license is pretty important to show it here.
28:56 And also a lot commit to see how actively it's maintained the latest versions and also the contributors.
29:03 Because I think it's good that some packages have some more contributors that you do.
29:08 - Sure, the difference between a package with one contributor and one with 30 contributors.
29:13 That's a big difference.
29:14 It's a really big difference, yeah.
29:15 - Yeah, I think it's also good for people to see if there's a package with just a single contributor that might give an opportunity to contribute to it or work together.
29:26 So I would like to encourage people to get involved in those projects.
29:31 - Yeah, that's a good idea.
29:32 - So yeah, that could help out.
29:34 - Yeah, and each one of these packages, you can go in and open the details here and it gives you a little bit more information.
29:39 Like it, for example, it actually lists the contributors and links to their GitHub profiles and choose their website and the GitHub page and PyPI and so on.
29:48 And also you can click on one of the contributes and see what other packages they, they built.
29:54 Oh, really?
29:56 So like, if I click on this one, yeah, they've done this, just this one.
30:00 Well, and this one, just a single one.
30:02 Some of them, they might have worked on multiple.
30:04 I know depend upon has worked on a few.
30:07 That's a really nice to contribute to that.
30:09 Yeah, yeah, I'm the absolutely prolific open source contributor.
30:14 Works on my project too.
30:15 This portion of Talk Python to Me is brought to you by InfluxData, the makers of InfluxDB.
30:22 InfluxDB is a database purpose-built for handling time series data at a massive scale for real-time analytics.
30:30 Developers can ingest, store, and analyze all types of time series data, metrics, events, and traces in a single platform.
30:37 So dear listener, let me ask you a question.
30:39 How would boundless cardinality and lightning fast SQL queries impact the way that you develop real-time applications?
30:46 InfluxDB processes large time series data sets and provides low latency SQL queries, making it the go-to choice for developers building real-time applications and seeking crucial insights.
30:58 For developer efficiency, InfluxDB helps you create IoT, analytics, and cloud applications using timestamped data rapidly and at scale.
31:06 It's designed to ingest billions of data points in real time with unlimited cardinality.
31:12 InfluxDB streamlines building once and deploying across various products and environments from the edge, on-premise, and to the cloud.
31:20 Try it for free at talkpython.fm/influxdb.
31:24 The link is in your podcast player show notes.
31:27 Thanks to InfluxData for supporting the show.
31:30 I didn't realize you could actually see all the projects that PySport knows about that that particular user works on.
31:38 That's a cool aspect of it.
31:40 >> I spent quite some time on fetching all the data and trying to combine it.
31:44 Also fetching data for Python and also do the similar for R packages.
31:49 Yeah, and seeing how to get all available data on one place.
31:53 It also tries to fetch images or screenshots from the read-me's of the report stories that works for some.
32:01 >> Yeah, that's nice. Screenshots can be very helpful.
32:04 Less important on the scrapers, more on the visualizers probably, but still.
32:09 Yeah, definitely.
32:10 What is, opensource.pysport.org written in?
32:14 It's, in, React using, Next JS.
32:19 So it was also quite an adventure for me because it's the first, application that might also explain why it's still a bit, slow on loading because I didn't really dive into how to, to make make it faster, but you still went, but in the backend it's, it's Python choosing Luigi.
32:38 That's, I still think it's a pretty interesting tool because it's really simple to set up like orchestration of some tasks.
32:49 Like the daily scraping, the updating the packages and that kind of stuff.
32:53 And then there's a GitHub action that runs on a daily basis and then fetch all the data and updates and commits it in a different branch.
33:01 And that one gets deployed to a personal, I believe.
33:05 - Okay, very interesting.
33:07 - But if you are interested in the source, you can also, it's also hope source.
33:13 - Okay, great.
33:14 So, Py ball for NBA, we have the Hockey Scraper, which is for scraping NHL play-by-play and shift data with six contributors.
33:24 That's pretty interesting.
33:25 - What you'll see on the filter list for every sports as a package also for the NHL for ice hockey.
33:32 That's a little bit less maintained, I think.
33:36 But I have to, I'm not really sure if it still works, because with those scrapers, it can work today and not tomorrow.
33:44 It doesn't even necessarily mean that they were intentionally blocked, right?
33:49 It could just be, "Hey, we've redesigned our site.
33:51 Doesn't that look awesome?" You're like, "Oh, the CSS selectors no longer pull up the thing." So, yeah.
33:57 So that's also on the scraping part.
33:59 If it's a last commit is like a while ago yet.
34:03 It might be broken.
34:05 Maybe, maybe not.
34:06 Yeah, sure.
34:07 All right.
34:07 Let's see some more.
34:08 And I think in the stats bomb, PI, there's an official, package.
34:12 There's also cool that, stats bomb provides an open source package for accessing their data.
34:18 What is stats bomb?
34:19 I see that showing up in many places on these different packages.
34:22 - Yeah, Statsbomb is, I think, one of the leading providers of event data in football, and I think in both football and soccer, and in football.
34:33 So they provide event data, so everything that happens on the pitch, like passes, dribbles, interceptions, everything.
34:41 They are also one of the providers of the open data sets.
34:45 - Okay, yeah, they've got a free data section, that's cool.
34:48 - Yeah.
34:48 - They proclaim themselves as data champions.
34:52 Yeah, I think the data is pretty good.
34:55 I'm thinking also one of the best in the market right now.
35:00 But at least that's what I heard from some users.
35:04 They even have courses.
35:05 Modern Scouting and Data-Driven Recruitment.
35:08 That's kind of interesting, isn't it?
35:10 Yeah, you also have to figure out how to apply data science in your job.
35:16 So how to use it and how to use the data for scouting purposes.
35:22 If you work in professional, a professional sports organization or even college sports, the US at least, there's a lot of recruiting people up from lower levels.
35:31 The type is in all sports, but I think the data is really helping to make the number of players that you have to watch from the footage a lot less.
35:43 So if you can already make a short list instead of watching 15,000 players, then yeah, it's really convenient.
35:50 Or maybe you're looking for a particular asset or a particular part of the play that a player is good at, right?
35:57 Maybe you're looking for a quarterback for a football team that is especially good at running the ball in addition to just throwing it right.
36:05 You could, you could ask the data for that and really narrow in quite quickly, I imagine.
36:10 And then you have to work with the data, figuring out how to extract it because maybe that single metric that's really important for you is not available in original data set.
36:20 So then you have to figure out how to work with the data and, get those metrics out of the drawer data.
36:27 Maybe it's something calculated or inferred or.
36:29 And that's also one of the things that happens in, in a soccer based on the tracking data, but it will probably happen in also in football and all the other sports that clubs will define their own metrics based on folks up on tracking data and use that, to, you know, to figure out what players match their own play the most.
36:49 >> Cool. Okay. So yeah, as you can see, there's a bunch of stats bombs here.
36:54 Py Baseball, an MLB game, seem to be a couple of things around baseball data.
37:00 Baseball is one of those games, it's kind of, I feel like baseball is one of those games that was almost created by a statistician just so they could come up with stats.
37:10 There's so many stats and people get averages.
37:14 What kind of hitter are they?
37:15 well, there's like a 0.3, you know, there are 300 hitter, right?
37:19 For 30% and all that.
37:22 And I'm not a huge fan of baseball.
37:24 I find it kind of a slow game.
37:26 It's kind of fun to play, but to watch it.
37:27 And it's like, you know, same as golf.
37:29 I don't watch those things.
37:30 I'm sure they're fun to play, but it's just like in terms of stats, these kinds of games, there's probably a ton of stats here because it's all about stats.
37:38 I also believe that, the baseball data science departments are one of the biggest departments over all sport.
37:45 And maybe, but I'm not sure about it.
37:48 You can also make a lot of impact there.
37:51 Maybe because also in all the sports, for example, soccer, a lot of things has impact on the eventual outcome.
37:58 It's also a discussion if all data is available to, to know what actually has the most impact.
38:06 So this is also one of the discussions within the soccer and list community.
38:11 Yeah. For both of these, Pi Baseball and MLB Game, you can see from your Luigi automation, there's, they're, they're both quite well, the MLB game is not particularly up to date.
38:22 I guess the Py Baseball one is more up to date, but you know, 13 contributors, 30 contributors.
38:27 That's quite a lot.
38:28 That's quite a lot.
38:29 And the, the Py Baseball was updated this month, right?
38:33 That's, but you know, when I saw these, I'm like, oh, these are kind of similar.
38:36 And then I look at, look at your page here and I see, oh, well, Py Baseball is, you know, way more up-to-date modern and should check that out first.
38:43 That's the kind of value you get for having that info.
38:45 That's also the intention that you have a quite quick overview of, yeah.
38:52 Oh, it's maintained it.
38:53 And, yeah, and that one against also goes against the API.
38:58 So let's see, a couple more, I guess it's worth, giving a shout out to the NFL fast py that, well, you know, NFL's, got a lot of data as well, what else there's some college baseball.
39:09 Here's one that I think is that shows up across a lot of the different categories, because it seems to do a lot, which is fast F1.
39:16 Have you seen that?
39:17 Have you played with this any also updated this month?
39:19 I should dig into it because quite some contributors.
39:23 And I think it's really interesting to, to also see the motor sports or cycling or more those sports to see what they are doing, how they are doing it.
39:32 I noticed looking through here that there's not a lot of motor sports compared to the other sports.
39:37 And so people are, if you're out there, like if you're an IndyCar or if you're a motocross or somewhere like, and you got a package, then shoot it over to these guys and have them put it in the list.
39:45 That'd be cool.
39:46 Yeah, the fast F1, they've got a page here that has a bunch of things.
39:49 It has access to timing data, telemetry, session results, and all the data is provided in an extended Panda data, Panda's data frame format, which is pretty cool, right?
40:01 Integration with matplotlib.
40:03 There's an examples gallery too.
40:05 You come over here and you can see it has things like position changes during the race.
40:10 So this, it'll say, if you go up here, it'll do things like, you gotta go for it.
40:15 You know, go to the seat, get season 23, race one, or for race, I guess, rather than practice or qualifying.
40:22 And that's Bahrain.
40:24 And so then here's, you know, it has all the drivers, their time throughout the race, their position.
40:28 You can see probably pit stop.
40:30 There's a lot of cool stuff you can see in here.
40:31 It looks really nice.
40:32 And also with those examples, I think that's really helpful to get people started with those packages.
40:40 It's not exactly a Jupyter notebook.
40:41 It's the HTML of a Jupyter notebook, but you know, it's still exactly what you need, right?
40:46 To, but I think you can use it.
40:47 Even I wrote it by a notebook.
40:49 You download it right there.
40:51 And apparently two and a half seconds to generate the script.
40:54 Let's see.
40:56 You can, you've even got cool visualizations, like on the track, color it by, by speed around the tracks of the, you know, there's a lot of cool data here.
41:05 I'm not really sure why I haven't seen this one before, but yeah, it looks really, really cool.
41:11 When I looked, I looked around a couple of the different packages and this one, like the documentation and examples and stuff seem, seem super good.
41:17 So that's the scrapers.
41:18 There's many more, there's plenty more there.
41:21 Another one, models, calculators, maybe take us through some of the ones that stand out in this category.
41:27 Like for example, there's Laurie's code for Metro code tracking data.
41:30 I love it.
41:30 That is just, it's a Lauri's code.
41:33 So this is mostly about how to also do all kinds of modeling on top of it.
41:38 I do predictions on top of data.
41:40 One of the, the packages that I think is pretty interesting is the soccer action.
41:45 Of course, again, it's soccer.
41:47 it might, there's only Python possibly, but for example, they have soccer XG, which is what does that XG boost models for soccer event data?
41:56 Yes, the expected goals.
41:58 So what's the expected value for a certain shot, if it should go in or not.
42:05 So that's also based on a position on the page, how many players are between the player with the ball and the goal.
42:14 So you can use it to determine, yeah, how if a player should score a goal and how many goals he should make.
42:22 - And the likelihoods.
42:24 I think this is actually one of the really interesting aspects is the model and calculate it.
42:28 You know, the prediction side is pretty cool.
42:30 There's quite some work to do for Py Sport because for example, the expected goals, there's also one of the things that I've seen in ice hockey, also in order sports where you have to score within a goal.
42:44 And I think it would be cool to find a way to abstract it over all sport.
42:49 Yeah, because it is kind of the same idea, probably different datasets, But all right, like scoring in hockey and scoring in soccer is from a structural perspective of the data is kind of the same thing, even though it's really, yeah, quite different in size of the goal and how easy it is and all that.
43:06 But I think we can still learn from, yeah, from the other sports and see how they did it.
43:10 Train up a model, but on different data.
43:13 But same, same type of model potentially.
43:14 Maybe some different features, but, yeah.
43:17 So the next category is IO and that obviously stats bomb is in here, right?
43:23 Python package to parse stats bombs, JSON data to CSV, which is cool.
43:27 Some on soccer, the spattle format, which I have no idea what that is.
43:31 That's also one of the things they built to make like atomic data format.
43:37 That's yeah.
43:39 Also kind of standardized.
43:41 So there's some overlap between soccer action and Kloppy.
43:44 I think they mostly focused on how to eventually work with the data.
43:49 So calculate also the expected threat and also like a contribution model.
43:55 So for every action towards our goal, I was there.
44:01 So maybe there was a takeaway and then a pass and a pass and then a score, like all of those people should somehow get credit for that potentially, right?
44:08 Makes sense.
44:09 But they also build the way to note the data and they will currently also.
44:15 working together with them to see if we can make copy to load the data and after the copy package, focus on unloading it and standardizing it.
44:24 And then after soccer action using it.
44:27 So see how the, the legal block blocks can work together.
44:31 We have the NFL DB, a library to manage and update NFL data in a relational database.
44:36 That's kind of cool.
44:37 All right.
44:37 Let's see.
44:38 The next category is the visualization.
44:40 I've probably the, excuse me.
44:42 The most important part is probably the actual data acquisition, but the most desired part is probably the visualization, right?
44:48 Like the data engineering part is, is not really, really sexy.
44:53 I mean, no one sees it.
44:55 And output is a structured CSV or parquet file.
44:58 So it is not really cool to show, but for example, the MPL soccer, I think, it's, it's a really, really nice packet used by every person in the soccer community.
45:12 There's a lot of, contributors here.
45:15 And the visualizations look really cool.
45:17 - Yeah.
45:18 - They also have a huge list of examples.
45:22 - Okay.
45:23 - So all kind of, you can just copy and paste to create some.
45:27 - Pizza charts, I love them.
45:29 - Yeah.
45:29 - Yeah, we'll actually come back to the pizza charts in just a moment actually.
45:32 But yeah, these are some good looking visualizations here.
45:35 - Yeah, and I think the interesting thing about this package is that at some point there were two packages that did similar things And then they decided, well, we should just work together.
45:46 And they spent quite some time on integrating those packages.
45:49 And then there was one.
45:50 And I think that's really cool to see that instead of kind of competing, they decided to work together and make, I think, one of the most awesome packages for the soccer community.
46:01 - It's really nice.
46:02 It's really nice.
46:03 There's a lot of soccer ones in here.
46:05 There's also one for a PT plot for American football, although I don't understand what PT stands for.
46:11 And then the fast formula one is also in there.
46:14 We already saw those pictures, but a lot of nice visualizations there.
46:17 - Yeah.
46:18 - And is that it for all the categories?
46:20 No, then there's the open data.
46:21 - Yeah, I think maybe when I look at this list, there are some missing.
46:25 It's still a bit limited on what data is available.
46:29 That's something that we should work together also with Leaks to see if there's a way to make some more data available.
46:37 - Yeah, they have it and they offer to publicly put it in the list, right?
46:40 - When it's available, I would definitely add it.
46:43 But there's already some interesting data, there may be a little bit smaller data sets, but you can definitely use it to start playing around with it.
46:52 - All right, so I think that kind of covers the list with the Python filter sort on.
47:00 You wanted also to give a quick shout out to NFL Verse, right, because while not Python, is quite a series of packages that does cool stuff in the NFL for that data, right?
47:11 So it's, it's, it's not Python is for the R users, but I think what's, what's really interesting there, what they did is they created quite some different packages, one for collecting the data, one for organizing it, one for reading the data, one for doing all kinds of modeling, one for creating the visualizations.
47:30 And I think that's also an example for, for, for other sports on how to, to make those packages available, making sure that everything fits together.
47:40 - Yeah, that's cool.
47:41 It's under the NFL Virtual Organization, but a bunch of different projects.
47:45 You talked about having the data and stuff that's not immediately obvious or predictable.
47:50 You might need a higher level sort of thinking about it.
47:53 And one of them that stands out here is the NFL Fourth, which studies fourth down decision data with the NFL version models, which is kind of cool because that's one of the big decisions that a coach makes.
48:05 It can make the game or it can lose the game and there's a go, no-go decision, right?
48:09 And there's a lot of, it's not just, well, they went this far, then they didn't make it.
48:13 It's, well, it was the, they had 30 seconds left in the game and they had to do it or, you know, cause otherwise they were just going to lose anyway, right?
48:21 There's a lot of higher, like sort of inference and higher level things you want to bring into that rather than just 30% of the time they make it fourth down, right?
48:29 And this, I think also one of the reason they just build an entire package around it to, -to work with it. -Yeah, it's pretty interesting.
48:36 Now, before all the Python people say, "I don't want to learn R, I don't care about R." It is also worth pointing out that you can call R from Python.
48:44 I don't know how much the visualization stuff still works super well or anything like that, but you can use, what is it called?
48:52 -rpy2. -Okay.
48:54 And you can end up, you just pass it in R file, and then you start calling functions or whatever.
49:00 Get a function out of it and call that function.
49:02 So it's worth, you know, if you really, really want to use some of these packages, maybe it's worth doing a quick little integration and then turn it into a data frame, pandas data frame and running with it or something.
49:13 It looks interesting. It's definitely worth a try.
49:16 It's nothing I've ever used, but I can see, you know, if you really care about NFL data and you really care about Python, it might be worth giving those combos a look there.
49:25 I think there is one package to work with their data from Python.
49:30 So if you look at the list there, there should be at least one.
49:33 I think it's not on their website on their GitHub page, but I think there's another one that integrates well with it.
49:42 Not under the organization, but maybe somebody else made that does.
49:46 That that's cool.
49:47 Maybe they use this, this integration.
49:49 I don't know.
49:50 All right.
49:51 And then the last thing I want to talk about here is interesting on two levels.
49:55 So you've got a playground.pysport.org, which is a hosted notebook to play with some examples, like in particular, Kloppy and MPL soccer, right?
50:05 I think one of the issues or challenges for a lot of people also working within the bigger clips is that they don't always have a background in programming.
50:15 So often they start as a video analyst or working as a performance analyst, and then they think, well, there's data.
50:21 I want to work with it.
50:23 And if you need to set up your Python environment for the first time, it can be a bit overwhelming.
50:29 So that's why I for, well, there is Jupiter light, which is a very cool project based on a pyodide.
50:37 Let's see if, yeah, if you can use it.
50:39 And it is just a start with the Kloppy and the MPL soccer package.
50:44 I just fetched the notebook from there, from my gallery, integrated into this one into the playground and you can just start playing around with it.
50:56 And so here's a proper Jupyter notebook using all of their libraries and stuff.
51:00 But what's awesome about this, as you said, based on Pyodide, I'm not sure necessarily actually stuck in people's minds, like this is running in web assembly on our front end, right?
51:11 Which is pretty epic.
51:13 It makes it really convenient for people to just start playing around with it without installing Python and working with virtual environments.
51:21 You know how it works.
51:23 It makes it super easy for you to host it because all you're doing is serving up static files.
51:28 You're not hosting, you're not running a Kubernetes cluster or anything like that.
51:32 Trying to prevent abuse of it and so on.
51:35 So, yeah, multiple sites make it good to, for me and for the people using it.
51:41 For sure.
51:41 And it even does that, that wild, what's it called?
51:44 pizza, pizza plot, that kind of style of plot that we were looking at.
51:48 And it, it runs fast and great.
51:50 This is really, really nice.
51:52 Are you, are you happy with pyodide or, Jupyter light?
51:54 Yeah, there was, yeah, well, yeah, there was some, some issues with it, especially around working with, with fetching data because some of these, try to fetch the open data from, from Stats Bomb or also some phones and stuff like that.
52:09 So we had to work around it.
52:12 It is also what you see on top of here is the, the patching of the request the library to make it work in a JupyterLite.
52:19 I think it's better to, to have a working version then, then not patching it.
52:26 I think it's great.
52:27 And then everything that uses requests can just do its thing.
52:31 This is really cool.
52:31 I, when I saw that you had this, I thought, Oh, this is, this is clever that it's based on JupyterLite and it's really nice.
52:37 So people, people can check that out.
52:39 Maybe people out there listening and maintain some of these packages and have notebooks.
52:43 Like if they get them working here, could they submit them to you and have them added in this list?
52:48 The entire playground is part of the, the Py sports organization on GitHub.
52:53 You can just watch, see the report story and make a pull request.
52:57 And I will just review it and merge it and then it will be available.
53:03 Oh, yeah.
53:04 That's awesome.
53:04 So really happy for more packages here.
53:06 More examples.
53:08 - Yeah, more examples would be very welcome.
53:11 All right.
53:12 Well, I think we're getting pretty much short on time for talking about sports analytics, but really, really good work there.
53:19 Now, before you get out of here, I have the final two questions for you.
53:22 I always ask.
53:23 Notable PyPI package, something you've come across, like, oh, this library is awesome.
53:27 People should check it out.
53:28 I mean, it's kind of the whole topic of this show.
53:30 So we talked about, you know, maybe a hundred, we didn't mention them all, but went through a list of a hundred different Python packages, but something you want to give a shout out to that things go out there.
53:39 - I'm not really sure if the entire Python world already knows it, but on the last PySport meetup, I made an example using a DuckDB, and that was something that people didn't know about it, especially with integration with Pandas DataFrames, that you just build a DataFrame and run queries directly on top of it.
54:00 Yeah, I've-- - Oh, interesting.
54:01 I heard of DuckDB, but I didn't realize the Pandas kind of direct integration also has direct Parquet query, interesting.
54:08 That makes it quite easy to also play around with, with SQL queries.
54:14 And I was very happy that I had a presentation on last Py data.
54:18 I drove a conference.
54:20 I think it's a package that, well, not everyone, but it's a really a word checking out because it can make your life easier.
54:28 I think it's just a Swiss army knife for data engineering.
54:32 And yeah, I think it's a nice one.
54:35 Great recommendation.
54:36 And if you're going to write Python code, what editor are you using these days?
54:39 I'm using PyCharm.
54:42 So not, yeah, not sure if it's cool.
54:44 I love PyCharm.
54:45 PyCharm is awesome.
54:45 Excellent one.
54:46 So I guess final call to action, people are interested in open source sports analytics, they're open and maybe interested in PySport, want to contribute back or, you know, be part of it in some way, what do you tell them?
54:57 Yeah, you can reach out on, on Twitter or LinkedIn to, to see, you know, where you, where you can contribute.
55:05 And I think it's also, if you're not working in the sport domain and would like to contribute, please reach out because I think the knowledge from outside of sports is really useful within sports.
55:16 So there are a lot of options to contribute and make an even more community, a more better community.
55:23 Yeah, absolutely.
55:24 All right.
55:25 Well, thank you so much for being here and sharing all these projects you've collected.
55:28 Thanks a lot for being on the show.
55:30 It's really, really nice.
55:32 Thank you.
55:33 You're welcome.
55:35 another episode of Talk Python to Me.
55:37 Thank you to our sponsors.
55:39 Be sure to check out what they're offering.
55:40 It really helps support the show.
55:42 The folks over at JetBrains encourage you to get work done with PyCharm.
55:47 PyCharm Professional understands complex projects across multiple languages and technologies, so you can stay productive while you're writing Python code and other code like HTML or SQL.
55:58 Download your free trial at talkpython.fm/donewithpycharm.
56:04 InfluxData encourages you to try InfluxDB.
56:07 InfluxDB is a database purpose-built for handling time series data at a massive scale for real-time analytics.
56:14 Try it for free at talkpython.fm/influxdb.
56:18 Want to level up your Python?
56:20 We have one of the largest catalogs of Python video courses over at Talk Python.
56:24 Our content ranges from true beginners to deeply advanced topics like memory and async.
56:29 And best of all, there's not a subscription in sight.
56:31 Check it out for yourself at training.talkpython.fm.
56:35 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
56:39 We should be right at the top.
56:40 You can also find the iTunes feed at /iTunes, the Google Play feed at /play, and the Direct RSS feed at /rss on talkpython.fm.
56:50 We're live streaming most of our recordings these days.
56:53 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.
57:01 This is your host, Michael Kennedy.
57:03 Thanks so much for listening.
57:04 I really appreciate it.
57:05 Now get out there and write some Python code.
57:07 (upbeat music)
57:25 (upbeat music)