WEBVTT

00:00:00.001 --> 00:00:06.600
If you're looking for fun data sets for learning, for teaching, maybe a conference talk, or even if you're just really into them,

00:00:06.600 --> 00:00:11.560
sports offers up a continuous stream of rich data that many people can relate to.

00:00:11.560 --> 00:00:14.100
Yet accessing that data can be tricky.

00:00:14.100 --> 00:00:18.040
Sometimes it's locked away in obscure file formats.

00:00:18.040 --> 00:00:22.140
Other times the data exists, but without a clear API to access it.

00:00:22.140 --> 00:00:28.900
On this episode, we talk about PySport, something of an awesome list of a wide range of libraries,

00:00:28.900 --> 00:00:36.660
mostly but not all Python, for accessing a wide variety of sports data from the NFL, NBA, F1, and more.

00:00:36.660 --> 00:00:41.960
We have Kuhn Vassen, the founder of PySport, to talk through some of the more popular projects.

00:00:41.960 --> 00:00:47.540
This is Talk Python To Me, episode 416, recorded May 11th, 2023.

00:00:57.800 --> 00:01:03.760
Welcome to Talk Python To Me, a weekly podcast on Python.

00:01:03.760 --> 00:01:05.480
This is your host, Michael Kennedy.

00:01:05.480 --> 00:01:12.960
Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both on fosstodon.org.

00:01:12.960 --> 00:01:15.580
Be careful with impersonating accounts on other instances.

00:01:15.580 --> 00:01:16.540
There are many.

00:01:16.540 --> 00:01:21.600
Keep up with the show and listen to over seven years of past episodes at talkpython.fm.

00:01:22.300 --> 00:01:25.640
We've started streaming most of our episodes live on YouTube.

00:01:25.640 --> 00:01:33.180
Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

00:01:33.980 --> 00:01:39.660
This episode is brought to you by JetBrains, who encourage you to get work done with PyCharm.

00:01:39.660 --> 00:01:46.560
Download your free trial of PyCharm Professional at talkpython.fm/done dash with dash PyCharm.

00:01:46.560 --> 00:01:49.060
And it's brought to you by InfluxDB.

00:01:49.060 --> 00:01:55.660
InfluxDB is the database purpose built for handling time series data at a massive scale for real-time analytics.

00:01:55.940 --> 00:01:59.500
Try them for free at talkpython.fm/InfluxDB.

00:01:59.500 --> 00:02:04.700
A quick announcement before we jump into the conversation around PySports.

00:02:04.700 --> 00:02:10.540
We have over 240 hours of Python course video content at Talk Python.

00:02:10.540 --> 00:02:17.700
And if you're watching that content on a mobile platform, phone, or tablet, the browser is definitely not the best experience.

00:02:17.700 --> 00:02:23.620
For example, on iOS, it won't even auto-advance or start playing without your interaction.

00:02:23.960 --> 00:02:32.000
We've had some mobile apps for our courses for a while now, but they have fallen a bit into disrepair for a couple of reasons, including App Store tyranny.

00:02:32.000 --> 00:02:39.500
But over the past four months, we've completely reimagined and rewrote our mobile apps in a modern and beautiful platform.

00:02:39.500 --> 00:02:47.680
And I'm super happy to announce that they are now available for Android and iOS on both phone and tablets in their respective app stores.

00:02:47.800 --> 00:02:56.740
So please visit talkpython.fm/apps to see for yourself how beautiful and clean the new apps are and why I'm so excited about them.

00:02:57.180 --> 00:03:04.420
Download them for free and even take a couple of our free courses included in there, as well as the paid ones that you might have gotten, of course.

00:03:04.420 --> 00:03:14.280
Finally, if you're curious for a bit of a look behind the curtain about how and why we rewrote them, check out my personal site, mkennedy.codes, for the full story.

00:03:14.280 --> 00:03:16.420
Thank you all for supporting our work.

00:03:16.420 --> 00:03:18.240
Now, on to the show.

00:03:18.240 --> 00:03:21.180
Kun, welcome to Talk Python To Me.

00:03:21.180 --> 00:03:22.040
Yeah, thanks.

00:03:22.180 --> 00:03:22.920
Thanks for having me.

00:03:22.920 --> 00:03:25.380
Really cool to talk about this topic.

00:03:25.380 --> 00:03:27.320
Yeah, I'm real excited to have you here.

00:03:27.320 --> 00:03:30.840
You have quite a collection of libraries.

00:03:30.840 --> 00:03:33.860
Now, up front, these are not all your libraries, right?

00:03:33.860 --> 00:03:42.380
These are sort of kind of an awesome list of Python and even beyond Python, sports libraries, data sets, APIs, models, everything, right?

00:03:42.520 --> 00:03:47.040
I think it's for people, it was quite hard to find open source packages.

00:03:47.040 --> 00:03:54.600
And I tried to collect everything I could find and make it available for everyone to find what I need.

00:03:54.600 --> 00:03:55.860
It sounds like a great mission.

00:03:55.860 --> 00:04:01.660
And as people will see, there is a bunch of stuff that we'll get you to talk about.

00:04:01.820 --> 00:04:09.100
So I noticed not absolutely every sport is covered there, but many of the popular sports are covered.

00:04:09.100 --> 00:04:16.660
And if you're interested in sports, I think also if you're interested in just examples that connect with people, right?

00:04:16.660 --> 00:04:24.120
Imagine you're a university professor and you don't want to use the New York City tax data one more time.

00:04:24.120 --> 00:04:31.280
You want to say, well, maybe people are into soccer, American football or NBA, whatever it is, right?

00:04:31.280 --> 00:04:33.140
Maybe you could come up with something more interesting.

00:04:33.140 --> 00:04:35.160
F1, for example, right?

00:04:35.160 --> 00:04:35.920
Yeah, definitely.

00:04:35.920 --> 00:04:40.180
There's quite some cool data available to use in your courses.

00:04:40.180 --> 00:04:41.180
Yeah, absolutely.

00:04:41.180 --> 00:04:41.640
Yeah.

00:04:41.640 --> 00:04:53.000
Also, if people are members of some kind of club or team, maybe they could use some of this to bring some cool visualizations or analysis to their own organization, right?

00:04:53.000 --> 00:05:07.060
Yeah, that's also one of the things that PySport likes to encourage to use open source packages that are already available instead of building your own stuff, because that actually happens a lot.

00:05:07.060 --> 00:05:15.840
So that's also a part of the mission of PySport to make people aware of what's already there and try to bring people together.

00:05:15.840 --> 00:05:20.740
One of the big problems, not problems, it's an opportunity, but it's also a challenge of Python.

00:05:20.740 --> 00:05:25.440
If you go to PyPI.org right now, there's 453,000 packages.

00:05:25.440 --> 00:05:27.760
I didn't know the number, but that's quite a lot.

00:05:27.760 --> 00:05:29.960
We're coming up on half a million.

00:05:29.960 --> 00:05:39.740
And if your goal is to work with some specific data set or try to solve a certain type of problem, often the hardest part is figuring out, well, what library do I use?

00:05:39.740 --> 00:05:40.540
Does it exist?

00:05:40.540 --> 00:05:43.900
And if so, is it up to date and all of these things?

00:05:43.960 --> 00:05:48.860
So having a list like this, a place that aggregates it and sorts it, filters it, super neat.

00:05:49.160 --> 00:05:51.480
So really looking forward to talking to you about it.

00:05:51.480 --> 00:05:54.640
Before we get to that, though, just give us a quick bit on your backstory.

00:05:54.640 --> 00:05:56.000
How'd you get into programming in Python?

00:05:56.000 --> 00:05:57.980
Yeah, this is an interesting story.

00:05:57.980 --> 00:06:04.020
I think I started programming when I was, I think, around 12, when I get Lego Mindstorms.

00:06:04.020 --> 00:06:05.740
Lego that you could also program.

00:06:06.120 --> 00:06:09.220
My father gave me a visual basic book.

00:06:09.220 --> 00:06:11.120
Yeah, I should just figure it out.

00:06:11.120 --> 00:06:13.440
So that's where I started with programming.

00:06:13.440 --> 00:06:19.440
And then during high school, I also did web development with PHP.

00:06:19.440 --> 00:06:26.120
I'm not really sure at what age, but eventually I ended up with, I think, the first Dutch search engine.

00:06:26.120 --> 00:06:29.420
They want to, yeah, they need a Python developer.

00:06:29.420 --> 00:06:32.700
And I didn't really know Python, but that wasn't an issue.

00:06:33.160 --> 00:06:35.140
So there I learned Python.

00:06:35.140 --> 00:06:40.420
And from that point on, I, yeah, only, or mostly used Python.

00:06:40.420 --> 00:06:41.000
Right on.

00:06:41.000 --> 00:06:42.620
You're like, all right, forget this PHP stuff.

00:06:42.620 --> 00:06:43.720
I'm going Python.

00:06:43.720 --> 00:06:48.720
Yeah, well, to be honest, we, my company, we still use PHP.

00:06:48.720 --> 00:06:49.740
Yeah, I'm sure.

00:06:49.740 --> 00:06:52.200
Because, well, it works quite well.

00:06:52.200 --> 00:06:53.740
The performance is okay.

00:06:53.740 --> 00:06:58.080
I'm not really sure if I'm allowed to say it on this show, but it also had some advances.

00:06:58.080 --> 00:06:59.380
Yeah, absolutely.

00:06:59.520 --> 00:07:05.340
Well, I mean, almost every language has some, something it's particularly good at and reasons to keep using it.

00:07:05.340 --> 00:07:09.640
And then also there's just tons of software that was written, you know, pick your language, written in that language.

00:07:09.640 --> 00:07:10.840
And it still works well.

00:07:10.840 --> 00:07:13.360
And there's plenty of reasons to just keep going with it, right?

00:07:13.360 --> 00:07:13.660
Yeah.

00:07:13.660 --> 00:07:23.820
But, yeah, Python is really my language of most interest and what I really use on my day-to-day work.

00:07:23.820 --> 00:07:24.440
Excellent.

00:07:24.880 --> 00:07:25.060
Yeah.

00:07:25.060 --> 00:07:25.780
Yeah, cool.

00:07:25.780 --> 00:07:26.900
What are you doing these days?

00:07:26.900 --> 00:07:28.360
Are you still working at the search engine?

00:07:28.360 --> 00:07:30.260
No, that was quite a while ago.

00:07:30.260 --> 00:07:38.220
I also worked at a huge online marketing agency where we did run the software department.

00:07:38.220 --> 00:07:45.960
We created tools to collect all the data from all kinds of different sources and make it available for the teams.

00:07:45.960 --> 00:07:48.920
Right now I'm running my own company.

00:07:48.920 --> 00:08:02.080
It's called TeamTV, where we provide all kinds of tools where we use video and data for, example, performance analysts, but also for highlight creation or live streaming.

00:08:02.080 --> 00:08:10.540
Just to make sure that we try, at least we try to combine video and data in all possible ways within the support domain.

00:08:10.540 --> 00:08:14.680
Yeah, that sounds like a really interesting thing to be working on.

00:08:14.680 --> 00:08:17.960
Yeah, I started mostly on the video engineering part.

00:08:18.260 --> 00:08:20.960
So we built quite some stuff ourselves there.

00:08:20.960 --> 00:08:29.360
So from people uploading huge amounts of footage that we need to transcode and how to scale it, how to serve it.

00:08:29.360 --> 00:08:31.780
So that's stuff we build ourselves.

00:08:31.780 --> 00:08:39.880
Yeah, later on, we keep on building more stuff around data and always keep the combination between data and the video.

00:08:40.340 --> 00:08:49.360
Because, well, you can see some sort of metric, but you always want to see the footage behind it to actually understand the context of it.

00:08:49.360 --> 00:08:49.860
Yeah, sure.

00:08:49.860 --> 00:08:50.940
Sounds really fun.

00:08:50.940 --> 00:08:53.600
And you're also involved with PyData Endover, is that right?

00:08:53.600 --> 00:08:54.200
Yeah, yeah.

00:08:54.200 --> 00:08:55.980
Give them a shout out.

00:08:55.980 --> 00:08:56.880
Yeah, yeah.

00:08:56.880 --> 00:08:59.980
Five years ago, we started with PyData Eindhoven.

00:08:59.980 --> 00:09:02.760
We were already friends with PyData Amsterdam.

00:09:02.760 --> 00:09:06.100
They said, well, maybe you should also start an Eindhoven chapter.

00:09:06.540 --> 00:09:11.680
I think this year will be the anniversary, the five-year anniversary for PyData Endover.

00:09:11.680 --> 00:09:14.180
And it's, yeah, it's an amazing community.

00:09:14.180 --> 00:09:18.960
And that's, yeah, also inspired me to start with PySport.

00:09:18.960 --> 00:09:24.560
I'm not really sure if people at all that are listening to this podcast know PyData.

00:09:24.560 --> 00:09:26.060
Maybe, I think.

00:09:26.060 --> 00:09:26.800
I hope so.

00:09:26.800 --> 00:09:29.300
Yeah, I suspect most of them probably do.

00:09:29.300 --> 00:09:32.420
At least the data science inclined among us.

00:09:32.420 --> 00:09:34.280
Yeah, I can tell a little bit about it.

00:09:34.520 --> 00:09:40.980
Right now, we have a nice way of organizing the meetups and trying to get more people involved

00:09:40.980 --> 00:09:43.000
and talk about data science and share knowledge.

00:09:43.000 --> 00:09:49.620
And then once a year, we have the conference where we try to get, yeah, collect money that

00:09:49.620 --> 00:09:54.740
we can send to NumFocus and they can share it over all the open source projects.

00:09:54.740 --> 00:09:55.540
So, yeah.

00:09:55.540 --> 00:09:57.800
It's a really amazing community.

00:09:58.240 --> 00:09:59.160
That's excellent.

00:09:59.160 --> 00:10:05.760
Yeah, NumFocus does a lot to support the bigger data science oriented projects.

00:10:05.760 --> 00:10:06.280
Yeah.

00:10:06.280 --> 00:10:09.760
I think that's kind of unique amongst the, in the Python space.

00:10:09.760 --> 00:10:14.220
You know, there's not really anything like that in the web or UI.

00:10:14.220 --> 00:10:22.240
You know, there's not a lot of areas where there's like organization that says, okay, we're going to try to find the popular projects and support them across organizations.

00:10:22.240 --> 00:10:22.640
Right.

00:10:22.640 --> 00:10:28.500
Like people support Flask, but they don't also support Django in the same sort of organization.

00:10:28.500 --> 00:10:28.820
Right.

00:10:28.900 --> 00:10:36.020
I think it's also an opportunity for all the companies that are using those open source packages to give back.

00:10:36.020 --> 00:10:49.280
And I think doing it through NumFocus, it makes it also easier because they use a lot of packages and can just donate to NumFocus and they will make sure it's distributed over those packages.

00:10:49.280 --> 00:10:49.620
Right.

00:10:49.620 --> 00:10:50.040
Absolutely.

00:10:50.040 --> 00:10:53.980
If you use Pandas, you should also support NumPy, right?

00:10:53.980 --> 00:10:56.720
Because that's kind of the foundation and so on.

00:10:56.720 --> 00:10:57.460
Yeah, yeah, yeah.

00:10:57.860 --> 00:10:58.260
Interesting.

00:10:58.260 --> 00:10:59.460
Oh, that makes a lot of sense.

00:10:59.460 --> 00:10:59.740
All right.

00:10:59.740 --> 00:11:06.020
Well, let's jump into sports and your project, PySport.

00:11:06.020 --> 00:11:10.120
Are there other people who are maintainers and working on this or is this just your project?

00:11:10.120 --> 00:11:17.100
To get the meetup that we had just a couple of weeks ago, we had some more people collected.

00:11:17.100 --> 00:11:23.140
And now we are building from there on to get more people involved with just PySport.

00:11:23.140 --> 00:11:27.700
But one of the projects we built with PySport is the Cloppy package.

00:11:27.700 --> 00:11:31.260
And there we have worked together with Jan van Haaren.

00:11:31.260 --> 00:11:37.520
He's a head of data science at Club Brugge, a big club in Belgium.

00:11:37.520 --> 00:11:44.760
We had the main maintainers there, but I think right now we have 22 contributors to the package.

00:11:44.760 --> 00:11:47.660
So there's quite some people contributing there.

00:11:47.660 --> 00:11:48.060
Yeah.

00:11:48.180 --> 00:11:49.000
Yeah, that's a big group.

00:11:49.000 --> 00:11:50.120
That's a lot of people contributing.

00:11:50.120 --> 00:11:50.420
Yeah.

00:11:50.420 --> 00:11:51.340
Let's start it this way.

00:11:51.340 --> 00:11:53.680
Tell people what PySport is and about that.

00:11:53.680 --> 00:11:57.900
And then we can talk a little broadly just about sports analytics before we get into the details.

00:11:57.900 --> 00:12:10.520
The most important mission of PySport is to bridge the gap between the clubs and the sports analytics and just people and by using open source packages.

00:12:10.520 --> 00:12:14.100
Because a lot of clubs are using open source packages.

00:12:14.100 --> 00:12:20.960
Open source packages are used by the clubs and people want to have a way to contribute to their favorite club.

00:12:21.280 --> 00:12:24.840
I think a lot of people are still struggling on how to do it.

00:12:24.840 --> 00:12:31.000
And with PySport, yeah, we want to share the knowledge and teach people on how to do it.

00:12:31.000 --> 00:12:46.260
So we try to get the experts from the clubs, but also getting the knowledge from, you know, like Thundas or other big packages and see how we can get all that knowledge into the sports analytics community.

00:12:46.760 --> 00:12:59.200
With Cloppy, we try to set an example on how to build such a package, how to work together on such a thing and also encourage people to contribute.

00:12:59.200 --> 00:12:59.760
Yeah.

00:12:59.760 --> 00:13:15.960
Show that you don't have to create a pull request that there's a major effector, but also like minor things like typing errors, fix and documentation and show people that that's also very valuable to a package.

00:13:15.960 --> 00:13:16.380
Interesting.

00:13:16.380 --> 00:13:21.980
So Cloppy is standardized soccer tracking and event data, right?

00:13:21.980 --> 00:13:30.620
So you started out with soccer or as I guess a lot of the world might refer to it as football, but in the U.S., that's already taken.

00:13:30.620 --> 00:13:31.860
It's a namespace collision.

00:13:31.860 --> 00:13:32.580
Yeah.

00:13:32.580 --> 00:13:34.120
Yeah.

00:13:34.120 --> 00:13:43.560
It's sometimes, yeah, it's difficult to talk about football, but here in Europe, we call it football, but for the package, because it's international.

00:13:43.560 --> 00:13:46.420
It's also worldwide, so I call it soccer.

00:13:46.420 --> 00:13:47.160
Yeah.

00:13:47.160 --> 00:13:47.480
Yeah.

00:13:47.480 --> 00:13:48.640
Namespace.

00:13:48.640 --> 00:13:49.320
Namespace.

00:13:49.320 --> 00:13:57.860
So give us a quick bit of background on Cloppy, but since it's kind of one of the founding, you created this as a way to sort of set an example, right?

00:13:57.860 --> 00:14:02.940
For how to create a package and it helps people understand this event, this club data.

00:14:03.160 --> 00:14:09.200
Where that started is on Twitter, there was already quite some people talking about sports analytics, of course.

00:14:09.200 --> 00:14:14.000
And one guy, Joe Mulberry, he's working at a Danish top club.

00:14:14.000 --> 00:14:20.680
He asked for help because he created a notebook and he wanted to build a Flask API on top of it.

00:14:20.680 --> 00:14:22.160
And I said, well, I know Python.

00:14:22.160 --> 00:14:28.280
I don't know really much about soccer or about data, but yeah, I would like to be involved.

00:14:28.620 --> 00:14:29.660
I would like to help you.

00:14:29.660 --> 00:14:41.340
And when I received a notebook, I noticed that like 80% of the code was about reading and standardizing the data to a format that he could work with.

00:14:41.340 --> 00:14:49.460
And when we talked about it, it seemed like most, at least a lot of people are struggling with that issue and doing the same thing over and over again.

00:14:49.460 --> 00:14:54.480
Because in more notebooks that I saw, people were doing the same thing, but in different ways.

00:14:54.480 --> 00:15:00.280
And some were not correct implementation or inefficient implementations.

00:15:00.280 --> 00:15:11.780
So I thought, well, one thing I know is how to read data and how to get it into a standardized format, because that was also one of the things I did at an online marketing company.

00:15:11.780 --> 00:15:12.780
Yeah.

00:15:12.780 --> 00:15:18.960
Like, I don't know much about your data format, but I know about processing data and normalizing it and all that, right?

00:15:18.960 --> 00:15:19.200
Yeah.

00:15:19.200 --> 00:15:25.220
I built a package start with just tracking data, but also try to explain what the next steps could be.

00:15:25.220 --> 00:15:27.960
And then people said, well, this is really useful.

00:15:27.960 --> 00:15:37.480
And from that part, I kept on adding deserializers for different kinds of data for the tracking data and also for the event data.

00:15:37.480 --> 00:15:37.980
Yeah.

00:15:37.980 --> 00:15:42.660
It tried to get knowledge from non-sport bigger projects.

00:15:42.660 --> 00:15:48.260
So I also got Will Kunnen from the Texel package.

00:15:48.460 --> 00:15:56.340
He also did several reviews on this package and give feedback to try to get the package on a higher level.

00:15:56.340 --> 00:16:01.600
So people within sports analytics community could also gain more knowledge from there.

00:16:01.600 --> 00:16:06.220
But maybe also a small background on the data.

00:16:06.220 --> 00:16:10.580
So the tracking data, that's like positioning data for all players on the pitch.

00:16:11.400 --> 00:16:15.580
I think it's most of the time 25 frames per second.

00:16:15.580 --> 00:16:18.320
So you know the location for each player and the ball.

00:16:18.320 --> 00:16:20.300
And on the other side, you have the event data.

00:16:20.300 --> 00:16:23.320
So there are all passes and shots and things like that.

00:16:23.320 --> 00:16:29.160
At this time from this position, there was a shot on the goal or there was a pass or there was a takeaway or penalty.

00:16:29.160 --> 00:16:29.760
Yeah.

00:16:29.860 --> 00:16:30.040
Yeah.

00:16:30.040 --> 00:16:30.520
Yeah.

00:16:30.520 --> 00:16:31.720
That's event data.

00:16:31.720 --> 00:16:32.300
Yeah.

00:16:32.300 --> 00:16:34.780
And all the vendors choose different formats.

00:16:34.780 --> 00:16:35.680
Yeah.

00:16:35.680 --> 00:16:36.360
Yeah.

00:16:36.360 --> 00:16:37.140
Oh, geez.

00:16:37.140 --> 00:16:38.960
That sounds hard.

00:16:39.740 --> 00:16:44.420
So first of all, 25 hertz of all the people's location.

00:16:44.420 --> 00:16:52.640
This is beyond somebody with just a pen and paper and notebook writing down, oh, at this time there was a shot on the goal by number 25.

00:16:53.040 --> 00:16:54.340
How do they get that data?

00:16:54.340 --> 00:16:55.100
That's crazy.

00:16:55.100 --> 00:16:55.520
Yeah.

00:16:55.520 --> 00:16:55.760
Yeah.

00:16:55.760 --> 00:16:58.400
That's quite an advanced systems that they use.

00:16:58.400 --> 00:17:02.520
So in the stadium, I think they have like 20 cameras around the pitch.

00:17:02.520 --> 00:17:07.880
They use computer vision to detect all the players and combine it.

00:17:07.880 --> 00:17:14.280
But I believe, and I'm not really sure if they're already vendors on the market that do it totally automated.

00:17:14.740 --> 00:17:27.020
But I think from the system that I'm currently used in soccer, there are still some people needed for difficult situations like a corner kick, where a lot of people in a small area and a lot of occlusions happen.

00:17:27.020 --> 00:17:27.540
You can't see the numbers, yeah.

00:17:27.540 --> 00:17:28.820
Yeah, they can't see the numbers.

00:17:28.820 --> 00:17:35.880
So just after a corner, some manual operator has to reassign some players or correct something.

00:17:35.880 --> 00:17:38.120
But it's quite an advanced system already.

00:17:38.120 --> 00:17:39.460
It sounds incredibly advanced.

00:17:39.560 --> 00:17:48.040
It sounds like an awesome data set to work with because with that much data, you really can make a lot of interesting predictions and trends.

00:17:48.040 --> 00:17:56.360
I mean, at some point, maybe we'll just put some sort of tracking RFID thing on the back of the player's heads, just stitch it on there.

00:17:56.360 --> 00:17:58.780
And then you can fully automate it, you know?

00:17:58.780 --> 00:18:02.580
Yeah, I think a soccer day ticket.

00:18:02.580 --> 00:18:03.420
Yeah, maybe.

00:18:03.420 --> 00:18:03.760
Yeah.

00:18:03.760 --> 00:18:05.720
Not sure if all players would accept it.

00:18:05.720 --> 00:18:09.740
But for example, on ice hockey, yeah, you can put it on the helmets.

00:18:09.740 --> 00:18:11.420
Yeah, you can put it on the helmets, sure.

00:18:11.420 --> 00:18:11.820
For football.

00:18:11.820 --> 00:18:13.100
Yeah.

00:18:13.100 --> 00:18:13.160
Yeah.

00:18:13.160 --> 00:18:28.000
Things like automobile racing, you know, they have, not all of them, but for example, F1 has incredibly high frequency of, like, points that measure where is this car, how fast is it going, the cars are sending out real-time telemetry.

00:18:28.000 --> 00:18:32.780
There's certainly many sports that have quite high fidelity in their data.

00:18:32.920 --> 00:18:44.700
I must admit, I haven't seen the data from F1 yet, but it would be really interesting to learn from them and now to work with data and see if it can be applied to football or soccer or the sport.

00:18:46.700 --> 00:18:53.040
This portion of Talk Python To Me is brought to you by JetBrains, who encourage you to get work done with PyCharm.

00:18:53.040 --> 00:19:00.220
PyCharm Professional is the complete IDE that supports all major Python workflows, including full-stack development.

00:19:00.220 --> 00:19:06.780
That's front-end JavaScript, Python backend, and data support, as well as data science workflows with Jupyter.

00:19:06.780 --> 00:19:09.400
PyCharm just works out of the box.

00:19:09.400 --> 00:19:16.200
Some editors provide their functionality through piecemeal add-ins that you put together from a variety of sources.

00:19:16.200 --> 00:19:18.940
PyCharm is ready to go from minute one.

00:19:18.940 --> 00:19:21.660
And PyCharm thrives on complexity.

00:19:21.660 --> 00:19:32.400
The biggest selling point for me personally is that PyCharm understands the code structure of my entire project, even across languages such as Python and SQL and HTML.

00:19:32.640 --> 00:19:42.240
If you see your editor completing statements just because the word appears elsewhere in the file, but it's not actually relevant to that code block, that should make you really nervous.

00:19:42.240 --> 00:19:45.920
I've been a happy paying customer of PyCharm for years.

00:19:45.920 --> 00:19:52.160
Hardly a workday passes that I'm not deep inside PyCharm working on projects here at Talk Python.

00:19:52.600 --> 00:19:56.480
What tool is more important to your productivity than your code editor?

00:19:56.480 --> 00:19:58.760
You deserve one that works the best.

00:19:58.760 --> 00:20:07.000
So download your free trial of PyCharm Professional today at talkpython.fm/done with PyCharm and get work done.

00:20:07.000 --> 00:20:09.180
That link is in your podcast player show notes.

00:20:09.180 --> 00:20:15.440
Thank you to PyCharm from JetBrains for sponsoring the show and keeping Talk Python going strong.

00:20:15.440 --> 00:20:18.680
I bet it's a lot, actually.

00:20:18.680 --> 00:20:25.000
I bet it is, you know, just in terms of actual quantity of data, you know, how fast are sampling and how many cars for how long.

00:20:25.000 --> 00:20:26.060
It's probably a lot of data.

00:20:26.060 --> 00:20:30.320
That's also one of the interesting things about working with sports data.

00:20:30.320 --> 00:20:36.460
I think the data engineering part and this package just focused on reading the data.

00:20:36.460 --> 00:20:43.120
But then the next step, yeah, how to work with the data, especially if you would like to use the tracking data for a whole season.

00:20:43.120 --> 00:20:48.820
Yeah, that's quite some data that also vendors can start struggling a bit with.

00:20:48.820 --> 00:20:55.320
It just occurred to me, there's probably a whole nother demographic or aspect who would be interested in this kind of data.

00:20:55.320 --> 00:20:57.380
It would be like sports betting people.

00:20:57.380 --> 00:20:59.860
I mean, not that I have any interest in that at all.

00:20:59.860 --> 00:21:05.740
But if you were trying to figure out like, OK, if this team plays that team, if you can understand, OK, this their star player,

00:21:05.780 --> 00:21:12.920
if we match up their moves against the other person's moves, it turns out there's a weakness in this way for their defense or who knows.

00:21:12.920 --> 00:21:14.840
Right. I mean, there's there's with that much data.

00:21:14.840 --> 00:21:16.660
There's probably some interesting stuff you can do.

00:21:16.660 --> 00:21:25.040
I think that a lot of vendors of the data also have the, yeah, the betting industry as well as their clients.

00:21:25.040 --> 00:21:25.880
Because, yeah.

00:21:25.880 --> 00:21:28.340
I don't really care to work for them or support them.

00:21:30.000 --> 00:21:31.780
It's a little bit shady, I suppose.

00:21:31.780 --> 00:21:39.000
But it does seem like you could, it's almost like really detailed information about companies for the stock market.

00:21:39.000 --> 00:21:42.680
This is kind of like a little bit like that for the sports betting in some ways, I suppose.

00:21:42.680 --> 00:21:43.280
Yeah.

00:21:43.280 --> 00:21:43.680
Yeah.

00:21:43.680 --> 00:21:44.120
Yeah.

00:21:44.120 --> 00:21:44.400
Yeah.

00:21:44.400 --> 00:21:44.900
Interesting.

00:21:44.900 --> 00:21:51.200
I think one of the challenges here is probably a lot of this data is not easily offered up.

00:21:51.200 --> 00:21:58.220
There's probably not a lot of JSON APIs with low latency that are super easy to access or some there must be, but not.

00:21:58.220 --> 00:22:09.120
There's probably a lot of data out there that is not overly welcome to either be given out or it's given out over in batch over slow periods or something like that.

00:22:09.120 --> 00:22:09.320
Right.

00:22:09.320 --> 00:22:11.720
Maybe speak to a little bit about the data availability.

00:22:11.720 --> 00:22:13.640
Yeah, that's quite an issue.

00:22:13.640 --> 00:22:21.060
And I know mostly about the soccer data, but I can imagine that the same applies to most of the other sports.

00:22:21.060 --> 00:22:36.360
And I think data availability is a major issue, at least if you want to encourage the community to work with it and do research on it and get people build more cool stuff without being within a club.

00:22:36.360 --> 00:22:41.700
There are some companies that already provide quite a big setup of open event data.

00:22:41.700 --> 00:22:43.160
Statsbomb is one of them.

00:22:43.160 --> 00:22:48.400
I think they provide around 1,500 data sets for event data.

00:22:48.760 --> 00:22:58.480
But if you're looking at the tracking data, maybe there are like 10, maybe 15 sets available because all those vendors have deals with the leaks.

00:22:58.480 --> 00:23:00.160
They are not allowed to start it.

00:23:00.160 --> 00:23:03.060
So you have to know someone within a club.

00:23:03.060 --> 00:23:07.260
Or use a beautiful soup or scrapey or something like that, right?

00:23:07.260 --> 00:23:08.680
That's the other option.

00:23:08.840 --> 00:23:15.700
But then it's still very hard to get the tracking data because I'm not sure if you can actually scrape it.

00:23:15.700 --> 00:23:23.320
But that's one of the things that I noticed when working on the open source of Piesport's website.

00:23:23.320 --> 00:23:25.560
There are really a lot of scrapers.

00:23:25.560 --> 00:23:30.640
And I think that's an indication that there's an issue with data availability.

00:23:32.100 --> 00:23:32.560
Yeah, it's not.

00:23:32.560 --> 00:23:34.980
This plugs into the API, but this is a scraper.

00:23:34.980 --> 00:23:35.640
Yeah.

00:23:35.640 --> 00:23:39.600
I guess it's worth pointing out or throwing out a bit of word of caution.

00:23:39.600 --> 00:23:45.780
Just because the website is publicly available and you can hit it with some kind of scraping tool.

00:23:45.780 --> 00:23:48.860
That doesn't mean you legally can do stuff with the data.

00:23:48.860 --> 00:23:51.280
You probably want to be pretty careful about that, right?

00:23:51.280 --> 00:23:59.300
Yeah, because I think even when it's not explicitly mentioned, most of the times it's not allowed to scrape the data at all.

00:23:59.300 --> 00:24:05.240
But also in soccer, there are quite some websites that are explicitly forbid it.

00:24:05.240 --> 00:24:08.500
And yeah, so the packages are there.

00:24:08.500 --> 00:24:13.600
And it's also a bit, I was thinking about should I include them or should I not include them?

00:24:13.600 --> 00:24:20.600
Because they kind of encourage non-legal actions, but yeah, not really sure about it again.

00:24:20.600 --> 00:24:21.080
Yeah, sure.

00:24:21.080 --> 00:24:23.780
I can see the case for both sides of that.

00:24:23.780 --> 00:24:28.540
But I just want to let people know, just be careful with what you do with the data.

00:24:28.540 --> 00:24:33.100
It's one thing if it's an academic research project and it's just for my own interest or whatever.

00:24:33.100 --> 00:24:38.780
Yeah, if you start scraping that entire website and trying to make money out of it, you should not do it.

00:24:38.780 --> 00:24:41.640
Or find a way to do it legitimately, right?

00:24:41.640 --> 00:24:43.740
But just don't sneak through.

00:24:43.740 --> 00:24:44.000
Yeah.

00:24:44.080 --> 00:24:44.420
All right.

00:24:44.420 --> 00:24:52.000
Well, I think it might be fun to, let's talk through some of the packages you have here.

00:24:52.000 --> 00:24:58.000
So if you go to PySport.org and there's a nav bar and on the left it says open source.

00:24:58.000 --> 00:25:05.220
And if people click that, then they end up with a whole bunch of, you know, I'll open it just this way for a moment.

00:25:05.220 --> 00:25:06.640
We can look at it and talk about it.

00:25:06.640 --> 00:25:10.900
So if you just click on it, it actually, there's a delay as it downloads.

00:25:10.900 --> 00:25:16.320
Yeah, there's still something I need to fix because, yeah, it's quite some packages.

00:25:16.320 --> 00:25:17.980
I mean, this is not a complaint.

00:25:17.980 --> 00:25:21.620
It's just, I don't know how many pages that is, but that's a really small scroll bar.

00:25:21.860 --> 00:25:28.320
What I noticed that's pretty cool is you can go in and there's a filter that you all have and you can filter by your language.

00:25:28.320 --> 00:25:31.060
Right now you have Haskell, Python and R and others.

00:25:31.060 --> 00:25:36.040
And then you can pick by sports and then you can pick by type of thing, right?

00:25:36.040 --> 00:25:40.060
So I filtered our discussion down to Python libraries just because, you know.

00:25:40.060 --> 00:25:41.300
We have a single title.

00:25:41.300 --> 00:25:41.880
Yeah.

00:25:41.880 --> 00:25:46.460
And you could also pick amongst the different types of tools.

00:25:46.460 --> 00:25:51.000
So we talked about the scrapers and probably to a lesser degree, the APIs, right?

00:25:51.000 --> 00:25:52.860
The API clients, which is cool.

00:25:52.860 --> 00:25:53.840
There are some in there.

00:25:53.840 --> 00:25:59.400
They say, here's the API and we just built a strongly typed package rather than just doing straight rest, which is great.

00:25:59.400 --> 00:26:03.800
But you also have models and calculators like for predicting things.

00:26:03.800 --> 00:26:08.540
And then IO for file formats, visualization, open data and databases.

00:26:08.540 --> 00:26:09.140
Right.

00:26:09.140 --> 00:26:22.600
So I encourage people to rather than try to read the whole list, which is hundreds and hundreds of packages to, you know, filter down maybe to the sport you're interested in or a couple of sports or the type of tooling you're interested in.

00:26:22.600 --> 00:26:22.740
Yeah.

00:26:22.740 --> 00:26:23.080
Yeah.

00:26:23.080 --> 00:26:37.480
I think filtering is a must, but maybe if you have plenty of time, you could just scroll and see what's interesting because it's still, I think, a very interesting list to see what's just what available and get inspiration.

00:26:37.480 --> 00:26:37.960
Yeah.

00:26:37.960 --> 00:26:38.620
It's quite a list.

00:26:38.780 --> 00:26:39.140
Yeah.

00:26:39.140 --> 00:26:39.220
Yeah.

00:26:39.220 --> 00:26:40.660
So what's the sort here?

00:26:40.660 --> 00:26:43.820
If I come here, how do I, how does this get sorted?

00:26:43.820 --> 00:26:46.000
Like, is there any meaning to the order they appear?

00:26:46.000 --> 00:26:47.660
Is it just when they were entered or?

00:26:47.660 --> 00:26:49.040
It's a good question.

00:26:49.040 --> 00:26:57.280
I also open source the data collection part of this website, but it's daily collected, at least to provide an update.

00:26:57.280 --> 00:27:01.160
And I think, I must say, I think there's an order.

00:27:01.160 --> 00:27:05.620
And when I added the packages, I think that's the order here.

00:27:05.740 --> 00:27:08.900
But to be honest, this can be pretty random.

00:27:08.900 --> 00:27:09.180
Excellent.

00:27:09.180 --> 00:27:09.620
All right.

00:27:09.620 --> 00:27:10.300
Excellent.

00:27:10.300 --> 00:27:10.740
All right.

00:27:10.740 --> 00:27:14.880
So here, I'll just sort of go through a couple of the scrapers here.

00:27:14.880 --> 00:27:17.360
And we can maybe dive into one or two potentially.

00:27:17.360 --> 00:27:19.220
So there's PyBall.

00:27:19.940 --> 00:27:22.060
We'll just go through just to give people a sense, right?

00:27:22.060 --> 00:27:23.620
Of the ones here, right?

00:27:23.620 --> 00:27:27.420
So there's PyBall, which is a Python API.

00:27:27.420 --> 00:27:28.160
Nice.

00:27:28.160 --> 00:27:34.660
Wrapper for stats.nba.com with a focus on NBA and WNBA application.

00:27:34.660 --> 00:27:35.560
That's pretty cool.

00:27:35.980 --> 00:27:38.740
I don't know anything about stats.nba.com.

00:27:38.740 --> 00:27:42.600
But it looks like, yeah, this is a whole website with all sorts of data.

00:27:42.600 --> 00:27:44.340
It's got players, teams, leaders.

00:27:44.340 --> 00:27:45.940
Looks great, actually.

00:27:45.940 --> 00:27:49.500
I think quite some people are also using this package.

00:27:49.500 --> 00:27:54.400
I think it's a mostly used package when working with basketball data.

00:27:54.400 --> 00:27:58.960
And it's not that they use the API to get this data.

00:27:59.100 --> 00:28:00.860
Yeah, you get quite a bit of data here.

00:28:00.860 --> 00:28:06.060
You've got like the player, their team, their age, their total number of points scored.

00:28:06.060 --> 00:28:09.000
A lot of stuff you can do to sort of compare them.

00:28:09.000 --> 00:28:10.020
And yeah, that's great.

00:28:10.020 --> 00:28:13.720
So if you're into basketball, I think it's a great start.

00:28:13.720 --> 00:28:17.040
It's also quite actively maintained.

00:28:17.040 --> 00:28:22.160
That's also one of the things that I intentionally mentioned on the list.

00:28:22.160 --> 00:28:25.960
Because some packages are not really maintained well.

00:28:25.960 --> 00:28:28.340
I think it's a benefit.

00:28:28.340 --> 00:28:32.160
Yeah, one of the things in the list that you call out is the number of contributors,

00:28:32.160 --> 00:28:36.620
the latest version, when the last commit was to the package.

00:28:36.620 --> 00:28:37.400
That's pretty cool.

00:28:37.400 --> 00:28:40.840
In the beginning, I thought, well, maybe I can just manually update the list.

00:28:40.840 --> 00:28:47.580
But then I decided, I think data engineering is fun.

00:28:47.580 --> 00:28:52.420
Let's find a way to automatically fetch the data and update it.

00:28:52.420 --> 00:28:56.520
Also, the license is pretty important to show it here.

00:28:56.520 --> 00:29:01.900
And also, I'm not going to commit to see how actively it's maintained, the latest versions.

00:29:01.900 --> 00:29:07.900
And also the contributors.

00:29:07.900 --> 00:29:13.000
Sure, the difference between a package with one contributor and one with 30 contributors.

00:29:13.000 --> 00:29:13.920
That's a big difference.

00:29:13.920 --> 00:29:15.220
It's a really big difference.

00:29:15.220 --> 00:29:15.440
Yeah.

00:29:15.440 --> 00:29:21.300
I think it's also good for people to see if there's a package with just a single contributor

00:29:21.300 --> 00:29:26.120
that might give an opportunity to contribute to it or work together.

00:29:26.280 --> 00:29:31.080
So PySport would like to encourage people to get involved in those projects.

00:29:31.080 --> 00:29:31.980
Yeah, that's a good idea.

00:29:31.980 --> 00:29:34.440
So that could help out here.

00:29:34.440 --> 00:29:34.600
Yeah.

00:29:34.600 --> 00:29:37.680
And each one of these packages, you can go in and open the details here.

00:29:37.680 --> 00:29:39.880
And it gives you a little bit more information.

00:29:39.880 --> 00:29:44.660
Like, for example, it actually lists the contributors and links to their GitHub profiles and choose

00:29:44.660 --> 00:29:48.560
their website and the GitHub page and PyPI and so on.

00:29:48.560 --> 00:29:48.700
Yeah.

00:29:48.700 --> 00:29:48.940
Yeah.

00:29:48.940 --> 00:29:54.200
And also, you can click on one of the contributes and see what other packages they built.

00:29:54.200 --> 00:29:55.840
Oh, really?

00:29:55.840 --> 00:29:56.280
Okay.

00:29:56.280 --> 00:30:00.520
So, like, if I click on this one, yeah, they've done just this one.

00:30:00.520 --> 00:30:01.980
Well, and this one, just a single one.

00:30:01.980 --> 00:30:02.220
Yeah.

00:30:02.220 --> 00:30:04.120
Some of them, they might have worked on multiple.

00:30:04.260 --> 00:30:05.820
I know Dependipods worked on a few.

00:30:05.820 --> 00:30:07.100
Yeah.

00:30:07.100 --> 00:30:08.800
That's a really nice contributor.

00:30:08.800 --> 00:30:09.360
Yeah.

00:30:09.360 --> 00:30:10.120
Yeah.

00:30:10.120 --> 00:30:10.260
Yeah.

00:30:10.260 --> 00:30:13.920
The absolutely prolific open source contributor.

00:30:13.920 --> 00:30:14.260
Yeah.

00:30:14.260 --> 00:30:15.340
Works on my project too.

00:30:15.340 --> 00:30:22.120
This portion of Talk Python To Me is brought to you by Influx Data, the makers of InfluxDB.

00:30:22.120 --> 00:30:29.400
InfluxDB is a database purpose built for handling time series data at a massive scale for real-time

00:30:29.400 --> 00:30:29.840
analytics.

00:30:30.600 --> 00:30:35.460
Developers can ingest, store, and analyze all types of time series data, metrics, events,

00:30:35.460 --> 00:30:37.280
and traces in a single platform.

00:30:37.280 --> 00:30:39.720
So, dear listener, let me ask you a question.

00:30:39.720 --> 00:30:44.800
How would boundless cardinality and lightning-fast SQL queries impact the way that you develop

00:30:44.800 --> 00:30:45.940
real-time applications?

00:30:45.940 --> 00:30:52.060
InfluxDB processes large time series data sets and provides low-latency SQL queries, making

00:30:52.060 --> 00:30:57.480
it the go-to choice for developers building real-time applications and seeking crucial insights.

00:30:58.080 --> 00:31:03.180
For developer efficiency, InfluxDB helps you create IoT, analytics, and cloud applications

00:31:03.180 --> 00:31:06.580
using timestamped data rapidly and at scale.

00:31:06.580 --> 00:31:11.940
It's designed to ingest billions of data points in real-time with unlimited cardinality.

00:31:11.940 --> 00:31:17.860
InfluxDB streamlines building once and deploying across various products and environments from

00:31:17.860 --> 00:31:20.300
the edge, on-premise, and to the cloud.

00:31:20.660 --> 00:31:24.340
Try it for free at talkpython.fm/influxDB.

00:31:24.340 --> 00:31:26.780
The link is in your podcast player show notes.

00:31:26.780 --> 00:31:29.940
Thanks to Influx Data for supporting the show.

00:31:32.560 --> 00:31:37.160
I didn't realize you could actually see all the projects that PySport knows about that that

00:31:37.160 --> 00:31:38.640
particular user works on.

00:31:38.640 --> 00:31:40.160
That's a cool aspect of it.

00:31:40.160 --> 00:31:44.540
I spent quite some time on fetching all the data and trying to combine it.

00:31:44.540 --> 00:31:49.060
Also, fetching data for PyTi and also do the similar for the R packages.

00:31:49.060 --> 00:31:49.780
Yeah.

00:31:49.780 --> 00:31:53.680
And seeing how to get all the available data on one place.

00:31:53.680 --> 00:31:59.940
It also tries to fetch images or screenshots from the readmes of the repo storage.

00:31:59.940 --> 00:32:01.340
That works for some.

00:32:01.340 --> 00:32:01.980
Oh, yeah.

00:32:01.980 --> 00:32:02.500
That's nice.

00:32:02.500 --> 00:32:04.680
Screenshots are really going to be very helpful.

00:32:04.680 --> 00:32:09.060
Less important on the scrapers, more on the visualizers, probably.

00:32:09.060 --> 00:32:09.760
But still.

00:32:09.760 --> 00:32:10.600
Yeah, definitely.

00:32:10.600 --> 00:32:13.940
What is opensource.pySport.org written in?

00:32:13.940 --> 00:32:19.640
It's written in React using Next.js.

00:32:19.640 --> 00:32:25.760
So it was also quite an adventure for me because it's the first application that might also explain

00:32:25.760 --> 00:32:32.420
why it's still a bit slow on loading because I didn't really dive into how to make it faster.

00:32:32.420 --> 00:32:33.520
It used still WIND.

00:32:33.520 --> 00:32:35.420
But in the backend, it's Python.

00:32:35.420 --> 00:32:37.340
It's using Luigi.

00:32:37.340 --> 00:32:38.020
Okay.

00:32:38.920 --> 00:32:45.180
That's, I still think it's a pretty interesting tool because it's really simple to set up

00:32:45.180 --> 00:32:48.600
like orchestration of some tasks.

00:32:48.600 --> 00:32:48.880
Right.

00:32:48.880 --> 00:32:51.860
Like the daily scraping, updating the packages and that kind of stuff.

00:32:51.860 --> 00:32:52.160
Yeah.

00:32:52.160 --> 00:32:58.480
And then there's a GitHub action that runs on a daily basis and then patches all the data

00:32:58.480 --> 00:33:01.600
and updates and commits it in a different branch.

00:33:01.600 --> 00:33:06.020
And that one gets deployed to the Purcell, I believe.

00:33:06.020 --> 00:33:06.400
Okay.

00:33:06.400 --> 00:33:06.840
Yeah.

00:33:06.840 --> 00:33:07.440
Very interesting.

00:33:07.600 --> 00:33:13.480
But if you are interested in the source, you can also, it's also HopeSource.

00:33:13.480 --> 00:33:13.740
Okay.

00:33:13.740 --> 00:33:14.220
Great.

00:33:14.220 --> 00:33:16.260
So, highball for NBA.

00:33:16.260 --> 00:33:23.860
We have the hockey scraper, which is for scraping NHL play-by-play and shift data with six contributors.

00:33:23.860 --> 00:33:25.060
That's pretty interesting.

00:33:25.060 --> 00:33:31.740
What you'll see on the filter list for every sport, there's a package also for the NHL, for

00:33:31.740 --> 00:33:32.260
ice hockey.

00:33:32.260 --> 00:33:36.240
That's a little bit less maintained, I think.

00:33:36.280 --> 00:33:43.540
But I have to, I'm not really sure if it still works because with those scrapers, it can work

00:33:43.540 --> 00:33:44.840
today and not tomorrow.

00:33:44.840 --> 00:33:48.520
It doesn't even necessarily mean that they were intentionally blocked.

00:33:48.520 --> 00:33:51.460
It could just be, hey, we've redesigned our site.

00:33:51.460 --> 00:33:52.560
Doesn't it look awesome?

00:33:52.560 --> 00:33:55.500
You're like, oh, the CSS selector is no longer pull up the thing.

00:33:55.500 --> 00:33:57.220
So, yeah.

00:33:57.220 --> 00:33:59.460
So, that's also on the scraping part.

00:33:59.460 --> 00:34:05.500
If it's last commit is like a while ago, it might be broken.

00:34:05.500 --> 00:34:06.460
Maybe, maybe not.

00:34:06.460 --> 00:34:06.980
Yeah, sure.

00:34:06.980 --> 00:34:07.520
All right.

00:34:07.520 --> 00:34:08.560
Let's see some more.

00:34:08.560 --> 00:34:12.920
I think the StatsBomb API is an official package.

00:34:12.920 --> 00:34:18.200
It's also cool that StatsBomb provides an open source package for accessing their data.

00:34:18.280 --> 00:34:18.400
Yeah.

00:34:18.400 --> 00:34:19.220
What is StatsBomb?

00:34:19.220 --> 00:34:22.400
I see that showing up in many places on these different packages.

00:34:22.400 --> 00:34:23.060
Yeah.

00:34:23.060 --> 00:34:29.600
StatsBomb is, I think, one of the leading providers of event data in football.

00:34:29.600 --> 00:34:33.020
And I think in both football and soccer and in football.

00:34:33.020 --> 00:34:35.280
So, they provide the event data.

00:34:35.280 --> 00:34:41.180
So, everything that happens on the pitch, like passes, dribbles, interceptions, everything.

00:34:41.180 --> 00:34:45.020
They are also one of the providers of the open data sets.

00:34:45.020 --> 00:34:45.420
Okay.

00:34:45.420 --> 00:34:46.100
Yeah.

00:34:46.100 --> 00:34:47.360
They've got a free data section.

00:34:47.360 --> 00:34:47.880
That's cool.

00:34:47.880 --> 00:34:48.160
Yeah.

00:34:48.160 --> 00:34:51.180
They proclaim themselves as data champions.

00:34:51.180 --> 00:34:52.420
That's kind of cool.

00:34:52.420 --> 00:34:53.180
Yeah.

00:34:53.180 --> 00:34:55.100
I think the data is pretty good.

00:34:55.100 --> 00:35:00.020
I think also one of the best in the market right now.

00:35:00.020 --> 00:35:03.420
But at least that's what I heard from some users.

00:35:03.420 --> 00:35:03.900
Sure.

00:35:04.160 --> 00:35:05.380
They even have courses.

00:35:05.380 --> 00:35:08.300
Modern scouting and data-driven recruitment.

00:35:08.300 --> 00:35:10.200
That's kind of interesting, isn't it?

00:35:10.200 --> 00:35:10.380
Yeah.

00:35:10.380 --> 00:35:16.060
You also have to figure out how to apply data science in your job.

00:35:16.060 --> 00:35:21.160
So, how to use it and how to use the data for scouting purposes.

00:35:21.160 --> 00:35:21.800
Yeah.

00:35:21.800 --> 00:35:28.060
If you work in a professional sports organization or even college sports, the U.S. at least,

00:35:28.060 --> 00:35:31.720
there's a lot of recruiting people up from lower levels.

00:35:31.860 --> 00:35:33.360
The tab is in all sports.

00:35:33.360 --> 00:35:40.860
But I think the data is really helping to make the number of players that you have to watch

00:35:40.860 --> 00:35:42.780
from the footage a lot less.

00:35:42.920 --> 00:35:49.600
So, if you can already make a short list instead of watching 15,000 players, then it's really

00:35:49.600 --> 00:35:49.980
convenient.

00:35:49.980 --> 00:35:50.340
Sure.

00:35:50.340 --> 00:35:56.860
Or maybe you're looking for a particular asset or a particular part of the play that a player

00:35:56.860 --> 00:35:57.480
is good at.

00:35:57.480 --> 00:35:57.860
Right?

00:35:57.860 --> 00:36:03.400
Maybe you're looking for a quarterback for a football team that is especially good at running

00:36:03.400 --> 00:36:05.020
the ball in addition to just throwing it.

00:36:05.020 --> 00:36:05.220
Right?

00:36:05.300 --> 00:36:10.440
You could ask the data for that and really narrow it quite quickly, I imagine.

00:36:10.440 --> 00:36:14.940
And then you have to work with the data, figuring out how to extract it.

00:36:14.940 --> 00:36:19.740
Because maybe that single metric that's really important for you is not available in the original

00:36:19.740 --> 00:36:20.480
data set.

00:36:20.480 --> 00:36:26.180
So, then you have to figure out how to work with the data and get those metrics out of

00:36:26.180 --> 00:36:27.220
the raw data.

00:36:27.220 --> 00:36:27.460
Yeah.

00:36:27.460 --> 00:36:29.300
Maybe it's something calculated or inferred.

00:36:29.300 --> 00:36:29.720
Yeah.

00:36:29.720 --> 00:36:34.060
And that's also one of the things that happens in soccer based on the tracking data.

00:36:34.060 --> 00:36:38.080
But it will probably happen also in football and all the other sports.

00:36:38.080 --> 00:36:42.980
That clubs will define their own metrics based on, for example, tracking data.

00:36:43.140 --> 00:36:49.740
And use that to figure out what players match their own play the most.

00:36:49.740 --> 00:36:49.940
Cool.

00:36:49.940 --> 00:36:50.500
Okay.

00:36:50.500 --> 00:36:50.940
So, yeah.

00:36:50.940 --> 00:36:53.960
That's what, as you can see, there's a bunch of stats bombs here.

00:36:53.960 --> 00:36:59.980
Pi Baseball, an MLB game, seem to be a couple of things around baseball data.

00:36:59.980 --> 00:37:05.640
And baseball is one of those games that's kind of, I feel like baseball is one of those games

00:37:05.640 --> 00:37:10.720
that was almost created by a statistician just so they could come up with stats.

00:37:10.720 --> 00:37:12.140
There's so many stats.

00:37:12.140 --> 00:37:15.640
And, you know, people get averages, you know, that what kind of hitter are they?

00:37:15.640 --> 00:37:19.180
Well, they're like a 0.3, you know, they're a 300 hitter.

00:37:19.180 --> 00:37:22.540
What are I, you know, 30% and all that.

00:37:22.540 --> 00:37:23.900
And I'm not a huge fan of baseball.

00:37:23.900 --> 00:37:25.900
I find it kind of a slow game.

00:37:25.900 --> 00:37:27.520
It's kind of fun to play, but to watch.

00:37:27.520 --> 00:37:29.120
And it's like, you know, same as golf.

00:37:29.120 --> 00:37:30.040
I don't watch those things.

00:37:30.040 --> 00:37:31.240
Yeah.

00:37:31.240 --> 00:37:35.780
I'm sure they're fun to play, but it's just like, in terms of stats, these kinds of games,

00:37:35.780 --> 00:37:38.820
there's probably a ton of stats here because it's all about stats there.

00:37:38.820 --> 00:37:44.660
I also believe that the baseball data science departments are one of the biggest departments

00:37:44.660 --> 00:37:45.680
overall sport.

00:37:45.680 --> 00:37:48.620
And maybe, but I'm not sure about it.

00:37:48.620 --> 00:37:51.420
You can also make a lot of impact there.

00:37:51.420 --> 00:37:51.980
Maybe.

00:37:51.980 --> 00:37:52.440
Sure.

00:37:52.580 --> 00:37:57.900
Because also in all the sport, for example, soccer, a lot of things has impact on the

00:37:57.900 --> 00:37:58.820
eventual outcome.

00:37:58.820 --> 00:38:06.760
It's also a discussion if all data is available to know what actually has the most impact.

00:38:06.760 --> 00:38:11.120
So that's also one of the discussions within the soccer analyst community.

00:38:11.120 --> 00:38:11.420
Yeah.

00:38:11.420 --> 00:38:17.200
For both of these, Pi Baseball and MLB Game, you can see from your Luigi automation.

00:38:18.420 --> 00:38:22.440
They're both quite, well, the MLB Game is not particularly up to date.

00:38:22.440 --> 00:38:24.140
I guess the Pi Baseball one is more up to date.

00:38:24.140 --> 00:38:27.540
But, you know, 13 contributors, 30 contributors.

00:38:27.540 --> 00:38:28.560
That's quite a lot.

00:38:28.560 --> 00:38:29.500
That's quite a lot.

00:38:29.500 --> 00:38:33.000
And the Pi Baseball was updated this month, right?

00:38:33.000 --> 00:38:36.160
But, you know, when I saw these, I'm like, oh, these are kind of similar.

00:38:36.160 --> 00:38:41.040
And then I look at your page here and I see, oh, well, Pi Baseball is, you know, way more

00:38:41.040 --> 00:38:42.200
up to date, modern.

00:38:42.200 --> 00:38:43.800
And you should check that out first, right?

00:38:43.800 --> 00:38:45.760
That's the kind of value you get for having the info.

00:38:45.760 --> 00:38:46.000
Yeah.

00:38:46.000 --> 00:38:53.620
That's also the intention that you have a quite quick overview of, yeah, how it's maintained

00:38:53.620 --> 00:38:53.960
it.

00:38:53.960 --> 00:38:55.080
And, yeah.

00:38:55.080 --> 00:38:55.640
Yeah.

00:38:55.640 --> 00:38:57.900
And that one also goes against the API.

00:38:57.900 --> 00:38:58.560
So let's see.

00:38:58.560 --> 00:38:59.880
A couple more.

00:38:59.880 --> 00:39:04.160
I guess it's worth giving a shout out to the NFL FastPie.

00:39:04.160 --> 00:39:07.180
That, well, you know, NFL's got a lot of data as well.

00:39:07.180 --> 00:39:07.740
What else?

00:39:07.740 --> 00:39:09.260
There's some college baseball.

00:39:09.640 --> 00:39:12.980
Here's one that I think is that shows up across a lot of the different categories because

00:39:12.980 --> 00:39:16.320
it seems to do a lot, which is Fast F1.

00:39:16.320 --> 00:39:17.140
Have you seen that?

00:39:17.140 --> 00:39:17.900
Have you played with this any?

00:39:17.900 --> 00:39:19.860
Also updated this month.

00:39:19.860 --> 00:39:22.720
I should dig into it because quite some contributors.

00:39:22.720 --> 00:39:29.540
And I think it's really interesting to also see the mode of sports or cycling or more of those

00:39:29.540 --> 00:39:32.620
sports to see what they are doing, how they're doing it.

00:39:32.620 --> 00:39:32.840
Yeah.

00:39:32.840 --> 00:39:36.600
I noticed looking through here that there's not a lot of motor sports compared to the other

00:39:36.600 --> 00:39:37.060
sports.

00:39:37.520 --> 00:39:41.720
And so people, if you're out there, like if you're an IndyCar or if you're in motocross

00:39:41.720 --> 00:39:45.360
or somewhere like, and you've got a package and shoot it over to these guys and have them

00:39:45.360 --> 00:39:45.920
put it in the list.

00:39:45.920 --> 00:39:46.360
That'd be cool.

00:39:46.360 --> 00:39:46.780
Yeah.

00:39:46.780 --> 00:39:49.920
The Fast F1, they've got a page here that has a bunch of things.

00:39:49.920 --> 00:39:55.460
It has access to timing data, telemetry, session results, and all the data is provided

00:39:55.460 --> 00:40:00.760
in an extended Panda data, Panda's data frame format, which is pretty cool.

00:40:00.760 --> 00:40:01.060
Right.

00:40:01.060 --> 00:40:02.740
Integration with Matplotlib.

00:40:02.740 --> 00:40:05.180
There's an examples gallery too.

00:40:05.180 --> 00:40:10.060
You come over here and you can see it has things like position changes during the race.

00:40:10.060 --> 00:40:15.940
So this, it'll say, if you go up here, it'll do things like, you got to go forward, you

00:40:15.940 --> 00:40:22.200
know, go to the C, get season 23 race one or for race, I guess, rather than practice or

00:40:22.200 --> 00:40:22.720
qualifying.

00:40:22.720 --> 00:40:23.900
And that's Bahrain.

00:40:23.900 --> 00:40:28.280
And so then here's, you know, it has all the drivers, their time throughout the race, their

00:40:28.280 --> 00:40:28.620
position.

00:40:28.620 --> 00:40:30.160
You can see probably pit stop.

00:40:30.160 --> 00:40:31.740
There's a lot of cool stuff you can see in here.

00:40:31.820 --> 00:40:32.760
It looks really nice.

00:40:32.760 --> 00:40:38.740
And also with those examples, I think it's really helpful to get people started with those

00:40:38.740 --> 00:40:39.700
packages.

00:40:39.700 --> 00:40:40.060
Yeah.

00:40:40.060 --> 00:40:41.800
It's not exactly a Jupyter notebook.

00:40:41.800 --> 00:40:43.960
It's the HTML of a Jupyter notebook.

00:40:43.960 --> 00:40:46.540
But, you know, it's still exactly what you need, right?

00:40:46.540 --> 00:40:49.480
But I think you can even download it by a notebook.

00:40:49.480 --> 00:40:50.460
You download it right there.

00:40:50.460 --> 00:40:50.760
Absolutely.

00:40:50.760 --> 00:40:51.040
Yeah.

00:40:51.100 --> 00:40:51.420
Yeah.

00:40:51.420 --> 00:40:51.500
Yeah.

00:40:51.500 --> 00:40:54.820
And apparently two and a half seconds to generate this script.

00:40:54.820 --> 00:40:56.480
Let's see.

00:40:56.480 --> 00:41:03.140
You can even got cool visualizations like on the track, color it by speed around the tracks

00:41:03.140 --> 00:41:03.520
of the start.

00:41:03.520 --> 00:41:05.340
You know, there's a lot of cool data here.

00:41:05.340 --> 00:41:09.580
I'm not really sure why I haven't seen this one before, but yeah, it looks really, really

00:41:09.580 --> 00:41:09.900
cool.

00:41:09.900 --> 00:41:11.100
Yeah.

00:41:11.100 --> 00:41:13.920
When I looked, I looked around a couple of the different packages and this one, like

00:41:13.920 --> 00:41:16.800
the documentation and examples and stuff seem, seem super good.

00:41:16.800 --> 00:41:17.500
Okay.

00:41:17.500 --> 00:41:18.900
So that's the scrapers.

00:41:18.900 --> 00:41:19.820
There's many more.

00:41:20.360 --> 00:41:21.280
There's plenty more there.

00:41:21.280 --> 00:41:23.580
Another one, models, calculators.

00:41:23.580 --> 00:41:26.840
Maybe take us through some of the ones that stand out in this category.

00:41:26.840 --> 00:41:30.560
Like, for example, there's Lori's code for Metrica tracking data.

00:41:30.560 --> 00:41:32.160
I love it that it's just, it's Lori's code.

00:41:32.160 --> 00:41:32.740
Good job, Lori.

00:41:32.740 --> 00:41:33.780
Yeah.

00:41:33.780 --> 00:41:39.500
So this is mostly about how to also do all kinds of modeling on top of it, do predictions

00:41:39.500 --> 00:41:40.540
on top of data.

00:41:40.540 --> 00:41:45.520
You know, one of the packages that I think is pretty interesting is the soccer action.

00:41:45.520 --> 00:41:46.120
Yeah, of course.

00:41:46.120 --> 00:41:47.040
Again, it's soccer.

00:41:47.040 --> 00:41:49.680
There's only Python, possibly.

00:41:49.680 --> 00:41:53.020
But for example, they have soccer XG, which is, what is that?

00:41:53.020 --> 00:41:55.960
XG boost models for soccer event data?

00:41:55.960 --> 00:41:58.400
That's the expected goals.

00:41:59.100 --> 00:42:02.440
So what's the expected value for a certain shot?

00:42:02.440 --> 00:42:05.780
If it should go in or not.

00:42:05.780 --> 00:42:13.140
So it's also based on a position on the page, how many players are between the player with

00:42:13.140 --> 00:42:14.420
the ball and the goal.

00:42:14.420 --> 00:42:22.040
So you can use it to determine, yeah, how, if a player should score a goal and how many goals

00:42:22.040 --> 00:42:22.780
he should make.

00:42:23.120 --> 00:42:24.120
Yeah.

00:42:24.120 --> 00:42:24.120
Yeah.

00:42:24.120 --> 00:42:24.120
Yeah.

00:42:24.120 --> 00:42:24.120
Yeah.

00:42:24.120 --> 00:42:24.380
Yeah.

00:42:24.380 --> 00:42:28.540
I think this is actually one of the really interesting aspects is the model and calculate it.

00:42:28.540 --> 00:42:30.840
You know, the prediction side is pretty cool.

00:42:30.840 --> 00:42:36.900
There's quite some work to do for PiSports because, for example, the expected goals.

00:42:36.900 --> 00:42:39.660
There's also one of the things that I've seen in ice hockey.

00:42:39.660 --> 00:42:44.180
Also, in other sports where you have to score within a goal.

00:42:44.180 --> 00:42:49.480
And I think it would be cool to find a way to abstract it over all sports.

00:42:49.480 --> 00:42:49.760
Yeah.

00:42:49.760 --> 00:42:54.300
Because it is kind of the same idea, probably different data sets, but right.

00:42:54.300 --> 00:42:59.980
Like scoring in hockey and scoring in soccer is from a structural perspective of the data

00:42:59.980 --> 00:43:04.680
is kind of the same thing, even though it's really quite different in size of the goal and

00:43:04.680 --> 00:43:06.020
how easy it is and all that.

00:43:06.020 --> 00:43:06.220
Yeah.

00:43:06.220 --> 00:43:10.680
But I think we can still learn from the other sports and see how they did it.

00:43:10.680 --> 00:43:10.900
Yeah.

00:43:10.900 --> 00:43:12.960
Train up a model, but on different data, right?

00:43:12.960 --> 00:43:14.600
But same type of model, potentially.

00:43:14.600 --> 00:43:14.980
Yeah.

00:43:14.980 --> 00:43:17.080
Maybe some different features, but yeah.

00:43:17.080 --> 00:43:17.360
Yeah.

00:43:17.360 --> 00:43:20.020
So the next category is IO.

00:43:20.020 --> 00:43:23.220
And that obviously stats bomb is in here, right?

00:43:23.220 --> 00:43:27.700
Python package to parse, stats bombs, JSON data to CSV, which is cool.

00:43:27.700 --> 00:43:31.700
Some on soccer, the spattle format, which I have no idea what that is.

00:43:31.700 --> 00:43:31.880
Yeah.

00:43:31.880 --> 00:43:37.660
That's also one of the things they built to make like atomic data format.

00:43:37.660 --> 00:43:41.160
That's also kind of standardized.

00:43:41.660 --> 00:43:44.040
So there's some overlap between soccer action and cloppy.

00:43:44.040 --> 00:43:49.080
I think they mostly focused on how to eventually work with the data.

00:43:49.080 --> 00:43:55.720
So calculate also the expected threat and also like a contribution model.

00:43:55.720 --> 00:43:59.100
So for every action towards a goal.

00:43:59.100 --> 00:44:00.220
Right.

00:44:00.220 --> 00:44:00.840
Right.

00:44:00.840 --> 00:44:01.160
Okay.

00:44:01.160 --> 00:44:04.800
So maybe there's a takeaway and then a pass and a pass and then a score.

00:44:04.940 --> 00:44:08.180
Like all of those people should somehow get credit for that potentially, right?

00:44:08.180 --> 00:44:08.480
Yeah.

00:44:08.480 --> 00:44:08.820
Okay.

00:44:08.820 --> 00:44:09.560
Makes sense.

00:44:09.560 --> 00:44:12.960
But they also build the way to load the data.

00:44:12.960 --> 00:44:20.060
And they will currently also working together with them to see if we can make cloppy to load

00:44:20.060 --> 00:44:25.800
the data and have the cloppy package focus on loading it and standardizing it and then have

00:44:25.800 --> 00:44:27.420
the soccer action using it.

00:44:27.420 --> 00:44:30.480
So see how the Nego blocks can work together.

00:44:30.480 --> 00:44:30.880
Absolutely.

00:44:30.880 --> 00:44:36.240
We have the NFLDB, a library to manage and update NFL data in a relational database.

00:44:36.240 --> 00:44:37.180
That's kind of cool.

00:44:37.180 --> 00:44:37.800
All right.

00:44:37.800 --> 00:44:38.200
Let's see.

00:44:38.200 --> 00:44:40.200
The next category is the visualization.

00:44:40.920 --> 00:44:45.640
I think probably the most important part is probably the actual data acquisition, but

00:44:45.640 --> 00:44:48.620
the most desired part is probably the visualization, right?

00:44:48.620 --> 00:44:53.360
The data engineering part is not really, what do you call it, really sexy.

00:44:53.360 --> 00:44:54.960
I mean, no one sees it.

00:44:54.960 --> 00:44:58.680
The output is a structured CSV or packet file.

00:44:58.680 --> 00:45:00.880
So that's not really cool to show.

00:45:00.880 --> 00:45:09.060
But for example, the NPL soccer, I think it's a really, really nice package used by every

00:45:09.060 --> 00:45:11.900
person in the soccer community.

00:45:11.900 --> 00:45:12.300
Yeah.

00:45:12.300 --> 00:45:14.520
There's a lot of contributors here.

00:45:14.520 --> 00:45:14.980
Yeah.

00:45:14.980 --> 00:45:17.800
And the visualizations look really cool.

00:45:17.800 --> 00:45:18.140
Yeah.

00:45:18.140 --> 00:45:22.020
They also have a huge list of examples.

00:45:22.020 --> 00:45:22.480
Okay.

00:45:22.480 --> 00:45:28.100
So all kinds of, you can just copy and paste to create some pizza charts.

00:45:28.100 --> 00:45:28.920
I love them.

00:45:28.920 --> 00:45:29.280
Yeah.

00:45:29.280 --> 00:45:29.600
Yeah.

00:45:29.600 --> 00:45:32.300
We'll actually come back to the pizza charts in just a moment, actually.

00:45:32.300 --> 00:45:35.740
But yeah, these are some good looking visualizations here.

00:45:35.740 --> 00:45:35.960
Yeah.

00:45:35.960 --> 00:45:40.440
And I think the interesting thing about this package is that at some point there were two

00:45:40.440 --> 00:45:42.780
packages that did similar things.

00:45:42.780 --> 00:45:45.940
And then they decided, well, we should just work together.

00:45:45.940 --> 00:45:48.940
And they spent quite some time on integrating those packages.

00:45:48.940 --> 00:45:50.320
And then there was one.

00:45:50.320 --> 00:45:56.060
And I think that's really cool to see that instead of kind of competing, they decided to

00:45:56.060 --> 00:45:59.260
work together and make, I think, one of the most awesome.

00:45:59.260 --> 00:46:01.160
packages for the soccer community.

00:46:01.160 --> 00:46:02.420
It's really nice.

00:46:02.420 --> 00:46:03.260
It's really nice.

00:46:03.260 --> 00:46:05.140
There's a lot of soccer ones in here.

00:46:05.140 --> 00:46:05.720
Yeah.

00:46:05.720 --> 00:46:10.420
There's also one for a PT plot for American football, although I don't understand what PT

00:46:10.420 --> 00:46:11.200
stands for.

00:46:11.540 --> 00:46:14.460
And then the fast formula one is also in there.

00:46:14.460 --> 00:46:17.420
We already saw those pictures, but a lot of nice visualizations there.

00:46:17.420 --> 00:46:17.620
Yeah.

00:46:18.100 --> 00:46:20.120
And is that it for all the categories?

00:46:20.120 --> 00:46:20.460
No.

00:46:20.460 --> 00:46:21.660
Then there's the open data.

00:46:21.660 --> 00:46:21.940
Yeah.

00:46:21.940 --> 00:46:25.460
I think maybe when I look at this list, are some missing?

00:46:25.460 --> 00:46:25.760
Okay.

00:46:25.760 --> 00:46:28.540
It's still a bit limited on what data is available.

00:46:28.540 --> 00:46:35.440
That's something that we should work together also with leaks to see if there's a way to make

00:46:35.440 --> 00:46:36.680
some more data available.

00:46:36.680 --> 00:46:37.240
Yeah.

00:46:37.240 --> 00:46:40.280
They have it and they offer it publicly, put it in the list, right?

00:46:40.280 --> 00:46:43.040
When it's available, I would definitely add it.

00:46:43.040 --> 00:46:46.340
But there's already some interesting data.

00:46:46.340 --> 00:46:52.060
There may be a little bit smaller data sets, but you can definitely use it to start playing

00:46:52.060 --> 00:46:52.640
around with it.

00:46:52.640 --> 00:46:53.140
All right.

00:46:53.640 --> 00:47:00.080
So I think that kind of covers the list with the Python filter sort on.

00:47:00.080 --> 00:47:04.220
You wanted also to give a quick shout out to NFLverse, right?

00:47:04.220 --> 00:47:10.500
Because while not Python is quite a series of packages that does cool stuff in the NFL for

00:47:10.500 --> 00:47:11.660
that data, right?

00:47:11.660 --> 00:47:11.920
Yeah.

00:47:11.920 --> 00:47:13.160
So it's not Python.

00:47:13.160 --> 00:47:14.600
It's for our users.

00:47:14.600 --> 00:47:20.880
But I think what's really interesting there, what they did is they created quite some different

00:47:20.880 --> 00:47:26.040
packages, one for collecting the data, one for organizing it, one for reading the data,

00:47:26.040 --> 00:47:30.140
one for doing all kinds of modeling, one for creating the visualizations.

00:47:30.140 --> 00:47:37.660
And I think that's also an example for all the sports on how to make those packages available,

00:47:37.660 --> 00:47:40.320
making sure that everything fits together.

00:47:40.320 --> 00:47:41.300
Yeah, that's cool.

00:47:41.300 --> 00:47:45.540
It's under the NFL virtual organization, but a bunch of different projects.

00:47:45.540 --> 00:47:50.320
You know, you talked about having the data and stuff that's not immediately obvious or

00:47:50.320 --> 00:47:50.680
predictable.

00:47:50.680 --> 00:47:53.480
You might need a higher level sort of thinking about it.

00:47:53.480 --> 00:47:58.880
And one of them that stands out here is the NFL fourth, which is studies fourth down decision

00:47:58.880 --> 00:48:03.580
datas with the NFL version models, which is kind of cool because that's one of the big

00:48:03.580 --> 00:48:07.140
decisions that a coach makes and it can make the game or it can lose the game.

00:48:07.140 --> 00:48:09.660
And there's a go, no go decision, right?

00:48:09.660 --> 00:48:13.220
And there's a lot of, it's not just, well, they went this far, then they didn't make it.

00:48:13.220 --> 00:48:18.560
It's well, it was the, they had 30 seconds left in the game and they had to do it or, you know,

00:48:18.560 --> 00:48:20.660
because otherwise they were just going to lose anyway.

00:48:20.660 --> 00:48:21.060
Right.

00:48:21.060 --> 00:48:25.340
There's a lot of higher, like sort of inference and higher level things you want to bring into

00:48:25.340 --> 00:48:28.800
that rather than just 30% of the time they make it fourth down.

00:48:28.800 --> 00:48:29.040
Right.

00:48:29.100 --> 00:48:29.260
Yeah.

00:48:29.260 --> 00:48:33.800
And this, I think also one of the reasons they just built an entire package around it

00:48:33.800 --> 00:48:35.560
to work with it.

00:48:35.560 --> 00:48:35.800
Yeah.

00:48:35.800 --> 00:48:36.460
That's pretty interesting.

00:48:36.460 --> 00:48:40.900
Now, before all the Python people say, I don't want to learn R, I don't care about R, it is

00:48:40.900 --> 00:48:44.280
also worth pointing out that you can call R from Python.

00:48:44.500 --> 00:48:48.840
I don't know how much like the visualization stuff still works super well or anything like

00:48:48.840 --> 00:48:51.680
that, but you can use, or what is it called?

00:48:51.680 --> 00:48:53.340
R pi two.

00:48:53.340 --> 00:48:54.140
Okay.

00:48:54.140 --> 00:48:59.280
And you can end up, you just pass it in our file and then you start calling functions or

00:48:59.280 --> 00:49:02.080
whatever, get a, get a function out of it and call that function.

00:49:02.080 --> 00:49:07.720
So it's worth, you know, if, if you really, really want to use some of these packages, maybe

00:49:07.720 --> 00:49:11.780
it's worth doing a quick little integration and then turn it into a data frame, a pandas data

00:49:11.780 --> 00:49:13.160
frame and running with it or something.

00:49:13.160 --> 00:49:14.060
It looks interesting.

00:49:14.300 --> 00:49:16.540
It's definitely worth a try.

00:49:16.540 --> 00:49:20.600
It's nothing I've ever used, but I can see, you know, if you really care about NFL data

00:49:20.600 --> 00:49:24.920
and you really care about Python, it might be worth, worth giving those, those combos a

00:49:24.920 --> 00:49:25.260
look there.

00:49:25.260 --> 00:49:30.020
I think there is one package to work with, with their data from Python.

00:49:30.020 --> 00:49:33.500
So if you look at the list there, there should be at least one.

00:49:33.500 --> 00:49:39.400
I think it's not on their website on their GitHub page, but I think there's another one

00:49:39.400 --> 00:49:41.540
that integrates well with it.

00:49:41.540 --> 00:49:41.880
Sure.

00:49:41.880 --> 00:49:42.280
Right.

00:49:42.280 --> 00:49:44.240
Not under the organization, but maybe somebody else.

00:49:44.240 --> 00:49:44.540
Yeah.

00:49:44.540 --> 00:49:44.940
Yeah.

00:49:44.940 --> 00:49:45.240
Yeah.

00:49:45.240 --> 00:49:45.920
That does.

00:49:45.920 --> 00:49:46.220
Yeah.

00:49:46.220 --> 00:49:46.920
That's cool.

00:49:46.920 --> 00:49:47.500
Excellent.

00:49:47.500 --> 00:49:49.640
Maybe they use this, this integration.

00:49:49.640 --> 00:49:50.540
I don't know.

00:49:50.540 --> 00:49:51.040
All right.

00:49:51.040 --> 00:49:55.160
And then the last thing I want to talk about here is interesting on two levels.

00:49:55.420 --> 00:49:57.440
So you've got a playground.

00:49:57.440 --> 00:50:03.040
So you've got a playground.pysport.org, which is a hosted notebook to play with some examples,

00:50:03.040 --> 00:50:05.980
like in particular, Cloppy and MPL soccer, right?

00:50:06.060 --> 00:50:11.480
I think one of the issues or challenges for a lot of people also working within the bigger

00:50:11.480 --> 00:50:15.080
clubs is that they don't always have a background in programming.

00:50:15.080 --> 00:50:21.200
So often they start as a video analyst or working as a performance analyst, and then they think,

00:50:21.200 --> 00:50:21.920
well, there's data.

00:50:21.920 --> 00:50:22.940
I want to work with it.

00:50:23.320 --> 00:50:28.920
And if you need to set up your Python environment for the first time, it can be a bit overwhelming.

00:50:28.920 --> 00:50:37.400
So that's why I, for, well, there is JupyterLite, which is a very cool project based on Pyrodite.

00:50:37.700 --> 00:50:39.800
Let's see if, yeah, if you can use it.

00:50:39.800 --> 00:50:44.180
And it is just a start with the Cloppy and the MPL soccer package.

00:50:44.180 --> 00:50:51.300
I just fetched the notebook from there, from my gallery, integrated into this one, into the

00:50:51.300 --> 00:50:55.960
playground, and you can just start playing around with it.

00:50:55.960 --> 00:50:56.180
Yeah.

00:50:56.180 --> 00:51:00.920
And so here's a proper Jupyter notebook using all of their libraries and stuff.

00:51:00.920 --> 00:51:05.960
But what's awesome about this, as you said, based on Pyrodite, I'm not sure it necessarily

00:51:05.960 --> 00:51:08.080
actually stuck in people's minds.

00:51:08.080 --> 00:51:11.600
Like, this is running in WebAssembly on our front end, right?

00:51:11.600 --> 00:51:13.100
Which is pretty epic.

00:51:13.100 --> 00:51:18.680
It makes it really convenient for people to just start playing around with it without installing

00:51:18.680 --> 00:51:21.480
Python and working with virtual environments.

00:51:21.480 --> 00:51:23.060
You know how it works.

00:51:23.060 --> 00:51:23.760
Yeah.

00:51:23.760 --> 00:51:28.020
It makes it super easy for you to host it because all you're doing is serving up static files.

00:51:28.020 --> 00:51:32.380
You're not hosting, you're not running a Kubernetes cluster or anything like that, right?

00:51:32.380 --> 00:51:34.480
Trying to prevent abuse of it and so on.

00:51:34.480 --> 00:51:34.660
Yeah.

00:51:34.800 --> 00:51:34.980
Yeah.

00:51:34.980 --> 00:51:36.840
So, yeah.

00:51:36.840 --> 00:51:41.160
Multiple sites make it good for me and for the people using it.

00:51:41.160 --> 00:51:41.560
For sure.

00:51:41.560 --> 00:51:47.840
And it even does that wild, what's it called, pizza plot, that kind of style of plot that

00:51:47.840 --> 00:51:48.360
we're looking at.

00:51:48.360 --> 00:51:50.120
And it runs fast and great.

00:51:50.120 --> 00:51:50.340
Yeah.

00:51:50.340 --> 00:51:51.880
This is really, really nice.

00:51:51.880 --> 00:51:52.100
Yeah.

00:51:52.100 --> 00:51:54.600
Are you happy with Pyrodite or Jupyter Lite?

00:51:54.600 --> 00:51:54.960
Yeah.

00:51:54.960 --> 00:51:54.960
Yeah.

00:51:54.960 --> 00:52:03.040
There was some issues with it, especially around working with fetching data because some

00:52:03.040 --> 00:52:09.200
of these try to fetch the open data from SlabsBomb or also some fonts and stuff like that.

00:52:09.200 --> 00:52:09.560
Yeah.

00:52:09.560 --> 00:52:11.860
So we had to work around it.

00:52:11.860 --> 00:52:18.560
And it's also what you see on top of here is the patching of the request library to make

00:52:18.560 --> 00:52:19.920
it work in Jupyter Lite.

00:52:19.920 --> 00:52:20.200
Yeah.

00:52:20.860 --> 00:52:26.280
I think it's better to have a working version than not patching it.

00:52:26.280 --> 00:52:27.180
I think it's great.

00:52:27.540 --> 00:52:30.980
And then everything that uses requests can just do its thing.

00:52:30.980 --> 00:52:31.320
Yeah.

00:52:31.320 --> 00:52:31.940
This is really cool.

00:52:31.940 --> 00:52:35.620
When I saw that you had this, I thought, oh, this is clever that it's based on Jupyter

00:52:35.620 --> 00:52:35.900
Lite.

00:52:35.900 --> 00:52:37.000
And it's really nice.

00:52:37.000 --> 00:52:37.280
Yeah.

00:52:37.280 --> 00:52:37.580
Yeah.

00:52:37.580 --> 00:52:39.280
So people can check that out.

00:52:39.280 --> 00:52:43.080
Maybe people out there listening maintain some of these packages and have notebooks.

00:52:43.080 --> 00:52:47.800
Like if they get them working here, could they submit them to you and have them added

00:52:47.800 --> 00:52:48.880
in this list?

00:52:48.880 --> 00:52:53.240
The entire playground is part of the PySupport organization on GitHub.

00:52:53.740 --> 00:52:57.800
You can just watch, see the repository and make a pull request.

00:52:57.800 --> 00:53:01.920
And I will just review it and merge it.

00:53:01.920 --> 00:53:03.760
And then it will be available here.

00:53:03.760 --> 00:53:04.040
Yeah.

00:53:04.040 --> 00:53:04.680
That's awesome.

00:53:04.680 --> 00:53:06.860
So I'm really happy for more packages here.

00:53:06.860 --> 00:53:07.620
More examples.

00:53:07.620 --> 00:53:09.120
Yeah.

00:53:09.120 --> 00:53:10.420
More examples would be very welcome.

00:53:10.420 --> 00:53:10.920
Excellent.

00:53:10.920 --> 00:53:11.680
All right.

00:53:11.680 --> 00:53:17.400
Well, I think we're getting pretty much short on time for talking about sports analytics,

00:53:17.400 --> 00:53:19.100
but really, really good work there.

00:53:19.100 --> 00:53:22.140
Now, before you get out of here, I have the final two questions for you.

00:53:22.140 --> 00:53:23.040
I always ask.

00:53:23.500 --> 00:53:25.820
Notable PyPI package, something you've come across.

00:53:25.820 --> 00:53:27.200
You're like, oh, this library is awesome.

00:53:27.200 --> 00:53:28.540
People should check it out.

00:53:28.540 --> 00:53:30.860
I mean, it's kind of the whole topic of this show.

00:53:30.860 --> 00:53:32.800
So we talked about, you know, maybe a hundred.

00:53:32.800 --> 00:53:37.020
We didn't mention them all, but went through a list of a hundred different Python packages.

00:53:37.020 --> 00:53:39.580
But something you want to give a shout out to that you think is cool out there?

00:53:39.580 --> 00:53:43.420
I'm not really sure if the entire Python world already knows it.

00:53:43.420 --> 00:53:48.240
But on the last PySupport meetup, I made an example using DuckDB.

00:53:48.740 --> 00:53:55.360
That was something that people didn't know about it, especially with integration with Pandas data frames,

00:53:55.360 --> 00:53:59.900
that you just build a data frame and run queries directly on top of it.

00:53:59.900 --> 00:54:00.360
Yeah.

00:54:00.360 --> 00:54:00.880
Interesting.

00:54:00.880 --> 00:54:05.200
I heard a DuckDB, but I didn't realize the Pandas kind of direct integration.

00:54:05.480 --> 00:54:07.340
It also has direct parquet.

00:54:07.340 --> 00:54:08.000
Interesting.

00:54:08.000 --> 00:54:08.680
Okay.

00:54:08.680 --> 00:54:14.080
That makes it quite easy to also play around with SQL queries.

00:54:14.080 --> 00:54:20.280
And I was very happy that I had a presentation on last PyData Eindhoven conference.

00:54:20.720 --> 00:54:20.840
Yeah.

00:54:20.840 --> 00:54:28.560
I think it's a package that, well, not everyone, but it's really worth checking out because it can make your life easier.

00:54:28.560 --> 00:54:32.400
I think it's just a Swiss army knife for data engineering.

00:54:32.400 --> 00:54:34.900
And yeah, I think it's a nice one.

00:54:34.900 --> 00:54:35.160
Yeah.

00:54:35.160 --> 00:54:36.120
Great recommendation.

00:54:36.120 --> 00:54:39.940
And if you're going to write some Python code, what editor are you using these days?

00:54:39.940 --> 00:54:41.940
I'm using PyCharm.

00:54:41.940 --> 00:54:44.240
So I'm not, yeah, not sure if it's cool.

00:54:44.240 --> 00:54:45.020
I love PyCharm.

00:54:45.020 --> 00:54:45.640
PyCharm is awesome.

00:54:45.640 --> 00:54:45.860
Okay.

00:54:45.860 --> 00:54:46.400
Excellent one.

00:54:46.400 --> 00:54:46.700
Yeah.

00:54:46.700 --> 00:54:48.520
So I guess final call to action.

00:54:48.680 --> 00:54:51.940
People are interested in open source sports analytics.

00:54:51.940 --> 00:54:57.380
They're open and maybe interested in PySport, want to contribute back or, you know, be part of it in some way.

00:54:57.380 --> 00:54:57.900
What do you tell them?

00:54:57.900 --> 00:54:58.100
Yeah.

00:54:58.100 --> 00:55:05.280
You can reach out on Twitter or LinkedIn to see where you can contribute.

00:55:05.280 --> 00:55:15.760
And I think it's also, if you're not working in the support domain and would like to contribute, please reach out because I think the knowledge from outside of sports is really useful within sports.

00:55:15.760 --> 00:55:22.520
So there are a lot of options to contribute and, yeah, make an even more community and make a more better community.

00:55:22.520 --> 00:55:22.840
Yeah.

00:55:22.840 --> 00:55:23.340
Absolutely.

00:55:23.340 --> 00:55:24.040
All right.

00:55:24.040 --> 00:55:27.700
Well, thank you so much for being here and sharing all these projects you've collected.

00:55:27.700 --> 00:55:29.940
Thanks a lot for being on the show.

00:55:29.940 --> 00:55:31.320
It's really, really nice.

00:55:31.320 --> 00:55:31.620
Yeah.

00:55:31.620 --> 00:55:32.040
Thank you.

00:55:32.040 --> 00:55:32.360
You're welcome.

00:55:32.360 --> 00:55:32.840
Bye.

00:55:32.840 --> 00:55:33.100
Bye.

00:55:33.100 --> 00:55:33.120
Bye.

00:55:34.420 --> 00:55:37.060
This has been another episode of Talk Python To Me.

00:55:37.060 --> 00:55:38.820
Thank you to our sponsors.

00:55:38.820 --> 00:55:40.480
Be sure to check out what they're offering.

00:55:40.480 --> 00:55:41.900
It really helps support the show.

00:55:41.900 --> 00:55:46.800
The folks over at JetBrains encourage you to get work done with PyCharm.

00:55:46.800 --> 00:55:52.360
PyCharm Professional understands complex projects across multiple languages and technologies,

00:55:52.360 --> 00:55:58.020
so you can stay productive while you're writing Python code and other code like HTML or SQL.

00:55:58.020 --> 00:56:03.160
Download your free trial at talkpython.fm/donewithpycharm.

00:56:03.920 --> 00:56:06.960
Influx Data encourages you to try InfluxDB.

00:56:06.960 --> 00:56:13.900
InfluxDB is a database purpose-built for handling time series data at a massive scale for real-time analytics.

00:56:13.900 --> 00:56:17.780
Try it for free at talkpython.fm/InfluxDB.

00:56:17.780 --> 00:56:19.720
Want to level up your Python?

00:56:19.720 --> 00:56:23.780
We have one of the largest catalogs of Python video courses over at Talk Python.

00:56:23.780 --> 00:56:28.940
Our content ranges from true beginners to deeply advanced topics like memory and async.

00:56:28.940 --> 00:56:31.620
And best of all, there's not a subscription in sight.

00:56:32.000 --> 00:56:34.520
Check it out for yourself at training.talkpython.fm.

00:56:34.520 --> 00:56:36.420
Be sure to subscribe to the show.

00:56:36.420 --> 00:56:39.200
Open your favorite podcast app and search for Python.

00:56:39.200 --> 00:56:40.500
We should be right at the top.

00:56:40.500 --> 00:56:45.660
You can also find the iTunes feed at /itunes, the Google Play feed at /play,

00:56:45.660 --> 00:56:49.860
and the direct RSS feed at /rss on talkpython.fm.

00:56:50.660 --> 00:56:53.300
We're live streaming most of our recordings these days.

00:56:53.300 --> 00:56:56.720
If you want to be part of the show and have your comments featured on the air,

00:56:56.720 --> 00:57:01.140
be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

00:57:01.140 --> 00:57:02.980
This is your host, Michael Kennedy.

00:57:02.980 --> 00:57:04.280
Thanks so much for listening.

00:57:04.280 --> 00:57:05.440
I really appreciate it.

00:57:05.700 --> 00:57:07.360
Now get out there and write some Python code.

00:57:07.360 --> 00:57:07.700
Thank you.

00:57:07.700 --> 00:57:28.160
Thank you.

00:57:28.160 --> 00:57:58.140
Thank you.

