#482: Pre-commit Hooks for Python Devs Transcript
00:00 Do you struggle to make sure your code is always correct before checking it in?
00:03 What about your team member's code? That one person who never wants to run the linter,
00:08 tired of dealing with tons of conflicts and spurious Git changes? You need Git pre-commit
00:13 hooks. Well, we're lucky to have Stefanie Molin on the show today, who has done a bunch of writing
00:18 and teaching of Git hooks. This is Talk Python to Me, episode 482, recorded October 24th, 2024.
00:27 Are you ready for your host? You're listening to Michael Kennedy on Talk Python to Me.
00:32 Live from Portland, Oregon, and this segment was made with Python.
00:36 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.
00:44 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using at Talk Python,
00:50 both accounts over at fosstodon.org, and keep up with the show and listen to over nine years of
00:56 episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams
01:01 over on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified
01:07 about upcoming shows. This episode is brought to you by Sentry. Don't let those errors go unnoticed.
01:13 Use Sentry like we do here at Talk Python. Sign up at talkpython.fm/Sentry. And this episode is
01:19 brought to you by Bluehost. Do you need a website fast? Get Bluehost. Their AI builds your WordPress site
01:25 in minutes, and their built-in tools optimize your growth. Don't wait. Visit talkpython.fm/ Bluehost
01:31 to get started. Hey, everyone. Before we jump into the interview with Stefanie,
01:36 I want to tell you real quickly that I just released a blog for Talk Python. Now, we have had tons of RSS
01:43 over there because that's what powers podcasts. You can subscribe to the episodes. You can subscribe to
01:49 an RSS feed for new course announcements over at Talk Python Training. And I've had a personal blog
01:55 time over at mkennedy.codes, but no official Talk Python blog. And so I'm going to be posting
02:01 really cool things on there. I've already got a couple of articles posted, but I have plans for
02:06 some interesting series. And anytime there's some more interesting announcements or exciting news
02:11 I want to share with Talk Python, it's going to be over on the Talk Python blog. So if you're interested,
02:16 I would really, really appreciate it. If you go to talkpython.fm, click on blog, right in the
02:21 navigation or at the bottom and just subscribe to the RSS feed. That way we can stay in touch.
02:26 And with that, let's talk pre-commit hooks. Stefanie, welcome to Talk Python. It's awesome
02:32 to have you. Thanks for having me. Yeah, really looking forward to talking about pre-commit hooks.
02:38 You know, these are things that I'm sure a lot of people have heard of. I've certainly heard of,
02:42 but to be honest, it's not much I've done very much with. And I bet a lot of people out there
02:47 listening are like, yeah, that'd be a good idea. Just like continuous integration and writing tests.
02:51 Now let's get back to it. You know, something like that. So I think there's a lot for people to take on, take away here. And we'll talk about what are these pre-commit
03:00 hooks, when to use them, how to build them, and a whole bunch of other things that you're up to.
03:05 So it should be a lot of fun. I'm looking forward to it. Me too.
03:08 Yeah. Now, before we get to that, how about your story? How do you get into programming Python and
03:14 pre-commit hooks and all these things? Hello everyone. I'm Stefanie Molin. I am a software engineer at
03:18 Bloomberg. And I would say, I guess I got into programming in Python. I initially was programming
03:25 in R and I was doing more data analysis while still building some things. And I needed to build a web
03:33 app. And one of my teammates had suggested that rather than battling with Shiny in R, that I just
03:38 learn Python. So I took a few weeks and just forced myself to do that. And I built something
03:43 with Flask. And that was how I got into it. Oh, that's really awesome. Yeah. You were doing work in
03:50 not finance, but in ads or something like that with R. What kind of work was that? Like we just generally
03:56 add, you don't have to go into details. Yeah. So it was, it was mainly reporting and doing analysis on how
04:03 client campaigns were going. But what really got me started with programming was more, I had gotten
04:08 involved with a hackathon team and we had built an alerting system. So just monitoring when something
04:13 weird went on with the campaigns. And I really enjoyed building more, more so than the analysis.
04:19 And so I had to find a way to, and I enjoy like a little bit of data and more on the coding side.
04:25 So I had to find something that would let me combine those two.
04:28 Yeah. Well, that sounds really fun. I definitely, I'm on the same wavelength as you with data analysis
04:33 is fun, but the building is, is really where things get interesting and, you know, look back and see
04:39 like, Oh, we built this thing. That's, that's a pretty awesome feeling.
04:42 Yeah. It was, it was a ton of fun and we ended up getting, I think third place on the hackathon,
04:47 but yeah, that was, that was really that moment where it was like, I got to taste of something else.
04:52 And I was like, this is, this is what I want to be doing.
04:55 Yeah. Oh, that's fantastic. Was that at your company or was that someone?
04:58 That was at the, the previous, previous role. It was the ad tech company. And so that was actually
05:04 all built in R, the alerting system. And then, Oh no. Yeah. Okay. Yeah. And then, and then as we
05:10 worked more on it, certain things ended up moving into Python. So a lot easier to work with and to
05:16 automate things and not have like some laptop running R somewhere.
05:19 Yeah, exactly. It's, that's sort of the promise of Python over a lot of these things that at first
05:28 blush seem somewhat equivalent, right? Is that it's, it's a real programming language that can go on to do
05:35 all the stuff. You don't have to try to automate some weird thing. That's not really meant to be that
05:40 way. Right.
05:40 I know. And now, I mean, I could not write R if I, if I had to, I wouldn't, I don't think I would.
05:47 Yeah. Well, I was going to ask you now, which side of the fence do you spend more time on R or Python?
05:53 It sounds like.
05:54 I haven't touched R in maybe six plus years at this point. So I, yeah. Other than the arrows,
05:59 that's probably the only thing I could manage too.
06:01 Yeah. No more equal size, just arrows. Okay. Got it. Awesome. Well, that's super fun. Let's talk
06:10 about pre-commit hooks, right? I've had Anthony Sotili on the show to talk about his pre-commit project.
06:16 It was a long time ago and I'm sure that project will get a bit of a shout out from your work as
06:21 well. But, you know, congrats, you put together a really nice series of articles and resources
06:28 teaching people what commit hooks are, how to debug them, how to build them, how to choose them. So I
06:35 think, you know, the stuff we're going to talk about, I'll link, of course, in the show notes.
06:38 It's a really nice resource for folks. So thank you. I appreciate that.
06:41 Yeah. Yeah. You bet. So let's talk about NumPy doc, doc string validation. This is, this was your entry
06:51 way into what this whole world of pre-commit hooks is, right?
06:54 Yeah. So, and I think July, 2022, I was at my first EuroPython and I decided to do the sprints
07:02 for the first time. I ended up working with the scikit learn team and they wanted to make sure that
07:08 all of their doc strings were conforming to the NumPy doc standard. They had a file in place or a test
07:14 file that you could run and just validate that whatever changes you made were now being validated
07:19 as far as doc strings. And I remember at one point, like I had, I think done 12 or so PRs in that sprint.
07:27 So I was very productive. And there was one early on, I think in the second or so, where it just wasn't
07:32 working and I couldn't figure out why it was telling me it wasn't valid. It was saying that it wasn't
07:37 ending in a period. And I had called over the, one of the maintainers and we both stared at it. To us,
07:43 it looked like a period. And I ended up just deleting the doc string and starting over. And it turned out
07:48 that it was a trailing space at the end. And so I had asked the maintainer, like, how do you not have
07:54 this happen to you? And the response was, you should install pre-commit. And by then I had, I was already,
07:59 I had to leave. So I was like, make, I made a note to myself. I need to research this when I get home.
08:04 And when I did, I was like, well, how did I not know about this before? And I set it up on things.
08:09 And then I went to look, does NumPy doc have that? This seems like exactly what you would want.
08:15 As you're writing code, you want to make sure that it's going to check the doc string there. You don't
08:19 want to have to run some other thing later on and remember to run it. So I looked and there was no
08:24 pre-commit hook for NumPy doc. And I had made something, something that initially we had just
08:29 used internally within my team. And then later on, I kind of wanted to use it for a personal project.
08:35 And so I set about seeing how we could actually open source it. And I had contacted the NumPy doc team
08:42 and they were very, very interested in it because there was a reason there was no hook. It's because
08:46 no one knew how to do it. Right. And at that point I had the horrible realization that what I had written
08:52 would never work outside because it was relying on things being installed. So, and then I felt pretty
08:58 bad about promising that to them. So I managed to come up with an entirely new solution in a weekend
09:03 and figured out how to use the abstract syntax tree to work through. And so I built an entirely
09:10 new version of it. And that is what is currently available in NumPy doc. And that actually led to
09:15 them inviting me to be a core developer for NumPy doc. Congratulations. How cool is that?
09:22 Yeah, I know. It's like the full spectrum, right? And just having heard about it and then just
09:27 seeing the connection between two things that weren't previously connected.
09:31 Yeah. Yeah. Well, I think your comment about the pre-commit hook not previously existing,
09:37 you know, for this project also is, it's pretty interesting, right? It's kind of like I hinted at,
09:41 I mean, a lot of people hear about this kind of stuff, but that doesn't mean they're putting it
09:45 into practice, right?
09:45 Yeah, for sure.
09:47 And so how do we, you know, let's, let's find our way over to pre-commit hooks in general. So how do we
09:53 encourage people or ensure that people follow coding rules, right? We've got tools like black,
09:59 we've got tools like rough. Now those will work awesome. If you give them a consistent config file
10:06 or config settings, not so much with black, but rough. Anyway, they'll make those changes and do a lot of the
10:13 kind of stuff that we're talking about here, but that requires, like you said, people to have it
10:16 installed, people to run it and people to buy into the whole concept of the project in the first place,
10:23 right?
10:23 Yeah, that last bit.
10:24 We're all using these tools and we're all going to run them and we're going to remember to run them
10:28 until one person goes, I don't like these tools. I'm not doing it. And then their settings fight with
10:33 your settings or their spacing fights with your spacing or whatever, right?
10:36 Yeah. I think what has, what really helped in my experience, when you incorporate these things,
10:41 even like going and approaching open source projects that didn't have a pre-commit set up and just asking
10:46 if they were interested in it, it's, you really see the value when you've, you think if you've ever
10:51 reviewed something or gotten review comments about, you should start a new line here. I don't like this
10:56 space here. And then you think about how much time you waste at that stage. And then you still have
11:01 zero consistency because you did it one way, someone else does it another way. And even further than that,
11:08 it's just the time you waste in your code. Oh, I should put this on a new line and reformatting files
11:13 when you could actually be writing things and thinking about how should I design this algorithm,
11:18 them. Right. And so I think a big part of making sure that once you find these tools that you're
11:23 going to use and actually make sure they're using, it's making it easy to use. Like you said, yeah,
11:28 you can just run black or rough, but you have to remember to run black or rough. And that is the
11:32 key problem. And what's so great about pre-commit or even extensions in your IDE is that these things
11:38 become automatic and that's what you need to get towards for these things to actually stick.
11:43 Yeah. To make them automatic and not part of it. And to some degree, continuous integration can do
11:49 those kinds of things. But a lot of times it's too late at that point. It's already checked in,
11:54 it's already committed. And then you've got the back and forth of now it's a diff, but it's only a diff
11:59 because they spaced it differently when they hit save in their IDE than when you hit save in yours and
12:04 all that. So pre-commit hooks run prior to actually leaving your computer, right?
12:11 Yeah. So it's actually prior to even the commit. So when you do get commit and you, let's say you pass
12:16 your message and if it's successful, you normally, you see the hash that gets generated. If you have
12:21 pre-commit hooks enabled, then if that, those checks don't pass, then that commit never gets created in
12:28 the first place. So you still have the files staged, but nothing has made it to the commit.
12:33 Yeah. That's great.
12:34 Yeah. I was just going to explain maybe a little bit about how they work if you're curious.
12:38 Yeah. Yeah. Well, let's start with just like, what even are, are these pre-commit hooks?
12:42 Yeah. So pre-commit hooks, and I think the naming is, is quite overloaded and that leads to a lot of confusion.
12:49 So at the lowest level, a Git repository in general supports a hooks system. So there's a variety of
12:57 different types of actions that Git will trigger a script on your behalf. And one of those such actions
13:03 is pre-commit. So as I described before, as you run Git commit, this gets triggered. Another thing
13:08 might be pushing. You can have Git wired to run some script when you push. Now that is Git's version
13:15 of pre-commit and hook, singular hook, because you can only have a single file run, single executable
13:21 can run.
13:21 Right. If you go to your, it's in the Git folder, there's a hooks subfolder and it's got little
13:27 samples for all the different lifecycle things, right?
13:30 Yeah. And yeah, they provide some, they have to, like I said, they had to be executable, but you can
13:35 be in any language that you have available on the machine. And so Git provides some examples. I do think
13:41 there are a few stages that don't have examples, but it's basically you take the name of the stage
13:46 that you're going to use and that's the name of the file. And that has to be an executable and Git
13:51 will run it at the designated moment.
13:53 Okay. So it could be a Python executable or it could be a Go executable or whatever, but it's just
14:00 one, right?
14:00 It's just one. Yeah. Cause it has to be named. And like in the case of pre-commit, it has to be called
14:05 pre-commit, nothing else.
14:07 This portion of Talk Python to me is brought to you by Sentry. Code breaks. It's a fact of life.
14:12 With Sentry, you can fix it faster. As I've told you all before, we use Sentry on many of our apps
14:18 and APIs here at Talk Python. I recently used Sentry to help me track down one of the weirdest bugs I've
14:24 run into in a long time. Here's what happened. When signing up for our mailing list, it would crash
14:30 under a non-common execution paths, like situations where someone was already subscribed or entered an
14:36 invalid email address or something like this. The bizarre part was that our logging of that
14:42 unusual condition itself was crashing. How is it possible for our log to crash? It's basically a
14:49 glorified print statement. Well, Sentry to the rescue. I'm looking at the crash report right now,
14:54 and I see way more information than you'd expect to find in any log statement. And because it's
14:59 production, debuggers are out of the question. I see the traceback, of course, but also the browser version,
15:06 client OS, server OS, server OS version, whether it's production or Q&A, the email and name of the person
15:13 signing up. That's the person who actually experienced the crash. Dictionaries of data on the call stack and so much
15:18 more. What was the problem? I initialized the logger with the string info for the level rather than the
15:25 enumeration.info, which was an integer-based enum. So the login statement would crash, saying that I could not use
15:33 less than or equal to between strings and ints. Crazy town. But with Sentry, I captured it,
15:40 fixed it, and I even helped the user who experienced that crash. Don't fly blind. Fix code faster with
15:46 Sentry. Create your Sentry account now at talkpython.fm/Sentry. And if you sign up with the code
15:52 TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's business plan, which will give you up to
16:00 20 times as many monthly events as well as other features.
16:04 So if you want to run more, you basically have to, potentially, write a program which then itself
16:10 figures out all the things to do and then delegates to running them. Like if you want to run rough with
16:16 a fixed formatting issues and you want to run the checker fixer for NumPy doc strings and all those
16:23 things, you'd have to write a sort of orchestrating program for that, right?
16:27 Yeah, it's almost like you're writing in the case of like a bash script, like a giant bash script where you have
16:32 to decide, you know, do you fail early? How do you like and check, do I run this one and then this one? And then
16:38 even worse, you're actually, in that case, you're probably running everything, you know, you're running everything
16:43 sequentially. And if you don't do it carefully, then you know, maybe, maybe you want to fail early, maybe you don't.
16:49 So that becomes very, very challenging to configure and also to share because the thing about that file is that is not
16:55 included in version control. So that would be something that you would maybe have to store somewhere else and then do a
17:01 symbolic link. And then that becomes already a lot trickier for everyone to manage.
17:05 Yeah, I was just doing that last night and that's an AI question. I don't remember how to do that.
17:09 I know you can do it. It's not that hard. It involves LN, but you know, ChatGPT, what do I do exactly?
17:16 LN-S. I've had to do that quite a bit.
17:19 It's burned into the brain, huh?
17:22 So one of the things that you recommend so we don't have to build this orchestration piece is actually pre-commit,
17:31 which is a Python project, right?
17:32 Yes. And it's not the only one. So again, that's where like the naming becomes challenging.
17:37 But pre-commit is built in Python, but it can run hooks in a variety of languages.
17:42 And it interfaces with GitHub's system for you. So it creates that executable and plants it there.
17:48 But that executable is then pointing back to pre-commit so that you can just define a simple YAML file like you can see part of it on the screen right now.
17:56 And it becomes very easy because essentially you're just configuring what you want to run.
18:01 You're not actually coding the logic of the checks and how they relate to each other.
18:05 Right. So let's assume that all the pre-commit hooks that you want to run somehow exist out there in the world, right?
18:12 You don't have to create them for the moment.
18:14 So what you can do with pre-commit is you can set up a YAML file.
18:20 I always get those crisscrossed.
18:22 A YAML file, a pre-commit config YAML file, which then has a bunch of listings of here's a Git repository.
18:30 And if you install it as a Python package, here's a bunch of things that you can run on it, like check toml, check YAML and so on, right?
18:38 Well, it doesn't actually have to be a Python package, right?
18:40 So in that repo, and we're maybe jumping ahead, but there's a special file in that repo, which will tell pre-commit how it actually needs to install it.
18:48 So it could be anything.
18:49 Oh, that's interesting.
18:50 So the thing that integrates with the pre-commit project, it has to opt in in a sense in that it has to have a configuration file or a launch file or a setup file, something like that.
19:01 Yeah. So right now we're looking at pre-commit config.
19:04 There's pre-commit hooks, and that one is kind of registering it with pre-commit system.
19:09 So it tells pre-commit how to install it once it gets a hold of it.
19:13 And it also lists out these hooks that we see here under ID, but that will be defined over there so that pre-commit knows, well, what is check toml?
19:21 What is check YAML?
19:22 Okay. Yeah.
19:24 That's really cool.
19:25 And you can have more than one of these repositories in there, right?
19:30 Correct.
19:31 Yeah. So the repos section is a list of repo sections, and then each repo then has other config, like the individual hooks that you want to run from that repo.
19:42 Right, right.
19:43 So for the first example that you have in this, and this is your article, I guess, I don't know if I give this the proper announcement, but how to set up pre-commit hooks.
19:53 This is your, I perceive this as kind of your getting started article for this whole series.
19:57 I don't know if you see it that way.
19:58 Yeah.
19:59 Yeah, this was the first one.
20:01 I had gotten a lot of questions on how to do this.
20:04 And I think it's always interesting, especially when you think about, you know, speaking at conferences, I feel like, and which I do a lot of, and I feel like a lot of what gets more hits in that sense is like the advanced stuff, maybe more creating it.
20:15 But there's so much value in people just getting started and figuring out how do I even use this in the first place?
20:20 Because this saves you so much time.
20:23 So I really, this was where I got started for that reason.
20:25 I think a lot of people were able to benefit from this article.
20:28 Yeah, it seems like it.
20:30 I know it's fun to talk about the super advanced deep dive things, but most people, they just need to get started.
20:36 They just need some foundation, right?
20:37 And I think, I think that's actually where most of the benefit comes from, even though it is really fun to see some cool deep dive talk that people are going into, right?
20:47 So this next one is pretty interesting that we're adding here in this example, and that's the rough pre-commit from straight from Astral, right?
20:57 So this is just github.com/astral.sh, which is the company behind rough newbie.
21:02 And this is the rough pre-commit.
21:04 But what's interesting about this is, well, one, that it has nothing to do with the pre-commit project.
21:09 But two, that this one also takes special arguments that you can pass to it.
21:14 Yeah, so I think the ruff pre-commit one is just a smaller version so that it works faster with pre-commit.
21:20 Because pre-commit will have to install these at some point.
21:23 It will have a cache.
21:25 So if you don't change the version in this case, it will be able to reuse that.
21:28 But that first time, you do have a bit of a delay.
21:31 And that's not something you want.
21:33 It's something you have to be very careful of when you want to be using these.
21:36 And then the args thing is nice because you have a few options when you configure these tools, depending on what the tool supports.
21:43 In this case, rough supports, as I think we mentioned a little bit earlier, configuration file.
21:47 So, for example, you could have stuff in your pyproject.tomil.
21:50 But the key here is that maybe you're using rough in your IDE.
21:55 And maybe you don't want to do the same kind of changes that you want to do in pre-commit.
22:00 Maybe you wanted to ask you if it's going to change something.
22:03 Whereas in the pre-commit stage, you definitely want it to be fixed.
22:06 So you can use the args here to provide stuff that you only want to happen when it's running in the context of pre-commit.
22:14 Yeah, and ruff has a exit non-zero on fix, which means if it goes through and you say to fix it, it will fix it.
22:21 But then it'll error out and say that wasn't a smooth transition or whatever, which is cool because that will then fail the commit itself.
22:30 Correct.
22:30 Give you the modified files and say basically have a look.
22:34 See if you like it now, right?
22:36 Before it actually just ships it off.
22:37 That's so important because sometimes you realize there was some rule that you hadn't reviewed before.
22:43 That's not quite doing what I want and let me tweak my setup.
22:45 So it's nice to have that bit where you can verify what was actually changed is what you want.
22:50 Yeah, I guess it's a little bit dangerous to just say change it and then commit it.
22:54 I've had people.
22:55 So I did a workshop on pre-commit both on setting it up and then making your own hooks at EuroPython this year.
23:03 And I did have a few people actually.
23:05 One very insistent asking me why wasn't there a hook or why don't they support just fixing it
23:12 and then automatically adding it and committing it on your behalf.
23:14 And to me, as a person who works in security, that just sounds very scary.
23:18 I don't want things doing that.
23:20 I want to see what is being changed and whether or not I agree with it or not.
23:23 Yeah.
23:24 Why doesn't it just go ahead and push it as well?
23:28 Come on.
23:28 Yeah.
23:28 Well, I think that was part of the suggestion.
23:30 I was like, I certainly don't want that running on my machine.
23:34 Yeah, it does skip out on some of the benefits of the multi-stage aspects of Git, I suppose.
23:39 But it is efficient.
23:40 You just get it done all at once.
23:41 That's pretty cool.
23:42 Yeah, but you don't know what else is grabbing, which is the scary part.
23:44 No, of course not.
23:45 I know.
23:45 Super bad.
23:47 So this example that we're talking about here where we've got a pre-commit hook that we're grabbing
23:53 and then it takes these arguments, I think this is an interesting point of discussion.
23:57 So the example you have in your article just says, what we're going to tell ruff is dash,
24:01 dash, fix, dash, dash, exit non-zero fix, and show fixes, which is all good.
24:07 But ruff can be pretty complex in its configuration, right?
24:11 You can say, disable flight gate, turn this one on.
24:14 These are warnings.
24:15 These are errors.
24:16 And there's a whole, you know, here's how many line columns I want and all of this stuff, right?
24:21 So you can either do this argument thing, or if it's supported, you could also potentially have,
24:27 say, a ruff.toml, right?
24:28 Yeah.
24:29 So I tend to want to minimize the amount of configuration files I have.
24:34 So in my case, I think below I talk about having it in the pyproject.toml.
24:38 Yeah, exactly.
24:38 So you just add a ruff section in there and then you configure things.
24:42 And this is stuff that you'd want to use both in your editor as well as in the pre-commit stage,
24:46 because you want them to agree.
24:47 And nothing worse than one telling you the lines too long and the other one like,
24:51 nope, that's good.
24:51 Go ahead.
24:52 Or put a space after the comma in parameters and then take away the space and put the space and take away the space.
24:59 Exactly.
24:59 You don't want them fighting.
25:00 You want them in agreement.
25:01 No, no, you don't.
25:02 So I suppose that's a massive bonus of having either the tool.rough settings in your pyproject or just a ruff.toml,
25:09 however you go about that, it doesn't really matter.
25:11 Because then no matter how you're using rough via the pre-commit or for your project, it'll be the same thing, right?
25:17 Exactly.
25:17 Yeah.
25:18 Okay.
25:18 Yeah.
25:19 That's pretty awesome.
25:20 Now, I guess maybe we got a bit ahead of ourselves.
25:24 If I want to somehow install a pre-commit hook or pre-commit so that when I then give it one of these toml files,
25:32 it'll go subsequently grab them and do the things.
25:34 How do you get started with that?
25:36 I think I need a rephrasing of that question.
25:39 Yeah.
25:40 Sorry.
25:40 So if I have just a plain GitHub repository and I want to have pre-commit manage the hooks for that repository,
25:49 like what do I do?
25:50 Okay.
25:51 So the first thing is you have to actually install pre-commit.
25:54 And that's not the command that's on the screen.
25:56 This is more of a pip install.
25:58 So make sure you have the Python library in place.
26:02 And then you need to have this configuration file.
26:05 At least one hook in there so that you have a valid file.
26:09 And then you can run pre-commit install.
26:12 And I omitted it here, but what I talk about in a different article, when you run this command, pre-commit actually tells you that it created the git hooks pre-commit file.
26:21 And if you open that up, and I have an example on that other article, it's very simple and it's just calling pre-commit the tool itself.
26:28 So in all cases, you need to have it installed in your environment.
26:33 And a single time you run pre-commit install, which then does the wiring on the git side.
26:39 And this is something that everyone in your project has to run on any machine that they are using.
26:45 Because it's part of the repository itself, that file needs to exist there.
26:50 And that can only happen if you run this command.
26:52 Yeah.
26:53 So there's a .pre-commit.config.yaml file.
26:56 That's what you put into GitHub at the root of your project or something like this.
27:01 But then to actually configure git itself, you've got to run this pre-commit space install.
27:07 And it basically wires up the hooks to make that happen, right?
27:11 Correct.
27:11 So yeah, when you run this, that file gets created on your behalf.
27:14 And then you don't have to worry about wiring that up.
27:17 And then it's transparent.
27:18 All you have to do is tweak your config and then the changes happen.
27:22 Nice.
27:23 I don't know if the naming, how much to believe the naming.
27:26 Can it do things other than pre-commit?
27:28 Yes.
27:28 Can it do pre-push and those kinds of things?
27:32 They don't support every single one.
27:35 But there are quite a few that they do support.
27:38 For example, I once configured an open source project with a pre-push because it was a slower
27:44 check.
27:45 And that's something you definitely don't want running on each commit.
27:48 But it might be something where you want to make sure when you push the files that you've
27:51 addressed something that's maybe a little bit longer.
27:54 And that is really not any different than configuring with the pre-commit config YAML.
28:00 There's just a separate item that goes in there that says which stage to run.
28:03 By default, it's pre-commit.
28:04 So you don't see it.
28:06 But if you needed to change it, you can.
28:07 Yeah.
28:07 I figured that was the case.
28:08 But I'd never tried.
28:09 And given that it's named pre-commit, you know, it's kind of named after one of the hooks,
28:13 right?
28:14 But of course.
28:15 I think that's named probably the most useful one.
28:17 I would.
28:18 Yeah, I would think so.
28:19 I think a very popular example would perhaps be the commit message hook.
28:25 So there's a lot of tools that work on, you know, making sure your commits are following
28:30 a certain standard.
28:30 I think one of them is called commitizen.
28:32 And so that runs on, my guess is on the commit message hook.
28:36 Commitizen?
28:37 Yes.
28:38 Okay.
28:38 What is this commitizen about?
28:39 I haven't heard of this.
28:40 I don't think their example uses that.
28:42 But I think they do have a pre-commit hook.
28:45 And I believe it works that way.
28:46 Yeah.
28:47 Yeah.
28:47 Interesting.
28:48 Okay.
28:48 What's this thing?
28:49 A release management tool for teams.
28:51 Yeah, sure.
28:52 That makes sense that you want to kind of be a little bit careful about what your commit
28:56 messages are.
28:57 Maybe you want to grab certain commit messages and add them to your changelog or something
29:01 like that, right?
29:01 Yeah.
29:02 I think there's been quite a bit of talk about this one at conferences I've been lately.
29:07 I think it's gotten a lot of traction.
29:09 Yeah.
29:09 2.5,000 GitHub stars.
29:11 That's pretty good.
29:12 I'll check it out.
29:13 This is news to me.
29:14 This portion of Talk Python to Me is brought to you by Bluehost.
29:18 Got ideas, but no idea how to build a website?
29:22 Get Bluehost.
29:23 With their AI design tool, you can quickly generate a high-quality, fast-loading WordPress
29:28 site instantly.
29:29 Once you've nailed the look, just hit enter and your site goes live.
29:33 It's really that simple.
29:34 And it doesn't matter whether you're a hobbyist, entrepreneur, or just starting your side hustle.
29:39 Bluehost has you covered with built-in marketing and e-commerce tools to help you grow and scale
29:44 your website for the long haul.
29:46 Since you're listening to my show, you probably know Python, but sometimes it's better to focus
29:50 on what you're creating rather than a custom-built website and add another month until you launch
29:55 your idea.
29:56 When you upgrade to Bluehost cloud, you get 100% of time and 24-7 support to ensure your
30:02 site stays online through heavy traffic.
30:04 Bluehost really makes building your dream website easier than ever.
30:08 So what's stopping you?
30:09 You've already got the vision.
30:10 Make it real.
30:11 Visit talkpython.fm/Bluehost right now and get started today.
30:16 And thank you to Bluehost for supporting the show.
30:19 All right.
30:21 What other takeaways should we talk about in this first one?
30:23 I think we maybe have pretty much covered it.
30:26 Let's see.
30:26 I guess, you know, we mentioned before, but if people want to see sort of examples of pre-commit
30:32 hooks failing or succeeding or failing because they changed something, which is not exactly
30:37 a failure, but stopping and starting over, you have a nice example of what that's like
30:42 there.
30:42 So one thing that I guess might be useful is sometimes maybe you don't want to run the
30:49 pre-commit hooks.
30:50 Maybe you need to check in something in a certain way to fix the servers down, right?
30:57 We have to check this in.
30:58 I can't fix this hook, whatever this hook is upset about right now.
31:01 It needs to go in right away.
31:03 Just let me commit it, right?
31:05 You can do that.
31:05 I mean, I think there are probably several use cases or something like this.
31:10 Maybe you're going to be squashing things later and it doesn't, and it's, you don't,
31:13 maybe you don't even know what the API for you're doing, what you're doing is going to
31:17 look like.
31:17 It could be, and this kind of ties back to what we talked about earlier, perhaps roughs
31:22 doing something and you don't agree with, but you need to like check with the rest of
31:25 your team to make sure that everyone's in agreement with let's remove this rule.
31:29 Right.
31:29 So it's, I, this definitely don't encourage always doing this.
31:34 That defeats the purpose, right?
31:35 But there is kind of a break glass solution here where you, let's say you first run, get
31:40 commit and something fails and it's not something that you either want to fix at the moment or
31:45 really can fix.
31:45 Then you can just pass it, pass in dash, dash, no verify.
31:48 And none of the checks run at that point.
31:51 So it's like, as if the checks were never there in the first place.
31:54 Right.
31:55 Right.
31:55 Right.
31:55 Okay.
31:55 That's pretty interesting.
31:56 Like you say, hopefully people don't run that all the time.
32:00 At that point, just remove the pre-commit setup, save yourself.
32:03 Yeah.
32:03 Like what are you, what are you even doing?
32:05 Right.
32:05 I suppose there's an interesting interplay between pre-commit hooks and continuous integration,
32:11 right?
32:12 Like in a sense, they are often checking some of the same things.
32:16 What do you think?
32:17 So I think it's probably an example, like not, not quite a Venn diagram.
32:22 I probably, the circle for pre-commit is entirely contained within the circle for the CICD.
32:29 The difference is there are certain things where you can get immediate feedback, quick
32:33 feedback locally, and that should be something that you can put pre-commit things like linting,
32:37 formatting, et cetera.
32:38 And then CICD may be running your test suite.
32:42 That's definitely not something you want to be doing in a commit.
32:44 Imagine you have a test suite that takes three minutes to run, even maybe three minutes isn't
32:48 that bad, but every commit waiting three minutes is definitely not something you want to do.
32:52 No.
32:53 But it's still a check that you should definitely be running.
32:55 So in CICD, I would run everything.
32:57 Do the linting, do the formatting.
32:58 That's your final, that's your last layer of defense and you need to be checking everything.
33:03 And this just allows developers to get that feedback sooner.
33:06 Right.
33:07 So what you're actually checking in and you finally approve is much closer to what CICD
33:12 would kind of want in the first place, right?
33:14 Yeah.
33:14 Yeah.
33:14 Okay.
33:15 And it's also a much faster feedback, right?
33:17 So like if the thing has to run all the way through the linting, the formatting, the testing,
33:20 the type checking, whatever, you might be waiting 10, 15 minutes for all the things to run when
33:25 you could have had, you know, under a minute, hopefully way under a minute feedback instantly that
33:30 your file wasn't formatted correctly.
33:31 It should be near instantaneous, right?
33:34 I mean, instant maybe is asking too much, but some of that astral stuff is kind of ridiculous.
33:40 Yeah.
33:41 I think you have to be very careful, right?
33:43 Because there's all these checks and I think you had up on the screen maybe earlier, like
33:47 the pre-commit hooks, the general ones provided by the pre-commit organization.
33:53 Yeah.
33:53 There's tons of things in there, but you do have to be careful, right?
33:56 Because if you're like, oh, this could be good and this could be good and this could
33:59 be good.
33:59 Each check is adding time.
34:02 Assuming, like I say, assuming they're all running on Python files, you're adding time
34:05 to how long.
34:06 So you do have to be mindful of what you actually need.
34:09 And if you go to the point where you end up making the whole process take too long, people
34:15 are going to stop using it.
34:16 And then that defeats the...
34:17 Yeah.
34:17 Yeah, exactly.
34:18 As soon as it becomes a point where people go, I'm not using this thing, then you're kind
34:22 of kind of sort of lost unless you can just say, no, you have to use it.
34:25 But then you just have unhappy teammates.
34:27 Exactly.
34:28 Either way, it's not a real great outcome, is it?
34:29 I mean, if there's something that maybe only runs on a few files every once in a while, then
34:35 if you are having problems with speed, then you can also consider moving that to the CICD.
34:39 And I am definitely a big fan of rough, as you said, like just switching from black, flaky,
34:45 all that onto rough, you do save a significant amount of time on these checks and it's a huge
34:50 benefit.
34:50 Yeah, it's pretty ridiculous.
34:51 Now, this is not a get pre-commit thing.
34:54 This is a pre-commit the project thing.
34:57 But you can, if you're using this pre-commit project we've been talking about, you can say
35:01 pre-commit space run and do kind of a test without actually doing a commit, right?
35:07 Correct.
35:07 Yeah.
35:07 So there's a bit of nuances.
35:09 So if you just do pre-commit run, it's going to run all of your hooks, but on the staged
35:14 changes, because it's thinking essentially you're doing like a dry run.
35:17 If you, let's say, are adding a new hook and you want to make sure all of your files are
35:22 compatible with that new hook, then you might want to do something like pre-commit run dash
35:26 dash all files.
35:27 So look through your entire repository, regardless of whether you have changes in place.
35:31 So if you say pre-commit run, it only works on your, basically your changed files, not the
35:37 stuff that's already there and accepted.
35:38 Correct.
35:39 And another neat thing is in the case I mentioned where you add a new hook, you might just want
35:44 to run that hook.
35:44 So you can say pre-commit run and then the hook ID, and then you would just run that hook
35:49 and then you can define either a certain set of files or the staged runs, whatever.
35:52 Yeah.
35:53 That sounds pretty useful when you're building your own pre-commit hook, right?
35:56 So yeah, depending on how you build it, you can either use that or they have also a try
36:01 repo command.
36:03 Right.
36:03 Got it.
36:04 Got it.
36:04 Well, let's see.
36:06 Maybe we could jump over and talk a bit through your hook creation guide, a step-by-step guide
36:12 to developing your own pre-commit hook.
36:14 I thought this was really, like I said, a good article.
36:17 And maybe one of the first things we talk about is just what makes a good hook in the first
36:23 place, right?
36:24 You said that they can't be too long or people will go crazy and turn them off or skip them
36:29 or whatever.
36:30 But what else?
36:31 So I think another big thing is if you're able to fix something, then you should fix it.
36:36 In the case of formatting and you're saying, oh, this should have a trailing comma, then
36:41 that's easy enough.
36:42 You can add the trailing comma.
36:43 You don't make more work for the user.
36:44 If you can't do that, then you should be very specific saying this file.
36:48 And if you have a line number saying exactly where it is, because just saying there's something
36:53 wrong in this file and someone has to hunt it is also not a good user experience.
36:57 No, that's going to be frustrating and super, super quick.
37:00 Yeah.
37:01 So be really descriptive about it.
37:03 And then also, maybe choose not to make it a pre-commit hook, right?
37:06 Not necessarily everything needs to run on every commit.
37:09 Yeah, I think that the speed thing is a huge factor.
37:12 And in general, I think one big thing that is key to note here is that it's even,
37:18 though, let's say you change files that, let's say you change a Python file, a Markdown file
37:23 and an image file.
37:24 If you're making a hook that only runs on a certain type of file, if you're careful and
37:30 specify that, then it's not necessarily a bad thing to include that in there because it will
37:34 only get triggered on those certain types of files.
37:36 And so like an example I have is the XF stripper.
37:40 Well, I created when I was building my website.
37:44 Your XF stripper is super interesting.
37:47 I'm starting to think maybe I want this as well.
37:48 Yeah, I was just very paranoid at one point about just working with images.
37:53 And so they come with, what's up here?
37:57 So exchangeable image file format data or XF as it's commonly called.
38:02 It's metadata that is in the image that you might not realize is there.
38:06 And so in this article, I talk about a picture of me presenting that I was given from a conference.
38:12 And this was something that was stored, I think, in a Google Drive.
38:15 So you have access to all the metadata that was available.
38:18 So I never met the photographer.
38:20 And yet I know the photographer's name, the camera they use, what type of computer they have,
38:24 how they edited it, all kinds of information.
38:27 And the dangerous part is the exact location of where this was.
38:31 Now, conference, not a big deal.
38:33 But you have to think about maybe you're blogging about something you did in your house or your apartment.
38:38 And now you have a photo up on your website where anyone can potentially see it that has the GPS coordinates for where you live.
38:47 Yeah, that wouldn't be great, no.
38:48 So I was very paranoid about this.
38:50 And I don't want the idea of like, oh, I'm going to add a new image.
38:54 Let me go through my checklist of what I need to do because I know at some point I'm going to mess something up or forget it.
39:00 And so this is a perfect use case for the pre-commit, right?
39:03 Because you want something that is going to stop you and tell you, nope, you can't do this, right?
39:08 And in this case, it can also remove the metadata because I am being super conservative and saying no metadata,
39:14 which has the nice side benefit of shrinking files, which is good for serving them.
39:19 Yeah.
39:20 Well, what value is it to have all that metadata in there for a blog?
39:26 Most of the time, most people are not, they just want to see, they want to read the blog.
39:29 They're not going to dissect your image, right?
39:31 I think it depends what you, I mean, maybe you have a travel blog and you want to know like, here's that location.
39:36 And then you have one off post where you introduce yourself and oops, you know?
39:40 Yeah.
39:41 There's so many ways.
39:42 And I think even just thinking, oh, I'm only going to be doing this.
39:46 There's always going to be something that later on happens.
39:48 So you have to be very careful just upfront that everything is going to go through this track.
39:53 Sure.
39:54 Can your exit thing, can it be selective about the metadata?
39:58 That's something I do want to do in the future.
40:00 Just remove the location if you say.
40:03 But the thing is, there's like, looking through all of that, it's hard to tell if there might be something in one subset of images you take that might be sensitive.
40:11 You can even think of certain situations where you might not want someone to know what kind of device you were using.
40:16 Right.
40:16 Because maybe they're like, oh, that device is vulnerable to something and I know they have it.
40:20 Right.
40:21 The worst of these is, I think, the multiple times, pretty sure it was the Samsung, but one of the Android companies posted a picture promoting the new phone.
40:34 And, you know, the exit information had the picture as being from an iPhone or something like that.
40:38 Oh, no, it was the other way around, I think.
40:40 Oh, the other way around.
40:41 I think I remember hearing that, yeah.
40:41 Well, it was like one phone company was posting it from, but the picture was actually, even though it was about the phone, it was, you know, implying this picture comes from or something.
40:49 It was like, nope.
40:50 Whoever is on the marketing team just happens to have the other kind of phone and there it goes.
40:54 Right.
40:54 And it's a huge scandal.
40:55 I mean, for those companies that talk about how awesome they're, how much better their cameras are or whatever.
41:00 Well, I see that's also the thing, right?
41:01 Because you never know who's going to look at the metadata either.
41:03 So, and it's interesting because certain things will, certain platforms will remove it.
41:09 So I mentioned like Google Drive, it's everything is preserved.
41:12 But the thing is, is you have to know ahead of time.
41:15 So you'd have to say, I'm planning to put this image here.
41:18 Let me upload a dummy image.
41:20 I don't care and check if the metadata is still there.
41:23 Yeah, exactly.
41:24 Yeah.
41:25 I think, I think Mastodon might remove it.
41:27 There's some certain platforms that will take away that metadata.
41:30 I think Facebook might.
41:31 It's been a long time.
41:33 I mean, it's a huge security concern.
41:35 So I imagine more and more places are, but I just wanted to have an abundance of caution and not risk anything happening.
41:41 Well, yeah.
41:42 And you're putting it on the internet as well, which there's, it goes straight from your computer through some sort of static website process.
41:49 And then it's downloaded, right?
41:50 There's very, there's no, nothing in between those two steps.
41:53 Exactly.
41:53 At least not in terms of image processing.
41:55 Yeah.
41:55 Yeah.
41:56 Cool.
41:57 Yeah, this is nice.
41:58 I'm thinking about grabbing it and trying out.
42:01 What file types does it work on?
42:03 Does it work on just JPEGs or does it do like WebP and all that?
42:07 Any image, anything that's classified as an image on pre-commit, the way pre-commit runs.
42:12 And it has to work with, I'm using Pillow.
42:15 So if Pillow can't read it, then it's not going to work.
42:17 Right.
42:18 Then I'll just skip over it or whatever.
42:20 Yeah.
42:21 Yeah.
42:21 So really quick, while we're talking about stuff on your website, your website's super nice.
42:26 Did you build this yourself?
42:28 Like, how is this thing built?
42:29 I did.
42:29 I did build it to myself.
42:32 I took a couple months in the beginning of the year and I had before a single page where
42:38 it was just like some boxes.
42:39 And then I was like, this needs to be revisited.
42:42 So it's built with Next.js and so React and TypeScript.
42:47 And then I use Tailwind CSS.
42:50 And yeah, it was kind of just like, I mean, a lot of these things are for me because sometimes,
42:54 you know, I like seeing all in one place where I'm speaking next or like stats about where
43:00 I've spoken, like a map and stuff.
43:02 And I went through, so kind of my process would be, you know, on my iPad, I would sketch out
43:08 what I kind of envisioned a page looking at and then I would prototype it in React and
43:13 see, okay, maybe this isn't fully work or like tweak things and iterate on a few times
43:17 and bit by bit the pages formed.
43:20 The latest thing I added was this timeline functionality.
43:24 At EuroPython this year, I had this idea for a timeline and I kind of got really, really into
43:31 it.
43:31 So it was funny.
43:31 I had a Python conference.
43:32 I was doing tons of React.
43:34 But if you scroll down a tiny bit, there's actually too much.
43:38 This one, right?
43:39 Yeah, yeah.
43:39 Versus the little text.
43:41 Oh, the complete upcoming.
43:42 Yeah, I got you.
43:43 So I built this.
43:44 Oh, that's beautiful.
43:45 I love it.
43:45 It's like a little infographic of your upcoming events.
43:48 Yeah.
43:49 So I was like very inspired and I did this in a few days.
43:53 But it's nice because, you know, going from the sketch to the React components, it's become
43:59 very natural, which it takes a bit to get there.
44:03 But it was nice because I did have to learn TypeScript for some changes in my team.
44:08 We were going to be starting moving to TypeScript.
44:10 So this was great to work on something that, you know, fit in my head as far as what needed
44:15 to be done.
44:16 And it was very, very helpful.
44:17 But yeah, so I'm very proud of this.
44:20 There's still more, tons more to do.
44:22 I have massive lists.
44:24 But yeah, I remember looking at Google.
44:25 This is a nice static site.
44:26 Very cool.
44:27 And I didn't even see this feature.
44:28 This is great.
44:29 Broadvon out in the audience says fire emoji for it.
44:31 Very good.
44:32 Thank you.
44:33 And also, thanks.
44:36 I see you put the podcast appearance on here as well.
44:38 That's cool.
44:38 So that's happening today.
44:40 Watch the live stream now.
44:41 If you're not watching now, then it's probably missed it.
44:43 But the recording will be there, of course.
44:45 But the reason I say that is you maybe want to give a shout out to some of your upcoming
44:49 events.
44:50 Yeah, why not?
44:51 So I'm going to be in San Francisco next week talking about my Datamorph project.
44:57 And I'll also be doing a book signing there for my hands-on data analysis with Pandas book,
45:02 second edition.
45:03 And then after that, I'm off to France to give a workshop on Pandas and then also talk about
45:09 getting started in open source contributions.
45:12 And then a couple of weeks after that, I will be at the final conference of the year in Australia.
45:18 And I will be talking about Datamorph once again.
45:21 And I'm hoping to run my third development sprint on Data Morph while I'm there.
45:26 Oh, that's cool.
45:27 Yeah, we'll talk about Data Morph in a second.
45:28 That's some interesting stuff.
45:30 But this is quite the agenda.
45:33 You got a full trip coming up.
45:34 No, I'm excited.
45:35 It's nice to see different cultures.
45:39 It definitely does land different, you know, the topics and just reactions.
45:43 Some people are at the top excited.
45:46 Some of them are just straight face.
45:48 You're like, I enjoy it.
45:49 I think it really comes into play as far as giving workshops.
45:54 I was in Portugal last week and I did the data analysis workshop.
46:00 And I think that was one of the best ones I've ever had.
46:03 It was very, very highly interactive and it was a really fun time for me.
46:07 And hopefully everyone else thought so as well.
46:09 Yeah, that's fantastic.
46:11 How did you get into public speaking?
46:12 Yeah, so I wrote the hands-on data analysis with Panda's book in 2019.
46:20 And at that time, if you had told me, go do some public speaking, I'm like, please no.
46:25 You're going to France and Australia and Portugal recently.
46:29 So I'm like, no, no, no.
46:30 Yeah.
46:30 And then, well, during pandemic times, a conference reached out to me about doing a workshop on pandas
46:38 because I had written the book and doing it virtually.
46:41 And to me, that felt like a good stepping stone to get over that fear of public speaking and
46:47 the fact that it would be virtual.
46:48 I wouldn't really have to look at anyone.
46:50 And I was still absolutely terrified when it came to actually delivering that talk.
46:55 And when you think about it, it wasn't a talk, right?
46:57 So it was my first thing was a four-hour workshop.
47:01 And now I'm at the point where a virtual thing is much less desirable because it's so hard when
47:08 you can't see people, you can't see our things landing, are they confused, are they with me?
47:12 Are they even still there?
47:14 So, and then after I did, you know, I made it to the end and I was like, okay, that's
47:19 definitely something I want to work on and do it again.
47:22 So I did, I came up with a second workshop on data visualization.
47:26 And then I think I did two or three more virtual sessions.
47:31 And then it became that some conferences were now in person.
47:35 And I was like, okay, I think I should try this.
47:37 And again, it was still a long one.
47:40 It may have even been a six-hour session that time.
47:42 So it's like crazy, right?
47:43 And then I did that in person.
47:45 And I was like, okay, I survived.
47:47 And then it kind of just felt like something, if I kept doing it, I would get over it or
47:52 at least get to the point where, you know, I could do it without being terrified for a
47:56 month ahead of time.
47:57 Right.
47:57 And I am at that point now.
47:59 It is like, I enjoy doing it because I enjoy, I'm very passionate about knowledge sharing and
48:04 just teaching people and getting that interaction that, oh, people are really like getting value
48:09 out of this.
48:10 And that to me is very nice.
48:11 Yeah.
48:12 It's super rewarding.
48:13 So, but yeah, this is quite impressive.
48:15 So just, I got the sense you kind of got started pretty soon.
48:18 You said 2019.
48:19 So that's, haven't been doing it for that long.
48:21 And this is great.
48:22 So maybe, you know, you brought it, maybe we could talk a bit about your book as well.
48:27 I don't know what to say about this one.
48:29 Just that it exists and people should check it out.
48:33 It's giant.
48:33 It's giant.
48:34 As you can see, 788 pages.
48:36 Holy moly.
48:37 That is giant.
48:38 Yeah.
48:40 So this is the second edition.
48:42 If you scroll down, there's also the covers for the Korean and Chinese editions.
48:46 Oh, awesome.
48:47 And I do not read either of those, but I do have copies.
48:52 You can act of faith to put your name on them.
48:53 You know what?
48:54 I've been told by people that read both of those languages that the name is not quite translated
48:59 correctly, but you know, I'll forget about that.
49:02 It's cool to have the copies.
49:03 Yeah.
49:04 So this book covers obviously pandas working through the basics of data analysis.
49:10 We also talk about data visualization.
49:13 And then there is a little bit towards the end about like actually applying this stuff
49:19 to use cases and also a little bit of machine learning.
49:21 Cool.
49:22 Yeah.
49:23 So I'll put a link in the show notes.
49:24 People can check it out if they would like to.
49:26 All right.
49:27 I feel like there's a few things.
49:28 We didn't make it very far in our creation guide.
49:31 So let's talk about the recipe.
49:33 All right.
49:34 What are the four steps?
49:35 At least Stephanie's recipe for pre-commit hook.
49:39 Yeah.
49:39 This is definitely my recipe.
49:41 I mean, I've, I think I've made two that are published ones and then obviously a few other
49:46 for trainings and explanation purposes.
49:48 And this, this is something that works well for me.
49:50 And I think makes sense as far as thinking about the pieces.
49:53 So the first thing, the hardest thing is actually to figure out what are you checking and how do
49:59 you actually code that up?
50:00 And if you want to do this in Python, this is just, okay, code your logic.
50:03 Yeah.
50:04 Right.
50:04 Yeah.
50:04 Well, and if it has a --fix, maybe that's even harder than just trying to
50:09 understand, right?
50:10 Because now you got to not break somebody's code or sorts of things like that.
50:13 Yeah.
50:13 But this would be where you start at the basic level, probably first, you know, find,
50:18 figure out, can you find the issue and show people where it is?
50:22 And then you can look into fixing it.
50:23 But yeah, you have to be very careful, especially if you're going to be touching things.
50:27 So I guess it's pretty straightforward, but the magic of Python is not just the language
50:32 and the static, the standard library, but the 500,000 external packages, right?
50:37 There's probably a ton of external packages that understand code, check different things.
50:41 And you could, you can use those in your hook implementation, right?
50:44 Just like a standard Python package, it can have dependencies and stuff.
50:47 Yes.
50:48 And so I talk about this in the third step, but I do like to make it as a package just
50:54 because you know that that's going to work and grab the dependencies as long as you follow
50:57 what you already know.
50:59 And pre-commit will, you will tell pre-commit in the fourth step in that pre-commit hooks
51:04 file how it should be installed.
51:06 So when you say this is, this is Python, then it will know, okay, so I should be using, for
51:11 example, pip to install this.
51:12 And if you have, for example, pyproject.toml and you specify how it should be built, then
51:16 all of that just happens as it normally would.
51:18 It's just that pre-commit is doing it instead of you.
51:20 Yeah.
51:20 Yeah.
51:21 That's kind of, instead of you doing a pip install dashy dot or whatever, that it's
51:25 kind of figuring that out.
51:26 And I guess we haven't really talked too much about it, but when you pre-commit install, it
51:31 looks at the, this hooks YAML file and then it, it creates the environment and it downloads
51:36 all the packages the first time to kind of set it up.
51:39 Then it just runs over and over after that.
51:41 Right.
51:41 Yeah.
51:41 Unless you change something in your pre-commit config file, then it won't need to rebuild the
51:47 environment for this.
51:48 So if you keep the same version, then it's kind of like you said.
51:51 I installed this version of the package.
51:52 And as long as you don't say you need to update the package and it's kind of like a virtual
51:56 environment.
51:56 Okay.
51:57 You already have that.
51:57 There's no need to.
51:58 Yeah.
51:59 Yeah.
51:59 Excellent.
51:59 So your recipe is one, design the check function to turn it into a CLI, which there's some interesting
52:07 stuff in that one as well.
52:08 That's.
52:08 And I think that's kind of where the --fix comment comes into play.
52:13 Right.
52:13 So your logic, that check function, you should be able to say this was successful.
52:18 This was not successful as in stop the commit.
52:21 And then the CLI provides a very easy way to plug into that.
52:26 Maybe you want to say --fix or dash dash, you know, leave this type of file alone,
52:31 whatever kind of modification you want to do.
52:33 You can expose that in a CLI.
52:36 And that's also a quicker way to get started versus trying to, let's say, read the pipe,
52:42 find the pyproject.toml, read it in, parse out things.
52:46 That's all stuff that can come later once you figure out exactly how you want your tool to
52:51 be configured.
52:52 Yeah.
52:52 Especially if it just has one or two arguments, it might not be necessary to be too, too over
52:57 the top with all the configuration.
52:58 And then you make it installable.
53:00 Basically, like you said, make it a package and then create the pre-commit hooks.
53:05 Yeah.
53:05 Well, those are the steps.
53:06 So I think write the function, that's pretty straightforward.
53:09 You just, whatever you want it to do, you just write a function that does it.
53:12 You do have an example in here about checking for valid file names and snake cased file names.
53:19 So things like it can't be just one letter and it has to be snake cased and so on.
53:25 Right.
53:25 But then to turn that into a CLI, there's a lot of options in Python these days, right?
53:31 You can click, you can type, but if you want something built in, yeah, if you want something
53:37 built in, argparse is pretty straightforward, right?
53:39 Yeah.
53:40 And I think also, I mean, if you look at the pre-commit hooks repo provided by pre-commit org,
53:45 a lot of them, or maybe all of them are just using argparse.
53:49 Because for most hooks, all you'll need to say is, I have an argument parser and it accepts
53:54 file names.
53:55 And at that point you have this boilerplate that you can just copy and you don't even
53:58 need to worry about configuring multiple, you know, different arguments.
54:02 It doesn't have to be too advanced with like sub commands and all that kind of stuff necessarily.
54:06 Yeah.
54:07 Yeah.
54:07 And then make it installable.
54:09 This is, you recommend a pyproject.toml, which yeah, for packages these days, that seems
54:15 pretty much the de facto standard, right?
54:17 Yeah.
54:18 And then what's nice is, yeah, you're using current things.
54:21 You're not relying on setup.py.
54:23 And also in there, there's a way to expose an entry point.
54:27 And that's line 24.
54:28 Yeah.
54:29 Yeah.
54:29 Yeah.
54:29 Yeah.
54:30 That's really nice.
54:30 I love entry points.
54:31 I think it's, I think they're massively underused in Python.
54:35 You know, people talk about how do I create a script that I can give it to somebody so they
54:40 can run something.
54:41 And that so often involves like, where is it?
54:44 Where is its associated files?
54:46 Where is its Python?
54:47 And where is its dependence?
54:48 All of that stuff you, if you just create a package and it has an entry point, you can
54:52 pipx install it or uv tool install it.
54:55 Or, and now you just have all these commands and people don't have to mess with all the Python
54:58 stuff.
54:59 Even if you know how to do it, you don't necessarily want to do that all the time.
55:01 Right?
55:02 Yeah.
55:02 And then it's just easy to, you can kind of call it from anywhere at that point.
55:05 Yeah, exactly.
55:06 So in this example, you give, you put a, a validate dash file name command and you
55:12 just point to, you know, what module and then what function to call.
55:16 And that's the CLI.
55:17 Yeah.
55:17 That's really nice.
55:18 And then of course that, that function in there is built and backed with arg parse.
55:22 So it all, it kind of all comes through a circle right there.
55:24 Yeah.
55:24 Yeah.
55:25 So it's like you, it's almost like you had created, you know, some command line utility,
55:29 like bash wise or something.
55:31 And you just have that available and it's hooks into your, your CLI.
55:35 I also want to call out on a 21 line 21, cause we talked about dependencies, right?
55:39 So anything you put in there, that's automatically will get grabbed when pre-commit installs.
55:44 So in this case, there's nothing.
55:46 And then the case of the exit stripper I mentioned, like we need to install pillow, right?
55:50 So this is how you can configure how pre-commit will grab everything.
55:55 And I also see it has, yeah, I see there's a requires Python version.
55:58 Does pre-commit help you get Python in any way?
56:01 Or is it just assume that there's a...
56:03 You need to have whatever languages you're relying on, you do need to have them installed
56:06 already.
56:07 Okay.
56:07 So in order for you to use this pre-commit hook on your machine, you'd have to have, for
56:12 example, Python 3, 10, 11, 12, something like that installed, given that it says 310
56:16 or greater.
56:17 So for example, like if you saw some hook that sounded interesting, but it's written in Go
56:21 and you don't have Go on your computer, you have to figure that out first.
56:24 That's a no-go.
56:25 It's a no-go.
56:28 All right.
56:28 Let's see.
56:28 Yeah.
56:30 And then last thing to do is you say, create the pre-commit hooks.yaml file.
56:36 And is this the thing that goes into your repo?
56:39 So when pre-commit sees it, it knows what to do?
56:42 Yeah.
56:42 So for example, in the exif stripper repo, there's this file exists.
56:47 So if someone uses exif stripper, they point to that repository.
56:51 And then when pre-commit goes and grabs it, it looks for this file, right?
56:54 And then the key things here, for one being language.
56:58 So language tells pre-commit, how does it try to install that?
57:02 So in this case, it says, oh, this is Python.
57:04 So then it knows, okay, pip.
57:06 The ID at the top, that's the name that you reference in the pre-commit config.
57:12 Like when you want to, like we saw check toml, check yaml in the beginning, those correspond
57:17 to entries in the pre-commit hooks yaml of that repository that they were being referenced
57:23 from.
57:23 So pre-commit can, so first finds this file, it can install, then it can see, oh, which
57:28 hook do you want?
57:29 Validate file name in this case.
57:30 And then how do I call this?
57:32 And that's entry.
57:33 And this is pointing to the entry point that we made, but it can be anything, right?
57:38 You could call rough and then add, you know, 20 different command line flags if you want.
57:43 And that can be your hook.
57:44 And that would be fine as well.
57:46 And what's very interesting here is it's optional, but it's the types one at the bottom.
57:51 So I talked before about XF stripper only running on images, right?
57:55 It'd be wasteful to have it look at toml and markdown, right?
57:58 If it's not going to do anything with it.
57:59 Can't find any XF information in the toml.
58:01 Yeah.
58:02 So this controls that.
58:04 So for example, this hook will only run on Python files.
58:08 And this logic, I'm blanking on the name of the tool that pre-commit uses to figure this
58:14 out.
58:14 But this is handled elsewhere.
58:15 So there's like certain names that you can use.
58:17 Right.
58:18 Some sort of category mapping over to these file extensions or these bombs at the beginning
58:23 of the file or whatever mean that it's this thing.
58:26 Exactly.
58:26 There is a very dangerous thing with this and that types is an and.
58:31 So if you say, if you wanted to do like this should run on Python and markdown, you can't
58:35 use this because it will look for files that are both Python and markdown and will not end
58:41 well.
58:41 Not too many of those exist.
58:43 Yeah.
58:43 There's a separate types or that you have to use.
58:46 That's like a little gotcha.
58:47 It's like an ORM sort of instead of a SQL statement.
58:51 Kind of you got to.
58:52 Yeah.
58:53 Those things always get weird.
58:54 Like import the or operator.
58:56 Like, okay.
58:57 Yeah.
58:58 Cool.
58:59 Okay.
58:59 That's actually that that is very good to know because it looks like a list of options.
59:03 It is.
59:05 Yeah.
59:05 But they combine.
59:05 So you might have something like it is a file and it's Python.
59:08 That might be one thing I've seen.
59:10 Right.
59:10 Okay.
59:11 Yeah.
59:12 Cool.
59:12 So if I wanted to have more than one hook, I could put it into one.
59:16 I could have more than one here.
59:18 Is that possible?
59:18 Yeah.
59:19 So this looks like a list.
59:20 Yeah, exactly.
59:21 It's structured as a YAML list.
59:22 So you just kind of could copy that block, paste the new one, and then just change whatever
59:27 fields you want.
59:28 And then that's now the second hook that you expose.
59:31 Right.
59:31 And working backwards, I suppose you just expose a different entry point potentially and then
59:36 just call it out or whatever you want.
59:38 Well, I mean, you could like maybe you have a validate file name and maybe you have another
59:41 one that's like validate long file names or something where you're like, now they have
59:44 to be this long.
59:45 And then it's just a shortcut for something else.
59:47 So it doesn't have to be a different thing.
59:49 Oh, yeah.
59:49 You just put an argument in there as a default kind of for people.
59:52 So we talked about args earlier and that was something the user could tweak.
59:57 Anything you put in here is essentially like it will always run with these.
01:00:01 So you could bake in certain things that have to happen.
01:00:04 Yeah.
01:00:04 Awesome.
01:00:05 I love it.
01:00:06 Okay.
01:00:06 We're pretty much out of time, but let's talk about one final thing.
01:00:12 Not this one.
01:00:13 Your Datamorph project.
01:00:15 Give a quick shout out to that before we wrap things up.
01:00:19 What do you think?
01:00:19 Sure.
01:00:19 So this project started related to the pandas workshop I had mentioned.
01:00:26 I wanted to have a visual to really drive home the point that we needed to visualize our
01:00:31 data because pandas very much data wrangling.
01:00:35 And after talking to people two hours about data wrangling and statistics, you can calculate
01:00:40 on tabular data.
01:00:41 Some people just feel like, oh, okay, we're done.
01:00:44 I mean, you know, we're done.
01:00:45 And that's definitely not the case.
01:00:47 And I was thinking about, and you had it on the screen before, but the data source doesn't.
01:00:52 So yeah.
01:00:53 So there was research in 2017 by Autodesk where they took the idea of Anscombe's Quartet, which
01:01:01 is, sorry, just a little bit above that, which is just a set of four, yeah, four data sets.
01:01:08 They share the same summary statistics.
01:01:10 So the mean in X and Y, the standard deviation in X and Y, and the Pearson correlation coefficient.
01:01:16 And they look very different.
01:01:17 And if you think of, naively, you think, well, I know the average and maybe how spread out
01:01:24 things are.
01:01:25 So I can kind of get a sense of what this data probably means.
01:01:28 But in reality, outliers and other weird things could just completely blow up those ideas,
01:01:33 right?
01:01:34 Yeah.
01:01:34 And so in 2017, they had developed this algorithm using simulated annealing.
01:01:40 So if you scroll down once more, where they take the dinosaur at the top and they use
01:01:47 simulated annealing to push the points.
01:01:48 Let me describe this really quick for just people listening.
01:01:50 So there's a matplotlib looking graph of some data points, and it has a certain standard
01:01:56 deviation, certain mean, et cetera.
01:01:58 But if you actually look at it, it looks like a T-Rex, right?
01:02:02 Something like this?
01:02:03 Yes.
01:02:03 Is that a decent enough description?
01:02:05 That's a perfect description.
01:02:07 Yeah.
01:02:07 So what the researchers have done is they use this simulated annealing algorithm to push
01:02:12 the points around.
01:02:13 So starting from that dinosaur and just moving the points ever so slightly in such a way where
01:02:18 the summary statistics are unchanged, at least to the two decimal places where they're currently
01:02:23 shown, and tried to make other shapes.
01:02:26 So some of the other shapes they have are a bullseye, a circle, lines slanted vertically
01:02:32 or a star.
01:02:33 And all of these can be formed from that dinosaur, some to varying degrees of success.
01:02:38 But they're visually recognizable, which is the point that is pretty important here, right?
01:02:45 So you cannot, as we said, rely on those summary statistics because you don't know.
01:02:48 Is it the star?
01:02:49 Is it the dinosaur?
01:02:50 Is it a line?
01:02:51 It could be anything.
01:02:52 And they also had animation that they included.
01:02:56 So basically, you could start from the dinosaur and then turn it into a circle.
01:03:00 And that's even more impractical because you realize at that point that it's not just the
01:03:06 dinosaur and the circle that have something in common, but it's the infinite number of
01:03:10 points arrangements that you can make between them that actually share that.
01:03:14 And so I wanted to explore if I could extend that to working for arbitrary data sets and also
01:03:20 different shapes.
01:03:21 So I found the research code and spent quite a bit hacking at it and even just trying to
01:03:27 get it to work for their example.
01:03:29 And that took quite a bit of time.
01:03:30 And then I had this idea of being that it was for a pandas workshop to take a panda and
01:03:35 turn it.
01:03:36 Initially, I wanted to turn it into the dinosaur.
01:03:38 I still have not found a good way to do that yet, but I also haven't been trying at all this
01:03:44 year on that, to be honest.
01:03:45 But I figured out how to, and by adding a lot of other things that didn't exist in the initial
01:03:51 algorithm, things like calculating bounds of the data and different metrics that I figured
01:03:56 out a way to get it to work regardless.
01:03:59 So I can give it a panda data set or a soccer ball and it can perform these transformations
01:04:04 and move the points around.
01:04:06 So on the screen, we have the first time I shared this publicly, what I had been working on,
01:04:11 it happened to be Easter.
01:04:12 So I made a bunny holding an Easter egg with the words, happy Easter off the side.
01:04:17 And it turns into two vertical lines all while preserving the summary statistics.
01:04:22 This is something I think makes it for a very good teaching tool in say like an introductory
01:04:28 statistics course to encourage people that they need to visualize.
01:04:32 There's an interesting study, I think called the hypothesis is a liability.
01:04:38 And they talked about taking students in a statistical analysis course and they split them into two.
01:04:44 And one set of students were just given the data set and say, here, explore, see what you find.
01:04:49 And then the other set were given a set of hypotheses to test.
01:04:53 And it turns out that the data is shaped like a gorilla.
01:04:56 And the students who were told here, test these hypotheses were five times less likely to even
01:05:02 realize that it was shaped like a gorilla because they never plotted it.
01:05:05 Yeah.
01:05:06 This is such a huge thing to like get people learning this early.
01:05:10 And the more shocking these visuals are, the better.
01:05:14 Yeah.
01:05:14 And I think these are super shocking, right?
01:05:17 Having T-Rexes and bunnies and go, you know, that bunny is, you know, equivalent.
01:05:22 And there's a continuous transformation from bunny to blob of dots with one outside dot, right?
01:05:28 That kind of stuff kind of surprise you, I think.
01:05:30 And one thing I see, especially when the dinosaur came out, but even when I posted some of my first
01:05:37 examples is you see people comment right away, wow, that there's something that's so cool that
01:05:41 that dinosaur is possible to do that with.
01:05:44 Like, no, no, no.
01:05:44 It's not, it's not just the dinosaur or just the panda.
01:05:47 It's really like anything.
01:05:48 And so the way this also works is that people can use their own data sets or they can add
01:05:53 something new.
01:05:53 And that's what I've had, that's what's what I've done this year in the two previous development
01:05:59 sprints that I had people just been, I did one in EuroPython and one in PyCon Taiwan earlier
01:06:07 this year.
01:06:07 And hopefully in Australia, we'll do some more.
01:06:11 But I had people add, for example, a target shape.
01:06:15 So what the, for example, the panda would turn into, we have a club, like the card suit,
01:06:21 which was quite a challenge, and the spade.
01:06:24 And I had already had the heart.
01:06:25 The heart is actually a trigonometric equation, which, you know, blew my mind at first.
01:06:30 There's actually a page I found on, I think, Wolfram Alpha, which was like, I want to say
01:06:35 like 10 or 15 different equations, trigonometric equations for different types of hearts.
01:06:40 And you can pick the exact type of heart you wanted.
01:06:43 Social media heart, the emoji heart, what are we talking about?
01:06:45 No, no, it was just like, this is longer, this is more curved.
01:06:48 Yeah, yeah, yeah, that's awesome.
01:06:49 But these are all now math problems when you think about that side of it.
01:06:53 So this could then be used maybe in a course where they want to focus on math, but also
01:06:57 some more coding.
01:06:58 So there's lots of different use cases, like just giving it the data.
01:07:02 And that's very much more just pure statistics.
01:07:04 But, you know, I think, and I've heard from a few teachers that, from what I presented,
01:07:09 that they're, it sounds like this would be something that they would like to use.
01:07:13 So hopefully that does happen.
01:07:14 If not, it's a fun thing to put in my slides.
01:07:16 And I did enjoy getting it to work.
01:07:18 Yeah, I didn't pull up any good videos for the YouTube video, but there's some really nice
01:07:23 animations of actually seeing it go from one to the other that you got.
01:07:27 And this is, you're doing a talk at PyCon Australia, and then you're doing a sprint on
01:07:32 this as well, right?
01:07:33 Coming up in November 22nd, about a month from now.
01:07:36 Correct.
01:07:37 So cool.
01:07:37 People can check that out if they happen to be at PyCon Australia and want to...
01:07:41 Well, I'll also be talking about it in San Francisco next week.
01:07:45 There won't be a sprint, but I will be talking about that.
01:07:48 So people can...
01:07:48 Okay.
01:07:49 It's not a PyCon.
01:07:49 Sure.
01:07:50 It's still cool.
01:07:51 All right.
01:07:52 Well, Stefanie, thank you so much for being here.
01:07:54 Let's wrap things up.
01:07:56 But I guess, you know, give us a final call to action for people maybe interested in pre-commit
01:08:00 hoax or other stuff that you're doing.
01:08:02 Yeah, you can find everything that we mentioned here and the projects on my website.
01:08:06 I'm putting much more effort into putting stuff on there this year now that I've rebuilt it.
01:08:12 So definitely check there and sign up for my newsletter.
01:08:15 Follow me on socials.
01:08:17 There's no links down here, but you can find them.
01:08:19 There'll be links on the episode page.
01:08:21 So we'll put them there.
01:08:22 All right.
01:08:23 Well, thanks.
01:08:24 Thanks for being here.
01:08:24 It's great to talk to you.
01:08:25 Thanks for coming on and sharing.
01:08:26 Thanks for having me.
01:08:27 Yeah.
01:08:27 Bye-bye.
01:08:28 This has been another episode of Talk Python to Me.
01:08:32 Thank you to our sponsors.
01:08:34 Be sure to check out what they're offering.
01:08:35 It really helps support the show.
01:08:37 Take some stress out of your life.
01:08:39 Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.
01:08:45 Just visit talkpython.fm/Sentry and get started for free.
01:08:50 And be sure to use the promo code talkpython, all one word.
01:08:53 This episode is brought to you by Bluehost.
01:08:56 Do you need a website fast?
01:08:58 Get Bluehost.
01:08:58 Their AI builds your WordPress site in minutes and their built-in tools optimize your growth.
01:09:04 Don't wait.
01:09:05 Visit talkpython.fm/Bluehost to get started.
01:09:08 Want to level up your Python?
01:09:10 We have one of the largest catalogs of Python video courses over at Talk Python.
01:09:14 Our content ranges from true beginners to deeply advanced topics like memory and async.
01:09:19 And best of all, there's not a subscription in sight.
01:09:22 Check it out for yourself at training.talkpython.fm.
01:09:25 Be sure to subscribe to the show.
01:09:27 Open your favorite podcast app and search for Python.
01:09:30 We should be right at the top.
01:09:31 You can also find the iTunes feed at /itunes, the Google Play feed at /play,
01:09:36 and the direct RSS feed at /rss on talkpython.fm.
01:09:40 We're live streaming most of our recordings these days.
01:09:43 If you want to be part of the show and have your comments featured on the air,
01:09:47 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.
01:09:51 This is your host, Michael Kennedy.
01:09:53 Thanks so much for listening.
01:09:55 I really appreciate it.
01:09:56 Now get out there and write some Python code.
01:09:57 I'll see you next time.