Learn Python with Talk Python's 270 hours of courses

#482: Pre-commit Hooks for Python Devs Transcript

Recorded on Thursday, Oct 24, 2024.

00:00 Do you struggle to make sure your code is always correct before checking it in?

00:03 What about your team member's code? That one person who never wants to run the linter,

00:08 tired of dealing with tons of conflicts and spurious Git changes? You need Git pre-commit

00:13 hooks. Well, we're lucky to have Stefanie Molin on the show today, who has done a bunch of writing

00:18 and teaching of Git hooks. This is Talk Python to Me, episode 482, recorded October 24th, 2024.

00:27 Are you ready for your host? You're listening to Michael Kennedy on Talk Python to Me.

00:32 Live from Portland, Oregon, and this segment was made with Python.

00:36 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:44 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using at Talk Python,

00:50 both accounts over at fosstodon.org, and keep up with the show and listen to over nine years of

00:56 episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams

01:01 over on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified

01:07 about upcoming shows. This episode is brought to you by Sentry. Don't let those errors go unnoticed.

01:13 Use Sentry like we do here at Talk Python. Sign up at talkpython.fm/Sentry. And this episode is

01:19 brought to you by Bluehost. Do you need a website fast? Get Bluehost. Their AI builds your WordPress site

01:25 in minutes, and their built-in tools optimize your growth. Don't wait. Visit talkpython.fm/ Bluehost

01:31 to get started. Hey, everyone. Before we jump into the interview with Stefanie,

01:36 I want to tell you real quickly that I just released a blog for Talk Python. Now, we have had tons of RSS

01:43 over there because that's what powers podcasts. You can subscribe to the episodes. You can subscribe to

01:49 an RSS feed for new course announcements over at Talk Python Training. And I've had a personal blog

01:55 time over at mkennedy.codes, but no official Talk Python blog. And so I'm going to be posting

02:01 really cool things on there. I've already got a couple of articles posted, but I have plans for

02:06 some interesting series. And anytime there's some more interesting announcements or exciting news

02:11 I want to share with Talk Python, it's going to be over on the Talk Python blog. So if you're interested,

02:16 I would really, really appreciate it. If you go to talkpython.fm, click on blog, right in the

02:21 navigation or at the bottom and just subscribe to the RSS feed. That way we can stay in touch.

02:26 And with that, let's talk pre-commit hooks. Stefanie, welcome to Talk Python. It's awesome

02:32 to have you. Thanks for having me. Yeah, really looking forward to talking about pre-commit hooks.

02:38 You know, these are things that I'm sure a lot of people have heard of. I've certainly heard of,

02:42 but to be honest, it's not much I've done very much with. And I bet a lot of people out there

02:47 listening are like, yeah, that'd be a good idea. Just like continuous integration and writing tests.

02:51 Now let's get back to it. You know, something like that. So I think there's a lot for people to take on, take away here. And we'll talk about what are these pre-commit

03:00 hooks, when to use them, how to build them, and a whole bunch of other things that you're up to.

03:05 So it should be a lot of fun. I'm looking forward to it. Me too.

03:08 Yeah. Now, before we get to that, how about your story? How do you get into programming Python and

03:14 pre-commit hooks and all these things? Hello everyone. I'm Stefanie Molin. I am a software engineer at

03:18 Bloomberg. And I would say, I guess I got into programming in Python. I initially was programming

03:25 in R and I was doing more data analysis while still building some things. And I needed to build a web

03:33 app. And one of my teammates had suggested that rather than battling with Shiny in R, that I just

03:38 learn Python. So I took a few weeks and just forced myself to do that. And I built something

03:43 with Flask. And that was how I got into it. Oh, that's really awesome. Yeah. You were doing work in

03:50 not finance, but in ads or something like that with R. What kind of work was that? Like we just generally

03:56 add, you don't have to go into details. Yeah. So it was, it was mainly reporting and doing analysis on how

04:03 client campaigns were going. But what really got me started with programming was more, I had gotten

04:08 involved with a hackathon team and we had built an alerting system. So just monitoring when something

04:13 weird went on with the campaigns. And I really enjoyed building more, more so than the analysis.

04:19 And so I had to find a way to, and I enjoy like a little bit of data and more on the coding side.

04:25 So I had to find something that would let me combine those two.

04:28 Yeah. Well, that sounds really fun. I definitely, I'm on the same wavelength as you with data analysis

04:33 is fun, but the building is, is really where things get interesting and, you know, look back and see

04:39 like, Oh, we built this thing. That's, that's a pretty awesome feeling.

04:42 Yeah. It was, it was a ton of fun and we ended up getting, I think third place on the hackathon,

04:47 but yeah, that was, that was really that moment where it was like, I got to taste of something else.

04:52 And I was like, this is, this is what I want to be doing.

04:55 Yeah. Oh, that's fantastic. Was that at your company or was that someone?

04:58 That was at the, the previous, previous role. It was the ad tech company. And so that was actually

05:04 all built in R, the alerting system. And then, Oh no. Yeah. Okay. Yeah. And then, and then as we

05:10 worked more on it, certain things ended up moving into Python. So a lot easier to work with and to

05:16 automate things and not have like some laptop running R somewhere.

05:19 Yeah, exactly. It's, that's sort of the promise of Python over a lot of these things that at first

05:28 blush seem somewhat equivalent, right? Is that it's, it's a real programming language that can go on to do

05:35 all the stuff. You don't have to try to automate some weird thing. That's not really meant to be that

05:40 way. Right.

05:40 I know. And now, I mean, I could not write R if I, if I had to, I wouldn't, I don't think I would.

05:47 Yeah. Well, I was going to ask you now, which side of the fence do you spend more time on R or Python?

05:53 It sounds like.

05:54 I haven't touched R in maybe six plus years at this point. So I, yeah. Other than the arrows,

05:59 that's probably the only thing I could manage too.

06:01 Yeah. No more equal size, just arrows. Okay. Got it. Awesome. Well, that's super fun. Let's talk

06:10 about pre-commit hooks, right? I've had Anthony Sotili on the show to talk about his pre-commit project.

06:16 It was a long time ago and I'm sure that project will get a bit of a shout out from your work as

06:21 well. But, you know, congrats, you put together a really nice series of articles and resources

06:28 teaching people what commit hooks are, how to debug them, how to build them, how to choose them. So I

06:35 think, you know, the stuff we're going to talk about, I'll link, of course, in the show notes.

06:38 It's a really nice resource for folks. So thank you. I appreciate that.

06:41 Yeah. Yeah. You bet. So let's talk about NumPy doc, doc string validation. This is, this was your entry

06:51 way into what this whole world of pre-commit hooks is, right?

06:54 Yeah. So, and I think July, 2022, I was at my first EuroPython and I decided to do the sprints

07:02 for the first time. I ended up working with the scikit learn team and they wanted to make sure that

07:08 all of their doc strings were conforming to the NumPy doc standard. They had a file in place or a test

07:14 file that you could run and just validate that whatever changes you made were now being validated

07:19 as far as doc strings. And I remember at one point, like I had, I think done 12 or so PRs in that sprint.

07:27 So I was very productive. And there was one early on, I think in the second or so, where it just wasn't

07:32 working and I couldn't figure out why it was telling me it wasn't valid. It was saying that it wasn't

07:37 ending in a period. And I had called over the, one of the maintainers and we both stared at it. To us,

07:43 it looked like a period. And I ended up just deleting the doc string and starting over. And it turned out

07:48 that it was a trailing space at the end. And so I had asked the maintainer, like, how do you not have

07:54 this happen to you? And the response was, you should install pre-commit. And by then I had, I was already,

07:59 I had to leave. So I was like, make, I made a note to myself. I need to research this when I get home.

08:04 And when I did, I was like, well, how did I not know about this before? And I set it up on things.

08:09 And then I went to look, does NumPy doc have that? This seems like exactly what you would want.

08:15 As you're writing code, you want to make sure that it's going to check the doc string there. You don't

08:19 want to have to run some other thing later on and remember to run it. So I looked and there was no

08:24 pre-commit hook for NumPy doc. And I had made something, something that initially we had just

08:29 used internally within my team. And then later on, I kind of wanted to use it for a personal project.

08:35 And so I set about seeing how we could actually open source it. And I had contacted the NumPy doc team

08:42 and they were very, very interested in it because there was a reason there was no hook. It's because

08:46 no one knew how to do it. Right. And at that point I had the horrible realization that what I had written

08:52 would never work outside because it was relying on things being installed. So, and then I felt pretty

08:58 bad about promising that to them. So I managed to come up with an entirely new solution in a weekend

09:03 and figured out how to use the abstract syntax tree to work through. And so I built an entirely

09:10 new version of it. And that is what is currently available in NumPy doc. And that actually led to

09:15 them inviting me to be a core developer for NumPy doc. Congratulations. How cool is that?

09:22 Yeah, I know. It's like the full spectrum, right? And just having heard about it and then just

09:27 seeing the connection between two things that weren't previously connected.

09:31 Yeah. Yeah. Well, I think your comment about the pre-commit hook not previously existing,

09:37 you know, for this project also is, it's pretty interesting, right? It's kind of like I hinted at,

09:41 I mean, a lot of people hear about this kind of stuff, but that doesn't mean they're putting it

09:45 into practice, right?

09:45 Yeah, for sure.

09:47 And so how do we, you know, let's, let's find our way over to pre-commit hooks in general. So how do we

09:53 encourage people or ensure that people follow coding rules, right? We've got tools like black,

09:59 we've got tools like rough. Now those will work awesome. If you give them a consistent config file

10:06 or config settings, not so much with black, but rough. Anyway, they'll make those changes and do a lot of the

10:13 kind of stuff that we're talking about here, but that requires, like you said, people to have it

10:16 installed, people to run it and people to buy into the whole concept of the project in the first place,

10:23 right?

10:23 Yeah, that last bit.

10:24 We're all using these tools and we're all going to run them and we're going to remember to run them

10:28 until one person goes, I don't like these tools. I'm not doing it. And then their settings fight with

10:33 your settings or their spacing fights with your spacing or whatever, right?

10:36 Yeah. I think what has, what really helped in my experience, when you incorporate these things,

10:41 even like going and approaching open source projects that didn't have a pre-commit set up and just asking

10:46 if they were interested in it, it's, you really see the value when you've, you think if you've ever

10:51 reviewed something or gotten review comments about, you should start a new line here. I don't like this

10:56 space here. And then you think about how much time you waste at that stage. And then you still have

11:01 zero consistency because you did it one way, someone else does it another way. And even further than that,

11:08 it's just the time you waste in your code. Oh, I should put this on a new line and reformatting files

11:13 when you could actually be writing things and thinking about how should I design this algorithm,

11:18 them. Right. And so I think a big part of making sure that once you find these tools that you're

11:23 going to use and actually make sure they're using, it's making it easy to use. Like you said, yeah,

11:28 you can just run black or rough, but you have to remember to run black or rough. And that is the

11:32 key problem. And what's so great about pre-commit or even extensions in your IDE is that these things

11:38 become automatic and that's what you need to get towards for these things to actually stick.

11:43 Yeah. To make them automatic and not part of it. And to some degree, continuous integration can do

11:49 those kinds of things. But a lot of times it's too late at that point. It's already checked in,

11:54 it's already committed. And then you've got the back and forth of now it's a diff, but it's only a diff

11:59 because they spaced it differently when they hit save in their IDE than when you hit save in yours and

12:04 all that. So pre-commit hooks run prior to actually leaving your computer, right?

12:11 Yeah. So it's actually prior to even the commit. So when you do get commit and you, let's say you pass

12:16 your message and if it's successful, you normally, you see the hash that gets generated. If you have

12:21 pre-commit hooks enabled, then if that, those checks don't pass, then that commit never gets created in

12:28 the first place. So you still have the files staged, but nothing has made it to the commit.

12:33 Yeah. That's great.

12:34 Yeah. I was just going to explain maybe a little bit about how they work if you're curious.

12:38 Yeah. Yeah. Well, let's start with just like, what even are, are these pre-commit hooks?

12:42 Yeah. So pre-commit hooks, and I think the naming is, is quite overloaded and that leads to a lot of confusion.

12:49 So at the lowest level, a Git repository in general supports a hooks system. So there's a variety of

12:57 different types of actions that Git will trigger a script on your behalf. And one of those such actions

13:03 is pre-commit. So as I described before, as you run Git commit, this gets triggered. Another thing

13:08 might be pushing. You can have Git wired to run some script when you push. Now that is Git's version

13:15 of pre-commit and hook, singular hook, because you can only have a single file run, single executable

13:21 can run.

13:21 Right. If you go to your, it's in the Git folder, there's a hooks subfolder and it's got little

13:27 samples for all the different lifecycle things, right?

13:30 Yeah. And yeah, they provide some, they have to, like I said, they had to be executable, but you can

13:35 be in any language that you have available on the machine. And so Git provides some examples. I do think

13:41 there are a few stages that don't have examples, but it's basically you take the name of the stage

13:46 that you're going to use and that's the name of the file. And that has to be an executable and Git

13:51 will run it at the designated moment.

13:53 Okay. So it could be a Python executable or it could be a Go executable or whatever, but it's just

14:00 one, right?

14:00 It's just one. Yeah. Cause it has to be named. And like in the case of pre-commit, it has to be called

14:05 pre-commit, nothing else.

14:07 This portion of Talk Python to me is brought to you by Sentry. Code breaks. It's a fact of life.

14:12 With Sentry, you can fix it faster. As I've told you all before, we use Sentry on many of our apps

14:18 and APIs here at Talk Python. I recently used Sentry to help me track down one of the weirdest bugs I've

14:24 run into in a long time. Here's what happened. When signing up for our mailing list, it would crash

14:30 under a non-common execution paths, like situations where someone was already subscribed or entered an

14:36 invalid email address or something like this. The bizarre part was that our logging of that

14:42 unusual condition itself was crashing. How is it possible for our log to crash? It's basically a

14:49 glorified print statement. Well, Sentry to the rescue. I'm looking at the crash report right now,

14:54 and I see way more information than you'd expect to find in any log statement. And because it's

14:59 production, debuggers are out of the question. I see the traceback, of course, but also the browser version,

15:06 client OS, server OS, server OS version, whether it's production or Q&A, the email and name of the person

15:13 signing up. That's the person who actually experienced the crash. Dictionaries of data on the call stack and so much

15:18 more. What was the problem? I initialized the logger with the string info for the level rather than the

15:25 enumeration.info, which was an integer-based enum. So the login statement would crash, saying that I could not use

15:33 less than or equal to between strings and ints. Crazy town. But with Sentry, I captured it,

15:40 fixed it, and I even helped the user who experienced that crash. Don't fly blind. Fix code faster with

15:46 Sentry. Create your Sentry account now at talkpython.fm/Sentry. And if you sign up with the code

15:52 TALKPYTHON, all capital, no spaces, it's good for two free months of Sentry's business plan, which will give you up to

16:00 20 times as many monthly events as well as other features.

16:04 So if you want to run more, you basically have to, potentially, write a program which then itself

16:10 figures out all the things to do and then delegates to running them. Like if you want to run rough with

16:16 a fixed formatting issues and you want to run the checker fixer for NumPy doc strings and all those

16:23 things, you'd have to write a sort of orchestrating program for that, right?

16:27 Yeah, it's almost like you're writing in the case of like a bash script, like a giant bash script where you have

16:32 to decide, you know, do you fail early? How do you like and check, do I run this one and then this one? And then

16:38 even worse, you're actually, in that case, you're probably running everything, you know, you're running everything

16:43 sequentially. And if you don't do it carefully, then you know, maybe, maybe you want to fail early, maybe you don't.

16:49 So that becomes very, very challenging to configure and also to share because the thing about that file is that is not

16:55 included in version control. So that would be something that you would maybe have to store somewhere else and then do a

17:01 symbolic link. And then that becomes already a lot trickier for everyone to manage.

17:05 Yeah, I was just doing that last night and that's an AI question. I don't remember how to do that.

17:09 I know you can do it. It's not that hard. It involves LN, but you know, ChatGPT, what do I do exactly?

17:16 LN-S. I've had to do that quite a bit.

17:19 It's burned into the brain, huh?

17:22 So one of the things that you recommend so we don't have to build this orchestration piece is actually pre-commit,

17:31 which is a Python project, right?

17:32 Yes. And it's not the only one. So again, that's where like the naming becomes challenging.

17:37 But pre-commit is built in Python, but it can run hooks in a variety of languages.

17:42 And it interfaces with GitHub's system for you. So it creates that executable and plants it there.

17:48 But that executable is then pointing back to pre-commit so that you can just define a simple YAML file like you can see part of it on the screen right now.

17:56 And it becomes very easy because essentially you're just configuring what you want to run.

18:01 You're not actually coding the logic of the checks and how they relate to each other.

18:05 Right. So let's assume that all the pre-commit hooks that you want to run somehow exist out there in the world, right?

18:12 You don't have to create them for the moment.

18:14 So what you can do with pre-commit is you can set up a YAML file.

18:20 I always get those crisscrossed.

18:22 A YAML file, a pre-commit config YAML file, which then has a bunch of listings of here's a Git repository.

18:30 And if you install it as a Python package, here's a bunch of things that you can run on it, like check toml, check YAML and so on, right?

18:38 Well, it doesn't actually have to be a Python package, right?

18:40 So in that repo, and we're maybe jumping ahead, but there's a special file in that repo, which will tell pre-commit how it actually needs to install it.

18:48 So it could be anything.

18:49 Oh, that's interesting.

18:50 So the thing that integrates with the pre-commit project, it has to opt in in a sense in that it has to have a configuration file or a launch file or a setup file, something like that.

19:01 Yeah. So right now we're looking at pre-commit config.

19:04 There's pre-commit hooks, and that one is kind of registering it with pre-commit system.

19:09 So it tells pre-commit how to install it once it gets a hold of it.

19:13 And it also lists out these hooks that we see here under ID, but that will be defined over there so that pre-commit knows, well, what is check toml?

19:21 What is check YAML?

19:22 Okay. Yeah.

19:24 That's really cool.

19:25 And you can have more than one of these repositories in there, right?

19:30 Correct.

19:31 Yeah. So the repos section is a list of repo sections, and then each repo then has other config, like the individual hooks that you want to run from that repo.

19:42 Right, right.

19:43 So for the first example that you have in this, and this is your article, I guess, I don't know if I give this the proper announcement, but how to set up pre-commit hooks.

19:53 This is your, I perceive this as kind of your getting started article for this whole series.

19:57 I don't know if you see it that way.

19:58 Yeah.

19:59 Yeah, this was the first one.

20:01 I had gotten a lot of questions on how to do this.

20:04 And I think it's always interesting, especially when you think about, you know, speaking at conferences, I feel like, and which I do a lot of, and I feel like a lot of what gets more hits in that sense is like the advanced stuff, maybe more creating it.

20:15 But there's so much value in people just getting started and figuring out how do I even use this in the first place?

20:20 Because this saves you so much time.

20:23 So I really, this was where I got started for that reason.

20:25 I think a lot of people were able to benefit from this article.

20:28 Yeah, it seems like it.

20:30 I know it's fun to talk about the super advanced deep dive things, but most people, they just need to get started.

20:36 They just need some foundation, right?

20:37 And I think, I think that's actually where most of the benefit comes from, even though it is really fun to see some cool deep dive talk that people are going into, right?

20:47 So this next one is pretty interesting that we're adding here in this example, and that's the rough pre-commit from straight from Astral, right?

20:57 So this is just github.com/astral.sh, which is the company behind rough newbie.

21:02 And this is the rough pre-commit.

21:04 But what's interesting about this is, well, one, that it has nothing to do with the pre-commit project.

21:09 But two, that this one also takes special arguments that you can pass to it.

21:14 Yeah, so I think the ruff pre-commit one is just a smaller version so that it works faster with pre-commit.

21:20 Because pre-commit will have to install these at some point.

21:23 It will have a cache.

21:25 So if you don't change the version in this case, it will be able to reuse that.

21:28 But that first time, you do have a bit of a delay.

21:31 And that's not something you want.

21:33 It's something you have to be very careful of when you want to be using these.

21:36 And then the args thing is nice because you have a few options when you configure these tools, depending on what the tool supports.

21:43 In this case, rough supports, as I think we mentioned a little bit earlier, configuration file.

21:47 So, for example, you could have stuff in your pyproject.tomil.

21:50 But the key here is that maybe you're using rough in your IDE.

21:55 And maybe you don't want to do the same kind of changes that you want to do in pre-commit.

22:00 Maybe you wanted to ask you if it's going to change something.

22:03 Whereas in the pre-commit stage, you definitely want it to be fixed.

22:06 So you can use the args here to provide stuff that you only want to happen when it's running in the context of pre-commit.

22:14 Yeah, and ruff has a exit non-zero on fix, which means if it goes through and you say to fix it, it will fix it.

22:21 But then it'll error out and say that wasn't a smooth transition or whatever, which is cool because that will then fail the commit itself.

22:30 Correct.

22:30 Give you the modified files and say basically have a look.

22:34 See if you like it now, right?

22:36 Before it actually just ships it off.

22:37 That's so important because sometimes you realize there was some rule that you hadn't reviewed before.

22:43 That's not quite doing what I want and let me tweak my setup.

22:45 So it's nice to have that bit where you can verify what was actually changed is what you want.

22:50 Yeah, I guess it's a little bit dangerous to just say change it and then commit it.

22:54 I've had people.

22:55 So I did a workshop on pre-commit both on setting it up and then making your own hooks at EuroPython this year.

23:03 And I did have a few people actually.

23:05 One very insistent asking me why wasn't there a hook or why don't they support just fixing it

23:12 and then automatically adding it and committing it on your behalf.

23:14 And to me, as a person who works in security, that just sounds very scary.

23:18 I don't want things doing that.

23:20 I want to see what is being changed and whether or not I agree with it or not.

23:23 Yeah.

23:24 Why doesn't it just go ahead and push it as well?

23:28 Come on.

23:28 Yeah.

23:28 Well, I think that was part of the suggestion.

23:30 I was like, I certainly don't want that running on my machine.

23:34 Yeah, it does skip out on some of the benefits of the multi-stage aspects of Git, I suppose.

23:39 But it is efficient.

23:40 You just get it done all at once.

23:41 That's pretty cool.

23:42 Yeah, but you don't know what else is grabbing, which is the scary part.

23:44 No, of course not.

23:45 I know.

23:45 Super bad.

23:47 So this example that we're talking about here where we've got a pre-commit hook that we're grabbing

23:53 and then it takes these arguments, I think this is an interesting point of discussion.

23:57 So the example you have in your article just says, what we're going to tell ruff is dash,

24:01 dash, fix, dash, dash, exit non-zero fix, and show fixes, which is all good.

24:07 But ruff can be pretty complex in its configuration, right?

24:11 You can say, disable flight gate, turn this one on.

24:14 These are warnings.

24:15 These are errors.

24:16 And there's a whole, you know, here's how many line columns I want and all of this stuff, right?

24:21 So you can either do this argument thing, or if it's supported, you could also potentially have,

24:27 say, a ruff.toml, right?

24:28 Yeah.

24:29 So I tend to want to minimize the amount of configuration files I have.

24:34 So in my case, I think below I talk about having it in the pyproject.toml.

24:38 Yeah, exactly.

24:38 So you just add a ruff section in there and then you configure things.

24:42 And this is stuff that you'd want to use both in your editor as well as in the pre-commit stage,

24:46 because you want them to agree.

24:47 And nothing worse than one telling you the lines too long and the other one like,

24:51 nope, that's good.

24:51 Go ahead.

24:52 Or put a space after the comma in parameters and then take away the space and put the space and take away the space.

24:59 Exactly.

24:59 You don't want them fighting.

25:00 You want them in agreement.

25:01 No, no, you don't.

25:02 So I suppose that's a massive bonus of having either the tool.rough settings in your pyproject or just a ruff.toml,

25:09 however you go about that, it doesn't really matter.

25:11 Because then no matter how you're using rough via the pre-commit or for your project, it'll be the same thing, right?

25:17 Exactly.

25:17 Yeah.

25:18 Okay.

25:18 Yeah.

25:19 That's pretty awesome.

25:20 Now, I guess maybe we got a bit ahead of ourselves.

25:24 If I want to somehow install a pre-commit hook or pre-commit so that when I then give it one of these toml files,

25:32 it'll go subsequently grab them and do the things.

25:34 How do you get started with that?

25:36 I think I need a rephrasing of that question.

25:39 Yeah.

25:40 Sorry.

25:40 So if I have just a plain GitHub repository and I want to have pre-commit manage the hooks for that repository,

25:49 like what do I do?

25:50 Okay.

25:51 So the first thing is you have to actually install pre-commit.

25:54 And that's not the command that's on the screen.

25:56 This is more of a pip install.

25:58 So make sure you have the Python library in place.

26:02 And then you need to have this configuration file.

26:05 At least one hook in there so that you have a valid file.

26:09 And then you can run pre-commit install.

26:12 And I omitted it here, but what I talk about in a different article, when you run this command, pre-commit actually tells you that it created the git hooks pre-commit file.

26:21 And if you open that up, and I have an example on that other article, it's very simple and it's just calling pre-commit the tool itself.

26:28 So in all cases, you need to have it installed in your environment.

26:33 And a single time you run pre-commit install, which then does the wiring on the git side.

26:39 And this is something that everyone in your project has to run on any machine that they are using.

26:45 Because it's part of the repository itself, that file needs to exist there.

26:50 And that can only happen if you run this command.

26:52 Yeah.

26:53 So there's a .pre-commit.config.yaml file.

26:56 That's what you put into GitHub at the root of your project or something like this.

27:01 But then to actually configure git itself, you've got to run this pre-commit space install.

27:07 And it basically wires up the hooks to make that happen, right?

27:11 Correct.

27:11 So yeah, when you run this, that file gets created on your behalf.

27:14 And then you don't have to worry about wiring that up.

27:17 And then it's transparent.

27:18 All you have to do is tweak your config and then the changes happen.

27:22 Nice.

27:23 I don't know if the naming, how much to believe the naming.

27:26 Can it do things other than pre-commit?

27:28 Yes.

27:28 Can it do pre-push and those kinds of things?

27:32 They don't support every single one.

27:35 But there are quite a few that they do support.

27:38 For example, I once configured an open source project with a pre-push because it was a slower

27:44 check.

27:45 And that's something you definitely don't want running on each commit.

27:48 But it might be something where you want to make sure when you push the files that you've

27:51 addressed something that's maybe a little bit longer.

27:54 And that is really not any different than configuring with the pre-commit config YAML.

28:00 There's just a separate item that goes in there that says which stage to run.

28:03 By default, it's pre-commit.

28:04 So you don't see it.

28:06 But if you needed to change it, you can.

28:07 Yeah.

28:07 I figured that was the case.

28:08 But I'd never tried.

28:09 And given that it's named pre-commit, you know, it's kind of named after one of the hooks,

28:13 right?

28:14 But of course.

28:15 I think that's named probably the most useful one.

28:17 I would.

28:18 Yeah, I would think so.

28:19 I think a very popular example would perhaps be the commit message hook.

28:25 So there's a lot of tools that work on, you know, making sure your commits are following

28:30 a certain standard.

28:30 I think one of them is called commitizen.

28:32 And so that runs on, my guess is on the commit message hook.

28:36 Commitizen?

28:37 Yes.

28:38 Okay.

28:38 What is this commitizen about?

28:39 I haven't heard of this.

28:40 I don't think their example uses that.

28:42 But I think they do have a pre-commit hook.

28:45 And I believe it works that way.

28:46 Yeah.

28:47 Yeah.

28:47 Interesting.

28:48 Okay.

28:48 What's this thing?

28:49 A release management tool for teams.

28:51 Yeah, sure.

28:52 That makes sense that you want to kind of be a little bit careful about what your commit

28:56 messages are.

28:57 Maybe you want to grab certain commit messages and add them to your changelog or something

29:01 like that, right?

29:01 Yeah.

29:02 I think there's been quite a bit of talk about this one at conferences I've been lately.

29:07 I think it's gotten a lot of traction.

29:09 Yeah.

29:09 2.5,000 GitHub stars.

29:11 That's pretty good.

29:12 I'll check it out.

29:13 This is news to me.

29:14 This portion of Talk Python to Me is brought to you by Bluehost.

29:18 Got ideas, but no idea how to build a website?

29:22 Get Bluehost.

29:23 With their AI design tool, you can quickly generate a high-quality, fast-loading WordPress

29:28 site instantly.

29:29 Once you've nailed the look, just hit enter and your site goes live.

29:33 It's really that simple.

29:34 And it doesn't matter whether you're a hobbyist, entrepreneur, or just starting your side hustle.

29:39 Bluehost has you covered with built-in marketing and e-commerce tools to help you grow and scale

29:44 your website for the long haul.

29:46 Since you're listening to my show, you probably know Python, but sometimes it's better to focus

29:50 on what you're creating rather than a custom-built website and add another month until you launch

29:55 your idea.

29:56 When you upgrade to Bluehost cloud, you get 100% of time and 24-7 support to ensure your

30:02 site stays online through heavy traffic.

30:04 Bluehost really makes building your dream website easier than ever.

30:08 So what's stopping you?

30:09 You've already got the vision.

30:10 Make it real.

30:11 Visit talkpython.fm/Bluehost right now and get started today.

30:16 And thank you to Bluehost for supporting the show.

30:19 All right.

30:21 What other takeaways should we talk about in this first one?

30:23 I think we maybe have pretty much covered it.

30:26 Let's see.

30:26 I guess, you know, we mentioned before, but if people want to see sort of examples of pre-commit

30:32 hooks failing or succeeding or failing because they changed something, which is not exactly

30:37 a failure, but stopping and starting over, you have a nice example of what that's like

30:42 there.

30:42 So one thing that I guess might be useful is sometimes maybe you don't want to run the

30:49 pre-commit hooks.

30:50 Maybe you need to check in something in a certain way to fix the servers down, right?

30:57 We have to check this in.

30:58 I can't fix this hook, whatever this hook is upset about right now.

31:01 It needs to go in right away.

31:03 Just let me commit it, right?

31:05 You can do that.

31:05 I mean, I think there are probably several use cases or something like this.

31:10 Maybe you're going to be squashing things later and it doesn't, and it's, you don't,

31:13 maybe you don't even know what the API for you're doing, what you're doing is going to

31:17 look like.

31:17 It could be, and this kind of ties back to what we talked about earlier, perhaps roughs

31:22 doing something and you don't agree with, but you need to like check with the rest of

31:25 your team to make sure that everyone's in agreement with let's remove this rule.

31:29 Right.

31:29 So it's, I, this definitely don't encourage always doing this.

31:34 That defeats the purpose, right?

31:35 But there is kind of a break glass solution here where you, let's say you first run, get

31:40 commit and something fails and it's not something that you either want to fix at the moment or

31:45 really can fix.

31:45 Then you can just pass it, pass in dash, dash, no verify.

31:48 And none of the checks run at that point.

31:51 So it's like, as if the checks were never there in the first place.

31:54 Right.

31:55 Right.

31:55 Right.

31:55 Okay.

31:55 That's pretty interesting.

31:56 Like you say, hopefully people don't run that all the time.

32:00 At that point, just remove the pre-commit setup, save yourself.

32:03 Yeah.

32:03 Like what are you, what are you even doing?

32:05 Right.

32:05 I suppose there's an interesting interplay between pre-commit hooks and continuous integration,

32:11 right?

32:12 Like in a sense, they are often checking some of the same things.

32:16 What do you think?

32:17 So I think it's probably an example, like not, not quite a Venn diagram.

32:22 I probably, the circle for pre-commit is entirely contained within the circle for the CICD.

32:29 The difference is there are certain things where you can get immediate feedback, quick

32:33 feedback locally, and that should be something that you can put pre-commit things like linting,

32:37 formatting, et cetera.

32:38 And then CICD may be running your test suite.

32:42 That's definitely not something you want to be doing in a commit.

32:44 Imagine you have a test suite that takes three minutes to run, even maybe three minutes isn't

32:48 that bad, but every commit waiting three minutes is definitely not something you want to do.

32:52 No.

32:53 But it's still a check that you should definitely be running.

32:55 So in CICD, I would run everything.

32:57 Do the linting, do the formatting.

32:58 That's your final, that's your last layer of defense and you need to be checking everything.

33:03 And this just allows developers to get that feedback sooner.

33:06 Right.

33:07 So what you're actually checking in and you finally approve is much closer to what CICD

33:12 would kind of want in the first place, right?

33:14 Yeah.

33:14 Yeah.

33:14 Okay.

33:15 And it's also a much faster feedback, right?

33:17 So like if the thing has to run all the way through the linting, the formatting, the testing,

33:20 the type checking, whatever, you might be waiting 10, 15 minutes for all the things to run when

33:25 you could have had, you know, under a minute, hopefully way under a minute feedback instantly that

33:30 your file wasn't formatted correctly.

33:31 It should be near instantaneous, right?

33:34 I mean, instant maybe is asking too much, but some of that astral stuff is kind of ridiculous.

33:40 Yeah.

33:41 I think you have to be very careful, right?

33:43 Because there's all these checks and I think you had up on the screen maybe earlier, like

33:47 the pre-commit hooks, the general ones provided by the pre-commit organization.

33:53 Yeah.

33:53 There's tons of things in there, but you do have to be careful, right?

33:56 Because if you're like, oh, this could be good and this could be good and this could

33:59 be good.

33:59 Each check is adding time.

34:02 Assuming, like I say, assuming they're all running on Python files, you're adding time

34:05 to how long.

34:06 So you do have to be mindful of what you actually need.

34:09 And if you go to the point where you end up making the whole process take too long, people

34:15 are going to stop using it.

34:16 And then that defeats the...

34:17 Yeah.

34:17 Yeah, exactly.

34:18 As soon as it becomes a point where people go, I'm not using this thing, then you're kind

34:22 of kind of sort of lost unless you can just say, no, you have to use it.

34:25 But then you just have unhappy teammates.

34:27 Exactly.

34:28 Either way, it's not a real great outcome, is it?

34:29 I mean, if there's something that maybe only runs on a few files every once in a while, then

34:35 if you are having problems with speed, then you can also consider moving that to the CICD.

34:39 And I am definitely a big fan of rough, as you said, like just switching from black, flaky,

34:45 all that onto rough, you do save a significant amount of time on these checks and it's a huge

34:50 benefit.

34:50 Yeah, it's pretty ridiculous.

34:51 Now, this is not a get pre-commit thing.

34:54 This is a pre-commit the project thing.

34:57 But you can, if you're using this pre-commit project we've been talking about, you can say

35:01 pre-commit space run and do kind of a test without actually doing a commit, right?

35:07 Correct.

35:07 Yeah.

35:07 So there's a bit of nuances.

35:09 So if you just do pre-commit run, it's going to run all of your hooks, but on the staged

35:14 changes, because it's thinking essentially you're doing like a dry run.

35:17 If you, let's say, are adding a new hook and you want to make sure all of your files are

35:22 compatible with that new hook, then you might want to do something like pre-commit run dash

35:26 dash all files.

35:27 So look through your entire repository, regardless of whether you have changes in place.

35:31 So if you say pre-commit run, it only works on your, basically your changed files, not the

35:37 stuff that's already there and accepted.

35:38 Correct.

35:39 And another neat thing is in the case I mentioned where you add a new hook, you might just want

35:44 to run that hook.

35:44 So you can say pre-commit run and then the hook ID, and then you would just run that hook

35:49 and then you can define either a certain set of files or the staged runs, whatever.

35:52 Yeah.

35:53 That sounds pretty useful when you're building your own pre-commit hook, right?

35:56 So yeah, depending on how you build it, you can either use that or they have also a try

36:01 repo command.

36:03 Right.

36:03 Got it.

36:04 Got it.

36:04 Well, let's see.

36:06 Maybe we could jump over and talk a bit through your hook creation guide, a step-by-step guide

36:12 to developing your own pre-commit hook.

36:14 I thought this was really, like I said, a good article.

36:17 And maybe one of the first things we talk about is just what makes a good hook in the first

36:23 place, right?

36:24 You said that they can't be too long or people will go crazy and turn them off or skip them

36:29 or whatever.

36:30 But what else?

36:31 So I think another big thing is if you're able to fix something, then you should fix it.

36:36 In the case of formatting and you're saying, oh, this should have a trailing comma, then

36:41 that's easy enough.

36:42 You can add the trailing comma.

36:43 You don't make more work for the user.

36:44 If you can't do that, then you should be very specific saying this file.

36:48 And if you have a line number saying exactly where it is, because just saying there's something

36:53 wrong in this file and someone has to hunt it is also not a good user experience.

36:57 No, that's going to be frustrating and super, super quick.

37:00 Yeah.

37:01 So be really descriptive about it.

37:03 And then also, maybe choose not to make it a pre-commit hook, right?

37:06 Not necessarily everything needs to run on every commit.

37:09 Yeah, I think that the speed thing is a huge factor.

37:12 And in general, I think one big thing that is key to note here is that it's even,

37:18 though, let's say you change files that, let's say you change a Python file, a Markdown file

37:23 and an image file.

37:24 If you're making a hook that only runs on a certain type of file, if you're careful and

37:30 specify that, then it's not necessarily a bad thing to include that in there because it will

37:34 only get triggered on those certain types of files.

37:36 And so like an example I have is the XF stripper.

37:40 Well, I created when I was building my website.

37:44 Your XF stripper is super interesting.

37:47 I'm starting to think maybe I want this as well.

37:48 Yeah, I was just very paranoid at one point about just working with images.

37:53 And so they come with, what's up here?

37:57 So exchangeable image file format data or XF as it's commonly called.

38:02 It's metadata that is in the image that you might not realize is there.

38:06 And so in this article, I talk about a picture of me presenting that I was given from a conference.

38:12 And this was something that was stored, I think, in a Google Drive.

38:15 So you have access to all the metadata that was available.

38:18 So I never met the photographer.

38:20 And yet I know the photographer's name, the camera they use, what type of computer they have,

38:24 how they edited it, all kinds of information.

38:27 And the dangerous part is the exact location of where this was.

38:31 Now, conference, not a big deal.

38:33 But you have to think about maybe you're blogging about something you did in your house or your apartment.

38:38 And now you have a photo up on your website where anyone can potentially see it that has the GPS coordinates for where you live.

38:47 Yeah, that wouldn't be great, no.

38:48 So I was very paranoid about this.

38:50 And I don't want the idea of like, oh, I'm going to add a new image.

38:54 Let me go through my checklist of what I need to do because I know at some point I'm going to mess something up or forget it.

39:00 And so this is a perfect use case for the pre-commit, right?

39:03 Because you want something that is going to stop you and tell you, nope, you can't do this, right?

39:08 And in this case, it can also remove the metadata because I am being super conservative and saying no metadata,

39:14 which has the nice side benefit of shrinking files, which is good for serving them.

39:19 Yeah.

39:20 Well, what value is it to have all that metadata in there for a blog?

39:26 Most of the time, most people are not, they just want to see, they want to read the blog.

39:29 They're not going to dissect your image, right?

39:31 I think it depends what you, I mean, maybe you have a travel blog and you want to know like, here's that location.

39:36 And then you have one off post where you introduce yourself and oops, you know?

39:40 Yeah.

39:41 There's so many ways.

39:42 And I think even just thinking, oh, I'm only going to be doing this.

39:46 There's always going to be something that later on happens.

39:48 So you have to be very careful just upfront that everything is going to go through this track.

39:53 Sure.

39:54 Can your exit thing, can it be selective about the metadata?

39:58 That's something I do want to do in the future.

40:00 Just remove the location if you say.

40:03 But the thing is, there's like, looking through all of that, it's hard to tell if there might be something in one subset of images you take that might be sensitive.

40:11 You can even think of certain situations where you might not want someone to know what kind of device you were using.

40:16 Right.

40:16 Because maybe they're like, oh, that device is vulnerable to something and I know they have it.

40:20 Right.

40:21 The worst of these is, I think, the multiple times, pretty sure it was the Samsung, but one of the Android companies posted a picture promoting the new phone.

40:34 And, you know, the exit information had the picture as being from an iPhone or something like that.

40:38 Oh, no, it was the other way around, I think.

40:40 Oh, the other way around.

40:41 I think I remember hearing that, yeah.

40:41 Well, it was like one phone company was posting it from, but the picture was actually, even though it was about the phone, it was, you know, implying this picture comes from or something.

40:49 It was like, nope.

40:50 Whoever is on the marketing team just happens to have the other kind of phone and there it goes.

40:54 Right.

40:54 And it's a huge scandal.

40:55 I mean, for those companies that talk about how awesome they're, how much better their cameras are or whatever.

41:00 Well, I see that's also the thing, right?

41:01 Because you never know who's going to look at the metadata either.

41:03 So, and it's interesting because certain things will, certain platforms will remove it.

41:09 So I mentioned like Google Drive, it's everything is preserved.

41:12 But the thing is, is you have to know ahead of time.

41:15 So you'd have to say, I'm planning to put this image here.

41:18 Let me upload a dummy image.

41:20 I don't care and check if the metadata is still there.

41:23 Yeah, exactly.

41:24 Yeah.

41:25 I think, I think Mastodon might remove it.

41:27 There's some certain platforms that will take away that metadata.

41:30 I think Facebook might.

41:31 It's been a long time.

41:33 I mean, it's a huge security concern.

41:35 So I imagine more and more places are, but I just wanted to have an abundance of caution and not risk anything happening.

41:41 Well, yeah.

41:42 And you're putting it on the internet as well, which there's, it goes straight from your computer through some sort of static website process.

41:49 And then it's downloaded, right?

41:50 There's very, there's no, nothing in between those two steps.

41:53 Exactly.

41:53 At least not in terms of image processing.

41:55 Yeah.

41:55 Yeah.

41:56 Cool.

41:57 Yeah, this is nice.

41:58 I'm thinking about grabbing it and trying out.

42:01 What file types does it work on?

42:03 Does it work on just JPEGs or does it do like WebP and all that?

42:07 Any image, anything that's classified as an image on pre-commit, the way pre-commit runs.

42:12 And it has to work with, I'm using Pillow.

42:15 So if Pillow can't read it, then it's not going to work.

42:17 Right.

42:18 Then I'll just skip over it or whatever.

42:20 Yeah.

42:21 Yeah.

42:21 So really quick, while we're talking about stuff on your website, your website's super nice.

42:26 Did you build this yourself?

42:28 Like, how is this thing built?

42:29 I did.

42:29 I did build it to myself.

42:32 I took a couple months in the beginning of the year and I had before a single page where

42:38 it was just like some boxes.

42:39 And then I was like, this needs to be revisited.

42:42 So it's built with Next.js and so React and TypeScript.

42:47 And then I use Tailwind CSS.

42:50 And yeah, it was kind of just like, I mean, a lot of these things are for me because sometimes,

42:54 you know, I like seeing all in one place where I'm speaking next or like stats about where

43:00 I've spoken, like a map and stuff.

43:02 And I went through, so kind of my process would be, you know, on my iPad, I would sketch out

43:08 what I kind of envisioned a page looking at and then I would prototype it in React and

43:13 see, okay, maybe this isn't fully work or like tweak things and iterate on a few times

43:17 and bit by bit the pages formed.

43:20 The latest thing I added was this timeline functionality.

43:24 At EuroPython this year, I had this idea for a timeline and I kind of got really, really into

43:31 it.

43:31 So it was funny.

43:31 I had a Python conference.

43:32 I was doing tons of React.

43:34 But if you scroll down a tiny bit, there's actually too much.

43:38 This one, right?

43:39 Yeah, yeah.

43:39 Versus the little text.

43:41 Oh, the complete upcoming.

43:42 Yeah, I got you.

43:43 So I built this.

43:44 Oh, that's beautiful.

43:45 I love it.

43:45 It's like a little infographic of your upcoming events.

43:48 Yeah.

43:49 So I was like very inspired and I did this in a few days.

43:53 But it's nice because, you know, going from the sketch to the React components, it's become

43:59 very natural, which it takes a bit to get there.

44:03 But it was nice because I did have to learn TypeScript for some changes in my team.

44:08 We were going to be starting moving to TypeScript.

44:10 So this was great to work on something that, you know, fit in my head as far as what needed

44:15 to be done.

44:16 And it was very, very helpful.

44:17 But yeah, so I'm very proud of this.

44:20 There's still more, tons more to do.

44:22 I have massive lists.

44:24 But yeah, I remember looking at Google.

44:25 This is a nice static site.

44:26 Very cool.

44:27 And I didn't even see this feature.

44:28 This is great.

44:29 Broadvon out in the audience says fire emoji for it.

44:31 Very good.

44:32 Thank you.

44:33 And also, thanks.

44:36 I see you put the podcast appearance on here as well.

44:38 That's cool.

44:38 So that's happening today.

44:40 Watch the live stream now.

44:41 If you're not watching now, then it's probably missed it.

44:43 But the recording will be there, of course.

44:45 But the reason I say that is you maybe want to give a shout out to some of your upcoming

44:49 events.

44:50 Yeah, why not?

44:51 So I'm going to be in San Francisco next week talking about my Datamorph project.

44:57 And I'll also be doing a book signing there for my hands-on data analysis with Pandas book,

45:02 second edition.

45:03 And then after that, I'm off to France to give a workshop on Pandas and then also talk about

45:09 getting started in open source contributions.

45:12 And then a couple of weeks after that, I will be at the final conference of the year in Australia.

45:18 And I will be talking about Datamorph once again.

45:21 And I'm hoping to run my third development sprint on Data Morph while I'm there.

45:26 Oh, that's cool.

45:27 Yeah, we'll talk about Data Morph in a second.

45:28 That's some interesting stuff.

45:30 But this is quite the agenda.

45:33 You got a full trip coming up.

45:34 No, I'm excited.

45:35 It's nice to see different cultures.

45:39 It definitely does land different, you know, the topics and just reactions.

45:43 Some people are at the top excited.

45:46 Some of them are just straight face.

45:48 You're like, I enjoy it.

45:49 I think it really comes into play as far as giving workshops.

45:54 I was in Portugal last week and I did the data analysis workshop.

46:00 And I think that was one of the best ones I've ever had.

46:03 It was very, very highly interactive and it was a really fun time for me.

46:07 And hopefully everyone else thought so as well.

46:09 Yeah, that's fantastic.

46:11 How did you get into public speaking?

46:12 Yeah, so I wrote the hands-on data analysis with Panda's book in 2019.

46:20 And at that time, if you had told me, go do some public speaking, I'm like, please no.

46:25 You're going to France and Australia and Portugal recently.

46:29 So I'm like, no, no, no.

46:30 Yeah.

46:30 And then, well, during pandemic times, a conference reached out to me about doing a workshop on pandas

46:38 because I had written the book and doing it virtually.

46:41 And to me, that felt like a good stepping stone to get over that fear of public speaking and

46:47 the fact that it would be virtual.

46:48 I wouldn't really have to look at anyone.

46:50 And I was still absolutely terrified when it came to actually delivering that talk.

46:55 And when you think about it, it wasn't a talk, right?

46:57 So it was my first thing was a four-hour workshop.

47:01 And now I'm at the point where a virtual thing is much less desirable because it's so hard when

47:08 you can't see people, you can't see our things landing, are they confused, are they with me?

47:12 Are they even still there?

47:14 So, and then after I did, you know, I made it to the end and I was like, okay, that's

47:19 definitely something I want to work on and do it again.

47:22 So I did, I came up with a second workshop on data visualization.

47:26 And then I think I did two or three more virtual sessions.

47:31 And then it became that some conferences were now in person.

47:35 And I was like, okay, I think I should try this.

47:37 And again, it was still a long one.

47:40 It may have even been a six-hour session that time.

47:42 So it's like crazy, right?

47:43 And then I did that in person.

47:45 And I was like, okay, I survived.

47:47 And then it kind of just felt like something, if I kept doing it, I would get over it or

47:52 at least get to the point where, you know, I could do it without being terrified for a

47:56 month ahead of time.

47:57 Right.

47:57 And I am at that point now.

47:59 It is like, I enjoy doing it because I enjoy, I'm very passionate about knowledge sharing and

48:04 just teaching people and getting that interaction that, oh, people are really like getting value

48:09 out of this.

48:10 And that to me is very nice.

48:11 Yeah.

48:12 It's super rewarding.

48:13 So, but yeah, this is quite impressive.

48:15 So just, I got the sense you kind of got started pretty soon.

48:18 You said 2019.

48:19 So that's, haven't been doing it for that long.

48:21 And this is great.

48:22 So maybe, you know, you brought it, maybe we could talk a bit about your book as well.

48:27 I don't know what to say about this one.

48:29 Just that it exists and people should check it out.

48:33 It's giant.

48:33 It's giant.

48:34 As you can see, 788 pages.

48:36 Holy moly.

48:37 That is giant.

48:38 Yeah.

48:40 So this is the second edition.

48:42 If you scroll down, there's also the covers for the Korean and Chinese editions.

48:46 Oh, awesome.

48:47 And I do not read either of those, but I do have copies.

48:52 You can act of faith to put your name on them.

48:53 You know what?

48:54 I've been told by people that read both of those languages that the name is not quite translated

48:59 correctly, but you know, I'll forget about that.

49:02 It's cool to have the copies.

49:03 Yeah.

49:04 So this book covers obviously pandas working through the basics of data analysis.

49:10 We also talk about data visualization.

49:13 And then there is a little bit towards the end about like actually applying this stuff

49:19 to use cases and also a little bit of machine learning.

49:21 Cool.

49:22 Yeah.

49:23 So I'll put a link in the show notes.

49:24 People can check it out if they would like to.

49:26 All right.

49:27 I feel like there's a few things.

49:28 We didn't make it very far in our creation guide.

49:31 So let's talk about the recipe.

49:33 All right.

49:34 What are the four steps?

49:35 At least Stephanie's recipe for pre-commit hook.

49:39 Yeah.

49:39 This is definitely my recipe.

49:41 I mean, I've, I think I've made two that are published ones and then obviously a few other

49:46 for trainings and explanation purposes.

49:48 And this, this is something that works well for me.

49:50 And I think makes sense as far as thinking about the pieces.

49:53 So the first thing, the hardest thing is actually to figure out what are you checking and how do

49:59 you actually code that up?

50:00 And if you want to do this in Python, this is just, okay, code your logic.

50:03 Yeah.

50:04 Right.

50:04 Yeah.

50:04 Well, and if it has a --fix, maybe that's even harder than just trying to

50:09 understand, right?

50:10 Because now you got to not break somebody's code or sorts of things like that.

50:13 Yeah.

50:13 But this would be where you start at the basic level, probably first, you know, find,

50:18 figure out, can you find the issue and show people where it is?

50:22 And then you can look into fixing it.

50:23 But yeah, you have to be very careful, especially if you're going to be touching things.

50:27 So I guess it's pretty straightforward, but the magic of Python is not just the language

50:32 and the static, the standard library, but the 500,000 external packages, right?

50:37 There's probably a ton of external packages that understand code, check different things.

50:41 And you could, you can use those in your hook implementation, right?

50:44 Just like a standard Python package, it can have dependencies and stuff.

50:47 Yes.

50:48 And so I talk about this in the third step, but I do like to make it as a package just

50:54 because you know that that's going to work and grab the dependencies as long as you follow

50:57 what you already know.

50:59 And pre-commit will, you will tell pre-commit in the fourth step in that pre-commit hooks

51:04 file how it should be installed.

51:06 So when you say this is, this is Python, then it will know, okay, so I should be using, for

51:11 example, pip to install this.

51:12 And if you have, for example, pyproject.toml and you specify how it should be built, then

51:16 all of that just happens as it normally would.

51:18 It's just that pre-commit is doing it instead of you.

51:20 Yeah.

51:20 Yeah.

51:21 That's kind of, instead of you doing a pip install dashy dot or whatever, that it's

51:25 kind of figuring that out.

51:26 And I guess we haven't really talked too much about it, but when you pre-commit install, it

51:31 looks at the, this hooks YAML file and then it, it creates the environment and it downloads

51:36 all the packages the first time to kind of set it up.

51:39 Then it just runs over and over after that.

51:41 Right.

51:41 Yeah.

51:41 Unless you change something in your pre-commit config file, then it won't need to rebuild the

51:47 environment for this.

51:48 So if you keep the same version, then it's kind of like you said.

51:51 I installed this version of the package.

51:52 And as long as you don't say you need to update the package and it's kind of like a virtual

51:56 environment.

51:56 Okay.

51:57 You already have that.

51:57 There's no need to.

51:58 Yeah.

51:59 Yeah.

51:59 Excellent.

51:59 So your recipe is one, design the check function to turn it into a CLI, which there's some interesting

52:07 stuff in that one as well.

52:08 That's.

52:08 And I think that's kind of where the --fix comment comes into play.

52:13 Right.

52:13 So your logic, that check function, you should be able to say this was successful.

52:18 This was not successful as in stop the commit.

52:21 And then the CLI provides a very easy way to plug into that.

52:26 Maybe you want to say --fix or dash dash, you know, leave this type of file alone,

52:31 whatever kind of modification you want to do.

52:33 You can expose that in a CLI.

52:36 And that's also a quicker way to get started versus trying to, let's say, read the pipe,

52:42 find the pyproject.toml, read it in, parse out things.

52:46 That's all stuff that can come later once you figure out exactly how you want your tool to

52:51 be configured.

52:52 Yeah.

52:52 Especially if it just has one or two arguments, it might not be necessary to be too, too over

52:57 the top with all the configuration.

52:58 And then you make it installable.

53:00 Basically, like you said, make it a package and then create the pre-commit hooks.

53:05 Yeah.

53:05 Well, those are the steps.

53:06 So I think write the function, that's pretty straightforward.

53:09 You just, whatever you want it to do, you just write a function that does it.

53:12 You do have an example in here about checking for valid file names and snake cased file names.

53:19 So things like it can't be just one letter and it has to be snake cased and so on.

53:25 Right.

53:25 But then to turn that into a CLI, there's a lot of options in Python these days, right?

53:31 You can click, you can type, but if you want something built in, yeah, if you want something

53:37 built in, argparse is pretty straightforward, right?

53:39 Yeah.

53:40 And I think also, I mean, if you look at the pre-commit hooks repo provided by pre-commit org,

53:45 a lot of them, or maybe all of them are just using argparse.

53:49 Because for most hooks, all you'll need to say is, I have an argument parser and it accepts

53:54 file names.

53:55 And at that point you have this boilerplate that you can just copy and you don't even

53:58 need to worry about configuring multiple, you know, different arguments.

54:02 It doesn't have to be too advanced with like sub commands and all that kind of stuff necessarily.

54:06 Yeah.

54:07 Yeah.

54:07 And then make it installable.

54:09 This is, you recommend a pyproject.toml, which yeah, for packages these days, that seems

54:15 pretty much the de facto standard, right?

54:17 Yeah.

54:18 And then what's nice is, yeah, you're using current things.

54:21 You're not relying on setup.py.

54:23 And also in there, there's a way to expose an entry point.

54:27 And that's line 24.

54:28 Yeah.

54:29 Yeah.

54:29 Yeah.

54:29 Yeah.

54:30 That's really nice.

54:30 I love entry points.

54:31 I think it's, I think they're massively underused in Python.

54:35 You know, people talk about how do I create a script that I can give it to somebody so they

54:40 can run something.

54:41 And that so often involves like, where is it?

54:44 Where is its associated files?

54:46 Where is its Python?

54:47 And where is its dependence?

54:48 All of that stuff you, if you just create a package and it has an entry point, you can

54:52 pipx install it or uv tool install it.

54:55 Or, and now you just have all these commands and people don't have to mess with all the Python

54:58 stuff.

54:59 Even if you know how to do it, you don't necessarily want to do that all the time.

55:01 Right?

55:02 Yeah.

55:02 And then it's just easy to, you can kind of call it from anywhere at that point.

55:05 Yeah, exactly.

55:06 So in this example, you give, you put a, a validate dash file name command and you

55:12 just point to, you know, what module and then what function to call.

55:16 And that's the CLI.

55:17 Yeah.

55:17 That's really nice.

55:18 And then of course that, that function in there is built and backed with arg parse.

55:22 So it all, it kind of all comes through a circle right there.

55:24 Yeah.

55:24 Yeah.

55:25 So it's like you, it's almost like you had created, you know, some command line utility,

55:29 like bash wise or something.

55:31 And you just have that available and it's hooks into your, your CLI.

55:35 I also want to call out on a 21 line 21, cause we talked about dependencies, right?

55:39 So anything you put in there, that's automatically will get grabbed when pre-commit installs.

55:44 So in this case, there's nothing.

55:46 And then the case of the exit stripper I mentioned, like we need to install pillow, right?

55:50 So this is how you can configure how pre-commit will grab everything.

55:55 And I also see it has, yeah, I see there's a requires Python version.

55:58 Does pre-commit help you get Python in any way?

56:01 Or is it just assume that there's a...

56:03 You need to have whatever languages you're relying on, you do need to have them installed

56:06 already.

56:07 Okay.

56:07 So in order for you to use this pre-commit hook on your machine, you'd have to have, for

56:12 example, Python 3, 10, 11, 12, something like that installed, given that it says 310

56:16 or greater.

56:17 So for example, like if you saw some hook that sounded interesting, but it's written in Go

56:21 and you don't have Go on your computer, you have to figure that out first.

56:24 That's a no-go.

56:25 It's a no-go.

56:28 All right.

56:28 Let's see.

56:28 Yeah.

56:30 And then last thing to do is you say, create the pre-commit hooks.yaml file.

56:36 And is this the thing that goes into your repo?

56:39 So when pre-commit sees it, it knows what to do?

56:42 Yeah.

56:42 So for example, in the exif stripper repo, there's this file exists.

56:47 So if someone uses exif stripper, they point to that repository.

56:51 And then when pre-commit goes and grabs it, it looks for this file, right?

56:54 And then the key things here, for one being language.

56:58 So language tells pre-commit, how does it try to install that?

57:02 So in this case, it says, oh, this is Python.

57:04 So then it knows, okay, pip.

57:06 The ID at the top, that's the name that you reference in the pre-commit config.

57:12 Like when you want to, like we saw check toml, check yaml in the beginning, those correspond

57:17 to entries in the pre-commit hooks yaml of that repository that they were being referenced

57:23 from.

57:23 So pre-commit can, so first finds this file, it can install, then it can see, oh, which

57:28 hook do you want?

57:29 Validate file name in this case.

57:30 And then how do I call this?

57:32 And that's entry.

57:33 And this is pointing to the entry point that we made, but it can be anything, right?

57:38 You could call rough and then add, you know, 20 different command line flags if you want.

57:43 And that can be your hook.

57:44 And that would be fine as well.

57:46 And what's very interesting here is it's optional, but it's the types one at the bottom.

57:51 So I talked before about XF stripper only running on images, right?

57:55 It'd be wasteful to have it look at toml and markdown, right?

57:58 If it's not going to do anything with it.

57:59 Can't find any XF information in the toml.

58:01 Yeah.

58:02 So this controls that.

58:04 So for example, this hook will only run on Python files.

58:08 And this logic, I'm blanking on the name of the tool that pre-commit uses to figure this

58:14 out.

58:14 But this is handled elsewhere.

58:15 So there's like certain names that you can use.

58:17 Right.

58:18 Some sort of category mapping over to these file extensions or these bombs at the beginning

58:23 of the file or whatever mean that it's this thing.

58:26 Exactly.

58:26 There is a very dangerous thing with this and that types is an and.

58:31 So if you say, if you wanted to do like this should run on Python and markdown, you can't

58:35 use this because it will look for files that are both Python and markdown and will not end

58:41 well.

58:41 Not too many of those exist.

58:43 Yeah.

58:43 There's a separate types or that you have to use.

58:46 That's like a little gotcha.

58:47 It's like an ORM sort of instead of a SQL statement.

58:51 Kind of you got to.

58:52 Yeah.

58:53 Those things always get weird.

58:54 Like import the or operator.

58:56 Like, okay.

58:57 Yeah.

58:58 Cool.

58:59 Okay.

58:59 That's actually that that is very good to know because it looks like a list of options.

59:03 It is.

59:05 Yeah.

59:05 But they combine.

59:05 So you might have something like it is a file and it's Python.

59:08 That might be one thing I've seen.

59:10 Right.

59:10 Okay.

59:11 Yeah.

59:12 Cool.

59:12 So if I wanted to have more than one hook, I could put it into one.

59:16 I could have more than one here.

59:18 Is that possible?

59:18 Yeah.

59:19 So this looks like a list.

59:20 Yeah, exactly.

59:21 It's structured as a YAML list.

59:22 So you just kind of could copy that block, paste the new one, and then just change whatever

59:27 fields you want.

59:28 And then that's now the second hook that you expose.

59:31 Right.

59:31 And working backwards, I suppose you just expose a different entry point potentially and then

59:36 just call it out or whatever you want.

59:38 Well, I mean, you could like maybe you have a validate file name and maybe you have another

59:41 one that's like validate long file names or something where you're like, now they have

59:44 to be this long.

59:45 And then it's just a shortcut for something else.

59:47 So it doesn't have to be a different thing.

59:49 Oh, yeah.

59:49 You just put an argument in there as a default kind of for people.

59:52 So we talked about args earlier and that was something the user could tweak.

59:57 Anything you put in here is essentially like it will always run with these.

01:00:01 So you could bake in certain things that have to happen.

01:00:04 Yeah.

01:00:04 Awesome.

01:00:05 I love it.

01:00:06 Okay.

01:00:06 We're pretty much out of time, but let's talk about one final thing.

01:00:12 Not this one.

01:00:13 Your Datamorph project.

01:00:15 Give a quick shout out to that before we wrap things up.

01:00:19 What do you think?

01:00:19 Sure.

01:00:19 So this project started related to the pandas workshop I had mentioned.

01:00:26 I wanted to have a visual to really drive home the point that we needed to visualize our

01:00:31 data because pandas very much data wrangling.

01:00:35 And after talking to people two hours about data wrangling and statistics, you can calculate

01:00:40 on tabular data.

01:00:41 Some people just feel like, oh, okay, we're done.

01:00:44 I mean, you know, we're done.

01:00:45 And that's definitely not the case.

01:00:47 And I was thinking about, and you had it on the screen before, but the data source doesn't.

01:00:52 So yeah.

01:00:53 So there was research in 2017 by Autodesk where they took the idea of Anscombe's Quartet, which

01:01:01 is, sorry, just a little bit above that, which is just a set of four, yeah, four data sets.

01:01:08 They share the same summary statistics.

01:01:10 So the mean in X and Y, the standard deviation in X and Y, and the Pearson correlation coefficient.

01:01:16 And they look very different.

01:01:17 And if you think of, naively, you think, well, I know the average and maybe how spread out

01:01:24 things are.

01:01:25 So I can kind of get a sense of what this data probably means.

01:01:28 But in reality, outliers and other weird things could just completely blow up those ideas,

01:01:33 right?

01:01:34 Yeah.

01:01:34 And so in 2017, they had developed this algorithm using simulated annealing.

01:01:40 So if you scroll down once more, where they take the dinosaur at the top and they use

01:01:47 simulated annealing to push the points.

01:01:48 Let me describe this really quick for just people listening.

01:01:50 So there's a matplotlib looking graph of some data points, and it has a certain standard

01:01:56 deviation, certain mean, et cetera.

01:01:58 But if you actually look at it, it looks like a T-Rex, right?

01:02:02 Something like this?

01:02:03 Yes.

01:02:03 Is that a decent enough description?

01:02:05 That's a perfect description.

01:02:07 Yeah.

01:02:07 So what the researchers have done is they use this simulated annealing algorithm to push

01:02:12 the points around.

01:02:13 So starting from that dinosaur and just moving the points ever so slightly in such a way where

01:02:18 the summary statistics are unchanged, at least to the two decimal places where they're currently

01:02:23 shown, and tried to make other shapes.

01:02:26 So some of the other shapes they have are a bullseye, a circle, lines slanted vertically

01:02:32 or a star.

01:02:33 And all of these can be formed from that dinosaur, some to varying degrees of success.

01:02:38 But they're visually recognizable, which is the point that is pretty important here, right?

01:02:45 So you cannot, as we said, rely on those summary statistics because you don't know.

01:02:48 Is it the star?

01:02:49 Is it the dinosaur?

01:02:50 Is it a line?

01:02:51 It could be anything.

01:02:52 And they also had animation that they included.

01:02:56 So basically, you could start from the dinosaur and then turn it into a circle.

01:03:00 And that's even more impractical because you realize at that point that it's not just the

01:03:06 dinosaur and the circle that have something in common, but it's the infinite number of

01:03:10 points arrangements that you can make between them that actually share that.

01:03:14 And so I wanted to explore if I could extend that to working for arbitrary data sets and also

01:03:20 different shapes.

01:03:21 So I found the research code and spent quite a bit hacking at it and even just trying to

01:03:27 get it to work for their example.

01:03:29 And that took quite a bit of time.

01:03:30 And then I had this idea of being that it was for a pandas workshop to take a panda and

01:03:35 turn it.

01:03:36 Initially, I wanted to turn it into the dinosaur.

01:03:38 I still have not found a good way to do that yet, but I also haven't been trying at all this

01:03:44 year on that, to be honest.

01:03:45 But I figured out how to, and by adding a lot of other things that didn't exist in the initial

01:03:51 algorithm, things like calculating bounds of the data and different metrics that I figured

01:03:56 out a way to get it to work regardless.

01:03:59 So I can give it a panda data set or a soccer ball and it can perform these transformations

01:04:04 and move the points around.

01:04:06 So on the screen, we have the first time I shared this publicly, what I had been working on,

01:04:11 it happened to be Easter.

01:04:12 So I made a bunny holding an Easter egg with the words, happy Easter off the side.

01:04:17 And it turns into two vertical lines all while preserving the summary statistics.

01:04:22 This is something I think makes it for a very good teaching tool in say like an introductory

01:04:28 statistics course to encourage people that they need to visualize.

01:04:32 There's an interesting study, I think called the hypothesis is a liability.

01:04:38 And they talked about taking students in a statistical analysis course and they split them into two.

01:04:44 And one set of students were just given the data set and say, here, explore, see what you find.

01:04:49 And then the other set were given a set of hypotheses to test.

01:04:53 And it turns out that the data is shaped like a gorilla.

01:04:56 And the students who were told here, test these hypotheses were five times less likely to even

01:05:02 realize that it was shaped like a gorilla because they never plotted it.

01:05:05 Yeah.

01:05:06 This is such a huge thing to like get people learning this early.

01:05:10 And the more shocking these visuals are, the better.

01:05:14 Yeah.

01:05:14 And I think these are super shocking, right?

01:05:17 Having T-Rexes and bunnies and go, you know, that bunny is, you know, equivalent.

01:05:22 And there's a continuous transformation from bunny to blob of dots with one outside dot, right?

01:05:28 That kind of stuff kind of surprise you, I think.

01:05:30 And one thing I see, especially when the dinosaur came out, but even when I posted some of my first

01:05:37 examples is you see people comment right away, wow, that there's something that's so cool that

01:05:41 that dinosaur is possible to do that with.

01:05:44 Like, no, no, no.

01:05:44 It's not, it's not just the dinosaur or just the panda.

01:05:47 It's really like anything.

01:05:48 And so the way this also works is that people can use their own data sets or they can add

01:05:53 something new.

01:05:53 And that's what I've had, that's what's what I've done this year in the two previous development

01:05:59 sprints that I had people just been, I did one in EuroPython and one in PyCon Taiwan earlier

01:06:07 this year.

01:06:07 And hopefully in Australia, we'll do some more.

01:06:11 But I had people add, for example, a target shape.

01:06:15 So what the, for example, the panda would turn into, we have a club, like the card suit,

01:06:21 which was quite a challenge, and the spade.

01:06:24 And I had already had the heart.

01:06:25 The heart is actually a trigonometric equation, which, you know, blew my mind at first.

01:06:30 There's actually a page I found on, I think, Wolfram Alpha, which was like, I want to say

01:06:35 like 10 or 15 different equations, trigonometric equations for different types of hearts.

01:06:40 And you can pick the exact type of heart you wanted.

01:06:43 Social media heart, the emoji heart, what are we talking about?

01:06:45 No, no, it was just like, this is longer, this is more curved.

01:06:48 Yeah, yeah, yeah, that's awesome.

01:06:49 But these are all now math problems when you think about that side of it.

01:06:53 So this could then be used maybe in a course where they want to focus on math, but also

01:06:57 some more coding.

01:06:58 So there's lots of different use cases, like just giving it the data.

01:07:02 And that's very much more just pure statistics.

01:07:04 But, you know, I think, and I've heard from a few teachers that, from what I presented,

01:07:09 that they're, it sounds like this would be something that they would like to use.

01:07:13 So hopefully that does happen.

01:07:14 If not, it's a fun thing to put in my slides.

01:07:16 And I did enjoy getting it to work.

01:07:18 Yeah, I didn't pull up any good videos for the YouTube video, but there's some really nice

01:07:23 animations of actually seeing it go from one to the other that you got.

01:07:27 And this is, you're doing a talk at PyCon Australia, and then you're doing a sprint on

01:07:32 this as well, right?

01:07:33 Coming up in November 22nd, about a month from now.

01:07:36 Correct.

01:07:37 So cool.

01:07:37 People can check that out if they happen to be at PyCon Australia and want to...

01:07:41 Well, I'll also be talking about it in San Francisco next week.

01:07:45 There won't be a sprint, but I will be talking about that.

01:07:48 So people can...

01:07:48 Okay.

01:07:49 It's not a PyCon.

01:07:49 Sure.

01:07:50 It's still cool.

01:07:51 All right.

01:07:52 Well, Stefanie, thank you so much for being here.

01:07:54 Let's wrap things up.

01:07:56 But I guess, you know, give us a final call to action for people maybe interested in pre-commit

01:08:00 hoax or other stuff that you're doing.

01:08:02 Yeah, you can find everything that we mentioned here and the projects on my website.

01:08:06 I'm putting much more effort into putting stuff on there this year now that I've rebuilt it.

01:08:12 So definitely check there and sign up for my newsletter.

01:08:15 Follow me on socials.

01:08:17 There's no links down here, but you can find them.

01:08:19 There'll be links on the episode page.

01:08:21 So we'll put them there.

01:08:22 All right.

01:08:23 Well, thanks.

01:08:24 Thanks for being here.

01:08:24 It's great to talk to you.

01:08:25 Thanks for coming on and sharing.

01:08:26 Thanks for having me.

01:08:27 Yeah.

01:08:27 Bye-bye.

01:08:28 This has been another episode of Talk Python to Me.

01:08:32 Thank you to our sponsors.

01:08:34 Be sure to check out what they're offering.

01:08:35 It really helps support the show.

01:08:37 Take some stress out of your life.

01:08:39 Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.

01:08:45 Just visit talkpython.fm/Sentry and get started for free.

01:08:50 And be sure to use the promo code talkpython, all one word.

01:08:53 This episode is brought to you by Bluehost.

01:08:56 Do you need a website fast?

01:08:58 Get Bluehost.

01:08:58 Their AI builds your WordPress site in minutes and their built-in tools optimize your growth.

01:09:04 Don't wait.

01:09:05 Visit talkpython.fm/Bluehost to get started.

01:09:08 Want to level up your Python?

01:09:10 We have one of the largest catalogs of Python video courses over at Talk Python.

01:09:14 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:09:19 And best of all, there's not a subscription in sight.

01:09:22 Check it out for yourself at training.talkpython.fm.

01:09:25 Be sure to subscribe to the show.

01:09:27 Open your favorite podcast app and search for Python.

01:09:30 We should be right at the top.

01:09:31 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

01:09:36 and the direct RSS feed at /rss on talkpython.fm.

01:09:40 We're live streaming most of our recordings these days.

01:09:43 If you want to be part of the show and have your comments featured on the air,

01:09:47 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:09:51 This is your host, Michael Kennedy.

01:09:53 Thanks so much for listening.

01:09:55 I really appreciate it.

01:09:56 Now get out there and write some Python code.

01:09:57 I'll see you next time.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon