WEBVTT

00:00:00.020 --> 00:00:02.440
Today, we're turning tiny tips into big wins.

00:00:02.980 --> 00:00:10.420
Khuyen Tran, creator of CodeCut.ai, has shipped hundreds of bite-sized Python and data science snippets across four years.

00:00:11.220 --> 00:00:18.080
We dig into open source tools you can use right now, cleaner workflows, and why notebooks and scripts don't have to be enemies.

00:00:19.020 --> 00:00:25.100
If you want faster insights with fewer yak shaves, this one's packed with takeaways you can apply before lunch.

00:00:25.400 --> 00:00:26.040
Let's get into it.

00:00:26.620 --> 00:00:32.759
This is Talk Python To Me, Episode 522, recorded September 4th, 2025.

00:00:50.900 --> 00:00:55.300
Welcome to Talk Python To Me, the number one podcast for Python developers and data scientists.

00:00:55.820 --> 00:00:57.200
This is your host, Michael Kennedy.

00:00:57.660 --> 00:01:00.920
I'm a PSF fellow who's been coding for over 25 years.

00:01:01.600 --> 00:01:02.760
Let's connect on social media.

00:01:03.180 --> 00:01:06.480
You'll find me and Talk Python on Mastodon, Bluesky, and X.

00:01:06.780 --> 00:01:08.500
The social links are all in the show notes.

00:01:09.080 --> 00:01:13.220
You can find over 10 years of past episodes at talkpython.fm.

00:01:13.680 --> 00:01:17.000
And if you want to be part of the show, you can join our recording live streams.

00:01:17.480 --> 00:01:17.840
That's right.

00:01:17.890 --> 00:01:21.580
We live stream the raw, uncut version of each episode on YouTube.

00:01:22.020 --> 00:01:26.680
Just visit talkpython.fm/youtube to see the schedule of upcoming events.

00:01:27.240 --> 00:01:31.520
And be sure to subscribe and press the bell so you'll get notified anytime we're recording.

00:01:32.700 --> 00:01:34.300
This episode is brought to you by Sentry.

00:01:34.820 --> 00:01:36.080
Don't let those errors go unnoticed.

00:01:36.230 --> 00:01:37.880
Use Sentry like we do here at Talk Python.

00:01:38.400 --> 00:01:41.300
Sign up at talkpython.fm/sentry.

00:01:41.700 --> 00:01:43.660
And it's brought to you by Agency.

00:01:44.240 --> 00:01:46.080
Discover agentic AI with Agency.

00:01:46.580 --> 00:01:49.340
Their layer lets agents find, connect, and work together.

00:01:49.690 --> 00:01:50.660
Any stack, anywhere.

00:01:51.120 --> 00:01:57.260
Start building the Internet of Agents at talkpython.fm/agency spelled A-G-N-T-C-Y.

00:01:58.260 --> 00:02:02.040
I'm super excited to tell you that I just released my first solo book.

00:02:02.200 --> 00:02:04.420
It's called Talk Python in Production.

00:02:05.200 --> 00:02:10.080
It's the inside look at how we host all the Talk Python sites, APIs, and more.

00:02:10.740 --> 00:02:18.140
The core idea is that I believe most hosting stories sold to developers and data scientists are way overcomplicated and overpriced.

00:02:18.460 --> 00:02:24.020
You've heard me say that you're not Google and you're not Facebook, so you shouldn't run your infrastructure the way they do.

00:02:24.620 --> 00:02:26.080
But if not that, then what?

00:02:26.880 --> 00:02:38.400
This book is both a blueprint for what I chose at Talk Python and a story arc of 10 years of running our own infrastructure from complete newbie to pretty neat infrastructure as code DevOps style.

00:02:38.920 --> 00:02:40.960
You'll find the book right on talkpython.fm.

00:02:41.140 --> 00:02:42.760
Just click book in the nav bar.

00:02:43.360 --> 00:02:46.900
I've made the first third of the book available for free for everyone to read online.

00:02:47.520 --> 00:02:52.260
After that, you can grab the DRM-free EPUB and Kindle editions from that same page.

00:02:52.610 --> 00:02:54.020
I hope this book resonates with you.

00:02:54.400 --> 00:03:00.380
People have asked me to share the details of how I run our sites at Talk Python, and now here it is in detail.

00:03:01.060 --> 00:03:04.020
If you're interested, grab the ebook at talkpython.fm.

00:03:04.180 --> 00:03:06.920
Today, I'm working on the paperback version as well.

00:03:07.130 --> 00:03:07.780
It should be out soon.

00:03:08.240 --> 00:03:10.460
Getting the book is a great way to support the podcast.

00:03:11.440 --> 00:03:12.480
Khuyen, welcome to Talk Python.

00:03:13.020 --> 00:03:14.060
Great to have you here.

00:03:14.360 --> 00:03:15.420
Happy to be here.

00:03:15.580 --> 00:03:40.480
Yes, I'm happy to have you here. It's going to be a super fun data science topic. You've got a really cool project over at CodeCut.ai. And one of our listeners reached out and said, you have Khuyen on because she's doing amazing stuff over at CodeCut.ai. And I'm really getting a lot of value out of it. And I'd love to hear more about this project and maybe dive into some of the topics there.

00:03:40.540 --> 00:03:42.240
So we're going to have a really great time talking about that.

00:03:42.440 --> 00:03:45.700
But before we get to those, really quick introduction for everyone.

00:03:46.540 --> 00:03:46.840
Who are you?

00:03:47.200 --> 00:03:47.340
Yeah.

00:03:47.860 --> 00:03:48.300
Hi, everybody.

00:03:48.800 --> 00:03:49.700
I'm Quynh Truong.

00:03:49.940 --> 00:03:52.600
I'm a developer advocate at Nixler.

00:03:52.760 --> 00:04:02.940
I am also the founder of CodeCut, where I share daily tips on both LinkedIn and my news through my newsletter.

00:04:02.960 --> 00:04:16.660
and I send out short tips which have about data science and Python in the form of code snippets that is very easy to digest in two minutes or three times a week.

00:04:17.130 --> 00:04:17.880
Three times a week.

00:04:18.200 --> 00:04:18.320
Yeah.

00:04:18.459 --> 00:04:18.859
That's a lot.

00:04:19.180 --> 00:04:24.460
Yes, it's a lot of work, but I have been doing it for four years, I think.

00:04:24.780 --> 00:04:27.440
So I have been writing tips,

00:04:28.000 --> 00:04:34.760
writing Python snippets for like basically every day of the week on the weekday for four years.

00:04:35.700 --> 00:04:36.640
Yeah, very fun.

00:04:36.800 --> 00:04:38.760
And you also have longer form articles there.

00:04:39.200 --> 00:04:39.980
Yes, that is correct.

00:04:40.220 --> 00:04:46.820
I also enjoy writing long form articles that dive deeper into open source.

00:04:47.440 --> 00:04:55.080
So majority of the time, like 95% of the time, I write about open source, Python, data science, so very specific.

00:04:55.760 --> 00:05:06.840
And I like to really explore how, you know, what should data scientists use this tool and how they use it.

00:05:07.180 --> 00:05:12.280
And my assumption for every article is data scientists are very busy.

00:05:12.720 --> 00:05:22.320
And if they only have five minutes to read this article, they will be able to get some takeaway from this tool and be able to apply right away.

00:05:22.720 --> 00:05:30.020
So you will see that as a common theme for snippets, micro snippets, as for a newsletter, like very short, very easy to digest.

00:05:30.740 --> 00:05:38.420
As well as article, even though it's long form, but also very, like if they want to, they can skim it, get something out of it.

00:05:38.470 --> 00:05:47.340
Or if they really want to like die deeper, you can sit down and like call along and most will work for getting out of it.

00:05:47.500 --> 00:05:48.200
Yeah, I agree.

00:05:48.270 --> 00:05:54.800
I read a bunch of your articles and I think you can certainly get it a lot even if you just have only time to skim them.

00:05:54.950 --> 00:05:57.280
I learned a couple of extra new tools.

00:05:57.390 --> 00:05:59.840
I think that we're going to have a lot of fun to talk about as well.

00:05:59.980 --> 00:06:03.160
So how do people get your code snippets?

00:06:03.160 --> 00:06:05.080
Is that through your newsletter or how you do that?

00:06:05.340 --> 00:06:09.460
Yeah, people get my code snippets through my newsletter.

00:06:10.140 --> 00:06:15.300
And if you go to the front page, see if you find how it looks like.

00:06:15.540 --> 00:06:16.660
Yes, that's how it looks.

00:06:17.120 --> 00:06:28.740
So just for the audience who's listening, so the form of my newsletter is I would, so I would extract like some specific feature out of the tool, right?

00:06:28.880 --> 00:06:37.380
because if the one tool can have many features and you don't want to talk about all the features in one code snippet, it will be really difficult to digest.

00:06:37.920 --> 00:06:39.020
So I pick one feature.

00:06:39.330 --> 00:06:42.160
I compare it with something that people already know.

00:06:42.680 --> 00:06:50.380
For example, in the screen we see we are comparing between a regular expression, REC-X, library, and DFLIP.

00:06:50.740 --> 00:07:30.580
And we try, I want to compare something people already know, something that is less than all to highlight, okay we see the typical problem with the tool that people all know and here's a solution and this tool offers the solution and so that's true code snippet my my philosophy when it comes to teaching people is it's better to show than to tell so i put a lot of effort into making it very easy for people to when they look at the code snippet they can understand but of course there's supported text. So if you scroll down a bit more, you can see the format.

00:07:32.490 --> 00:07:37.440
Yeah, we see the problem solutions. The problem is, I'll just read it out loud here.

00:07:37.960 --> 00:07:59.660
RegEx pre-processing achieves exact matching but fails completely on typos like iPhone 14 Pro with like two double R max. Dolution diffused provides similarity scoring that tolerates typos and and character variations, enabling approximate matching where red X pairs.

00:08:00.220 --> 00:08:02.460
And they can view the full article if you want.

00:08:02.700 --> 00:08:06.020
But it needs to be digested within the newsletter.

00:08:06.420 --> 00:08:11.680
Yeah, I think you could probably read that and get a good bit of information out of that in like one minute.

00:08:11.880 --> 00:08:12.420
That's really nice.

00:08:12.740 --> 00:08:18.900
And the idea you're highlighting here is like, sure, you can search if you're doing data science, text, NLP type of stuff.

00:08:19.260 --> 00:08:21.800
If you search with regular expressions, you might find things.

00:08:22.540 --> 00:08:32.640
But what if the spacing is different between them or somebody puts a comma or they put iPhone 14 without a space between the 14 and the iPhone, right?

00:08:32.740 --> 00:08:35.719
Like all those things are really tricky to catch every variation.

00:08:35.979 --> 00:08:41.099
So there's better tools like DiffLib, which we'll probably talk about again later in an article, right?

00:08:41.280 --> 00:08:44.280
And in the article, I highlight a lot more tools.

00:08:44.420 --> 00:08:46.840
So it's from like Red X to DiffLib.

00:08:47.080 --> 00:08:50.680
And then if you want to like even better, then fuzzy matching, right?

00:08:51.400 --> 00:08:55.920
And then if even you want semantic capture, and that's another tool.

00:08:56.880 --> 00:09:04.180
And then, of course, you can go down all the spaCy route and into LLMs and like the whole spectrum, right?

00:09:04.820 --> 00:09:09.060
But just knowing about this, because sometimes you don't want full machine learning.

00:09:09.240 --> 00:09:11.280
You just want kind of like regular expressions,

00:09:11.440 --> 00:09:12.140
but not so hard.

00:09:12.380 --> 00:09:12.760
That's correct.

00:09:12.840 --> 00:09:14.480
Like something that gets the job done.

00:09:15.500 --> 00:09:24.560
To me, the iNew tool is something that is like works right out of the box without a lot of boilerplate code and not too many dependencies.

00:09:25.200 --> 00:09:30.420
Yeah, that's really, really important. So let's talk about the origins of CodeCut.ai.

00:09:30.980 --> 00:09:31.540
Why did you start it?

00:09:31.940 --> 00:11:17.680
So I started when I was in college, so four years. So I started to, I didn't start the website. So I start sharing my tips first i started on linkedin right so i make a commitment to myself that i'm gonna so i i initially i write a lot of articles and i i push out through two to three articles on towards data science every week and i i make the commitment i'm gonna do that every week and then i um i was not very active on linkedin and then i read a book called share your book and it say how know that how messy your work is you should just share and i was very into open source tools and i often you know just send a message to my friends say hey like check this out this is so cool uh and i was like what if i could share with more people so i started to you know put out some of the things that i excited about on linkedin and initially i was so afraid we were like what are talking about you don't know what you're talking about yeah but uh people also very excited about it and i i was like oh great like i shared something i'm excited about and i didn't have some other people and everybody excited about the tour i love that so i keep doing it initially i do it like every day so seven days per week share every day later i was okay i'll be myself the weekend off um but i started to do i think over 500 tips and then people was like where can i find the old one, right? Because if you want to, let's say, if you are interested in machine learning tools or data processing tools, how do you find it on LinkedIn? There's no way for you to categorize.

00:11:17.980 --> 00:11:55.439
So I started to go back to most of my, it's, it's, there's so many. So I try as many as I can, put it into a website. So it was at the time I just find like some domain, it's called mass data simplified and then I just put a bunch there and then I sent out some newsletter so very very not Polish at all and but I just want to have a place to capture it and then later I figured out that people was a lot of time people do have typo when they type because it's very long um URL so I try to make it something very short and actually capture what I

00:11:55.660 --> 00:12:01.780
we do in cold cut so it become cold cut yeah i like the aesthetic of it it looks nice it's got

00:12:01.920 --> 00:12:08.380
this these soft warm colors thank you yeah i try to um you know keep everything kind of like a

00:12:08.510 --> 00:12:19.659
cotton candy so like blue and pink yeah it does yeah cotton candy definitely comes through i can see that for sure what platform did you use to build it i use wordpress uh but i so for articles

00:12:19.720 --> 00:12:26.420
I'm creating a workflow where basically I write all my articles in Quarto, like DocQS.

00:12:26.420 --> 00:12:26.920
Yeah, yeah, yeah.

00:12:27.220 --> 00:12:31.320
I realized that, so I write and also run my code in Quarto.

00:12:32.060 --> 00:12:39.020
And then I use a WordPress API to push it to my article in blog.

00:12:39.380 --> 00:12:40.580
So I do a lot.

00:12:42.840 --> 00:12:48.280
It's easier than, I don't use a WordPress editor to create the article.

00:12:48.500 --> 00:12:52.040
I use VS Code through QMD.

00:12:52.660 --> 00:12:56.820
But for the aesthetic, why I like progress because I can drag and drop.

00:12:57.080 --> 00:13:00.320
I'm not a front-end developer, but it works for me.

00:13:00.360 --> 00:13:00.420
Yeah.

00:13:00.940 --> 00:13:01.660
You know, it's interesting.

00:13:01.710 --> 00:13:11.120
I think a lot of people who are developers get hung up feeling like they have to create their website in the same language or technology that they're an expert in.

00:13:12.120 --> 00:13:18.980
If you're a Python developer, you're like, well, how do I create a website in Python for my blog or whatever language, right?

00:13:19.300 --> 00:13:24.360
But I think there's a lot of value in just saying, it's just a tool, I'm just gonna pick it and it's gonna be great.

00:13:24.980 --> 00:13:31.820
Like for example, I had for a long time, I had my stuff in WordPress under my own domain for my blog and other things.

00:13:31.980 --> 00:13:38.500
And I finally decided to move on to Hugo, but Hugo is also not Go, it's not Python, it's Go.

00:13:39.080 --> 00:13:40.460
And it's just build static sites.

00:13:40.460 --> 00:13:46.680
And I think it's really, it's good if you know the technology, but it's certainly not something I think people should get overly.

00:13:47.140 --> 00:13:47.440
No.

00:13:47.620 --> 00:13:51.460
Hung up on because you lose out on good tools that way, right?

00:13:51.760 --> 00:13:52.540
Yeah, exactly.

00:13:52.720 --> 00:13:56.260
And I mean, you can learn, like, you can learn any language, right?

00:13:56.300 --> 00:14:00.680
If you know one language, you can learn, you're kind of guessing another, especially now with AI.

00:14:01.639 --> 00:14:03.380
Yeah, that's such an interesting topic.

00:14:03.540 --> 00:14:06.820
I do think that's actually really changed a lot of things.

00:14:07.040 --> 00:14:11.260
It's like, well, I could work with this, but I know I'll get stuck on something.

00:14:12.120 --> 00:14:15.340
And now you can just ask an agentic AI, like, I'm stuck on this.

00:14:15.860 --> 00:14:16.840
Okay, here you go.

00:14:17.040 --> 00:14:22.780
And even if it's not perfect, it really, really helps handle, juggle different types of technology.

00:14:22.780 --> 00:14:23.700
Yes, exactly.

00:14:24.120 --> 00:14:24.800
Yeah, super cool.

00:14:25.160 --> 00:14:26.660
So tell me more about this Quarto thing.

00:14:26.820 --> 00:14:29.780
I didn't intend to talk about this, but I find it really interesting.

00:14:30.220 --> 00:14:31.580
Do you write in Markdown?

00:14:31.780 --> 00:14:32.560
Do you write in HTML?

00:14:32.820 --> 00:14:34.340
What do you write in and then publish?

00:14:34.720 --> 00:14:35.180
Yeah, okay.

00:14:35.280 --> 00:14:36.440
So I write in Quarto.

00:14:37.520 --> 00:14:38.620
It is in Markdown, right?

00:14:38.800 --> 00:14:43.940
So somebody knows it's in Markdown, but it's basically really similar to.md.

00:14:44.480 --> 00:14:47.860
The only difference is you can execute the code.

00:14:48.180 --> 00:14:54.660
So my really favorite stack is in VS Code.

00:14:55.300 --> 00:15:01.640
I have Quanto on the left side, and then I open a doc.qmd file.

00:15:01.860 --> 00:15:04.340
People can just imagine it's like a doc.md file.

00:15:05.120 --> 00:15:07.900
But you can click the button, and you can execute.

00:15:08.460 --> 00:15:12.120
So it's very similar to a notebook, right?

00:15:12.880 --> 00:15:15.320
But I like it.

00:15:15.570 --> 00:15:17.580
I really enjoy running in Markdown.

00:15:17.940 --> 00:15:20.580
And it's just like a very clean, right?

00:15:20.810 --> 00:15:21.580
Compared to a notebook.

00:15:22.080 --> 00:15:23.800
It's a big QMD.

00:15:23.850 --> 00:15:26.280
You can think of like a Markdown notebook.

00:15:26.570 --> 00:15:27.360
Just Markdown.

00:15:27.760 --> 00:15:29.060
And you can execute the code.

00:15:29.560 --> 00:15:36.480
And then after I have the image, I have the code, everything.

00:15:37.080 --> 00:15:41.760
I will create some Python functions to clean it up a bit.

00:15:42.120 --> 00:15:53.940
For example, if there's some cell that I wrote, I don't want to show it on WordPress, on my website, I will say echo is false, right?

00:15:54.040 --> 00:16:00.840
It's just like a comment, Python comment on the top of the cell, as you will not show in the website, which is great.

00:16:01.100 --> 00:16:04.760
And then I can just run publish to WordPress and it's publish to WordPress.

00:16:05.640 --> 00:16:11.500
What is nice about it is let's say later some of the content is I want to change.

00:16:11.860 --> 00:16:18.220
Let's say in three articles, I mentioned something, like a link to something, but now it's broken.

00:16:18.550 --> 00:16:25.340
So what I can do is I can just like, you know, click search by all the instances of it, update it.

00:16:25.390 --> 00:16:29.960
And then I run the function sync to WordPress and then it will sync.

00:16:30.320 --> 00:16:33.920
And hey, I don't need to manually go through each of it.

00:16:34.640 --> 00:16:35.580
Yeah, that's really nice.

00:16:35.630 --> 00:16:38.140
And it'll just fix all the articles that needed to be changed.

00:16:38.460 --> 00:16:38.660
Yeah.

00:16:38.830 --> 00:16:38.940
Okay.

00:16:40.100 --> 00:16:42.580
This portion of Talk Python To Me is brought to you by Sentry.

00:16:43.520 --> 00:16:51.560
Over at Talk Python, Sentry has been incredibly valuable for tracking down errors in our web apps, our mobile apps, and other code that we run.

00:16:52.280 --> 00:17:00.880
I've told you the story how more than once I've learned that a user was encountering a bug through Sentry and then fixed the bug and let them know it was fixed before they contacted me.

00:17:01.340 --> 00:17:02.160
That's pretty incredible.

00:17:02.880 --> 00:17:09.100
Let me walk you through the few simple steps that you need to add error monitoring and distributed tracing to your Python web app.

00:17:09.740 --> 00:17:17.339
Let's imagine we have a Flask app with a React front end, and we want to make sure there are no errors during the checkout process for some e-commerce page.

00:17:18.120 --> 00:17:22.600
I don't know about you, but anytime money and payments are involved, I always get a little nervous writing code.

00:17:23.540 --> 00:17:25.980
We start by simply instrumenting the checkout flow.

00:17:26.380 --> 00:17:32.900
To do that, you enable distributed tracing and error monitoring in both your Flask backend and your React frontend.

00:17:33.920 --> 00:17:40.320
Next, we want to make sure that you have enough context that the front-end and back-end actions can be correlated into a single request.

00:17:41.520 --> 00:17:43.860
So we enrich a Sentry span with data context.

00:17:44.440 --> 00:17:49.800
In your React checkout.jsx, you'd wrap the submit handler in a Sentry start span call.

00:17:50.280 --> 00:17:52.320
Then it's time to see the request live in a dashboard.

00:17:52.720 --> 00:17:54.340
We build a real-time Sentry dashboard.

00:17:55.080 --> 00:18:04.340
You spin up one using span metrics to track key attributes like cart size, checkout duration, and so on, giving you one pain for both performance and error data.

00:18:05.340 --> 00:18:05.700
That's it.

00:18:05.900 --> 00:18:12.940
When an error happens, you open the error on Sentry and you get end-to-end request data and error tracebacks to easily spot what's going on.

00:18:13.920 --> 00:18:18.920
If your app and customers matter to you, you definitely want to set up Sentry like we have here at Talk Python.

00:18:19.520 --> 00:18:24.840
Visit talkpython.fm/sentry and use the code TALKPYTHON, all caps, just one word.

00:18:25.320 --> 00:18:29.540
That's talkpython.fm/sentry, code TALKPYTHON.

00:18:29.880 --> 00:18:31.620
Thank you to Sentry for supporting the show.

00:18:32.900 --> 00:18:33.800
Notebooks are awesome.

00:18:33.940 --> 00:18:37.880
And the idea of notebooks are, you know, they really changed the game, I think, a lot.

00:18:38.180 --> 00:18:44.800
But you don't necessarily want to see all of that when your goal is mostly writing, right?

00:18:44.980 --> 00:18:50.980
Like you don't necessarily need to see all the import statements if you're only going to focus on one cell worth of code.

00:18:51.500 --> 00:18:52.840
It's not about the import, right?

00:18:53.120 --> 00:18:53.300
Right.

00:18:53.700 --> 00:18:58.800
Especially when my goal is to make it as easy as possible for data scientist to scheme.

00:18:59.120 --> 00:19:03.780
And, you know, when you skip, you're on block of code, you're like, okay, I skip, I'm done.

00:19:03.970 --> 00:19:05.940
Like, I'll read this the next time.

00:19:06.280 --> 00:19:11.700
But my goal is always raise small code snippet and highlight the core features of the tool.

00:19:12.040 --> 00:19:18.880
And a lot of time, like, you need to hide the unnecessary code, right, you know, for that to happen.

00:19:19.320 --> 00:19:20.800
Yeah, yeah, I totally agree.

00:19:21.460 --> 00:19:26.600
People can get hung up on completeness, and it really takes away from the essence of just, like, skimming it.

00:19:26.620 --> 00:19:38.800
And also, I think that this workflow, it sounds like this is one of the things that made it possible for you to do this frequently, like three, four times a week instead of getting overwhelmed, right?

00:19:39.060 --> 00:19:44.540
Yeah, I really need to learn so many tricks in order to, I guess it's a good thing, right?

00:19:44.540 --> 00:19:45.960
It pushed me to be more productive.

00:19:46.240 --> 00:19:51.860
Like I learned a bunch of shortcuts because, you know, without shortcuts, I cannot get things done quickly.

00:19:52.190 --> 00:19:53.940
So I know a lot of VSCO shortcuts.

00:19:55.400 --> 00:19:58.120
And like I use a text expansion tool.

00:19:58.520 --> 00:19:59.620
It's called Expansal.

00:19:59.860 --> 00:20:01.180
And it's game changer.

00:20:01.320 --> 00:20:03.680
I can like, you know, like view the code.

00:20:03.840 --> 00:20:04.600
This article, right?

00:20:04.740 --> 00:20:10.160
If you can just colon, doc, something, something, like two words.

00:20:10.200 --> 00:20:12.020
And then it's expand the whole text.

00:20:12.400 --> 00:20:13.980
And then you can fill in the blank.

00:20:14.740 --> 00:20:17.520
Yeah, I need to learn multiple tricks.

00:20:17.600 --> 00:20:21.860
And also like just automate things is how I can get it out.

00:20:21.940 --> 00:20:38.440
But yeah, what I still find is, so by doing that, right, I can actually focus on the essence of the blog, which is to make it as easy as possible to digest as well as researching, which is very important.

00:20:38.900 --> 00:20:39.680
Yeah, absolutely.

00:20:39.920 --> 00:20:41.280
I love your philosophy here.

00:20:41.400 --> 00:20:44.580
I can see why the site is popular.

00:20:45.040 --> 00:20:48.620
And, you know, the show your work, I think that's a great philosophy as well.

00:20:48.680 --> 00:20:52.820
It's like get it out there even if it's messy and get it out there even if it's incomplete, right?

00:20:53.020 --> 00:20:53.460
That's the idea?

00:20:53.780 --> 00:20:54.580
Yes, exactly.

00:20:54.740 --> 00:20:59.240
And I really do recommend it for especially people who try to find a job, right?

00:21:00.200 --> 00:21:07.000
At the time, I kind of like another side effect of it is you show people what you know.

00:21:07.180 --> 00:21:13.060
Because so many people, they know a lot, but they don't share, right?

00:21:13.380 --> 00:21:27.220
And if you let employers to find you by guessing what you know or look at your resume, it's very hard because now with AI, everybody has very polished, very nice resume.

00:21:27.340 --> 00:21:30.060
So why should they choose you over another person?

00:21:30.420 --> 00:21:37.100
But by showing your work, by showing even the messy one, they will be able to see, okay, this is what they're interested in.

00:21:37.320 --> 00:21:38.980
This is what makes them excited about.

00:21:39.000 --> 00:21:40.220
And this is what they know.

00:21:40.480 --> 00:21:44.700
then they will be able to imagine how they can use you in their company.

00:21:45.140 --> 00:21:56.700
And that's how, because of me sharing my articles, as well as LinkedIn posts early on in when I was in college, I was able to get multiple internships in data science.

00:21:57.100 --> 00:21:58.080
Yeah, super neat.

00:21:58.300 --> 00:22:03.980
And, you know, you're in developer relations, and I think there's a strong communication aspect of that as well.

00:22:04.060 --> 00:22:09.560
And this is just another example of like, yeah, you want to hire me for this thing, I'm already doing it, right?

00:22:09.620 --> 00:22:15.220
Like one of my first important jobs I got, I was doing a lot of speaking at user groups and meetups.

00:22:15.410 --> 00:22:27.060
They weren't called meetups then because meetup.com didn't exist, but you know, they were meetups and I got a message from a company and they said, Hey, we saw you're talking to this place and that place and you're doing that for free.

00:22:27.180 --> 00:22:29.340
We'd pay you if you want to do this for a job.

00:22:29.490 --> 00:22:30.280
I'm like, great.

00:22:31.180 --> 00:22:31.920
Where do I sign up?

00:22:32.000 --> 00:22:33.000
I didn't even have to apply.

00:22:33.110 --> 00:22:41.800
They just reached out to me and the timing was perfect because I had quit my longtime job a week before and I'm like okay well I was about to start looking for a new job but I guess

00:22:41.960 --> 00:22:52.860
wow you're so brave you look for a new job after you quit your job that's very brave when you like intended to take a break or something no I was not as brave as it

00:22:53.060 --> 00:23:09.220
sounded I am my wife got her PhD and finished and she was getting her first professor job on the other side of the country and so I was like well I'll wait till we get there and look for a job because it was San Diego to New Jersey level of travel.

00:23:09.300 --> 00:23:10.360
It was far, right?

00:23:10.520 --> 00:23:12.020
So I figured I'll find something there.

00:23:12.300 --> 00:23:12.640
I see.

00:23:12.720 --> 00:23:13.740
And then you got a new job.

00:23:14.060 --> 00:23:14.440
Yeah, yeah.

00:23:14.500 --> 00:23:21.640
I think I'd quit my job, but I was still working for a month because I knew I was going to be moving in a while.

00:23:21.760 --> 00:23:25.460
And so like, okay, guys, I'm leaving in a month, six weeks or whatever.

00:23:25.660 --> 00:23:27.100
And about that time, people reached out.

00:23:27.220 --> 00:23:28.700
So it was a perfect, perfect deal.

00:23:28.980 --> 00:23:29.260
I see.

00:23:29.380 --> 00:23:33.480
But just like you said, it's because of the stuff I was doing out in public, right?

00:23:33.740 --> 00:23:34.000
Right.

00:23:34.360 --> 00:23:34.440
Yeah.

00:23:34.500 --> 00:24:11.820
Yeah. Okay. So let's dive in. Let's dive into CodeCut here. And the idea I thought is maybe we could just go through a handful of articles. We picked 10. I don't know how much time we have to go through all of them. We probably can get through all of them, but I think this will give people an idea of what they can get from CodeCut. But also I just think there's a bunch of interesting data science ideas and tools. Yeah. Kind of like that diff lib that we talked about. So I think it'll be a good blend of that type of stuff to talk about. So let's start with an article called Goodbye pip and Poetry, Why uv Might Be All That You Need. So tell us about this and then we can dive

00:24:11.980 --> 00:25:32.060
in a bit. Yeah, so I think a lot of people have heard of uv, which is like dependency management tool. But let me, for people who don't know about it, just kind of give some quick reason why uv is might be something that you want to look into so a lot of data scientists know conda and no peep right i think poultry is a little bit less common um but uh for a while poultry is like a modern tool that allows you to do is manage so many things right like um dependency management and virtual environment and you can um it handle anyway so there's a lot of things in poultry but uv is but something that i didn't like about poultry is it's pretty slow for dependency management but when it comes to uv is quick every like it replaces many things and even if you are using you don't want to learn new tool basically you can just use uv pip install something something and you have a speed boost um and it's yeah so i i grow entire article about uv and i also compared with the existing tools to give people some motivation on switching it to uv.

00:25:32.400 --> 00:25:34.300
It's so easy to switch, so why not?

00:25:34.780 --> 00:25:36.480
Yeah, I think I am a huge fan of uv.

00:25:36.870 --> 00:25:40.400
I actually just interviewed Charlie Marsh for his next project, pyx.

00:25:40.900 --> 00:25:41.320
Oh, nice.

00:25:41.610 --> 00:25:42.620
Two days ago, yeah.

00:25:42.920 --> 00:25:44.900
So uv is going to get even better.

00:25:45.520 --> 00:25:51.560
I just pulled up the performance story for uv compared to poetry and Pipsync.

00:25:51.610 --> 00:25:53.320
And yeah, it's, what is that?

00:25:53.600 --> 00:25:57.020
20 times faster than poetry and way faster than everything else.

00:25:57.680 --> 00:25:57.920
Insane.

00:25:58.160 --> 00:25:58.260
Yeah.

00:25:58.520 --> 00:25:59.880
Rust because of the rust.

00:26:00.590 --> 00:26:01.320
I think rust.

00:26:01.370 --> 00:26:07.020
And then also they just rethought the caching and other.

00:26:07.110 --> 00:26:11.960
They just, it's rust plus a bunch of new ideas that came together to make it work really, really well.

00:26:12.070 --> 00:26:15.640
And honestly, I feel a little bit, I feel a little bit bad for the other tools.

00:26:15.940 --> 00:26:18.640
Like I'm a big fan of pip-tools and PipX.

00:26:19.580 --> 00:26:19.820
Me too.

00:26:20.000 --> 00:26:22.260
And, you know, those things are really great.

00:26:22.460 --> 00:26:25.060
And we'll come back and talk about them a bit, I think, as well.

00:26:25.220 --> 00:26:29.660
But I think those tools laid the foundation for what uv is doing.

00:26:30.240 --> 00:26:32.480
And uv sort of naturally brought these together.

00:26:32.700 --> 00:26:39.480
But whenever I think of the contributors and maintainers of those projects, I feel a little bit bad because they're not getting as much love as they used to.

00:26:39.620 --> 00:26:40.500
But they're still great.

00:26:40.680 --> 00:26:42.860
But uv is definitely making a splash.

00:26:43.360 --> 00:26:45.100
Yeah, that's a good way to think about it.

00:26:45.820 --> 00:26:48.160
I get excited about new tools very quickly.

00:26:48.480 --> 00:26:54.540
I don't often think about how the maintainers of the old data tools are less popular think.

00:26:55.400 --> 00:27:00.820
Yeah, it's the natural way of things, I think, but it's also tough.

00:27:01.280 --> 00:27:02.640
So the speed is important.

00:27:02.760 --> 00:27:14.100
One thing I think is interesting, while I think this is especially interesting for data science, and you touched on this already, is I come more from the web API side of Python, I would say.

00:27:14.460 --> 00:27:16.240
And over there, it's been pip all day long.

00:27:16.500 --> 00:27:17.520
It's been pip the whole time.

00:27:17.980 --> 00:27:26.760
But in the data science side, it's really largely been Conda, Anaconda distributions, that kind of thing, Conda environments.

00:27:27.440 --> 00:27:30.140
There's been kind of this separation.

00:27:30.750 --> 00:27:33.160
And obviously, those are all pulling from the same projects.

00:27:33.340 --> 00:27:35.180
They're all on GitHub and so on.

00:27:35.260 --> 00:27:37.180
But there has been a little bit of a difference.

00:27:37.410 --> 00:27:44.160
And I feel like uv has made a really big splash in data science and kind of bringing those two worlds back together.

00:27:44.600 --> 00:27:45.040
What do you think?

00:27:45.300 --> 00:27:45.960
Yeah, I agree.

00:27:45.980 --> 00:27:48.680
I actually have an article a long time ago.

00:27:49.190 --> 00:27:51.020
I think it's probably in cold court still.

00:27:51.500 --> 00:27:54.340
It's called Peep vs. Conduct vs. Coalty.

00:27:54.820 --> 00:27:58.600
So basically, I was trying to give a very fair comparison.

00:27:58.730 --> 00:28:00.560
I was in favor of Coalty.

00:28:01.180 --> 00:28:03.480
But, you know, I compared everything.

00:28:04.420 --> 00:28:08.840
And something that I had, to be honest, I was not a fan of Conduct.

00:28:10.020 --> 00:28:16.500
And after the comparison, I was not even more of a fan of Conduct because I felt like it's very heavy.

00:28:16.920 --> 00:28:19.160
You can use, I think, Miniconda, so it's quicker.

00:28:19.920 --> 00:28:25.740
But at the same time, it installs many dependencies that you didn't ask for.

00:28:25.810 --> 00:28:30.560
So it's kind of like a solution where you have a lot of dependencies out of the box, right?

00:28:30.920 --> 00:28:42.480
But I find that a lot of times, you didn't ask for some dependencies installed, so you don't really have control over what's being installed as well as the way that it solves dependencies or not.

00:28:43.600 --> 00:28:44.920
I find it's very slow.

00:28:45.890 --> 00:28:50.740
But with, so, so poultry was better because, you know, it's cleaner.

00:28:50.870 --> 00:28:52.580
You can see like what you put in.

00:28:52.580 --> 00:28:55.040
You can see like everything in a file.

00:28:55.350 --> 00:28:57.260
Like there's a separate file, right?

00:28:57.380 --> 00:29:02.200
One for, one is what, piproject.com and one is a log.

00:29:02.560 --> 00:29:04.320
You want to see all the sub-dependencies.

00:29:04.810 --> 00:29:08.240
And now uv is, to me, it's like that.

00:29:08.450 --> 00:29:10.380
But to like poultry, but better.

00:29:11.240 --> 00:29:22.600
the fact that it can run very fast. So one drawback of poetry was sometimes it's think very hard about and take a long time to solve, install new dependencies.

00:29:24.060 --> 00:29:44.940
And also, you replace X, you know, like it kind of like an all-in-one-place tool, which is, I think, very necessary, especially for, I mean, for everybody who use Python, but For data scientists, you know, they already have so many things going on and tell them to learn, you know, good practices.

00:29:45.240 --> 00:29:54.220
You know, let's say they want to upgrade, right, from Python 3.5 to 3.6 to 3.7 to, you know, use other tools to do it.

00:29:54.440 --> 00:29:55.480
It seems like a lot.

00:29:55.640 --> 00:30:04.260
If you can just use one tool and then you can run it, like upgrading in one command line, then we're not.

00:30:04.540 --> 00:30:05.840
Yeah, 100%.

00:30:05.880 --> 00:30:06.900
I think it's great.

00:30:07.160 --> 00:30:10.860
it has a lot of compatibility with previous workflows.

00:30:11.210 --> 00:30:19.400
For example, you can use uv pip install instead of using its project management thing, like with add and sync, right?

00:30:19.620 --> 00:30:22.660
So you don't necessarily have to adopt its style.

00:30:22.770 --> 00:30:25.980
You can just say, instead of pip install, you can say uv pip install.

00:30:26.420 --> 00:30:34.420
One of the things that used to bother me, and I honestly just haven't looked for a while, about Ponda was coming from a non-data science side.

00:30:34.480 --> 00:30:40.100
some of the libraries, say, like requests or Flask or something like that.

00:30:40.340 --> 00:30:40.460
Yeah.

00:30:40.620 --> 00:30:41.080
Pyramid.

00:30:41.680 --> 00:30:47.740
Those were held back quite a bit because, you know, they try to make sure there's like a compatibility

00:30:48.180 --> 00:30:50.220
between the versions that they ship through Condit,

00:30:50.520 --> 00:30:54.000
which is, I think that's good for like data science or reproducibility.

00:30:54.300 --> 00:31:01.320
But when it comes to web frameworks, you've got to be super on top of updates in case there's a security vulnerability.

00:31:01.980 --> 00:31:05.120
You know, you want to release that thing, the day the vulnerability is announced.

00:31:05.470 --> 00:31:13.400
You want to upgrade your website because as soon as it's announced, people start scanning the internet for any website that could possibly be susceptible to that problem, right?

00:31:13.610 --> 00:31:16.660
And so like the web tools were held back a lot.

00:31:16.710 --> 00:31:19.200
So that's one of the reasons I never really embraced Conda.

00:31:19.650 --> 00:31:29.700
Yeah, now that you mentioned it, another, I also found it as one thing I didn't like about Conda is, you know, like there's, let's say data science is like Pandas, right?

00:31:30.000 --> 00:31:35.980
If this Pandas version is available, a lot of time, sometimes it's not available in Condor.

00:31:36.100 --> 00:31:36.600
That's one thing.

00:31:37.160 --> 00:31:41.440
But another thing is data scientists, they collaborate with other engineers.

00:31:42.080 --> 00:31:48.840
So as someone, they need to hand over to the engineers, like machine learning engineers, data engineers, to use their code.

00:31:49.380 --> 00:31:54.860
And I think in my previous company, they didn't use Condor.

00:31:55.040 --> 00:32:00.820
So for them to need to install Condor, there's a way for you from Condor to peep.

00:32:01.280 --> 00:32:07.080
But then the dependencies look very messy, and it's not like a one-to-one translation.

00:32:07.320 --> 00:32:10.940
Sometimes you don't get the same environment.

00:32:11.640 --> 00:32:12.240
Sure, interesting.

00:32:12.560 --> 00:32:15.720
Not the trouble for handing over, right?

00:32:15.940 --> 00:32:22.640
And reproducible is a big thing when it comes to data science, and dependencies play a big part of it.

00:32:22.640 --> 00:32:29.220
So if you make it hard to reproduce the dependencies, then there will be some bugs.

00:32:29.560 --> 00:32:33.880
there will be some errors and even work silent errors.

00:32:36.260 --> 00:32:38.760
This portion of Talk Python To Me is brought to you by Agency.

00:32:39.270 --> 00:32:44.420
Build the future of multi-agent software with Agency spelled A-G-N-T-C-Y.

00:32:44.760 --> 00:32:49.160
Now an open source Linux foundation project, Agency is building the internet of agents.

00:32:49.690 --> 00:32:55.860
Think of it as a collaboration layer where AI agents can discover, connect, and work across any framework.

00:32:56.720 --> 00:32:58.020
Here's what that means for developers.

00:32:58.660 --> 00:33:04.300
The core pieces engineers need to deploy multi-agent systems now belong to everyone who builds on agency.

00:33:04.660 --> 00:33:10.760
You get robust identity and access management, so every agent is authenticated and trusted before it interacts.

00:33:11.240 --> 00:33:23.800
You get open, standardized tools for agent discovery, clean protocols for agent-to-agent communication, and modular components that let you compose scalable workflows instead of wiring up brittle glue code.

00:33:24.260 --> 00:33:25.700
Agency is not a walled garden.

00:33:26.240 --> 00:33:34.800
You'll be contributing alongside developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than 75 supporting companies.

00:33:35.360 --> 00:33:36.220
The goal is simple.

00:33:36.680 --> 00:33:43.900
Build the next generation of AI infrastructure together in the open so agents can cooperate across tools, vendors, and runtimes.

00:33:44.580 --> 00:33:48.420
Agencies dropping code, specs, and services with no strings attached.

00:33:49.080 --> 00:33:49.520
Sound awesome?

00:33:50.060 --> 00:33:53.620
Well, visit talkpython.fm/agency to contribute.

00:33:54.180 --> 00:33:57.680
That's talkpython.fm/A-G-N-T-C-Y.

00:33:58.140 --> 00:34:01.000
The link is in your podcast player's show notes and on the episode page.

00:34:01.620 --> 00:34:04.380
Thank you, as always, to Agency for supporting Talk Python To Me.

00:34:06.380 --> 00:34:13.240
Next article, let's talk about, I think this is a big one, reproducibility, maintainability, that kind of thing.

00:34:13.560 --> 00:34:16.240
I mean, what you talked about there was kind of a good lead into it, right?

00:34:16.340 --> 00:34:25.500
It's like the production engineers need to make sure that what they're running runs the same as what the data scientists have tested and come up with.

00:34:25.600 --> 00:34:32.600
And if the situation is different, right, the runtime environment is different, well, maybe it's going to give different answers because of different versions or something.

00:34:33.660 --> 00:34:39.500
Yeah, exactly. And dependency is a big thing, but there's also other things that come into it, right?

00:34:39.659 --> 00:34:40.700
And we can talk about it too.

00:34:41.020 --> 00:34:44.899
Yeah, so what are some of the core ideas of reproducibility?

00:34:45.300 --> 00:35:09.480
Yeah, so reproducibility, I think, is, and at the heart of reproducibility is you want to, let's say, especially for data scientists, when they create code, right, create machine learning models, it depends on some, there's a lot when it packages dependencies, but also like depends on parameters that you use.

00:35:09.920 --> 00:35:18.040
and like, you know, even you use everything the same from experiment to experiment, it can produce slightly different accuracy.

00:35:18.500 --> 00:35:24.600
So you want to be able to, let's say one data scientist say, I got a 0.9 accuracy.

00:35:25.060 --> 00:35:28.440
Then they give it, head it over to their teammate to deploy.

00:35:28.800 --> 00:35:31.820
And then it's degrading the performance.

00:35:32.300 --> 00:35:34.060
It's no longer have a 0.9 accuracy.

00:35:34.200 --> 00:35:36.800
It probably can be, you know, without 0.8.

00:35:37.300 --> 00:35:51.280
And so reproducibility is very important in terms of, you know, you want people, if you produce some result, you want other people to be able to reproduce the same results so that your company can profit out of it.

00:35:51.500 --> 00:36:07.060
But another thing is choice, right? Let's say if you data center, you create something that is good, but then you hand it over and it's bad, then it reduces the choice from your co-workers.

00:36:07.680 --> 00:36:19.080
The third thing about it is, let's say if data scientists create some project and then later they want to reuse it, right? Like for another project, they cannot reproduce the same results.

00:36:19.980 --> 00:36:27.220
then it's kind of like throw it away, creating everything from scratch.

00:36:27.740 --> 00:36:29.320
So it also time consuming.

00:36:30.440 --> 00:36:35.260
And another big theme after producing, I think go hand in hand is maintainability.

00:36:36.140 --> 00:36:37.000
But yeah, you can go ahead.

00:36:37.120 --> 00:36:38.440
Yeah, no, I totally agree.

00:36:38.540 --> 00:36:42.000
The maintainability part is super important as well.

00:36:42.120 --> 00:36:44.500
And so what are some of the pieces that go into that?

00:36:44.640 --> 00:37:03.180
I mean, the first thing that comes to mind for me is pinning your dependencies, either through using something like uv init, uv add, uv sync, or just a requirements.txt file or pinning your requirements, you know, through some other means like with poetry, but somehow clearly documenting what's used.

00:37:03.220 --> 00:37:12.500
I think it can be easy for data scientists to just import some things in a notebook and then do a magic, well, uv pip install that right here and let's just keep going.

00:37:12.580 --> 00:37:15.040
But then like, why won't it run later, you know?

00:37:15.300 --> 00:37:17.880
Yeah, are you talking about maintainability or reproducibility?

00:37:18.140 --> 00:37:22.520
Well, I was thinking maintainability, but both, yeah.

00:37:22.980 --> 00:37:29.620
So maintainability, another thing is, another aspect that comes into maintainability is readability, right?

00:37:30.400 --> 00:37:32.960
And I think this is a big thing for data scientists.

00:37:33.480 --> 00:37:38.400
A lot of data scientists don't write readable code and modular code.

00:37:38.700 --> 00:39:08.920
what you will see is like a big chunk of code with like nested if else or using global variables inside a function or not using function at all and there's a lot of duplicate in the code and I mean that is great for quick experimentation like you want to get something out of it but at the same time it's create a lot of technical debt in terms of this multiple thing that leads to not readable, right? So read and I can go deeper. I just want to list it out. Readability and then reproducibility. And readability, you want your code to be readable because data scientists collaborate with other engineers. And if your code and also another problem I saw is data scientists, they cannot read their old code in like a couple of months. They forgot. They won't be right. They're really like, what did I write? And they spend quite bit of time to understand what they write and if they cannot understand or they don't trust it right if you don't understand in a full level you cannot trust it and you cannot use the next project a lot of times they write things from scratch again so readability also if your code is not written well readability and reproducibility you cannot it's harder to reproduce the same result because let's say global variables, you use global variables and it affects your code, then it's hard.

00:39:09.940 --> 00:39:16.240
You might introduce some external factors that make the code different the next time you run.

00:39:16.580 --> 00:39:18.900
So, management really is very important

00:39:19.340 --> 00:39:32.960
and that comes with, sorry, write readable code, version control, dependencies management, configuration management, meaning you don't want a hard code the values into your code.

00:39:33.120 --> 00:39:37.920
Yeah, this is a really great article here because it shows you so many tools and techniques, I think.

00:39:38.180 --> 00:39:38.680
Yeah, thank you.

00:39:39.000 --> 00:39:40.200
And I also,

00:39:40.780 --> 00:39:47.540
so writing a maintainable data science project has been my passion for a long period of time.

00:39:48.220 --> 00:39:50.820
And it actually has become a book like this.

00:39:51.040 --> 00:39:56.880
This is the article form, but I also recently wrote a book about this, finished a book about this.

00:39:57.040 --> 00:39:58.860
It's called Production Ready Data Science.

00:39:59.440 --> 00:40:00.200
Yeah, congratulations.

00:40:00.700 --> 00:40:01.660
That's amazing.

00:40:01.900 --> 00:40:02.840
When did you finish it?

00:40:02.900 --> 00:40:03.780
When did you publish it?

00:40:04.460 --> 00:40:05.740
Two months ago, I think.

00:40:05.960 --> 00:40:06.180
Okay.

00:40:06.500 --> 00:40:07.520
So really recently, yeah.

00:40:07.900 --> 00:40:08.540
Very recently.

00:40:08.740 --> 00:40:10.100
It took me quite a while.

00:40:10.360 --> 00:40:13.120
I have other materials because I've been writing about this.

00:40:14.300 --> 00:40:21.220
I have so many materials who put it together and also make it less in an article style.

00:40:21.580 --> 00:40:28.220
Explaining more of the why before the how is also what I focus a lot on this book.

00:40:28.420 --> 00:40:29.920
Yeah, that's a super important part.

00:40:30.320 --> 00:40:37.240
Now, I am a huge fan of thinking a lot about readability and spending a lot of time on readability.

00:40:37.340 --> 00:40:38.800
And I agree with you.

00:40:39.060 --> 00:40:41.900
I think we're on the same page for sure on that one.

00:40:42.020 --> 00:40:45.640
Because if your code is not readable, that doesn't mean it can't be read.

00:40:45.640 --> 00:40:47.040
It just means it's harder to read, right?

00:40:47.560 --> 00:40:50.940
If your code is not readable, it's harder to maintain it.

00:40:51.000 --> 00:40:52.180
It's harder to see the bugs.

00:40:52.360 --> 00:40:52.980
It's harder to read.

00:40:53.160 --> 00:40:54.660
All the things that you've pointed out.

00:40:54.980 --> 00:41:00.280
And I think it's a little bit ironic that the notebook style of development

00:41:00.300 --> 00:41:01.940
that is meant to communicate.

00:41:02.340 --> 00:41:06.760
Like it's supposed to be more communication compared to regular programming, right?

00:41:06.840 --> 00:41:13.080
It's got the markdown cells, it's got the pictures, it's got little tables for like df.head, that kind of stuff, right?

00:41:13.720 --> 00:41:19.840
It's natural way of programming it leads to less maintainable reusable code, right?

00:41:19.940 --> 00:41:21.400
Because it does not encourage functions.

00:41:22.000 --> 00:41:23.780
So like it's really hard to do reuse

00:41:24.420 --> 00:41:29.800
if you don't think, oh, maybe I need to write a function in this cell, collapse the cell and then use the function later, right?

00:41:29.860 --> 00:41:31.940
That happens, but not very much.

00:41:31.980 --> 00:41:36.340
I was doing some research for, or just looking around, really.

00:41:36.560 --> 00:41:41.000
Research is a little bit high, putting it in a high position compared to what it was.

00:41:41.040 --> 00:41:53.620
I was looking around for some examples, and I found this really cool JPL collection of notebooks for studying, I think, galaxies, like studying the brightness of galaxies and trying to understand things about them.

00:41:53.940 --> 00:41:59.640
They had a bunch of cool notebooks in there, and they were like 1,500 lines of code.

00:41:59.700 --> 00:42:17.160
and stuff in there. One function. One. I'm like, oh, that is probably, I mean, maybe this is necessary, but it's probably not necessary. And this is JPL, right? This is like a polished published thing, not just a random notebook I ran across, right? The Jet Propulsion Laboratory

00:42:17.860 --> 00:42:24.100
in Pasadena, which is, I consider them to be pretty amazing. So maybe speak to a little bit

00:42:24.220 --> 00:42:28.280
about that, like this maintainability thing in juxtaposition with like notebook style.

00:42:28.680 --> 00:42:44.580
Yeah, definitely. That's the thing. I think five years ago, I wrote an article that got very popular, but also a lot of controversy. It's called, I think, something about seven reasons why I switched from notebook to Python script or something.

00:42:46.780 --> 00:42:52.660
So anyway, so the thing about Notebook is a lot of people argue, yeah, you can write functions.

00:42:53.100 --> 00:42:54.220
It's just Python, right?

00:42:54.220 --> 00:42:54.980
You can write functions.

00:42:55.020 --> 00:42:59.460
You can write whatever you want, just like a Python script.

00:42:59.820 --> 00:43:04.740
But the interactive style of it kind of discourages you from doing it.

00:43:04.740 --> 00:43:10.860
Like, why should I write a function when I can just see the result and then I build upon it?

00:43:12.300 --> 00:43:19.100
Because in Notable, what is so great about it is you can see the results and then you can build up upon it in the next cell, next cell, next cell.

00:43:19.330 --> 00:43:25.340
So people are less likely to build a function, which like if you create a function, then like it's hide inside a function.

00:43:25.540 --> 00:43:26.800
It's not execute the code.

00:43:27.960 --> 00:43:28.760
So that's one thing.

00:43:29.160 --> 00:43:34.740
Another thing problematic about Notable is it's encouraged hard coding.

00:43:35.070 --> 00:43:41.700
You know, like a lot of time, because if you want to see the result, you want to put your value in there to see the result.

00:43:42.380 --> 00:43:53.980
another thing is a very big problem is self execution right if you execute self the fact that it gives you the ability to execute in different

00:43:54.380 --> 00:43:56.000
orders is nice but

00:43:56.300 --> 00:44:07.940
also a disadvantage in a way that if you run it in the wrong order then it will like you get like some can get some result right now that is different from result that you want to run

00:44:08.240 --> 00:44:13.920
right if you forget to run a particular cell, you go back up, run that to refresh it, and then you skip back down.

00:44:14.420 --> 00:44:18.300
Something in the middle was important and you don't rerun it. That's probably not good.

00:44:18.620 --> 00:44:30.480
Yeah, exactly. And that's, I remember when I worked in notebook before, it just gave me a headache because I would run something. I said, oh, did I rerun this? Like, did I run that score or did I not?

00:44:30.760 --> 00:44:40.060
Just to make sure I'll rerun it. So it's something, and then later when I come back to notebook, I was like, I'm not going to reuse this notebook again because I don't know what's going on here.

00:44:40.340 --> 00:44:40.460
Yeah.

00:44:41.020 --> 00:44:44.380
I don't want people to think that I'm completely anti-notebook.

00:44:44.560 --> 00:44:44.980
I'm not.

00:44:45.360 --> 00:44:49.000
But I do think that there are some techniques and some tips that you can use.

00:44:49.180 --> 00:45:02.040
So when I was riffing on functions, one thing I think you could do, and you touched on it as well, is you could take the parts of your code that you want to reuse and put that in a Python script next to your notebook and import it, right?

00:45:02.220 --> 00:45:03.020
And then you can use it.

00:45:03.280 --> 00:45:10.900
A lot of times, I think there's a bunch of cells in notebooks that don't really have to do with the presentation or the important part.

00:45:11.380 --> 00:45:15.640
They're just there like, well, I need to load this thing or I need to like clean up that or something.

00:45:15.730 --> 00:45:18.600
And it has to go before what you actually want to show people.

00:45:19.000 --> 00:45:19.980
You could put that in a script

00:45:20.620 --> 00:45:22.320
and you could just import it

00:45:22.400 --> 00:45:24.000
and then show the important part.

00:45:24.460 --> 00:45:26.320
It's more readable, more reusable.

00:45:26.640 --> 00:45:27.280
It's more modular.

00:45:27.480 --> 00:45:29.420
Like in all ways, almost, it's better.

00:45:29.800 --> 00:45:30.900
Yes, I agree.

00:45:31.560 --> 00:45:34.860
The thing about, yeah, Sam, like notebook is great.

00:45:35.160 --> 00:45:36.720
It's very good for presentation.

00:45:37.480 --> 00:45:44.440
But I think the problem is a lot of time people use it more than what it's supposed to design for, which is, I mean, it's prototype, right?

00:45:44.840 --> 00:45:51.500
It's to showcase something, to showcase it with either non-technical or technical people.

00:45:51.900 --> 00:45:53.680
It's not used for like everything.

00:45:54.070 --> 00:45:55.320
And I think that's a problem.

00:45:55.470 --> 00:46:00.840
Like a lot of data scientists use it for everything, including writing production, writing code.

00:46:01.140 --> 00:46:03.100
And again, that's another argument, right?

00:46:03.240 --> 00:46:06.700
Like, can you shoot your production-ready code in notebooks?

00:46:07.000 --> 00:46:08.220
And Netflix did that.

00:46:08.360 --> 00:46:12.220
Netflix created production-ready code in notebooks, and that's great.

00:46:12.580 --> 00:46:14.060
Paper mill and all that stuff, yeah.

00:46:14.260 --> 00:46:16.360
Yeah, that technology allows you to do it.

00:46:17.050 --> 00:46:18.920
I feel like it's still...

00:46:19.540 --> 00:46:24.540
But then Temtem, it's a tool that if you use the right, then it will serve you.

00:46:24.820 --> 00:46:26.040
Yeah, it's interesting.

00:46:26.340 --> 00:46:30.800
Speaking of tools, I know you've talked a little bit about Marimo.

00:46:31.400 --> 00:46:32.440
What do you think about that?

00:46:32.600 --> 00:46:34.100
I think this is a really interesting one.

00:46:34.220 --> 00:46:40.140
I just did a sort of, I did a course called Just Enough Python and it's software engineering for data scientists.

00:46:40.700 --> 00:46:40.900
Nice.

00:46:41.020 --> 00:46:42.000
And I was trying, thanks.

00:46:42.200 --> 00:46:48.380
Yeah, and it's like, okay, a lot of themes that you talk about as well, like you should probably learn Git, you know, and so on.

00:46:48.920 --> 00:46:58.000
But one of the things I really struggled with was I really wanted to use Marimo as the foundation, but, you know, 90% or more data scientists use Jupyter.

00:46:58.380 --> 00:47:09.460
So I ended up using Jupyter, But I think Marimo actually might be a little bit better in terms of software engineering and a little bit better in terms of this order of operations you talked about.

00:47:10.080 --> 00:47:12.620
Because Marimo tracks like, hey, this cell depends upon that cell.

00:47:12.760 --> 00:47:15.040
So if you rerun this one, we've got to rerun that one first.

00:47:15.260 --> 00:47:16.340
It'll catch those kind of things.

00:47:16.560 --> 00:47:18.640
But you want to maybe talk about this real quick?

00:47:18.760 --> 00:47:20.080
Because I know you wrote about it a little bit.

00:47:20.280 --> 00:47:21.880
Yeah, I really like Marimo.

00:47:22.020 --> 00:47:22.940
I think it's great.

00:47:23.480 --> 00:47:24.020
It has been.

00:47:24.340 --> 00:47:26.580
So I was actually in the middle of writing the book.

00:47:26.780 --> 00:47:26.880
Right.

00:47:27.020 --> 00:47:33.160
And I know that people, data scientists love Notebook and I don't want it to be like, Hey, let's just write Python script.

00:47:33.520 --> 00:47:39.720
But at the same time, I really think that Notebook is really hard to write production radical.

00:47:39.990 --> 00:47:42.200
And that's, you know, my book is about production radical.

00:47:42.430 --> 00:47:48.760
So how do you find the middle ground between, you know, an interactive version of Notebook and production radical?

00:47:49.070 --> 00:47:52.240
So Marimor came out and I dig deep into it.

00:47:52.240 --> 00:47:53.520
It solved a lot of problems.

00:47:53.980 --> 00:47:57.640
And first is a self-execution order, right?

00:47:57.780 --> 00:48:01.980
So the problem in Notebook is, let's say you cell 1, cell 2, cell 3.

00:48:02.180 --> 00:48:06.600
You change something in cell 1, for example.

00:48:06.800 --> 00:48:09.040
You all run it, but then you change something in cell 1.

00:48:09.280 --> 00:48:16.600
If you forgot to rerun cell 3, then the result is not going to reflect what you have changed in cell 1.

00:48:17.000 --> 00:48:20.180
But with Marimor, it can detect the dependencies.

00:48:20.740 --> 00:48:31.120
And if you change cell 1 and cell 3 depend on cell 1, depend on the variable in cell 1, then it will execute it automatically, which is great.

00:48:31.600 --> 00:48:39.140
Another big thing is, you know, notebook, not disadvantage, notebook is JavaScript, right?

00:48:39.540 --> 00:48:47.720
So it's very big, unreadable, hard to, if you want to use it or hand it over, reuse it.

00:48:48.020 --> 00:48:48.480
It's really hard.

00:48:48.540 --> 00:48:55.720
Right. It's so common to get a Git conflict when you try to pull a new version just because the output has changed.

00:48:55.840 --> 00:48:57.900
And that shouldn't even be taken into account, right?

00:48:58.300 --> 00:49:00.340
And yeah, that shouldn't be taken into account.

00:49:00.740 --> 00:49:02.400
And also the output can be big.

00:49:02.900 --> 00:49:07.720
But with Marimode, not only just a Python script under the hood.

00:49:07.920 --> 00:49:09.880
So you have the...

00:49:09.980 --> 00:49:14.020
And what great thing about Python script is you can reuse it.

00:49:14.020 --> 00:49:16.340
You can import it in another Python script.

00:49:16.460 --> 00:49:18.800
but also for CI/CD, right?

00:49:20.059 --> 00:49:22.860
And you can write unit tests.

00:49:23.150 --> 00:49:28.460
You know, like with unit tests are important, but with in Notebook, you can write unit tests.

00:49:28.640 --> 00:49:31.080
There's a hacker out, but it's a hack.

00:49:32.030 --> 00:49:38.440
And sometimes people need to put their tests inside Notebook, which is not modular, right?

00:49:38.560 --> 00:49:44.160
It's not, it's make it very messy with code and then tests inside the same place.

00:49:45.000 --> 00:49:52.740
But if you write notebook that is Python script under the hood, then you can write unit tests and you can run it with pytest.

00:49:53.900 --> 00:49:54.360
Yeah, very cool.

00:49:54.520 --> 00:49:56.240
I think this is a neat one.

00:49:56.400 --> 00:50:02.920
I generally wanted to use it, but I'm like, but at the same time, I probably got to speak the same language of the people doing that.

00:50:03.320 --> 00:50:03.440
Exactly.

00:50:03.920 --> 00:50:04.480
Yeah, yeah.

00:50:05.140 --> 00:50:11.460
Another thing I want to talk about here, actually this maintainability article has just got so much good stuff in it.

00:50:11.600 --> 00:50:16.140
Something I think that might be new to folks is Hydra, right?

00:50:16.560 --> 00:50:21.380
Let's talk about this idea of managing configuration files with Hydra.

00:50:21.700 --> 00:50:23.460
Yeah, I love Hydra.

00:50:23.620 --> 00:50:30.120
So Hydra is developed by Facebook Research, and I think I use it.

00:50:30.280 --> 00:50:31.420
It has been my two goals.

00:50:31.660 --> 00:50:33.120
So I have some core tools.

00:50:33.220 --> 00:50:37.520
I go back for every data science project, and Hydra is one of them.

00:50:40.420 --> 00:50:53.120
I guess if you think about Hydra is a configuration management tool, you can write your configuration inside a YAML file, and then you can call things.

00:50:53.480 --> 00:50:59.960
You can access the variables inside the YAML file inside a Python script.

00:51:00.380 --> 00:51:01.300
That is nothing new.

00:51:01.490 --> 00:51:08.240
People know about there's a YAML muting library in Python that allows you to work with YAML.

00:51:08.620 --> 00:51:15.080
But something I like about HIDL a lot is reduce the boilerplate code.

00:51:15.480 --> 00:51:21.040
So you can just add an app like a decorator to your Python function.

00:51:21.390 --> 00:51:27.280
And then it will be able to access everything underneath inside the Python function.

00:51:27.780 --> 00:51:37.960
And instead of using a bracket notation, like for example, if to access a nested value, you might need to use mutual nested function.

00:51:38.360 --> 00:51:42.140
But instead, you can use dot notation, which is very clean.

00:51:43.010 --> 00:51:48.720
Other things are, oh, you can, as your project getting bigger, right?

00:51:48.890 --> 00:51:51.720
I mean, for a small project, maybe a YAML is enough.

00:51:52.020 --> 00:52:02.560
As your project getting bigger, let's say you have, you want to be able to use, especially in like data science experimentation, you want to use different values.

00:52:02.960 --> 00:52:05.160
You want to experiment with different way of processing.

00:52:05.720 --> 00:52:10.540
You want to experiment with different parameters for your models, and that can get messy.

00:52:10.670 --> 00:52:12.240
Your YAML file becomes very big.

00:52:12.700 --> 00:52:18.860
So a good way to handle this is you want to break it into smaller configuration.

00:52:20.040 --> 00:52:31.760
But if you want to do that without Hydra, then you need to write Python functions to access it, to composite it, to put it together, put the pieces of puzzle together.

00:52:32.720 --> 00:52:36.540
With Hydra, you can basically, everything is handled for you.

00:52:36.780 --> 00:52:41.500
So you can break it apart and then you can refer to different part of the pieces.

00:52:41.960 --> 00:52:47.120
You can, like in one YAML file, you can refer to the other part of the YAML file.

00:52:47.240 --> 00:52:48.460
It's kind of like an import, right?

00:52:48.680 --> 00:52:49.300
Import in Python.

00:52:49.840 --> 00:52:51.380
And you can override it.

00:52:51.820 --> 00:52:56.200
Another thing I realized, you can override the parameter from the command line.

00:52:56.580 --> 00:52:57.620
Yeah, that's a cool feature.

00:52:57.960 --> 00:53:00.179
Yeah, so you don't need to go back into the...

00:53:00.200 --> 00:53:10.120
Because a lot of time you want to experiment with different parameters or inside the production environment and the development environment, you might use different parameters.

00:53:10.700 --> 00:53:20.040
And if you can override from the command line, which is used a lot in, you know, different platforms, then that will be amazing.

00:53:20.480 --> 00:53:21.420
Yeah, it's super neat.

00:53:21.560 --> 00:53:24.700
Basically, you provide a bunch of default values instead of hard coding them.

00:53:25.100 --> 00:53:30.800
And then if you need to override them, you can do that through the command line or you could just do it in code or whatever, right?

00:53:31.140 --> 00:53:31.580
Yeah, exactly.

00:53:31.940 --> 00:53:32.640
Yeah, excellent.

00:53:32.940 --> 00:53:36.300
And you bring it all together here in your article, uv run process.

00:53:38.540 --> 00:53:42.520
So you got your Hydra, you got your uv run coming together, all that stuff.

00:53:42.740 --> 00:53:43.900
So you actually have a whole article.

00:53:44.400 --> 00:53:47.020
I originally started talking about it because it was in the maintainability section.

00:53:47.120 --> 00:53:50.400
But you have a whole article called Hydra for Python Configuration, right?

00:53:50.480 --> 00:53:57.560
I, that, the other article is more like a high level, but then I have smaller articles that talk about each tool.

00:53:58.500 --> 00:54:02.140
Super. Okay. Well, we're not making very good progress through this, but that's okay.

00:54:03.260 --> 00:54:04.500
Because these are really interesting topics.

00:54:05.760 --> 00:54:09.160
Another one that you talk a lot about is doing proper logging.

00:54:09.700 --> 00:54:13.800
And you recommend Log Guru, which I love Log Guru. I think it's really great.

00:54:13.810 --> 00:54:16.580
I used Log Book for a long time from Armin Roeneker.

00:54:16.980 --> 00:54:19.500
But then I've been slowly moving towards Log Guru.

00:54:19.940 --> 00:54:20.140
Nice.

00:54:20.500 --> 00:54:21.500
Yeah, I really like.

00:54:21.800 --> 00:54:27.480
So another thing about, you know, like a lot of data scientists, they use print, which nothing wrong with print.

00:54:28.480 --> 00:54:33.580
Like, in fact, you shouldn't use, I think it's overkill.

00:54:33.680 --> 00:54:35.300
You use print in a notebook, right?

00:54:35.640 --> 00:54:42.880
But when it comes to like, if you want to run production ready code, if you have a bunch of prints, because a lot of times you had debugging prints, right?

00:54:43.360 --> 00:54:46.820
And if, yeah, you can go back and delete them and everything.

00:54:47.360 --> 00:54:48.680
But wouldn't it be great?

00:54:48.800 --> 00:54:52.980
But then if you go back to the development environment, then you need to put it back.

00:54:53.340 --> 00:55:01.920
Or if you don't do it at all, if you keep all the print inside your code, you go into development, now you have a bunch of noise when you see the output.

00:55:02.500 --> 00:55:13.940
So with logging, just a simple way of you thinking about, I think a big advantage of logging is you can do logging and you can use different modes like debug, info, warning, error.

00:55:14.040 --> 00:55:22.360
so you can set the level based on, you can show, let's say if you're in development environment, you want to show the debug level.

00:55:22.440 --> 00:55:28.860
If you're going to the production environment, then you might want to do like the info level only.

00:55:29.280 --> 00:55:36.120
But one drawback of logging is, I think it's quite a bit of boilerplate code.

00:55:37.040 --> 00:55:37.800
That's the first thing.

00:55:37.960 --> 00:55:43.780
So it really, even if you can pop in paste, even it's like one barrier.

00:55:44.000 --> 00:55:45.880
you know, like one barrier to use logging.

00:55:46.120 --> 00:55:49.580
Why should I do all of this code when I can just use print?

00:55:49.880 --> 00:55:54.400
Right, it's got like, you got to register handlers and all sorts of funky stuff, yeah.

00:55:55.040 --> 00:56:02.720
But what if you can have the advantage of print, like so easy to use and the benefit of logging?

00:56:03.240 --> 00:56:12.580
Then that's why I'm such a big fan of Loggeroo, which have the best of both worlds and other things that it allows this beautiful app in the box.

00:56:12.980 --> 00:56:17.260
So the login output, it looks colorful.

00:56:17.780 --> 00:56:19.000
It's very easy to see.

00:56:20.140 --> 00:56:24.260
With a standard login, you need to do a lot of, you know, what do you want to show?

00:56:24.600 --> 00:56:25.360
You want to show this.

00:56:25.420 --> 00:56:27.480
You want to show the lines of code.

00:56:27.480 --> 00:56:31.520
You want to show the function where the output is coming from.

00:56:32.460 --> 00:56:40.580
All of those configurations with LogGuru, you can just say from LogGuru import logger and then logger.info something, something.

00:56:40.940 --> 00:56:42.880
and the output is very beautiful.

00:56:43.580 --> 00:56:44.180
Yeah, I really like it.

00:56:44.190 --> 00:56:52.040
You can basically just, if you wanted to kind of keep your print style, you don't have to, you can just import it and use it and it's prints, right?

00:56:52.320 --> 00:56:52.540
Yeah.

00:56:52.860 --> 00:57:05.160
The other thing, like you pointed out, this is a super big, for people who are primarily data scientists who haven't worked with logging a lot or new programming, instead of just saying logger print, you say logger debug or logger info or logger error.

00:57:05.290 --> 00:57:09.340
And then you can say, production, just show me the errors or hey, something's going wrong.

00:57:09.560 --> 00:57:13.160
Actually, show me all the warnings and the info.

00:57:13.670 --> 00:57:16.160
And you can dial it up and down without changing the code.

00:57:16.480 --> 00:57:20.000
So you can leave your effective print statements there and just turn them off with a configuration.

00:57:20.380 --> 00:57:21.360
Maybe with Hydra, who knows?

00:57:21.680 --> 00:57:21.920
Yeah.

00:57:22.200 --> 00:57:22.780
Yeah, really nice.

00:57:22.810 --> 00:57:23.220
And the color.

00:57:23.860 --> 00:57:25.420
Yeah, the color is very nice as well.

00:57:25.520 --> 00:57:26.180
It's very colorful.

00:57:26.620 --> 00:57:26.860
Yeah.

00:57:27.740 --> 00:57:30.740
And the color persists through files, by the way, right?

00:57:30.900 --> 00:57:37.260
So if you have LogGuru logging to a file and then you tail the file in your terminal, you see the color still, which is awesome.

00:57:37.460 --> 00:57:37.840
Oh, wow.

00:57:37.910 --> 00:57:38.560
I didn't know that.

00:57:38.680 --> 00:57:39.680
That's a new point.

00:57:40.380 --> 00:57:40.800
Yeah, yeah.

00:57:41.260 --> 00:57:46.200
Because some of my stuff is all logging with LogGuru and I tell it, the color comes through.

00:57:46.220 --> 00:57:47.980
And it's so much easier.

00:57:49.160 --> 00:57:54.540
Like, for example, on the website, one of the things that I'm logging to a separate file is what are the requests?

00:57:54.960 --> 00:57:56.100
What is the response time?

00:57:56.520 --> 00:57:57.920
What is the status code?

00:57:58.020 --> 00:57:58.640
Is it a 500?

00:57:58.880 --> 00:57:59.600
Is it 200?

00:58:00.000 --> 00:58:00.640
Is it a 300?

00:58:00.820 --> 00:58:01.440
You know, whatever, right?

00:58:01.480 --> 00:58:01.780
All that.

00:58:02.020 --> 00:58:06.020
And so I'll actually have it change the color based on what happens.

00:58:06.220 --> 00:58:09.480
So if the message is a 500, it's a different color than if it's a 200.

00:58:09.760 --> 00:58:16.020
Or if the response time is over 500 milliseconds, it changes color to like a warning sign versus if it's under.

00:58:16.520 --> 00:58:17.820
And all that stuff comes through.

00:58:17.830 --> 00:58:19.480
And it's like so nice with LogRoot.

00:58:19.480 --> 00:58:19.580
So nice.

00:58:19.980 --> 00:58:20.040
Yeah.

00:58:20.340 --> 00:58:21.760
Especially when you have a lot of logs.

00:58:22.000 --> 00:58:23.420
And that is very useful.

00:58:23.860 --> 00:58:24.080
Yeah.

00:58:24.400 --> 00:58:24.520
Absolutely.

00:58:24.840 --> 00:58:24.960
Okay.

00:58:25.580 --> 00:58:25.660
Git.

00:58:26.860 --> 00:58:28.060
Let's talk about source control.

00:58:28.840 --> 00:58:29.020
Yeah.

00:58:29.200 --> 00:58:31.620
Version control, I think, is huge.

00:58:31.840 --> 00:58:32.580
It's very useful.

00:58:33.540 --> 00:58:38.260
And it's very important for data scientists, but I don't see it being used enough.

00:58:38.650 --> 00:58:47.400
I was in, yeah, and the last thing I actually, I was a machine learning engineer in Accenture and working with clients.

00:58:47.890 --> 00:58:59.220
So I was building kind of like a data science environment, data science tool that integrated with AWS SageMaker.

00:58:59.280 --> 00:59:04.000
And it's funny how many data scientists are not using Git.

00:59:04.330 --> 00:59:07.400
And they're multiple data science team.

00:59:07.650 --> 00:59:09.300
And they want to communicate with each other.

00:59:09.680 --> 00:59:10.880
They want to share their work.

00:59:11.030 --> 00:59:19.980
But then every time they want to share their work, it's so hard for them because they don't have a source version control, which surprised me.

00:59:20.640 --> 00:59:22.040
It's a big client.

00:59:22.450 --> 00:59:23.440
I mean, it's a big question.

00:59:24.070 --> 00:59:25.080
Why don't they use Git?

00:59:26.040 --> 00:59:29.980
And if they do, it's not very good practices.

00:59:30.700 --> 00:59:35.660
So that's why I want to educate data scientists more about Git.

00:59:36.600 --> 00:59:41.380
And to me, Git is so important for several reasons.

00:59:42.380 --> 00:59:47.280
First, let's say if you do something, right, you mess up some code.

00:59:49.780 --> 00:59:51.620
Let me give an example of data scientists.

00:59:51.700 --> 00:59:57.600
They produce, they write a model, they write code that produces a model that works great.

00:59:57.900 --> 01:00:08.420
Then like later, they want to twist the parameters, maybe experiment with different type of processing, see if the model performance can increase.

01:00:08.960 --> 01:00:10.140
And they mess it up.

01:00:10.820 --> 01:00:15.960
The model decreases in performance and they have edited two or three files.

01:00:16.090 --> 01:00:18.320
Now they need to, okay, what did I edit?

01:00:18.540 --> 01:00:19.500
Now I need to revert it.

01:00:19.700 --> 01:00:28.080
And if you don't already like save the last version that was working, you cannot just revert to it, right?

01:00:28.360 --> 01:00:30.960
But with Git, you can, there's multiple things.

01:00:31.150 --> 01:00:35.740
If you're just a solo data scientist, then you can revert it to the last working version.

01:00:37.400 --> 01:00:43.520
But if you work with other data scientists, then you can share, you can also collaborate with them.

01:00:43.600 --> 01:01:00.720
like different data scientists can work on different versions and then as they, all different features and when they ready for it to be published and they can merge it together, which is so much better when you do it with Git than manually going life to life.

01:01:01.120 --> 01:01:02.860
I think, I agree with you.

01:01:02.860 --> 01:01:07.440
I think data scientists don't use source control or version control enough.

01:01:07.760 --> 01:01:14.740
And I think that even when people work alone on their own project, they should use, They should use Git or something else.

01:01:14.820 --> 01:01:20.280
And honestly, they should just use Git these days because it's the lingua franca of everything that people are doing.

01:01:21.140 --> 01:01:24.300
But it lets you do exactly what you're talking about.

01:01:24.560 --> 01:01:26.920
And it's not just, oh no, I have to go back.

01:01:27.220 --> 01:01:29.180
But I think a really important part of people,

01:01:29.480 --> 01:01:31.780
speaking to the people who don't really use it that much,

01:01:32.020 --> 01:01:39.780
or they only use it in their professional settings in the team maybe, I think it lets you be fearless in exploring new ideas, right?

01:01:40.100 --> 01:01:41.780
Like, all right, I'm going to commit this version.

01:01:42.200 --> 01:01:52.220
and then let me just terrorize my code and like try something crazy and oh my gosh it worked great we'll keep it if it doesn't work nothing lost you can 100% go back and if people don't if they're

01:01:52.320 --> 01:02:09.780
not i mean that's a good reason even by yourself right yeah yeah exactly and it's i feel like it's so easy uh you can do git add commit that i mean um there's a lot of thing you can do with git but i think to just learn some basic git add git commit git push that's it yeah yeah there's not

01:02:10.320 --> 01:02:50.500
you don't have to learn there's maybe five or six commands you got to learn mostly your tools that you're working with um you mentioned vs code it's got awesome get oh yeah integration and pycharm and others yeah yeah absolutely i think that's really really nice um let me give you two more ideas and see how these land with you okay or why i think git is now more important than ever yeah number one if you're doing any agentic coding here's another reason to use like a proper editor instead of a notebook as well, potentially. But you can go and say, hey, AI agent, I would like you to change this and add this feature and do this thing. And it'll just go and go for 10, 15 minutes.

01:02:51.740 --> 01:03:10.400
What I find a lot of times with these agentic coding things is they make great progress. You're like, wow, this is amazing. It makes more great progress. Like, how is this even possible? I live in the future. And then the next step is, oh no, it broke it all. You know what I mean? And you're like yeah oh yeah you have the same experience yes and i love that you go into this topic because

01:03:10.640 --> 01:03:36.860
that's kind of i think of the mathematics i have been mentioned it's more important than ever now that you have ai you're working with ai because um two different things right a lot of time you wanted to do something and then it broke so you want to go back to the previous version that's one thing another thing is if you commit very frequently then you can see what it changed and you can get before you commit it or you push.

01:03:37.320 --> 01:03:43.520
Because so many, I think it's so frequently that I do something that is not optimal.

01:03:43.700 --> 01:03:45.780
For example, multiple nested.

01:03:46.680 --> 01:03:51.080
Like with AI, a lot of time, it overcomplicates the solution.

01:03:51.240 --> 01:03:57.620
It makes it so many tries, except, except, and if else, or like it used, it's hard code.

01:03:57.900 --> 01:03:58.800
It's just bad practices.

01:03:59.500 --> 01:04:09.460
But if you control, you use version control, you can see like the difference, like the last call and the new call, and you can catch it before you commit.

01:04:10.150 --> 01:04:12.640
And yeah, it's more important than ever.

01:04:13.480 --> 01:04:14.740
Yeah, you're 100% right.

01:04:14.840 --> 01:04:15.640
You're right on it.

01:04:16.040 --> 01:04:25.320
I find that when I'm doing the agentic coding side of things, I actually have the source control tab open in my editor, not the files tab, but the source control.

01:04:25.410 --> 01:04:29.460
And as it's going, I'll see what files are changed and I'll start like looking at the diffs.

01:04:29.720 --> 01:04:32.080
Even while it's still working, I'm like, oh, it's on a good path.

01:04:32.120 --> 01:04:34.080
oh no, it shouldn't be doing this.

01:04:34.120 --> 01:04:37.540
And either you ask it to fix it or you just go, no, I'm reverting it.

01:04:37.820 --> 01:04:39.980
Yeah, I added a feature to the website.

01:04:40.560 --> 01:04:42.920
Not a super big one, but kind of touched a lot of files.

01:04:43.080 --> 01:04:50.260
I think there was 15 commits to achieve that change because every time a major, I'm like, okay, the AI has gotten something right.

01:04:50.540 --> 01:04:51.740
I'm like, save that to get.

01:04:51.860 --> 01:04:52.120
Yes.

01:04:52.320 --> 01:04:55.240
Because it's exactly like you described.

01:04:55.340 --> 01:04:55.580
It's awesome.

01:04:55.940 --> 01:04:57.620
Yeah, I experienced the same.

01:04:57.760 --> 01:05:22.520
Like I would have so many commits for like one feature change Because also something that I tell to do is, you know, just let's say if I want to implement like something I really like to do is go very thoroughly with like planning very, very thoroughly in advance before even code to ensure that everything is what I expected.

01:05:23.070 --> 01:05:24.080
And then I let it run.

01:05:24.250 --> 01:05:24.500
Right.

01:05:25.580 --> 01:05:29.420
And what I did depends on what I do.

01:05:29.660 --> 01:05:35.440
but recently I tried to develop like a workflow to go from, you know, my code into newsletter.

01:05:36.080 --> 01:05:55.540
So I wanted to implement something and I want to be, I want to be able to walk away and go back and be able to go back to each. So I break into phases, like let's say I break into five phases, right? The easiest phase and the more complicated phase, like the nicer hair feature.

01:05:56.020 --> 01:05:59.540
Now I want to go back into each phase and see if it breaks.

01:05:59.600 --> 01:06:00.960
I want to know when it breaks.

01:06:02.420 --> 01:06:13.520
So for every phase, I ask it to commit so that if I want to, if it breaks somewhere, I can go back to the previous version, which also another thing that you can ask it to commit.

01:06:14.660 --> 01:06:15.320
Yeah, yeah.

01:06:15.500 --> 01:06:15.820
That's interesting.

01:06:15.940 --> 01:06:17.400
I haven't thought about asking it to commit.

01:06:17.520 --> 01:06:17.800
That's cool.

01:06:18.100 --> 01:06:23.200
I'll give you one final reason to use Git in the agentic world.

01:06:23.320 --> 01:06:24.740
And then I think we're out of time.

01:06:24.860 --> 01:06:25.540
We're going to have to call it.

01:06:25.900 --> 01:06:27.220
but this has been a great conversation.

01:06:28.080 --> 01:06:34.060
One other thing you talked about, like, oh, it's made good progress and then something didn't work and you can actually tell it.

01:06:34.360 --> 01:06:35.260
You gave your example.

01:06:35.660 --> 01:06:36.800
Let me make this more concrete.

01:06:36.810 --> 01:06:40.520
You gave your example of, oh, I made some changes as, forget the AI, right?

01:06:40.580 --> 01:06:41.240
I made some changes.

01:06:41.740 --> 01:06:44.260
It turns out that the performance fell apart.

01:06:44.560 --> 01:06:45.480
I edited three files.

01:06:45.530 --> 01:06:47.100
I couldn't quite remember what was wrong.

01:06:47.360 --> 01:06:49.700
You could go to an agentic AI and say,

01:06:49.980 --> 01:06:57.460
please look at my local Git history from this commit forward and try to understand why the performance fell apart.

01:06:57.570 --> 01:07:01.560
And I bet you it will have some really good ideas, if not know exactly why.

01:07:01.710 --> 01:07:04.560
And it can just, it will do what we described for us in reverse.

01:07:04.800 --> 01:07:07.020
Like it'll just look at the Git history and go, this changed, this changed.

01:07:07.380 --> 01:07:09.000
Oh, if this changed, I bet it had this effect.

01:07:09.090 --> 01:07:09.780
And then, right.

01:07:09.920 --> 01:07:13.740
So you can leverage it in reverse and feed it to the AI to get understanding too.

01:07:14.060 --> 01:07:14.460
Yeah, yeah.

01:07:14.540 --> 01:07:17.720
That's such a great use case that you just talked about a lot of time.

01:07:18.000 --> 01:07:19.780
You're like, okay, use GitDip.

01:07:20.440 --> 01:07:22.820
Look at the difference and tell me what's wrong.

01:07:23.500 --> 01:07:27.420
That's another thing that I use very often.

01:07:28.320 --> 01:07:36.000
Also, another thing on top of it is, let's say it doesn't code the way you, like everybody has the old preferred style to do things, right?

01:07:36.290 --> 01:07:52.980
So sometimes you can edit it and you tell it, okay, look at your implementation and my implementation and just kind of taking notes of the difference so that the next time it will be able to create more similar to the style you're looking for.

01:07:53.360 --> 01:08:13.500
Yeah. Yeah, that's a great idea. I feel like we honestly could talk about this for hours, but we're out of time. We're out of time, Khuyen. So thank you for being here. I think let's leave it with this. Let's leave it with two things. First of all, maybe final call to action. People are interested in CodeCut.ai, all of the stuff you've created. How do they get started with all of your resources?

01:08:13.800 --> 01:08:20.480
Yes. So just go to my website, CodeCut, so C-O-D-E-C-U-T dot A-I.

01:08:20.859 --> 01:08:24.240
And then you will see how to sign up for newsletter.

01:08:24.819 --> 01:08:31.460
There's also a book tab that if you want to be interested in production, ready data science, you can buy it there.

01:08:31.680 --> 01:08:32.839
It's currently on sales.

01:08:33.560 --> 01:08:35.980
It's Labor Day week, so 20% off.

01:08:36.460 --> 01:08:38.080
And yeah, that's it.

01:08:38.500 --> 01:08:40.420
And if you want to read my blog, go to blog.

01:08:40.660 --> 01:08:44.359
If you want to see a sample of my code snippet, go to short post.

01:08:45.120 --> 01:08:46.620
Everything is very easy to navigate.

01:08:47.240 --> 01:08:48.720
Just go to Word.ai.

01:08:49.120 --> 01:08:49.480
Fantastic.

01:08:49.740 --> 01:08:49.920
All right.

01:08:49.950 --> 01:08:52.960
Well, thank you so much for being here and sharing all your experience and your work.

01:08:53.240 --> 01:08:55.299
Really, congratulations on CodeCut.

01:08:55.380 --> 01:08:56.299
It's a cool project.

01:08:56.720 --> 01:08:57.359
Thank you so much.

01:08:57.660 --> 01:08:57.839
You bet.

01:08:58.359 --> 01:08:58.640
See you later.

01:08:58.940 --> 01:08:59.359
We'll see you.

01:08:59.560 --> 01:08:59.720
Bye.

01:09:00.859 --> 01:09:03.240
This has been another episode of Talk Python To Me.

01:09:03.680 --> 01:09:04.480
Thank you to our sponsors.

01:09:04.880 --> 01:09:06.259
Be sure to check out what they're offering.

01:09:06.480 --> 01:09:07.859
It really helps support the show.

01:09:08.560 --> 01:09:09.960
Take some stress out of your life.

01:09:10.180 --> 01:09:15.720
Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.

01:09:16.259 --> 01:09:20.640
Just visit talkpython.fm/sentry and get started for free.

01:09:21.040 --> 01:09:24.299
And be sure to use the promo code talkpython, all one word.

01:09:25.020 --> 01:09:27.799
Agency. Discover agentic AI with agency.

01:09:28.299 --> 01:09:32.359
Their layer lets agents find, connect, and work together, any stack, anywhere.

01:09:33.020 --> 01:09:39.060
Start building the internet of agents at talkpython.fm/agency, spelled A-G-N-T-C-Y.

01:09:39.720 --> 01:09:52.460
If you or your team needs to learn Python, we have over 270 hours of beginner and advanced courses on topics ranging from complete beginners to async code, Flask, Django, HTML, and even LLMs.

01:09:52.900 --> 01:09:55.100
Best of all, there's not a subscription in sight.

01:09:55.540 --> 01:09:57.360
Browse the catalog at talkpython.fm.

01:09:57.780 --> 01:09:59.180
Be sure to subscribe to the show.

01:09:59.620 --> 01:10:01.360
Open your favorite podcast player app.

01:10:01.800 --> 01:10:02.500
Search for Python.

01:10:02.760 --> 01:10:03.660
We should be right at the top.

01:10:04.060 --> 01:10:07.560
If you enjoy the Geeky Rap theme song, you can download the full track.

01:10:07.880 --> 01:10:09.420
The link is your podcast player show notes.

01:10:10.140 --> 01:10:11.340
This is your host, Michael Kennedy.

01:10:11.800 --> 01:10:12.920
Thank you so much for listening.

01:10:13.100 --> 01:10:13.940
I really appreciate it.

01:10:14.420 --> 01:10:16.040
Now get out there and write some Python code.

01:10:29.040 --> 01:10:35.740
We'll be right back.