Software Supply Chain Security with Phylum

Episode #457, published Fri, Apr 19, 2024, recorded Wed, Jan 24, 2024

Episode Deep Dive Links Transcript

We've spoken previously about security and software supply chains and we are back at it this episode. We're diving in again with Charles Coggins. Charles works at a software supply chain company and is on to give us the insiders and defender's perspective on how to keep our Python apps and infrastructure safe.

Play on YouTube

Watch the live stream version

Episode Deep Dive

This conversation covers a wide range of topics around Python packaging, supply chain security threats, and best practices to keep your environment safe. Below are the key topics and takeaways.

Guest Background

In this episode, our guest is Charles “Charlie” Coggins, a seasoned Python developer who works at Phylum, a company focused on software supply chain security. Charlie originally started his career in a non-traditional programming path, worked for the US government on cybersecurity initiatives, and later transitioned into Python development. He has spent the past couple of years at Phylum working on Python integrations and helping defend against modern threats in open-source software ecosystems.

1. Software Supply Chain Security Concerns

Why it matters: Software developers have significant power to impact many users; a single compromised dependency may affect thousands of downstream projects.
Multiplicative effect: A malicious or vulnerable library can propagate through transitive dependencies, making early detection and safe practices crucial.

2. Lock Files and Dependency Management

Importance of pinning dependencies: Pinning versions (e.g., with lock files) ensures reproducibility and can prevent malicious updates from being unwittingly pulled in.
pip-tools and pip compile
- A popular way to generate a “lock file” style requirements output that includes the complete transitive dependency list.
- GitHub: pip-tools
Other tools: Mentions of poetry, hatch (with hatchling), and the now-rejected PEP 665 proposal to standardize Python lock files.

3. PEPs around Python Packaging

PEP 517 & 518: Discussed as the mechanism behind pyproject.toml and build backends.
- PEP 517 defines a build-system independent format for source trees.
- PEP 518 specifies minimum build system requirements and the structure of pyproject.toml.
pyproject.toml: Modern approach for declaring build systems and dependencies, reducing the need for setup.py.

4. Common Supply Chain Attacks

Typosquatting: Malicious packages use a name very close to a well-known package (e.g., missing or swapped letters in requests).
Starjacking: Attackers copy legitimate repository metadata (like GitHub stars) to appear legitimate on PyPI.
Dependency Confusion: A private/internal package name is hijacked when a higher-versioned package of the same name is published to PyPI.
Repo Jacking & Expired Domains: Taking over old GitHub handles or domains tied to a package’s original author to push compromised updates.

5. Phylum’s Approach and Tooling

Phylum CLI and Integrations
- Phylum CLI is published in Rust but installable via Python (pipx install phylum or pip install phylum).
- Phylum can be integrated into CI/CD (e.g., GitHub Actions, GitLab pipelines) to block or warn on malicious dependencies.
- Free community edition allows up to five projects, while paid tiers accommodate larger teams and organizations.
- Website: phylum.io

Key Takeaways

Lock Down Your Dependencies
Use strict version pinning (via tools like pip-tools or poetry) to ensure you’re only installing known, trusted versions.
Monitor and Update Regularly
Attackers rely on unmaintained or outdated environments. Periodically review and update your pinned dependencies.
Verify Source Integrity
Watch out for unverified or direct Git installations and ensure you’re referencing the correct package repos.
Implement Automated Security Checks
Tools like Phylum or other CI/CD scanners help detect malicious updates or suspicious package behavior before it hits production.

Overall Takeaway

Software supply chain attacks on Python projects are becoming more common, but there are concrete steps you can take to mitigate risk:

Pin dependencies with lock files.
Regularly audit and monitor your code and its third-party packages.
Employ CI/CD security tools to catch issues early.

By investing in these practices, you’ll drastically reduce the chance of inadvertently shipping malicious or compromised Python applications.

Links from the show

Series: How Malicious Python Code Gains Execution: blog.phylum.io

Pick a Python Lockfile and Improve Security: blog.phylum.io
Bad Beat Poetry: blog.phylum.io
PEP 665 – A file format to list Python dependencies for reproducibility of an application: peps.python.org
PEP 517 – A build-system independent format for source trees: peps.python.org
PEP 518 – Specifying Minimum Build System Requirements for Python Projects: peps.python.org
Lockfiles should be committed on all projects: classic.yarnpkg.com
An Overview of Software Supply Chain Security: tldrsec.com
Typosquatting: docs.phylum.io
Common Attack Pattern Enumeration and Classification: capec.mitre.org
Dependency Confusion: docs.phylum.io
Expired Author Domains: docs.phylum.io
Unverifiable Dependency: docs.phylum.io
Repo Jacking: Hidden Danger in Broken Links: blog.phylum.io
Software Libraries Are Terrifying: medium.com
phylum 0.43.0: pypi.org
linguist: github.com
rich-codex ⚡️📖⚡️: ewels.github.io
Phylum Community Discord: discord.gg
The dream is dead?: mastodon.social
When "Everything" Becomes Too Much: The npm Package Chaos of 2024: socket.dev
pip-tools: github.com
Watch this episode on YouTube: youtube.com
Episode #457 deep-dive: talkpython.fm/457
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #457 deep-dive: talkpython.fm/457

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 We've spoken previously about security and software supply chains, and we're back at it on this episode.

00:05 We're diving in again with Charlie Coggins. Charlie works at a software supply chain company

00:10 and is on the episode to give us an insider's look and a defender's perspective on how to keep

00:16 our Python apps and infrastructure safe. This is Talk Python To Me, episode 457, recorded January 24th, 2024.

00:25 Welcome to Talk Python To Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:45 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,

00:50 both on fosstodon.org. Keep up with the show and listen to over seven years of past

00:55 episodes at talkpython.fm. We've started streaming most of our episodes live on YouTube. Subscribe to

01:02 our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and

01:07 be part of that episode. This episode is brought to you by Sentry. Don't let those errors go unnoticed.

01:13 Use Sentry like we do here at Talk Python. Sign up at talkpython.fm/sentry. And it's brought to

01:20 you by Mailtrap, an email delivery platform that developers love. Use their email sandbox to inspect

01:27 debug emails and debug emails in staging, dev, and QA environments before sending them to recipients in

01:32 production. Try Mailtrap for free at talkpython.fm/Mailtrap. Hey, Charlie, welcome to Talk Python To Me.

01:40 Hi, Michael.

01:40 Great to have you here. We have corresponded back and forth about security things. And now,

01:48 you're scared. It's going to seem that way. There are threats everywhere, especially when you start

01:54 looking. And that's the problem. You look, you'll find them. And if you're not looking, you might get

02:00 affected without even knowing it.

02:03 Yeah, but that's true. But we're also going to come with some tools and techniques and tips on

02:07 how to avoid security problems with your Python code.

02:11 Yes, absolutely.

02:12 Yeah, I think it's especially concerning. That certainly catches my attention that if you mess

02:21 with somebody's software, like the software builders, the developers, it gets shipped to however many

02:27 users are on the other side of that equation, right? It's not like I just took over some

02:31 teenagers gaming PC, and now what can I do? It's like, I took over, you know, name your big web app,

02:39 and now we're going to start shipping some stuff around. All right, that's where the sort of

02:44 multiplicative aspect of this gets more concerning than just standard personal computer safety, right?

02:50 Oh, absolutely. You know, a single developer can have very broad impacts. You know, maybe they

02:58 publish one package, but that one package could be included in hundreds, thousands of other packages

03:04 as a dependency. And then everyone using those packages could be affected, whether the code is

03:10 good and works as intended or poorly written and has bugs and vulnerabilities.

03:15 Yeah, this is not to say there's any chance of there being a problem with Pydantic. But just to make

03:21 your point, if you go to like Pydantic or Request or something like that, a lot of these have used by

03:27 projects, right? And this Pydantic is used by 315,000 people, not people, software projects that

03:35 themselves have users, right? And so that's the kind of stuff that I'm thinking about when I said that

03:40 multiplicative effect, right? It's a big multiplier, not just a couple.

03:44 Oh, yeah. Yeah, for sure.

03:45 Yeah. Now, before we dive into our main topic, of course, you know, tell people a bit about yourself.

03:51 Hi, well, my name is Charles Coggins. I usually go by Charlie. And I'm a Python developer. I'm a

03:59 software developer, but not through the traditional sense. I don't have a computer science degree. I

04:03 didn't come to this, you know, straight out of school. I got my first taste of programming long

04:11 enough ago, back in the 80s, in 1987. My dad got a computer for us. And, you know, I was messing

04:18 around on there with some games, always with games, right? You know, at the time, it was basic.

04:25 You know, it was this bowling game that my brother and I would play. And I saw that I could look at the

04:30 code. I could look at the source. And I went in there and modified it a bit to make it so that I would

04:34 always win whenever I played him.

04:38 How long did it take him to catch on?

04:39 Oh, he figured out pretty quickly. And he was in there to change in, you know, ball speed and,

04:44 you know, how often he could get a gutter or get a gutter. But yeah, I, you know, took a class or two in high

04:52 school and college. But I was an electrical engineering major and then went to work for the government

04:59 doing something that wasn't even really that. So I spent 10 years working for the government before

05:07 they stood up the U.S. Cyber Command and decided or figured out that they needed to hire 6,000 new

05:17 developers to fill the positions. And there weren't that many available in the industry, let alone those

05:24 who could, you know, pass the clearances and work in that environment. So they looked to people already

05:30 working in the government. And I raised my hand. I said, yes, yes, I want to cross train. I'll be a

05:35 developer. And so they trained me.

05:38 Excellent. What did they teach you for language in that program?

05:41 We started with C, C++, and then there was some Python. So I went through a couple of boot camps and

05:49 a lot of self-learning, self-teaching. Python's the one that really clicked for me. It just made sense

05:55 in my head.

05:56 Yeah, of course. If you're learning to do cybersecurity stuff, you know, a lot of times I'd be happy to tell

06:02 people like, ah, you don't really need to learn C or Rust or Java. If you just know Python, you're

06:09 probably 90% of the time golden. But if you're trying to do cybersecurity, a lot of times it's about

06:15 like the machine level stuff, right? Understanding things like C and pointers and buffer overflows and

06:21 all of that kind of stuff is where you actually kind of need to be.

06:24 And they taught us all that as well. In fact, we learned assembly language as well. And that one

06:30 really didn't fit in my brain.

06:33 You're not like, I didn't want to become an assembly language programmer.

06:38 I mean, yeah, that's a whole different breed.

06:41 Yeah, it sure is. And, you know, it used to be, I remember when I first got into programming,

06:47 I was doing some C, C++ and inline assembly was something people would do a lot to optimize. A

06:53 lot like people might do Cython or Numba or something like that to make Python fast. Like we'll find this

06:59 little part and we'll rewrite it in this way and be like, we're just going to do inline assembly. I'm like,

07:03 that just doesn't seem like worthwhile. I don't need that much performance. We're going to not do that.

07:09 Yeah. Yeah.

07:10 Fun. So now you're working at Phylum. Is it Python focused or just software security?

07:20 It's not Python focused. In fact, the company primarily develops with Rust, as you were mentioning.

07:28 Okay. Yeah.

07:29 Yeah. Yeah. We've got some excellent Rust developers at our company. And I think that's what attracted

07:36 a lot of them is that that is the primary language we use. But we also have some elements in Python. And

07:43 when I came on board, I got assigned to work on our integrations. So like GitHub,

07:51 integrations, GitLab, pre-commit hooks, things like that. And so I was able to kind of architect it the

07:59 way I thought best. And because I love Python, I made it all in Python and exposed it through Docker

08:06 containers.

08:07 Are you doing direct integration with Rust, like Py03? Or is it more issuing commands out?

08:17 The Rust elements that our company works on, like our API, the command line interface, a lot of the

08:25 backend, it's just written straight Rust. And then the Python is just plain Python. There's no

08:31 interface between the two, really.

08:33 Yeah. Okay. Consuming APIs and Docker containers and stuff like that.

08:37 Right. Right. Right. Although I am interested in the Py03. And I think there's room to

08:45 bridge the two languages at our company.

08:47 I mean, for sure, people are adopting Rust for the performance foundations of Python. It's pretty

08:56 interesting.

08:57 Yeah. Yeah. I've been at the company almost two years now. I keep saying it's what I'm going to

09:03 learn next is Rust. And I felt like I would just kind of absorb it by going through code reviews and

09:09 the people on my team. It hasn't happened yet. I can kind of understand what's going on by reading it,

09:14 but I just, yeah, I need to jump in.

09:17 It's in depth. Okay. Got it. Those are the same. Okay. Got it.

09:19 Yeah. Yeah.

09:20 No, it's interesting. Okay. Well, we're not here to talk about Rust, although I do think

09:26 it's becoming one of those things that is sort of, I don't know if you need to be a little one level

09:32 deeper in the Python space that used to be C and now it's, I think it's pretty solidly moving to be

09:38 Rust, right? It's, there's a lot of popular things. Identic, for example, I pulled up earlier where

09:43 that's the foundation, but that also seems to be where the momentum is.

09:46 Yeah. The oxidation of Python libraries is a real thing. I mean, you know, look at Ruff, right?

09:53 Yeah. Ruff. I just heard about how Granian, I think it was, which is a new, similar,

10:01 similar to G Unicorn and Microwisky is a Rust-based async server. You know, there's, it goes on and on.

10:08 This portion of Talk Python is brought to you by OpenTelemetry support at Sentry.

10:16 In the previous two episodes, you heard how we use Sentry's error monitoring at Talk Python and how

10:22 distributed tracing connects errors, performance, and slowdowns, and more across services and tiers.

10:27 But you may be thinking, our company uses OpenTelemetry, so it doesn't make sense for us to switch to

10:34 Sentry. After all, OpenTelemetry is a standard and you've already adopted it, right? Well, did you know,

10:40 with just a couple of lines of code, you can connect OpenTelemetry's monitoring and reporting

10:45 to Sentry's backend. OpenTelemetry does not come with a backend to store your data, analytics on top

10:51 of that data, a UI, or error monitoring. And that's exactly what you get when you integrate Sentry with

10:58 your OpenTelemetry setup. Don't fly blind. Fix and monitor code faster with Sentry. Integrate your

11:05 OpenTelemetry systems with Sentry and see what you've been missing. Create your Sentry account at

11:09 talkpython.fm/sentry dash telemetry. And when you sign up, use the code TALKPYTHON,

11:16 all caps, no spaces. It's good for two free months of Sentry's business plan, which will give you

11:20 20 times as many monthly events as well as other features. My thanks to Sentry for supporting Talk

11:27 Python and me. All right, well, let's talk about software security though. You know, like we touched

11:34 on it a little bit with the multiplicative aspect of like why software developers should care.

11:39 But maybe let's start with some ways in which viruses might get on a computer from a software

11:46 perspective. Not from like, oh, you know, I found this cool app on BitTor and normally it's paid,

11:51 but this one's free. It's like, maybe don't install that. But, you know, not that kind of advice,

11:56 right? But, you know, specifically for software developers.

11:59 Right, right. So for software developers, I think the primary vector, you know, for malicious code

12:07 running in your environment or really any developer environment along the way, it doesn't just have to

12:12 be your system. It could be your CI, CD servers and your runners. It's going to be software

12:19 dependencies, third-party code, right? Code from strangers on the internet, right? That's really what it boils down to.

12:26 They just, Charlie, they're just here to help out. They're just giving you the code to help out.

12:32 They have no bad intentions. Right, right.

12:35 Except for that one. That one over there, don't take that one.

12:39 Yeah. And it's hard to tell, you know, what's good, what's bad. And I think we all rely on third-party code.

12:49 I mean, I think it's a rare company, rare project that writes everything from scratch on their own without any dependencies.

12:57 Yeah.

12:59 So that's a vector for sure, is allowing code from strangers on the internet to run. I think, like, the name of the game, right,

13:07 for attackers and threat actors is arbitrary code execution. Like, that's the key phrase.

13:16 Arbitrary code execution. If I can get arbitrary code execution with this vulnerability, then I've won.

13:21 I can attack your stuff.

13:22 You're going to get a CVE score of nine or above. It's right there.

13:26 Yeah, exactly. And that's for vulnerabilities. That's just, you know, poorly written code or code with bugs.

13:32 But forget about vulnerabilities. I mean, if you're an attacker, you're a threat actor, you've already got the perfect means to run arbitrary code,

13:41 to gain arbitrary code execution on a developer system. That's with third-party dependencies.

13:46 Open source software is just the perfect target for writing malware or slipping malware into packages.

13:55 Now, when people hear this, we've talked about it enough. It actually came as quite a surprise a few years ago.

14:02 People theoretically knew that it could happen, but that it was happening is that packages on package stores like PyPI and NPM and so on

14:12 got published vulnerabilities that people could then install and make part of theirs.

14:16 But there's a whole software supply chain, right?

14:19 Maybe talk us through some of the different elements that make that up.

14:21 Only one of which is these libraries, right?

14:24 That's right. That's right.

14:25 So the software supply chain is really it's using third-party code securely, as well as securing the end-to-end development process.

14:35 So that process is very broadly broken into three phases.

14:39 You've got the source phase.

14:43 That's source control management systems and then actually coding.

14:49 Developers coding on their systems, committing to repositories.

14:55 Yeah.

14:55 And you mentioned the dependencies like pip install this or that.

15:01 There's also, for many of the really popular IDs and editors, there's a whole massive array of varying levels of trusted plugins or extensions, right?

15:13 As well.

15:14 That's right.

15:14 Yeah.

15:15 Like Visual Studio Code, that's what I use for my IDE.

15:19 It's got an extensive extension ecosystem.

15:22 Just about anything you want to do.

15:26 I get a little pop-up when I open a new project and it says, oh, I recognize you're using a YAML file.

15:30 Do you want to download this extension that will lend to YAML files, right?

15:35 Yeah.

15:35 I got one for CBEs.

15:37 It was like rainbow CSV syntax highlighter or something.

15:42 I'm like, you know what?

15:43 That's not really made by a trusted company.

15:46 It's probably fine.

15:47 It's probably fine.

15:49 But I don't need my CSV files highlighted so much so that I'm willing to just run arbitrary code from a stranger on the internet.

15:56 That's right.

15:57 Right?

15:58 Yep.

15:58 And, you know, I use both PyCharm and VS Code and they both, especially PyCharm, has sort of a warning that says, this is untrusted.

16:07 It's a third-party thing.

16:08 Are you sure you want it?

16:10 Right.

16:10 Just saying no.

16:11 That's a pretty light warning.

16:12 Yeah.

16:13 And also, they're not the same, right?

16:15 Is it installed by a million people used every day?

16:18 Or is it for you the fourth person to use it?

16:21 And it hasn't, you know, hasn't had the experience of people going, why is it opening a network socket?

16:27 What's it doing?

16:28 You know, something like that.

16:31 Yeah.

16:32 That's another entry point you got to be careful about.

16:36 All right.

16:37 Well, I cut you off.

16:38 We're only in like square one of maybe nine.

16:40 Yeah.

16:41 Square one, source code.

16:42 And then there's the build phase.

16:45 That's where you take the code.

16:48 You take the commits that have gone into source control.

16:50 And you build something with it, right?

16:54 This usually happens in, you know, your CI, CD systems, GitHub, and Git Labs of the world.

17:02 And it's at that point where, you know, your third party dependencies get included and wrapped up into your artifacts, right?

17:13 Which brings us to the third stage of the software supply chain, which is the package and deploy phase.

17:20 That's where you're creating your artifacts and making them available to the world to use.

17:27 Could be anything.

17:28 Could be a wheel for a library that other parts of your company use to build software.

17:34 Yep.

17:34 Could be some app you ship.

17:37 It could actually be a website, an API.

17:39 Who knows, right?

17:40 Yeah.

17:41 Docker container.

17:42 Docker container.

17:43 Yeah.

17:44 Yeah, exactly.

17:44 And then by the time you get to that, you know, the end of the supply chain and, you know, the products or the packaged product that people are going to see and use and work with, you know,

17:57 you've baked in so many elements at that point, you know, from your third party dependencies to, you know, any other external resources that are getting called.

18:11 So there's lots of points along the way that it's possible to.

18:17 Yeah.

18:18 One of the things that can be sneaky is, you know, it doesn't happen that often in Python, but you're shipping like a Windows or a Mac app.

18:26 There's a digital signature proof of we're going to sign this with our trusted certificate.

18:32 So it doesn't even give you any warnings.

18:34 Like, look, this is it's signed by the company.

18:36 It is trusted.

18:37 Here you go.

18:38 Pick it.

18:39 Right.

18:39 And somewhere upstream from that, there's an issue like with packages or other things.

18:45 Well, that issue is now that that problem is signed and verified as well.

18:50 Yeah.

18:51 Yeah.

18:51 You know, so you mentioned you mentioned code code signing the research team at our company.

18:57 I mean, they're amazing, amazing group there.

19:00 They're always finding new and novel attacks.

19:04 And one they found just this past week involves something kind of cool where the attacker had bundled up a valid Microsoft binary.

19:15 It had been signed by Microsoft.

19:17 But they bundled it with a DLL that was malicious.

19:22 It was named something to be expected.

19:25 Right.

19:26 So when you run the executable on the binary, you know, you could see that there's this Microsoft signs application looking for permissions, looking to continue.

19:38 And you're like, oh, yeah, great.

19:39 Signed by Microsoft.

19:39 No problem.

19:40 But then it uses this technique called like DLL search order hijacking.

19:45 Okay.

19:47 That technique.

19:47 Right.

19:47 So if you have a DLL that's being called by the application more locally than not, that's what it's called.

19:56 So it's looking for something in like.

19:57 Yeah.

19:58 It'll look for the name of the DLL in the same directory first, basically, is what's happening.

20:04 Right.

20:05 They had shipped their bad DLL with a good binary.

20:09 So you pick something in system 32 that's got like a real common name like VC runtime whatever dot DLL or, you know, some of the standard ones.

20:20 But then you completely reprogram it.

20:22 Yeah.

20:23 And stick it in there with that app.

20:24 Or maybe not completely because you need the app to not crash.

20:26 But you give it some extra boost when it does something.

20:30 Right.

20:30 Yeah.

20:31 Yeah.

20:31 In this case, they had just copied all the files needed for execution into a new directory, including the known good binary, the known bad DLL.

20:40 And then, you know, it had everything it needed in that directory to run.

20:44 And it looked like it was legitimate.

20:45 Right.

20:46 Because a lot of the OS dependent, a lot of these OS checks are on the executable, the system libraries that they use.

20:53 Right?

20:53 Right.

20:54 Right.

20:55 You'll see like this, this executable is downloaded from the internet.

20:57 Are you sure you want to run it?

20:59 Like that doesn't say this executable, what you trust is maybe possibly using a library that you downloaded.

21:05 Like it doesn't say that.

21:06 Right.

21:06 Yeah.

21:07 Cause we could never get work done if there was that level of checking all over the place.

21:11 This is what updated somewhere.

21:12 This portion of talk Python to me is brought to you by Mailtrap.

21:18 We're going to keep this super short.

21:20 So please pay attention or you'll miss it.

21:22 Mailtrap is an email delivery platform that developers love.

21:26 An email sending solution with industry best analytics, SMTP, and email APIs and SDKs for major programming languages with 24 seven human support.

21:36 What makes them unique is their email sandbox.

21:38 Use email sandbox to inspect and debug emails in staging, dev, and QA environments before sending them to recipients in production.

21:46 Try Mailtrap for free at talkpython.fm/mailtrap.

21:53 That's kind of the space that we're talking about, right?

21:56 We've got editors.

21:57 We've got libraries that you use.

22:00 CI, CD pipelines.

22:01 Containers are super interesting as well.

22:04 And all the tools to go with those.

22:07 So let's talk through some of the posts that you've written and also just selected about some of these things.

22:14 And maybe starting to the front of that list there with block files.

22:17 Yeah.

22:18 Okay.

22:18 So, yes, I wrote a blog post.

22:22 I guess it's looking at the date on your screen.

22:24 It looks like it was over a year ago now.

22:26 And probably seems like yesterday, but no.

22:28 Yeah, that's right.

22:29 That's right.

22:31 So I'm sure the landscape has changed since then a bit.

22:34 And maybe there's some new players out there.

22:36 But, yeah.

22:37 Yeah.

22:38 I think one thing you can do as a developer, a big one I would recommend, is use lock files for your dependencies, right?

22:47 And, you know, what's a lock file?

22:51 Well, it's the fully resolved set of dependencies that are used by your application, your package.

23:01 And, you know, if nothing else, like, you should know what's going into your code, right?

23:09 Like, what?

23:10 Right.

23:10 Well, one of the ways this helps.

23:12 Yeah, exactly.

23:13 That's a really, that's a bit of a challenge, right?

23:15 And I think I'll admit when I first got into Python, I didn't do this that well.

23:20 And, you know, to me, it felt like probably the biggest issue I might run into is instability in my app, right?

23:27 Like, for example, if I don't pin a dependency, some new thing comes out, I reinstall it on a new computer.

23:33 Maybe it gets an upgraded version.

23:35 And there's some library that doesn't work, right?

23:37 I mean, there's been certainly popular libraries that just said we're having a major version change and we're fixing the mistakes we made 10 years ago.

23:44 And these three functions are changing or whatever, right?

23:47 That would break it.

23:48 But it could also be there's now a malicious version of library X and that's version two.

23:55 But if you pinned it on version one, even though it's bad, you're still not getting the bad one, at least for a while, right?

24:01 Absolutely.

24:02 Yes.

24:02 So I think I got to look it up.

24:05 I always forget.

24:06 Pep 665.

24:08 Okay.

24:09 Yeah.

24:10 Pep 665.

24:11 665.

24:12 It's a rejected PEP.

24:14 Unfortunately, but it was written by Brett Cannon, some others.

24:18 I know you've had Brett on the show a number of times.

24:20 I love the stuff he does.

24:22 Yeah, he does excellent work.

24:24 And it's kind of a shame this was rejected, but this Pep tried to create a standard block file format for Python.

24:35 And, you know, if you look into the Pep a little bit, you know, there's some motivation about like why you'd want to do this and, you know, four big reasons.

24:43 And the third one is when I really key on, which is that, you know, lock files allow for reproducibility.

24:48 And reproducibility is just more secure.

24:51 Because when, you know, I'm quoting here from the Pep says, when you control exactly what files are installed, you can make sure no malicious actor is attempting to slip nefarious code into your application.

25:01 IE, some supply chain attacks.

25:03 By using a lock file, which always leads to reproducible installs, we can avoid certain risks entirely.

25:09 And, I mean, that's the name of the game.

25:12 That's like, that's what our company focuses on, which is avoiding those risks by ensuring you know which dependencies you're using and you're knowing that those dependencies are benign or good, you know, doing no harm.

25:27 Even if there's something that happens, usually it's going to happen to a popular library because you're using it, hence probably other people are using it, other than type of squatting, which we can talk about.

25:40 But, you know, if you pin your dependencies, chances are it's, these things only stick around for a little while.

25:47 It's not like, oh, they discovered it had been there for eight months.

25:50 It's like, oh my gosh, we heard about it.

25:52 A few people got it and then we got rid of it, right?

25:55 Yes.

25:55 The folks at Pype are pretty excellent.

25:57 So it's, to some degree, a timing issue as well.

26:00 Yes.

26:00 Yeah.

26:01 Vulnerabilities are different, right?

26:03 Where that's what a lot of people focus on.

26:06 A lot of the tooling exists to, you know, discover vulnerabilities in your dependencies, which is good to know about those.

26:13 But those exist for a long time, right?

26:16 You have CVEs for known vulnerabilities and they end up in these databases and they're there for years.

26:21 And if you're using old dependencies or maybe transitive dependencies or using old ones and you're stuck on it, then you're going to be exposed to those vulnerabilities.

26:31 But what's different about that?

26:33 Examples of those include the WebP library not too long ago, right?

26:38 That was baked into Python and then also open SSL, right?

26:43 So people discovered issues in those.

26:45 Those are baked into different aspects of Python or some of the libraries.

26:48 And it's like, well, all of a sudden there's this fire drill.

26:51 Yes.

26:52 Which is different than somebody going, I'm going to sneak a thing into the library.

26:55 Right.

26:56 And then it is a timing matter.

26:58 So malicious dependencies, that's a whole other story.

27:01 Because if a malicious package is discovered, there's not a CVE created for it.

27:06 The package is just taken off of the registry.

27:09 You know, you report it to the good people at PyPI and, you know, they'll review the submission and take it down.

27:17 I've done a few of those myself and they're really fast.

27:21 But there's still a window of time where that malicious package, that malicious dependency is up and available.

27:28 And that's, you know, often all that's needed.

27:32 Yeah, exactly.

27:33 I do think having a pin dependency there is worthwhile.

27:36 Because if you make a commit, your CI runs, et cetera, et cetera, right?

27:39 Like the chances that you just bump the version to this malicious thing is pretty low.

27:44 Yeah, exactly.

27:45 So, yeah.

27:46 Yeah.

27:47 And having version ranges is not enough.

27:50 You know, you need to have explicit versions, you know.

27:53 Let's talk more about these lock files then, right?

27:55 So there's actually a bunch of choices these days.

27:59 You know, Brett's PEP tried to make it less of a choice.

28:03 Say, well, it doesn't matter if you use hatch or pip or poetry or whatever.

28:07 The outcome is the same.

28:09 And for reasons that I don't haven't learned enough about, I don't know why that didn't work.

28:14 But let's talk about what's out there now.

28:16 Because there's a couple options at this point.

28:18 Sure.

28:19 I think the, yeah.

28:21 So most Python developers are going to be most familiar with pip, right?

28:25 That's the standard.

28:29 And pip has requirements files.

28:31 And, you know, they're unique in the lock file world because they can be named anything, right?

28:40 Most other lock files have a defined name.

28:43 We were talking about Rust earlier.

28:44 You know, they're the gold standard for a lot of this stuff.

28:47 And, you know, they're very clear.

28:49 They have cargo.lock.

28:50 That's their lock file.

28:51 You can't name it anything else.

28:53 Its contents are well defined.

28:55 It is what it is.

28:57 But in Python with pip, I mean, you could name it whatever you want.

29:01 You know, dev requirements.ext.

29:03 You could name it cargo.lock.

29:05 But it can contain Python dependencies in it.

29:08 Surprise.

29:09 I'm not Rust.

29:11 Basically, you can just put more or less arbitrary commands that are sent to pip in a text file, right?

29:18 Yes.

29:18 Which is more or less what it is.

29:19 Yeah.

29:19 Yep.

29:20 Yep.

29:21 Any command line option you can feed the pip, you can put in a requirements file.

29:25 It's cool because you can have an import by saying dash or some other file.

29:29 Yes.

29:30 Yes.

29:31 But it's also not super structured.

29:33 You can get a hierarchy that way.

29:33 Mm-hmm.

29:34 Yeah.

29:34 So there are some tools available to turn those, like, loose requirements files, the pip requirements files, into strict lock files, right?

29:47 Where every entry is pinned to a specific version.

29:51 And pip itself can do it with the pip freeze command.

29:54 So that's the one most people know about.

29:57 But that one's kind of not so great because it only freezes the packages for the environment that you ran pip freeze in.

30:05 You know?

30:06 And maybe you're trying to publish your lock file for users of a different platform or system.

30:13 The other thing that I don't like about it is you want to put just the things you actually use into your requirements file.

30:19 Like, I'm using HTTPX and Pydantic.

30:21 That's it.

30:22 But what it really installs when you run that is the transitive closure of all those things.

30:27 Yes.

30:28 Which is fine.

30:28 But you're not necessarily expressing that with just your requirements.txt, right?

30:35 Right.

30:36 Yeah.

30:37 Yeah.

30:37 Your two packages could balloon to, you know, 100 dependencies.

30:41 And that's not uncommon.

30:43 It's not even that bad.

30:44 Like, in the JavaScript ecosystem, you know, the same handful of top-level dependencies could have two orders of magnitude explosion where you end up with thousands.

30:54 There's a really...

30:56 Oh, gosh.

30:56 I can't find out.

30:57 You know what?

30:57 I think it's on...

30:59 I think I put it on the Python bytes.

31:00 But there's a really funny...

31:01 I want to be able to pull this up for people so they can find it.

31:03 There's a funny, funny thing that somebody did.

31:07 Well, for some definition of funny.

31:11 They put...

31:12 Somebody created an NPM package called Everything.

31:16 Yes.

31:17 And there's an article called Everything Becomes Too Much, the NPM Package Chaos of 2024.

31:23 Yeah.

31:23 An NPM user named Patrick JS launched a troll campaign with a package called Everything, which depends on every package in NPM.

31:31 Yeah.

31:32 Yeah.

31:32 And that's...

31:33 I think it's the...

31:34 NPMs are the largest package registry out there.

31:37 So it's...

31:38 I mean, it's already massive.

31:40 I remember your early episodes, you would recount how many packages were on PyPI.

31:46 And then we got to that...

31:47 I don't even know.

31:47 Are we past half a million?

31:49 ...6-figure number?

31:49 Well, yeah.

31:50 I remember it was a big deal.

31:51 It got up to 100,000.

31:52 And now it's probably, what?

31:53 400,000?

31:55 500,000?

31:55 Over 500,000.

31:57 509 by rounding.

31:58 Yeah.

31:59 Half a million.

32:00 Congratulations, world.

32:01 Amazing.

32:03 Yeah.

32:04 Yeah.

32:04 I just added two new ones last week.

32:06 So I guess I've made a huge difference in that number.

32:08 Nice.

32:10 Yeah.

32:11 So basically, the pip is awesome and it does a bunch of great stuff.

32:14 And one of the things I really like about working with pip is I don't need to teach people anything if they want to work with my project.

32:20 Right.

32:21 I don't need to teach them like, oh, I know you love poetry, but I'm using a combination of the hatch build back end with PDM.

32:27 You're like, what?

32:28 I don't even know what those are.

32:29 Right?

32:29 Like, there's a lot of, like, ways in which you work that are brought in with a lot of these tools here.

32:36 So pip is kind of like, you know, it just kind of works, right?

32:39 Yes.

32:40 But having this transitive closure managed is not part of what it does, but it's super important because if I need to upgrade something, I can't just change my version number in my requirements.

32:50 Because that doesn't affect its dependency possibly, right?

32:53 Like, it depends what it's said.

32:55 So I'm a huge fan of pip-tools.

32:57 This is actually what I do most of the time.

32:59 Yes.

32:59 pip-tools is another one.

33:01 You can, it's great.

33:04 I think it has this pip compile command that will take as an input, I think, just about any Python manifest type that's out there.

33:15 So you can do setup.py, requirements.txt.

33:20 I'm forgetting the other ones.

33:24 The pip.ev.lock, maybe.

33:27 Setup.cfg, pyproject.toml.

33:31 It just recognizes all the different ways people could express their loose requirements, you know, the manifest files.

33:41 Yeah.

33:41 So, yeah.

33:42 Yeah.

33:42 I really like it.

33:43 And you can say pip compile upgrade and it'll look at all the dependencies and upgrade them all as high as they can go.

33:50 But what's nice about that is you'll be working for a while.

33:52 Then you choose, like, well, let me just do a refresh on the dependencies right now and repin them and see how that works.

33:58 And then just carry on with your business for a while, right?

34:01 And it'll manage that transitive closure as well with, like, actually a really nice lock file where it described, like, these are all the things in the lock file.

34:09 And the reason that, you know, for example, in your blog post, you say they're certified of this version.

34:13 And it's there because you asked for it and because request needs it.

34:16 You know, if you're like, why is this in my virtual environment?

34:19 Why do I have this weird thing that I don't know?

34:21 Like, it'll tell you, here's why it's there.

34:23 Yeah.

34:24 Yeah.

34:25 One of the downsides, though, I think pip-tools has this issue.

34:29 I know pip does, is that in determining that transitive dependency resolution, it is very possible.

34:39 In fact, it usually happens that you have arbitrary code execution on your system, right?

34:43 Like, if you start with the two top-level dependencies, like you mentioned, and it lists dependencies, well, then it'll pull those in and it acquires the metadata.

34:51 From the wheel, if that exists.

34:54 But if it doesn't, it'll build the package just to get the metadata filed, just to figure out which dependencies that needs.

35:00 Are you saying I should set up a Docker container to execute this?

35:05 I mean, that's, yeah, that's kind of what's happening.

35:07 Maybe I should, yeah.

35:08 Maybe I should.

35:09 And, you know, yeah, running in a sandbox is another option, right?

35:15 Where that's what my company, Phylum, that's one of the solutions we offer.

35:19 You know, we have extensions for our CLI where you can wrap pip by just calling Phylum pip.

35:27 And then everything runs in a sandbox.

35:29 So that's another solution.

35:31 Yeah.

35:32 Yeah.

35:32 So good.

35:33 Because, I mean, pip is a funny one because they even have a command line option called dry run, tac-tac-dry run, which you would think, oh, nothing's going to happen on my system.

35:43 It's just separate.

35:43 Running code from strangers on the internet.

35:45 But it does.

35:46 Yes.

35:47 Dry run, even using dry run for pip install and pip download commands will or has the possibility of downloading and running arbitrary code from strangers on the internet.

35:59 If we had, oh, wheels came along far after pip, right?

36:03 And we've got the source distributions and setup.py and all that kind of stuff.

36:07 And so if wheels existed from day one, it very well would be the case that this is not a problem, right?

36:13 But, you know, what is pip supposed to do?

36:15 Like, it has to evaluate this dynamic thing to figure out what it wants.

36:18 Yes.

36:19 Yes.

36:20 Yeah.

36:20 Wheels are great because, you know, they have a metadata file in there that clearly lays out what the dependencies are.

36:29 And there's no arbitrary code running when you install a wheel.

36:32 It's just extracting and copying, you know?

36:35 Yeah.

36:35 A wheel is just a zip file.

36:37 You extract that zip file and then copy the contents to various locations.

36:41 But, yes, as you said, because we've had source distributions, tarballs, and then even eggs before that,

36:49 and I'm probably never going to fully get rid of those.

36:53 It just takes one.

36:55 One dependency anywhere in your chain that is only distributed as a source distribution.

37:00 Before now, you're downloading and building a package just to get metadata to continue.

37:07 Yeah.

37:07 And maybe you didn't actually choose it, right?

37:09 It's the dependency of a dependency of a dependency.

37:12 Absolutely.

37:13 Yeah.

37:14 Yeah.

37:14 That's, yeah.

37:15 Yeah.

37:15 You know, people often respond to some of the findings our company has where we'll, you know,

37:23 we'll post these malicious packages with all sorts of crazy names.

37:26 And people will respond to say, like, you know, why would I install that?

37:30 Like, why would I ever install this, you know, random package that no one's heard of?

37:36 It's like, well, you wouldn't.

37:37 It's, it's, it's, it could be, but it could be included in, you know, the transit dependencies, right?

37:43 If it gets, if it gets added to a slightly more legitimate package or, you know, worked up the chain that way, then, then yes, eventually, you know, you'll be running it unknowingly.

37:54 Yeah.

37:55 I think there's two important things we should talk about this before we move on, because there are some interesting ways in which you might unknow it.

38:02 You might even try to do the right thing and you might actually shoot yourself in the foot by doing so.

38:07 So number one, these like super strict lock files are awesome when you're building an application.

38:14 I want to ship Talk Python training out.

38:16 It's got its strict APIs.

38:17 It runs on this version.

38:19 It uses that version of Pydantic, that version of Beanie and whatever.

38:22 Yeah.

38:22 I want that to be fixed, fixed, zero flexibility until I decide through maybe a pip compile upgrade or whatever.

38:29 I want a new one.

38:30 However, if I was building a library that someone else was using, I would do them many headaches and a disservice to say, I depend on Pydantic 2.7.0.

38:41 You're like, well, my other library needs Pydantic 8, 2.8.

38:45 Right.

38:46 And I can't use it and your library together.

38:48 Right.

38:49 So you need the, it's, it's a different story when you're building a library that others are going to consume than it is when you're building an application.

38:56 And there was some, some disagreement, I guess, about the recommendation of pipenv for a while.

39:01 And it's because I believe that pipenv is really focused on the application side.

39:05 And it, I don't think it was made super clear that maybe it doesn't make as much sense for libraries.

39:10 Right.

39:10 So you want to speak to that a little?

39:11 Yeah.

39:12 Yeah.

39:12 I'm, I'm an advocate for lock files for everyone.

39:16 Right.

39:16 Applications for sure, but also libraries and their developers.

39:20 Right.

39:21 Cause you know, if when you, when you, when you distribute a library, sure.

39:27 You know, loose dependencies is, is probably the way to go there.

39:31 But library developers, people who want to contribute to your projects, the developers themselves, maybe you work on a team.

39:38 Having, having a lock file alongside your library is still going to be useful.

39:45 Right.

39:46 Like.

39:46 Yeah.

39:46 Cause that way you can say everyone, if somebody makes a change or they report a bug or whatever.

39:50 Yeah.

39:50 They're not bringing in a change from a different version of a dependency or like maybe something changed.

39:55 Right.

39:55 Yes.

39:56 Yes.

39:56 Yeah.

39:57 and then, and it, plus it still allows you to, start from a known good spot.

40:03 And then, maybe, maybe if you, if you know, you want to get the latest, then you can do it in a controlled environment, you know, like a sandbox or maybe a, on CI, you know, in a, in a throwaway runner that has no access to any, any secrets.

40:20 Or, sensitive.

40:22 sensitive.

40:23 That's interesting.

40:24 I hadn't really thought about having a specific requirements lock file type of thing for the libraries that I've been working on for the developers, right?

40:34 For people who want to contribute.

40:35 because it's just been like a loose requirement so that people that built against it aren't pinned into some very specific thing.

40:42 But yeah, that makes a lot of sense.

40:43 I think.

40:43 Yeah.

40:43 There's a, there's a link in that blog post.

40:46 It's kind of dated now, but it's from the folks who built yarn, you know, JavaScript ecosystem.

40:51 But, they had, they say it a lot more eloquently than I can.

40:55 yeah, that's the one, lock files should be committed on all projects.

41:00 Yeah.

41:00 It's, I mean, it's a bit old now, but, but they, they go down the lists and spell it out a lot more clearly than me.

41:06 And that's why, libraries even can benefit from, from publishing a lock file.

41:12 Yeah.

41:12 People can check that out.

41:13 That's cool.

41:14 Yeah.

41:14 And Java, that's the JavaScript package manager.

41:16 So in JavaScript years, like a hundred years or something that's been a couple of years.

41:20 That's right.

41:20 Yeah.

41:21 You got dog years.

41:22 You got JavaScript years.

41:23 JavaScript years just tick by like second, the second hand.

41:26 Yeah.

41:27 All right.

41:27 cool.

41:28 So I see we're making great progress through our list of things to talk about here.

41:32 I've gone through three and I like 15 left.

41:35 We'll have plenty of time.

41:36 so yeah, let's see.

41:41 So another one, another pep, I think we're talking about here is 517, a build system independent

41:48 format for source trees.

41:49 I have no idea what this is.

41:51 What is this?

41:51 Yeah.

41:51 Pep 517 and 518 kind of go together.

41:54 This is, this was like the transition away from setup.py towards pyproject.toml.

42:00 518 is the one that specifies pyproject.toml.

42:04 kind of things that go in it.

42:07 And then five, 517 is all about, build systems and build backends.

42:12 so, so like in your pyproject.toml and your, in your, in your, build system

42:19 key, you know, you'll often see things like, poetry core or flit or hatchling or,

42:26 these kinds of things.

42:26 And, and so it's five, PEP 517 is, is specifying what it means to be one of those build backends.

42:33 it's really just defining two mandatory hooks.

42:37 What does it mean to build wheel and build sdst?

42:40 there's three optional hooks as well.

42:43 And I think there's even another PEP that followed on from this that talks about, building

42:47 editable, packages or, or, right.

42:51 The dash, the dash E equivalents.

42:54 Yeah.

42:54 Yeah, exactly.

42:56 but really it just boils down to, defining a way to build a wheel and build

43:01 a source distribution.

43:02 Yeah.

43:03 And this is part of what opened up all the different choices we now have for package management and

43:09 things like that.

43:10 Right.

43:10 Cause now there's a common way they can all work together a little bit like WSGI.

43:14 Yes.

43:15 Yeah.

43:16 I've been using hatchling for my build back in recently and it's been working real nicely.

43:20 Okay.

43:20 Yeah.

43:21 I was just looking at hatchling the other day and they've got, yeah, yeah.

43:27 They, they're one of the, they're one of the build backends that offers, build hooks,

43:31 which, you know, so prior to, pipe project.toml and, and, and, and, wheels

43:41 and beat us wheels and you go back to the source distributions and your setup.py files where

43:46 it's just Python code.

43:47 You can be, you can be doing anything in your setup.py file, which runs when you

43:53 install the package.

43:53 well now we're starting to see, you know, methods to do the same thing in these, in these

43:59 more modern packaging or build backends.

44:01 So like hatch has their, build hooks, build system hooks where you can, you can, you

44:07 can, you can, point it to think, yeah, just Python code and have it, have it run as

44:13 part of the build.

44:14 Yeah.

44:15 At least it only runs at build time, not install time.

44:17 right.

44:19 Yeah.

44:20 I'm looking at the documentation now.

44:22 I, I, I, yeah, this is still new to me, but there might be hooks for, for install as well.

44:29 Okay.

44:29 While you're thinking about it, one of the things I got a couple of questions I want to highlight

44:34 from the audience here, but also one of the, one of the things that I think maybe was considered,

44:41 I have no awareness of this, but if it wasn't would be excellent is what if the people at

44:46 pip just pre-computed all that metadata from, at least for the common platforms that you would

44:53 get that pip needs to download, run, set up pie, and then throw it away just to get that

44:58 data.

44:59 Like for Mac, Windows, and Linux, you know, if it would just go, okay, we're just going

45:03 to like, as you upload it, it would just kick off a job that does that on those three platforms

45:08 and put it in a JSON blob.

45:09 Yeah.

45:10 It seems like that would be worthwhile.

45:11 I, I, I'm fairly certain there's discussions already around that type of a solution and

45:17 maybe even a PEP for proposal, for it, but, yeah, getting away from having to build

45:23 a package just to get metadata.

45:24 yeah.

45:25 You got packages that are downloaded billions of times with a B it's insane.

45:32 And if somebody could do that three times instead of a billion times, it would make

45:37 it work faster and it would also make it safe.

45:39 Right.

45:39 I think it'd be great.

45:40 Yeah.

45:40 All right.

45:41 A couple of questions here.

45:42 this one.

45:45 So Tony on the audience says pip compile is great for finding your transitive dependencies.

45:50 One interesting thing that they've done is package up code with pants build, which supports

45:56 locks files just to look through what code gets packaged up.

45:59 Is this anything you've explored?

46:02 I've heard of pants.

46:03 I haven't looked into it myself yet.

46:05 Mm-hmm.

46:06 okay.

46:07 Yeah.

46:07 So just use it like, go, okay, you're going to have to build this thing and give me a little

46:11 manifest and whatnot.

46:13 And then we can just look at that.

46:14 That's cool.

46:14 Yeah.

46:14 And then Tamir says, do you have a solution for taking already locked dependencies with you

46:19 when you start a new app?

46:21 I'm guessing, you know, maybe, yeah, I don't know.

46:24 I guess maybe you've already got a project you're working on.

46:26 You want to say like, I want this project to use that.

46:28 Probably you could just copy the lock file.

46:30 Right.

46:31 Yeah.

46:32 Yeah.

46:32 If you, I mean, if you really, I mean, really you're going to, if you start a new project,

46:36 or new application, you're going to, you're going to have new, manifest file, you know,

46:42 pyproject.toml, maybe you have the same dependencies, the top level dependencies or not,

46:46 but the fully resolved set of dependencies that makes up your lock file, that, that, that,

46:51 that can very easily be different.

46:52 So I'm, I'm not exactly sure how you just poured over one to another.

46:58 One more bit from Tony.

46:59 This is, something that I now remember from pants is this, if it just looks through

47:05 your code and if you use the import statement, regardless of whether you've put it in your

47:09 requirements files, it'll figure out what your requirements file should have been.

47:13 If you were a bad developer, basically.

47:16 That's kind of cool.

47:18 Just to see what it uses.

47:19 Yeah.

47:19 Nice.

47:20 All right.

47:20 onto the next thing.

47:22 Specifying PEP 518, specifying minimum build system requirements for Python projects.

47:28 Yeah.

47:28 I'm guessing related.

47:29 This is pyproject.toml.

47:31 This is the, this is the PEP for that.

47:33 Okay.

47:34 There's not much to it other than to say that they've settled on that name, rejecting a bunch

47:39 of other possibilities.

47:40 And then they've got the, you know, the few entries that are required, like for your,

47:45 your finding your build system.

47:46 Excellent.

47:46 Yeah.

47:47 Yeah.

47:47 You don't have to have a pyproject.toml for Python, but.

47:50 No.

47:51 Yeah.

47:51 If you're building a Python library and you don't want to use setup.py, then you're much

47:56 better off having a pyproject.toml, right?

47:58 Yes.

47:59 Yeah.

48:00 Yeah.

48:00 It's more in the library side that it, I mean, it's not that you can't use it on an application,

48:04 but it's more required on the library side.

48:07 Yeah.

48:07 That's the thing.

48:09 All right.

48:09 So let's talk about some of the ways in which your packages might go wrong.

48:14 We've already talked about typosquatting and we also talked about everything that's different.

48:18 Yeah.

48:19 But yeah, typosquatting is, it is tricky.

48:23 I think it's pretty well understood at this, this point, but maybe just tell people real

48:27 quick to cover that base, you know?

48:29 Sure.

48:29 Typosquatting is, is, you know, publishing a package with a name that's similar, but not

48:36 the same as, as a, as a existing known good package.

48:39 Right.

48:40 So like, instead of requests, maybe you, you get request without the S or, you know, one

48:48 that gets me cause I, cause I make the typo all the time was, is the cryptography package.

48:52 Like, like if I, you know, if I put you on the spot, would you know how to spell cryptography?

48:57 I always get the first couple of letters, you know, jumbled up a bit and, and there have

49:02 been malicious packages published and then taken down with, with the, you know, spelled C-R-P-Y

49:10 instead of C-R-Y-P cryptography.

49:12 Right.

49:13 Yeah.

49:13 But, but the idea is that, you know you, you, you can overlook a package cause it looks like

49:20 a, it looks like a good one.

49:21 No, it's not necessarily that you're going to, you're going to install it because you type

49:25 it wrong.

49:26 although that is, that is, you know, one technique, right?

49:30 The drive-by installs where someone just bat fingers, the package name.

49:34 but really having a, typo squatted package is going to allow these threat actors to,

49:43 be a little more stealthy in their inclusion of that package in, in legitimate, code

49:48 reviews and commits and, dependencies of dependencies.

49:52 Right.

49:52 And so the other, the other thing that goes with typo squatting, I don't know if I had a link

49:57 for you there yet, is, is star jacking.

50:00 So, a lot of times if you're going to typo squat on a known good package, okay, there,

50:05 there it is.

50:06 you know, these, these, these threat actors, they just, they just straight up copy the

50:12 known good project, right?

50:14 It just cloned the repository and then changed the package name.

50:19 and, and then when they, when they post the package to, PyPI, for instance,

50:26 the metadata that goes with the package, still exists, right?

50:31 So, on PyPI for a given package, you can see on the left-hand side, it shows like some,

50:36 some statistics.

50:37 If, if the, URL was given to like a GitHub.

50:42 hosted project, for instance, it'll go in there and tell you how many stars.

50:48 Right, right, right.

50:50 You know, how many downloads.

50:51 That's actually a signal that it seems like it should be good, right?

50:53 It'll have.

50:54 Yeah.

50:54 A lot.

50:55 And that's what star jacking is doing is just copying the metadata of a known good package.

51:01 so that on first look.

51:04 Yeah, there you go.

51:05 You can see.

51:05 Like I did pull up pytest and it says statistics, GitHub statistics, 11,000 stars, 2,000 forks.

51:12 Okay, this is legit.

51:12 Let's install it.

51:13 Right.

51:14 So I could go clone pytest repository right now, change the name to pytest spelled P-I-T-E-S-T.

51:20 Mm-hmm.

51:21 And then, and then push that to P-I.

51:23 The math version of testing, yeah.

51:24 And you're going to get these same statistics and you're going to get the same, maintainers

51:28 that you see if you scroll down a little bit, in the, the, metadata.

51:34 Yeah.

51:34 So you get the maintainers list.

51:35 All of that metadata that you, you, you enter in your pyproject.toml or setup.py file,

51:42 gets read here on PyPI and just, just publish.

51:45 So you can, you can fake people out.

51:48 Yeah.

51:48 That's actually really, okay.

51:51 Well, there's a new terrifying thing that I hadn't thought about.

51:53 Yeah.

51:53 So, so star jacking and typosquatting where you just take a known good package, clone it,

51:58 and then maybe you, you make a change to, you know, existing function, you know, the

52:04 function does what it's supposed to do, but it also does some other stuff like ship off

52:07 secrets from your, your CI server or, you know, it could lay dormant and wait for,

52:14 some sort of production environment and grab some SSH queues or something terrible.

52:19 Yeah.

52:20 Yeah.

52:20 That's, that's, that's the other, the other dependency confusion.

52:24 Okay.

52:24 That's the next one you've got up.

52:26 Yeah.

52:26 This is the one that we kind of talk, it's similar to what we talked about, before

52:30 with, I can't remember, but I said there's, there's going to come back to this.

52:34 So here, here it is again, this is a dependency confusion where, if you get the wrong version

52:39 or the wrong name, it could actually, you try to be safe by having a white listed list or

52:45 say, well, it's, it's, so this is one where it's the same, same package name, different source

52:51 of where you acquire that package.

52:53 Yes.

52:53 So this is, you'll, these attacks are mostly like, companies, enterprises.

52:59 This is the enterprise attack.

53:01 Yeah.

53:02 Yeah.

53:02 So it's an artifactory and we, we only put our stuff there and we're, we're going to call

53:08 it like, you know, international company underscore data access.

53:11 That's right.

53:12 And, and it's, and it's, and it's tricky because if you don't know, like if you don't

53:17 have your build system set up in a way and then, your CI server set up in a way to

53:22 install your dependencies in the proper order, like excluding public registries first and only

53:28 looking for packages in your private registry, then it's very easy, especially with pip, which

53:34 defaults to looking on pipe PI, the public registry first, and then only falling back to your, your

53:40 extra index URL specifications.

53:42 Secondly, that if you, if someone had the knowledge or just guessed at the package

53:50 name that you had published on your internal registry, and then they made their own package

53:54 with the same name, but put it on pipe PI, that's the one that's going to get installed.

53:58 and there was like a whole series of, you know, bug bounties that were claimed over this

54:05 back a few years ago, because people just went around, you know, guessing at internal package

54:10 names or maybe they used to work there or new people.

54:13 Yeah.

54:14 Yeah.

54:14 I'll pay a hundred bucks just to share your quorum at Sot.txt with me.

54:18 Right.

54:19 Right.

54:20 You know, it's, it's kind of, it's extra sneaky because it only affects people.

54:27 It only affects people who are going out of their way to be more secure, right?

54:32 They're going out of their way to say, we're only going to, we're going to actually set up

54:35 a whole server and we're going to whitelist a bunch of stuff.

54:39 You can only ask for the names of the things on this server.

54:41 And, ah, yes.

54:43 Yes.

54:43 And that, that might still work if you limit it to your internal registry only or a mirror

54:49 perhaps of, of the public registries.

54:53 but it's pretty easy to create your own internal copy, download a bunch of external

55:00 ones and mirror them locally and say like, these are the ones that are pre-approved at

55:04 our company.

55:04 Nothing else.

55:05 Yeah.

55:06 Yeah.

55:06 I, I, I've worked in a environment where that's exactly what we did.

55:10 And, I think there is merit to that.

55:13 You just have to know that anything you're mirroring to the trusted internal network is in

55:19 fact secure.

55:20 You know?

55:21 Yeah.

55:22 For sure.

55:22 I think it doesn't really make sense except for a few very rare cases to say you cannot

55:29 use external dependencies.

55:30 Right.

55:31 Right.

55:31 You're just saying what we want is to not build software, but while the rest of the world

55:36 does, you know?

55:38 Yeah.

55:38 Because that's part of the magic.

55:39 We just saw there's over half a million libraries you can choose from.

55:42 When you say we, we have zero of those, you're really, really constraining the type of software

55:48 and the velocity at which you can build.

55:51 Yeah.

55:52 Yeah.

55:53 Yeah.

55:53 It reminds me of, there's that line, you know, like why, why do you rob banks?

55:58 Cause they have the money.

56:00 Cause that's where the money is.

56:01 Right.

56:01 It's like, well, why do attackers, why are attackers going after open source software now?

56:05 Like, well, that's, that's where it's easiest to get arbitrary code to run.

56:11 That's where developers are.

56:12 That's what, to be fair though.

56:14 It's not only, it's not only right.

56:16 There's solar winds, which really had almost nothing to do with open source, but it had

56:20 to do with CI, CD systems and other sneakiness.

56:23 Right.

56:23 Yeah.

56:24 Yeah.

56:24 And got into places that, you know, instead of getting into libraries, you get into the

56:29 build system and you just give it a little extra, a little extra include tag there.

56:33 Bringing that deal out, like you said, right.

56:37 So dependency and confusion is sneaky because you're asking for a local version off a local

56:42 server.

56:42 It doesn't exist on PyPI, but if it could be made to exist on PyPI, all of a sudden that

56:47 gets installed.

56:48 That's potentially, that's not good.

56:50 Potentially.

56:50 Yeah.

56:51 It's, it's, that's, that's how it works in all the, in all the default cases.

56:54 And it's, it's pretty tricky actually to, to exclude, to do it in the correct order and

56:59 exclude those public registries.

57:01 Yeah.

57:01 What's what I do to help this is I just, I just run the UUID command to get one of those

57:08 16 digit arbitrary X things.

57:10 And I just name all my libraries that.

57:12 And so it's like, oh, you have the F3DC.

57:16 Yeah.

57:16 That's the, that's the API one.

57:18 That's right.

57:18 That's important.

57:19 That, right.

57:20 No one is going to do this.

57:21 It's such a safe space.

57:23 I tell you.

57:23 All right.

57:25 On to the next one.

57:26 That, that would work.

57:28 Expired author domains.

57:30 This is super sneaky.

57:32 Yeah.

57:33 Yeah.

57:33 So this is one, you know, it, it might be less of a factor now.

57:41 I think, I think it was just earlier this month that PyPI enforced two factor authentication

57:46 for all their users.

57:48 But a lot of sites and, you know, even PyPI, I think before this month have, you know,

57:59 password reset features where if, if you lose access to your account or you forget your password,

58:04 just, you know, send me an email and reset your password.

58:07 But it's, it's, it's very possible that people, you know, years ago submitted a package.

58:13 They, they don't maintain it anymore.

58:15 They submitted it under an old email account that has expired.

58:20 Right.

58:20 Maybe they, they had some domain.

58:22 Yeah.

58:23 Special.

58:23 It doesn't work that well for Gmail or Outlook.

58:26 Right.

58:27 Right.

58:27 If you had.

58:28 Custom domain.

58:28 If you had a custom domain and, as would be awesome, have your own, you know, Michael at

58:34 talkpython.fm, that kind of thing.

58:36 Yeah.

58:37 Say you, you win the lottery and, and, you know, decide to quit your pay job.

58:42 Yeah.

58:42 And then you let your domain expire and, well, maybe there's still a linkage for the talkpython

58:48 domain to PyPI.

58:50 And then I go and buy that domain and, you know, request a password.

58:54 Set up the server.

58:54 Yeah.

58:55 Account reset.

58:56 Set up email.

58:56 Yeah.

58:56 And then now I, now I can publish new versions of, of the, of the packages there.

59:02 Yeah.

59:03 It's not good.

59:03 Yeah.

59:04 Yeah.

59:04 So I don't really know what to do about that one, but there's an amazing, there's an amazing

59:09 joke that I found on Mastodon.

59:11 Somebody posted, sit here.

59:13 It's two big red buttons.

59:17 Think Ren and Stimpy or whatever.

59:18 And one of the red buttons says, admit to yourself that your dream is dead.

59:22 The other one says pay $12 for domain renewal.

59:26 Right.

59:27 I mean, it's funny, but there's plenty of people who will get a domain and I totally go in and

59:33 then it's like, you know what?

59:33 I haven't done anything without five years.

59:35 I'm not paying another 12 bucks.

59:36 But if they had set up an account under that, right?

59:39 This is what you're talking about.

59:40 Yeah.

59:41 Yeah.

59:41 Yeah, exactly.

59:42 Yeah.

59:43 That's why you got to buy your domains for that hundred year renewal period.

59:47 Exactly.

59:48 Take out that loan.

59:50 You get to the loan.

59:51 All right.

59:54 We're getting short on time here.

59:55 I want to, let me, let's just go through, I'll just list off a few real quick.

59:59 Maybe we do lightning round.

01:00:00 Okay.

01:00:00 Unverifiable dependency.

01:00:01 Okay.

01:00:02 These are for specifying dependencies that are not necessarily published to PyPI, right?

01:00:11 So that maybe you're pointing to a GitHub repository.

01:00:15 You know, pip calls these VCS project URLs, you know, if you look in their help output.

01:00:22 Yeah.

01:00:22 It's like pip install git plus HTTP to a thing that has a PyProject autonomous.

01:00:27 Yeah.

01:00:27 And that thing, it can point to a repository.

01:00:31 Maybe it points to a tag.

01:00:32 Maybe it points to a branch.

01:00:34 None of that is stable, right?

01:00:37 Like you, the tag could change out from under you.

01:00:40 The code that's related to that tag could change out from under you.

01:00:44 The code at the branch you're pointing to could change while the name remains the same.

01:00:49 So, you know, those are risky for that reason, right?

01:00:52 If you're not pinning to a very specific version or a very specific hash, right?

01:00:56 If you're going to point to a repository or a git URL.

01:00:59 Interesting.

01:01:00 Yeah.

01:01:00 Make sure it's true.

01:01:01 I've gotten to feel a lot of times like the hash is maybe a little bit redundant given the immutability of PyPI.

01:01:06 But if you're pointing at something like this, then maybe all of a sudden you really do want that, right?

01:01:11 Yes.

01:01:11 For sure.

01:01:11 Yeah.

01:01:12 Okay.

01:01:12 Repo jacking?

01:01:15 Yeah.

01:01:15 This is similar to the expired author domain, right?

01:01:20 So if someone was pointing to one of those git dependencies, a VCS project URL, as pip calls it, and that account went dormant or expired, relinquished, whatever, and someone else took it over, then yeah, they can now dictate.

01:01:39 What's there, yeah.

01:01:40 Yeah, exactly.

01:01:41 A lot of people are acquiring.

01:01:42 All right.

01:01:44 And then maybe last bit, get a chance to talk a bit about your Phylum CI project.

01:01:50 I do want to point out really quick, though, that Phylum was a sponsor of the show.

01:01:54 Yes.

01:01:55 A while ago.

01:01:55 But this is not a sponsored episode.

01:01:57 This is just you and I had been talking prior to that, actually, and decided to put this show together.

01:02:03 So just to be clear, but let's talk about this project you guys got anyway.

01:02:06 Yeah.

01:02:07 Yeah.

01:02:08 So you can pip install Phylum right now, or like I prefer, pipx install Phylum.

01:02:14 Yeah.

01:02:14 I love pipx.

01:02:15 It's awesome.

01:02:16 Yeah, me too.

01:02:17 Yeah.

01:02:17 I think I heard about it from you, actually.

01:02:19 So the circle goes.

01:02:23 Yes, yes.

01:02:23 So this package, it does two main things.

01:02:26 One is it can, it'll expose us to entry points.

01:02:30 One of them is called Phylum init, and that'll get you the Phylum command line interface written in Rust, but installed with Python.

01:02:40 It'll get you the Phylum CLI locally.

01:02:45 And then the other one is called Phylum CI.

01:02:48 That's just a catch-all entry point, the thing that gets exposed through our Docker container.

01:02:53 to handle almost all of our integrations.

01:02:56 So if you want to monitor your PRs on GitHub, for instance, we've got an integration for that.

01:03:03 Nice.

01:03:04 So the idea is basically that I could set this up in GitHub.

01:03:07 A PR comes in, I could set up an action.

01:03:09 Phylum will scan it for known mischievousness.

01:03:13 That's right.

01:03:14 And make that part of the PR, maybe even block it out, right?

01:03:17 Yeah, exactly.

01:03:18 It'll fail your build if you don't pass your default policy or established policy on any of your given lock files or manifests.

01:03:28 We deal with manifests as well.

01:03:30 And you mentioned GitHub.

01:03:31 So even with GitHub, we went a step further.

01:03:33 We have an app as well.

01:03:34 So you don't even have to modify a workflow.

01:03:36 You could just install a GitHub app and automatically monitor your repositories.

01:03:42 But a lot of the other ecosystems don't have that.

01:03:46 So we just provide Docker containers.

01:03:49 I love the Docker container.

01:03:51 So use Docker run against your code or whatever.

01:03:54 So yeah.

01:03:56 And then there's even a pre-commit hook we expose as well.

01:04:02 Nice.

01:04:03 I genuinely don't know the answer to this question.

01:04:06 Does this cost money?

01:04:07 No.

01:04:08 We have anyone.

01:04:10 Anyone can sign up for free.

01:04:12 There's a community edition where you can have up to five projects.

01:04:17 Okay, cool.

01:04:18 You guys have to eat.

01:04:19 There must be some way you charge for something.

01:04:21 Oh, exactly.

01:04:21 Yeah, yeah.

01:04:22 So there's the paid version, right?

01:04:24 Which, you know, unlimited projects.

01:04:26 You get access to group-based management.

01:04:29 You know, there's a few extra features.

01:04:30 It's a freemium model.

01:04:32 More of a Teams enterprise-y angle.

01:04:34 Yeah, yeah.

01:04:35 But for this audience, I mean, I would love if everyone just went that little extra step of securing their open source software and, you know, go with the free option.

01:04:46 I'm not trying to sell you anything here.

01:04:47 You know, monitor your manifest, your lock files.

01:04:52 Make sure that you remain secure.

01:04:55 You're not exposing your secrets because that's what we're finding now is that developers are the new high-value targets.

01:05:01 Yeah.

01:05:03 That's what attackers want to go after because we know that developers, they have the secrets.

01:05:07 They've got the keys, you know.

01:05:09 We write the code that then gets run on the production server inside the firewalls.

01:05:15 Yeah.

01:05:16 We have all the access, all the secrets, all the keys.

01:05:20 So, you know, if you can find a way to get arbitrary code from strangers to run on developer systems, you're going to have a much better chance.

01:05:28 We have a good time.

01:05:29 Yeah.

01:05:29 We have a good time.

01:05:30 By that, I mean having a bad time.

01:05:32 Right.

01:05:33 Yeah.

01:05:33 Doing bad things.

01:05:35 Okay.

01:05:35 Let's not do that.

01:05:36 Awesome.

01:05:37 Well, excellent work.

01:05:38 I think probably we'll kind of just leave it there.

01:05:40 We're pretty much out of time for the rest of the stuff.

01:05:42 But close it out for us, Charlie.

01:05:44 People are maybe both have a few new tools to work with, but also techniques, but maybe also a little freaked out.

01:05:51 What do you tell them?

01:05:51 I recommend everyone to restrict their use of dependencies to lock files and then carefully gate or guard the inclusion of new lock files or updates of existing ones.

01:06:04 Or sorry, dependencies in those lock files with careful analysis.

01:06:08 Don't allow arbitrary code to run anywhere in your development process and give FileM a try.

01:06:13 We've got the free community edition.

01:06:15 We will provide that analysis and ensure that you don't have malware running on your system through bad dependencies.

01:06:22 Awesome.

01:06:23 All right.

01:06:23 Well, it's been very interesting and a lot of new things to think about.

01:06:27 So thanks for being here.

01:06:28 Thank you, Michael.

01:06:28 Yep.

01:06:29 See you later.

01:06:29 This has been another episode of Talk Python To Me.

01:06:33 Thank you to our sponsors.

01:06:35 Be sure to check out what they're offering.

01:06:36 It really helps support the show.

01:06:38 Take some stress out of your life.

01:06:40 Get notified immediately about errors and performance issues in your web or mobile applications with Sentry.

01:06:45 Just visit talkpython.fm/sentry and get started for free.

01:06:50 And be sure to use the promo code talkpython, all one word.

01:06:54 Mailtrap, an email delivery platform that developers love.

01:06:58 Use their email sandbox to inspect and debug emails in staging, dev, and QA environments before sending them to recipients in production.

01:07:06 Try Mailtrap for free at talkpython.fm/Mailtrap.

01:07:10 Want to level up your Python?

01:07:11 We have one of the largest catalogs of Python video courses over at Talk Python.

01:07:15 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:07:20 And best of all, there's not a subscription in sight.

01:07:23 Check it out for yourself at training.talkpython.fm.

01:07:26 Be sure to subscribe to the show.

01:07:28 Open your favorite podcast app and search for Python.

01:07:31 We should be right at the top.

01:07:32 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:07:42 We're live streaming most of our recordings these days.

01:07:45 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:07:52 This is your host, Michael Kennedy.

01:07:54 Thanks so much for listening.

01:07:56 I really appreciate it.

01:07:57 Now get out there and write some Python code.

01:07:59 I'll see you next time.