Typosquatting and Supply Chains Vulnerabilities

Episode #319, published Sun, Jun 6, 2021, recorded Wed, May 26, 2021

Episode Deep Dive Links Transcript

One of the true superpowers of Python is the libraries over at the Python Package Index. They are all just a "pip install" away. Yet, like all code that you run on your system, it is done with some degree of trust. How do we know that all of those useful packages are trustworthy?

That's the topic of this episode. Bentz Tozer and John Speed Meyers are here to share their research into typosquatting on PyPI and other sneaky deeds. But we also discuss some potential solutions and fixes.

Play on YouTube

Watch the live stream version

Episode Deep Dive

Guests Introduction and Background

Bentz (Benz) Tozer and John Speed Meyers join this episode from In-Q-Tel (IQT), a nonprofit that invests in and fosters leading-edge tech with a focus on cybersecurity. Bentz has a strong background as a software developer and systems engineer, spending 20 years in the defense industry before turning his focus to cybersecurity. John has a blend of data science, economics, and programming skills and works at IQT Labs researching open-source security and other high-tech solutions. Together, they share insights into how the Python package ecosystem can become vulnerable to attacks—especially via typosquatting and malicious software supply-chain threats.

What to Know If You're New to Python

If this is your first foray into Python, here are a few essentials from the conversation to help you get more out of it:

Package management in Python usually happens with pip. When you run pip install some_package, it executes code under your user permissions.
The Python Package Index (PyPI) is the official software repository where you’ll find most libraries and frameworks, but verifying you’re installing the correct package is crucial.
Creating virtual environments or using Docker is a recommended best practice to isolate and protect your system when exploring new packages.

Key Points and Takeaways

Supply Chain Superpowers and Blind Trust
The Python Package Index offers a huge array of libraries just one pip install away—part of Python’s “superpower.” But these millions of packages also mean developers typically trust code blindly. As Bentz and John highlight, it’s vital to remember that installing dependencies can execute arbitrary code.
- Links / Tools:
  - PyPI main site
  - Python Security Response Team (PSRT)
Typosquatting: A Quiet Attack Vector
Attackers exploit simple spelling mistakes by uploading near-identical package names (e.g., “pandar” vs. “pandas”) to trick developers. This is one of the most common forms of malicious package abuse, with occasionally thousands of downloads from unsuspecting users.
- Links / Tools:
  - Issue #9527 (“social distancing” for package names)
Malicious Packages on PyPI
Several examples, including requests vs. request and “pandar” vs. “pandas,” show how easy it is to slip in a bad library. Historically, PyPI has even removed over 3,000 malicious packages in one wave. Even if only 40 known malicious examples existed, the potential damage can be large, especially if installed in corporate or government settings.
Importance of Responsible Disclosure
Developers who discover malicious packages or vulnerabilities should use responsible disclosure: quietly notify maintainers or email the PSRT before going public. This helps quickly remove threats and minimizes damage. There are also projects like the Backstabber’s Knife Collection (an open repository of malware samples) that researchers can use to study malicious code.
- Links / Tools:
  - PSRT contact info
Scanning Tools: PyPI Scan, Aura, and Aura Borealis
Bentz and John’s research led to creating scanning tools. Simple approaches (like comparing the name and metadata of lesser-known packages to popular ones) already reveal suspicious libraries. Aura Borealis aims to be a front-end for deeper static analysis (e.g., scanning all PyPI packages).
- Links / Tools:
  - PyPI Scan (no direct URL in transcript but cited as an open-source CLI tool)
  - Aura & Aura Borealis (open-source scanning & front-end for analyzing metadata across PyPI)
Practical Security Steps
Developers should adopt speed bumps when installing: double-check the spelling of package names or copy/paste from official documentation. Using private PyPI repositories, pinned versions, or Docker-based “quarantine” environments (like local containers or VMs) helps control what code ends up in your production environment.
- Links / Tools:
  - Docker official site
  - Virtualenv documentation
Hardened pip and Namespace Protection
The episode highlights the need for pip “safeguards,” such as blocking suspiciously close package names to the top downloaded libraries. Namespacing (like “ownername/package”) could make it clearer who published a package and reduce confusion with sound-alike libraries.
OpenSSF and Ecosystem-Wide Solutions
The Open Source Security Foundation (OpenSSF) is a Linux Foundation project tackling these challenges across multiple language ecosystems. It encourages standard best practices and might help fund or unify scanning solutions, policy frameworks, and developer education.
- Links / Tools:
  - OpenSSF website
Examples of Supply Chain Attacks
While the episode focuses on Python, it references bigger incidents like SolarWinds and XcodeGhost for Apple iOS. These highlight the scale and impact of hijacked developer tools. Even if Python’s pip ecosystem is smaller by comparison, it’s not immune to large-scale exploits.
Responsible AI in Security Tooling?
Though not deeply explored in the episode, they hint that some scanning approaches may eventually incorporate machine learning or AI to detect suspicious packages in real time. However, even “simple” code checks can yield big benefits.

Interesting Quotes and Stories

“There’s a little bit of paranoia that comes with working in cybersecurity, it’s true.” — Bentz Tozer

“We realized you can do a lot by just scanning names and metadata. We found ‘pandar’ that way, which was doing keylogging!” — John Speed Meyers

“You’ll never pip install the same way after listening to this episode.” — Michael Kennedy

Key Definitions and Terms

Typosquatting: Creating package names nearly identical to popular ones so unsuspecting users install the malicious “near-clone.”
Supply Chain Attack: Targeting a developer tool, library, or repository with the goal of reaching many downstream users.
PSRT (Python Security Response Team): The team at the Python Software Foundation that handles security vulnerabilities.
Social Distancing for Packages: A proposal to block or warn about similarly named packages to reduce typosquatting.
Aura Borealis: A front-end system for analyzing output from “Aura,” which performs static checks on the entire PyPI set of packages.

Learning Resources

If you’re new to Python or want to deepen your understanding of foundational coding practices (including safe coding habits), these courses from Talk Python Training can help:

Python for Absolute Beginners: Ideal for those just starting out, covering essential Python concepts and coding fundamentals before jumping into advanced topics like security.

Overall Takeaway

Typosquatting and broader supply chain risks present a significant threat to the open, highly collaborative nature of Python’s ecosystem. Being vigilant—double-checking spelling, using private repositories, scanning new dependencies, and reporting malicious packages—can go a long way toward keeping our projects safe. As the community rallies around new tooling and more robust infrastructure, the hope is that these attacks will become both rarer and easier to stop.

Links from the show

Overview topics
SolarWinds: csoonline.com
XCodeGhost: macrumors.com
Python Package Index nukes 3,653 malicious libraries uploaded: theregister.com
Dependency confusion: medium.com
Typosquatting Is About More Than Typos: iqt.org
Approaches to Protecting the Software Supply Chain: iqt.org
A Quant’s View of Software Supply Chain Securityz: usenix.org

Organizations
Open Source Security Foundation (OpenSSF): openssf.org
Python Security Response Team: python.org

Proposed solutions and tools
pypi-scan: github.com
AuraBorealis App: github.com
Project Aura: aura.sourcecode.ai
Aura source code: github.com
Reduce Typosquatting Harm via Social Distancing for Top PyPI Packages: github.com
Have I Been Pwned: haveibeenpwned.com
Snyk Package Advisor: snyk.io
Backstabbers-Knife-Collection: dasfreak.github.io
NetworkML Package: github.com

Misc
Google as a Visionary Sponsor: pyfound.blogspot.com
Watch this episode on YouTube: youtube.com
Episode #319 deep-dive: talkpython.fm/319
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #319 deep-dive: talkpython.fm/319

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 One of the true superpowers of Python is the libraries over at the Python Package Index.

00:04 They're all just a pip install away. And yet, like all code that we run on our systems,

00:10 it is done with some degree of trust. How do we know that all those useful packages are

00:15 trustworthy? That's the topic of this episode. Benz Tozer and John Speedmeyers are here to share

00:21 their research into typosquatting on PyPI and other sneaky deeds. And we also get a chance to

00:26 discuss some potential solutions, fixes, and tools to help solve this problem.

00:31 This is Talk Python To Me, episode 319, recorded May 26, 2021.

00:36 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem,

00:54 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at

00:59 mkennedy. And keep up with the show and listen to past episodes at talkpython.fm. And follow the show

01:05 on Twitter via at Talk Python. This episode is brought to you by Square and us over at Talk Python

01:10 Training. Please check out what we're offering during our segments. It really helps support the show.

01:14 Hey, all. I have a quick announcement. We've had transcripts for all of our episodes for a long

01:19 time. But recently, we put more time and effort into making them more useful for you.

01:24 Now, every show has a link to the transcripts right in your podcast player. And that transcript

01:29 page lets you filter, search, and playback audio from exact moments within the transcript. Hope you

01:34 enjoy the richer experience around using our episodes as reference materials. I'm also happy to announce

01:39 a new sponsor of the show, Assembly AI. Assembly AI is a top-rated API for automatic speech-to-text.

01:46 You can transcribe videos and audio files with human-level accuracy in just a few lines of code.

01:52 To help us keep leveling up our transcript games, they're sponsoring the transcripts for our episodes

01:57 going forward. So thank you to Assembly AI for higher quality transcripts and supporting the podcast.

02:02 Check them out at talkpython.fm/assemblyai. Now, on to that conversation.

02:07 Vince, John, welcome to Talk Python To Me.

02:10 Thank you. Thanks for having us.

02:12 Thank you.

02:12 Yeah, it's great to have you both here. It's going to be exciting, unnerving, I might say,

02:17 a little bit to have this conversation. But I think it's certainly high time.

02:21 You'll never pip install the same way.

02:23 Exactly, exactly. You just kind of hold your breath as you do at each time. And you know,

02:29 I'm also, this is not a challenge that just the Python world faces.

02:34 This is anyone that has a package manager.

02:37 Yep.

02:37 And the more open, the bigger the difficulties, I suppose, right? So NPM,

02:41 gems, like you name it, right?

02:44 Yep. If you're a software developer these days, it probably affects you.

02:47 Absolutely. So before we get into the typo squatting, the supply chain issues and all the

02:54 stuff in history and current problems and, you know, on the positive side, there are solutions

03:00 and tools and things that we can talk about as well. Before we get into all that, let's start

03:04 your story, maybe abbreviated version since there's a couple of you. Ben, how do you get

03:08 into programming in Python?

03:09 Programming I got into just as a kid. Got a computer when I was, I don't know, nine or

03:14 10 and tinkered around with it, enjoyed it, ended up actually taking programming classes in

03:18 high school, stuck with it in college, majoring in computer engineering, and was a software developer

03:22 for the system engineer sort of stuff in the defense industry for 20 years.

03:27 Yeah. What languages did you start in or mainly use?

03:30 Originally started in C, actually originally in Pascal, then started in C, C++, and transitioned

03:35 over to doing more active Python development. I just needed a web scraper, needed to collect

03:40 some data and analyze it, and Python was the right tool for the job.

03:43 You didn't want to do that in C++?

03:44 I did not, no. And now, you know, Python is my preferred language for tinkering or back-end

03:51 web development. Pretty much as much as I can use it for, I use Python.

03:53 Yeah, fantastic. John, how about you?

03:55 I don't have quite the classic story. I learned it programming through statistics classes in

04:01 undergrad, specialized language called Stata that a lot of economists use. Good for legal

04:05 trials, well-tested, but I didn't learn Python until I, in grad school, I took more data science

04:11 classes and learned the typical NumPy, Panda, scikit-learn sort of stuff.

04:17 Right. They're like, let us introduce you to probably, it was called IPython at the time.

04:20 Exactly. And now, of course, Jupyter Notebooks, that sort of thing.

04:24 Yeah. Fantastic. It's really interesting just to see all the broad and diverse ways that Python

04:30 is growing and people are coming into it, you know? It's not that, well, I learned it for

04:34 programming, you know, building an operating system and on I went. There's a lot of languages that are

04:38 fairly, you know, or JavaScript. I built it to work on a website, right?

04:41 Yeah.

04:42 It draws people in from all sorts of things, which is awesome.

04:44 It's a meeting ground.

04:45 Yeah. Yeah. And I think that's one of the strengths, actually, kind of a sidebar is that

04:49 we have all these people with different backgrounds and different motivations and interests and things

04:53 they're trying to build rather than being more like, well, here's how I build my web app. How do

04:57 you build your web app? Yeah. Very cool. And how about now? Vince, what are you up to day-to-day?

05:03 So day-to-day, you know, kind of put down the keyboard, at least from the programming perspective.

05:07 And I work as a cybersecurity subject matter expert for In-Q-Tel, which, so I guess my job

05:13 there is to search for and then work with companies we find in the cybersecurity industry that have a

05:18 high impact on national security, as well as providing kind of advisory services to our customers in the

05:23 U.S. government.

05:24 Okay, cool. So what's In-Q-Tel? Sorry, In-Q-Tel, I guess it is?

05:27 Yes.

05:28 Yeah. What's the company story there? Because you both are from the same company.

05:31 Yeah. So it's a nonprofit, 501c3, stood up a little over 20 years ago by the CIA to basically help

05:37 you know, originally the CIA, but now it's seen most of the intelligence community and

05:41 elements of the DOD basically acquire and adopt and use cutting edge technology. They realized a

05:47 little while ago, you know, around that time that a lot of innovation was moving into Silicon Valley

05:51 and into other places in industry and startups. And the traditional acquisition model that federal

05:55 government uses doesn't play well with those people. They don't understand it. So we kind of

05:59 helped as a bridge working with startups, identifying them, and then helping them interact with the

06:04 government and conversely helping the government, you know, adopt, said technologies and support their mission.

06:09 So maybe, let me see if I can run a scenario by you. Maybe there's some Silicon Valley company

06:13 that's created like a cool ML thing that identifies deforestation or something like that. And the

06:21 government decides, oh, this might be really helpful for us for, I have no reason why. I have no idea

06:26 why, but let me imagine there's a reason, right? You might help that company like work with the

06:31 request for proposals and the whole crazy government side of things and get them more in line with what's

06:36 needed there. Is that the story? Yeah, that's to an extent. Yeah. I mean, we actually invest in them,

06:39 take equity, and that do help them learn how to interact with the government and also help them

06:44 shape their product and meet our customer needs.

06:46 Yeah. Okay, cool. Interesting. I had no idea such a company exists.

06:50 John, how about yourself?

06:51 I'm also at IQT. I work in what's called IQT Labs. It's an open source applied research and development

06:58 lab where we do hands-on research, mostly in the open source, largely on GitHub.

07:02 Cool. Sounds very, very fun. Now, let's talk about the supply chain issue, I guess, at a real broad level,

07:10 right? And I don't know how you all feel. I suspect that you have a little more hesitancy

07:16 or whatever as you interact with the computers and software and the internet and so on. You know,

07:21 when you, oh, there's a cool new app, maybe I'll try that. Like, you might think a little more

07:25 carefully about this than the average, you know, say, teenager or whatever.

07:29 There's a little bit of paranoia that comes with this. It's true.

07:32 Yeah, exactly. Exactly. That's what I'm getting at. And I feel like one of the more insidious

07:37 aspects of this has been the supply chain side of things, right? Because it's one thing to say,

07:42 that app looks shady. That site looks shady. Let me just not go there. Let me not click that link.

07:47 Let me not install that. But if I were to install, you know, Office Suite X and I completely trust the

07:54 company that makes that, but there's some library that they got from a third party and that third party

08:01 had been hacked and they somehow Trojan'd that third party thing and no one's found out yet.

08:07 I don't know. That's super scary. And that's kind of along the lines of some of the things that we're

08:11 touching on. And so I think the most broad one of those in the recent times has got to be SolarWinds,

08:16 right? That's certainly what's making the headlines these days. Still even, what,

08:20 five, four or five months later. It's, yeah, still a topic of discussion around this theme.

08:26 And yeah, I mean, that was a pretty challenging attack to pull off. I mean,

08:30 it took nation state actors months, maybe years to plan, you know, laying the groundwork,

08:35 getting things in place, you know, basically infiltrating SolarWinds development infrastructure.

08:40 Pretty impressive, honestly, that they were able to do it. And obviously the impact was enormous.

08:44 It was wildly successful.

08:46 I think one thing that Vince and I have been interested in, though, while this sort of attack

08:51 is very serious and obviously has rightly gathered a lot of attention, there are a number of other

08:56 types of attacks, often focusing on open source software that are actually more numerous.

09:01 How serious they are is actually open to debate. But we are still talking many people affected and can

09:08 still have grave consequences, especially if you're the one that's hacked.

09:12 So it's gotten less front of the newspaper attention, but Vince and I still think it's serious.

09:18 Yeah, I think it's very serious. I started with this one because I feel like everyone has heard about this.

09:22 Everyone can relate to this, right?

09:24 And here's an example of a company that supplies network gear to many of the largest companies and government

09:33 organizations around the world. And this was basically a way to get, you know, access to all of those.

09:38 They think it's Russia's cozy beer crew, but who knows, right? And it almost doesn't matter.

09:44 Another one that I think also is in the news really quick before we jump into the open source stuff.

09:49 This is not open source at all, but was called Xcode Ghost. Have you two heard of this?

09:54 Yeah. Yeah. So, yeah, I mean, basically what happened here was, you know, app developers,

09:59 iOS developers in China don't like to download or can't download stuff from the Apple official

10:04 Apple version of Xcode. Someone, you know, put a compromised version of Xcode up on some.

10:10 So let's get it off BitTorrent or something.

10:12 Yeah. I mean, also some Chinese file sharing site that app developers over there like to use

10:16 because it's more convenient and they, they, it was compromised. There was a, basically a,

10:20 something that would bake a backdoor into, you know, the ultimate compiled app that would go into

10:25 the app store or variant of the app store.

10:27 Yeah. So every app that was built and published to the app store with Xcode Ghost, which looked

10:33 exactly like Xcode injected a backdoor malware type of thing into it. So there was something like

10:39 2,500 applications, the iOS app store that yeah, affected like 128 million people. So that,

10:46 that's bad kinds of things, right?

10:48 Very bad.

10:49 Not ideal.

10:49 I mean, I guess attacking a compiler, I mean, developers trust their compiler, I would say,

10:54 I mean, not being able to rely on that or feel like you have, and it's very hard to

10:57 vet your company, especially for closed source or closed source, product like Xcode.

11:02 It's very hard to see is my compiler compromised or not.

11:07 Yeah. Yeah. And I think this actually is closer to the open source side of things,

11:12 right? Because if you can start to infect the tools of the developers building the things,

11:16 that's a problem. Yeah. So let's talk about the open source side, John, you pointed out,

11:21 there's many known attacks over there.

11:23 That's right.

11:23 Set the stage. What's going on?

11:25 There's actually a range of attacks, but I'll mention a couple here and we'll get into typosquatting.

11:29 So there is certainly a typosquatting attack, which we'll discuss extensively today, which

11:33 just like domain names, you might've heard someone is trying to go to a website and, mistypes it

11:41 a little, or somehow gets confused about how it's spelled, maybe switching the order of words,

11:45 and then ends up someplace that's malicious, either on the web, or if you're downloading a package,

11:50 you download a package you think you want, but it's not actually.

11:54 And sometimes not always, sometimes that contains malware and does things to your computer that you

12:00 don't want.

12:01 That's bad, right?

12:01 Bad. Especially if there's arbitrary code execution, meaning they can do what they want because perhaps

12:08 you've installed it as root.

12:09 Right. And well, I think a lot of people who are getting into Python don't realize that when you

12:14 pip install something, unless it's being installed as a wheel, as a binary wheel, it's running a setup.py

12:21 as your account. So whatever your current account is able to do, like you said, if you run it as

12:27 sudo, it's, it can do more, but even if it can just completely wreck your account and get your

12:31 information for many people, that's plenty bad on your computer. You don't want to.

12:35 Yeah. And it could be your computer. It could be your, your corporation's computer where you work

12:39 or your company's computer. And this setup.py, you're exactly right. It is a key attack vector.

12:44 For many people, it's simply a helpful way to install software. But unfortunately,

12:48 some people abuse that specific resource.

12:51 Yeah. I think it's been critical in the growth of how software is built. I remember,

12:55 you know, Ben, you were talking about doing C++ programming. I remember back in 97, 98,

13:02 99 doing C++ programming then. And it felt like whatever you wanted to do, almost everything you

13:07 had to build from scratch. You want a library that does this kind of UI widgets? Well, how do I build

13:12 that? You want a library that has this kind of data structure? Where do I either find or build that?

13:17 Right. And now it's just pip install this thing, pip install that thing. And the building blocks

13:22 that we have to compose are so much more effective, right? I can take a couple of libraries here and

13:28 click them together. And all of a sudden I've got something absolutely incredible, like pulling data

13:33 from different sources, creating amazing graphs. I mean, that is the power of modern software

13:38 development, right? And yet, you know, I guess Corey Atkins out in the live stream has a nice

13:43 sort of comment on this. Like he said, I didn't realize how naive I was thinking packages were

13:48 vetted. You're not alone, Corey. And so you're not alone. Join the club.

13:52 This portion of Talk Python To Me is brought to you by Square. Payment acceptance can be one of the

13:59 most painful parts of building a web app for a business. When implementing checkout, you want

14:04 it to be simple to build, secure, and slick to use. Square's new web payment SDK raises the bar

14:10 in the payment acceptance developer experience and provides a best in class interface for merchants

14:16 and buyers. With it, you can build a customized branded payment experience and never miss a sale.

14:22 Deliver a highly responsive payments flow across web and mobile that integrates with credit cards and

14:28 debit cards, digital wallets like Apple Pay and Google, ACH bank payments, and even gift cards.

14:33 For more complex transactions, follow-up actions by the customer can include completing a payment

14:39 authentication step, filling in a credit line application form, or doing background risk checks

14:44 on the buyer's device. And developers don't even need to know if the payment method requires

14:49 validation. Square hides the complexity from the seller and guides the buyer through the necessary

14:54 steps. Getting started with a new web payment SDK is easy. Simply include the web payment SDK

14:59 JavaScript, flag an element on the page where you want the payment form to appear, and then attach

15:04 hooks for your custom behavior. Learn more about integrating with Square's web payments SDK at

15:09 talkpython.fm/square, or just click the link in your podcast player's show notes.

15:14 That's talkpython.fm/square. These incredible building blocks, these Lego pieces, there's a lot of faith

15:22 out there that these are good building blocks. Not good in the sense they don't have bugs, but good in that

15:27 they have a good intent.

15:28 Well, I think that's one thing that's the key is that, and one of the things that's a challenge here is you have to

15:32 kind of figure out the intent of the people you're trusting, and you are trusting them ultimately, and you have to

15:38 hope they do not have malicious intent. Because inferring that is very challenging.

15:41 It's a double-edged sword. I mean, I agree. It is a powerful change that you can download a couple

15:47 libraries and have an amazing app, potentially in a few minutes, maybe an hour or two. And this is the

15:53 dream of code reuse, come alive, finally. And it just so happens that there are sometimes downsides.

16:00 They can be mitigated, but unfortunately to the unaware user, which it's all too easy to be unaware,

16:06 it's difficult, actually. There are serious, there can be risks.

16:10 Yeah, there definitely can. Kim Van Wick out of the live stream has an example. A benign example would

16:16 be atter, A-T-T-R versus atters. Both are legitimate packages, but completely different.

16:22 Another example would be if I want to install requests, but I actually just type request.

16:27 I mean, even auditorily, they sound like requests.

16:31 It's easy to do.

16:32 It even sounds like very similar with the S versus no S there. And if somebody says,

16:38 go install requests, you're like, oh, request, pip install request. God, I did it. Like, wait,

16:42 no, no, no, no, don't do that one.

16:43 Yeah. And it actually happened. You can find that that attack truly happened, affected,

16:47 at least according to the article published about it, 20,000 users. So I don't know how many of them

16:52 were actually affected. I haven't, we don't, this is unfortunately part of the problem. It's hard to

16:57 track this data, but the example you brought up, I know you brought it up on purpose. It's serious.

17:04 Yeah. And requests with the S is installed millions of times a week or a month. Many,

17:11 many, many, many, many times, right?

17:12 We'll talk about this later, but we found one called Pandar, like Pandas, but with an R.

17:17 And, you know, it's not hard to imagine just by, either confusion or a mistake typing this.

17:24 Yeah, absolutely. So another area I think that is a little bit interesting before we dive completely

17:30 into the package management type of squatting and related type of issues has to do with a trusted

17:37 open source thing becoming untrusted. And what I mean by that is there were some examples of things

17:44 like Google Chrome extensions being put out there as proper extensions, and then someone taking over

17:49 that project and then putting something maybe more adware in it, or something somewhat nefarious,

17:55 if not actually malicious, or, you know, somebody who is running the request is not a great example

18:01 because it's under the PSF organization, but many of the projects are under an individual, right? On

18:08 their GitHub project. And so if somebody was able to break into that person's GitHub repo, and then they

18:13 somehow sneak something into the code, well, does it look wrong? No, the person who made that change

18:19 is the trusted benevolent person who runs this project, right? They are, if, you know, Guido

18:25 van Rossum comes in and makes a change, well, who's going to look at that and go, oh, this is, this guy's

18:30 sketchy. We better really, like, it's probably going to be fine, right? So if someone takes over an account,

18:35 like, not only do they have access to the code and then how it gets pushed out to, you know, potentially

18:39 gets into the stream that goes to PyPI. It's also done by the person who looks like they should be most

18:45 trusted, right? So things like two-factor auth and just securing your GitHub and things along those

18:51 lines seems extremely important as well. Absolutely. I mean, what you're describing with account takeovers

18:56 happen numerous times. And there's variants on it too, where there's some single developer who's

19:01 overworked, tired, doesn't use the project they create anymore. They just hand it over to someone

19:05 who ends up, you know, putting a backdoor in it or some sort of malicious payload. I mean, that, that's

19:10 happened. And then also people take advantage of the fact that not only do you have your GitHub

19:13 profile secure, but you also have to have your PyPI or Ruby gems or, you know, where you actually

19:18 publish your packages, people run. So there's kind of two areas for potential attack. And also people

19:23 take advantage of the, you know, most people, at least me anyway, when I would vet software, I would go

19:28 look at GitHub and then I would download, I wouldn't download it from GitHub. I would download it using

19:32 pip or whatever. And that kind of, dissonance or whatever you want to call it,

19:37 there's another opportunity for, for confusion and malfeasance.

19:43 Yeah. And so these things are hard to detect, but I guess the area that you all have done a lot of

19:49 research in, you built some tools around and probably the biggest area is around the package

19:55 management side of things, right? That's right.

19:57 So we've talked about typosquatting and some of the challenges where people might mistype things.

20:03 And you talked about some examples where you found packages that look like they were intended to be

20:09 installed by accident, you know, to catch those. If there's 7 million people type, you know, 7 million

20:15 times pip install requests is typed. Chances that a couple of those are misspelled or enough of those

20:20 are misspelled is pretty high, but there were actually quite a few examples. Like for example,

20:25 the register had an article. When was this? This was, this is recent, March, 2021. The title is Python

20:33 Package Index nukes. 3,653 malicious libraries uploaded soon after security shortcomings highlighted.

20:41 That's right. This is, there's really a longer historical narrative too, to include this.

20:45 I'll call this a political activism, anti-typo squatting activism, where this,

20:51 you could call it an attack, is really about drawing attention to this risk.

20:55 Yeah. And I feel like a lot of these were people like, look, I'm proving to you this could actually

20:59 happen. That's right. I'm proving by creating this thing that uploads as requests with the S&T.

21:06 That's right.

21:06 Swapped. But were there actually viruses put up there? Like what is the actual harm been?

21:12 Yeah. So not all of these are. This one and a number of others, we can discuss those if we have time,

21:19 were largely benevolent, but demonstrated the risk. But yes, there have been, at least by our

21:25 calculations, 40 known malicious typo squatters on the Python Package Index, affecting thousands of

21:31 users. We actually published a blog post on this, something like Python typo squatting is about more

21:37 than typos. So yes, this has happened. I don't know the exact persons that it has affected. We just

21:43 don't have that data. Sorry if it affected you. And we published this and got some debate on hacker news.

21:49 And this is the point where Vince and I said, oh, there's really something here. There's a broad

21:53 audience that hasn't had a voice that cares about this.

21:56 Yeah. I mean, it could have been nothing, right? If I'm a student at a university and I install it on

22:02 a lab computer.

22:03 No big deal.

22:03 No big deal. Like who trusts those lab computers, right?

22:06 You shouldn't.

22:07 I mean, not just because like somebody could have installed something bad on it, but there's,

22:13 there are college students.

22:15 Oh yeah.

22:15 Who could be installing all sorts of just, you know, pranks and other kinds of stuff. So you

22:20 should just treat those things with.

22:22 Contaminated.

22:23 Yes, they're fully connected. But on the other hand, if this is a data scientist working at like

22:29 a major corporation or an agency and that happened to them, it could be the thing that opens the door

22:35 to, you know, access to the entire network and all sorts of lateral movement, right?

22:40 That's right. There's even one of the earliest pieces of anti-typo squadding activism comes from

22:45 Nikolai Schocker, who was writing his undergraduate thesis at the time in Europe. And he showed that

22:53 over a few weeks, he got over 17,000 downloads of a series of type of squad packages, including .mil,

23:00 the military addresses of the United States military. So it is certainly possible that people in a more

23:06 secure organization that really value security could accidentally be the victim of type of

23:11 squatting.

23:11 Yeah, absolutely. And the fact that it came out of a .mil domain shows that, yeah, that bad example

23:17 could also happen. And also his thesis got covered on Ars Techno.

23:21 That's right. Coolest undergrad thesis ever.

23:23 Exactly. That's way better than anything I did in college.

23:27 Oh yeah.

23:28 Yeah. Fantastic. And then there was this project called Pyto Squatting.

23:33 Yeah.

23:34 Pyto Squatting.

23:34 Yeah. It's a play on...

23:36 Which actually has been... Yeah, like a play on typo squatting.

23:39 It's a play on typo squatting. It's a clever one. And Benjamin Balderbach and Hano Beck,

23:43 who are open source software activists, developers, also a journalist, they've really had a multi-year

23:51 effort pointing out the dangers here. Not simply criticizing, but trying to help Python Software Foundation

23:56 and the warehouse, our PyPI crew, raise money and build a consensus around trying to make

24:04 this infrastructure safer.

24:05 Yeah. Yeah. So they had this project called Pyto Squatting, but that actually got closed down.

24:11 That's right.

24:11 Yeah. Because they said that the PS... What do they call it?

24:16 The PSRT, Python Security Response Team.

24:20 That's it. PSRT. And I'm like, wait, there's a Python Security Response Team?

24:26 That's cool.

24:26 And they respond to emails too. They're good.

24:28 Yeah. Okay. So this is an organization, a group of people under the PSF banner that basically

24:34 triage these types of concerns, right?

24:37 That's right. That's right.

24:38 Okay. Yeah. I'll link to their page on python.org and they have their email there. They also have

24:46 rules for different types of disclosure, like whether you should email them, do other things.

24:51 That's right. And if you find a malicious package or even a package that you think is very suspicious,

24:56 this is who to contact. And they're diligent and timely.

25:01 So what do you two think about how this should be disclosed? People out there listening, they find

25:06 something. Should they go to Hacker News and say, look, this horrible thing I found on PyPI or on

25:12 NPM or whatever. Should they quietly disclose that to the security response team and then talk about it

25:21 after it's been removed and fixed? What's the flow for disclosure?

25:25 Seems like it would follow any other responsible disclosure process for traditional bugs, exploitable

25:31 bugs that are with vulnerabilities, where it would be nice if you find a problem, contact maybe the

25:39 Python security team, get in contact with the developer, get it fixed, probably get the package

25:43 pulled down if in fact it is malicious. And then, yeah, it'd be nice to have some sort of reporting

25:48 mechanism so that everyone who uses it could be identified. And the first part, John Speed was

25:53 saying, you know, the Python South for Foundation and the PSR team do a good job or great job of

25:57 being on top of it, being timely, being responsible. It's much harder to notify, you know, there's no

26:03 authentication when you download one of these packages. So it's very hard to know who's been

26:06 affected. So maybe just promoting that more would be helpful. But then people have to know where to

26:10 look and that they need to look at all. It becomes challenging quickly.

26:13 Well, it's like the Xcode ghost thing, you know, there was 2,500 apps that were backdoored.

26:20 And I think only the top 25 were even disclosed. And it's like, if there was a list of 2,500 apps,

26:26 are you going to go cross compare? No, you know, no normal person is going to cross compare that

26:31 announcement with their phone.

26:33 Right.

26:34 And it's just such a challenge. And I feel like, you know, here we had the same thing,

26:37 right? We had 3,653 packages removed. Well, are you going to go check if you had those? It's

26:43 extra hard because it's, you didn't intend to ever have them. You didn't intend to swap the S and the

26:50 T when you type requests, but you did. And you accidentally, almost unknowingly got it most

26:55 likely. Right. And so I do think it's really hard to push this out as an awareness thing and like,

27:02 hey, you should know that this happened. And so just go check, right? The checking,

27:06 I think it's really tricky.

27:07 Yeah. I mean, like many software problems, you need to solve it with more software.

27:11 You got to solve it with AI probably.

27:13 No, you definitely have to solve it with AI. I think one thing that's helpful and could be part

27:18 of that process, but isn't always, unfortunately, is also taking a collection or taking that artifact

27:26 that you found, let's say a Python package that was malicious and making sure it gets to somewhere

27:31 where it can be studied and hopefully future attacks prevented. And so for Python and a couple

27:36 of other languages, there is actually an interesting project. It has a very colorful name. It's called

27:41 Backstabber's Knife Collection. Sounds very scary and malicious, but it is actually yet another

27:46 enterprising grad student trying to collect malware samples, especially of interpreted languages.

27:51 Python is one of them. And so that there can be a community researchers and hopefully then

27:57 companies that can fight these packages. So that would be another thing I would add to the list.

28:02 Yeah, there you go. Mark Ohm is the main person associated with that and has written some

28:07 interesting papers and great stuff. And so I urge you, if you come upon this and you say,

28:12 how do I act responsibly here? Do the things Bent says and also maybe grab a sample and give it to

28:17 the Backstabber's Knife Collection or another similar repository.

28:20 Yeah, interesting. Okay. Have I just messed up my computer by visiting this webpage as well? I wonder.

28:25 I don't think so, but there is a...

28:27 I'm just teasing. I'm just teasing.

28:28 I mean, I can't guarantee anything though, but...

28:31 No, of course, of course, of course. Before we get too far on, Corey Adkins also asked,

28:35 when we were talking about messing up your computer, the lab computer, so on, he asked,

28:39 could installing these types of things also affect shared server space? On my IaaS land,

28:46 where I have a shared server running for however much someone else does something bad.

28:51 I mean, theoretically, sure. It depends on the permissions, I would think. Yeah. If you install

28:55 some dependency that has keylogger baked into it or, I don't know, or, you know, some sort of file,

29:00 you know, collector, and it has permission to traverse all directories, then yeah, I mean,

29:05 I could certainly see a scenario where that was possible. I mean, I haven't, you know,

29:08 I haven't heard of that happening specifically, but there's nothing preventing it theoretically.

29:12 Yeah. If you had a series of virtual machines, you know, it's pretty tricky from one virtual machine

29:17 to escape to another, but I believe there have been examples, but those are exceedingly rare,

29:22 those sorts of vulnerabilities, right? That's right. So while we're on this topic,

29:26 I want to throw out an idea and then we'll talk about some of the tools you built, but I feel like

29:29 we're right in the middle of this notification thing. Like we've got all these packages, they've

29:34 been identified, they have been downloaded. We can see that we probably even have IP addresses,

29:39 which you can reverse look up to DNS names as probably how those attributions were given,

29:44 but it happens so often in so many different places, right? Like if I've got a continuous

29:51 integration story that builds a Docker container that pushes to a Docker hub and then my production

29:57 grabs that from that container, the place where the problem happened is not the place where the

30:03 problem is, right? It's probably GitHub or some other CI pool. We have a really nice thing for this

30:09 in the account space. Have I been pwned by Troy Hunt, which is a really nice project. I definitely

30:15 recommend people go there and enter their email address.

30:19 And prepare to be horrified.

30:20 Yeah. And prepare to be horrified. There's 11.2, 11.3 billion accounts that have been breached,

30:27 which is odd because it's more than all the humans, but we have more than one account. So there it is.

30:31 But yeah, so you put your email in there and then in the future, well, historically as well,

30:37 but then in the future, you say, if something has happened and your email appears in some kind of

30:42 password dump, password breach or account informational breach, you'll get an email saying,

30:47 Hey, we found something that should be concerning to you. Check it out. I would love to see something

30:51 like this for pip, right? Something that says, I pip installed this thing and it just has a record of,

30:57 here's my account. These are the things I've pip installed. If there turns out to be a problem

31:02 with one of those, notify me that that had happened.

31:05 That does sound really useful. I don't think it exists.

31:07 I don't think it exists either. And it shouldn't just be a pip thing. It should be an NPM thing.

31:11 It should be a gem thing. It should be a crate thing. It should be something that like a,

31:15 just a little bit of a wrapper that says, I would like to opt in to saying, here's my UUID.

31:22 Here's my email address. And here's the list of things that I've installed. If it turns out that

31:27 one of them is horrible, just let me know.

31:30 It's so sensible. It makes me laugh.

31:32 Yeah.

31:33 Well, there's an idea out there as well, but this is way far down the line, right? This is,

31:38 oh, we know this has happened. We know who's done it. We've, we know who's been affected and so on,

31:43 but starting a little bit further back, you all have built some tools to go and start at the

31:50 beginning and say, well, let's look and see what might be out there. That is bad, right? This is

31:54 the tool you used to find pandar instead of pandas.

31:58 That's right. I mean, I think the first idea you had is the crucial one, which is that

32:03 you need to know that there's been a compromise in order to report it. And right now it's surprisingly

32:08 hard to know that. So we're not the only one to have devised a tool or approach to finding

32:14 malicious packages on the Python package index, but we took a particularly simple one and we said,

32:19 can we use simply the metadata, especially the name, but some other information too,

32:23 of packages. And then look at just the most downloaded packages and check who has names

32:29 that are very similar to those packages. This is where AI comes in.

32:32 Need crazy AI at this point. You do get a lot of false positives. People have similar names just

32:37 because the packages are related. It's fine. There's no problem inherently with having a similar name,

32:42 but we cracked open those packages too. This was some very boring Saturday mornings of mine

32:49 and simply scanned through the code looking for anything that's suspicious. And lo and behold,

32:54 we found one called Pandar that was actually doing key logging. It was a proof of concept. It's unlikely

32:59 that it actually would have worked, but we reported it to the Python security response team, security at

33:05 python.org. They said, yep, not good. Yanked it. And it was just an example of it's not that hard

33:12 to find them. And we were showing yet again with a pretty simple demonstration that it's not that hard.

33:18 Interesting. That's really cool. So basically the tool is about finding given popular packages,

33:24 finding ones that are oddly similar. And then there's like a, let me go and see what this one's about.

33:29 That's right. And so there's a couple additional checks to help anyone using it. And you can find

33:33 it's an open source written in Python tool, command line tool. It also checks things like, for instance,

33:39 is the description of the package on PyPI, is it very similar? So that what you are witnessing is

33:46 someone who's trying to not only type with squat the name, but in some sense, like squat the broader

33:52 metadata or almost like the copyright of the package.

33:55 Right. Because you want it to look as similar as possible.

33:57 Exactly. Like camouflage.

33:58 Something that comes to mind. Yeah. Are you guys familiar with sneak?

34:01 Yes.

34:02 Yes.

34:02 Is it package? There's a project. And geez, I'm forgetting.

34:06 Are you thinking of the advisor project?

34:08 Yes. Sneak package advisor.

34:10 Yeah. It's neat.

34:11 That's it. It's super neat. Yeah. That's what I was looking for. No, that's not how I want to spell it.

34:15 Yeah. And so that thing is pretty cool. The reason I bring this thing up is you can come over here and I

34:21 can type in a project like requests or whatever, and it'll tell us eventually,

34:26 it'll tell us the package health score. Yeah.

34:29 And it'll tell us things like there's this many PRs that have been open and closed. There's this many

34:33 contributors. There's this many people participating that the maintenance looks like so on. One thing

34:38 that I think would be cool would be to take this number plus a misspelling and say, if that number

34:45 is really, really low for a package that should be really, really high, that's a challenge, right?

34:50 If you look at the GitHub repo that is delivering this thing and it doesn't look right, if it's not

34:56 associated with something that seems kind of hard to replicate, like a GitHub repo with many people

35:00 participating over a long period of time, that seems like that could be a good flag as well.

35:04 Yeah, certainly. It certainly seems like there's an abundant opportunity to build something into the

35:10 actual download client to the pip or a wrapper around pip where it checks these sorts of things and

35:17 create speed bumps for you as you are trying to download something or use a package so that

35:23 says, hey, this looks suspicious. Have you thoroughly checked this? And I think your idea could contribute to

35:29 exactly such a tool or tools. Yeah. Yeah. Very neat. You've been working on this and Martin

35:36 Karnoguski created this thing called Aurora and also reached out and said, hey, I'm also working on this.

35:43 And so, yeah, tell us about this thing called Aurora. Yeah. So we got an email last fall after publishing

35:49 this blog post and he said, hey, I've been working on a similar tool. Not only does it check this metadata,

35:55 but we even do static analysis of the entire Python package index. And we said, Martin, that's awesome.

36:01 Let's work together. And so over the past six months, roughly now, in an open source collaboration

36:08 between a number of us at IQT Labs and Martin Karnoguski, we have further refined Aurora, which

36:14 truly is designed to do a static analysis of the entire Python package index open source tool. You can find it.

36:20 He releases his data on a try as best as he can to release it regularly. We've also built a tool

36:26 called Aura Borealis. That key thing, his aura produces 50 gigs of output when it's done scanning

36:33 the entire Python package index. No human can wade through that. So.

36:38 And I suspect also the IPA, the Python package authority folks probably don't want everyone

36:44 downloading that much data all the time. No, it's exhausting and creates so many database issues and

36:50 other things. So we've been working on a tool called Aura Borealis that you've pulled up that is a front

36:55 end that makes it easier to use the data set that Martin creates this tool, Aura. This wouldn't

37:02 necessarily be part of PyPI, though, of course, it could be. But we imagine this as a tool for

37:08 organizations or persons that need to have global knowledge about either global knowledge about the

37:15 entire Python package index and to rank and assess potential threats and go look, look for those,

37:20 and then take appropriate action or even individual developers that are really curious about packages.

37:25 This is makes it easy. The Aura Borealis isn't yet live, but we hope to make it live this summer.

37:30 Aura is a in production tool. It works. So go check it out.

37:34 Talk Python To Me is partially supported by our training courses.

37:39 When you need to learn something new, whether it's foundational Python, advanced topics like async or web apps

37:45 and web APIs, be sure to check out our over 200 hours of courses at Talk Python.

37:51 And if your company is considering how they'll get up to speed on Python, please recommend they give our content a look. Thanks.

37:57 This looks like it's really handy. You know, so the idea is basically it's going to run forever and that's going to generate tremendous amounts of data.

38:05 Maybe just put a web front end on top of that static data for everyone to generate it over and over.

38:11 Exactly. Instead of having generated over and over now having 50 gigs and having to write your own custom,

38:16 probably Python script that's, you know, you'll have to optimize and blah, blah, blah.

38:20 Yeah. So came out in the live stream just says I accidentally typed sync instead of sneak,

38:26 which also is hard to spell anyway, because it's like a non-common spelling.

38:30 So which is an excellent way to demonstrate making a typo of getting the wrong package.

38:34 I have no idea what that's going to return. I'm not going to pull it up.

38:37 Podcast imitates life, imitates art, imitates compromise.

38:43 Exactly. All right. Well, this is really neat. How would I use the Aurora data and the Aurora Borealis project?

38:50 I guess also we should talk about this from different angles, right?

38:53 Maybe I'm a CISO at a company and I'm concerned that all my people are psyched about data science and Python

38:59 or NPM and web front ends and they just make me nervous all day and I want to get on top of it.

39:06 So I want, as somebody who is concerned about, I would like to know what's happening in my software

39:11 fly chain, or maybe I run, I maintain pandas and I'm really upset that pandar exists and I want to now be able to defend my package.

39:20 Like it seems like there's different use cases and people out there.

39:24 That's right. I think if you're a company and you have a group of software developers and

39:29 you have the, let's say a security team that helps vet packages.

39:34 So perhaps you put those packages in an internal repository so that the developers know that they're

39:39 cleared to use or Borealis will help you do that.

39:42 We're glad to set up pilots and discuss.

39:44 You can email me, jmeyers at iqt.org.

39:47 But there's also other angles too.

39:50 There's just, you're a developer and you want to make an informed choice.

39:53 The static analysis tool and its output can help you with that or Borealis.

39:57 And I think there is also, you're right, there's a maintainer angle and also a PyPI administrator

40:02 angle where you want to either protect a set of namespaces close to your package or you care

40:09 about the health of the entire ecosystem.

40:11 And those are all possible user types.

40:14 Yeah.

40:15 And we could probably use your PyPI scan to go and say, look, can I say, look for things

40:22 similar to my package name?

40:23 Yeah, that's right.

40:24 And we built that into Aura Borealis too now.

40:27 So in some ways, PyPI scan was a demo and still useful as a command line tool, but Aura Borealis

40:33 and Aura has that now built in.

40:36 Are you all going to put an API on top of this?

40:38 Good question.

40:38 That would be cool.

40:40 The thing that's tricky, like everything in life, is it costs money and, you know,

40:45 engineering resources and time.

40:47 I certainly have a vision.

40:48 I certainly have a vision.

40:48 And, you know, if I don't do it, someone else should do it.

40:52 Go make a lot of money.

40:53 of creating a technical infrastructure that every single package and every single new version

41:00 of every package, IPI, NPM, et cetera, gets scanned, a variety of scans, static analysis,

41:06 dynamic analysis, metadata analysis.

41:08 And that gets stored in a database that where you and I can go make API calls and get that

41:14 information that we should on these packages.

41:16 That could be, you know, there could be a free tier.

41:18 And then if you really need to make a lot of calls, a paid tier.

41:21 But someone should do it, I think.

41:23 Yeah, it would be neat to know, like you said, integrate into, say, pip even.

41:28 So if I pip install something, it could even flag it and say, hey, no, actually, we're going

41:33 to block that.

41:34 That's right.

41:35 Preemptively, because it's got some low score, unless you do like a --force.

41:39 Like, no, really, I mean, yeah, exactly.

41:41 It's something that'll sort of slow it down, as you call them speed bumps.

41:44 I hope someone does something similar to that.

41:46 We have plans, but no active development underway.

41:50 All right.

41:50 So that sets the stage that some of the tools out there, at least to identify that there

41:55 are potentially bad packages.

41:57 And it's also, I guess, you know, worth pointing out that if we go over, say, to PyPI, there's

42:02 over 300,000 packages over there.

42:05 And if there are 40 actually malicious ones, right, the chances are low.

42:10 They're not very high.

42:12 But so people shouldn't be, you know, running for the hills and complete panic or anything,

42:17 I don't think, from this.

42:18 But at the same time, we should be careful.

42:20 We should be cautious.

42:21 So, you know, what can we do?

42:23 That's the tough question.

42:24 Vince, do you want to start?

42:25 And you want me to go?

42:26 Sure.

42:26 I mean, there's a lot of things that we can do.

42:29 I mean, John Speed's hit on a few of them about just kind of being more deliberate, you know,

42:33 checking your work before you download something.

42:36 And also, you know, when you're considering dependencies, I mean, you mentioned C++ and,

42:40 you know, the late 90s.

42:41 I vaguely remember those times.

42:43 I remember when Boost came out, it was a big deal.

42:44 Oh, yeah.

42:44 You actually had a dependency that was...

42:46 I remember reading more books.

42:47 Right.

42:48 Less internet, more books to make things...

42:50 So, yeah, we moved on from that.

42:52 But ultimately, you know, it is worth considering, do you actually need this dependency?

42:55 You know, LeftPad, and NPM is a funny, you know, canonical example.

42:58 Broke the internet because people didn't feel like typing one line of their own code.

43:01 They wanted to import a LeftPad-ing dependency.

43:04 I do feel that's a really good example.

43:06 And certainly, LeftPad came to mind, not as a malicious thing, but just as a supply chain

43:12 Jenga tower type of thing.

43:14 And somebody pulled too much on a part of the Jenga tower and it came down.

43:17 I feel that the JavaScript community has way smaller Lego pieces than the Python community.

43:25 The blocks that you click together here are larger.

43:28 So, I feel like there's just fewer in number external dependencies on average in my Python

43:35 experience than my JavaScript experience.

43:37 Yeah, I think that's accurate.

43:37 I mean, numbers vary.

43:39 I've seen NPM, people who use NPMs are JavaScript developers.

43:43 The average package in NPM has like 94% dependency.

43:47 You know, other dependencies, only 6% is your actual code you've written.

43:51 Most of the modern languages, meaning JavaScript, Python, and some others are in like the 90-ish

43:56 range.

43:57 And then you see C and C++ are much lower.

43:59 Java is somewhere in the middle.

44:00 So, yeah.

44:01 Python is, I would say, lower than JavaScript, but much higher than the kind of legacy languages

44:06 that are historically used.

44:08 So, be deliberate means things like don't just, as fast as you can, type pip install, whatever.

44:13 Type pip install and then carefully type out the package name.

44:17 Maybe give it a quick read before you hit go.

44:19 Yeah, or just copy and paste.

44:20 Don't type it all.

44:22 Yeah.

44:23 So, for example, if I'm over here on PyPI, there's a copy button I can click and it'll

44:29 do exactly that.

44:30 Right.

44:31 That's an option.

44:31 Yeah.

44:31 So, yeah.

44:32 Just being a little more thoughtful and kind of, you know, looking at the dependency chain

44:36 as well before you download something, which is much harder than it should be, to be

44:40 completely fair.

44:41 That's helpful to know that, you know, maybe the top level you are using Joski.

44:46 That's how you pronounce that.

44:47 I have no idea what Joski.

44:49 I don't know.

44:49 You should pip install that right now.

44:50 Let's see what happens.

44:51 Let's just see what happens.

44:52 Here's an example of one of those that should rank lower.

44:55 No offense if this is your project, but it literally has zero stars, zero forks.

45:00 Its features are to do.

45:01 Its requirements are to do.

45:02 Its PyPI version banner is not found.

45:06 And I mean, it is only four minutes old.

45:08 They may be working.

45:09 Yeah.

45:09 Sure.

45:10 Yeah.

45:10 If it has dependencies, this one probably doesn't.

45:12 But, you know, take a look at those two.

45:13 Just makes them there's nothing egregiously wrong at a minimum.

45:16 Yeah.

45:17 That's one of the things that makes it kind of insidious and hard to see is the thing I

45:22 directly look at may be fine.

45:24 But the person who maintained that, did they make a mistake in the things that they depend

45:28 upon?

45:29 Or maybe, you know, transit, like follow that chain that graph down far enough.

45:33 Right.

45:34 There's a lot of layers that could be happening along the way.

45:37 It ends up looking like a web.

45:38 And not surprisingly, just because of that, most vulnerabilities inside of packages like

45:43 this are in the transitive dependencies, the ones below the first layer, the dependencies

45:48 are dependent.

45:48 Interesting.

45:49 So you can pip install the thing.

45:51 What about pinning the version?

45:52 I know there were some issues about having a private PyPI server, which I think is a good

45:58 idea where you whitelist packages in.

46:00 You say, we approve these things and only these things get installed.

46:03 And if you want to use a new one, we've got to opt it in.

46:06 And then now it's part of the organization.

46:08 That seems like something you could do, right?

46:10 There's PyPI server that you could set up that is a sort of pass-through layer there.

46:15 But then there was also the vulnerability of the version mismatch.

46:19 Like if there's a higher version of that thing on the public PyPI, then your local one.

46:23 So people were putting in like data layer version 70, you know, and then it's like, oh, there's

46:29 a newer version out there for me to go get.

46:31 I'll get that, even though it was internal, meant to be internal only, right?

46:35 So there's these challenges.

46:37 But what do you think about a private whitelist server?

46:39 It certainly seems valuable and seems like it's another speed bump, as John Speed was calling

46:45 them.

46:45 But yeah, I mean, then you run into scenarios like the one you described, where it's kind

46:50 of, I guess, that's undefined behavior, potentially, or at least not well-known behavior that maybe

46:55 isn't necessarily most intuitive.

46:56 So even that might not be enough.

46:58 So then, yeah, the pinning could help.

47:00 Then, of course, there's the challenge of maintaining your pin at the proper level, which adds more

47:05 effort on the developers to maintain up-to-date dependencies.

47:10 At least publicly.

47:11 Yeah.

47:12 Publicly, we have Dependabot on GitHub, which is way more of a pain than it should be to use.

47:17 Because if you've got 10 updates, it'll issue 10 PRs, which conflict with themselves.

47:22 Anyway, that's a long story.

47:24 But it's still at least some automation that says, hey, there's a new version of this.

47:27 Here's the change log.

47:28 And we also have the CVE security checks of Dependabot, which are really good.

47:33 Yeah.

47:33 Unfortunately, most of these typosquatting or just general supply chain attacks don't end

47:38 up in the NVD as a CVE.

47:40 Yeah.

47:40 Who's going to study this one and then not just say, take it down, right?

47:43 Like it's living under the, in the shadows, right?

47:46 Of being unnoticed.

47:48 To an extent, yeah.

47:48 NPM does a good job with their advisory service of like saying, this is a malicious package

47:53 and this is why we removed it.

47:55 But not all package managers do that.

47:57 And even so, then you have to go to all the most developers.

48:00 These ones are developing multiple languages these days.

48:03 So it's hard to keep track.

48:04 Yeah.

48:04 What about having isolated environments for trying out new packages?

48:09 So for example, one of the things I'm trying to do is if I'm checking out any new package,

48:13 I have to pip install.

48:14 And maybe that happens in a Docker container.

48:16 And then I throw away the container or possibly a VM with snapshotting on.

48:20 And then I roll back the snapshot periodically.

48:22 Yeah.

48:23 Those both sound like great ways to have good hygiene and not isolate the potential blast radius

48:29 of a potentially malicious package.

48:33 Yeah.

48:33 It's one thing to say, here's a thing we want you to check out and it's on PyPI and it's

48:38 really well known, but it's, you know, you got to explore new things that aren't super well

48:42 known yet.

48:42 Right.

48:43 And so how do you install that?

48:44 Right.

48:45 So I think some kind of blast blast store, like you said, like Docker, like a VM is not

48:49 a terrible idea.

48:50 Yeah.

48:50 It's a good one.

48:51 What else?

48:51 There's the open source software found, security foundation.

48:55 Yeah, that's right.

48:56 This is open SSF.

48:58 Open SSF.

48:59 Clearly a reference to open SSL.

49:01 Yeah.

49:01 another well known software supply chain compromise that widespread impact.

49:06 It's worth.

49:07 It's worth.

49:07 It's worth pointing out that this group for anyone who is comes very enthusiastic about

49:12 open source software supply chain security in particular has become a meeting ground where

49:17 both companies, but also persons interested in this sort of topics we've been discussing

49:21 the day and more have set up a series of working groups.

49:24 There's six roughly and, meet every few weeks, open community, fun, interesting people,

49:30 either, interested in the topic or actively working to, give back and contribute.

49:35 it's run by the Linux foundation and, we would highly recommend it as a place to

49:40 find other like-minded persons.

49:41 If you care about these sorts of topics.

49:43 Yep.

49:43 Fantastic.

49:44 And then there's the further on down the road, which we've touched on a couple of times,

49:49 but maybe we can encourage some enterprising person, people group out there to go after

49:54 it like a, a hardened pip or, you know, we have things that are sort of on top of pip-tools.

50:00 We've got pip ENV.

50:02 We've got pip X.

50:03 I'm a big fan of pip X, the isolation.

50:05 And then that gives us kind of need.

50:07 And just, I can see like, like a pip sec or something along those or PIPs, maybe a plural

50:14 PIPs.

50:14 I don't know for pip security, but something like that that incorporates some of these ideas.

50:18 Maybe it, it checks in.

50:20 You say like, I don't want to install any package that is not in the top 1000.

50:25 Sure.

50:26 Or a popular package, except for what I whitelist in on top of that or something or check with

50:32 Aurora Borealis about the score or check with the, have I been PIP?

50:37 Whatever that thing ever would become, right?

50:39 So talk about like where you might see things going.

50:42 Yeah.

50:42 Well, there's been a couple, I'll call them starter projects in the hardened pip area.

50:47 There even was one called pip sec.

50:49 You can find it on PyPI, but it's really, there's nothing there, unfortunately, at least yet.

50:55 That namespace is claimed.

50:56 The maintainers who we mentioned, Benjamin Balderbach, especially are interested in doing something,

51:01 just haven't had time, other busy priorities.

51:04 And I think there is a lot of potential to build out that idea and create something that

51:10 could be useful to the average developer.

51:12 JavaScript has a tool that has at least some moderate popularity called MPQ that does this.

51:18 And I think it's time for the Python community to see if there's something similar.

51:23 I would love to see something like that.

51:24 Another thing is Google.

51:26 Thank you, Google.

51:27 Has become a visionary sponsor of the PSF.

51:31 And specifically, they want their funds to go towards critical supply chain security improvements,

51:37 developing productized malware detection for PyPI, for a type of dynamic analysis infrastructure.

51:44 So this sort of gets at the hit at, maybe there's something that the PyPA and PyPI.org could do on their end without even necessarily changing pip, right?

51:55 pip's going to go talk to some API there.

51:57 And it goes, yeah, no, not this one.

51:59 That's right.

51:59 Or you're going to upload it, like with upload a new package.

52:02 It goes, no, we don't want to accept it.

52:03 And Dustin Ingram of the Python Software Foundation at PyCon just recently devoted his talk to talking about Python and the software supply chain issues that we've discussed today and writ large to include typosquatting.

52:15 And it's clear that there is energy and willingness from even core members of Warehouse and Python Software Foundation to tackle these issues.

52:24 So we're glad to see that.

52:26 It'd be great to see something like that happening.

52:28 I think layers as well, right?

52:29 That's how you talk about security often is it's not just, well, you have a strong password and you're fine.

52:34 Like, well, and maybe you have two-factor authentication.

52:36 And maybe you run as lower permissions and, and, and, and, right?

52:39 Yeah.

52:39 Layers.

52:39 So this could be one of the layers, but not necessarily all of them.

52:43 Yeah.

52:43 I should note that we even, a couple of us at IQT Labs even put in an issue recently that on Warehouse that might interested some parties here.

52:52 It's issue 9527.

52:54 You can also find it at short.iqt.org slash issue, just a redirect.

52:59 And we essentially call for something like social distancing for the top Python package indexes.

53:05 So that for very popular package names, the package names that are close by are blocked off.

53:11 So that not saying that anybody who chooses those names is malicious, but just so malicious people can't choose them.

53:17 Feel free to upvote that.

53:18 We've been discussing this with some of the members of the Warehouse team.

53:22 Yeah.

53:22 So your proposal is that Pandar should not have even been allowed, right?

53:28 That's right.

53:28 Given that the package Pandas is so popular, minor variations on its spelling should basically be blocked or maybe redirect to Pandas and say with a warning, like, you tried to install Pandar.

53:41 Did you mean to install Pandas?

53:42 That's right.

53:43 Something like that.

53:44 That's right.

53:44 So it's a way to build in guardrails so that the unwary don't fall prey to this.

53:51 Yeah.

53:51 Personally, my first impression is that that's a good idea.

53:54 It's worth it that we don't need request and requests and requester.

53:58 And, you know, the potential harm is higher than the value of, you know, reusing very, very similar names.

54:05 Yeah, we agree.

54:06 And there's obviously tradeoffs.

54:08 Yeah.

54:08 Vince, what do you think?

54:09 You must agree with this, I suspect.

54:11 I do agree with that.

54:12 I definitely supported this.

54:13 And I know one other thing that's under consideration that's relevant is namespacing.

54:16 So you can, you know, Kenneth Wright is the request guy.

54:21 He has his namespace.

54:22 You go to his namespace, you're less likely to mistype that and have someone, the namespace and have someone who has claimed the same package within their own namespace.

54:31 So possible, but, you know, it's another layer, I guess, as you were describing it.

54:35 Yeah, it makes the commands.

54:36 You got to type a little bit longer.

54:37 But it makes it really clear where it's coming from.

54:40 I mean, that's what the point of namespaces and programming is.

54:42 It's really clear what library it comes from or what part of your code it comes from.

54:47 And who?

54:47 Grouped together in namespace.

54:48 As well.

54:49 I know Go has done, you know, used that to great success.

54:52 Yeah, Kim Benwick out there put a cool comment that's sort of related to that, talking about the private IPI server that's, you know, redirecting out.

55:01 It would help if the private PIs, if you had an option to prevent the account uploading images from or pulling images with a certain prefix.

55:09 For example, if everybody named their packages ABC something at the company, you could say ABC is private, ABC star is private, and never, ever, you know, go look beyond here for that type of thing.

55:22 I think that that's pretty interesting.

55:23 Yeah, it's a good idea.

55:24 Yeah, I think it seems super simple and a good idea.

55:27 I agree.

55:28 All right, gentlemen.

55:29 Well, very cool to talk about this stuff.

55:31 Like I said, it's going to make all of us a little bit more nervous, I suspect.

55:35 You know, for example, Corey Adkins out there said, I also just found an article on malicious Docker images.

55:39 Now I am paranoid, which.

55:41 I'm sorry.

55:42 Welcome.

55:43 Yeah, yeah.

55:45 I've been there for a while.

55:46 All right.

55:47 Before I let you two out of here, though, real quickly, let's answer.

55:49 I'll ask you the two questions at the end of the show, of course.

55:52 So if you're going to write some Python code, what editor do you use?

55:55 I use Vim if I'm in the command line.

55:57 But if I have the fortune to be outside of it, use Sublime.

56:01 Right on.

56:01 I suspect JupyterLab is also in there.

56:03 Definitely Jupyter is in there.

56:04 Yeah.

56:05 And Ben's?

56:05 Probably PyCharm.

56:07 Yeah, I mean, I'll use Vim if I'm already in a command line.

56:11 But yeah, that's not as often these days.

56:13 So PyCharm is just my idea of choice.

56:15 Right on.

56:17 And then notable PyPI package, something that's like, oh, people should know about.

56:21 Check out one called NetworkML.

56:22 It's a package related to machine learning and network traffic.

56:25 The lead maintainer is Charlie Lewis of IQT Labs.

56:28 You can go find it on PyPI.

56:30 Yeah.

56:30 Fantastic.

56:31 So machine learning plugins for network traffic.

56:34 Yeah.

56:34 So it identifies like anomalies and other weirdnesses like that?

56:38 Yeah.

56:38 It parses network traffic.

56:39 And one of the cool things it does is it helps identify what sort of device is being observed.

56:45 So is this thing a printer?

56:46 Or is this thing a personal computer?

56:47 Is it an active directory controller?

56:49 Et cetera.

56:49 Is it a canary?

56:51 Is it a canary?

56:52 Who knows?

56:54 Awesome.

56:54 All right.

56:55 Well, thank you both for shedding the light on lots of what's happening, some of the things

57:00 that are being done and what might also be done as well.

57:03 So final call to action.

57:04 People that want to get involved, maybe do more, become more aware.

57:07 What do you all say?

57:08 Yeah.

57:08 I mean, there's plenty of work to be done.

57:10 Open SSF is a very welcoming, relatively new organization that has a nice list of stuff

57:15 to do.

57:15 Python Software Foundation also actually has an active list of items they would like

57:21 to work on, some of which are relevant to this topic.

57:23 So that'd be two great places to start.

57:25 I'll point you back towards that GitHub issue.

57:27 Feel free to chime in.

57:30 And I think there's definitely potential over the next few months.

57:32 Additionally, we're actually working on a survey at IQT Labs called on secure code reuse.

57:38 So if you want to help build the research foundations for this, you can find this survey at

57:42 short.iqt.org slash survey.

57:45 And we're trying to understand the developer or data scientists or other programming professional

57:51 experience with package reuse.

57:52 So that's another way.

57:54 So hopefully this survey informs future tools.

57:56 Yeah.

57:57 Fantastic.

57:57 Well, thanks for the work that you all are doing.

57:59 And thanks for being on the show.

58:00 Thanks for having us.

58:01 Thanks for having us.

58:02 Bye.

58:02 This has been another episode of Talk Python To Me.

58:06 Our guests on this episode were Ben Stoser and John Speedmeyers.

58:10 It was brought to you by Square, us over at Talk Python Training, and the transcripts are

58:14 brought to you by Assembly AI.

58:16 With Square, your web app can easily take payments, seamlessly accept debit and credit cards, as

58:22 well as digital wallet payments.

58:23 Get started building your own online payment form in three steps with Square's Python SDK

58:29 at talkpython.fm/square.

58:32 Want to level up your Python?

58:34 We have one of the largest catalogs of Python video courses over at Talk Python.

58:38 Our content ranges from true beginners to deeply advanced topics like memory and async.

58:43 And best of all, there's not a subscription in sight.

58:46 Check it out for yourself at training.talkpython.fm.

58:49 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

58:53 We should be right at the top.

58:55 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

59:00 and the direct RSS feed at /rss on talkpython.fm.

59:04 We're live streaming most of our recordings these days.

59:07 If you want to be part of the show and have your comments featured on the air,

59:11 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

59:16 This is your host, Michael Kennedy.

59:17 Thanks so much for listening.

59:18 I really appreciate it.

59:20 Now get out there and write some Python code.

59:21 You're welcome.