#319: Typosquatting and Supply Chains Vulnerabilities Transcript
00:00 One of the true superpowers of Python is the libraries over at the Python Package Index.
00:04 They're all just a pip install away. And yet, like all code that we run on our systems,
00:10 it is done with some degree of trust. How do we know that all those useful packages are
00:15 trustworthy? That's the topic of this episode. Benz Tozer and John Speedmeyers are here to share
00:21 their research into typosquatting on PyPI and other sneaky deeds. And we also get a chance to
00:26 discuss some potential solutions, fixes, and tools to help solve this problem.
00:31 This is Talk Python to Me, episode 319, recorded May 26, 2021.
00:36 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem,
00:54 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at
00:59 mkennedy. And keep up with the show and listen to past episodes at talkpython.fm. And follow the show
01:05 on Twitter via at Talk Python. This episode is brought to you by Square and us over at Talk Python
01:10 Training. Please check out what we're offering during our segments. It really helps support the show.
01:14 Hey, all. I have a quick announcement. We've had transcripts for all of our episodes for a long
01:19 time. But recently, we put more time and effort into making them more useful for you.
01:24 Now, every show has a link to the transcripts right in your podcast player. And that transcript
01:29 page lets you filter, search, and playback audio from exact moments within the transcript. Hope you
01:34 enjoy the richer experience around using our episodes as reference materials. I'm also happy to announce
01:39 a new sponsor of the show, Assembly AI. Assembly AI is a top-rated API for automatic speech-to-text.
01:46 You can transcribe videos and audio files with human-level accuracy in just a few lines of code.
01:52 To help us keep leveling up our transcript games, they're sponsoring the transcripts for our episodes
01:57 going forward. So thank you to Assembly AI for higher quality transcripts and supporting the podcast.
02:02 Check them out at talkpython.fm/assemblyai. Now, on to that conversation.
02:07 Vince, John, welcome to Talk Python to Me.
02:10 Thank you. Thanks for having us.
02:12 Thank you.
02:12 Yeah, it's great to have you both here. It's going to be exciting, unnerving, I might say,
02:17 a little bit to have this conversation. But I think it's certainly high time.
02:21 You'll never pip install the same way.
02:23 Exactly, exactly. You just kind of hold your breath as you do at each time. And you know,
02:29 I'm also, this is not a challenge that just the Python world faces.
02:34 This is anyone that has a package manager.
02:37 Yep.
02:37 And the more open, the bigger the difficulties, I suppose, right? So NPM,
02:41 gems, like you name it, right?
02:44 Yep. If you're a software developer these days, it probably affects you.
02:47 Absolutely. So before we get into the typo squatting, the supply chain issues and all the
02:54 stuff in history and current problems and, you know, on the positive side, there are solutions
03:00 and tools and things that we can talk about as well. Before we get into all that, let's start
03:04 your story, maybe abbreviated version since there's a couple of you. Ben, how do you get
03:08 into programming in Python?
03:09 Programming I got into just as a kid. Got a computer when I was, I don't know, nine or
03:14 10 and tinkered around with it, enjoyed it, ended up actually taking programming classes in
03:18 high school, stuck with it in college, majoring in computer engineering, and was a software developer
03:22 for the system engineer sort of stuff in the defense industry for 20 years.
03:27 Yeah. What languages did you start in or mainly use?
03:30 Originally started in C, actually originally in Pascal, then started in C, C++, and transitioned
03:35 over to doing more active Python development. I just needed a web scraper, needed to collect
03:40 some data and analyze it, and Python was the right tool for the job.
03:43 You didn't want to do that in C++?
03:44 I did not, no. And now, you know, Python is my preferred language for tinkering or back-end
03:51 web development. Pretty much as much as I can use it for, I use Python.
03:53 Yeah, fantastic. John, how about you?
03:55 I don't have quite the classic story. I learned it programming through statistics classes in
04:01 undergrad, specialized language called Stata that a lot of economists use. Good for legal
04:05 trials, well-tested, but I didn't learn Python until I, in grad school, I took more data science
04:11 classes and learned the typical NumPy, Panda, Scikit-learn sort of stuff.
04:17 Right. They're like, let us introduce you to probably, it was called IPython at the time.
04:20 Exactly. And now, of course, Jupyter Notebooks, that sort of thing.
04:24 Yeah. Fantastic. It's really interesting just to see all the broad and diverse ways that Python
04:30 is growing and people are coming into it, you know? It's not that, well, I learned it for
04:34 programming, you know, building an operating system and on I went. There's a lot of languages that are
04:38 fairly, you know, or JavaScript. I built it to work on a website, right?
04:41 Yeah.
04:42 It draws people in from all sorts of things, which is awesome.
04:44 It's a meeting ground.
04:45 Yeah. Yeah. And I think that's one of the strengths, actually, kind of a sidebar is that
04:49 we have all these people with different backgrounds and different motivations and interests and things
04:53 they're trying to build rather than being more like, well, here's how I build my web app. How do
04:57 you build your web app? Yeah. Very cool. And how about now? Vince, what are you up to day-to-day?
05:03 So day-to-day, you know, kind of put down the keyboard, at least from the programming perspective.
05:07 And I work as a cybersecurity subject matter expert for In-Q-Tel, which, so I guess my job
05:13 there is to search for and then work with companies we find in the cybersecurity industry that have a
05:18 high impact on national security, as well as providing kind of advisory services to our customers in the
05:23 U.S. government.
05:24 Okay, cool. So what's In-Q-Tel? Sorry, In-Q-Tel, I guess it is?
05:27 Yes.
05:28 Yeah. What's the company story there? Because you both are from the same company.
05:31 Yeah. So it's a nonprofit, 501c3, stood up a little over 20 years ago by the CIA to basically help
05:37 you know, originally the CIA, but now it's seen most of the intelligence community and
05:41 elements of the DOD basically acquire and adopt and use cutting edge technology. They realized a
05:47 little while ago, you know, around that time that a lot of innovation was moving into Silicon Valley
05:51 and into other places in industry and startups. And the traditional acquisition model that federal
05:55 government uses doesn't play well with those people. They don't understand it. So we kind of
05:59 helped as a bridge working with startups, identifying them, and then helping them interact with the
06:04 government and conversely helping the government, you know, adopt, said technologies and support their mission.
06:09 So maybe, let me see if I can run a scenario by you. Maybe there's some Silicon Valley company
06:13 that's created like a cool ML thing that identifies deforestation or something like that. And the
06:21 government decides, oh, this might be really helpful for us for, I have no reason why. I have no idea
06:26 why, but let me imagine there's a reason, right? You might help that company like work with the
06:31 request for proposals and the whole crazy government side of things and get them more in line with what's
06:36 needed there. Is that the story? Yeah, that's to an extent. Yeah. I mean, we actually invest in them,
06:39 take equity, and that do help them learn how to interact with the government and also help them
06:44 shape their product and meet our customer needs.
06:46 Yeah. Okay, cool. Interesting. I had no idea such a company exists.
06:50 John, how about yourself?
06:51 I'm also at IQT. I work in what's called IQT Labs. It's an open source applied research and development
06:58 lab where we do hands-on research, mostly in the open source, largely on GitHub.
07:02 Cool. Sounds very, very fun. Now, let's talk about the supply chain issue, I guess, at a real broad level,
07:10 right? And I don't know how you all feel. I suspect that you have a little more hesitancy
07:16 or whatever as you interact with the computers and software and the internet and so on. You know,
07:21 when you, oh, there's a cool new app, maybe I'll try that. Like, you might think a little more
07:25 carefully about this than the average, you know, say, teenager or whatever.
07:29 There's a little bit of paranoia that comes with this. It's true.
07:32 Yeah, exactly. Exactly. That's what I'm getting at. And I feel like one of the more insidious
07:37 aspects of this has been the supply chain side of things, right? Because it's one thing to say,
07:42 that app looks shady. That site looks shady. Let me just not go there. Let me not click that link.
07:47 Let me not install that. But if I were to install, you know, Office Suite X and I completely trust the
07:54 company that makes that, but there's some library that they got from a third party and that third party
08:01 had been hacked and they somehow Trojan'd that third party thing and no one's found out yet.
08:07 I don't know. That's super scary. And that's kind of along the lines of some of the things that we're
08:11 touching on. And so I think the most broad one of those in the recent times has got to be SolarWinds,
08:16 right? That's certainly what's making the headlines these days. Still even, what,
08:20 five, four or five months later. It's, yeah, still a topic of discussion around this theme.
08:26 And yeah, I mean, that was a pretty challenging attack to pull off. I mean,
08:30 it took nation state actors months, maybe years to plan, you know, laying the groundwork,
08:35 getting things in place, you know, basically infiltrating SolarWinds development infrastructure.
08:40 Pretty impressive, honestly, that they were able to do it. And obviously the impact was enormous.
08:44 It was wildly successful.
08:46 I think one thing that Vince and I have been interested in, though, while this sort of attack
08:51 is very serious and obviously has rightly gathered a lot of attention, there are a number of other
08:56 types of attacks, often focusing on open source software that are actually more numerous.
09:01 How serious they are is actually open to debate. But we are still talking many people affected and can
09:08 still have grave consequences, especially if you're the one that's hacked.
09:12 So it's gotten less front of the newspaper attention, but Vince and I still think it's serious.
09:18 Yeah, I think it's very serious. I started with this one because I feel like everyone has heard about this.
09:22 Everyone can relate to this, right?
09:24 And here's an example of a company that supplies network gear to many of the largest companies and government
09:33 organizations around the world. And this was basically a way to get, you know, access to all of those.
09:38 They think it's Russia's cozy beer crew, but who knows, right? And it almost doesn't matter.
09:44 Another one that I think also is in the news really quick before we jump into the open source stuff.
09:49 This is not open source at all, but was called Xcode Ghost. Have you two heard of this?
09:54 Yeah. Yeah. So, yeah, I mean, basically what happened here was, you know, app developers,
09:59 iOS developers in China don't like to download or can't download stuff from the Apple official
10:04 Apple version of Xcode. Someone, you know, put a compromised version of Xcode up on some.
10:10 So let's get it off BitTorrent or something.
10:12 Yeah. I mean, also some Chinese file sharing site that app developers over there like to use
10:16 because it's more convenient and they, they, it was compromised. There was a, basically a,
10:20 something that would bake a backdoor into, you know, the ultimate compiled app that would go into
10:25 the app store or variant of the app store.
10:27 Yeah. So every app that was built and published to the app store with Xcode Ghost, which looked
10:33 exactly like Xcode injected a backdoor malware type of thing into it. So there was something like
10:39 2,500 applications, the iOS app store that yeah, affected like 128 million people. So that,
10:46 that's bad kinds of things, right?
10:48 Very bad.
10:49 Not ideal.
10:49 I mean, I guess attacking a compiler, I mean, developers trust their compiler, I would say,
10:54 I mean, not being able to rely on that or feel like you have, and it's very hard to
10:57 vet your company, especially for closed source or closed source, product like Xcode.
11:02 It's very hard to see is my compiler compromised or not.
11:07 Yeah. Yeah. And I think this actually is closer to the open source side of things,
11:12 right? Because if you can start to infect the tools of the developers building the things,
11:16 that's a problem. Yeah. So let's talk about the open source side, John, you pointed out,
11:21 there's many known attacks over there.
11:23 That's right.
11:23 Set the stage. What's going on?
11:25 There's actually a range of attacks, but I'll mention a couple here and we'll get into typosquatting.
11:29 So there is certainly a typosquatting attack, which we'll discuss extensively today, which
11:33 just like domain names, you might've heard someone is trying to go to a website and, mistypes it
11:41 a little, or somehow gets confused about how it's spelled, maybe switching the order of words,
11:45 and then ends up someplace that's malicious, either on the web, or if you're downloading a package,
11:50 you download a package you think you want, but it's not actually.
11:54 And sometimes not always, sometimes that contains malware and does things to your computer that you
12:00 don't want.
12:01 That's bad, right?
12:01 Bad. Especially if there's arbitrary code execution, meaning they can do what they want because perhaps
12:08 you've installed it as root.
12:09 Right. And well, I think a lot of people who are getting into Python don't realize that when you
12:14 pip install something, unless it's being installed as a wheel, as a binary wheel, it's running a setup.py
12:21 as your account. So whatever your current account is able to do, like you said, if you run it as
12:27 sudo, it's, it can do more, but even if it can just completely wreck your account and get your
12:31 information for many people, that's plenty bad on your computer. You don't want to.
12:35 Yeah. And it could be your computer. It could be your, your corporation's computer where you work
12:39 or your company's computer. And this setup.py, you're exactly right. It is a key attack vector.
12:44 For many people, it's simply a helpful way to install software. But unfortunately,
12:48 some people abuse that specific resource.
12:51 Yeah. I think it's been critical in the growth of how software is built. I remember,
12:55 you know, Ben, you were talking about doing C++ programming. I remember back in 97, 98,
13:02 99 doing C++ programming then. And it felt like whatever you wanted to do, almost everything you
13:07 had to build from scratch. You want a library that does this kind of UI widgets? Well, how do I build
13:12 that? You want a library that has this kind of data structure? Where do I either find or build that?
13:17 Right. And now it's just pip install this thing, pip install that thing. And the, the building blocks
13:22 that we have to compose are so much more effective, right? I can take a couple of libraries here and
13:28 click them together. And all of a sudden I've got something absolutely incredible, like pulling data
13:33 from different sources, creating amazing graphs. I mean, that is the power of modern software
13:38 development, right? And yet, you know, I guess Corey Atkins out in the, the live stream has a nice
13:43 sort of comment on this. Like he said, I didn't realize how naive I was thinking packages were
13:48 vetted. You're not alone, Corey. And so you're not alone. Join the club.
13:52 This portion of Talk Python to Me is brought to you by Square. Payment acceptance can be one of the
13:59 most painful parts of building a web app for a business. When implementing checkout, you want
14:04 it to be simple to build, secure, and slick to use. Square's new web payment SDK raises the bar
14:10 in the payment acceptance developer experience and provides a best in class interface for merchants
14:16 and buyers. With it, you can build a customized branded payment experience and never miss a sale.
14:22 Deliver a highly responsive payments flow across web and mobile that integrates with credit cards and
14:28 debit cards, digital wallets like Apple Pay and Google, ACH bank payments, and even gift cards.
14:33 For more complex transactions, follow-up actions by the customer can include completing a payment
14:39 authentication step, filling in a credit line application form, or doing background risk checks
14:44 on the buyer's device. And developers don't even need to know if the payment method requires
14:49 validation. Square hides the complexity from the seller and guides the buyer through the necessary
14:54 steps. Getting started with a new web payment SDK is easy. Simply include the web payment SDK
14:59 JavaScript, flag an element on the page where you want the payment form to appear, and then attach
15:04 hooks for your custom behavior. Learn more about integrating with Square's web payments SDK at
15:09 talkpython.fm/square, or just click the link in your podcast player's show notes.
15:14 That's talkpython.fm/square. These incredible building blocks, these Lego pieces, there's a lot of faith
15:22 out there that these are good building blocks. Not good in the sense they don't have bugs, but good in that
15:27 they have a good intent.
15:28 Well, I think that's one thing that's the key is that, and one of the things that's a challenge here is you have to
15:32 kind of figure out the intent of the people you're trusting, and you are trusting them ultimately, and you have to
15:38 hope they do not have malicious intent. Because inferring that is very challenging.
15:41 It's a double-edged sword. I mean, I agree. It is a powerful change that you can download a couple
15:47 libraries and have an amazing app, potentially in a few minutes, maybe an hour or two. And this is the
15:53 dream of code reuse, come alive, finally. And it just so happens that there are sometimes downsides.
16:00 They can be mitigated, but unfortunately to the unaware user, which it's all too easy to be unaware,
16:06 it's difficult, actually. There are serious, there can be risks.
16:10 Yeah, there definitely can. Kim Van Wick out of the live stream has an example. A benign example would
16:16 be atter, A-T-T-R versus atters. Both are legitimate packages, but completely different.
16:22 Another example would be if I want to install requests, but I actually just type request.
16:27 I mean, even auditorily, they sound like requests.
16:31 It's easy to do.
16:32 It even sounds like very similar with the S versus no S there. And if somebody says,
16:38 go install requests, you're like, oh, request, pip install request. God, I did it. Like, wait,
16:42 no, no, no, no, don't do that one.
16:43 Yeah. And it actually happened. You can find that that attack truly happened, affected,
16:47 at least according to the article published about it, 20,000 users. So I don't know how many of them
16:52 were actually affected. I haven't, we don't, this is unfortunately part of the problem. It's hard to
16:57 track this data, but the example you brought up, I know you brought it up on purpose. It's serious.
17:04 Yeah. And requests with the S is installed millions of times a week or a month. Many,
17:11 many, many, many, many times, right?
17:12 We'll talk about this later, but we found one called Pandar, like Pandas, but with an R.
17:17 And, you know, it's not hard to imagine just by, either confusion or a mistake typing this.
17:24 Yeah, absolutely. So another area I think that is a little bit interesting before we dive completely
17:30 into the package management type of squatting and related type of issues has to do with a trusted
17:37 open source thing becoming untrusted. And what I mean by that is there were some examples of things
17:44 like Google Chrome extensions being put out there as proper extensions, and then someone taking over
17:49 that project and then putting something maybe more adware in it, or something somewhat nefarious,
17:55 if not actually malicious, or, you know, somebody who is running the request is not a great example
18:01 because it's under the PSF organization, but many of the projects are under an individual, right? On
18:08 their GitHub project. And so if somebody was able to break into that person's GitHub repo, and then they
18:13 somehow sneak something into the code, well, does it look wrong? No, the, the person who made that change
18:19 is the trusted benevolent person who runs this project, right? They are, if, you know, Guido
18:25 van Rossum comes in and makes a change, well, who's going to look at that and go, oh, this is, this guy's
18:30 sketchy. We better really, like, it's probably going to be fine, right? So if someone takes over an account,
18:35 like, not only do they have access to the code and then how it gets pushed out to, you know, potentially
18:39 gets into the stream that goes to PyPI. It's also done by the person who looks like they should be most
18:45 trusted, right? So things like two-factor auth and just securing your GitHub and things along those
18:51 lines seems extremely important as well. Absolutely. I mean, what you're describing with account takeovers
18:56 happen numerous times. And there's variants on it too, where there's some single developer who's
19:01 overworked, tired, doesn't use the project they create anymore. They just hand it over to someone
19:05 who ends up, you know, putting a backdoor in it or some sort of malicious payload. I mean, that, that's
19:10 happened. And then also people take advantage of the fact that not only do you have your GitHub
19:13 profile secure, but you also have to have your PyPI or Ruby gems or, you know, where you actually
19:18 publish your packages, people run. So there's kind of two areas for potential attack. And also people
19:23 take advantage of the, you know, most people, at least me anyway, when I would vet software, I would go
19:28 look at GitHub and then I would download, I wouldn't download it from GitHub. I would download it using
19:32 pip or whatever. And that kind of, dissonance or whatever you want to call it,
19:37 there's another opportunity for, for confusion and malfeasance.
19:43 Yeah. And so these things are hard to detect, but I guess the area that you all have done a lot of
19:49 research in, you built some tools around and probably the biggest area is around the package
19:55 management side of things, right? That's right.
19:57 So we've talked about typosquatting and some of the challenges where people might mistype things.
20:03 And you talked about some examples where you found packages that look like they were intended to be
20:09 installed by accident, you know, to catch those. If there's 7 million people type, you know, 7 million
20:15 times pip install requests is typed. Chances that a couple of those are misspelled or enough of those
20:20 are misspelled is pretty high, but there were actually quite a few examples. Like for example,
20:25 the register had an article. When was this? This was, this is recent, March, 2021. The title is Python
20:33 Package Index nukes. 3,653 malicious libraries uploaded soon after security shortcomings highlighted.
20:41 That's right. This is, there's really a longer historical narrative too, to include this.
20:45 I'll call this a political activism, anti-typo squatting activism, where this,
20:51 you could call it an attack, is really about drawing attention to this risk.
20:55 Yeah. And I feel like a lot of these were people like, look, I'm proving to you this could actually
20:59 happen. That's right. I'm proving by creating this thing that uploads as requests with the S&T.
21:06 That's right.
21:06 Swapped. But were there actually viruses put up there? Like what is the actual harm been?
21:12 Yeah. So not all of these are. This one and a number of others, we can discuss those if we have time,
21:19 were largely benevolent, but demonstrated the risk. But yes, there have been, at least by our
21:25 calculations, 40 known malicious typo squatters on the Python Package Index, affecting thousands of
21:31 users. We actually published a blog post on this, something like Python typo squatting is about more
21:37 than typos. So yes, this has happened. I don't know the exact persons that it has affected. We just
21:43 don't have that data. Sorry if it affected you. And we published this and got some debate on hacker news.
21:49 And this is the point where Vince and I said, oh, there's really something here. There's a broad
21:53 audience that hasn't had a voice that cares about this.
21:56 Yeah. I mean, it could have been nothing, right? If I'm a student at a university and I install it on
22:02 a lab computer.
22:03 No big deal.
22:03 No big deal. Like who trusts those lab computers, right?
22:06 You shouldn't.
22:07 I mean, not just because like somebody could have installed something bad on it, but there's,
22:13 there are college students.
22:15 Oh yeah.
22:15 Who could be installing all sorts of just, you know, pranks and other kinds of stuff. So you
22:20 should just treat those things with.
22:22 Contaminated.
22:23 Yes, they're fully connected. But on the other hand, if this is a data scientist working at like
22:29 a major corporation or an agency and that happened to them, it could be the thing that opens the door
22:35 to, you know, access to the entire network and all sorts of lateral movement, right?
22:40 That's right. There's even one of the earliest pieces of anti-typo squadding activism comes from
22:45 Nikolai Schocker, who was writing his undergraduate thesis at the time in Europe. And he showed that
22:53 over a few weeks, he got over 17,000 downloads of a series of type of squad packages, including .mil,
23:00 the military addresses of the United States military. So it is certainly possible that people in a more
23:06 secure organization that really value security could accidentally be the victim of type of
23:11 squatting.
23:11 Yeah, absolutely. And the fact that it came out of a .mil domain shows that, yeah, that bad example
23:17 could also happen. And also his thesis got covered on Ars Techno.
23:21 That's right. Coolest undergrad thesis ever.
23:23 Exactly. That's way better than anything I did in college.
23:27 Oh yeah.
23:28 Yeah. Fantastic. And then there was this project called Pyto Squatting.
23:33 Yeah.
23:34 Pyto Squatting.
23:34 Yeah. It's a play on...
23:36 Which actually has been... Yeah, like a play on typo squatting.
23:39 It's a play on typo squatting. It's a clever one. And Benjamin Balderbach and Hano Beck,
23:43 who are open source software activists, developers, also a journalist, they've really had a multi-year
23:51 effort pointing out the dangers here. Not simply criticizing, but trying to help Python Software Foundation
23:56 and the warehouse, our PyPI crew, raise money and build a consensus around trying to make
24:04 this infrastructure safer.
24:05 Yeah. Yeah. So they had this project called Pyto Squatting, but that actually got closed down.
24:11 That's right.
24:11 Yeah. Because they said that the PS... What do they call it?
24:16 The PSRT, Python Security Response Team.
24:20 That's it. PSRT. And I'm like, wait, there's a Python Security Response Team?
24:26 That's cool.
24:26 And they respond to emails too. They're good.
24:28 Yeah. Okay. So this is an organization, a group of people under the PSF banner that basically
24:34 triage these types of concerns, right?
24:37 That's right. That's right.
24:38 Okay. Yeah. I'll link to their page on python.org and they have their email there. They also have
24:46 rules for different types of disclosure, like whether you should email them, do other things.
24:51 That's right. And if you find a malicious package or even a package that you think is very suspicious,
24:56 this is who to contact. And they're diligent and timely.
25:01 So what do you two think about how this should be disclosed? People out there listening, they find
25:06 something. Should they go to Hacker News and say, look, this horrible thing I found on PyPI or on
25:12 NPM or whatever. Should they quietly disclose that to the security response team and then talk about it
25:21 after it's been removed and fixed? What's the flow for disclosure?
25:25 Seems like it would follow any other responsible disclosure process for traditional bugs, exploitable
25:31 bugs that are with vulnerabilities, where it would be nice if you find a problem, contact maybe the
25:39 Python security team, get in contact with the developer, get it fixed, probably get the package
25:43 pulled down if in fact it is malicious. And then, yeah, it'd be nice to have some sort of reporting
25:48 mechanism so that everyone who uses it could be identified. And the first part, John Speed was
25:53 saying, you know, the Python South for Foundation and the PSR team do a good job or great job of
25:57 being on top of it, being timely, being responsible. It's much harder to notify, you know, there's no
26:03 authentication when you download one of these packages. So it's very hard to know who's been
26:06 affected. So maybe just promoting that more would be helpful. But then people have to know where to
26:10 look and that they need to look at all. It becomes challenging quickly.
26:13 Well, it's like the Xcode ghost thing, you know, there was 2,500 apps that were backdoored.
26:20 And I think only the top 25 were even disclosed. And it's like, if there was a list of 2,500 apps,
26:26 are you going to go cross compare? No, you know, no normal person is going to cross compare that
26:31 announcement with their phone.
26:33 Right.
26:33 Right.
26:33 Right.
26:34 And it's just such a challenge. And I feel like, you know, here we had the same thing,
26:37 right? We had 3,653 packages removed. Well, are you going to go check if you had those? It's
26:43 extra hard because it's, you didn't intend to ever have them. You didn't intend to swap the S and the
26:50 T when you type requests, but you did. And you accidentally, almost unknowingly got it most
26:55 likely. Right. And so I do think it's really hard to push this out as an awareness thing and like,
27:02 hey, you should know that this happened. And so just go check, right? The checking,
27:06 I think it's really tricky.
27:07 Yeah. I mean, like many software problems, you need to solve it with more software.
27:11 You got to solve it with AI probably.
27:13 No, you definitely have to solve it with AI. I think one thing that's helpful and could be part
27:18 of that process, but isn't always, unfortunately, is also taking a collection or taking that artifact
27:26 that you found, let's say a Python package that was malicious and making sure it gets to somewhere
27:31 where it can be studied and hopefully future attacks prevented. And so for Python and a couple
27:36 of other languages, there is actually an interesting project. It has a very colorful name. It's called
27:41 Backstabber's Knife Collection. Sounds very scary and malicious, but it is actually yet another
27:46 enterprising grad student trying to collect malware samples, especially of interpreted languages.
27:51 Python is one of them. And so that there can be a community researchers and hopefully then
27:57 companies that can fight these packages. So that would be another thing I would add to the list.
28:02 Yeah, there you go. Mark Ohm is the main person associated with that and has written some
28:07 interesting papers and great stuff. And so I urge you, if you come upon this and you say,
28:12 how do I act responsibly here? Do the things Bent says and also maybe grab a sample and give it to
28:17 the Backstabber's Knife Collection or another similar repository.
28:20 Yeah, interesting. Okay. Have I just messed up my computer by visiting this webpage as well? I wonder.
28:25 I don't think so, but there is a...
28:27 I'm just teasing. I'm just teasing.
28:28 I mean, I can't guarantee anything though, but...
28:31 No, of course, of course, of course. Before we get too far on, Corey Adkins also asked,
28:35 when we were talking about messing up your computer, the lab computer, so on, he asked,
28:39 could installing these types of things also affect shared server space? On my IaaS land,
28:46 where I have a shared server running for however much someone else does something bad.
28:51 I mean, theoretically, sure. It depends on the permissions, I would think. Yeah. If you install
28:55 some dependency that has keylogger baked into it or, I don't know, or, you know, some sort of file,
29:00 you know, collector, and it has permission to traverse all directories, then yeah, I mean,
29:05 I could certainly see a scenario where that was possible. I mean, I haven't, you know,
29:08 I haven't heard of that happening specifically, but there's nothing preventing it theoretically.
29:12 Yeah. If you had a series of virtual machines, you know, it's pretty tricky from one virtual machine
29:17 to escape to another, but I believe there have been examples, but those are exceedingly rare,
29:22 those sorts of vulnerabilities, right? That's right. So while we're on this topic,
29:26 I want to throw out an idea and then we'll talk about some of the tools you built, but I feel like
29:29 we're right in the middle of this notification thing. Like we've got all these packages, they've
29:34 been identified, they have been downloaded. We can see that we probably even have IP addresses,
29:39 which you can reverse look up to DNS names as probably how those attributions were given,
29:44 but it happens so often in so many different places, right? Like if I've got a continuous
29:51 integration story that builds a Docker container that pushes to a Docker hub and then my production
29:57 grabs that from that container, the place where the problem happened is not the place where the
30:03 problem is, right? It's probably GitHub or some other CI pool. We have a really nice thing for this
30:09 in the account space. Have I been pwned by Troy Hunt, which is a really nice project. I definitely
30:15 recommend people go there and enter their email address.
30:19 And prepare to be horrified.
30:20 Yeah. And prepare to be horrified. There's 11.2, 11.3 billion accounts that have been breached,
30:27 which is odd because it's more than all the humans, but we have more than one account. So there it is.
30:31 But yeah, so you put your email in there and then in the future, well, historically as well,
30:37 but then in the future, you say, if something has happened and your email appears in some kind of
30:42 password dump, password breach or account informational breach, you'll get an email saying,
30:47 Hey, we found something that should be concerning to you. Check it out. I would love to see something
30:51 like this for pip, right? Something that says, I pip installed this thing and it just has a record of,
30:57 here's my account. These are the things I've pip installed. If there turns out to be a problem
31:02 with one of those, notify me that that had happened.
31:05 That does sound really useful. I don't think it exists.
31:07 I don't think it exists either. And it shouldn't just be a pip thing. It should be an NPM thing.
31:11 It should be a gem thing. It should be a crate thing. It should be something that like a,
31:15 just a little bit of a wrapper that says, I would like to opt in to saying, here's my UUID.
31:22 Here's my email address. And here's the list of things that I've installed. If it turns out that
31:27 one of them is horrible, just let me know.
31:30 It's so sensible. It makes me laugh.
31:32 Yeah.
31:33 Well, there's an idea out there as well, but this is way far down the line, right? This is,
31:38 oh, we know this has happened. We know who's done it. We've, we know who's been affected and so on,
31:43 but starting a little bit further back, you all have built some tools to go and start at the
31:50 beginning and say, well, let's look and see what might be out there. That is bad, right? This is
31:54 the tool you used to find pandar instead of pandas.
31:58 That's right. I mean, I think the, the first idea you had is the crucial one, which is that
32:03 you need to know that there's been a compromise in order to report it. And right now it's surprisingly
32:08 hard to know that. So we're not the only one to have devised a tool or approach to finding
32:14 malicious packages on the Python package index, but we took a particularly simple one and we said,
32:19 can we use simply the metadata, especially the name, but some other information too,
32:23 of packages. And then look at just the most downloaded packages and check who has names
32:29 that are very similar to those packages. This is where AI comes in.
32:32 Need crazy AI at this point. You do get a lot of false positives. People have similar names just
32:37 because the packages are related. It's fine. There's no problem inherently with having a similar name,
32:42 but we cracked open those packages too. This was some very boring Saturday mornings of mine
32:49 and simply scanned through the code looking for anything that's suspicious. And lo and behold,
32:54 we found one called Pandar that was actually doing key logging. It was a proof of concept. It's unlikely
32:59 that it actually would have worked, but we reported it to the Python security response team, security at
33:05 python.org. They said, yep, not good. Yanked it. And it was just an example of it's not that hard
33:12 to find them. And we were showing yet again with a pretty simple demonstration that it's not that hard.
33:18 Interesting. That's really cool. So basically the tool is about finding given popular packages,
33:24 finding ones that are oddly similar. And then there's like a, let me go and see what this one's about.
33:29 That's right. And so there's a couple additional checks to help anyone using it. And you can find
33:33 it's an open source written in Python tool, command line tool. It also checks things like, for instance,
33:39 is the description of the package on PyPI, is it very similar? So that what you are witnessing is
33:46 someone who's trying to not only type with squat the name, but in some sense, like squat the broader
33:52 metadata or almost like the copyright of the package.
33:55 Right. Because you want it to look as similar as possible.
33:57 Exactly. Like camouflage.
33:58 Something that comes to mind. Yeah. Are you guys familiar with sneak?
34:01 Yes.
34:02 Yes.
34:02 Is it package? There's a project. And geez, I'm forgetting.
34:06 Are you thinking of the advisor project?
34:08 Yes. Sneak package advisor.
34:10 Yeah. It's neat.
34:11 That's it. It's super neat. Yeah. That's what I was looking for. No, that's not how I want to spell it.
34:15 Yeah. And so that thing is pretty cool. The reason I bring this thing up is you can come over here and I
34:21 can type in a project like requests or whatever, and it'll tell us eventually,
34:26 it'll tell us the package health score. Yeah.
34:29 And it'll tell us things like there's this many PRs that have been open and closed. There's this many
34:33 contributors. There's this many people participating that the maintenance looks like so on. One thing
34:38 that I think would be cool would be to take this number plus a misspelling and say, if that number
34:45 is really, really low for a package that should be really, really high, that's a challenge, right?
34:50 If you look at the GitHub repo that is delivering this thing and it doesn't look right, if it's not
34:56 associated with something that seems kind of hard to replicate, like a GitHub repo with many people
35:00 participating over a long period of time, that seems like that could be a good flag as well.
35:04 Yeah, certainly. It certainly seems like there's an abundant opportunity to build something into the
35:10 actual download client to the pip or a wrapper around pip where it checks these sorts of things and
35:17 create speed bumps for you as you are trying to download something or use a package so that
35:23 says, hey, this looks suspicious. Have you thoroughly checked this? And I think your idea could contribute to
35:29 exactly such a tool or tools. Yeah. Yeah. Very neat. You've been working on this and Martin
35:36 Karnoguski created this thing called Aurora and also reached out and said, hey, I'm also working on this.
35:43 And so, yeah, tell us about this thing called Aurora. Yeah. So we got an email last fall after publishing
35:49 this blog post and he said, hey, I've been working on a similar tool. Not only does it check this metadata,
35:55 but we even do static analysis of the entire Python package index. And we said, Martin, that's awesome.
36:01 Let's work together. And so over the past six months, roughly now, in an open source collaboration
36:08 between a number of us at IQT Labs and Martin Karnoguski, we have further refined Aurora, which
36:14 truly is designed to do a static analysis of the entire Python package index open source tool. You can find it.
36:20 He releases his data on a try as best as he can to release it regularly. We've also built a tool
36:26 called Aura Borealis. That key thing, his aura produces 50 gigs of output when it's done scanning
36:33 the entire Python package index. No human can wade through that. So.
36:38 And I suspect also the IPA, the Python package authority folks probably don't want everyone
36:44 downloading that much data all the time. No, it's exhausting and creates so many database issues and
36:50 other things. So we've been working on a tool called Aura Borealis that you've pulled up that is a front
36:55 end that makes it easier to use the data set that Martin creates this tool, Aura. This wouldn't
37:02 necessarily be part of PyPI, though, of course, it could be. But we imagine this as a tool for
37:08 organizations or persons that need to have global knowledge about either global knowledge about the
37:15 entire Python package index and to rank and assess potential threats and go look, look for those,
37:20 and then take appropriate action or even individual developers that are really curious about packages.
37:25 This is makes it easy. The Aura Borealis isn't yet live, but we hope to make it live this summer.
37:30 Aura is a in production tool. It works. So go check it out.
37:34 Talk Python to me is partially supported by our training courses.
37:39 When you need to learn something new, whether it's foundational Python, advanced topics like async or web apps
37:45 and web APIs, be sure to check out our over 200 hours of courses at Talk Python.
37:51 And if your company is considering how they'll get up to speed on Python, please recommend they give our content a look. Thanks.
37:57 This looks like it's really handy. You know, so the idea is basically it's going to run forever and that's going to generate tremendous amounts of data.
38:05 Maybe just put a web front end on top of that static data for everyone to generate it over and over.
38:11 Exactly. Instead of having generated over and over now having 50 gigs and having to write your own custom,
38:16 probably Python script that's, you know, you'll have to optimize and blah, blah, blah.
38:20 Yeah. So came out in the live stream just says I accidentally typed sync instead of sneak,
38:26 which also is hard to spell anyway, because it's like a non-common spelling.
38:30 So which is an excellent way to demonstrate making a typo of getting the wrong package.
38:34 I have no idea what that's going to return. I'm not going to pull it up.
38:37 Podcast imitates life, imitates art, imitates compromise.
38:43 Exactly. All right. Well, this is really neat. How would I use the Aurora data and the Aurora Borealis project?
38:50 I guess also we should talk about this from different angles, right?
38:53 Maybe I'm a CISO at a company and I'm concerned that all my people are psyched about data science and Python
38:59 or NPM and web front ends and they just make me nervous all day and I want to get on top of it.
39:06 So I want, as somebody who is concerned about, I would like to know what's happening in my software
39:11 fly chain, or maybe I run, I maintain pandas and I'm really upset that pandar exists and I want to now be able to defend my package.
39:20 Like it seems like there's different use cases and people out there.
39:24 That's right. I think if you're a company and you have a group of software developers and
39:29 you have the, let's say a security team that helps vet packages.
39:34 So perhaps you put those packages in an internal repository so that the developers know that they're
39:39 cleared to use or Borealis will help you do that.
39:42 We're glad to set up pilots and discuss.
39:44 You can email me, jmeyers at iqt.org.
39:47 But there's also other angles too.
39:50 There's just, you're a developer and you want to make an informed choice.
39:53 The static analysis tool and its output can help you with that or Borealis.
39:57 And I think there is also, you're right, there's a maintainer angle and also a PyPI administrator
40:02 angle where you want to either protect a set of namespaces close to your package or you care
40:09 about the health of the entire ecosystem.
40:11 And those are all possible user types.
40:14 Yeah.
40:15 And we could probably use your PyPI scan to go and say, look, can I say, look for things
40:22 similar to my package name?
40:23 Yeah, that's right.
40:24 And we built that into Aura Borealis too now.
40:27 So in some ways, PyPI scan was a demo and still useful as a command line tool, but Aura Borealis
40:33 and Aura has that now built in.
40:36 Are you all going to put an API on top of this?
40:38 Good question.
40:38 That would be cool.
40:40 The thing that's tricky, like everything in life, is it costs money and, you know,
40:45 engineering resources and time.
40:47 I certainly have a vision.
40:48 I certainly have a vision.
40:48 And, you know, if I don't do it, someone else should do it.
40:52 Go make a lot of money.
40:53 of creating a technical infrastructure that every single package and every single new version
41:00 of every package, IPI, NPM, et cetera, gets scanned, a variety of scans, static analysis,
41:06 dynamic analysis, metadata analysis.
41:08 And that gets stored in a database that where you and I can go make API calls and get that
41:14 information that we should on these packages.
41:16 That could be, you know, there could be a free tier.
41:18 And then if you really need to make a lot of calls, a paid tier.
41:21 But someone should do it, I think.
41:23 Yeah, it would be neat to know, like you said, integrate into, say, pip even.
41:28 So if I pip install something, it could even flag it and say, hey, no, actually, we're going
41:33 to block that.
41:34 That's right.
41:35 Preemptively, because it's got some low score, unless you do like a --force.
41:39 Like, no, really, I mean, yeah, exactly.
41:41 It's something that'll sort of slow it down, as you call them speed bumps.
41:44 I hope someone does something similar to that.
41:46 We have plans, but no active development underway.
41:50 All right.
41:50 So that sets the stage that some of the tools out there, at least to identify that there
41:55 are potentially bad packages.
41:57 And it's also, I guess, you know, worth pointing out that if we go over, say, to PyPI, there's
42:02 over 300,000 packages over there.
42:05 And if there are 40 actually malicious ones, right, the chances are low.
42:10 They're not very high.
42:12 But so people shouldn't be, you know, running for the hills and complete panic or anything,
42:17 I don't think, from this.
42:18 But at the same time, we should be careful.
42:20 We should be cautious.
42:21 So, you know, what can we do?
42:23 That's the tough question.
42:24 Vince, do you want to start?
42:25 And you want me to go?
42:26 Sure.
42:26 I mean, there's a lot of things that we can do.
42:29 I mean, John Speed's hit on a few of them about just kind of being more deliberate, you know,
42:33 checking your work before you download something.
42:36 And also, you know, when you're considering dependencies, I mean, you mentioned C++ and,
42:40 you know, the late 90s.
42:41 I vaguely remember those times.
42:43 I remember when Boost came out, it was a big deal.
42:44 Oh, yeah.
42:44 You actually had a dependency that was...
42:46 I remember reading more books.
42:47 Right.
42:48 Less internet, more books to make things...
42:50 So, yeah, we moved on from that.
42:52 But ultimately, you know, it is worth considering, do you actually need this dependency?
42:55 You know, LeftPad, and NPM is a funny, you know, canonical example.
42:58 Broke the internet because people didn't feel like typing one line of their own code.
43:01 They wanted to import a LeftPad-ing dependency.
43:04 I do feel that's a really good example.
43:06 And certainly, LeftPad came to mind, not as a malicious thing, but just as a supply chain
43:12 Jenga tower type of thing.
43:14 And somebody pulled too much on a part of the Jenga tower and it came down.
43:17 I feel that the JavaScript community has way smaller Lego pieces than the Python community.
43:25 The blocks that you click together here are larger.
43:28 So, I feel like there's just fewer in number external dependencies on average in my Python
43:35 experience than my JavaScript experience.
43:37 Yeah, I think that's accurate.
43:37 I mean, numbers vary.
43:39 I've seen NPM, people who use NPMs are JavaScript developers.
43:43 The average package in NPM has like 94% dependency.
43:47 You know, other dependencies, only 6% is your actual code you've written.
43:51 Most of the modern languages, meaning JavaScript, Python, and some others are in like the 90-ish
43:56 range.
43:57 And then you see C and C++ are much lower.
43:59 Java is somewhere in the middle.
44:00 So, yeah.
44:01 Python is, I would say, lower than JavaScript, but much higher than the kind of legacy languages
44:06 that are historically used.
44:08 So, be deliberate means things like don't just, as fast as you can, type pip install, whatever.
44:13 Type pip install and then carefully type out the package name.
44:17 Maybe give it a quick read before you hit go.
44:19 Yeah, or just copy and paste.
44:20 Don't type it all.
44:22 Yeah.
44:23 So, for example, if I'm over here on PyPI, there's a copy button I can click and it'll
44:29 do exactly that.
44:30 Right.
44:30 Right.
44:31 That's an option.
44:31 Yeah.
44:31 So, yeah.
44:32 Just being a little more thoughtful and kind of, you know, looking at the dependency chain
44:36 as well before you download something, which is much harder than it should be, to be
44:40 completely fair.
44:41 That's helpful to know that, you know, maybe the top level you are using Joski.
44:46 That's how you pronounce that.
44:47 I have no idea what Joski.
44:49 I don't know.
44:49 You should pip install that right now.
44:50 Let's see what happens.
44:51 Let's just see what happens.
44:52 Here's an example of one of those that should rank lower.
44:55 No offense if this is your project, but it literally has zero stars, zero forks.
45:00 Its features are to do.
45:01 Its requirements are to do.
45:02 Its PyPI version banner is not found.
45:06 And I mean, it is only four minutes old.
45:08 They may be working.
45:09 Yeah.
45:09 Sure.
45:10 Yeah.
45:10 If it has dependencies, this one probably doesn't.
45:12 But, you know, take a look at those two.
45:13 Just makes them there's nothing egregiously wrong at a minimum.
45:16 Yeah.
45:17 That's one of the things that makes it kind of insidious and hard to see is the thing I
45:22 directly look at may be fine.
45:24 But the person who maintained that, did they make a mistake in the things that they depend
45:28 upon?
45:29 Or maybe, you know, transit, like follow that chain that graph down far enough.
45:33 Right.
45:34 There's a lot of layers that could be happening along the way.
45:37 It ends up looking like a web.
45:38 And not surprisingly, just because of that, most vulnerabilities inside of packages like
45:43 this are in the transitive dependencies, the ones below the first layer, the dependencies
45:48 are dependent.
45:48 Interesting.
45:49 So you can pip install the thing.
45:51 What about pinning the version?
45:52 I know there were some issues about having a private PyPI server, which I think is a good
45:58 idea where you whitelist packages in.
46:00 You say, we approve these things and only these things get installed.
46:03 And if you want to use a new one, we've got to opt it in.
46:06 And then now it's part of the organization.
46:08 That seems like something you could do, right?
46:10 There's PyPI server that you could set up that is a sort of pass-through layer there.
46:15 But then there was also the vulnerability of the version mismatch.
46:19 Like if there's a higher version of that thing on the public PyPI, then your local one.
46:23 So people were putting in like data layer version 70, you know, and then it's like, oh, there's
46:29 a newer version out there for me to go get.
46:31 I'll get that, even though it was internal, meant to be internal only, right?
46:35 So there's these challenges.
46:37 But what do you think about a private whitelist server?
46:39 It certainly seems valuable and seems like it's another speed bump, as John Speed was calling
46:45 them.
46:45 But yeah, I mean, then you run into scenarios like the one you described, where it's kind
46:50 of, I guess, that's undefined behavior, potentially, or at least not well-known behavior that maybe
46:55 isn't necessarily most intuitive.
46:56 So even that might not be enough.
46:58 So then, yeah, the pinning could help.
47:00 Then, of course, there's the challenge of maintaining your pin at the proper level, which adds more
47:05 effort on the developers to maintain up-to-date dependencies.
47:10 At least publicly.
47:11 Yeah.
47:12 Publicly, we have Dependabot on GitHub, which is way more of a pain than it should be to use.
47:17 Because if you've got 10 updates, it'll issue 10 PRs, which conflict with themselves.
47:22 Anyway, that's a long story.
47:24 But it's still at least some automation that says, hey, there's a new version of this.
47:27 Here's the change log.
47:28 And we also have the CVE security checks of Dependabot, which are really good.
47:33 Yeah.
47:33 Unfortunately, most of these typosquatting or just general supply chain attacks don't end
47:38 up in the NVD as a CVE.
47:40 Yeah.
47:40 Who's going to study this one and then not just say, take it down, right?
47:43 Like it's living under the, in the shadows, right?
47:46 Of being unnoticed.
47:48 To an extent, yeah.
47:48 NPM does a good job with their advisory service of like saying, this is a malicious package
47:53 and this is why we removed it.
47:55 But not all package managers do that.
47:57 And even so, then you have to go to all the most developers.
48:00 These ones are developing multiple languages these days.
48:03 So it's hard to keep track.
48:04 Yeah.
48:04 What about having isolated environments for trying out new packages?
48:09 So for example, one of the things I'm trying to do is if I'm checking out any new package,
48:13 I have to pip install.
48:14 And maybe that happens in a Docker container.
48:16 And then I throw away the container or possibly a VM with snapshotting on.
48:20 And then I roll back the snapshot periodically.
48:22 Yeah.
48:23 Those both sound like great ways to have good hygiene and not isolate the potential blast radius
48:29 of a potentially malicious package.
48:33 Yeah.
48:33 It's one thing to say, here's a thing we want you to check out and it's on PyPI and it's
48:38 really well known, but it's, you know, you got to explore new things that aren't super well
48:42 known yet.
48:42 Right.
48:43 And so how do you install that?
48:44 Right.
48:45 So I think some kind of blast blast store, like you said, like Docker, like a VM is not
48:49 a terrible idea.
48:50 Yeah.
48:50 It's a good one.
48:51 What else?
48:51 There's the open source software found, security foundation.
48:55 Yeah, that's right.
48:56 This is open SSF.
48:58 Open SSF.
48:59 Clearly a reference to open SSL.
49:01 Yeah.
49:01 another well known software supply chain compromise that widespread impact.
49:06 It's worth.
49:07 It's worth.
49:07 It's worth pointing out that this group for anyone who is comes very enthusiastic about
49:12 open source software supply chain security in particular has become a meeting ground where
49:17 both companies, but also persons interested in this sort of topics we've been discussing
49:21 the day and more have set up a series of working groups.
49:24 There's six roughly and, meet every few weeks, open community, fun, interesting people,
49:30 either, interested in the topic or actively working to, give back and contribute.
49:35 it's run by the Linux foundation and, we would highly recommend it as a place to
49:40 find other like-minded persons.
49:41 If you care about these sorts of topics.
49:43 Yep.
49:43 Fantastic.
49:44 And then there's the further on down the road, which we've touched on a couple of times,
49:49 but maybe we can encourage some enterprising person, people group out there to go after
49:54 it like a, a hardened pip or, you know, we have things that are sort of on top of pip-tools.
50:00 We've got pip ENV.
50:02 We've got pip X.
50:03 I'm a big fan of pip X, the isolation.
50:05 And then that gives us kind of need.
50:07 And just, I can see like, like a pip sec or something along those or PIPs, maybe a plural
50:14 PIPs.
50:14 I don't know for pip security, but something like that that incorporates some of these ideas.
50:18 Maybe it, it checks in.
50:20 You say like, I don't want to install any package that is not in the top 1000.
50:25 Sure.
50:26 Or a popular package, except for what I whitelist in on top of that or something or check with
50:32 Aurora Borealis about the score or check with the, have I been PIP?
50:37 Whatever that thing ever would become, right?
50:39 So talk about like where you might see things going.
50:42 Yeah.
50:42 Well, there's been a couple, I'll call them starter projects in the hardened pip area.
50:47 There even was one called pip sec.
50:49 You can find it on PyPI, but it's really, there's nothing there, unfortunately, at least yet.
50:55 That namespace is claimed.
50:56 The maintainers who we mentioned, Benjamin Balderbach, especially are interested in doing something,
51:01 just haven't had time, other busy priorities.
51:04 And I think there is a lot of potential to build out that idea and create something that
51:10 could be useful to the average developer.
51:12 JavaScript has a tool that has at least some moderate popularity called MPQ that does this.
51:18 And I think it's time for the Python community to see if there's something similar.
51:23 I would love to see something like that.
51:24 Another thing is Google.
51:26 Thank you, Google.
51:27 Has become a visionary sponsor of the PSF.
51:31 And specifically, they want their funds to go towards critical supply chain security improvements,
51:37 developing productized malware detection for PyPI, for a type of dynamic analysis infrastructure.
51:44 So this sort of gets at the hit at, maybe there's something that the PyPA and PyPI.org could do on their end without even necessarily changing PIP, right?
51:55 PIP's going to go talk to some API there.
51:57 And it goes, yeah, no, not this one.
51:59 That's right.
51:59 Or you're going to upload it, like with upload a new package.
52:02 It goes, no, we don't want to accept it.
52:03 And Dustin Ingram of the Python Software Foundation at PyCon just recently devoted his talk to talking about Python and the software supply chain issues that we've discussed today and writ large to include typosquatting.
52:15 And it's clear that there is energy and willingness from even core members of Warehouse and Python Software Foundation to tackle these issues.
52:24 So we're glad to see that.
52:26 It'd be great to see something like that happening.
52:28 I think layers as well, right?
52:29 That's how you talk about security often is it's not just, well, you have a strong password and you're fine.
52:34 Like, well, and maybe you have two-factor authentication.
52:36 And maybe you run as lower permissions and, and, and, and, right?
52:39 Yeah.
52:39 Layers.
52:39 So this could be one of the layers, but not necessarily all of them.
52:43 Yeah.
52:43 I should note that we even, a couple of us at IQT Labs even put in an issue recently that on Warehouse that might interested some parties here.
52:52 It's issue 9527.
52:54 You can also find it at short.iqt.org slash issue, just a redirect.
52:59 And we essentially call for something like social distancing for the top Python package indexes.
53:05 So that for very popular package names, the package names that are close by are blocked off.
53:11 So that not saying that anybody who chooses those names is malicious, but just so malicious people can't choose them.
53:17 Feel free to upvote that.
53:18 We've been discussing this with some of the members of the Warehouse team.
53:22 Yeah.
53:22 So your proposal is that Pandar should not have even been allowed, right?
53:28 That's right.
53:28 Given that the package Pandas is so popular, minor variations on its spelling should basically be blocked or maybe redirect to Pandas and say with a warning, like, you tried to install Pandar.
53:41 Did you mean to install Pandas?
53:42 That's right.
53:43 Something like that.
53:44 That's right.
53:44 So it's a way to build in guardrails so that the unwary don't fall prey to this.
53:51 Yeah.
53:51 Personally, my first impression is that that's a good idea.
53:54 It's worth it that we don't need request and requests and requester.
53:58 And, you know, the potential harm is higher than the value of, you know, reusing very, very similar names.
54:05 Yeah, we agree.
54:06 And there's obviously tradeoffs.
54:08 Yeah.
54:08 Vince, what do you think?
54:09 You must agree with this, I suspect.
54:11 I do agree with that.
54:12 I definitely supported this.
54:13 And I know one other thing that's under consideration that's relevant is namespacing.
54:16 So you can, you know, Kenneth Wright is the request guy.
54:21 He has his namespace.
54:22 You go to his namespace, you're less likely to mistype that and have someone, the namespace and have someone who has claimed the same package within their own namespace.
54:31 So possible, but, you know, it's another layer, I guess, as you were describing it.
54:35 Yeah, it makes the commands.
54:36 You got to type a little bit longer.
54:37 But it makes it really clear where it's coming from.
54:40 I mean, that's what the point of namespaces and programming is.
54:42 It's really clear what library it comes from or what part of your code it comes from.
54:47 And who?
54:47 Grouped together in namespace.
54:48 As well.
54:49 I know Go has done, you know, used that to great success.
54:52 Yeah, Kim Benwick out there put a cool comment that's sort of related to that, talking about the private IPI server that's, you know, redirecting out.
55:01 It would help if the private PIs, if you had an option to prevent the account uploading images from or pulling images with a certain prefix.
55:09 For example, if everybody named their packages ABC something at the company, you could say ABC is private, ABC star is private, and never, ever, you know, go look beyond here for that type of thing.
55:22 I think that that's pretty interesting.
55:23 Yeah, it's a good idea.
55:24 Yeah, I think it seems super simple and a good idea.
55:27 I agree.
55:28 All right, gentlemen.
55:29 Well, very cool to talk about this stuff.
55:31 Like I said, it's going to make all of us a little bit more nervous, I suspect.
55:35 You know, for example, Corey Adkins out there said, I also just found an article on malicious Docker images.
55:39 Now I am paranoid, which.
55:41 I'm sorry.
55:42 Welcome.
55:43 Yeah, yeah.
55:45 I've been there for a while.
55:46 All right.
55:47 Before I let you two out of here, though, real quickly, let's answer.
55:49 I'll ask you the two questions at the end of the show, of course.
55:52 So if you're going to write some Python code, what editor do you use?
55:55 I use Vim if I'm in the command line.
55:57 But if I have the fortune to be outside of it, use Sublime.
56:01 Right on.
56:01 I suspect JupyterLab is also in there.
56:03 Definitely Jupyter is in there.
56:04 Yeah.
56:05 And Ben's?
56:05 Probably PyCharm.
56:07 Yeah, I mean, I'll use Vim if I'm already in a command line.
56:11 But yeah, that's not as often these days.
56:13 So PyCharm is just my idea of choice.
56:15 Right on.
56:17 And then notable PyPI package, something that's like, oh, people should know about.
56:21 Check out one called NetworkML.
56:22 It's a package related to machine learning and network traffic.
56:25 The lead maintainer is Charlie Lewis of IQT Labs.
56:28 You can go find it on PyPI.
56:30 Yeah.
56:30 Fantastic.
56:31 So machine learning plugins for network traffic.
56:34 Yeah.
56:34 So it identifies like anomalies and other weirdnesses like that?
56:38 Yeah.
56:38 It parses network traffic.
56:39 And one of the cool things it does is it helps identify what sort of device is being observed.
56:45 So is this thing a printer?
56:46 Or is this thing a personal computer?
56:47 Is it an active directory controller?
56:49 Et cetera.
56:49 Is it a canary?
56:51 Is it a canary?
56:52 Who knows?
56:54 Awesome.
56:54 All right.
56:55 Well, thank you both for shedding the light on lots of what's happening, some of the things
57:00 that are being done and what might also be done as well.
57:03 So final call to action.
57:04 People that want to get involved, maybe do more, become more aware.
57:07 What do you all say?
57:08 Yeah.
57:08 I mean, there's plenty of work to be done.
57:10 Open SSF is a very welcoming, relatively new organization that has a nice list of stuff
57:15 to do.
57:15 Python Software Foundation also actually has an active list of items they would like
57:21 to work on, some of which are relevant to this topic.
57:23 So that'd be two great places to start.
57:25 I'll point you back towards that GitHub issue.
57:27 Feel free to chime in.
57:30 And I think there's definitely potential over the next few months.
57:32 Additionally, we're actually working on a survey at IQT Labs called on secure code reuse.
57:38 So if you want to help build the research foundations for this, you can find this survey at
57:42 short.iqt.org slash survey.
57:45 And we're trying to understand the developer or data scientists or other programming professional
57:51 experience with package reuse.
57:52 So that's another way.
57:54 So hopefully this survey informs future tools.
57:56 Yeah.
57:57 Fantastic.
57:57 Well, thanks for the work that you all are doing.
57:59 And thanks for being on the show.
58:00 Thanks for having us.
58:01 Thanks for having us.
58:02 Bye.
58:02 This has been another episode of Talk Python to Me.
58:06 Our guests on this episode were Ben Stoser and John Speedmeyers.
58:10 It was brought to you by Square, us over at Talk Python Training, and the transcripts are
58:14 brought to you by Assembly AI.
58:16 With Square, your web app can easily take payments, seamlessly accept debit and credit cards, as
58:22 well as digital wallet payments.
58:23 Get started building your own online payment form in three steps with Square's Python SDK
58:29 at talkpython.fm/square.
58:32 Want to level up your Python?
58:34 We have one of the largest catalogs of Python video courses over at Talk Python.
58:38 Our content ranges from true beginners to deeply advanced topics like memory and async.
58:43 And best of all, there's not a subscription in sight.
58:46 Check it out for yourself at training.talkpython.fm.
58:49 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
58:53 We should be right at the top.
58:55 You can also find the iTunes feed at /itunes, the Google Play feed at /play,
59:00 and the direct RSS feed at /rss on talkpython.fm.
59:04 We're live streaming most of our recordings these days.
59:07 If you want to be part of the show and have your comments featured on the air,
59:11 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.
59:16 This is your host, Michael Kennedy.
59:17 Thanks so much for listening.
59:18 I really appreciate it.
59:20 Now get out there and write some Python code.
59:21 You're welcome.