Learn Python with Talk Python's 270 hours of courses

#319: Typosquatting and Supply Chains Vulnerabilities Transcript

Recorded on Wednesday, May 26, 2021.

00:00 One of the true superpowers of Python is the libraries over at the Python package index. They're all just a pip install away. And yet, like all code that we run on our systems, it is done with some degree of trust. How do we know that all those useful packages are trustworthy? That's the topic of this episode. Vince Tozer and John Speed Myers are here to share their research into typosquatting on PyPI and other sneaky deeds. And we also get a chance to discuss some potential solutions, fixes and tools to help solve this problem. This is talk Python to me Episode 319, recorded may 26 2021.

00:50 Welcome to talk Python to me, a weekly podcast on Python, the language, the libraries, the ecosystem and the personalities. This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy and keep up with the show and listen to past episodes at "talkpython.fm" and follow the show on Twitter via "@talkpython". This episode is brought to you by "Square" and Us over at Talk Python Training. Please check out what we're offering during our segments. It really helps support the show, y'all I have a quick announcement. We've had transcripts for all of our episodes for a long time. But recently, we put more time and effort into making them more useful for you. Now every show has a link to the transcripts right in your podcast player. And that transcript page lets you filter search and playback audio from exact moments within the transcript. Hope you enjoy the richer experience around using our episodes as reference materials. I'm also happy to announce a new sponsor of the show "AssemblyAI".

00:50 "AssemblyAI" is a top rated API for automatic speech to text. You can transcribe videos and audio files with human level accuracy and just a few lines of code. To help us keep leveling up our transcript game. They're sponsoring the transcripts for episodes going forward. So thank you to "AssemblyAI" for higher quality transcripts and supporting the podcast. Check them out at "talkpython.fm/assemblyAI" now under that conversation.

02:08 Bentz & John, welcome to talk Python to me. Thank you. Thanks for having us. Thank you. Yeah, it's great to have you both here. It's gonna be exciting, unnerving, I might say a little bit to have this conversation. But I think it's certainly high time, you'll never 'pip' install the same way. Exactly, exactly. You just kind of hold your breath as you do it each time. And you know, I'm also this is not a challenge that just the Python world faces. This is anyone that has a package manager. Yeah. And the more open the bigger the difficulties, I suppose. Right? So NPM, Yes. GEMS. Like, you name it, right? Yep. You're a software developer these days, it probably affects you. Absolutely. So before we get into the typo squatting, the supply chain issues and all the stuff in history and current problems. And you know what, on the positive side, there are solutions and tools and things that we can talk about as well. Before we get into all that let's start with your story, maybe abbreviated version, since there's a couple of you events, how did you get into programming and Python programming and into just as a kid got a computer when I was, I don't know, 9 or 10 and tinkered around with it enjoyed it ended up actually taking programming classes in high school, stuck with it in college, majored in Computer Engineering, and was a software developer for system engineer that sort of stuff in the defense industry for 20 years. Yeah, what languages Did you started, mainly? originally started in C, actually, originally, Pascal, then started in C, C++, and transitioned over to doing more active Python development, just needed a web scraper needed to collect some data and analyze it. And Python was the right tool for the job. You didn't want to do that in C++, I did not no.

03:47 And now, Python is my preferred language for tinkering, your back end web development permits as much as I can use it for I use Python. Yeah, fantastic. John how about you, I don't have quite the classic story. I learned programming through statistics classes in undergrad specialized language called 'STATA' that a lot of economists use good for legal trials, well tested. But I didn't learn Python until I in grad school, I took more data science classes, and learn the typical 'NumPy', 'Pandas' and 'Scikit-Learn' sort of stuff right there. Like, let us introduce you to probably it was called 'IPython'. Yes, exactly. And now, of course, Jupyter notebooks, that sort of thing. Yeah. Fantastic. Really interesting, just to see all the broad and diverse ways that Python is growing, and people are coming into it. You know, it's, it's not that well, I learned it for programming, you know, building an operating system. And on my way, there's a lot of languages that are fairly, you know, or JavaScript. I built it to work on a website, right. Yeah. It draws people in from all sorts of Yeah. Which is awesome. It's a meeting ground. Yeah. Yeah. And I think that's one of the strengths actually, kind of a sidebar is that we have all these people with different backgrounds and different motivations and interests and things they're trying to build, rather than being more like, Well, here's how I build my web app. How do you build your web app? Yeah, very cool. And

05:00 How about now? Ben, what are you up to day to day, so day to day, kind of put down the keyboard, at least from a programming perspective, and I work as a cybersecurity subject matter expert for In-Q-Tel , which, so I guess my job there is a search for and then work with companies we find in the cybersecurity industry could have a high impact on national security, as well as providing advisory services to our customers in the US Government. Okay, cool. So what In-q-tel and inqtel I'm sorry, In-Q-Tel, I guess it Yes. Yeah. What's the company's story there? Because you both are from the same company? Yeah. So it's a nonprofit, 501 c three, stood up over 20 years ago by the CIA basically help now Originally, the CIA, but now it's seen most of the intelligence community and elements of the DOD basically acquire and adopt and use cutting edge technology. They realize more while ago, around that time that a lot of innovation was moving into Silicon Valley, and in the other places in industry and startups. And the traditional acquisition model that federal government uses doesn't play well with those people. They don't understand it, to kind of help us bridge working with startups, identifying them, and then helping them interact with the government and conversely helping the government that adopt said technologies. And yeah, right, right, what their mission , so maybe let me see if I can run a scenario by you. Maybe there's some Silicon Valley company that's created like a cool ML thing that identifies deforestation or something like that. And the government decides, oh, this might be really helpful for us for I have no reason why I have no idea why. But imagine there's a reason right? You might help that company, like work with the request for proposals and the whole crazy government side of things and get them more in line with what's needed there. Is that the story? Yeah, that's to an extent. Yeah. I mean, we actually invest in them pick equity, and maybe I do help them learn how to interact with the government and also help them shape their product to meet our customer needs. Okay, cool. Interesting. I had no idea such a company exists. John, how about yourself, I'm in also 'iqt.org', I work in what's called IQT labs. It's an open source, Applied Research and Development Lab, where we do hands on research, mostly in the open source largely on GitHub. Sounds very, very fun. Now, let's talk about the supply chain issue, I guess, at a real broad level, right. And I don't know how you all feel, I suspect, I suspect that you have a little more hesitancy or whatever, as you interact with the computers and software and the Internet, and so on, you know, and you Oh, there's a cool new app, maybe I'll try that. Like, you might think a little more carefully about this than the average, you know, as a teenager, whatever. There's a little bit of paranoia that comes with this. It's true. Yeah, exactly, exactly. That's what I'm getting at. And I feel like one of the more insidious aspects of this has been the supply chain side of things, right? Because it's one thing to say that app looks shady, that site looks shady, let me just not go there. Let me not click that link. Let me not install that. But if I were to install, you know, Office Suite x , and I completely trust the company that makes that, but there's some library that they got from a third party and that third party had been hacked. And they somehow Trojan that third party thing, and no one's found out yet. I don't know. That's super scary. And that's kind of along the lines of some of the things that we're touching on. And so I think the most broad one of those in recent times has got to be solar winds. Right? Certainly what's making the headlines these days still even

08:21 five or five months later? That's yet still a topic of discussion around this theme. Yeah, I mean, that was a pretty challenging attack to pull off. I mean, it took nation state actors, months, maybe years to plan, you know, laying the groundwork, getting things in place. They are basically infiltrating solar wind development, infrastructure, pretty impressive, honestly, that they were able to do it. And obviously the impact was was enormous. It was wildly successful. I think one thing that Benson I have been interested in, though, while this sort of attack is very serious, I mean, obviously has rightly gathered a lot of attention. There are a number of other types of attacks, often focusing on open source software that are actually more numerous how serious they are, is actually open to debate. But we are still talking, many people affected and can still have grave consequences, especially if you're the one that's hacked. Yeah. So it's gotten less front of the newspaper attention. But Benson I still think it's serious. Yeah, I think it's very serious. I started with this one, because I feel like everyone has heard about so everyone can relate to this. Right? Yeah. And here's an example of a company that supplies network gear to many of the largest companies and government organization around the world. And this was basically a way to get, you know, access to all those, right? They think it's Russia's Cozy Bear crew, but who knows, right, and it almost almost doesn't matter. Another one that I think also in this in the news real quick before we jump into the open source stuff, this is not open source at all. But what's called Xcode Ghost. Have you heard of this? Yeah, yeah. So yeah, I mean, basically what happened here was you know app developers. IOS developers in

10:00 China don't like the downloader can't download stuff from the Apple official Apple version of Xcode to someone, you know, put a compromised version of Xcode up on some tool. Let's get it off BitTorrent or something. Or some Chinese file sharing site that app developers over there like to use with is more convenient. And they'll pay it was compromised. There was a basically a something that would bake a backdoor into, you know, the ultimate compiled app will go into the app store or variant of the App Store. Yeah, so every app that was built and published to the app store with Xcode Ghost, which looked exactly like Xcode injected a backdoor malware type of thing into it. So there was something like 2500 applications, the iOS app store that Yeah, affected like 128 million people. So that that's bad kinds of things. Right. Very bad, not not ideal. on top I mean it's attacking a compiler I mean, developers trust their compiler, I would say, I mean, not being able to rely on that or feel like you have, and it's very hard to vet your company, especially for closed source, or closed source product like Xcode. It's very hard to see, is my compiler compromised or not. Yeah, yeah. And I think this actually is closer to the open source side of things, right? Because if you can start to infect the tools of the developers building the things, that's a problem. Yeah. So let's talk about the open source side. John , you pointed out there's many known attacks over there. That's right, set the stage what's going on, there's actually a range of attacks. But I'll mention a couple here. And we'll get into typosquatting. So there are certainly a typo squatting attack, which we'll discuss extensively today, which just like domain names, you might have heard someone is trying to go to a website and miss types of little or somehow gets confused about how it's spelled, maybe switching the order of words, and then ends up someplace that's malicious, either on the web, or if you're downloading a package, you download a package you think you want, but it's not actually and sometimes not always, sometimes that contains malware and does things to your computer that you don't want, that's bad. Especially if there's our arbitrary code execution, meaning you can do what they want, because perhaps you've installed it as root. Right? And well, I think a lot of people who are getting into Python don't realize that when you pip install something, unless it's being installed as a wheel as a binary wheel, it's running a 'setup.py' as your account. So whatever your current account is able to do, like you said, If you run it as sudo, it's, it can do more, but even if it can just completely wreck your account and get your information for many people, that's plenty bad on your computer, you don't want Yeah, and it could be your computer could be your your Corporation's computer where you work or your company's computer in this 'setup.py'. You're exactly right is a key attack vector. For many people. It's simply a helpful way to install software. But unfortunately, some people abuse that specific resource. Yeah, I think it's been critical in the growth of how software is built. I remember, you know, Bentz , you were talking about doing C++ programming. I remember back in 97/98/99, doing C++ programming then. And it felt like whatever you wanted to do almost everything you had to build from scratch. You want a library that does this kind of UI widgets. Well, how do I build that you want a library that has this kind of data structure? Where do I either find or build that right? And now it's just pip install this thing, pip install that thing. And the building blocks that we have to compose, are so much more effective, right? I can take a couple of libraries here and click them together. And all of a sudden, I've got something absolutely incredible, like pulling data from different sources, creating amazing graphs. I mean, that is the power of modern software development, right? And yet, you know, I guess Cory Atkins out in the livestream has a nice sort of comment. There's like, he said, I didn't realize how naive I was thinking packages were vetted Pip. npm like you're not alone, Cory, and so you're not alone. Join the club.

13:54 This portion of talk Python to me is brought to you by Square. Payment acceptance can be one of the most painful parts of building a web app for a business. When implementing Checkout, you want it to be simple to build secure and slick to use. Squares new web payment SDK raises the bar in the payment acceptance developer experience and provides a best in class interface for merchants and buyers. With it, you can build a customized branded payment experience and never miss a sale. Deliver a highly responsive payments flow across web and mobile that integrates with credit cards and debit cards, digital wallets, like Apple Pay and Google, ACH bank payments and even gift cards. For more complex transactions. Follow up actions by the customer can include completing a payment authentication step, filling in a credit line application form or doing background risk checks on the buyers device. And developers don't even need to know if the payment method requires validation. Square hides the complexity from the seller and guides the buyer through the necessary steps. Getting started with a new web payment SDK is easy. Simply include the web payment SDK, JavaScript

15:00 flagging an element on the page where you want the payment form to appear. And then attach hooks for your custom behavior. Learn more about integrating with Squares web payments SDK at 'talkpython.fm/square', or just click the link in your podcast player show notes. That's "talkpython.fm/square".

15:18 These incredible building blocks these Lego pieces. Yeah, there's a lot of faith out there that that these are good building blocks, not not good in the sense they don't have bugs but good and that they have a good intent. Well, I think that's one thing. That's the key is anyone thinks it's a challenge here is you have to kind of figure out the intent of the people you're trusting, and where you're trusting them, ultimately, and you have to hope they do not have malicious intent, because inferring that is very challenging. Yeah, it's a double edged sword. I mean, I agree, it is a powerful change that you can download a couple of libraries and have an amazing app, potentially in a few minutes, maybe an hour or two. And this is the dream of coder reuse come alive, finally. And it just so happens that there are sometimes downsides that can be mitigated. But unfortunately, to the unaware user, which it's all too easy to be unaware, it's difficult Actually, there are serious there can be risks. Yeah, there definitely can Kim Van wyck at a live stream has example, you know, a benign example would be "attr" vs "attrs". Both are legitimate packages, but completely different. Now, another example would be if I want to install requests, but I actually just type request, I mean, even auditory, like they sound like requests requests, easy to do sounds, it's even sounds like very similar with the S vs. No 'S' there. And if somebody says go install requests, you're like, oh, request pip install requests. God, I did it like, wait, no, no, don't do that. And it actually happened, you can find that that attack truly happened to affected, at least according to the article published about it. 20,000 users. So I don't know how many of them were actually affected. I haven't we we? This is unfortunately, part of the problem. It's it's hard to track this data. But the example you brought up, I know you brought it up on purpose. It's serious. Yeah. And requests with the S is installed millions of times last week or month. Many many, many, many. Yes, right. We'll talk about this later. But we found one called pan dar like pandas, but with an R. And, you know, it's not hard to imagine just by either confusion or mistake typing this. Yeah, absolutely. So another area, I think that is a little bit interesting. Before we dive completely into the package management, typo squatting and related type of issues has to do with a trusted open source thing becoming untrusted. What I mean by that is, there were some examples of things like Google Chrome extensions, being put out there as proper extensions, and then someone taking over that project, and then putting something maybe more adware in it or something somewhat nefarious, if not actually malicious, or you know, somebody who is running the request is not a great example, because it's under a PSF organization. But many of the projects are under an individual, right on their GitHub project. And so if somebody was able to break into that person's GitHub repo, and then they somehow sneak something into the code, what does it look wrong? No, the, the person who made that change is the trusted benevolent person who runs this project, right? They are, if you know, Guido van Rossum comes in and makes a change? Well, who's gonna look at that, Oh, this is this guy's sketchy, we've got a really like, it's probably going to be fine, right? So if someone takes over an account, like not only do they have access to the code, and then how it gets pushed out to, you know, potentially gets into the stream that goes to PyPI , it's also done by the person who looks like they should be most trusted. Right? So things like two factor auth. And just securing your GitHub and things along those lines seems extremely important as well. Absolutely. I mean, what you're described with account takeovers happen numerous times. And there's variants on it to where there's some single developer who's overworked, tired, doesn't use the product they create anymore, they've handed over someone who ends up, you know, putting a backdoor in it or some sort of malicious payload, I mean, that that's happened. And then also, people take advantage of the fact that not only you've had your GitHub profile security, you also have to have your PyPI or Ruby gems, where you actually publish your packages people run. So there's kind of two areas for potential attack and also people take advantage of the most people atleast me anyway, when I would vet software, I would go look at GitHub, and then I would download, I wouldn't download it from GitHub, I would download it using pip or whatever, and kind of dispense or whatever you want to call it is another opportunity for for confusion and malfeasance.

19:43 Yeah, so these things are hard to detect. But I guess the area that you all have done a lot of research and you've built some tools around and probably the biggest area is around the package management side of things. Right. That's right. So we've talked about typo squatting and

20:00 Some of the challenges where people might miss type things. And you talked about some examples where you found packages that look like they were intended to be installed by accident, you know, to catch those, if there's 7 million people , you know, 7 million times pip install requests is typed chances that a couple of those are misspelled or enough of those are misspelled words, is pretty high. But there were actually quite a few examples. Like, for example, the register had an article when was this this was this is recent March 2021. The title is Python package index nukes 3653 malicious libraries uploaded soon after security shortcomings highlighted That's right. This is there's really a longer historical narrative to to include this. I call this political activism, anti typosquatting activism where this you could call it an attack is really about drawing attention to this risk. Yeah, I feel like a lot of these were people like, Look, I'm proving to you this could actually happen. That's right. I'm proving by creating this thing that uploads as requests with the s&t . That's right. That's like, where they're actually viruses put up there. Like, what is the actual harm? But yeah, so not all of these are, this one and a number of others. We can discuss those if we have time, we're largely benevolent, but demonstrate the risks. But yes, there have been, at least by our calculations, 40, known malicious typo, squatters on the Python package index, affecting 1000s of users, we actually published a blog post on this, something like Python typosquatting is about more than typos. So yes, this has happened. I don't know the exact persons that it has affected. We just don't have that data. Sorry if it affected you. And we publish this and got some debate on Hacker News. And this is the point where have been some I said, oh, there's really something here. There's a broad audience that hasn't had a voice that cares about this. Yeah. Me, it could have been nothing, right? If I'm a student at a university, and I install it on a lab computer, no big deal, no big deal. Like, who trusts those lab computers? Right?

22:08 I mean, not just because like somebody could have installed something bad on it. But there's there are college students, oh, yeah. Who could be installing all sorts of just, you know, pranks and other kinds of stuff. So you should just treat those things with contamination. Yes, they fully get connect . But on the other hand, if this is a data scientist working at like a major corporation or an agency, and that happened to them, it could be the thing that opens the door to, you know, access to the entire network and all sorts of lateral movement rights, right? There's even one of the earliest pieces of anti typosquatting activism comes from Nicolai shocker, who was writing his undergraduate thesis at the time in Europe. And he showed that over a few weeks, he got over 17,000 downloads of a series of typosquatting packages including '.mil', the military addresses of the United States military. So it is certainly possible that people in a more secure organization that really value security could accidentally be the victim of typosquatting. Yeah, absolutely. And the fact that it came out of a '.mil' domain shows that yeah, that bad example could also happen. And also his thesis covered on ARS tech. That's right. coolest undergrad thesis ever. Exactly. That's way better than anything I did in

23:27 college. Oh, yeah. Yeah. Fantastic. And then there was this project called Pytho squatting. squatting. Yeah, it's actually has been, yeah, like, like a play on typo squatting play on typo squatting. It's clever one, and Benjamin Beltre, Bach and Hannah Beck, who are an open source software activist developers, also a journalist, they really had a multi year effort pointing out the dangerous here, not simply criticizing, but trying to help Python Software Foundation and the warehouse of PyPI crew, raise money and build a consensus around trying to make this infrastructure safer. Yeah, yes. So they had this project called Pyto squatting, but that actually got closed down. Right. Yeah. Because they said that the PS what do they call it the PSRT, I found Python Security Response Team respond. That's at PSRT. And I'm like, wait, there's a Python Respnse Security Team. That's cool. And they respond to emails too they are good. Yeah. Okay. So this is an organization, a group of people under the PSF banner that basically triage these types of concerns. Right, right. That's right. Okay. Yeah, they all link to their, their page on 'python.org'. And they have their email there. They also have rules for different types of disclosure, like whether you should email them do other things. That's right. And, you know, if you find a malicious package, or even a package that you think is very suspicious, this is the who to contact and then they're.

25:00 diligent and timely. So what do you think about how this should be disclosed, people out there listening, they find something, should they go to Hacker News and say, Look, this horrible thing I found on PyPI or on NPM, or whatever, should they quietly disclose that to the security response team, and then talk about it after it's been removed and fix what's the flow for disclosure seems like it would follow you know any other responsible disclosure process for that traditional bugs, exploitable bugs there are multiple vulnerabilities, where it'd be nice if you find a problem, contact, you know, maybe the Python Security team, you know, get in contact developer get it fixed , they'll probably get the package pulled down, though, in fact, it is malicious. And then yeah, be nice to have some sort of reporting mechanism so that everyone who uses that could be identified. And the first part, John Speed was saying, you know the Python Software Foundation, the PSRT do a good good job, a great job of being on top of it being timely being responsible, it's much harder to notify, you know, there's no authentication when you download one of these packages. So it's very hard to know who's been affected, right. So maybe just promoting that more would be be helpful, but then people to know where to look, and that they need to look at all challenging quickly. Well, it's like the Xcode ghost thing. You know, there was 2500 apps that were backdoored. And I think only the top 25 were even disclosed. And it's like if there was a list of 2500 apps, are you going to go cross compare? No, you know, no normal person is going cross compare that announcement with their phones. Right. Right. Right. And it's just such a challenge. And I feel like, you know, here, we had the same thing, right? We had 3653 packages removed. Well, are you going to go check if you had those? It's extra hard, because it's you didn't intend to ever have them.

26:48 You didn't intended swap the S and the T and when you type request, but you did. And you accidentally almost unknowingly got it most likely, right. And so I do think it's really hard to push this out as an awareness thing. And like, Hey, you should know that this happened. And so just go check. Right, the checking, I think it's really tricky. Yeah. I mean, like many software problems, you need to solve it with more software, you got to solve it with AI, you definitely have to solve it with AI. I think one thing that's helpful and could be part of that process, but isn't always unfortunately, is also taking a collection are taking that artifact that you found, let's say a Python package that was malicious, and making sure it gets to somewhere where it can be studied, and hopefully future attacks prevented. And so for Python, and a couple of other languages, there is actually an interesting project that has a very colorful name. It's called "Backstabber's Knife" collection sounds very scary and malicious. But it is actually yet another enterprising grad student trying to collect malware samples, especially if interpreted languages. Python is one of them. And so that there can be a community researchers and hopefully then companies that can fight these packages. So that would be another thing I would add to the list. Yeah, there you go. Mark home is the main person associated with that. And as has written some interesting papers and great stuff. And so I urge you if you come upon this, and you say, How do I act responsibly here, do the things Ben says and also maybe grab a sample and give it to the Backstabbers Knife collection or another similar repository? Yeah, interesting. Okay. Have I just messed up my computer by visiting this web page as well? I wonder? I don't think so. But there is that I just

28:29 I can't guarantee anything, though. But no, of course, of course. Of course. Before we get too far on, Cory Atkins also asked when we were talking about messing up your computer, the lab computer so on? Yes. Could installing these types of things also affect shared server space on my IAZ land where I have a shared server running for however much someone else does something bad? I mean, theoretically, sure. It depends on the permissions, I would think, yeah, if you install some dependency that has keylogger baked into it, or I don't know, or, you know, some sort of file collector, and it has permission to traverse all directories, then yeah, I could certainly see a scenario where that was was possible, I mean i have no idea. I haven't heard of that happening specifically. But there's nothing preventing it. Theoretically. Yeah. If you had a series of virtual machines, you know, it's pretty tricky from one virtual machine to escape to another, but I believe there have been examples, but those are exceedingly rare those sorts of vulnerabilities. Right, right. So while we're on this topic, I want to throw out an idea. And then we'll talk about some of the tools you built. But I feel like we're right in the middle of this notification thing. Like we've got all these packages they've been identified. They have been downloaded, we can see that we probably even have IP addresses, which you can reverse look up to DNS names as probably how those attributions were given. But it happens so often in so many different places, right, like if I've got a continuous integration story that builds a Docker container that pushes to a Docker Hub and then my production grabs that from that container.

30:00 The Place where the problem happened is not the place where the problem is, right? It's probably GitHub or some other CI tool. We have a really nice thing for this in the account space, have I been pawned by Troy hunt, which is a really nice project, I definitely recommend people go there and enter their email addresses .Prepare to be horrified. Yeah. And prepared to be horrified. There's 11.21.3 billion accounts that have been breached, which is odd, because it's more than all the humans. But we'll have more than one account. So there it is. But yeah, so you put your email in there. And then in the future, what, historically as well, but then in the future, you say, if something has happened, and your email appears in some kind of password dump, password breach, or account or informational breach, you'll get an email saying, Hey, we found something that should be concerning to you check it out, I would love to see something like this for pip write something that says, I pip install this thing. And it just has a record of, here's my account, these are the things I've pip installed, if there turns out to be a problem with one of those, notify me that that had happened. But those are really useful. I don't think it exists. I don't think it exists either. And it shouldn't just be a pip thing. It should be an NPM thing, it should be a GEM thing, it should be a great thing, it should be something that like just a little bit of a wrapper that says, I would like to opt in to saying, here's my UUID. Here's my email address. And here's the list of things that I've installed. If it turns out that one of them is horrible, just let me know.

31:30 It's so sensible, it makes me laugh. Yeah. Well, there's an idea out there as well. But this is way far down the line, right? This is Oh, we know this has happened. We know who's done it. We've, we know who has been affected, and so on. But starting a little bit further back, you all have built some tools to go and start at the beginning and say, well, let's look and see what might be out there. That is bad, right? This is the tool used to find Pandar instead of pandas that's right. I mean, I think the the Tell us about it. First idea you had is the crucial one, which is that you need to know that there has been a compromise in order to report it. And right now, it's surprisingly hard to know that. So we're not the only one to devise a tool or approach to finding malicious packages on the Python package index. But we took a particularly simple one. And we said, Can we use simply the metadata, especially the name, but some other information to have packages, and then look at just the most downloaded packages, and check who has names that are very similar to those packages. This is where AI comes in you crazy AI at this point, you do get a lot of false positives, people have similar names just because the packages are related. It's fine. There's no problem inherently with having a similar name. But right, we cracked open those packages to this was some very boring Saturday mornings of mine, and simply scan through the code looking for anything that's suspicious. And lone behold, we found one called Pandora that was actually doing keylogging. And there's a proof of concept. It's unlikely that it actually would have worked, but reported it to the Python Security Response Team, security@python.org. They said, Yep, not good yanked it. And it was just an example of it's not that hard to find them. And we were showing yet again, with a pretty simple demonstration that it's not that hard. Interesting. That's really cool. So basically, the tool is about finding, given popular packages, finding ones that are oddly similar. And then there's like a let me go and see what this one's about. That's right. So there's a couple additional checks to help anyone using it, you can find it it's an open source written in Python tool, command line tool. It also checks things like for instance, is the description of the package on PyPI, is it very similar, so that what you are seeing as someone who is trying to not only type a squat the name, but in some sense, like squat, the broader metadata or almost like the copyright of the package, right, because you want it to look as similar as possible, exactly, like camouflage something that comes to mind. Yeah. Are you guys familiar with Snyk? Yes, yes, that's it package. There's a project and she's I'm forgetting the 'Advisor Project' Yes. Snyk Package Advisor, Yeah. It's neat. That's it. It's super neat. Yeah, that's what I was looking for. No, that's not how it's I spell it . Yeah. And so that thing is pretty cool. The reason I bring this thing up, is you can come over here, and I can type in a project like request or whatever. And it'll tell us eventually, it'll tell us the package health score. Yeah. And it'll tell us things like there's this many Pr's that have been open to close, there's this many contributors, there's this many people participating that the maintenance looks like so on. One thing that I think would be cool would be to take this number plus a misspelling, and say, if that number is really, really low for a package, that should be really, really high. That's a challenge, right? If you look at the GitHub repo that is delivering this thing, and it doesn't look right, if it's not associated with something that seems kind of hard to replicate, like a GitHub repo with many people

35:00 Participating over a long period of time. Yeah, that seems like that could be a good flag as well. Yeah, certainly, it certainly seems like there's a abundant opportunity to build something into the the actual download client to the PIP, a wrapper around pip where it checks these sorts of things and create speed bumps for you, as you are trying to download something or use a package. So that says, hey, this looks suspicious. Have you thoroughly checked this? And I think your idea could contribute to exactly such a tool or tools yeah very neat, so you been working on this? And Martin Karnaguski, created this thing called Aurora and also reached out and said, Hey, I'm also working on this. And so yeah, tell us about this thing called Aura. Yeah. So we got an email last fall after publishing this blog posts. And he said, hey, I've been working on a similar tool, not only does it check this metadata, but we even do static analysis of the entire Python package index. And we said, Martin, that's awesome. Let's work together. And so over the past six months, roughly now, in an open source collaboration between a number of us at IQT labs and Martin Karnaguski, we have further refined Aura, which truly is designed to do a static analysis of the entire Python package. And that's open source tool, you can find it and he releases his data on a try as best as you can to release it regularly. We've also built a tool called 'AuraBorealis', that key thing his Aura produces 50 gigs of output, when it's done scanning the entire Python package index, no human can wade through that. So and I suspect also the PyPA, the Python package authority, folks probably don't want everyone downloading that much data all the time. No, it's exhausting, and creates so many database issues and other things. So we've been working on a tool called 'AuraBorealis' that you've pulled up, that is a front end that makes it easier to use the data set that Martin creates this tool aura. This wouldn't necessarily be part of PyPI, though, of course, it could be. But we imagine this as a tool for organizations or persons that need to have global knowledge about either global knowledge about the entire Python package index, and to rank and assess potential threats and go Look, look for those, and then take appropriate action or even individual developers that are really curious about packages. This makes it easy. The AuraBorealis isn't yet live. But we hope to make it live this summer. Or as in production tool, it works. So go check it out.

37:36 Talk Python to me partially supported by our training courses. When you need to learn something new, whether it's foundational Python, advanced topics like async, or web apps and web API's, be sure to check out our over 200 hours of courses at talk Python. And if your company is considering how they'll get up to speed on Python, please recommend they give our content a look. Thanks.

37:59 This looks like it's really handy, you know. So the idea is basically it's going to run forever. And that's going to generate tremendous amounts of data. Maybe just put a web front end on top of that static data for everyone to generate it over and over Exactly. Instead of having generating over and over or now having 50 gigs and having to write your own custom probably Python script that's, you know, you'll have to optimize and blah, blah, blah. Yeah, so Kim in the livestream just says I accidentally typed 'synk' instead of 'snyk', which also is hard to spell anyway, because it's like a non common spelling, which is an excellent way to demonstrate making a typo getting the wrong package. I have no idea what that's going to return. I'm not going to pull it up. Podcast imitates life imitates art imitates compromise.

38:43 Hey, exactly. Alright. Well, this is really what how would I use the Aura data and the Aura-Borealis project, I guess they're also we should talk about this from different angles, right? Maybe I'm a CSO at a company. And I'm concerned that all my people are psyched about data science and Python, or NPM. And web front ends. And they just make me nervous all day. And I want to, I want to get on top of it. So I want as somebody who is concerned about, I would like to know what's happening in my software supply chain. Or maybe I run, I maintain pandas. And I'm really upset that Pandora exists. And I want to now be able to defend my package like that seems like there's different use cases and people out there. That's right. I think if you're a company and you have a group of software developers, and you have the, let's say, a security team that helps that packages, so perhaps you put those packages in internal repository, so that then developers know that they're clear to use, or Borealis will help you do that. And we're glad the setup pilots and discuss you can email me 'jmyers@itt.org'. But there's also other angles too. There's just a you're a developer and you want to make an informed choice. The static analysis tool and its output can help you with that and or Borealis and I think there is also You're right, there's a maintainer

40:00 angle, and also a PyPI administrator angle where you want to either protect a set of namespaces close to your package, or you care about the health of the entire ecosystem. And those are all possible user types. Yeah. And we could probably use your PyPI scan to go and say, Look, can I say look for things similar to my package name? Yeah, that's right. And we built that into Aura Borealis to now. So in some ways PyPI scan was a demo. And still useful as a command line tool, but Aura- Borealis and aura has that now built in, but you're gonna put an API on top of this? Good question. That would be cool. The thing that tricky, like everything in life, is it cost money and step engineering resources and time, I certainly have a vision. And, you know, if I don't do it, someone else should do it go make a lot of money of creating and a technical infrastructure that every single package and every single new version of every package, Pypi, NPM, etc, get scanned a variety of scans, static analysis, dynamic analysis, metadata analysis, and that gets stored in a database that where you and I can go make API calls and get the information that we should on these packages, that could be you know, that could be a free tier. And then if you really did make a lot of calls a paid tier, but someone should do it, I think, yeah, it's it would be neat. To know, like you said, integrate into say, Pip, even though if I pip install something, it could, it could even flag it and say, Hey, no, actually, we're gonna block that. That's right, preemptively. Because it's got some low score, unless you do like a '-- force', like no, really? I mean, yeah, exactly. It's something that all sort of slow it down, as you call them speed bumps. I hope someone does something similar that we have plans, but no active development underway. Right. So that sets the stage that some of the tools out there at least to identify that there are potentially bad packages. And it's also I guess, you know, worth pointing out that if we go over, say, to PyPI, there's over 300,000 packages over there. And if there are 40, actually malicious ones, right? The chances are low, they're not very high. But so people shouldn't be, you know, running for the hills in complete panic or anything. I don't think from this, but at the same time, we should be careful, we should be cautious. So what can we do? That's the tough question. Vince, do you want to start and you want me to go? Sure. I mean, there's a lot of things that we can do. I mean, John's been hit on a few of them about just kind of be more deliberate, you know, checking your work before you download something. And also, you know, when you're considering dependencies, and you mentioned C++ and the late 90s. I vaguely remember those times were in boost came out as a big deal. Oh, yeah. You actually had a dependency that way. I remember reading more books, right,less internet more books to make things work . We moved on from that. But ultimately, it is worth considering, do you actually need this dependency, your left pad and NPM is a funny yo canonical example broke the internet because people didn't feel like typing one line of their own code, import a left padding dependency, I do feel that's a really good example. And certainly left pad came to mind not as malicious no, just as a supply chain Jenga tower type of thing, and somebody pulled too much on a part of the Jenga tower and it came down, right, I feel that the the JavaScript community has way smaller Lego pieces than the Python community, the bit, the blocks that you click together here are larger. So I feel like there's just fewer in number, external dependencies, on average, in my Python experience than my JavaScript experience. Yeah, I think that's accurate. I mean, numbers vary. I've seen Java or NPM. People who use NPM. To JavaScript developers, the average package in NPM has like 94%, dependency ill other dependencies only 6% is your actual code you've written most of them, the modern languages, meaning JavaScript, Python, and some others are in like the 90 ish range. And then you see C and C++ are much lower Java somewhere in the middle. Yeah, Python is obviously lower than JavaScript, but much higher than the kind of legacy languages that are historically used, though. So be deliberate means things like don't just as fast as you can type pip install whatever, type pip install, and then carefully type out the package name. Maybe give it a quick read before you hit go. Just copy and paste.

44:22 I'll type it all. Yeah. So for example, if I'm over here on PyPI, there's a copy button, I can click and it'll it'll do exactly that. Right. Right. That's an option. Yeah. So it's been a little more thoughtful and kind of, you know, looking at the dependency chain as well before you download something which is hard. It's much harder than should be to be there. But that's helpful to know that you know, maybe the top level you are using Joseki, how you pronounce that. You should pip

44:49 Install that right now, let's see what happens. Let's just see what happened. Here's an example of one of those that should rank lower. No offense if this is your project, but it literally has zero stars 04

45:00 It's features are to do with requirements are to do. It's PyPI version banner is not found, and I mean, is only four minutes old. They may be worth Yeah, sure. Yeah. It has dependencies. This one probably doesn't. But take a look at those two just makes them there's nothing egregiously wrong at a minimum. Yeah, that's one of the things that makes makes it kind of insidious and hard to see is the thing I directly look at it may be fine. But the person who maintain that, did they make a mistake and the things that they depend upon? Or maybe, you know, transit, like, follow that chain? that graph down far enough, right? There's a lot of layers that could be happening along the way, it ends up looking a web. And not surprisingly, just because of that most of most vulnerabilities inside of packages like this are in the transitive dependencies, the ones below the first layer, that dependencies dependent. Interesting. So you can pip install the thing? What about pinning the version? I know, there was some issues about having a private PyPI Server, which I think is a good idea, where do you whitelist packages, and you say we approve these things, and only these things get installed. And if you want a nice new one, we've got to opt it in and then now it's part of the organization, that seems like something you could do, right? There's PyPI Server that you could set up that is a sort of pass through layer there. But then there was also the vulnerability of the version mismatch, like if there's a higher version of that thing on the public PyPI than your local one. So people were putting in like data layer version 70, you know, and then it's like, oh, there's a newer version out there for me to go get, I'll get that, even though it was internal, meant to be internal. And they right, so there's, there's these challenges, but what do you think about a private whitelist server? certainly seems valuable And seems like it's another another speed bump, as John's been was calling them? Yeah, I mean, then you run into in scenarios, like the one you described, where it's kind of, I guess that's undefined behavior, potentially, or at least not well known behavior that maybe isn't the most intuitive. So even that might not be enough. So then, yeah, the pinning could help, then, of course, there's the challenge of maintaining your pin at the proper level, which adds more and more effort on the

47:07 developers to maintain up to date dependencies, at least publicly. Yeah, publicly, we have dependencies on GitHub, which is way more of a pain than it should be to use because it if you got 10 updates, it'll issue 10 yars, which conflict with themselves. Anyway, that's a long story. But it's still at least some automation that says, hey, there's a new version of this. Here's the change log. And we also have the CVE security decks of dependable which are really good. Yeah, unfortunately, most of these typosquatting are just general supply chain attacks don't end up in the MVD as a CD. Yeah, who's gonna study this one? And then not just say, take it down? Right. Like it's it's living under the in the shadows, right, of being unnoticed to an extent Yeah. And NPM does a good job with their advisory service of like saying, this is a malicious package. And this is why we removed it. But not all, package managers do that. And even so then you have to go to all the most of those developers are building multiple languages these days. So it's hard to keep track. Yeah. What about having isolated environments for trying out new packages? So for example, one of the things I'm trying to do is if I'm checking out a new package, I have to pip install, maybe that happens in a Docker container, then I throw away that container, or possibly a VM with snapshotting on and then I roll back the snapshot periodically. Yeah, those both sound like great ways to heal and have good hygiene and not isolate the potential ice blast radius of

48:31 potentially malicious package. Yeah, it's one thing to say, here's a thing we want you to check out. And it's on PyPI. And it's, it's really well known, but it's, you know, you got to explore new things that aren't super well known yet. Right? And so how do you install that? Right? So I think some kind of blast blast or like you said, like Docker, like a VM is not a terrible idea. Yeah, it's a good one. What else? There's the open source, software found Security foundation. Yeah, that's right. This is open SSF, open SSF is clearly a reference to open SSL, yet another well known software supply chain compromise that have widespread impact. It's worth pointing out that this group, for anyone who is comes very enthusiastic about open source software supply chain security in particular, has become a meeting ground where both companies but also persons interested in this sort of topics we've been discussing today and more have set up a series of working groups. There's six roughly and meet every few weeks, open community fun, interesting people either interested in the topic or actively working to give back and contribute. It's run by the Linux Foundation, and we would highly recommend it as a place to find other like minded persons, if you care about these sorts of topics. Yeah, fantastic. And then there's the further on down the road, which we've touched on a couple of times, but maybe we can encourage some enterprising person people out there to go after it like a hardened pip or we have things that are sort of on top of PIP

50:00 tools we've got PIPENV, pipX we've got pipX I'm a big fan of pip x the isolation then that gives us a need. And just I can see like, like a pip sec, or something along those are pips, maybe a plural pips. I don't know for pips security . But something like that, that incorporates some of these ideas. Maybe it checks in, you say like, I don't want to install any package that is not in the top 1000 sure are popular package, except for what I whitelist in on top of that, or something, or check with Aura-Borealis about the score, or check with the have I been pip or whatever that thing ever word become right was talking about like where you might see things going? Yeah, well, there's been a couple, I'll call them starter projects in the hardened pip area, there even was one called pip sec, you can find it on PyPI. But it's really there's nothing there. Unfortunately, at least yet that namespaces claimed the maintainers, who we mentioned, Benjamin baller back especially are interested in doing some just haven't had time, other busy priorities. And I think there is a lot of potential to build out that idea and create something that could be useful to the average developer, JavaScript has a tool that has at least some moderate popularity called MPQ that does this. And I think it's time for the Python community to see if there's something similar, I would love to see something like that. Another thing is Google. Thank you. Google has become a visionary sponsor of the PSF. And specifically, they want their funds to go towards critical supply chain security improvements, developing productize malware detection for PyPI prototype at dynamic analysis infrastructure, do this sort of gets at the head at you maybe there's something that the PYPA and 'PyPI.org' could do on their end without even necessarily changing pip right pip's gonna go straight to some API there. And it goes, Yeah, no, not this one. That's right, where you're going to upload it like with upload a new package, it goes no we don't want to accept it and Dustin Ingram have the Python Software Foundation at PyCon just recently devoted his talk to talking about Python and the software supply chain issues that we've discussed today. And writ large to include typosquatting, it's clear that there's energy and willingness from even core members of warehouse and Python Software Foundation to tackle these issues. And so we're glad to see that. It'd be great to see something like that happening, I think layers as well, right? That's how you talk about security often is it's not just Well, you have a strong password, and you're fine, like well, it may be of two factor authentication, and maybe you run as lower permissions and and and and write the layers. So this your this could be one of the layers, but not necessarily all of them. Yeah, I should note that we even a couple of us IQT labs even put in an issue recently that on warehouse that might interested some parties here, it's issue 9527, you can also find it at short, that 'IQT.org/issue', just a redirect. And we essentially call for something like social distancing for the top Python package indexes. So that for very popular package names, the package names that are close by are blocked off. So that not saying that anybody who chooses those names is malicious, but just so malicious, people can't choose them. Feel free to upvote that we've been discussing this with some of the members of the warehouse team. Yeah, so your proposal is that Pandar should not have even been allowed, right? It's right, given that the package pandas is so popular, minor variations on it, spelling should basically be blocked or maybe redirect to pandas and say, with a warning, like you tried to install Pandora, did you mean to install pandas? That's right, or something like that? That's right. So it's a way to build in guardrails so that the unwary don't fall prey to this? Yeah. Personally, my first impression is that that's a good idea. It's worth it that we don't need requests and requests and requester. And, you know, the potential harm is higher than the value of you know, reusing very, very similar names. Yeah, we agree. And there's obviously trade offs. Yeah. Ben's What do you think you must agree this I suspect I do agree with that. I definitely supported this and I know and one other thing that's under consideration that's relevant is namespacing. So you can kind of rights is the request guy, he has his namespace, you go to his namespace, you're less likely to miss type that and have someone that namespace and have someone who has claimed it's gonna be packaged within their own namespace. So possible video is another layer, I guess it was you're describing it makes a command, you got to type a little bit longer. Yep. It makes it really clear where it's coming from. I mean, that's what the point of namespaces in programming is. It's really clear what library it comes from, or what part of your, your code it comes from, and who grouped together and, as well, no go was done. Yeah, use that to great success. So yeah, Kim benwick out there. Put a comment that sort of related to that talking about the private pilot PyPI server that's

55:00 You know, redirecting out, it would help if the private PI's if you had an option to prevent the account uploading images from or pulling images with a certain prefix. For example, if you everybody named their packages, ABC, something at the company, you could say, ABC is private ABC stars private, never ever, you know, go look beyond here, or that type of thing. I think that that's pretty interesting. Yeah. Good idea. Yeah, I think it seems super simple and a good idea. I agree. Alright, gentlemen, well, very cool to talk about this stuff. Like I said, it's going to make all of us a little bit more nervous, I suspect. You know, for example, Cory Atkins out there that I also just found an article on malicious Docker images. Now I am paranoid, which I'm sorry, welcome.

55:44 Yeah, yeah, I've been there for a while. Alright. Before let you two out of here, though. Real quickly. Let's answer. I'll ask you the two questions at the end of the show, of course. So if you're going to write some Python code, what editor do you use? I use VIM, if I'm in the command line, but if I had the fortune to be outside of it, Sublime right on I suspect Jupyter lab is also an IDE definitely Jupyter's in there. Yeah. Events. Apparently PyCharm. Yeah, we all use Vim if I if I'm already in a command line, but that's not as often these days. So PyCharm is my ideal choice choice, right on and then notable PyPI package something that's like, oh, people should know about check out one called 'Network ML'. It's a package related to machine learning and network traffic. The lead maintainer is Charlie Lewis of IQT labs. Go find it on PyPI. Yeah. Fantastic. So machine learning plugins for network traffic. Yeah, it identifies like anomalies and other weirdnesses. Like that. Yeah, it parses network traffic. And one of the cool things it does is it helps identify what sort of device is being observed. So is this thing a printer? Is this thing a personal computer? Is it an Active Directory controller, etc? Is it a canary? Is it a canary?

56:54 Awesome. All right. Well, thank you both for shedding the light on lots of what's happening, some of the things that are being done and what might also be done as well. So final call to action. People that want to get involved, maybe do more become more aware, what do you say, yeah, I mean, there's plenty of work to be done Open SSF is a very welcoming, relatively new organization that has a nice list of stuff to do. Python Software Foundation also actually has an active list of items they would like to work on, some of which are relevant to this topic. So may be two great places to start. I'll point you towards back towards that GitHub issue, feel free to chime in. And I think there's definitely potential over the next few months. Additionally, we're actually working on a survey at IQT labs called on Secure Code Reuse. So if you want to help build the research foundations for this, you can find this survey short, that "iqt.org/survey". And we're trying to understand the developer or data scientists or other programming professionals experienced with package reuse. So that's another way so hopefully this survey informs future tools. Yeah. Fantastic. Well, thanks for the work that y'all are doing. And thanks for being on the show. Thanks for having us. Right.

58:03 This has been another episode of talk Python to me. Our guests in this episode were Bentz Tozer and John Speed Meyers. It was brought to you by 'Square'& US over at Talk Python Training and the transcripts are brought to you by 'Assembly AI'. With 'Square' your web app can easily take payments seamlessly accept debit and credit cards as well as digital wallet payments. Get started building your own online payment form in three steps with Squares Python SDK at 'talkpython.fm/square'.

58:33 Wanna level up your Python we have one of the largest catalogs of Python video courses over at talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription insight. Check it out for yourself at

58:33 'training.talkpython.fm'. Be sure to subscribe to the show, open your favorite podcast app and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play and the direct RSS feed at /rss on talkpython.fm. We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at 'talkpython.fm/youTube'. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon