#64: Inside the Python Package Index Transcript
00:00 What is the most powerful part of the Python ecosystem? Well, the ability to say "pip install magic_library" has to be right near the top. But do you what powers the Python Package Index and the people behind it? Did you know it does over 300 TB of traffic each month these days?
00:00 Just me as we chat with Donald Stufft to look inside Python's package infrastructure.
00:00 This is Talk Python To Me, episode 64, recorded Wednesday, June 22nd, 2016.
00:00 [music intro]
00:00 Welcome to Talk Python To Me, a weekly podcast on Python- the language, the libraries, the ecosystem and the personalities.
00:00 This is your host, Michael Kennedy, follow me on Twitter where I am at @mkennedy, keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython.
00:00 This episode is brought to you by Snap CI and Rollbar.
00:00 Hey everyone. I have a really fabulous look inside the Python package index on Dec for you. Donald and I just scratched the surface but there are ton of fascinating topics that we cover. Before we get to our talk with Donald, I have one announcement for you, I have released my second online course called Write Pythonic Code Like A Seasoned Developer. It's over 4 hours and 50 concrete examples how you can write more Pythonic code, it's jam packed with tips that you can incorporate into your projects immediately. Topics covered include the expansive use of dictionaries, hacking Python's memory usage via slots, using generators to comprehensions and generate expressions, creating subsets of collections via slices, all the way to the database and much more. Several of these are Python 3 only features and so you'll have even more reasons to adopt Python 3 for your next project. The response to the first two days has been super positive, I hope you'll take a moment to see what the course is all about at
00:00 https://talkpython.fm/pythonic
00:00 Now, let's get to Donald.
02:09 Michael: Donald, welcome to the show.
02:12 Donald: Hi.
02:12 Michael: Hey. Thanks for coming today, it's going to be really fun to talk about packaging and PyPi and all these sorts of things, but of course, like always before we get into the details, let's hear your story, how did you get started programming in Python?
02:25 Donald: Yeah, so programming in general I was playing a video game back in high school on Everquest and I started to get in the hacking that video game and there were the tool called MacroQuest that I started running at on and that sort of got me involved in programming at all. And then I started to get jobs in programming, my first job was with PHP and when I was working with Drupal at the time, I kept hearing about this cool framework called Django but it was written in Python, so I picked up a Python book and sort of taught myself Python, in a week or so and so that I could then use Django to do websites instead of Drupal. I was feeling constrained by this sort of CMS aspects of Drupal.
03:18 Michael: Yeah, ok, that's cool. Drupal is PHP, right?
03:20 Donald: Yes, Drupal is PHP CMS that surely got some frameworky aspects, at least it did back then, that was 2007 or so, I haven't looked at it recently.
03:33 Michael: Yes, it's definitely still going. So was it refreshing to get into Python from PHP?
03:39 Donald: Yes, I found Python to be incredibly concise and great to work with. I liked the enforced white space coming from somewhere where PHP- particularly at the time a lot of the examples you would find online didn't always have the greatest formatting, so that sort of enforced formatting helped a lot with me reading other people's code, to figure out what they were doing, how it worked, and sort of just general grokking of the code bases to help me learn over time.
04:15 Michael: Yeah, yeah, that's really cool, I am sure it was nice. So, I think maybe a good place to start this conversation would be to talk about what you do for your day job?
04:26 Donald: Yes, so my day job is, I am employed by Hewlett Packard Enterprise, full time employee, and my sort of mandate is basically make Python packaging better. So from there, I work on PyPi, I work on PIP, I work on project called Warehouse which is essentially PyPi 2.0 and the Warehouse theme is not really exposed to end users, other than if you start to work on the back end, but we call it Warehouse just to distinguish it. Then there is other really small tools like Twine and sort of a lot of the background efforts, writing Peps and coordinating things and what not, it all falls under my banner of what I do for my day job.
05:14 Michael: Yes, so that's really awesome that HP is making that investment to more or less fund a developer to continuously work on Python packaging.
05:23 Donald: Yes, and I think it's great, they are already big contributors to the open stack community and open stack is- I think it's not entirely written in Python now, but it's largely written in Python, so they are heavy users of Python and PyPi and pip and what not, so HP felt that they are depending on these things, it made sense to invest in these things to make sure that they continue to be running and working.
05:52 Michael: Yes, that's cool. So is this a position that you somehow managed to create or was there like a job announcement, hiring packaging support person?
06:06 Donald: It was created for me, at the time there was Monty Taylor inside HPE who sort of led the effort to commence the hire ups and this was something that they should do and he is the one that reached out to me, and offered this to me. Prior to this I was working at Rackspace where they gave me half time, 50% time to work on packaging and 50% of time to work on their own projects. So he sort of gave me an idea and just took it up to the next level and we were really pushed for getting that done in HP.
06:45 Michael: Yeah that's awesome. I think we'll dig in a little bit more later on, how companies are supporting PyPi and so on. But let's keep it high level for a little bit; I've had a lot of interesting conversations with people, basically around the pronunciation of P-Y-P-I and I've hear a lot of people say [pai pai] and I've heard a fair number of people say [pai pi ai] what's the official way to say it?
07:16 Donald: The official way to say it is [pai pi ai] it's Python package index so you put the emphasis on the pai pi ai; a lot of people do call it [pai pai] I've heard [pi pi] and [pee pee] and all sorts of pronunciations, I really don't like the name, I think it's confusing name that has this big sort of problem of pronunciation, everyone pronounces it a little bit differently, and one of the most common pronunciations [pai pai] clashes with [pai pai] as PyPy, the alternative Python implementation. I have sort of thus far been unsuccessful at convincing people that it's worth it to change the names and PyPi is so enrgained everywhere across the ecosystem.
08:10 Michael: Yeah, it's deeply engrained but it's also an insider term, right, like you can walk up to somebody who is very barely familiar with Python and say PyPi and have him know, "Oh yeah of course", right, but you know, that's true I think with lots of the packaging systems like if you said NPM to somebody they wouldn't know, if you said Jems they wouldn't know, there is all these systems, they all seem to have poor names but, PyPi ok, that's great. So, I am glad to hear that I've been saying it right, but I've been following Guido's lead and I figured that is a pretty safe lead to follow.
08:50 Donald: Yeah. And, PyPi is what it is, although I do slip up every once in a while and pronounce it wrong, but-
08:58 Michael: Yeah, ok, cool. So, PyPi, one thing I always wondered about, maybe you can explain this for me, if I type pypi.python.org and I hit enter, I get pypi.python.org/pypi, like why does it appear in both ends, is it just like really beloved?
09:15 Donald: There is a long history there, and I am, I don't fully know it because a lot of it comes from before my time, but I believe PyPi was originally deployed under just python.org/pypi and then they eventually moved it to its own domain name, as pypi.python.org and they just kept the /pypi because that prefix was baked in the scripts and was easier to just change the domain name rather than changing the whole path.
09:47 Michael: Right, so, it just the compatibility thing.
09:49 Donald: Yeah. And it was actually just recently, I know that because we just recently within the last year brought compatibility for people who were still using python.org/pypi and apparently there were still people who had scripts and one might point to that because when we broke that we got people yelling at us for-
10:13 Michael: yeah, you heard about it?
10:15 Donald: Yes.
10:15 Michael: Well that's the quickest, easiest way to tell if a service or a data base or something like that is required, it's like turn it off and see if anyone screams, right?
10:26 Donald: Yes. And with PyPi that's one of our mainly we figure out what's needed or not because it's never been documented well what you can depend on in PyPi and sort of the entire history of PyPi has been somewhat dependent on something weird in it, some sort of implementation detail and suddenly it became the API, so a lot of things at PyPi are we have no idea what people are depending on so we just change things and pray.
10:57 Michael: Yeah, so, you step very lightly, making changes I suspect?
11:01 Donald: Yeah. Particularly on what we call legacy PyPi, which is what's running now because it's very old code base, it's like 15 years old or so, and it's got no tests, it doesn't run locally very easily, like to actually run it locally you have to modify the code and comment things out to get it to start up, so it's fun.
11:26 Michael: Yeah. I am sure this is really an interesting experience and long lived Python code. So maybe it's a good place to start talking about the history, like PyPi is not as old as Python, right, and so there was a while where Python was just a thing, there was no packaging, right, and then PyPi came along, what's the story there?
11:47 Donald: So, again, this predates me so this is what my understanding is from what people have told me. Python did originally have no packaging story, so people started to sort of making their own story with make files, make files obviously are not super repeatable and say you had to write a whole new make file every time you switch to a different project, they don't run by default on Windows, etc, etc. So someone whose name escapes me right now came up with Distutils to replace all these make files with a Python script that will run everywhere and it will sort of do all that stuff for you.
12:29 Which at the time, great, it was a really big step forward for Python and packaging in general. And I think roughly around that same time maybe a little bit after, there were several sort of efforts to create sort of a CPAN but for Python, one main that I know of was Vaults of Parnassus, I don't fully know all the names of what their specific implementations were, but then Richard Jones I believe came up with the idea of PyPi and sort of implemented a proof of concept of what PyPi could be, and they deployed it to python.org, and we're still running that today. That proof of concept, that was originally designed to be replaced swiftly with the real thing.
13:15 Michael: Yeah, I'm always a little suspicious or leery of proof of concepts showing them to people who are like, "oh this is amazing, let's use this, we don't even have to- we're almost done, let's go", right. It can be a shifting soft foundation I suppose.
13:34 Donald: Yes, and important thing to realize about PyPi is code quality; it's not great, it was written 15 years ago with people hacking on it since then, and it predates much of what consists of the modern web stack: Django didn't exist, Pyramid didn't exist, Flask didn't exist, I think WSGI might have existed barely or they were like contemporaries, so the original PyPi didn't use WSGI, it just sort of wrote its own Python handler, I think it used CGI; so a lot of that code is still there and a lot of what we've done to try and somehow modernize it has been how we hack in WSGI support into this thing that expects CGI, how do we deploy it using a modern web server instead of this little script that just happens to sit inside. A lot of it is open because it was big, 15 years ago.
14:37 Michael: Right, of course.
14:37 Donald: You know, a lot of its problems really just stand from the fact it was written before we knew how to do good websites.
14:44 Michael: Yeah, what year was that when it was created?
14:44 Donald: I believe it was 2002.
14:50 Michael: Ok, so there was some examples of serious web activity, serious websites, right, that's .com just post .com days but still, Python was not nearly as polished around the web story.
15:06 Donald: Correct.
15:06 Michael: Cool, yeah, ok so that's really interesting. So what's the relationship with easy install these days, EasyInstall lets you install packages, pip lets us install packages, it's all pip these days, right, is there still a reason to use EasyInstall?
15:18 Donald: EasyInstall still exists, it does a few things that pip doesn't do, and pip largely doesn't do the them on purpose. EasyInstall supports multi version installs where you have a single Python environment and you can install multiple versions of say requests into that single Python environment, and then at runtime, it will generate a sys.path that has the correct version for whatever thing you are running on that sys.path; pip doesn't do that, which I believe is a good thing, pip prefers to use-
15:58 Michael: Is that because pip really assumes the presence of virtual environments and it's like if that's what you want just create a virtual environment with the different version?
16:04 Donald: Yes, correct, you know, pip says use virtual environments create explicit environments and install things in there, EasyInstall- while it works with virtual environments, because this is the way virtual environments work, virtual environments- it sort of says you declaire in your script, doing your script wrappers etc when you depend on and we'll sort of create a dynamic virtual environment in memory by managing the sys.path, you've created an explicit named environment, a name by your file system path and then pip will install things flat, just a single version into that environment.
16:49 Michael: That's a cool feature that you could hack it together in multiple versions what it seems like it's almost more trouble than it's worth.
16:56 Donald: Yes, so it's certainly an interesting feature, I don't think one is necessary better than the other, be each other in tradeoffs, we've sort of settled around using as a virtual env your single flat install system and I don't think there is enough of a reason to go back to the EasyInstall sort of multi version install system, just because- I think everyone sort of figured out how to use a virtual environments and yeah, I think trying to change gears now would just be a big disruption for not a whole lot of benefits, but if we have ever done that I think we would be in a perfectly fine place now, we just have a different mechanism for doing things.
17:38 Michael: Ok, cool. So what's the relationship between the Python packaging authority and the Python software foundation?
17:47 Donald: Yeah, so the PyPa is sort of not a real thing, sort of is a real thing, you know, it's not a official organization, we don't have the 5013c or whatever they are called.
17:59 Michael: So Python packaging authority is like a shadow organization, pulling the strings with Python packaging?
18:07 Donald: Yes. You can think of it that way, it sort of came around because pip was putting itself on github and it needed an organization name so they just called it PyPi- sorry, PyPa- too many things in my head with P Y P- something, I confuse them, so yes, it's called PyPa and then as we sort of, this sort of pushed in the past couple of years to really standardize around these things and fix things and get move forward we just sort of started to take that name and say ok, we are going to really take this name, it started out as a sort of joke, and we are going to use this for a real thing, and so it's just an umbrella organization where if you are working on something it's packaging in Python you can bring your project in the PyPa we don't have that many rules, I think the only rule that we really have is that you have to, your project has to be governed by Conda and beyond that, you know, you can run your project however you want etc etc, but it sort of provides a central location for some use case, ok, I want to work on packaging stuff, go to the PyPa pages and you can see the list of projects and then since there is a lot of cross pollination between people who work on these projects kind of mismanaging permissions and stuff, it would be easier. But like funding we are starting to try and get funding and stuff available for that so that's largely going through the PSF so PSF is sort of the legal entity that we hang our stuff off of but the PyPa is sort of unofficial organizations, 19:43 [indiscernible ]to Python dev managers, CPython, but you know, the legal trade markets sort of hang off those, the PSF.
19:54 Michael: Right, ok, so since the packaging authority is just like a loose group of people who worked together on packaging, there is no legal component there, like for example, I made a donation to PyPi, when you guys announced like hey we need some supporters and what not, small one but still, to do that I actually went through the PSF website and I sort of donated it to them and then they forwarded on, right, so for things like that, where you need an entity, PSF is kind of there to support you?
20:27 Donald: Yeah, exactly, it frees us up from having to deal with getting the board of directors, we got the PSF already does that, and I believe we call it the fiscal sponsorship or something along that lines, but it's basically they manage being the legal entity behind these stuff, we manage actually doing the code work and you know, how it's going to go forward and making these sorts of decisions.
20:27 [music]
20:27 Snap CI is a continuous delivery tool from Thought works that lets you reliably test and deploy your code through multi stage pipelines in the cloud, without the hustle of managing hardware. Automate and visualize your deployments with ease, and make pushing to production and effortless item on your to do list.
20:27 Snap also supports Docker and in browser debugging and they integrate with AWS and Heroku.
20:27 Thank Snap CI for sponsoring this episode by trying them with no obligation for 30 days by going to snap.ci/talkpython
20:27 [music]
21:55 Michael: So I kind of want to focus on three themes around PyPi, one is looking inside the infrastructure and the traffic and all that, one is about this new versioning called warehouse and then also like the funding and support and that kind of things, so let's start with the traffic and infrastructure. You wrote a cool blog post called "Powering the Python package index" in there you really laid out a lot of the underlying technology, you talked about some of the bandwidth requirements like I felt like I had bandwidth requirements, until I saw what you guys were doing like you had a comment where you said we had 293 TB of traffic serving 3 billion requests in April, for example. Can you talk a little bit about that? That's pretty darn impressive, like you've got to go talk into the video places the Vimeos the YouTube and the Netflix to get more traffic than that, or more bandwidth than that, in some sense, right?
22:55 Donald: Yeah, and the vast bulk of that bandwidth is taken up by the package files themselves although a not insignificant amount of that is taken up by API requests and stuff. I was actually curious, I looked up maze numbers and maze numbers are 343 TB and 3 billion in change, http request again-
23:21 Michael: That's really amazing, so it sounds like May had what is that like 10% more traffic in terms of bandwidth then it did in April. So does that mean- how do you interpret it, does that mean that we have more popularity like the popularity and usage of Python is growing? Or does that mean people are spending up more little tiny VMs and pip install requirements.txt more often, like is this a use case variation or is this adoption variation you think, as these numbers are going up?
23:56 Donald: I actually don't have a whole lot of insight into exactly what it is, it is something that I am trying to get some insight into but it's kind of hard, I will say one thing is that as of pip 6 which was released in the end of 2014 I think, don't quote me on that though, pip sort of aggressively cashes locally. So you type pip install requests and you get requests version 3.0 or 2.0 whatever, it only downloads a file once per computer basically, use http cashing, we have a ten-year long cash headers on those so it's basically once per ten years per file.
24:41 Michael: It's basically once per machine, right, like probably the machine does it before the cash does.
24:46 Donald: Yeah, unless you blow away the cash or something all along that lines. So we're definitely I believe not seeing increase in people doing pip install on their own machine multiple times or rather should we say we don't know if that is increasing or not, because they download once and then we never see that download again from them.
25:10 Michael: Right. It's probably not reflected in those numbers, yeah.
25:13 Donald: Yeah, so this is going to be either new users or new machines, cloud machines, I think probably more people switching to cloud based workflows having impact on this because each time you bring up a new cloud, Docker containers, yes, each time you bring up a new Docker container and new cloud, you start Google with the fresh cash and if that downloads again, I think people are doing CI more and particularly things like Travis and such unless you go out of your way to cash things between runs. Travis again gives you a brand new cash each time, so I think a lot of it is things like that.
25:51 Michael: Yeah, you snap CI and every time I check in something it definitely pip installs a bunch of stuff. I am not sure if it has the cash populated before that or not, but yeah, you are right, every check in in some sense triggers that kind of behavior.
26:07 Donald: Yeah. And plus I do think we are, I don't know if Python itself is growing in users but I think pip is particularly since it's been a lot more push from the Python dot side of things, say hey here is the thing that you can use that on other packages, I think particularly for Python 2 that's 2.6 and 2.7, they are getting kind of long in it, people wanting some of the new features from Python 3 without actually switching to Python 3, so they are installing backports and stuff or they are reaching out for things that aren't included in the standard libraries much more than they used to be. I also think there is- previously PyPi wasn't very reliable, it went down fairly regularly sometimes even in the middle of file downloads, and so in the past 3 to 5 years I believe it's gotten to a point where you didn't pip install something and be pretty confident you go and download something and install it, I think just usability overall has made people more willing to use PyPi than they were in the past.
27:15 Michael: Yeah, and kudos to you guys for that, right?
27:18 Donald: Yeah, although I mean, I admit that vast amount has been Fastly I say Fastly is our secret scaling sauce because we really would not be able to do nearly amount of traffic with the skeleton crew we have without 27:39 [indiscernible] CDN.
27:39 Michael: Right, you guys don't have like a massive data center in San Antonio or something that's like in a bunker that you manage, you are pushing this to the cloud like all the modern companies, right?
27:48 Donald: Yes. And we actually have PyPi legacy runs, we have three web nodes for all of that, we have a Heroku data base server and we store our files in S3 and that's really about it for the infrastructure for PyPi itself.
28:08 Michael: Ok, excellent, and can you give us a sense of what it costs to run PyPi?
28:15 Donald: Yes, so for us the cost is zero.
28:18 Michael: Besides the people's time, but like- right of course.
28:23 Donald: Yeah, because we are lucky to have all these companies donate services, you know, I think I counted that we are somewhere around 35 000 dollars a month. Some of these numbers are a little fuzzy because like Fastly the billing numbers we have are for all of our use of Fastly in Python.
28:40 Michael: Right, but you probably represent the vast majority of traffic-
28:47 Donald: Yes, in May we had 340 TB of PyPi for Fastly in general we had 399 TB so like our Fastly bill in May it was 33 000 dollars; we have six thousand some dollars in Rackspace for all of Python that we have dns and such, so not counting people time we are roughly 35 000 maybe a little bit more a month for just PyPi.
29:18 Michael: Ok. This is not something that you could just run if you wanted to, right, this takes the support of the community and companies like HP that are really backing it, and Fastly. Because there is no person that is going to go you know, I really believe in this project, so here is 35 000 dollars a month to pay for bandwidth.
29:39 Donald: Yeah, you know, I mean-
29:41 Michael: I guess there could be people [laugh]
29:44 Donald: Maybe you'll find your Bil Gates or someone 29:47 [indiscernible] almost 400 000 dollars a year, that's more than those people make.
29:51 Michael: Outside that, yeah. Let's talk about the people involved, so we know that HP is basically supporting you to do what needs to be done for Python packaging, mostly around PyPi but in the general sense. How many people work on these projects and how many people would you say are responsible for keeping pip install a thing, a possibility?
30:17 Donald: Ernest Durbin helps a lot with PyPi itself, he is not paid for that, so he is completely a volunteer, he is sort the opt side of things, I do a little bit of ops I am not great at it, so he does a lot of the operation stuff and he is super helpful with that. Richard Jones does things on volunteer basis also, he is largely stepped back lately, just because his time has been taken up by other things, but he still comes around and helps with support requests and stuff, which I don't have the time to do, then we start getting like setup tools what you need for pip install to do source dists, that's largely Jason Coombs, @Jaraco, that's largely his baby now, on pip as well, there is like a Paul Moore and Markus is there and a few other people, you know. But, their times are limited because they are doing it in their spare time, which I do in my spare time as well on top of the HP time, but most people outside of myself have a lot more limitations to the time they can spend on it just because they are completely voluntary based.
31:39 Michael: Yeah, so just the handful of people full time. And, everyone else is just a little here and a little there as they can. Do you feel like PyPi is underfunded?
31:52 Donald: Yeah, I do believe PyPi is underfunded, you know, it's sort of a tragedy of the 32:01 [indiscernible] it's used by a lot of people as evidence I have our traffic numbers, they are using them a lot, while we do get support from a number of companies you know, realistically, less than 10 companies support us and I am pretty sure more than that use PyPi.
32:24 Michael: They could be just installing a lot of stuff, of course, I mean, everybody uses it, it's absolutely one of the foundational things, right, it is the things that facilitates "batteries included" in the broad sense of Python, right.
32:41 Donald: Yeah, I absolutely believe so, I think it's one of the most important things provided by the PSF infrastructure, it's debatable with you know, I think it's one of the most important things that PSF structure does provide, and certainly when it goes down people are very quick to notice that you know, I get notifications through twitter and email and IRC before our monitoring even knows that it's down. People are telling me it's down. A lot of people depend on it.
33:23 Michael: is it stressful to you to be responsible for it?
33:25 Donald: So there is definitely stress, over the years I am getting better at dealing with it. A lot of people respond to these things where when it's working people don't think about it and when it's not working, they are out to tell you it's not working, and I've never actually been technically on call for PyPi. Ernest had been technically on call but I say it's basically impossible for me to not be on call, unless I completely unplug myself from the internet.
34:03 Michael: You have to go into the forest or skyriding airplanes in PyPi.
34:11 Donald: Yes, because you know, I am so publicly known and associated with PyPi that people are reaching out to me as soon as it's down, and to their credit, they are mostly trying to be helpful to let me know it's down, but you know, it soon becomes a flood of communication it suffers serious downtime. we've sort of engineered it so that pip install very very rarely goes down, but upload and this stuff can go down more often than that but that affects much smaller percentage of the people. So that doesn't get as bad.
34:49 Michael: Absolutely.
34:49 [music]
34:49 I am really excited to tell you about the new sponsor of the show- Rollbar. One of the frustrating things about being a developer is dealing with errors, relying on users report errors, digging through log files trying to debug issues or a million alerts just flooding your inbox and ruining your day.
34:49 With Rollbar's full stack error monitoring you get the context insights and control your need to find and fix bugs faster. It's easy to install, start tracking production errors and deployments in eight minutes or less. Rollbar works with all the major languages and frameworks including the Python ones- Django, Flask, Pyramid, as well as Ruby, Javascript, Node IOS and android. You can integrate Rollbar into your existing workflow, send error alerts to Slack or Hip chat or automatically create new issues in Jira, Pivotal tracker and lots more.
34:49 We have a special offer for Talk Python listeners- visit rollbar.com/talkpythontome and get the bootstrap plan for free for 90 days, that's 300 000 errors tracked for free. But hey, just between you and me, I really hope you don't many errors. Rollbar is loved by developers, awesome companies like Heroku, Twilio, kayak, Instacart, Zendesk, Twitch and more. Give them a try today, go to Rollbar.com/talkpythontome.
34:49 [music]
36:25 Michael: If we had more people working on it, and more people whose time was more seriously dedicated to it, like yours is, we could probably do some amazing stuff, right, so there is a way to donate to this, you guys actually have donate.pypi.io as like the landing domain, because it redirects, but that allows people to donate like how if people are working for a company they are like yeah our whole company that does a 100 million dollars in revenue depends upon this thing, and we've actually not even help supported, maybe we should, like how would they involved, maybe those would be the biggest bang for the bucks if we could get some big companies to step up?
37:08 Donald: Yeah, so obviously they can donate, if that's the way that they want to get involved, donations are easy and there are tax write off in the US, but if they want to donate developer time, one of the easiest things to do would be to contribute to Warehouse which is on Github, at github.com/pypa/warehouse, that is deployed now to pypi.io, that's where it's going to be in the future. My big push right now is trying to get the Warehouse to the point where we can say ok this is running enough, let's start directing people here by default. Because, PyPi is sort of an old code base, it's slowly falling apart, uploads get sort of a base 10% error rate on uploads right now, it's just falling apart and so the full time job to keep it from not falling apart which doesn't give me any time to work on other things, so I've sort of-
38:08 Michael: All your fingers are plugging holes in the dam rather than like building new dam. Yeah?
38:13 Donald: Yes. So I have sort of ignored some of the holes, willfully to try and free up time to get the Warehouse ready to launch, because the idea is that hopefully Warehouse will get slated in the place, not a lot of things will be better, I am sure it's going to break things for a lot of people-
38:32 Michael: But it will become more joyous to work on, and maybe you'll get more contributors using stuff that people understand, right?
38:38 Donald: Yeah. Absolutely, right, PyPi legacy is largely two files both over 4000 lines long and I think I am the only person left that actually understands it, Richard hasn't touched it recently enough and things have changed that he would have to get right back up and so far I have personally had a 100% failure rate on getting new people involved in that, just because they take one look at the code and then they just kind of disappear. It's not a fun code base to work on, yeah, it's particularly because you can't run it locally so a lot of times changes are- make a change push it to production maybe to test PyPi which is a sort of a staging-sandbox instance we have depending on what these like problem is and then just sort of pray that 39:28 [indiscernible] doesn't start yelling at us.
39:30 Michael: Yeah. All right, so Warehouse is a huge topic that people are interested in, Ii want to get to that, but before we move off this topic, you also wrote another blog post called a year of PyPi downloads and there was a lot of insight that can be gleamed from those numbers. Can you talk about that really quick?
39:47 Donald: Yes, in January of 2014 we started saving and archiving all our download logs. In 2014 and again in 2015 I believe it was, I sort of pulled these all down and crunched some numbers on those and pulled out what I felt were interesting numbers largely around what versions of Python are being used, what version of packaging tools are being used, things like that. And, those blog posts were sort of very widely received. So what I have done since then, is I have sort of improved our metric stream to the point where now every single download generally took row in a big query data base and which has got all sorts of information like you know, what version of Python downloaded it, what tool downloaded it, what country it came from, stuff like that and as of a month ago, that data is now completely public for anyone we go in queries as long as they have a Google account.
40:52 Michael: Oh nice. Give me the link to where that is and I'll put it in the show notes
40:58 Donald: So I am not really that great at data visualization or pulling information from data, I try to do my best with my year of PyPi downloads posts but I am hoping that by making that completely public a people can do, can make data driven decisions about what versions of Python may support, because you can very easily use an sql ask language say what versions of Python are downloading x package over some period of time, you know, I am hoping that people who are actually good at data visualization and pulling meaning from data, can take a look at it and come up some interesting information, particularly maybe as it relates to co-relating that with github data or bitbucket data or some other kind of data.
41:47 Michael: Yeah, that sound really cool and if you give me a link to the details I'll be sure to put in the show notes.
41:51 Donald: Yep, it's not greatly documented right now, it's just basically a post to Distutils SIG telling people about it, the historical data is not in there yet, you know, so that data starts I want to say late March, that data starts on backfilling from January of 2016 right now, and then I have to backfill back beyond that but that takes a bit more effort because the logs are in the correct formats, I think I got to come up with something to 42:24 [indiscernible] these logs, and to do correct format.
42:27 Michael: Yeah, ok, that sounds really promising, and there is a lot of data scientists and data visualization professionals that listen to my show and when I put data they seem to do amazing stuff with it, in surprisingly short amount of time, so maybe someone will come up with something cool for that. let's talk about Warehouse, because that's a really interesting project and I sent some messages out on Twitter to hey what should I ask Donald while we are talking and everybody came back with some variation of talking about Warehouse. So, Warehouse can be found at PyPi.io what is it?
43:03 Donald: Yeah, so PyPi.io is the production-ish I say ish because we are not monitoring or anything, deployment of Warehouse which is PyPi 2.0, it's backed by the same postgre sql data base, the same S3 instance etc, so anything that changes on PyPi changes on Warehouse and the reverse is true.
43:28 Michael: Right, so they are watching the same data stores, it's more the front end, an API implementation, right?
43:34 Donald: Correct. So right now, the read only portions of it which is 90 some percent of our traffic are pretty much done, there are a few 43:44 UI or CFU to do stuff that we have to either finish or comment out before we make it official. A lot of the author UI stuff is not done yet.
43:56 Michael: Ok. The client side, the read only UI bit is really nice, it looks I feel like I am going from source forge to github equivalent, I have experienced it, it's much better than current story so I think that's going to be delightful, that's cool.
44:15 Donald: Yeah, and you know, I think that was Nicole Haris had done the divine for that which I think she has done a phenomenal job, on that so far, you know, particularly going from what we had to what this is it really does feel like you are jumping forward an error or two, for what modern design looks like.
44:37 Michael: Yeah, absolutely.
44:41 Donald: And she has put a big focus on trying to get the usability of PyPi to be a whole lot nicer and better to use and surfacing the information people care about and hide the information people really don't care about or maybe diminute it completely or it's just confusing information or information that nobody really needs to know.
45:01 Michael: Yeah. Nice. So, can you talk a little bit about what frameworks and internals you used to crate it?
45:07 Donald: Yes, so Warehouse is written in Pyramid, now it is, Warehouse has got sort of a sorted history where have gone from started out, using just werkzeug and making my own framework to then what the flask did and then I went to Django, now it finally settle on Pyramid-
45:27 Michael: Why do you make those changes, like what kept moving you along?
45:32 Donald: Yeah, so one of the things I wanted to do with Warehouse was to have 100% test coverage across the board, coming from with PyPi where we have 0 test coverage one of the things that was very painful to me to make any change was figuring out what the impact was going to be, where I am going to break code that I thought was unrelated but was really related, so I really wanted to make sure we had great test coverage. Unfortunately, a lot of the web frameworks out there tend to use a fair amount of globals there is some I remember whether it was thread locals it was a capitaly globals or not, but I call them globals too-
46:11 Michael: I think they are globals.
46:12 Donald: Yeah. And, that makes it more difficult to test things so I created my own framework on top of werkzeug that was heavily influenced by Gary Burnhotch Boundaries talk where there was very few of these sort of bag of items that was a user class that was a data model in the head, all these sorts of things you could go to a user hanging of off it, that makes it hard to test because you had to pass the huge interface, it's sort of like a miniature global, they kind of have to pass this huge interface and to actually test things you have to provide an object implements all those things, your tests and are actually really testing things. And, so we didn't have a ORM, it was using SQLAlchemy's expression layer to some degree it was also using just raw SQL some places, and one of the things that I discovered doing that was a, I was betting a lot of things which while fun was not a necessary the best use of my time, it brought us back to the same problem we have with PyPi where other people found it hard to contribute to it because it was using something that while it fit my head space quite well, it didn't necessary fit other people's head spaces, and-
47:35 Michael: Right, nobody had experience with it, right?
47:36 Donald: Yeah, and 3. it a lot of decisions were bottlenecked to me, because it was how do we do x thing in this completely custom framework, where nobody knows except Donald because Donald has to invent it, and so a lot of things were bottlenecked on me and Richard, sort of talked about it and we decided you know, we need to move to something more standard. I can't recall exactly if it was Flask or Django that came first, I know it was ported to Flask and I wasn't really too fond of that because one of Flask API things was thread locals as part of the API and my experience with that is it becomes hard to do then do anything in Flask without adding increasing amounts of thread locals, and so I kind of mix that pledged in my 48:26 locals.
48:26 I went to Django and Django is a great code base, I've used Django a lot I got started with Django, I started pouring into that and I discovered that the Django ORM is not really powerful enough to handle all of the things that PyPi does in its data base that it's sort of accumulated over time, we have tables with composite keys, with composite foreign keys we have tables without any primary keys we have a number of things and I did get the user things ported over to Django that I just sort of gave up and said you know, this is too much work, get this slotted into the shape that Django wants it to be and I said I will go and just have to use SQLAlchemy for this which is another great ORM.
49:16 But then, once you sort of throw away the Django ORM, you lose a lot of the power of Django, and I also wasn't using the Django template language because prior to all of this happening, I have worked on a sort of another alternative front end to PyPi called crate.io that before it I was a PyPi administrator which was written in Django, I used the Django templating language in that, and that it became a bottleneck because of how big some of our html pages were to list all the packages, that became a serious bottleneck. So then we are sitting there, so we use Jinja2. So then I am suddenly looking ok, we have Django it's kind of hard to fit our data base into Django's ORM so we can't use that.
50:00 The dtl was too slow at the time, I don't know if it is now, to really use that so we've sort of paired Django down to a glorified request router, and we've thrown away a lot of the power of Django, you know there is no third party apps, there is no admin, so then I took a look at Pyramid, and Pyramid had a lot of the things I liked about Django, there was really no thread locals, there is a thread local but it's optional whether we use it or not and you know, it had enough flexibility in it to sort of use whatever tools we wanted to, so I can bring SQLAlchemy and use that, I can bring in sort of change things around to sort of suit my purposes better than I could with Django.
50:44 Now the foot side of that is it doesn't do as much out of the box, you have to kind of configure it and make it do what you want to do but given the long history of PyPi and all the sort of a weird thing that's grown over time, that felt a lot nicer than what we needed to so then sort of Django does, because while Django was great, once you sort of get out of the Django workflow, you start fighting in the framework a lot more than you do with Pyramid.
51:13 Michael: Yeah, ok, that kind of makes sense to me, the testing part with Flask, that I didn't know about but I can imagine the stuff with Django, you know, I think you and I settled on for very different projects, the same technology stack more or less, like for all my web properties I am using SQLAlchemy and Pyramid and what not, and so yeah, very interesting. Let's see we have just a few minutes left and I have a couple of questions from the listeners and maybe one more thing I wanted to make sure we touch on. Mahmoud Hashemi who has also been a guest in the show asked if real time download counters are coming back?
51:48 Donald: Yes, they are. So the old metric stack sort of fell apart and died because it was sort of hacked onto the side of the PyPi like a lot of PyPi has been, the new metrics stack which is based on Big query and such is sort of designed to allow us to bring that back because as well as to give us this sort of archival ability to query all sorts of things, I disabled them just because they were zero all the time, because the thing had broken, they didn't have enough time to fix it, but we are playing- I am bringing those back and hopefully they will actually work, this time and it would be nice.
52:27 Michael: Yeah, will they re-appear in the Warehouse time frame or will you bring them back to the legacy version?
52:30 Donald: It will probably be in the Warehouse timeframe might not be until after launch of Warehouse unless someone feels like coming around and figuring out how to do all that, beforehand, I'm new to big query so far, I really enjoy using it, it seems to work really great, but one of its constraints is that queries can take a couple of seconds to a minute or two to run, which is fine if you just looking at it for data archieval, not great to do in a middle of a request response cycle.
53:05 Michael: [laugh] Yeah, that's for sure.
53:06 Donald: So we need to do some more work around figuring out ok, how do we take this data, this Big query and put it into format that we can look at in Warehouse in the request response cycle and that just takes someone sit down and figure that out, I just haven't had the time to do it.
53:24 Michael: Sure. So, final thing on Warehouse Nicola Kentar asks what's the timeframe for shipping that?
53:32 Donald: So I have given a few dates before and we have passed them all so far. That has totally been great at estimating how long until it's ready to go, I think we are pretty close though, I just recently told people on Distutil SIG to start switching their uploads to using Warehouse, largely because we have a 10% standard failure rate on uploads still to legacy, so far everyone who has done that said it's worked great. We just recently committed to Twine which is a tool to replace stup.py upload, we switched in the master branch the default to using Warehouse it's not been released yet, but I am hoping that will be released in the next couple of weeks, and if that works great, hopefully we'll get CPython itself switch to Warehouse for uploads and then that will hopefully propagate out and people will use that and that will solve the big problem. As far as switching the actual like redirecting the old domain, I think it's soon, I don't have a good target for exactly when, but hopefully, I am really hoping the release will be in 2016 because I am tired of the old code base, I want it to die a 1000 deaths.
54:51 Michael: Yeah, I can imagine. That sounds like a really cool new version and it sounds like it's going to be great for everybody when it's working and it's the default. Certainly my playing around with it, you know, it seems like a nice place to be. So, I have a bunch of other questions I'd love to talk to you about, but we are just running out of time so maybe we'll have to leave it there. Maybe when you guys do ship it, maybe we can come back and do some kind of celebratory show.
55:20 Donald: Sure.
55:21 Michael: To celebrate the actual flipping of the DNS, or the redirect. Cool, so a couple of questions I always ask people before the end of the show- and I think this question is particularly interesting to you, but I ask this all my guests- there are over 80 000 packages on PyPi these days and like we talked about, there is so many amazing little packages, people can grab and make their programs awesome, like what one do you think is amazing you'd like to call attention to that's maybe not requests that everybody knows, you know something like that?
55:56 Donald: Yeah, I would have to say BPython, it's sort of an alternative REPL for Python, it's got the syntax highlighting, it's got audio complete as you type things out, it really works well, I install it on my virtual ms because it just works nice. I use it a lot.
56:21 Michael: Ok, that's awesome. And when you write Python code what editor do you use.
56:26 Donald: So lately I've been using Atom, I have been trying it out for the past 3 or 3 months, so far I've liked it. Previously I was using Sublime text. Lately it's been Atom.
56:37 Michael: Ok, cool, yeah. It seems like Sublime and Atom are working at the same level, they appeal to a similar group of people, that's cool. All right, any final call to action while you've got the mic?
56:50 Donald: Yeah, I would love it if anyone who uses PyPi could come and contribute to that or to pip or you know, talk to your companies, see if they can contribute in developers or even just some money, anything helps and hopefully we can keep moving things forward and everyone would be happy, and stop yelling at me when things go down.
57:14 Michael: That would be amazing, I just want to second that as well, like try to convince your companies if they depend heavily on Python, to contribute just a little bit, because imagine a world where pip install basically didn't work, you would have to go piece that all back together, that is something we don't want to see happen and it would be really great if we could make it much more stable, supported, active thing instead of putting all the weight on you Donald. And few other guys that we mentioned, right.
57:40 Donald: Yeah. And remember, donations are tax detectable in the US.
57:46 Michael: Awesome. So there is just so many more things we could talk about, but we are going to have to live it here just for the sake of time. So Donald, thanks for being on the show, it was great to talk to you.
57:54 Donald: Yeah, thanks for having me.
57:55 Michael: Yeah, bye bye.
57:55 This has been another episode of Talk Python To Me.
57:55 Today's guest was Donald Stufft and this episode has been sponsored by Snap CI and Rollbar. Thank you guys for supporting the show!
57:55 Snap CI is modern continuous integration and delivery. Build, test, and deploy your code directly from github, all in your browser with debugging, docker, and parallelism included. Try them for free at snap.ci/talkpython
57:55 Are you or a colleague trying to learn Python? Have you tried books and videos that left you bored by just covering topics point-by-point? Check out my online course Python Jumpstart by Building 10 Apps at talkpython.fm/course to experience a more engaging way to learn Python. If you're looking for something a little more advanced, try my write pythonic code course at talkpython.fm/pythonic
57:55 You can find the links from the show at talkpython.fm/episodes/show/64
57:55 Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at /play and direct RSS feed at /rss on talkpython.fm.
57:55 Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song at talkpython.fm/music.
57:55 This is your host, Michael Kennedy. Thanks for listening!
57:55 Smixx, take us out of here.