Monitor performance issues & errors in your code

#16: Python at Netflix Transcript

Recorded on Wednesday, Jun 10, 2015.

00:00 Right now there is a chaos monkey running through AWS knocking over Netflix servers. But don't be alarmed! It's all part of the plan. This is Talk Python to Me with Roy Rapoport, recorded Wednesday, June 10th 2015

00:00 [music]

00:00 Welcome to Talk Python to Me. A weekly podcast on Python the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy Follow me on twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpythontome.com and follow us on twitter where we are @talkpython.

00:00 This episode we'll be talking to Roy Rapaport about Python at Netflix.

00:00 This episode is brought to you by Codeship and Hired. Thank them for supporting the show on twitter via @codeship and @hired_hq

00:00 The topic of this show is Python and cloud computing at Netflix. A while ago I had the amazing opportunity of teaching a Python course to some of the developers and data scientists there. I want to give a quick shoutout to my past students Thanks for listening guys. Netflix is an amazing place and after this show with Roy I am even more convinced of this.

00:00 You're going to love this interview. And with that, let me introduce Roy.

00:00 Roy is currently managing the Inside Engineering organization at Netflix, where they write the powerful telemetry platform and graphics, alerting, and analytics systems on top of it, that allow Netflix to have complete real-time visibility into its operations and systems -- In the cloud, on customer devices, and anywhere else Netflix operates.

01:57 Roy, welcome to the show.

02:01 Thank you so much, I am happy to be here.

02:02 I'm really beside myself of excitement to talk to you about some of the stuff that you guys are doing at Netflix. So, thanks so much. I think when people think about software projects and deployment and managing large software at scale, there is almost no other company that comes to mind doing the kinds of things that you are doing at Netflix. I mean, there is a few, Google and maybe someone else I am not sure, but what you guys are doing is really I think pushing the limits of what you can do with software in the cloud these days. So I am really excited to talk about that.

02:40 Yeah, I actually love talking about this stuff too, so... it works.

02:44 That's great. So, since this is listened to all over the world, I know Netflix is like absolutely household name in America. But maybe you could just say really briefly for folks who might not know what Netflix is, what you guys do.

02:57 Oh sure, I think we are becoming to some degree a household name in a bunch of other countries as well. We now serve somewhere between I think 45 and 50 countries in the world. And we are a subscription video streaming service. So the idea is you pay us a small amount of money- in the US I think it is about $8 a month and you get unlimited access to a pretty wide catalog of TV shows and movies. And you know, you watch as much of that as you want in any given month.

03:29 That's fantastic. Yeah, you just pay $8 and basically all movies and TV shows- not all, almost all the ones that matter are yours to just watch on demand, it's great. You know, at my house, I have 3 kids, and about 6 or 7 years ago we just decided television adds constantly being on television not really helpful. We have 600 channels, 2 of them are good. And we just literally canceled our cable, and said, "Look, we are going with Netflix, and few other things here and there, like You Tube and so on," and it's made an actual real difference in my kids' lives. I think you go and ask them, "What do you want for Christmas?" and they are like, "I'm not really sure, I don't know," they are just not overwhelmed with like all these ads and commercials, and I think Netflix is really a positive force.

04:21 Thank you , and thanks by the way for paying my salary.

04:25 Yeah, I'm happy to give a very small part. So, speaking of salary, what do they paid for at Netflix?

04:33 Oh boy, so my job at Netflix is manager of inside engineering. Inside engineering at Netflix is a software development group responsible for building real time operational inside systems. So, originally we thought of it as monitoring but really, the goal is to help people figure out what is going on and what they should do about it, and in the best cases actually automate that sort of process of analyses, discovery, decision and then application.

05:00 That's cool. So, who would you consider like your internal client, if you will, like are you helping the developers, are you helping DevOps, are you helping like the business folks that decide, "Hey, we are working with this big data and we are trying to figure out what the next data driven move we are going to create is," what are you working with there, who rather?

05:19 Good question. Well, it is worth knowing that we don't really have DevOps people. We have developers. And every developer at Netflix both writes code, tests code and then deploys it into production as responsible for it working well, and at 2 o'clock in the morning if it doesn't well, that developer wakes up to deal with it. So every developer at Netflix is our customer.

05:40 Now, the interesting thing that ended up happening was while we are focused on real time and operational domain, so we don't typically actually work to serve the business people who want longer term insight and sort of big data queries, all of our data ends up being stored in Hive and ends up being really useful for a whole bunch of that sort of strategic view analyses that we haven't originally expected it to be useful for.

06:07 That is excellent. I know Hive is related to Hadoop is Hive kind of like the front end to your Hadoop cluster?

06:13 Yeah, pretty much.

06:15 Yeah, that's the simple uninitiated version. Awesome. So, I've been watching you guys for a while, you know, just from a software architecture perspective, and I think that what you are doing is really interesting. Can you speak a little bit of like the scale of how you guys are using the cloud, what cloud are you using, that kind of stuff?

06:38 Sure, so Netflix is, when your device talks to us to figure out what movies it could watch, and lets you browse our catalog, it talks to a whole bunch of systems in the AWS-- Amazon Web Services Cloud. I think the last official public number that we disclosed is that we run more than 15 000 servers in the AWS cloud, across several production regions. When you actually stream the movie, that actually comes from our in-house content distribution network- we call this the open connectment work, and so we've actually deployed content cashes in a whole bunch of different internet peering points and in some cases in merge ISPs to minimize the bandwidth that they need to dedicate for Netflix. So, yeah, so that's the number of servers in the cloud. And I'm not sure we have thousands of open connect servers but I'm not exactly sure about the number.

07:47 Right, ok. You guys have kind of been at the center of the whole net neutrality thing, as well because it is so critical to get that bandwidth to so many places, right?

07:57 Yes, net neutrality is something that is near to Netflix's heart.

08:02 I'm sure you paid a little attention to it, at least as an organization.

08:05 Yes.

08:08 That's really cool. I heard some really interesting statistics about how much bandwidth as a percentage of the internet you guys represent. Do you know that number? Can you speak to it?

08:18 I think the last open number is that- was it Widevine who reported this? The last number is that I heard was something on the order of 33% of internet traffic theoretically.

08:36 Yeah, that's amazing, I mean, just stop and think about that. There is millions of websites. And you guys represent a third of all bandwidth, that's amazing. But, like I said, you know, you are more or less becoming the new television for the world, right.

08:47 Yes, and actually, I'm sorry, that was Sandvine and in November they said we had top 35%.

08:54 Wow.

08:55 Yeah.

08:55 So all these stuff you are talking about, scaling, and CDNs and all this extra work is really needed.

09:06 Yeah, it's interesting how much doing things well ends up working well for you at very large scale. And how your ability to sort of take shortcuts that probably would be pretty sane approach using a smaller scale, ends up sort of not being very much the right approach when you are looking at our scale. We end up being a lot more I think allergic to technical that then we would be if we were a lot smaller.

09:36 So, I can certainly see the more maintainable and sort of understandable the code is, because it has to be at such a scale, that you can't have these little hidden problems that people don't understand or don't want to go fix because you have to fix it, right? I can see how Python would really help in that space because, it is a simpler language, it's easy to understand, it has his great libraries. What role does that play?

10:05 Wee, I think this is where we get into potentially issues of sort of personal preferences almost. I have some strong opinions about Python, it's quite literally my favorite language, it's been my favorite language since I started working in Python in, 2004. I think there is a whole bunch of other people at Netflix who do great work with other languages, many of them JVM languages, whether it is Java, Closure or Scala, and of course, we also do a bunch of work with Javascript. I have been privileged to see some really wonderfully written Python code I've been privileged to see some really nicely written Java code, and I think we've all frankly seen some pretty terrible code irrespective of the language.

10:05 [music]

10:05 Codeship is a hosted Continuous Delivery Service focusing on speed, security, and customizability. You can set up Continuous Integration in a matter of seconds and automatically deploy when your tests have passed. Codeship supports your GitHub and Bitbucket projects. You can get started with Codeship’s free plan today! Should you decide to go with a premium plan, Talk Python listeners can save 20% off of any plan for the next 3 months by using the code: TALKPYTHON

10:05 Check them out now at codeship.com, and tell them thanks for sponsoring the show on Twitter where they are @codeship.

10:05 [music]

11:52 Yeah, you can write bad code anywhere, can you?

11:54 Yes, the only thing that Python really saves you from is bad indentation, I suppose.

11:59 Yeah. It won't run. If you have bad indentation, it just won't run. Fantastic. So, when I originally reached out to you, that was because you were a co-author on a really amazing blog post, just very humbly entitled "Python and Netflix". And I'll put that as a link in the show notes, so everyone can go check it out. But you kind of went through all the different uses of Python that you guys have throughout this great cloud system that you guys have built. You said that you use something called "Boto", and that's super central of course, that's the AWS Python STK right? So what parts of Boto are like super important, what is a really common thing that you guys do with AWS across developers?

12:51 Oh boy, howdy, so, for those of your listeners who don't know "Boto" is a Python interface to the AWS API. It was written by a guy named Mitch Garnaat, who eventually ended up working for AWS for a while. And we use "Boto" pretty much across the board both talk to services like SQS, and S3 but also frankly to get a bunch of information out of EC2 and almost any other part of AWS. It is the way for Python developers to talk to AWS, I'm not actually familiar with any other option that you might have.

13:34 Sure. How much auto scaling do you guys do, like for evenings in the US, I'm just trying to get a sense of like how many machines are coming online, going offline, what's that like?

13:47 Well, I'll give you a perspective. I think we've shown some public graphs that show that the traffic we get to troff- another word of the lowest quarter of our day which is somewhere between about 1 am and 5 am, is somewhere between a third and a half of the traffic we get at peak times in the day. So that means that if you've got application servers running in clusters that are let's say 1000 servers at troff, they can be up to 2000 or 3000 servers at peak. Not all of our systems out of scale of course, it doesn't always make sense depending on the kind of traffic load you've got, but we bring up thousands of servers, every day to deal with traffic and then, when traffic goes away, those servers go away as well, which is why the typical sort of half life of an AWS instance for us is measured in the 2 - 3 day range.

14:48 That's amazing. So, after 2 or 3 days it is likely that it was the one selected when you sort of downgraded your size for that troff?

14:58 Yeah. Now, there are of course a bunch of systems like our Cassandra systems are more stateful and that means that they don't out of scale and they have much easier time if we don't randomly sort of recycle their instances, but our front-end systems the systems that deal with direct customer traffic are stateless and there is a whole bunch of them that come up all the time and whole bunch of them that go away, all the time.

15:24 Yeah, that makes sense. Can you talk a little bit about the architecture, is this something you know about like are using lots of microservices are using containers, can you speak to any of that?

15:35 I can. So we have not yet started looking- we have not yet started deploying containers across our environment; we are classic microservers service oriented architecture and environment without a centralized service bus, last I looked and unfortunately I'm not VPN right now so I can't confront number, we had more than about 1200 or so services in our production environment.

16:08 Wow, 1200 distinct services, not instances of the servers running those services, right?

16:15 Exactly. Yes. So, obviously some of those servers might have thousands of instances running, some of those services may only have, you know, 2 or 3 instances running.

16:23 How do you guys manage, like, "Hey, there is this functionality that exists, so, as this other service, don't go write your own" and just sort of keep people knowledgeable in discovering these things, at that scale?

16:36 Yes, and that is an interesting question when it comes down to a company culture. We try to be as agile with the lower case as possible. And decentralize is possible which means we don't want to serve people through some sort of centralized approval or, you know, information distribution system. And that means that in fact it is a little harder to make sure that if you need to get something done that you will know if anybody else has done it. We count on a lot of informal communication between teams, we count on the fact that we are all geographically co-located so every engineer at Netflix working on our cloud ecosystem is either working in the building I work in or the building to my left, or the building to my right.

17:21 Right, is that Los Gatos, California.

17:23 Yeah, yeah. And then, we basically try to make it so if you know that you need something, ideally and hopefully before you start building it, you might at least know who else to ask, who might know, whether or not somebody else has already built it. And sometimes we will just have duplication, and we tend to hire people who are reasonable enough about this kind of stuff that they are not going to become overly invested in their own solution rather than be the right solution. So when we find duplication, then the people who own duplicating code or function can sit together and figure out, "Well, what do we want to do with this now?"

18:02 Right, try to narrow it down to just one, maybe bring the features everybody needed into that one service, right?

18:08 Potentially or have a better understanding of why you need two. So, I think, because I've been in the bunch of these conversations, when that happens, really the goal is to either understand why you should have to, and that will help you clarify what these two things do differently from each other, or decide, "No, that doesn't really make sense so we'll just have one."

18:30 That makes a lot of sense. It seems like a wonderful place to work if you can just go out and have a lot of freedom and it is not very top down, it's great.

18:38 Yeah, and actually I mean that ends up being really relevant to this whole conversation about Python, because when I started using Python in the engineering side of the house at Netflix, at the time there wasn't really a lot of appetite for Python, in engineering. I think it felt like I was the first one proposing to build production services, with Python. And my boss frankly really didn't like this. I think my boss would have much preferred that I spent 2 or 3 months to learn Java, because we had basically everything else working in Java, and we had a whole bunch of infrastructural libraries making it really easy for Java developers to run in the Netflix cloud ecosystem. And, you know, every week or two, I would be sitting down with my boss giving him an update on how my project was going and maybe about every other of those conversations he would say, "So you really think Python is the right way to do this?" And I would say, "Yes, yes, yes." And, you know, because of the way we tend to think about where the engineering decisions need to be made at Netflix, mainly in engineers, he gave me all the rope I needed to in this case validate that that was a good idea.

19:48 It's really cool when you can bring a new idea and actually show, "Hey this is not a bad idea" and see how it works. I think one of the best ways to prove that is not to have a meeting, but to just build something that works and say, "Look what I did, look how great this is and how easy this is, and we can do more of this..." I think, you know, in programming a lot of times the best way to show something works is to just do it and look back on it, you know.

20:15 Yeah, the best way to show something works is to show something works.

20:18 Yeah, show something working, right, exactly. One of the things that seems like you guys are really into is OpenSource at Netflix. You've got a lot of cool stuff you are doing, you seem to be OpenSource in these libraries. Can you talk about some of the more popular ones?

20:32 Sure, I think we've gotten a lot of interest out of the Java work that we are actually doing, which, of course is not so much Python, but our data science side which is a big fan of Python, has Open Sourced a bunch of work using Python. One thing that we haven't yet Open Sourced that we have talked about and we would like to Open Source as soon as possible actually is something that my team has been doing over the last year or so which is a restful service to do anomaly and outlier detection, and that restful service uses a whole bunch of SciKit-learn and Panda algorithms to help us drive automated operational decisions. My real time analytic engineers are big fans of Python in that space, one of the amazing facts speaking about it right now-

21:27 Wow, where is that?

21:27 21:28 New York.

21:29 Ok, fantastic.

21:29 Yeah. We've got another tutorial coming up in PyData in Seattle, so we would love to actually share all of that code with the community as soon as we will have some time to get that done.

21:42 Wow, amazing. So, you guys are using SciKit-learn and machine learning to monitor some of these cloud instances and services.

21:50 Yeah, but not just cloud instances and services, right, cloud services and instances to some degree are the smallest domain of data that we have. It's to some degree the most public invisible one but if you think about it, we have millions of pieces of contents. Millions of TV shows and movies and each of them is encoded into a whole bunch of different formats, and a bunch of different Betrates. We can't set some sort of artificial static tresholds for what it looks like for any one of them to be successful. So you have to really learn in production what the expected sort of viewing rate for any one of these pieces of content, if you want to be able to notice when one of them is sort of going wrong.

22:34 Same thing about devices. There are millions of devices, spread across thousands of device families, thousands of device models, so we can monitor any one of them sort of manually. We have to sort of deduce what the right behavior is in real time, so we can notice when one of them starts going wrong.

22:55 That's really cool. Just the scale is so big that trying to individually test them all you probably by the time you got through the whole list you would have to start back at the beginning because they will have new versions and new settings and so on, right? So, just build the system that watches it, huh?

23:10 Yeah exactly, and can deduce correctness from the historical patterns.

23:15 I wonder how many organizations are actually applying machine learning to the monitoring of their software. And this thing you are going to OpenSource?

23:23 Yeah, we'd like to.

23:24 Will it have a name?

23:27 Well yes, everything has a name.

23:29 Yeah, sorry, do you know the name that you are planning to give it so people would know to look for it?

23:29 [music]

23:29 This episode is brought to you by Hired. Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.

23:29 Each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them before you even talk to the company.

23:29 Typicaly candidates receive 5 or more offers in just the first week and there are no obligations ever.

23:29 Sounds awesome, doesn't it? Well did I mention the signing bonus? Everyone who accepts a job from Hired gets a $2,000 signing bonus. And, as Talk Python listeners, it get's way sweeter! Use the link hired.com/talkpythontome and Hired will double the signing bonus to $4,000!

23:29 Opportunity is knocking, visit hired.com/talkpythontome and answer the call.

23:29 [music]

24:44 You know, there is a time honor tradition at Netflix, that if you are a developer working on a product, you get to name that product when we Open Source it, and I haven't actually asked developers whether or not they've picked a name for the Open Source product. We have an internal name for it, which is Caplar, but I'm not sure that it'll end up being their public name for it.

25:10 Ok. Wow, excellent. So you have some other things that are kind of related; one of the things I read about when you wrote your blog was that you have a large number of alerts that get sent for various reasons and you have this whole central alert gateway thing that is written in Python.

25:30 Well yes and no. So this is actually, this is perhaps maybe the bad news from the Python perspective, when I came over to engineering, in 2011, I wrote a bunch of really useful stuff. One of them was a central alert gateway, another thing was howler monkey and security monkey. The interesting thing that happened over time is back in 2013 I became a manager which means I pretty much get paid not to code, and so some of the stuff that I wrote that we haven't publicly talked about, ended up not being kind of a bust. You know, one of the things that we tried to do is we take a bunch of bets, we try to minimize the cost of the bets and some of them will go great, some of them they will not go great and we'll just kill them quickly.

26:19 Yeah but the key to be successful is experimenting, right?

26:22 Yeah, exactly. So the central alert gateway was highly successful. Incredibly useful it is at the core of knowing that something's gone wrong in our environment. And we have a long term commitment to it. However, it turned out that after doing maybe about two years of sort of organic development on it, it really needed a refresh. And it also needed ownership by software developers who were being paid to be software developers. And so it ended up moving over to my team family and the developer who is responsible for it ended up re-implementing it in Scala.

26:56 Ok.

26:58 So, the CAG these days is no longer Python based, I think I still in my spare time still maintain a client in Python so you can talk to the CAG directly but the same thing frankly happened with the security monkey and howler monkey as they moved into other groups, those other groups made their own decisions as to how to keep them organized. I think maybe to some degree you could argue that this is a nice example of how personally I was able to use Python to very rapidly iterate over what needed to be done, and then once it reached a stable point then I mean frankly, I don't know if the language mattered all that much, and the developer who preferred to use Scala ended up re-implementing it in Scala.

27:46 Sure and that is part of that developers' freedom, you guys talked about, which we will come back to again. I think it's easy to- it could be easy to look at this and go, "Oh, well they tried in Python and they failed" but I think that is exactly the wrong message to take from it, like you are saying it's almost the success of Python is this thing came into existence, it was written well and quickly, and then it evolved since then, but you know, the reason is here because it was easy to do this in Python to get started.

28:14 Yeah, and again, it wasn't like the company made a decision that it wasn't going to be in Python, frankly because the company doesn't make those sort of decisions, the company made a decision that it was really worth investing in and having as a prime sort of first class member of our ecosystem, and then the company knows that developers make the best decisions when it comes to this sort of like engineering and implementation kind of decisions, and so we didn't really have an investment in having it we in any given language and the developer who ended up doing it decided to do it in the language that she was most comfortable with.

28:55 Right, and that makes sense as well. You don't want t o force somebody to write a language they are not familiar with and it not going to end up with this good of a product. So, one of the things I think Netflix is really known for, besides the 35% bandwidth story, huge AWS usage and things like that, is data driven sort of decisions and that type of stuff.

29:17 Yes, especially and I would say mostly for our public product.

29:22 So, can you talk a little bit about how you guys are using data science there, maybe some of the tools like iPython or things like that if you are using them?

29:29 Sure, we are big fans of iPython. I think it was two years ago that I did PyCon 2013 when I discovered the iPython notebook and I fell like my life has never been the same since then. Because that thing is just glorious.

29:47 Yeah, it really is amazing.

29:48 Yeah. And in fact, we use an iPython notebook now as the format for our take home tech assignment for candidates for the real time analytics group under me. But that's neither here nor there, so largely when you look at the science driving product decisions we are big fans of A/B tests. And should I define A/B tests?

30:16 Yeah sure, go ahead, not everyone is sort of front end facing folks maybe.

30:21 Yeah, so basically if you think about it you've got a bunch of customers who use your product and you think about whether or not a given feature might be useful and first of all you've got to define what useful means, what is it that you are trying to actually change? You need to have good enough understanding of your product that you can basically tie everything into some sort of key performance indicator, KPI. And then you actually implement the feature and you subject part of your customer population to a test where some people see the original behavior, some people see the new behavior, and you see whether or not there is a difference in the KPI between those two groups. Now, I am massively simplifying this because in our environment for example, any given A/B test might have different potention functions or different test cells and of course we don't just run one A/B test I could be subject I think the last time I looked as a Netflix customer, I was randomly allocated to something like either 12 or 15 or so different A/B tests.

31:28 So, that's really cool, how might that show up with that be like, what's in the recommendations, or maybe other parts of the UI?

31:37 Yeah, it can be a whole bunch of different things, you can come down to for example what image do we show you fro given movie. So for example if you look at "Orange Is The New Black" there is a little picture that we show you for the "Orange Is The New Black" movie, well does one picture actually get people to want to watch that more than another picture. One easy way to look is to basically divide your customers into a bunch of different groups and show each of them a different picture and then actually see if one of those pictures ended up resulting in more click through rates, right? So that is a relatively trivial small difference, you can look at bigger things like for example what order do we show you recommendations, what language we may use and in some cases, complete overhauls of the UI. So, I saw an article recently that suggested that sometime in June will going to completely overhaul the computer based UI to make it sort of much sort of niftier and prettier, and that's a complete overhaul that obviously would have been tested on a small group first to see if it actually causes people to engage with a small or less.

32:46 Right, and you only choose the ones that make them engage with you more of course.

32:49 Well, ideally or at least you know, assuming that that is the KPI that we wanted to actually affect, right. So again, this all comes down to what am I actually trying to accomplish with this particular feature.

33:02 Sure. I saw I think it was an article in "Wired" and I'll put the link in the show notes, that some of the guys, it was from a Hadoop conference, they were looking at the different things you can change in A/B test and they were actually looking at the color spectrum of the image and comparing those across other movies that maybe more popular or less popular and it sounds like there is a lot of dimensions that you guys look at.

33:27 Oh my God, yes, but the beautiful thing is if you have already built essentially the infrastructure to do A/B tests relatively cheaply. Then that means that you open the door to doing a lot more experimentation and you don't need to necessarily have a conjecture that you are going to significantly move the needle. So it is all about cost-benefit analyses and if you think about it, if it costs you a lot to try a test, then boy howdy, you've got to conjecture that that test is really going to change some KPI significantly. But, if implementing A/B tests is relatively inexpensive and can be thought of as almost free, then you can be a lot more experimental than you would be otherwise. Which is why by the way for example the group responsible for building our A/B test infrastructure is not called the A/B test group, they are called the experimentation platform group, because this is really how you enable cheap and easy experimentation.

34:32 Yeah, and you have so many users that you get very easily statistically significant data, right?

34:38 Yeah. That does make it easier.

34:41 Very, very cool. One of the things that I heard that really intrigued me about the architecture and things you guys do is the thing called "The chaos monkey" and it's been around for a while but can you talk about that?

34:51 Sure, Chaos Monkey is a relatively proven concept; we started doing this as we went into the cloud maybe about 4 years ago and the idea is that given that servers will die. You know, this is a fact, this is true in every environment I've ever seen, whether you are in the cloud or not, your servers will die. And, trying to avoid it or deny it does not lead to happiness or more robust systems. So let's try to actually make true that our systems are resilient to server depth. And so we built this small component that we ended up calling Chaos Monkey that goes around every day and kills one of your servers for every application group that we have. So, we talked about those 1200-

35:37 And this is in production, right?

35:37 Oh yeah, that is I would argue the most useful way to have Chaos Monkey run.

35:43 Amazing, ok. Keep going, sorry.

35:45 Ok, so basically if you think about those 1200 applications running in production, we kill at least one server out of each of these applications because we are running on multiple servers every day, to make sure that we don't have the problem when that happens.

35:59 So, when people start building their servers they have to from the very beginning plan for the fact that these machines are going to die which is reality, right? But it is really neat that you have put these systems together to test it and that's pretty leading edge I think, but what I think really takes it to the next level is you leave it running in production. That's amazing.

36:20 Yeah. And you know, the interesting thing is, it feels leading when you are outside this environment and it feels leading edge when you start doing it, but you know, off the top of my head I don't think Chaos Monkey is actually exposed to problem for us in the last I want to say three years and that is perfect, right, because that is exactly what it is there for and it is verification that despite the fact that we kill thousands of instances every day, as part of Chaos Monkey, we've gotten to the point where it is pretty much normal for us that we can lose a server without any noticeable impact.

36:53 Yeah. That is excellent. I mean, you really need to make sure that the servers do not go down and I can't remember the last time the Netflix wasn't available. And it was your fault, you may remember?

37:04 Yes, so one of the fun parts about being involved in every major incident post mortem at Netflix is that I do in fact get to be aware of every glitch in the system and every significant outage. It is worth noting that thankfully for us the definition of significant outage is smaller than for our customers. We tend to think of problems being significant below the level at which they would likely be noticeable to most of our customers-

37:35 Right, exactly. Maybe these machines are using too much CPU or something like that, is that a possibility?

37:41 Not so much because we tend to think about production impact purely based on whether or not customers are impacted. The treshold for us for example from major production incidents is whether or not 10% or more of our customers are impacted, that is a really big deal for us. It happens not that often but it still happens every once in a while. But it does mean that for example if 10% of our customers are impacted, the good news is that 90% of them aren't.

38:06 Yeah, sure. It's just 10% is really high.

38:08 Yeah, and it's been a really long time, and in fact I can't exactly remember when we have seen a 100% service impact. When that graph of actual current usage of Netflix goes to zero.

38:24 So, I can see how you build your systems, and your services so they can take this, but you are built on multiple data centers spanning AWS cloud availability zone and all that kind of stuff. So, you also need to be able to deal with if AWS goes down.

38:43 Yes, in theory we need to be able to do that and I think we built a pretty good environment because of that, so for example in the US we run out of two major regions in AWS US west 2 and US East one, and we can fail over between those two regions so we can evacuate a region and we do that I think basically right now at least once a month are there about to validate that that mechanism still works. The good news is that frankly, AWS to the best of my recollection hasn't had a significant regional outage, in reasonably long time. You do watch them have one right now as we speak.

39:20 Watch what you say... Yeah, of course, I agree they are absolutely rock solid and I think they are the best place in the world to be including your own data center. Because those things go down too, and then that is your problem.

39:34 Yeah, I've been there before and I spent most of my life in data centers; but you know, going back to Chaos Monkey for a second there is a point that is worth making which is we didn't so much built Chaos Monkey because AWS wasn't reliable enough. We built Chaos Monkey because AWS was too reliable. What I mean by that is, if AWS was causing us to lose machines for example on a daily basis, then they would have already given us the requirement that we needed to be resilient to that. Everybody running in our environment would have known that they need to build that way because they would have lost machines just by the natural sort of unreliability of AWS. AWS's reliability even down to the machine level is actually high enough that we needed to simulate machine failures because we wanted it to happen at the rate we wanted it to happen.

40:22 Right, it almost never happens so how do you prepare for it?

40:24 Right, exactly.

40:26 That's amazing. So you guys built a tool that is related to Chaos Monkey called Chaos Gorilla.

40:33 Yes, exactly, we have sort of taken the simian army motif and we've run with it. So, we actually have Chaos Monkey, Chaos Gorilla and Chaos Kong, believe it or not. Chaos Monkey is about simulating a single machine failure, as I mentioned before. Chaos Gorilla was something we built to allow us to evacuate availability zones, within a region, so should we briefly talk about regions versus availability zones with AWS?

41:02 Yeah, absolutely, sure.

41:02 So in Amazon region, is a geographical location that actually can contain multiple availability zones. An availability zone can be thought of his you know, let's say for example data center, and a region may contain several data centers that are pretty close to each other. And so, we wanted to be able to tolerate a single machine failure or a single availability zone failure within a region and then we build Chaos Kong to be able to tolerate the loss of a region. Lately, I think we've found that we've gotten good enough at evacuating regions, that if a single AZ (availability zone) within a given region is going to fail we would probably prioritize just leaving that region temporarily rather than trying to re-balance within that region.

41:48 So, we practice regional evacuation and regional re-balancing pretty regularly at least once a month with no adverse effects and every time we don't so much look as at whether or not we've impacted our customers anymore because we really don't, but rather how quickly were we able to affect an impact free evacuation.

42:10 That's really amazing that you guys built, that's great. So, keeping on the monkey theme, you guys also have Security Monkey and Howler Monkey.

42:17 Yeah. Because everybody loves monkeys.

42:20 Yes.

42:22 So, Security Monkey and Howler Monkey were both written by me, actually I would say Howler Monkey was the first project I wrote when I moved over to engineering from IT at Netflix. And, at the time we were trying to solve a problem we basically, you know we have dozens of SSL certificates spread around different parts of our environment, and when those certificates expire, as they want to do, you end up with the production event.

42:49 And so we built the system to automatically discover certificates and alert us when they are going to close to expiration that ended up being at the heart of Security Monkey, and from that what we are finding at the time was that we are growing at fast enough rates that we kept running into the limits in post on us by Amazon. Amazon as a cloud environment is infinitely scalable, but they need to protect themselves so there are essentially logical limits to the resources you use.

43:17 For example the number of autoscaling groups that you might be able to provision or the number of instances that you might be able to provision. At the time, we were increasing our footprint fast enough that we kept running into those limits and then we couldn't get any more, until we asked them to increase them, and traditionally, of course, that would happen on Friday at about 5 o'clock in the afternoon, and that would spoil somebody's Friday evening. And so we built something that was, that allowed us to monitor our usage compared to limits and allowed us to alert ourselves when we get pretty close, around 80% so we can talk to Amazon about increasing our tresholds.

43:52 You guys must have special relationship with AWS?

43:56 Yeah, you know, we are big fans of AWS; AWS seems to be big fans of us and we talk a whole lot.

44:03 Yeah, I'm sure that you guys are actually a topic of conversation in their engineering group fairly often, that's cool.

44:03 Nice. So, one of the phrases or quotes I saw from your blog post was you said something to the fact of, "We found a formal change control system didn't work well within our culture of freedom and responsibility." That sounds really cool, and so you created this thing called Cronos. What is this culture of freedom responsibility?

44:29 Netflix has done a lot of work to formally define the kind of culture it likes to have internally. So we have a culture slide deck it's now publicly available on slideshare I think googling for Netflix culture will find it within the first few links. If you look at how we think we are going to be successful as a company, it's really about decentralizing and maximizing speed of innovation. And if you want to maximize the speed of innovation, we believe that the best way to do this is to hire a bunch of people and to make it so innovation can happen at the lowest, most spread out parts of the organization, without the need for coupling between somebody who wants to do something and the whole bunch of other people just for sort of approval.

45:14 Which means, that we have a culture that emphasizes freedom but also means that for that you get responsibility for your actions, and examples of freedom mean for example that we don't have approvals on purchases. And I'll say that again because it can kind of sound a little crazy, there are no actual approval processes for making a hardware purchase.

45:42 Amazing.

45:43 Yeah. As another example, let's talk about expenses; Netflix's expense policy is very robustly documented, it is act in Netflix's best interests. That sentence, there you go, is the entirety of our expense policy. And what that ends up meaning in reality is that for example when I submit an expense, there is no approval for that expense, it will get automatically paid and then at the end of the month my manager will get a list of expenses that were automatically paid within her or his organization.

46:17 Sure, as long as you are behaving good for Netflix it is all good, right, if you take some clients or somebody out to dinner you think it is good for Netflix, and it's all right?

46:27 Right. Exactly. So, that also means that from an engineering perspective, engineers actually get to decide what's the right thing for Netflix, both in terms of how they solve and when for example they might push to production which is also why I got to build Howler Monkey, in Python originally despite the fact that we didn't have a good belief that a whole bunch of other people would be interested in using Python. Now, it turned out that that ended up being wrong, because as soon as I build a bunch of infrastructure libraries in Python, for working in the cloud, I started getting a bunch of people contacting me and formally going, "Hey, I heard you have this thing that will allow me to do Python more easily."

47:08 So that turned out to have enabled a lot of people I think at Netflix to use Python. But, from the change perspective, in a typical environment and I worked in a bunch of them, if a developer wants to do something in production, they have to submit a change control ticket, which will end up getting approved by somebody. It's not very consistent with our culture, and having worked in a bunch of other organizations that do that, I don't think it's particularly useful anyway. So we don't do that.

47:35 Right. It makes people feel like they have control, but it doesn't really help that much.

47:40 Yeah, I mean I think what happens is basically the people who are kind of far away from that change end up having to approve it but they don't actually know what that actually means. It's not like they are actually being any more responsible for it and typically, what happens is when a change ends up not having been the right thing then the company will react by saying, "Well, now you need the director to approve this kind of change" or "You need VP to approve this kind of change," moving the locals of control over the decision further and further away from the people who are most qualified to make the decision.

48:10 Yeah that is really good point. The more approval it needs the less likely that person is able to like actually understand what they are approving. That's very interesting.

48:20 Yeah, and the other thing is, it actually takes away from your responsibility. So if you are a developer and you want to deploy it a t 5 o'clock in the afternoon on the Friday for example, which is kind of prime time for us, and generally speaking, to be perfectly honest, you know- I don't do that, my team generally speaking doesn't do that unless we have a very good reason to, then if you can basically submit change control that can say, "Well, you know, my boss or my director approved it so it's ok," it means you are not actually practicing a whole lot of responsible thinking as to whether or not this is a good idea right now. So, putting it entirely in your hands, means you have complete freedom to do what you think is right but also, means you have the responsibility to make the best decision.

49:03 That's great. I think a lot of the best engineers and developers really value that freedom and responsibility, so I think it's great.

49:11 Yeah. It's worked reasonably well for us.

49:14 Cool. So Python 2 or Python 3 in Netflix?

49:18 Right now, my impression is and remember, we are not exactly very standards driven company, there is no overall board that says this is how you should use Python at Netflix. My impression is that generally speaking we seem to be using Python 2 across the board, I don't know anybody using anything older than 2.7 thankfully, and I suspect that in the next year we'll see Python 3 increasing in its coverage.

49:46 Ok. Great. I also saw that you guys sponsor PyCon, this year, that's great, thanks for that. Did you do any presentations or Netflix is a group to Python presentations?

49:57 I've got to tell you I don't know of the top of my head whether or not Netflix did any presentations. This is the first year I think in about three years that I didn't attend PyCon, despite the fact that PyCon I think is my favorite conference, only because Montreal was a little far to go, for PyCon, for me.

50:17 Yeah, well, you can go a little bit North next year.

50:21 Where is it next year?

50:21 It's in Portland, my home town.

50:23 Oh, well perfect. I mean I am going to Portland next week so I'ma a big fan.

50:26 Excellent. So I'll go look through some of the older ones see if I can pull up something. So you guys have a lot of developers and a lot of cool stuff going on; are you hiring people right now, if people who are listening are there interesting areas that they should maybe think about?

50:42 Are we hiring people right now? Maybe one day the answer to that will be no, but I don't suspect that is going to happen any time in the next few years. We are very actively hiring people right now, the only thing I wish is when I look at jobs.netflix.com there was some sort of counter that allowed me to see how many open positions there are, but there is a whole bunch of them.

51:05 Excellent, ok. So jobs.netflix.com people can check that out if they are interested. Very cool.

51:10 Yeah, and I would say this, there is a bunch of jobs that might actually call for Python experience, a bunch of jobs that might not. 2 things I would say: one, I know a whole bunch of areas within Netflix where Python is heavily used, both in terms of analytics so for example my real time analytics group is big fans of Python. But also in terms of sort of infrastructure management and development, so our Cassandra operations people are big fans of Python as well as some of our big data platform people.

51:43 And the other thing is that if you are not actually married to continuing to use Python, there is whole bunch of other positions at Netflix that want great engineers irrespective of your language. We've come to believe that hiring for the language is less valuable to us, then hiring for the aptitude, that is one of the things I like about the way we think about hiring compared to a lot of other companies. We tend to hire great engineers and then have them use the right language.

52:14 Excellent. Yes, so I am sure a lot of people are excited to hear that, that's cool. Few questions I always ask on the way out the door is first of all what is your favorite editor?

52:24 So right now, after many, many years I have switched, I am not sure it is going to be permanent switch, but I've switched to Adam, from GitHub.

52:34 Oh yeah, Adam is pretty nice.

52:36 Yeah, I tried Sublime for a while and it was ok but it wasn't sticky, I kept coming back to Vim every once in a while, and then I haven't tried anything heavier than Sublime or Adam like any of these sort of PyCharm or all that stuff.

52:50 Exactly. Ok, cool, I know Sublime it's very popular within Netflix. Adam is nice, it looks really good. How about some cool package on PyPi that people should know about?

53:05 Well, I mean technically iPython notebook is on PyPy right?

53:08 It is, yeah.

53:09 And I would say if there was one- actually for editing and for authoring and for experimentation iPython notebook beats everything else out there for me, for actually getting work done- oh boy howdy Requests.

53:26 Yes.

53:26 Requests, requests, requests.

53:28 Yeah, Request is the most popular Python package. Is has been downloaded an insane amount of times, and we had Kenneth Reitz on show 6 to talk about it, that was very cool.

53:38 Yeah, I mean you know, at some point I wonder why, you know, at some point will it just go back go into the core-

53:45 You know actually Kenneth talked about that. He was at the language summit there at the last PyCon and they talked a lot about that specifically will request become part of the standard library. And they decided no, because they would like to release changes to Request on a more high speed pace than the actual Python itself. So that was actually a quite a bit of discussion I think that that was what the decision was but they said it is going to become the official recommended way to do HTTP within Python instead of URL lib 2 or things like that, within the documentation.

54:22 All right, very cool, so definitely a popular one, that's great. iPython and Request people should check those out, that's great. Roy, any final shout out you want to give? Anything you want to bring the listeners' attention to?

54:33 You know, Monitorama is happening in Portland next week. so anybody should check out the presentations coming out of them if they have any interesting monitoring or operational insight, I'm going to be at the conference for the first time, I'm very excited about that, there is a whole bunch of very interesting people who are much smarter than me presenting and attending. Check that out and- I can't think of anything else off the top of my head.

54:59 All right. This has been such a great show, I really enjoyed the conversation and I learned a lot. Thank for being here Roy.

55:04 It's been my pleasure, thanks Michael.

55:07 Yeah, bye.

55:07 This has been another episode of Talk Python To Me.

55:07 Today's guest was Roy Rapaport and this episode has been sponsored by CodeShip and Hired. Thank you guys for supporting the show!

55:07 Check out Codeship at codeship.com and thank them on twitter via @codeship. Don't forget the discount code for listeners, it's easy: TALKPYTHON

55:07 Hired wants to help you find your next big thing. Visit hired.com/talkpythontome to get 5 or more offers with salary and equity right up front and a special listener signing bonus of $4,000 USD.

55:07 Speaking of jobs, remember that Netflix is hiring! If what you heard on this show sounds amazing and you're able to work from Los Gatos, CA, check out jobs.netflix.com.

55:07 You can find the links from the show at talkpythontome.com/episodes/show/16.

55:07 Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes and direct RSS feeds in the footer on the website.

55:07 Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song on our website.

55:07 This is your host, Michael Kennedy. Thanks for listening!

55:07 Smixx, take us out of here.

55:07 [music]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon