#16: Python at Netflix Transcript
00:00 Right now, there is a chaos monkey running through AWS, knocking over Netflix servers.
00:04 But don't be alarmed. It's all part of the plan.
00:07 This is Talk Python to Me with Ray Rapoport, recorded Wednesday, June 10th, 2015.
00:12 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
00:45 This is your host, Michael Kennedy.
00:47 Follow me on Twitter where I'm @mkennedy.
00:50 Keep up with the show and listen to past episodes at talkpythontimmy.com
00:54 and follow us on Twitter where we're at Talk Python.
00:57 This episode, we'll be talking to Roy Rapoport about Python at Netflix.
01:01 This episode is brought to you by Codeship and Hired.
01:06 Thank them for supporting the show on Twitter via at Codeship and at Hired underscore HQ.
01:13 The topic of this show is Python and cloud computing at Netflix.
01:16 A while ago, I had the amazing opportunity to teach a Python training course
01:21 to some of the developers in data science there.
01:23 I want to give a quick shout out to my past students.
01:26 Thanks for listening, guys.
01:27 Netflix is an amazing place.
01:30 And after the show with Roy, I'm even more convinced of this.
01:32 You're going to love this interview.
01:34 And with that, let me introduce Roy.
01:36 Roy Rapoport is currently managing the Insight Engineering Organization at Netflix,
01:41 where they write the powerful telemetry platform and graphics, alerting, and analytics systems on top of it that allow Netflix to have complete real-time
01:50 visibility into its operations and systems in the cloud, on customer devices,
01:55 and anywhere else Netflix operates.
01:57 Roy, welcome to the show.
01:59 Oh, thank you so much.
02:00 I'm happy to be here.
02:01 I'm really beside myself with excitement to talk to you about some of the stuff that you guys are doing at Netflix.
02:09 So thanks so much.
02:10 I think when people think about software projects and deployment and managing large software at scale,
02:19 there's almost no other company that comes to mind doing the kinds of things that you're doing at Netflix.
02:26 I mean, there's a few, Google, and maybe someone else.
02:30 I'm not sure.
02:31 But what you guys are doing is really, I think, pushing the limits of what you can do with software in the cloud these days.
02:38 So I'm really excited to talk about that.
02:39 Yeah.
02:40 I actually love talking about this stuff, too.
02:42 So it works.
02:43 That's great.
02:44 So since this is listened to all over the world, I know Netflix is, like, absolutely a household name in America.
02:52 But maybe you could just say really briefly for folks who maybe don't know what Netflix is, what you guys do.
02:57 Oh, sure.
02:58 I think we're becoming, to some degree, a household name in a bunch of other countries as well.
03:02 We now serve somewhere between, I think, 45 and 50 countries or thereabouts.
03:07 In the world.
03:08 And we are a subscription video streaming service.
03:12 So the idea is you pay us a relatively small amount of money.
03:15 In the U.S., I think it's about $8.99 a month.
03:17 And you get unlimited access to a pretty wide catalog of TV shows and movies.
03:24 And, you know, you watch about as much of that as you want in any given month.
03:28 That's fantastic.
03:29 Yeah.
03:29 You just pay $8.
03:31 And basically all movies and TV shows are not all of them.
03:35 Almost all the ones that matter are yours to just watch on demand.
03:38 It's great.
03:39 You know, at my house, I have three kids.
03:42 And about six or seven years ago, we just decided television ads constantly being on television, not really helpful.
03:51 We have 600 channels.
03:53 Two of them are good.
03:54 And we just literally canceled our cable and said, look, we're going with Netflix and, you know, a few other things here and there like YouTube and so on.
04:03 And it's made an actual real difference in my kids' lives.
04:08 I think, you know, you go and ask them, what do you want for Christmas?
04:11 They're like, I'm not really sure.
04:13 I don't know.
04:13 I mean, they're just not overwhelmed with, like, all these ads and commercials.
04:18 And I think Netflix is really a positive force.
04:20 No, thank you.
04:22 And thanks, by the way, for paying my salary.
04:23 Yeah.
04:24 I'm happy to do my very small part.
04:26 So speaking of salary, what do they pay for at Netflix?
04:30 Oh, boy.
04:33 So my job at Netflix is manager of Insight Engineering.
04:37 Insight Engineering at Netflix is a software development group responsible for building real-time operational insight systems.
04:44 So originally we thought of it as monitoring, but really the goal is to help people figure out what's going on and what they should do about it.
04:50 And in the best cases, actually automate that sort of process of analysis, discovery, decision, and then application.
04:59 That's cool.
05:00 So who would you consider, like, your internal client?
05:03 If you will, like, are you helping the developers?
05:06 Are you helping DevOps?
05:07 Are you helping, like, the business folks that decide, hey, we're working with this big data and we're trying to figure out what the next data-driven movie we're going to create is?
05:16 Or what are you working with there?
05:18 Who, rather?
05:20 Good question.
05:20 Good question.
05:20 Well, it's worth noting that we don't really have DevOps people.
05:23 We have developers.
05:24 And every developer at Netflix both writes code, tests code, and then deploys it into production is responsible for it working well.
05:32 And at 2 o'clock in the morning, if it doesn't work well, that developer wakes up to deal with it.
05:38 So every developer at Netflix is our customer.
05:40 Now, the interesting thing that ended up happening was, while we're focused on real-time and operational domain, so we don't typically actually work to serve the business people who want longer-term insight and sort of big data queries.
05:54 All of our data ends up being stored in Hive and ends up being really useful for a whole bunch of that sort of strategic view analysis that we hadn't originally expected it to be useful for.
06:06 Oh, that's excellent.
06:08 And I know Hive is related to Hadoop.
06:10 Is Hive kind of like the front end to your Hadoop cluster?
06:13 Yeah, pretty much.
06:15 Yeah, that's the simple uninitiated version.
06:19 Awesome.
06:21 So I've been watching you guys for a while, you know, just from a software, excuse me, architecture perspective.
06:28 And I think that's what you're doing is really interesting.
06:31 But could you speak to a little bit of, like, the scale of how you guys are using the cloud?
06:35 What cloud are you using?
06:36 That kind of stuff.
06:37 Sure.
06:38 So Netflix is for the control plane.
06:43 So in other words, when your device talks to us to figure out what movies it could watch and lets you browse our catalog,
06:50 it talks to a whole bunch of systems in the AWS Amazon Web Services cloud.
06:55 I think the last official public number that we've disclosed is that we run more than 50,000 servers in the AWS cloud across several production regions.
07:07 When you actually stream the movie, that stream happens that comes, sorry, when you actually stream the movie, that actually comes from our in-house content distribution network.
07:19 We call this the Open Connect Network.
07:22 And so we've actually deployed content caches in a whole bunch of different internet peering points and, in some cases, in large ISPs to minimize the bandwidth that they need to dedicate for Netflix.
07:34 So, yeah.
07:37 So that's the number of servers in the cloud.
07:40 And I'm not sure.
07:42 We have thousands of Open Connect servers, but I'm not exactly sure about the number.
07:45 Right.
07:47 Okay.
07:47 Yeah.
07:48 You guys have kind of been at the center of the whole net neutrality thing as well because it's so critical to get that bandwidth to so many places, right?
07:58 Yes, net neutrality is something that's near and dear to Netflix's heart.
08:01 I'm sure you paid a little attention to it, at least as an organization.
08:04 Yes.
08:05 That's really cool.
08:08 I heard some really interesting statistics about how much bandwidth as a percentage of the internet you guys represent.
08:14 Do you know that number?
08:15 Can you speak to it?
08:16 I think the last Open Numbers that's – was it Widevine who reported this?
08:25 The last numbers that I heard were something on the order of 33% of internet traffic, theoretically.
08:34 Yeah.
08:35 That's amazing.
08:36 I mean, just stop and think about that.
08:38 There's millions of websites, and you guys represent a third of all bandwidth.
08:42 That's amazing.
08:43 But like I said, you're more or less becoming the new television for the world, right?
08:47 And actually, I'm sorry.
08:48 That was Sandvine.
08:50 And in November, they said we had topped 35%.
08:53 Wow.
08:54 Yeah.
08:55 So all this stuff you're talking about, scaling and CDNs and all this extra work is really needed.
09:04 Yeah.
09:06 It's interesting how much doing things well ends up working well for you at very large scale.
09:12 And how your ability to sort of take shortcuts that probably would be a pretty sane approach using a smaller scale ends up sort of not being very much the right approach when you're looking at our scale.
09:28 We end up being a lot more, I think, allergic to technical debt than we would be if we were a lot smaller.
09:34 So I can certainly see the more maintainable and sort of understandable the code is because it has to be, you know, at such a scale that you can't have these little hidden problems that people don't understand or don't want to go fix because you have to fix it, right?
09:53 I can see how Python would really help in that space because it's a simpler language.
10:01 It's easy to understand.
10:02 It has these great libraries.
10:03 What role does that play?
10:06 Well, I think this is where we get into potentially issues of sort of personal preferences almost.
10:12 I have some strong opinions about Python.
10:14 It's quite literally my favorite language.
10:17 It's been my favorite language since I started working in Python in, oh gosh, 04 or thereabouts, like 2004.
10:26 So about, you know, 10, 11 years.
10:28 I think there's a whole bunch of other people at Netflix who do great work with other languages, many of them JVM languages, whether it's Java, Clojure, or Scala.
10:37 And, of course, we also do a bunch of work with JavaScript.
10:40 I have been privileged to see some really wonderfully written Python code.
10:48 I've been privileged to see some really nicely written Java code.
10:52 And I think we've all, frankly, seen some pretty terrible code irrespective of the language.
10:56 CodeShip is a hosted, continuous delivery service focused on speed, security, and customizability.
11:16 You can set up continuous integration in a matter of seconds and automatically deploy when your tests have passed.
11:22 CodeShip supports your GitHub and Bitbucket projects.
11:25 You can get started with CodeShip's free plan today.
11:28 Should you decide to go with a premium plan, Talk Python listeners can save 20% off any plan for the next three months by using the code TALKPYTHON.
11:36 All caps, no spaces.
11:38 Check them out at CodeShip.com and tell them thanks for sponsoring the show on Twitter where they're at CodeShip.
11:51 Yeah, you can write bad code anywhere, can't you?
11:53 Yes.
11:54 The only thing that Python really saves you from is bad indentation, I suppose.
11:59 Yeah.
12:00 Yeah.
12:01 It won't run.
12:01 If you have bad indentation, it just won't run.
12:04 Fantastic.
12:06 So when I originally reached out to you, that's because you were a co-author on a really amazing blog post, just very humbly entitled Python at Netflix.
12:19 And I'll put that as a link in the show notes so everyone can go check it out.
12:23 But you kind of went through all the different uses of Python that you guys have throughout this great cloud system that you guys have built.
12:32 You said that you use something called BOTO, B-O-T-O, and that's super central.
12:38 And, of course, that's the AWS Python SDK, right?
12:42 So what parts of BOTO are, like, super important?
12:45 What's a really common thing that you guys do with AWS across developers?
12:50 Oh, boy, howdy.
12:52 Everything, huh?
12:54 So for those of your listeners who don't know, BOTO is a Python interface to the AWS API.
13:00 It was written by a guy named Mitch Garnott, who eventually actually ended up working for AWS for a while.
13:10 And we use BOTO pretty much across the board, both to talk to various services like SQS and S3,
13:18 but also, frankly, to get a bunch of information out of EC2 and almost any other part of AWS.
13:25 It is the way for Python developers to talk to AWS.
13:30 I'm not actually familiar with any other option that you might have.
13:33 Sure.
13:34 How much auto-scaling do you guys do, like, for evenings in the U.S.?
13:40 I'm just trying to get a sense of, like, how many machines are coming online, going offline.
13:46 What's that like?
13:48 Well, I'll give you a perspective.
13:50 I think we've shown some public graphs that show that the traffic we get at trough,
13:56 in other words, the lowest part of our day, which is somewhere between about 1 a.m. and 5 a.m.,
14:04 is either somewhere between a third and a half of the traffic we get at peak times in the day.
14:11 So that means that if you've got application servers running in clusters that are, let's say, 1,000 servers at trough,
14:19 they could be up to 2,000 or 3,000 servers at peak.
14:23 Not all of our systems auto-scale, of course.
14:25 It doesn't always make sense, depending on the kind of traffic load you've got.
14:28 But we bring up thousands of servers every day to deal with traffic.
14:35 And then when traffic goes away, those servers go away as well, which is why the typical sort of half-life of an AWS instance for us is measured in the two- to three-day range.
14:48 That's amazing.
14:49 So after two or three days, it's likely that it was the one selected when you sort of downgraded your size for that trough.
14:58 Yeah.
14:59 Now, there are, of course, a bunch of systems like our Cassandra systems that are more stateful.
15:03 And that means that they don't auto-scale and they have a much easier time if we don't randomly sort of recycle their instances.
15:10 But our front-end systems, the systems that deal with direct customer traffic, are stateless.
15:17 And there's a whole bunch of them that come up all the time and a whole bunch of them that go away all the time.
15:22 Yeah.
15:23 That makes sense.
15:24 Can you talk a little bit about the architecture?
15:26 Is this something you know about?
15:28 Are you using lots of microservices?
15:31 Are you using containers?
15:33 Can you speak to any of that?
15:34 I can.
15:35 I can.
15:36 So we have not yet started looking.
15:40 Well, we have not yet started deploying containers across our environment.
15:44 We use a baked AMI approach to deal with microservice deployments.
15:50 We are a classic microservice, service-oriented architecture environment without a centralized service bus.
15:58 Last I looked, and unfortunately, I'm not VPN right now, so I can't confirm the number.
16:03 We had more than about 1,200 or so services in our production environment.
16:08 Wow.
16:08 1,200 distinct services, not instances of the servers running those services, right?
16:14 Exactly, yes.
16:15 So obviously, some of those services might have thousands of instances running.
16:19 Some of those services may only have two or three instances running.
16:24 How do you guys manage, like, hey, there's this functionality that exists, so as this other service, don't go write your own, and just sort of keep people knowledgeable in discovering these things at that scale?
16:35 Yeah, so that's an interesting question.
16:37 It comes down to a company culture.
16:39 We try to be as agile with a lowercase a as possible and decentralized as possible, which means we don't want to sort of shunt people through some sort of centralized approval or, you know, information distribution system.
16:56 And that means that, in fact, it is a little harder to make sure that if you need to get something done, that you'll know if anybody else has done it.
17:03 We count on a lot of informal communication between teams.
17:07 We count on the fact that we're all geographically co-located.
17:11 So every engineer at Netflix working on our cloud ecosystem is either working in the building I work in or the building to my left or the building to my right.
17:20 Right.
17:21 Is that Los Gatos, California?
17:22 Yeah, yeah.
17:23 And then we basically try to make it so if you know that you need something, ideally and hopefully before you start building it, you might at least know who else to ask who might know whether or not somebody else has already built it.
17:39 And sometimes we will just have duplication.
17:41 And, you know, we tend to hire people who are reasonable enough about this kind of stuff that they're not going to become overly invested in their own solution rather than the right solution.
17:52 So when we find duplication, then the people who own the duplicating code or function can sit together and figure out, well, what do we want to do with this now?
18:01 Right. Try to narrow it down to just one, maybe bring the features everybody needed into that one service, right?
18:08 Potentially, or have a better understanding of why you need two.
18:11 So I think, because I've been in a bunch of these conversations, when that happens, really the goal is to either understand why you should have two, and that will help you clarify what these two things do differently from each other.
18:26 Or decide, no, that doesn't really make sense, so we'll just have one.
18:30 That makes a lot of sense, and it seems like a wonderful place to work if you can just go out and have a lot of freedom and it's not very top-down.
18:37 That's great.
18:37 Yeah, and actually, I mean, that ends up being really relevant to this whole conversation about Python, because when I started using Python in the engineering side of the house at Netflix, at the time there wasn't really a lot of appetite for Python in engineering.
18:53 I think it felt like I was the first one proposing to build production services with Python.
19:00 And my boss, frankly, really didn't like this.
19:02 I think my boss would have much preferred that I spend the two or three months to learn Java, because we had basically everything else writing in Java.
19:11 And we had a whole bunch of infrastructure libraries making it really easy for Java developers to run in the Netflix cloud ecosystem.
19:18 And, you know, every week or two, I'd be sitting down with my boss, giving him an update on how my project was going.
19:25 And maybe about every other one of those conversations, he'd say, so you really think Python's the right way to do this?
19:31 And I'd say, yes, yes, yes.
19:33 And, you know, because of the way we tend to think about where the engineering decisions need to be made at Netflix, namely in engineers, he let me sort of run with it and gave me all the rope I needed to, in this case, validate that that was a good idea.
19:49 It's really cool when you can bring a new idea and actually show, hey, this is not a bad idea and see how it works.
19:55 And I think one of the best ways to prove that is not to have a meeting, but to just build something that works and say, look what I did.
20:03 Look how great this is and how easy this is.
20:06 And we can do more of this.
20:07 And I think, you know, in programming, a lot of times the best way to show something works is to just do it and look back on it, you know?
20:14 Yeah, the best way to show something works is to show something works.
20:18 Yeah, show something working, right?
20:19 Exactly.
20:20 One of the things that seems like you guys are really into is open source at Netflix.
20:25 You've got a lot of cool stuff you're doing.
20:27 You seem to be open sourcing these libraries.
20:28 Can you talk about some of the more popular ones?
20:31 Sure.
20:32 I think we've gotten a lot of interest out of the RxJava work that we're actually doing, which, of course, is not so much Python.
20:41 But our data science side, which is a big fan of Python, has open sourced a bunch of work using Python.
20:50 One thing that we haven't yet open source that we've talked about and we'd like to open source as soon as possible, actually, is something that my team has been doing over the last year or so, which is a RESTful service to do anomaly and outlier detection.
21:06 And that RESTful service uses a whole bunch of scikit-learn and Panda algorithms to help us drive automated operational decisions.
21:18 My real-time analytics engineers are big fans of Python in that space.
21:23 One of them is, in fact, speaking about it right now.
21:25 Wow.
21:26 Where's that?
21:26 Kukon, New York.
21:27 Okay.
21:28 Fantastic.
21:29 Yeah.
21:29 We've got another tutorial coming up in PyData in Seattle.
21:33 So we'd love to actually share all of that code with the community as soon as we will have some time to breathe and get that done.
21:40 Wow.
21:41 Amazing.
21:42 So you guys are using scikit-learn and machine learning to monitor some of these cloud instances and services.
21:49 Yeah, but not just cloud instances and services, right?
21:52 Cloud services and instances, to some degree, are the smallest domain of data that we have.
21:58 It's, to some degree, the most public and visible one.
22:01 But if you think about it, we have millions of pieces of content, right?
22:05 You know, millions of TV shows and movies.
22:07 And each of them is encoded into a whole bunch of different formats and a bunch of different bit rates.
22:13 We can't set some sort of artificial static thresholds for what it looks like for any one of them to be successful.
22:21 So you have to really learn in production what the expected sort of viewing rate for any one of these pieces of content, if you want to be able to notice when one of them is sort of going wrong.
22:34 Same thing about devices.
22:35 You know, there are millions of devices spread across thousands of device families, thousands of device models.
22:41 So we can't monitor any one of them sort of manually.
22:46 We have to sort of deduce what the right behavior is in real time so we can notice when one of them starts going wrong.
22:54 That's really cool.
22:55 Just the scale is so big that trying to individually test them all, you probably, by the time you got through the whole list, you'd have to start back at the beginning because they all have new versions and new settings and so on, right?
23:06 So just build a system that watches it, huh?
23:09 Yeah, exactly.
23:10 And can deduce correctness from, you know, the historical patterns.
23:15 I wonder how many organizations are actually applying machine learning to the monitoring of their software.
23:21 And this thing you're going to open source?
23:22 Yeah, we'd like to.
23:24 Will it have a name?
23:25 Well, yes.
23:28 Everything has a name.
23:29 Yeah.
23:29 Sorry, do you know the name that you're planning to give it so people would know and to look for it?
23:40 This episode is brought to you by Hired.
23:46 Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
23:52 Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.
24:02 Typically, candidates receive five or more offers in just the first week, and there are no obligations ever.
24:08 Sounds pretty awesome, doesn't it?
24:10 Well, did I mention there's a signing bonus?
24:13 Everyone who accepts a job from Hired gets a $2,000 signing bonus.
24:16 And as Talk Python listeners, it gets way sweeter.
24:20 Use the link Hired.com slash Talk Python to me, and Hired will double the signing bonus to $4,000.
24:28 Opportunities knocking.
24:30 Visit Hired.com slash Talk Python to me and answer the call.
24:33 You know, there's a time-honored tradition at Netflix that if you're a developer working on a product,
24:52 you get to name that product when we open source it.
24:55 And I haven't actually asked the developers whether or not they've picked a name for the open source product.
25:03 We have an internal name for it, which is Kepler, but I'm not sure that it'll end up being the public name for it.
25:09 Okay.
25:10 Wow.
25:11 Excellent.
25:11 So you have some other things that are kind of related.
25:15 One of the things I read about when you wrote your blog was that you have a large number of alerts that get sent for various reasons.
25:23 And you have this whole central alert gateway thing that's written in Python.
25:28 Well, yes and no.
25:32 So this is actually, this is perhaps maybe the bad news from a Python perspective.
25:39 When I came over to engineering in 2011, I wrote a bunch of really useful stuff.
25:46 One of them was the central alert gateway.
25:48 Another thing was Holler Monkey and Security Monkey.
25:52 The interesting thing that's happened over time is back in 2013, I became a manager, which means I pretty much get paid not to code.
26:00 Exactly.
26:02 And so some of the stuff that I wrote that we haven't publicly talked about ended up not being kind of a bust.
26:09 You know, one of the things that we try to do is we take a bunch of bets, we try to minimize the cost of the bets, and some of them will go great, some of them will not go great, and we'll just kill them quickly.
26:18 Yeah, but the key to be successful is experimenting, right?
26:22 Yeah, exactly.
26:22 So the central alert gateway was highly successful.
26:25 Incredibly useful.
26:28 It is at the core of knowing that something's gone wrong in our environment, and we have a long-term commitment to it.
26:34 However, it turned out that after doing maybe about two years of sort of organic iterative development on it, it really needed a refresh.
26:42 And it also needed ownership by software developers who were being paid to be software developers.
26:48 And so it ended up moving over to my team formally, and the developer who's responsible for it ended up reimplementing it in Scala.
26:56 Okay.
26:57 So the CAG these days is no longer Python-based.
27:01 I think I still, in my spare time, still maintain a client in Python, so you can talk to the CAG directly.
27:08 But the same thing, frankly, happened with SecurityMonkey and HowlerMonkey as they moved into other groups.
27:14 Those other groups made their own decisions as to how to keep them organized.
27:20 I think maybe to some degree you could argue that this is a nice example of how, personally, I was able to use Python to very rapidly iterate over what needed to be done.
27:33 And then once it reached a stable point, then, I mean, frankly, I don't know that the language mattered all that much.
27:40 And the developer who preferred to use Scala ended up re-implementing it in Scala.
27:46 Sure.
27:46 And that's part of that developer freedom that you guys talked about, which we can come back to again.
27:50 I think it's easy to, well, it could be easy to look at this and go, oh, well, they tried it in Python and it failed.
27:57 But I think that's exactly the wrong message to take from it.
28:00 Like you were saying, it's almost the success of Python is this thing came into existence.
28:05 It was written well and quickly.
28:07 And then it's evolved since then.
28:10 But, you know, the reason it's here is because it was easy to do this in Python to get started.
28:14 Yeah.
28:14 And again, it wasn't like the company made a decision that it wasn't going to be in Python, frankly, because the company doesn't make those sort of decisions.
28:23 The company made a decision that it was really worth investing in and having as a prime sort of first class member of our ecosystem.
28:31 And then the company knows that developers make the best sort of the best decisions when it comes to this sort of like engineering and implementation kind of decision.
28:42 And so we didn't really have an investment in having it be in any in any given language.
28:48 And the developer who ended up doing it decided to do it in the language that she was most comfortable with.
28:53 Right.
28:54 And that makes sense as well.
28:55 You don't want to force somebody to write a language they're not familiar with and you're not going to end up with this good of a product.
29:01 So one of the things I think Netflix is really known for, besides the 35% bandwidth story and, you know, huge AWS usage and things like that, is data driven sort of decisions and that type of stuff.
29:17 Yes, especially and I would say mostly for our public product.
29:21 So can you talk a little bit about how you guys are using data science there, maybe some of the tools like IPython or things like that if you're using them?
29:28 Sure.
29:29 We're big fans of IPython.
29:31 And boy, howdy, can I tell you, I think it was two years ago that I did PyCon.
29:36 Yeah, I think it was PyCon 2013 when I discovered the IPython notebook.
29:41 But I feel like my life's never been the same since then because that thing is just glorious.
29:46 Yeah, it really is amazing.
29:48 Yeah.
29:48 And in fact, we use an IPython notebook now as the format for our take-home tech assignment for candidates for the real-time analytics group under me.
30:00 But that's neither here nor there.
30:02 So largely, when you look at data science driving product decisions, we're big fans of A-B tests.
30:10 And should I define A-B tests or is that relative?
30:15 Yeah, sure.
30:16 Go ahead.
30:16 Not everyone's sort of front-end customer-facing folks, maybe.
30:20 Yeah, so basically, if you think about it, you know, you've got a bunch of customers who use your product and you're thinking about whether or not a given feature might be useful.
30:28 And first of all, you've got to define what useful means.
30:31 What is it that you're trying to actually change?
30:33 You need to have a good enough understanding of your product that you can basically tie everything into some sort of key performance indicator, KPI.
30:40 And then you actually implement the feature and you subject part of your customer population to a test where some people see the original behavior, some people see the new behavior, and you see whether or not there's a difference in the KPI between those two groups.
30:56 Now, I'm massively simplifying this because in our environment, for example, any given A-B test might have upwards of a dozen or so different potential functions or, sorry, different test cells.
31:11 And, of course, we don't just run one A-B test.
31:14 I could be subject to, I think the last time I looked as a Netflix customer, I was randomly allocated to something like either 12 or 15 or so different A-B tests.
31:27 So, that's really cool.
31:28 How might those show up?
31:29 Would that be like what's in the recommendations or maybe other parts of the UI?
31:36 Yeah, it can be a whole bunch of different things.
31:38 So, you know, it can come down to, for example, what image do we show you for a given movie?
31:42 So, for example, if you look at Orange is the New Black, there's a little, you know, picture that we show you for the Orange is the New Black movie.
31:50 Well, does one picture actually get people to want to watch that more than another picture?
31:57 One easy way to look at it is to, you know, basically divide your customers into a bunch of different groups and show each of them a different picture and then actually see if one of those pictures ended up resulting in more click-through rates, right?
32:10 So, that's a relatively trivial, small difference.
32:13 You can look at bigger things, like, for example, at what order do we show you recommendations?
32:17 What language we may use?
32:20 And in some cases, complete overhauls of the UI.
32:23 So, I saw an article recently that suggested that sometime in June, we're going to completely overhaul the computer-based UI to make it sort of much sort of niftier and, you know, prettier.
32:36 And that's a complete overhaul that obviously would have been tested on a small group first to see if it actually causes people to engage with us more or less.
32:46 Right.
32:46 And you only choose the ones that make them engage with you more, of course.
32:50 Well, ideally, or at least, you know, assuming that that's the KPI that we wanted to actually affect, right?
32:57 So, again, this all comes down to what am I actually trying to accomplish with this particular feature?
33:02 Sure.
33:03 I saw, I think it was an article in Wired, and I'll put the link in the show notes, that some of the guys, it was from a Hadoop conference.
33:10 They were looking at the different things you can change in A-B test, and they were actually looking at the color spectrum of the image and comparing those across other movies that may be more popular or less popular.
33:24 And it sounds like there's a lot of dimensions that you guys look at.
33:27 Oh, my gosh, yes.
33:28 But the beautiful thing is, if you've already built essentially the infrastructure to do A-B tests relatively cheaply, then that means that you open the door to doing a lot more experimentation.
33:39 And you don't need to necessarily have a conjecture that you're going to significantly move the needle.
33:45 So, it's all about cost-benefit analysis, right?
33:49 And if you think about it, if it costs you a lot to try a test, then, boy howdy, you've got to conjecture that that test is really going to change some KPI significantly.
34:01 But if implementing A-B tests is relatively inexpensive and can be thought of as almost free, then you can be a lot more experimental than you would be otherwise.
34:13 Which is why, by the way, for example, the group responsible for building our A-B test infrastructure is not called the A-B test group.
34:24 They're called the experimentation platform group.
34:26 Because this is really how you enable cheap and easy experimentation.
34:30 Yeah, and you have so many users that you get very easily statistically significant data, right?
34:38 Yeah.
34:38 That does make it easier.
34:40 Very, very cool.
34:41 One of the things that I heard that really intrigued me about the architecture and things you guys do is this thing called the Chaos Monkey.
34:48 And it's been around for a while, but can you talk about that?
34:51 Sure.
34:51 Chaos Monkey is a relatively proven concept.
34:55 We started doing this as we went into the cloud maybe about four years ago or thereabouts.
35:00 And the idea is that given that servers will die, you know, this is a fact.
35:07 This is true in every environment I've ever seen, whether you're in the cloud or not.
35:11 Your servers will die.
35:13 And trying to avoid it or deny it does not lead to, you know, happiness or more robust systems.
35:19 So let's try to actually make sure that our systems are resilient to server death.
35:24 And so we built this small component that we ended up calling Chaos Monkey that goes around every day and kills one of your servers for every application group that we have.
35:33 So we talked about those 1,200.
35:35 And this is in production, right?
35:37 Oh, yeah.
35:37 That's the only, that's, I would argue, the most useful way to have Chaos Monkey run.
35:42 Amazing.
35:43 Okay.
35:43 Okay.
35:43 Keep going.
35:44 Sorry.
35:44 Yeah.
35:45 So basically, you know, if you think about those 1,200 applications running in production, we kill at least one server out of each of these applications because they're running on multiple servers every day to make sure that we don't have a problem when that happens.
35:59 So when people start building their service, they have to, from the very beginning, plan for the fact that these machines are going to die, which is reality, right?
36:07 But it's really neat that you've put this system together to test it.
36:11 And that's pretty leading edge, I think.
36:14 But what I think really takes it to the next level is you leave it running in production.
36:19 That's amazing.
36:20 Yeah.
36:20 And, you know, the interesting thing is it feels leading edge when you're outside of this environment and it feels leading edge when you start doing it.
36:28 But, you know, off the top of my head, I don't think Chaos Monkey has actually exposed a problem for us in the last, I want to say, three years.
36:36 And that's perfect, right?
36:37 Because that's exactly what it's there for.
36:39 And it's a verification that despite the fact that we kill thousands of instances every day as part of Chaos Monkey, we've gotten to the point where it's pretty much normal for us that we can lose a server without any noticeable impact.
36:53 Yeah.
36:53 That's excellent.
36:54 I mean, you really need to make sure the service doesn't go down.
36:57 And I can't remember the last time that Netflix wasn't available and it was your fault.
37:02 You may remember.
37:03 Yes.
37:05 So one of the fun parts about being involved in every major incident postmortem at Netflix is that I do, in fact, get to be aware of every glitch in the system and every significant outage.
37:18 It's worth noting that, thankfully for us, the definition of significant outage is probably bigger than for our customers.
37:24 Sorry, it's actually smaller than for our customers.
37:26 We tend to think of problems being significant below the level at which they would likely be noticeable to most of our customers.
37:34 Right.
37:35 Exactly.
37:35 Maybe these machines are using too much CPU or the latency is too high or something like that.
37:40 Is that a possibility?
37:41 Not so much because we tend to think about production impact purely based on whether or not customers are impacted.
37:47 The threshold for us, for example, for major production incidents is whether or not 10% or more of our customers are impacted.
37:54 That's a really big deal for us.
37:56 It happens not that often, but it still happens every once in a while.
38:00 But it does mean that, for example, if 10% of our customers are impacted, the good news is that 90% of them aren't.
38:05 Yeah, sure.
38:06 It's just 10% is really high.
38:08 Yeah.
38:09 And it's been a really long time.
38:12 And in fact, I can't exactly remember off the top of my head when we've seen 100% service impact.
38:18 When that graph of actual current usage of Netflix goes to zero.
38:23 So I can see how you build your systems and your services so they can take this.
38:29 But you're built on multiple data centers spanning AWS cloud availability zones and all that kind of stuff.
38:36 So you also need to be able to deal with if AWS goes down.
38:42 Yes, in theory, we need to be able to do that.
38:44 And I think we've built a pretty good environment because of that.
38:48 So, for example, in the US, we run out of two major regions in AWS, US West 2 and US East 1.
38:55 And we can fail over between those two regions so we can evacuate a region.
38:59 And we do that, I think, basically right now, at least once a month or thereabouts to validate that all that mechanism still works.
39:07 The good news is that, frankly, AWS, to the best of my recollection, hasn't had a significant regional outage in a reasonably long time.
39:16 Now, you know, watch them have one right now because of that.
39:19 Watch what you say.
39:20 No, of course.
39:21 I agree.
39:22 They are absolutely rock solid.
39:23 And, you know, I think they're the best place in the world to be, including your own data center, right?
39:29 Because those things go down, too.
39:32 And then that's your problem.
39:32 Yeah, I've been there before.
39:34 I've spent most of my life in data centers.
39:36 But, you know, going back to Chaos Monkey for a second, there's a point that's worth making, which is we didn't so much build Chaos Monkey because AWS wasn't reliable enough.
39:45 We built Chaos Monkey because AWS was too reliable.
39:49 What I mean by that is if AWS was causing us to lose machines, for example, on a daily basis, then they would have already given us the requirement that, you know, we needed to be resilient to that.
40:02 Everybody running in our environment would have known that they need to build that way because they would have lost machines just by the natural sort of unreliability of AWS.
40:10 But AWS's reliability, even down to the machine level, is actually high enough that we needed to simulate machine failures because we wanted it to happen at the rate we wanted it to happen.
40:21 Right.
40:21 It almost never happens.
40:23 So how do you prepare for it?
40:24 Right.
40:25 Exactly.
40:25 It's amazing.
40:26 So you guys built a tool that's related to Chaos Monkey called Chaos Gorilla.
40:32 Yes, exactly.
40:33 We've sort of taken the Simian Army motif and we've run with it.
40:36 So we actually have Chaos Monkey, Chaos Gorilla, and Chaos Kong.
40:43 Believe it or not, Chaos Monkey is about simulating a single machine failure, as I mentioned.
40:50 Chaos Gorilla was something we built to allow us to evacuate availability zones within a region.
40:56 So should we briefly talk about regions versus availability zones within AWS?
41:01 Yeah, absolutely.
41:02 Sure.
41:02 So an Amazon region is a geographical location that actually can contain multiple availability zones.
41:10 An availability zone can be thought of as, let's say, for example, a data center.
41:15 And a region might contain several data centers that are pretty close to each other.
41:19 And so we wanted to be able to tolerate a single machine failure or a single availability zone failure within a region.
41:27 And then we built Chaos Kong to be able to tolerate the loss of a region.
41:31 Lately, I think we found that we've gotten good enough at evacuating regions that if a single AZ within a given region is going to fail,
41:41 we would probably prioritize just leaving that region temporarily rather than trying to rebalance within that region.
41:48 So we practice regional evacuation and regional rebalancing pretty regularly, at least once a month with no adverse effects.
41:57 And every time we don't so much look at whether or not we've impacted our customers anymore because we really don't,
42:04 but rather how quickly were we able to affect an impact-free evacuation.
42:09 That's really amazing that you guys built that.
42:11 That's great.
42:12 So keeping on the monkey theme, you guys also have Security Monkey and Howler Monkey.
42:16 Yeah.
42:17 Because everybody loves monkeys.
42:20 Yes.
42:21 So Security Monkey and Howler Monkey were both written by me.
42:25 Actually, I would say Howler Monkey was the first project I wrote when I moved over to engineering from IT at Netflix.
42:33 And at the time, we were trying to solve a problem where basically, you know, we have dozens of SSL certificates spread around different parts of our environment.
42:42 And when those certificates expire, as they are wont to do, you end up with a production event.
42:49 And so we built a system to automatically discover certificates and alert us when they're getting close to expiration.
42:55 That ended up being at the heart of Security Monkey.
42:58 And separately from that, what we were finding at the time was that we were growing at fast enough rates that we kept running into the limits imposed on us by Amazon.
43:08 Amazon, as a cloud environment, is infinitely scalable.
43:11 But they need to protect themselves.
43:13 So there are essentially logical limits to the resources you use.
43:16 For example, the number of autoscaling groups that you might be able to provision or the number of instances you might be able to provision.
43:23 At the time, we were increasing our footprint fast enough that we kept running into those limits.
43:28 And then we couldn't get any more until we asked them to increase them.
43:32 And traditionally, of course, that would happen on Friday at about 5 o'clock in the afternoon.
43:36 And that would spoil somebody's Friday evening.
43:38 And so we built something that allowed us to monitor our usage compared to limits.
43:43 And allowed us to alert ourselves when we get pretty close, around 80%, so we can talk to Amazon about increasing our thresholds.
43:52 You guys must have a special relationship with AWS.
43:55 Yeah, you know, we're big fans of AWS.
43:58 AWS seems to be big fans of us, and we talk a whole lot.
44:02 Yeah, I'm sure that you guys are actually a topic of conversation in their engineering group fairly often.
44:07 That's cool.
44:08 Nice.
44:09 So one of the phrases or quotes I saw from your blog post was you said something to the effect of,
44:18 we found a formal change control system didn't work well within our culture of freedom and responsibility.
44:23 That sounds really cool.
44:25 And so you created this thing called Kronos.
44:26 What's this culture of freedom and responsibility?
44:29 Netflix has done a lot of work to formally define the kind of culture it likes to have internally.
44:35 So we have a culture slide deck.
44:37 It's now publicly available on SlideShare.
44:40 You know, I think Googling for Netflix culture will find it within the first few links.
44:47 If you look at how we think we're going to be successful as a company, it's really about decentralizing and maximizing speed of innovation.
44:55 And if you want to maximize the speed of innovation, we believe that the best way to do that is to hire a bunch of people and to make it so innovation can happen to the lowest, most spread out parts of the organization without the need for coupling between somebody who wants to do something and a whole bunch of other people just for sort of approval.
45:15 Which means that we have a culture that emphasizes freedom, but also means that for that you get responsibility for your actions.
45:24 And examples of freedom mean, for example, that we don't have approvals on purchases.
45:31 And I'll say that again, because it can kind of sound a little crazy.
45:35 There are no actual approval processes for making a hardware purchase.
45:41 Yeah.
45:43 As another example, let's talk about expenses.
45:46 Netflix's expense policy is very robustly documented.
45:50 It is act in Netflix's best interests.
45:53 That sentence, there you go, is the entirety of our expense policy.
46:00 And what that ends up meaning in reality is that, for example, when I submit an expense, there's no approval for that expense.
46:08 It'll get automatically paid.
46:09 And then at the end of the month, my manager will get a list of expenses that were automatically paid within her or his organization.
46:16 Sure.
46:17 And long as you're behaving good for Netflix, it's all good, right?
46:20 If you take some client or somebody out to dinner and you think it's good for Netflix, then it's all right, huh?
46:26 Right.
46:27 Exactly.
46:28 So that also means that from an engineering perspective, engineers actually get to decide what's the right thing for Netflix, both in terms of how they solve problems and when, for example, they might push to production.
46:40 Now, it turned out that that ended up being wrong because as soon as I built a bunch of infrastructure libraries in Python for working in the cloud, I started getting a bunch of people contacting me informally and going, hey, I heard you have this thing that would allow me to do Python more easily.
47:08 So that turned out to have enabled, I think, a lot of other people in Netflix to use Python.
47:13 But from the change perspective, in a typical environment, and I've worked in a bunch of them, if a developer wants to do something in production, they have to submit a change control ticket, which will then end up getting approved by somebody.
47:26 It's not very consistent with our culture.
47:29 And having worked in a bunch of other organizations that do that, I don't think it's particularly useful anyway.
47:33 So we don't do that.
47:35 Right.
47:35 It makes people feel like they have control, but it doesn't really help that much.
47:40 Yeah.
47:40 I mean, I think what happens is basically the people who are kind of far away from that change end up having to approve it, but they don't actually know what that actually means.
47:48 It's not like they're actually being any more responsible for it.
47:51 And typically what happens is when a change ends up not having been the right thing, then the company will react by saying, well, now you need a director to approve this kind of change, or you need a VP to approve this kind of change.
48:01 Moving the locus of control over the decision further and further away from the people who are most qualified to make the decision.
48:09 Yeah, that's a really good point.
48:11 The more approval it needs, the less likely that person is able to actually understand what they're approving.
48:18 That's very interesting.
48:19 Yeah.
48:20 And the other thing is it actually takes away from your responsibility.
48:23 So if you're a developer and you want to deploy at five o'clock in the afternoon on a Friday, for example, which is kind of prime time for us, and generally speaking, to be perfectly honest, you know, I don't do that.
48:34 My team, generally speaking, doesn't do that unless we have a very good reason to.
48:39 Then if you can basically submit a change control ticket and say, well, you know, my boss or my director approved it, so it's okay.
48:46 It means you're not actually practicing a whole lot of responsible thinking as to whether or not this is a good idea right now.
48:52 So putting it entirely in your hands means you have complete freedom to do what you think is right, but also means you have the responsibility to make the best decision.
49:01 That's great.
49:03 I think a lot of the best engineers really and developers really, you know, value that freedom and responsibility.
49:10 So I think it's great.
49:11 Yeah, it's worked reasonably well for us.
49:13 Cool.
49:14 So Python 2 or Python 3, Netflix?
49:20 Right now, my impression is, and remember, we're not exactly a very standards-driven company.
49:26 There's no, you know, overall board that says this is how thou shalt use Python and Netflix.
49:32 My impression is that generally speaking, we seem to be using Python 2 across the board.
49:36 I don't know anybody using anything older than 2.7, thankfully.
49:41 And I suspect that in the next year, we'll see Python 3 increasing in its coverage.
49:46 Okay, great.
49:47 I also saw that you guys sponsored PyCon this year.
49:51 That's great.
49:51 Thanks for that.
49:52 Did you do any presentations or Netflix as a group do Python presentations?
49:56 Boy, how do you know?
49:57 I got to tell you, I don't know off the top of my head whether or not Netflix did any presentations.
50:01 This is the first year, I think, in about three years that I didn't attend PyCon, despite the fact that PyCon, I think, is one of my favorite conferences, only because Montreal was a little far to go for PyCon for me.
50:15 Yeah, well, you can go a little bit north next year.
50:18 Where is it next year?
50:21 It's in Portland, my hometown.
50:22 Oh, well, perfect.
50:23 I mean, I'm going to Portland next week, so I'm a big fan.
50:25 Excellent.
50:26 So I'll go look through some of the older ones, see if I can pull up something.
50:30 So you guys have a lot of developers and a lot of cool stuff going on.
50:35 Are you hiring people right now?
50:36 If people are listening, are there interesting areas that they should maybe think about?
50:41 Are we hiring people right now?
50:44 Maybe one day the answer to that will be no.
50:48 But I don't suspect that's going to happen anytime in the next few years.
50:51 We are very actively hiring people right now.
50:54 The only thing I wish is when I looked at jobs.netflix.com, there was some sort of counter that allowed me to see how many open positions there are.
51:03 But there's a whole bunch of them.
51:05 Excellent.
51:05 Okay, so jobs.netflix.com.
51:07 People can check that out if they're interested.
51:08 Very cool.
51:09 Yeah.
51:10 And I'll say this.
51:12 There's a bunch of jobs that might actually call for Python experience, a bunch of jobs that might not.
51:18 Two things I would say.
51:20 One, I know a whole bunch of areas within Netflix where Python is heavily used, both in terms of analytics.
51:29 So, for example, my real-time analytics group is big fans of Python.
51:33 But also in terms of sort of infrastructure management and development.
51:36 So, our Cassandra operations people are big fans of Python as well as some of our big data platform people.
51:43 And the other thing is that if you're not actually married to continuing to use Python, there's a whole bunch of other positions at Netflix that want great engineers irrespective of your language.
51:55 We've come to believe that hiring for the language is less valuable to us than hiring for the aptitude.
52:04 That's one of the things that I like about the way we think about hiring compared to a lot of other companies.
52:09 We tend to hire great engineers and then have them use the right language.
52:14 Excellent.
52:15 Yeah, so I'm sure a lot of people are excited to hear that.
52:17 That's cool.
52:17 A few questions I always ask on the way out the door is, first of all, what's your favorite editor?
52:23 So, right now, after many, many, many years, I have switched.
52:30 And I'm not sure if it's going to be a permanent switch, but I've switched to Atom from GitHub.
52:33 Oh, yeah.
52:34 Atom is pretty nice.
52:35 Yeah.
52:36 I tried Sublime for a while and it was okay, but it wasn't sticky.
52:40 I kept coming back to Vim every once in a while.
52:44 And then I haven't tried anything heavier than Sublime or Atom, like, you know, any of the sort of PyCharm or all of that stuff.
52:50 Exactly.
52:50 Okay, cool.
52:51 I know Sublime is very popular within Netflix from my exposure that I've had.
52:56 So, that's cool.
52:57 Yeah, Atom is nice.
52:58 It looks really good.
52:59 How about some cool package on PyPI that people should know about?
53:04 Well, I mean, technically, IPython Notebook is on PyPy, right?
53:08 It is, yeah.
53:09 And I would say, if there was one, well, actually, so for editing and for authoring and for experimentation, IPython Notebook beats everything else out there for me.
53:20 For actually getting work done, oh, boy, howdy, requests.
53:24 Yes.
53:25 Requests, requests, requests.
53:27 Yeah, requests is the most popular Python package, period.
53:32 It's been downloaded an insane amount of times.
53:34 And we had, on show six, we had Kenneth Wrights here to talk about it.
53:37 That was very cool.
53:37 Yeah, I mean, you know, at some point, I wonder why, you know, at some point, will it just go back, go into the core?
53:44 You know, actually, Kenneth talked about that.
53:46 He was at the language summit there at the last PyCon, and they talked a lot about that, specifically, will requests become part of the standard library?
53:55 And they decided no, because they would like to release changes to requests on a more high-speed pace than the actual Python itself.
54:07 So there was actually quite a bit of discussion.
54:10 I think that that was what their decision was.
54:12 But they said it's going to become the official recommended way to do HTTP within Python, like, instead of URL lib2 or things like that, within the documentation.
54:19 All right.
54:22 Very cool.
54:22 So that's definitely a popular one.
54:24 That's great.
54:25 Yeah, IPython and requests.
54:26 People should check those out.
54:27 That's great.
54:27 Roy, any final shout-out you want to give?
54:30 Anything you want to bring the listeners' attention to?
54:32 You know, Monitorama is happening in Portland next week.
54:37 Okay.
54:37 So anybody should check out the presentations coming out of them if they have any interest in monitoring or operational insight.
54:45 I'm going to be at the conference for the first time.
54:47 I'm very excited about that.
54:48 There's a whole bunch of really interesting people who are much smarter than me presenting and attending.
54:53 Check that out.
54:55 I can't think of anything else off the top of my head.
54:58 All right.
54:58 This has been such a great show.
55:00 I really enjoyed the conversation, and I learned a lot.
55:03 Thanks for being here, Roy.
55:04 It's been my pleasure.
55:06 Thanks, Michael.
55:07 Yeah.
55:07 Bye.
55:09 This has been another episode of Talk Python to Me.
55:12 Today's guest was Roy Rapoport, and this episode has been sponsored by CodeChip and Hired.
55:17 Thanks, guys, for supporting the show.
55:19 Check out CodeChip at CodeChip.com and thank them on Twitter via at CodeChip.
55:24 Don't forget the discount code for listeners.
55:25 It's easy.
55:26 Talk Python, all caps, no spaces.
55:28 Hired wants to help you find your next big thing.
55:31 Visit Hired.com slash Talk Python to me and get five or more offers with salary and equity presented right up front and a special listener signing bonus of $4,000.
55:43 Speaking of jobs, remember that Netflix is hiring.
55:45 If what you heard on this show sounds amazing and you're able to work from Los Gatos, California, check out jobs.netflix.com.
55:52 You can find the links from the show at talkpythoninv.com slash episodes slash show slash 16.
55:58 Also, be sure to subscribe to the show.
56:01 Open your favorite podcatcher and search for Python.
56:04 We should be right there at the top.
56:05 You'll also find the iTunes and direct RSS feeds in the footer of the website.
56:10 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.
56:15 You can hear the entire song on our website.
56:17 This is your host, Michael Kennedy.
56:20 Thanks for listening.
56:21 Smix, take us out of here.
56:40 developers, developers, developers, developers, developers, developers, developers, developers.