00:00 Michael Kennedy: In this episode, we'll dive deep into one of the foundations of modern data science, Bayesian algorithms and Bayesian thinking. Join me along with guest Max Sklar as we look at the algorithmic side of data science. This is Talk Python To Me, Episode 239, recorded November 10th, 2019. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @MKennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @TalkPython. This episode is brought to you by Linode and Tidelift. Please check out what they're offering during their segments. It really helps support the show. Max, welcome to Talk Python To Me.
00:55 Max Sklar: Thanks for having me, Michael. It's very great to be on.
00:57 Michael Kennedy: It's good to have you on as well. You've been on PythonBytes before but never Talk Python To Me.
01:03 Max Sklar: That was a lot of fun. I actually got, someone reached out to me on Twitter the other day saying, "Hey, I saw you on PythonBytes." So, that was really exciting.
01:10 Michael Kennedy: Right on, right on. That's super cool. Well...
01:11 Max Sklar: I heard you on PythonBytes. I always say saw you, when it's really heard you. But anyway.
01:15 Michael Kennedy: That's all good. So now they can say they saw you on Talk Python To Me as well. Now, we're going to talk about some of the foundational ideas behind data science machine learning, and that's going to be a lot of fun. But before we get to them, let's set the stage and give people a sense of where you're coming from. How d'you get into programming in Python?
01:34 Max Sklar: That is a really interesting question, because I think I started in Python a very long time ago, like 10 years ago, maybe. I was working on kind of a side project called stickymap.com. The website's still up. It barely works. But it was basically a, it was like, my senior project as an undergrad. So I really, I started this in 2005. And what it was was, y'know, Google Maps had just come out with their API, where you can like, you know, include a Google map on your site. And so I was like, "Okay, this is cool. What can I do with this? Let's add markers all over the map, and it could be user generated." we would call them emojis now, and people could leave little messages, and little locations and things like that. This is before there was Foursquare, which is where I worked, which is location intelligence. This was just me messing around, trying to make something cool, and being inspired by the whole host of like, you know, social media startups that were happening at the time. And I was using, what was I using at the time? I was using PHP and MySQL to put that together. I knew nothing about web development. So I went to the Barnes and Noble, I got a book, "PHP and MySQL" I got it, but then sometime around, like 2008, 2009, I realized, you know, a lot of people were talking about Python at work and I realized like, sometimes I need, this is kind of when I was winding down on the project, but I realized, you know, I had all this data, and I realized I needed a way to like clean the data, I needed a way to like write good scripts that would clear up certain fit. Like if I have a flat file of like, here's the latitude and longitude, they're separated by tabs. And here's a, you know, here's some text that someone wrote that needs to be cleaned up, et cetera, et cetera. Yeah, I can write some scripts in like Python or Java, believe it or not, which I knew at the time, but then, or sorry, a PHP or Python, which I knew at the time, but like, wait, wait, not Java. Sorry. I was trying to do it in PHP and Java, which is a really bad idea.
03:35 Michael Kennedy: Yeah, especially PHP sounds tricky, yeah.
03:37 Max Sklar: Yes, yes, yes. And then I was like, well, I'm just learning this Python, I need something, so let me try to do with Python. And it worked really well. And then I had, you know, to deal a lot more with CSVs, and stuff like that, tab separated files. And it really was just a way to like, save time at work. And it was like a trick to say, "Hey, that thing that you're doing manually? I can do that in like 10 minutes.", and it's not 10 minutes, maybe a couple of hours and write a script, and it's going to take you like one week. Like, I saw someone at work trying to change something manually. And so, this is all a very long time ago, so I don't remember exactly what it was, but it was kind of like a good trick to save time, and it had nothing to do with data science and machine learning at the time. It was more like writing scripts to clean up files.
04:34 Max Sklar: Yeah, the one thing that I was really impressed with was like, how easy at the time. Now, when I wanted to do more complicated Python packages, and like, 2012, 2013 I realized, oh well actually, some of these packages are Python are complicated to install. But like, I was so impressed with how easy it was to just import the CSV package, and just be like, okay, now we understand your CSV. If you have some stuff in quotes, no problem. You want to clean up the quotes, no problem Like it was all just like, just happened very fast.
05:02 Michael Kennedy: Yeah, you don't have to statically link to some library or add a reference to some other thing and none of that right, it's all good. It's all right there.
05:09 Max Sklar: Yeah. I mean, that was, those were the days when like, I was still programming in C++ for work. So, you could imagine what how big of a jump that was? I mean, that seems so ancient. I used to have to program in C++ for the Palm Pilot. That was my first job aout of school, which is crazy.
05:27 Michael Kennedy: Oh, wow, that sounds interesting. Yeah. Yeah, coming from C++. I think people have two different reactions. One like, "Wow, this is so easy. I can't believe I did this in so few lines." Or, "This is cheating. It's not real programming. It's not for me," you know? But I think people Who even disagree like, "Oh, this is not for me", eventually, like find their way over, they're pulled in.
05:49 Max Sklar: I never had a phase where it was like, oh, this is not for me, but I did have a phase where it was like, I don't see, this is just another language, and I don't see why it's better or worse than any other. I think that's the phase that you go through when you learn any new language where it's like, "Okay, I see all the features. I don't see what this brings me." It was only through doing this specific project where it was like, ah hah! No one could have convinced me.
06:10 Michael Kennedy: Yeah, also, you know, if you come from another language, right, if you come from C++, you come from Java, whatever, you know how to solve problems super well in that language, and you're comfortable, and when you sit down to work, you say, File, New Project, and files, New File, start typing, and say, "Okay, well what do I want to do?" I want to call this website or talk to this database, I'm going to create this and I'm going to do this and bam, like, you can just do it. You don't have to just pound on every little step, like how do I run the code? How do I use another library? What libraries are there? Is there like, every, you know, it's just that transition is always tricky, and it takes a while before you, you get over that and you feel like okay, I really actually do like it over here. I'm going to put the effort into learn it properly because I don't care how amazing it is, you're still going to feel incompetent, at first.
07:03 Max Sklar: The switching costs are so tough. And that's why they say, "Oh, if you're going to build a new product it has to be like 10x better than the one that exists or something like that. I don't know if that's, you know, literally true, but like, it's true with languages too, because it's really hard to like pick up a new language when everyone's busy at work and busy doing all the tasks they need to do every day. For me, frankly, it was helpful to take that time off in quotes time off when I was going to grad school, time off from working full time as a software engineer, to actually pick some of this stuff up.
07:33 Michael Kennedy: Absolutely. Alright, so you had mentioned earlier that you do stuff at Foursquare, and it sounds like your early programming experience with Stickymap's. Is not that different than Foursquare, honestly. Tell people about what you do. Maybe I'm pretty sure everyone knows what Foursquare is, what you guys do. But tell them what you do there.
07:51 Max Sklar: People might not be aware of where Foursquare is today, you know. There is, Foursquare is kind of known as that quirky, checkin app, find good places to go with your friends and eat app, you know, share where you are. And that's where we were in 2011, when I joined, up to, you know, a few years ago.
08:10 Michael Kennedy: Uh-huh.
08:12 Max Sklar: But ultimately, you know, the company kind of pivoted business models and sort of said, "Hey, we have this really cool technology that we built for the consumer apps, which is called Pilgrim, which essentially takes the data from your phone and translates that into stops, you know, you'd stopped at Starbucks in the morning, and then you stopped at this other place, then you stopped at work, et cetera, et cetera. And then, you know, that goes into that finds use cases like, you know, across the appo-sphere, I don't even know what to call it, but many apps would like that technology. And so we have this panel, and, you know, so for a few years, I was working on a product at Foursquare called Attribution, where companies who were our clients would say, "Hey, we want to know if our ads are working, our ads across the internet, not just on Foursquare." And we would say, "Well, we could tell you whether your ads are actually causing people to go into your stores more than they otherwise would." And I worked on that for a few years, which is a really cool problem to solve, a really cool data science problem to solve, because it's a causality problem. It's not just, you can't just say, "Well, the people who saw the ads visited 10% more." because maybe you targeted people who would have visited 10% more.
09:21 Michael Kennedy: Exactly, I'm targeting my demographic, so they better visit more. I got it wrong,
09:26 Max Sklar: That industry is a struggle, because the people that you're selling to often don't have the backgrounds to understand the difference, and sometimes don't have the incentives to understand the difference. But we did the best we could. And so that led to kind of an acquisition that Foursquare did earlier this year of Placed, which is, which was an attribution company owned by Snap but they they sold it to us through this big deal. You can read about it online, And...
09:54 Michael Kennedy: Giant tech company trade.
09:56 Max Sklar: Yeah. And so, I had left Foursquare in the interim but then I recently went back to work with the founder Dennis Crowley and just kind of building new apps and trying to build cool apps based on location technology, which is really why I got into Foursquare, why I get into Stickymap, and I'm just having so much fun, so. That's, and we have some products coming along the way where it's not enterprise, it's not you know, measuring ads, it's not ad retargeting. It's just building cool stuff for people and I, I don't know how long this will last, but I couldn't be happier.
10:32 Michael Kennedy: Sounds really fun. I'm sure Squarespace is, sorry, square, Foursquare.
10:36 Max Sklar: You're not the first one though.
10:37 Michael Kennedy: Fan. Squarespace is right here. Foursquare is in New York where you are. Now I'm sure that that's a great place to be and they're doing a lot of stuff, they used, something like Scala? There's some functional programming language that primarily there right, is it Scala?
10:53 Max Sklar: Yeah, it's primarily Scala. I've actually done a lot of data science and machine learning in Scala, and sometimes I'm kind of envious of Python 'cause there's better tools in Python. And we do some of our, we do some of our initial testing on data sets in Python sometimes, but there is a lot of momentum to go with Scala because all of our backend jobs are written in Scala, and so we often have to translate into Scala which has good tools, but not as good as Python.
11:20 Michael Kennedy: Yeah, so I was going to ask, what's the Python story there? Do you guys get to do much Python there?
11:25 Max Sklar: Yeah, so, I have done, if I can take you back in the, to the olden days of 2014. If that's, if that's allowed, because one of the things that I did at Foursquare that I'm pretty proud of is building a sentiment model, which is trying to take a Foursquare tip, which were like three sentences that people wrote in Foursquare on the Foursquare city guide app. And, that gets surfaced later, it was sort of compared to the Yelp reviews but except they're short and helpful, and not as negative. What we want to do is we want to to take those tips and try to come up with the rating of the venue, 'cause we have this one to 10 rating that every venue receives. And so using the likes and dislikes explicitly wasn't good enough because there were so many people who would just click Like very casually. And so we realized at some point, "Hey, we have a labeled training set here." We can say, hey, the person who explicitly liked a place and also left a text tip, that is a label of positive, and someone who explicitly disliked a place that's labeled a negative. And someone who left the middle option, which we call it a meh or a mixed review, their tip is probably mixed. And so we have this tremendous data set on tips. And that allowed us to build a model, a pretty good model, and it wasn't very sophisticated. It was multi-logistic regression, based on sparse data, which was like what phrases are included in the, in the tip.
12:50 Michael Kennedy: Right, trying to understand the sentiment of the actual words, right?
12:53 Max Sklar: Yeah, there was logistic regression available in Python at the time, which is great, but I wanted something a little custom which is now available in Python, but back then, it was kind of hard to find these packages. And not just that, there, even when there were packages, sometimes it's difficult to say, okay, is this working? How do I test what's going on under the hood? It's not very... So I decided to build my own in Python, which was a multilogistic regression, it means we're trying to find out three categories, like dislike positive review, negative review, or mixed review, based on the label data. And we were going to have sparse data set, which means it's not like they're 20 words that we're looking for, no. There are like tens of thousands, I don't know the exact number, tens of thousands, hundreds of thousands of phrases that we're going to look for. And for most of the tips, most of the phrases are going to be zero, didn't see it, didn't see it, didn't see it. But every once in a while, you're going to have a one, did see it. It's that, that's when you have that matrix where most of them are zero, that's sparse. And then thirdly, we wanted to use elastic net, which meant that most of the weights are going to be set to exactly zero. So when we store our model, most words, it's going to say, hey, these words aren't sentiment. So we're just going to, these don't really affect it, we want to have it exactly zero, except what a traditional logistic regression would do is it would say, okay, we are going to come up with the optimal, but everything will be close to zero. And so you have to kind of store it, you have to store the like, 0.0001. So, that's a problem too. So I actually built that kind of open source and put that on my GitHub on base.py back in 2014. I don't think anyone uses it, but it was a lot of fun. I used Cython to make it go really fast. It's kind of a problem at Foursquare, because it's the only thing that runs in Python. And every once in a while someone asks me like, "What's this doing here?"
14:42 Michael Kennedy: Exactly. How do I run this? I don't know, this doesn't fit into our world, right?
14:45 Max Sklar: Yeah, yeah, yeah.
14:45 Michael Kennedy: Cool. All right, well, Foursquare sounds really fun. Another thing that you do that I know you from, I don't know you through the Foursquare work that you're doing. I know you through your podcast, The Local Maximum.
14:57 Max Sklar: Oh, yeah.
14:57 Michael Kennedy: Which is pretty cool. You had me on back on Episode 73.
15:00 Max Sklar: 73.
15:00 Michael Kennedy: So that was cool.
15:02 Max Sklar: That is our most downloaded episode right now.
15:04 Michael Kennedy: Really? Wow, awesome. That's super cool to hear! More relevant for today's conversation though would be Episode 78. Which is all about Bayesian thinking and Bayesian analysis and those types of things. So people can check that out for a more high level, less technical, more philosophical view, I think, on what we're going to talk about if they want to go deeper, right?
15:27 Max Sklar: Absolutely. You can also ask me questions directly, 'cause I ramble a little bit in that, but I cover some pretty, pretty cool ideas, some pretty deep ideas there that I've been thinking about for many years.
15:37 Michael Kennedy: Yeah, for sure. So maybe tell people just really quickly what the Local Maximum is just so give you a chance to tell them about it.
15:43 Max Sklar: Yeah. So I started this podcast about a year and a half ago in 2018. And it started with, you know, I started basically interviewing my friends at Foursquare being like, hey, this person's working on something cool, that person's working on something cool, but they never get to tell their story. So why not let these engineers tell their story about what they're working on. And since then I've kind of expanded it to cover, you know, current events and interesting topics in math and machine learning that people can kind of apply to their everyday life. Some episodes get more technical, but I kind of want to bring it back to the more general audience that it's like, hey, my guests tonight, we have this expertise. We don't just want to talk amongst ourselves. We want to actually engage with the current events, engage with the tech news and try to think, okay, how do we apply these ideas? And so that's sort of the direction that I've been going in. And it's it's been a lot of fun. I've expanded beyond tech several times. I've had a few historians on, I've had a few journalists on.
16:42 Michael Kennedy: That's cool. I like the intersection of tech and those things as well. It's pretty nice. This portion of Talk Python To Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well look past that bookstore and check out Linode at talkpython.fm/linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe, so no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private Git server or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24/7 friendly support even on holidays, and a seven day money back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/linode. Let's talk about general data science before we get into the Bayesian stuff. So I think one of the misconceptions in general, is that you have to be a mathematician or be very good at math to be a programmer. I think that's a false statement.
17:59 Max Sklar: To Be a programmer?
17:59 Michael Kennedy: Yes, yeah. Software Developer, straight up. I built like the checkout page on this e-commerce site, for example.
18:06 Max Sklar: I would agree. I think you need some some abstract thinking.
18:10 Michael Kennedy: Yes.
18:10 Max Sklar: You know, you can't like escape letters and stuff and variables, but you don't need, well, in the case of data science to compare, like, you don't need, you don't need algebra, or you don't need maybe a little bit, but you don't really need calculus, and you don't need
18:23 Michael Kennedy: Geometry.
18:24 Max Sklar: Linear algebra, geometry, yeah. Sometimes as a UI engineer, you might need a little geometry.
18:29 Michael Kennedy: I mean, there's certain parts that you need that kind of stuff like video game development, for example. Everything is about multiplying something by a matrix, right? You put all your stuff on the screen, you even arrange it and rotate it by multiplying by matrices and there's some stuff happening there you got to know about but generally speaking, you don't. However, I feel like in data science, you do get a little bit closer to statistics. And you do need to maybe understand some of these algorithms, and I think that's where we can focus our conversation for this show is like, what do we need to know, in general, and then the idea of Bayesian base theorem and things like that? What do we need to know if I wanted to go into say data science? Because like I said, I don't really think you know that need to know that to do like, you know, connecting to a database and like saving the user.
19:17 Max Sklar: Sure.
19:17 Michael Kennedy: You need, you absolutely need logical thinking, but not, not like stats. But for data science, what do you think you need to know?
19:24 Max Sklar: Well, for data science, it really depends on what you're doing and how far down the rabbit hole you really want to go. You don't necessarily need all of the philosophical background that I talk about, I just love thinking about it and it sort of helps me focus my thoughts when I do work on it, to kind of go back and think about the first principles. So, I get a lot of value out of that, but maybe not everyone does. There is sort of a surface level data science that or machine learning that you can get away with, if you want to do simple things, which is like, hey, I want to understand the idea that I have a training set, you know, what a training set is. And this is what I want to predict. And here is roughly my mathematical function of how I know whether I'm predicting it well or not. But it could be something simple, like the squared distance, but already, you're introducing some math there. And basically, I'm going to take a look at some libraries, and I'm going to see if something works out of the box and gives me what I need. And if you do it that way, you need a little bit of understanding, but you don't need everything that like I would say, kind of a true data science or machine learning engineer needs. But if you want to go deeper, and kind of make it your profession, I would say you need kind of a background in calculus and linear algebra. And again, like, look, if I went back to grad school and I, like, if I went to a linear algebra final, and I took it right now, would I be able to get every question right? Probably not. But I know the basics, and I have a great understanding of how it works, and if I look at the equations, I can kind of break it down, you know, maybe with a little help from Google and all that.
21:06 Michael Kennedy: But I think there's a danger of using these libraries to make predictions and other stuff, when you're like, well, the data goes in here to this function, and I call it and then out comes the answer. Maybe there's some conditionality versus independence requirement that you didn't understand and it's not met or you know, whatever, right?
21:26 Max Sklar: That's why I said it's really surface level and you can get away with it sometimes, but only for so long. And I think understanding where these things go wrong outside the, you know, when you take these black box functions requires both kind of a theoretical understanding of how they work and then also just like experience of seeing things going wrong in the past.
21:46 Michael Kennedy: Yeah, that experience sounds hard to get but at the same time, you just, you got to get out there and do it.
21:52 Max Sklar: Right, well, here's a good example. One time I was trying to predict how likely someone is to visit a store. This was part of working on Foursquare's attribution product, right? And someone was using random forest algorithm. Or maybe it was just a simple decision tree, I'm not sure. But basically, it creates a tree structure and puts people into buckets and determines whether or not, you know, and for each bucket it says, okay, in this bucket, everyone visited, in this bucket, everyone didn't, or maybe this bucket is 90/10. and this bucket is 10/90. And so I can give good predictions on the probability someone will visit based on where they fall on the leaves of the tree. And we were using it and something just wasn't making sense to me. Somehow the numbers were just, something was wrong. And then I said, "Okay, let's make let's make more leaves." And then I made more leaves, like I made, I made the tree deeper, right? And then they're like, "See, when you make the tree deeper, it gets better." That makes sense because it's it's more fine graining. I'm like, "Yeah, but something doesn't make sense. It shouldn't be getting this good. And then as I realized what was happening...
22:59 Michael Kennedy: What was it?
22:59 Max Sklar: What was happening was some of the leaves had nobody visited in this leaf. That makes a lot of sense because most days you don't visit any particular chain. And when it went to zero and it saw someone visited, well, the log likelihood loss, it basically predicted zero percent of an event didn't happen. And so log, we do log likelihood loss, or negative log likelihood loss, the score is like the negative log of that. So essentially, you should be penalized infinitely for that, because there was no smoothing. But the language we were using, which I think was SPARK or something like that, and it was probably some library inspired, I probably shouldn't throw SPARK under the bus, it was probably some library or something was changing that infinity to a zero. So the thing that was infinitely bad, it was saying was infinitely good. And so that works. And that took, oh God, that took us so long to figure out, like it's embarrassing how long that one took to figure out but that's that's a good example of when experience will get you something. I don't think I've ever talked about this one publicly.
22:59 Michael Kennedy: Yeah, well, you just got to know that, you know, that's not what we're expecting, right?
22:59 Max Sklar: Yeah. But you know, theoretically, hey, if I more fine grained my tree, if I make my groups smaller, maybe it works better. But I was like something, I was like, something's not right. It's working a little too good. There was nothing specifically that got me but it was just like...
22:59 Michael Kennedy: There's probably a lot of stuff out there that's actually people are taking actions on and spending money on. But it's,
22:59 Max Sklar: Yeah!
22:59 Michael Kennedy: It's like that, right? Yeah, so let's see. So we talked about some of the math stuff. If you really want to understand the algorithms, you know, statistics, calculus, linear algebra, you obviously need calculus to understand like real statistics, right? Continuous statistics and stuff. What else though? Like, do you need to know machine learning? What kind of algorithms do you need to know? Like, what, what in the computer sciencey side of things do you think you got to know?
22:59 Max Sklar: Bread and butter for the data scientists that I work with is machine learning algorithms. So I think that is very helpful to know. And I think that, you know, some of the basic algorithms in machine learning are good to know, which is like k-nearest neighbor, k-means, logistic regression, decision trees, and then some kind of random forest algorithm, whether it's just random forest, which is a mixture of trees or gradient boosted trees we've had a lot of luck with. And then, a lot of this deep learning stuff is, well, neural networks is one of them. Maybe you don't need to be an expert in neural networks, but it's certainly one to be aware of. And based on these neural networks, deep learning is becoming very popular. And I've been hearing and kind of looking into reading about deep learning for many years, but I have to say, I haven't actually implemented one of these algorithms myself, but I just interviewed a guy on my show, Mark Ryan, and he came out with a book called "Machine Learning For Structured Data", which means, hey, you don't just, this doesn't just work for like images or audio recognition, you could actually use it for regular marketing data, like use everything else for, so it's like, all right, that's interesting, maybe I'll work on that now. But I don't think at this point you need to know machine learning to be a good, or deep learning to be a good data scientist or machine learning engineer. I think the basics are really good to know, because in many problems, you know, the basics will get you very far and there's a lot less that can go wrong.
22:59 Michael Kennedy: Yeah, a lot of those algorithms you talked about as well, like k-nearest neighbor and so on, there are several books that seem to cover all of those. I can't think of any off the top my head, but I feel like I've looked through a couple and they all seem to have, like here the main algorithms you need to know. To kind of learn data science. So not too hard to pick them up.
22:59 Max Sklar: His last name's Bishop, the book that I read for grad school, but that's already 10 years old. Certainly had all that stuff. That was very deep on math. I can send you a link if I want.
22:59 Michael Kennedy: Sure.
22:59 Max Sklar: I think kind of any intro book to machine learning will have all of that stuff. And basically, it's not, it's not an order of like hard to easy, it's just sort of, hey, these are things that have helped in the past and that statisticians and machine learning engineers have relied on in the past to get started, and it's worked for them, so, maybe it'll work for you.
22:59 Michael Kennedy: Cool. Well, a lot of machine learning and data science is about making predictions. We have some data, what does that tell us about the future, right?
22:59 Max Sklar: Right.
22:59 Michael Kennedy: That's where the Bayesian inference comes from in that world, right?
22:59 Max Sklar: Yeah, it's trying to form beliefs, which could be a belief about something that already happened that you don't know about, but you'll find out in the future or be affected by in the future, or it could be a belief about something that will happen in the future. So something either will happen in the future or you'll learn about in the future. But Bayesian inference is more about, you know, forming beliefs, and I kind of call it, like it's a quantification of the scientific method. So, in the basic form, Bayes rules is very easy. You start with your current beliefs, and you codify that in a special mathematical way. And then you say, okay, here's some new data I received on this topic, and then it gives you a framework to update your beliefs within the same framework that you began with.
22:59 Michael Kennedy: Right. And so like an example that you gave would be, say, a fire alarm, right?
22:59 Max Sklar: We know from like life experience that most fire alarms are false alarms. You know, one example is, what is your prior belief that there is a fire right now, without seeing the alarm? The alarm is the data. The prior is, what's the probability that you know, my building is on fire and I need to get the F out right now? You know, it's very low, actually, yeah. I mean, for most of us,
22:59 Michael Kennedy: Yeah, it better be low.
22:59 Max Sklar: It hasn't really happened in our life. Maybe we've seen one or two fires, but they weren't that big of a deal. I'm sure there's some people in the audience who have seen bad fires and for them, maybe their priors a little higher.
22:59 Michael Kennedy: I once in my entire life have had to escape a fire. Only once, right?
22:59 Max Sklar: Were you in like real danger or?
22:59 Michael Kennedy: Oh, yeah, probably. It was a car and the car actually caught on fire.
22:59 Max Sklar: Oh, yeah, that sounds pretty bad.
22:59 Michael Kennedy: It had been worked on by some mechanics and they put it back together wrong and it like, shot oil over something that caught fire. And so we're like, oh, the cars on fire, we should get out of it.
22:59 Max Sklar: Yeah. But yeah, sitting in your building at work, your prior's going to be much lower than in a car that you just worked on. So when the alarm goes off, okay, that's your data. The data is that we received an alarm today. And so then you have to think about, Okay, I still have two hypotheses, right? Hypothesis one, is that there is a fire and I have to escape, and hypothesis two, is that there is no fire. And so once you hear the alarm, you still have those two hypotheses. One is that the alarm is going off, and there's a real fire, and two is that there is no fire, but this is a false alarm. And so what ends up happening is that because there's a significant probability of a false alarm, so at the beginning, there is a very low probability of a fire after you hear the alarm. There's still a pretty low probability of fire, but the probability of false alarm still overwhelms that. Now, I'm not saying that you should ignore fire alarms all the time. But, 'cause in that case, that's a case where the action that you take is important, regardless of the belief. So, you know, hey, there is a very low cost to checking into it, at least checking into it, or leaving the building in, if you have a fire alarm. But there's a very high cost.
22:59 Michael Kennedy: Alright, the consequence of failure is so high.
22:59 Max Sklar: Exactly, exactly. But in terms of just forming beliefs, which is a good reason not to panic, you shouldn't put a lot of probability on the idea that there's definitely a fire.
22:59 Michael Kennedy: Okay, yeah, so that's basically Bayesian inference.
22:59 Max Sklar: Right.
22:59 Michael Kennedy: I know how likely a fire is, I have all of a sudden, I have this piece of data that now there is a fire, I have a set, a space of hypotheses that could apply. Try to figure out which hypothesis, start testing and figure out which one is the right one, maybe.
22:59 Max Sklar: Yeah, so you take your prior, so let's say there's like a, I don't know, one in 10, hundred thousand chance that there's a fire in the building today and a 99,999 chance there isn't, then you take that, that's your prior, then you multiply it by your likelihood, which is, okay, what is the likelihood of seeing the data given that the hypothesis is true? So what's the likelihood that the alarm would go off, if there is a fire? Maybe that's pretty high, maybe that's close to one or a little bit lower than one. And then, on the second hypothesis, there's no fire. What's the likelihood of a false alarm today, which could actually be pretty high? Could be like one in 1,000, or even one in 100 in some buildings, and then you multiply those together, and then you get an un-normalized posterior, and that is your answer. So it's really just multiplication.
22:59 Michael Kennedy: Yeah, it's like simple fractions once you have all the pieces, right? So, it's pretty simple algorithm.
22:59 Max Sklar: It's very hard to describe through audio, but it's much better visually If you want to check it out. I've been struggling to describe it through audio for, you know, for the last year and a half, but I did the best I can.
22:59 Max Sklar: Right.
22:59 Michael Kennedy: Who came up with this idea in the 1700s. But for a long time, it wasn't really respected, right? And then it actually found some pretty powerful, it solved some pretty powerful problems that matters a lot to people recently.
22:59 Max Sklar: Yeah, I mean, I can't go through the whole, do the whole history justice in just a few minutes. But I'll try to give my highlights which was this Reverend who was sort of, he was a, you know, he was into theology, and he was also into mathematics. So he was probably like pondering big questions. And he wrote down notes, and he was trying to figure out the validity of various arguments. His notes were found after he died, so he never published that. And so, this was taken by Pierre Laplace who was more well known mathematician and kind of formalized. But when the basis of statistical thinking was built in the late 20th, early 19th century, or late 19th, early 20th century, it really went in a more frequentists direction, where it's like no probability is actually a fraction of a repeatable experiment that kind of like over time, what fraction does it, does it end up as? And so they consider probability as sort of a, an objective property of the system. So for example, a dice flip. Well, each side is 1/6th, that's like kind of an objective property of the die, whereas no Bayesian statistics is, quote, called, sort of based on belief. And because belief kind of seems unscientific, and the frequentists had very good methods for coming up with answers and more, more objective ways of doing it, they sort of had the upper hand, but as kind of the focus got into more complex issues, and we had the rise of computers and that sort of thing, and the rise of more data and that sort of thing, Bayesian inference started taking a bigger and bigger role, until now, I think most machine learning engineers and most data science scientists think as a Bayesian. And so it's like some examples in history. Most people are probably aware of Alan Turing at Bletchley Park, along with many other people, you know, building these machines that broke the German codes during World War II, someone wrote movie about it,
22:59 Michael Kennedy: Right, that's trying to break the Enigma machine, and the Enigma code
22:59 Max Sklar: Right.
22:59 Michael Kennedy: And that those were some important problems to solve, but also highly challenging.
22:59 Max Sklar: Yeah, and so they incorporated a form of Bayes rule into this, well, what are my relative beliefs as to the setting of the machine because, you know, the machine could have had quadrillions of settings, and they're trying to distinguish between which one is likely to have and which one's not likely to have. But after the war, that stuff was classified. So nobody could say, "Oh, yeah, Bayesian inference was used in that problem." And one interesting application that I found even as it wasn't accepted academia for many years was life insurance, because they're kind of on the hook for determining if the actuaries get the answer wrong as to how likely people are to live and die, then they're on the hook for lots and lots of money or like the continuation of their company, if they get it wrong. And so...
22:59 Michael Kennedy: Right, right. Or how likely is it to flood here? How likely is it for there to be a hurricane
22:59 Max Sklar: Exactly.
22:59 Michael Kennedy: That wipes this part of the world off the map, right?
22:59 Max Sklar: And a lot of these were one off problems. You know, one problem is, you know, what's the likelihood of two commercial planes flying into each other, it hadn't happened. But they wanted to estimate the probability of that, and you can't do repeated experiments on that. So they really had to use priors, which was sort of like expert data. And then, you know, more recently, as we had the rise of kind of machine learning algorithms and big data, you know, Bayesian methods have become more and more relevant. But also a big problem was, you know, the problems that we just mentioned, which are you know, fire alarms, and figuring out whether or not you have a disease and things like that. That's the two hypothesis problem. But a lot of times you have an infinite space, you have an infinite hypothesis problem that you're trying to determine between an infinite set of possible hypotheses. And that becomes very difficult to do becomes extremely difficult without a computer, even with a computer becomes difficult to do. And so, you know, there's been a lot of research into how do you search that space of hypotheses to find the ones that are most likely? And so if you've heard the term Markov Chain, Monte Carlo, that is the most common algorithm used, and for that purpose, there's even current research into that to making that faster and finding the hypothesis you want more quickly. Andrew Gelman at Columbia has some a lot of stuff out about this, and he has like a new thing that's called like the NUTS, which is like a No U-Turn Sampler, which is based off a very complicated version of MCMC. And so that's what's used in a framework that Python has called PyMC3 to come up with your most likely hypothesis very, very quickly.
22:59 Michael Kennedy: So let's take this over to the Python world. Like, there's a lot of stuff that works that obviously, like you said, the machine learning, deep down uses some of these techniques. But this PyMC3 library is pretty interesting. Let's talk about it. So its subtitle is Probabilistic Programming in Python.
22:59 Max Sklar: If I could start with some alternatives, which I've used, because I haven't, I've been diving into reading about PyMC3, but I haven't used it personally. So even when I was doing things in 2014, just on my own, basically, without libraries, I was able to use Python very, very easily to kind of put in these equations for Bayesian inference on whether it's multi-logistic regression or another one I did was de Richelieu's prior calculator, which, if I can kind of describe that, it's sort of thinking well, how, what should I believe about a place before I've seen any reviews? Should I believe it's good? Should I believe it's bad? You know, if I have very few reviews, what should I believe about it? Which was an important question to ask for something like Foursquare city guide, in many cases. Because we didn't have a lot of data. And so that was a good application of Bayesian inference. And I was able to just use the equations straight up and kind of from first principles, apply algorithms directly in Python. And it actually was not that hard to do, because when searching the space, there was a single global maximum, didn't have to worry about the local maximum in these equations, so it was just a hill climbing. Hey, I'm going to start with this hypothesis in this n-dimensional space, and I'm going to find the gradient, I'm going to go a little higher, a little higher, a little higher. Gradient ascent is what I described although it's usually called gradient descent. So that's sort of an easy one to understand. Then if you want to do MCMC directly, because you have some space that you want to search and you have the equations of the probability on each of the points in that space, I used PyEMCEE, which is spelled E-M-C-E-E, which is a simple program that only does MCMC. And so I had a lot of success with that when I wanted to do some one off sampling of, you know, non standard probability distributions. So, those are ones that I've actually used and had success with in the past. But PyMC3 seems to be like the full, you know, we do everything sort of a thing. And basically, what you do is you program probabilistically. So you say, "Hey, I imagine that this is how the data is generated. So I'm just going to basically put that in code and then I'm going to let you, the algorithm, work backwards and tell me what the parameters originally were." So if I could do a specific here, let's say I'm doing logistic regression, which is like every item has a score or, you know, in the case that I was working on every word has a score, the scores get added up, that's then a real number, then it's transformed using a sigmoid into a number between zero and one. And that's the probability that's a positive review. And so basically, you'll just say, "Hey, I have this vector that describes the words this has, then I'm going to add these parameters, which I'm not going to tell you what they are. And then I'm going to get this result. And then I'm going to give you the final data set at the end." And it kind of works backwards and tells you, "Okay, this is what I think the parameters were." And what's really interesting about something like PyMC3, which I would like to use in the future is when you do a linear regression or logistic regression, in kind of standard practice, you get one model at the end, right? This is the model that we think is best, and this is the model that has the highest probability, and this is the model that we're going to use. Great, you know that works for a lot of cases. But, what PyMC3 does is that instead of picking a model at the end, it says, "Well, we still don't know exactly which model produced this data. But because we have the data set, we have a better idea of which models are now more likely and less likely. So we now have a probability distribution over models, and we're going to let you pull from that." So it kind of gives you a better sense of what the uncertainty is over the model. So, for example, if you have a word in your data set, let's say the words delicious, and it's a pos, we know it's a positive word. But let's say for some reason, there is not a lot of data on it, then it can say, well, I don't really know what the weight of delicious should be, and so...
22:59 Michael Kennedy: It could be these are rock concerts, we don't know why, what does it mean?
22:59 Max Sklar: Yeah, yeah, yeah. And so we're going to give you a few possible models, and you know, and you can keep sampling from that. And you'll see that the the deviation, the discrepancy, the variance of that model is going to be very high, of that weight is going to be very high 'cause we just don't have a lot of data on it. And that's something that standard regressions, just don't do.
22:59 Michael Kennedy: That's pretty cool. And the way you work with it is you basically code out the model in like a really nice Python language. API. You kind of say, well, this, I think it's a linear model. I think it's this type of thing. And then like you said, it'll go back and solve it for you. That's, that's pretty awesome. I think it's nice,
22:59 Max Sklar: Right. A good thing to think about it, it is in terms of just a standard linear regression, like, what's the easiest example I can think of? Trying to find someone's weight from their height, for example. And so you think there might be an optimal coefficient on there given the data, but if you use PyMC3, it will say, "No, we don't know exactly what the coefficient is given your data. You don't have a lot of data. But we're going to give you several possibilities. We're going to give you a probability distribution over it." And as I say in The Local Maximum, you shouldn't make everything probabilistic, because there is a cost in that. But oftentimes, you can by considering something to be, rather than considering one single truth by considering multiple truths probabalistically you can unlock a lot of value. In this case, you can kind of determine your variance a little better.
22:59 Michael Kennedy: Yeah, that's super cool. I hadn't really thought about it, and like I said, the API is super clean for doing this. So it's, it's great. Where does this Bayesian inference, like, where do you see this solving problems today? Where do you see like stuff going? What's the world look like now?
22:59 Max Sklar: I've been using it to solve problems, basically, as soon as I started working as a machine learning engineer at Foursquare, basically using Bayes rule as kind of my first principles whenever I approach a problem, and it's never driven me in the wrong direction. So I think it's one of those timeless things that you can always use. For me, especially after working with our attribution product a lot, I think that the future is trying to figure out causality a lot better. And I think that's where some of these more sophisticated ideas come in. Because it's one thing to say, this variable is correlated with that and I can have a model, but it's like, well, what the probability that this variable changing this variable actually causes this other variable to change? In the case of ads, where you could see where it's going to unlock a lot of value for companies where, you know, there might be a lot of investment in this is, what is the probability that this ad affects someone's likelihood to visit my place? Or to buy something from me, more generally? Or what is my probability distribution over that? And so can I estimate that? And I think that that whole industry of online ads is it's very frustrating for an engineer, 'cause it's so inefficient, and there's so many people in there that don't know what they're doing. And it can be very frustrating at times. But I think that means also that there's a lot of opportunity to like unlock value, if you have a lot of patience.
22:59 Michael Kennedy: Sure. Well, so much of it is just, they looked for this keyword, so they must be interested, right? It doesn't take very much into account.
22:59 Max Sklar: Yeah, but the question is, okay, maybe they look for that keyword, and now they're going to buy it no matter what I do. So don't send them the ad, send the ad to someone who didn't search the keyword. Or, maybe they need that extra push. And that extra push is very valuable. It's hard to know, unless you measure it. And you measure it, you don't get a whole lot of data, so you really, it really has to be a Bayesian model. Whoever uses these, these Bayesian models is going to get way ahead. But you know, right now, it goes through several layers. I kept saying, when we were working on this problem, and people weren't getting what we were doing, I was like, you know, I wish the people who are writing the check for these ads could get in touch with us, because I know they care. But, you know, oftentimes you're working through sales and someone on the other side...
22:59 Michael Kennedy: Yeah, there's just too many layers between, right? Yeah, for sure. Earlier, you spoke about having your code go fast. And you talked about Cython. What's your experience with Cython?
22:59 Max Sklar: I used that for the multi-logistic regression and all I can say is, it took a little getting used to, but you know, I got an order of magnitude speed up, which we needed to launch that thing in our one off Python job at Foursquare. So it took only a few hours versus all day. So it was kind of a helpful tool to get that thing launched. And I haven't used it too much since. But I kind of keep that in the back of my mind as part of my toolkit.
22:59 Michael Kennedy: Yeah, it's great to have in the toolkit. I feel like it doesn't get that much love. But I know people talk about Python speed and oh, it's fast here, it's slow there. First, people just think it's slow, because it's not compiled. But then you're like, oh, but what about the C extension. Should go, actually, yeah, that's actually faster than Java, or something like that. So interesting.
22:59 Max Sklar: Yeah, I've also had a big speed up just by taking, you know, a dictionary or matrix I was using and then using numpy instead of the, or NumPy. I don't know how you pronounce it, but instead of using NumPy,
22:59 Michael Kennedy: I think NumPy, but yeah, we'll have to see.
22:59 Max Sklar: Okay. NumPy instead of the standard like, you know, Python tools, I've also got a big speed up there.
22:59 Michael Kennedy: Yeah, for sure. And that's pushing it down into the C layer, right? But a lot of times you have your algorithm in Python, and one option is to go write that C lay, 'cause you're like, well, we kind of need it. So here we go down the rabbit hole of writing C code instead of Python. But Cython is sweet, right? Especially the latest one, you can just put the regular type annotations, the Python 3 type annotations.
22:59 Max Sklar: Oh, yeah.
22:59 Michael Kennedy: On the types and then you you know, magic happens.
22:59 Max Sklar: I definitely, I just started with Python, and it was like, you know, we're in this these three functions 90% of the time, just fix that.
22:59 Michael Kennedy: Is usually the slow part is like really focused. Most of your code, it doesn't even matter what happens to it, right? It's just there's like that little bit where you loop around a lot. And that matters.
22:59 Max Sklar: Yeah, yeah. It's funny how we over optimize and you can't escape it. Like even when I'm creating, you know, I see like a bunch of doubles, I'm like, oh, but these are only one and zero. Can we like change them to Boolean? But like in the end, it doesn't care, it doesn't matter. For most of the code it really has no effect.
22:59 Michael Kennedy: For sure.
22:59 Max Sklar: Except in that one targeted place.
22:59 Michael Kennedy: Yeah, so, the trick is to use the tools to find it right? C profiler or something like that. The other major thing, one thing you can do to speed up stuff like this, these algorithms is just to say, "Well, I wrote it in Python, or I use this data structure, and maybe if I rewrote it differently, or I wrote it in C, or I applied Cython, it'll go faster." But it could be that you're speeding up the execution of a bad algorithm. And if you had a better algorithm, it might go 100 times faster or something right? Like so, how of you think about that with your problems?
22:59 Max Sklar: That's what I did for the, back in 2014, with the de Richlieu prior calculator, and that was an interesting problem to solve. Because to recap on that, it's one of the use cases we had, okay, what's my prior on a venue before I've gotten any reviews? What's my prior on a restaurant before I've gotten any reviews? And I'm using the experience of the data on all the other restaurants I've seen. So we know what the variance is and let me try to come up with an equation that can calculate that value from the data. And it turned out there were some algorithms available. But as I dug into the math, I noticed there was like a math trick that I could make use of. In other words, it was something like, certain logs were being taken, of the same number, were being taken over and over again. And it's like, okay, just store how many times we took the log. And then when I dug into the math, they kind of combined into one term and multiplied that together. So essentially, I used a bunch of factoring and refactoring, whether you think of it as factoring code, or factoring math, to get kind of an exponential speed-up in that algorithm. And so that's why I published a paper on it. I was very proud of that. It was a very satisfying thing to do. It might not have mattered in terms of our product, but I think a lot of people used it though to be like, I, rather than just taking an average of what I've seen in the past, no, I want to do something that is based on good principles. And so I want to use the de Richlieu prior calculator. And so some people have used that, it's my Python code online, and the algorithm has proven very fast and like almost instantaneous. Basically, as soon as you load all the data and it gives you the answer, which I like. Now, my next step to that is to use PyMC3. Rather than giving you an answer it should give you a probability distribution over answers.
22:59 Michael Kennedy: Yeah, that's right.
22:59 Max Sklar: Haven't done that yet, didn't know about that. Yeah, didn't know about that at the time. I think my speed-up would still apply.
22:59 Michael Kennedy: Yeah, that's cool. Well, that definitely takes it up a notch. What about learning more about Bayesian analysis and inference, and where should people go for more resources?
22:59 Max Sklar: Oh, okay. Well, a kind of a history book that I read that I really like on Bayesian inference is one called "The Theory That Would Not Die" by Sharon McGrayne, a few years old, but it's really good if you're interested in the history on that. I have a book about PyMC3, kind of a tech book that does go into the basics of Bayesian inference that has a really good title. It's called "Bayesian Analysis with Python."
22:59 Michael Kennedy: Oh, yeah,
22:59 Max Sklar: Yeah, yeah, so that's a good one to look at. And then I have a bunch of episodes on my show that are related to Bayesian analysis. So, Episodes 0 and 1 on my show were basically just starting out trying to describe Bayes rule to everyone. I sort of attempted to do the description in Episode 0. And then in Episode 1, I applied it to the news story that was happening that day, which was kind of the fire alarm at the bigger scale, which was everyone in Hawaii getting this message that there's an ICBM missile coming their way, because of a mistake someone made. And then...
22:59 Michael Kennedy: Yeah, because of some terrible UI decision on like the tooling, because they had to press
22:59 Max Sklar: Yeah, is that what it was?
22:59 Michael Kennedy: There was some analysis about what had happened and not probabilistically, but there was some, there's some really old crummy UI and they have to press some button to like acknowledge a test or treat as real, and somehow they look like almost identical, or there, there's some weird thing about the UI that had I tricked the operator into say, no, it's real.
22:59 Max Sklar: Yeah, yeah. And then another couple of episodes I want to highlight is Episode 21 and 22, which was sort of, kind of, 21 is the philosophy of probability, and 22, we talk about the problem of p hacking, which is when people try their experiments over and over, and until they get something that works with p values, which is a frequentist idea. Which works if you're using it properly. But the problem is, most people don't. And then we did an episode, I think it was 65, on probability, how to estimate the probability of something that's never happened. And then 78, the one that you mentioned, which was on the history of Bayes and a little more philosophy. So, I've talked about that a lot. You could probably go to localmaxradio.com or localmaxradio.com/archive and find the ones that you want.
22:59 Michael Kennedy: That's really cool. So yeah, I guess we'll leave it there, for now.
22:59 Max Sklar: Awesome!
22:59 Michael Kennedy: That's quite interesting 'cause this will look into some of the algorithms and math we got to know for our data science. Now before you get out of here, though, I got the two questions I always to ask everyone.
22:59 Max Sklar: Uh, oh.
22:59 Michael Kennedy: You're going to write some Python code what editor to use?
22:59 Max Sklar: I just use Sublime.
22:59 Michael Kennedy: Alright, alright.
22:59 Max Sklar: Or, yeah, TextMate, also on Mac, but I'm sure I could do something a little better than that. I just picked one and never really looked back.
22:59 Michael Kennedy: Sounds good. And then notable PyPI package?
22:59 Max Sklar: Notable...
22:59 Michael Kennedy: Maybe, maybe not the most popular, but you're like, "Oh, you should totally know about this."
22:59 Max Sklar: Hmm.
22:59 Michael Kennedy: I mean, you already threw out there PyMC3, if you want to claim that one or if there's something else yeah, pick that.
22:59 Max Sklar: Yeah, well, hmm. I have BayesPy, which is the one that's like in github.com/maxclark/bayespy, which has all the stuff I talked about. It's not actively developed, but it does have my kind of one off algorithms, which, if you're in the, if you're in the market for multinomial models or de Richlieu, or you want some kind of interesting new way to do multi-logistic regression I could certainly give that a try, but, you know, most people probably want to again use again to the standard toolings. Yeah, why don't I go with that one? Why don't I go with the one I wrote a long time ago?
22:59 Michael Kennedy: Yeah, right on, sounds good. All right, final call to action. People are excited about this stuff. What do you tell them? What do they do?
22:59 Max Sklar: Check out the books I mentioned, and check out my website, localmaxradio.com. And also subscribe to The Local Maximum. It should be on all of your pod catchers. If it's not on one, please let me know. But it should be on all of your pod catchers, you know, Local Maximum is just every week. And we have a lot of fun. So definitely check it out.
22:59 Michael Kennedy: Man, that's cool, you spend a lot of time talking about these types of things. Super. All right, well, Max, thanks for being on the show.
22:59 Max Sklar: Michael, thank you so much. I really enjoyed this conversation.
22:59 Michael Kennedy: Yeah, same here. Bye, bye.
22:59 Max Sklar: Bye.
22:59 Michael Kennedy: This has been another episode of Talk Python To Me. Our guest on this episode was Max Sklar. And it's been brought to you by Linode and Tidelift. Linode is your go to hosting for whatever you're building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. If you run an open source project Tidelift wants to help you get paid for keeping it going strong. Just visit talkpython.fm/tidelift, search for your package and get started today. Want to level up your Python? If you're just getting started, try my Python Jumpstart By Bilding 10 Apps course. Or, if you're looking for something more advanced, check out our new Async course that digs into all the different types of Async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, in the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.