#40: Top 10 Data Science Stories from 2015 Transcript
00:00 It's the end of the year and many of you are probably kicking and taking it easy without a TPS report to be seen. So we'll keep this fun and lighthearted this week. We're running down the top 10 data science stories of 2015 on episode 40 of Talk Python To Me with guest Jonathan Morgan, recorded December 13th 2015.
00:00 Welcome to Talk Python to Me. A weekly podcast on Python-the language, the libraries, the ecosystem, and the personalities.
00:00 This is your host, Michael Kennedy, follow me on twitter where I'm @mkennedy
00:00 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on twitter via @talkpython.
00:00 This episode is brought to you by Hired and Digital Ocean. Thank them for supporting the show on twitter via @hired_hq and @digitalocean
00:00 Hi everyone. This episode is a little unique. I've partnered up with a great data science podcast called Partially Derivative. We're doing a joint show multicast to both podcast. If you like this kind of thing, be sure to visit partiallyderivative.com and subscribe to their show.
00:00 Also I wanted to let you know I am not releasing the show next week- I'm on vacation! So I am going to take next week off, do little resting and relaxation, hang out with the family and be ready to get back and do a ton of awesome shows for you in 2016!
00:00 Now let's get on to the co-hosted episode I did with Jonathan Morgan.
01:46 Hey Jonathan, welcome to the show.
01:50 Hey Michael, thanks so much for having me.
01:50 Yeah, I'm really excited to do this joined Talk Python- Partially Derivative podcast about the end of the year in data science and all that awesome stuff.
01:59 Yeah, I'm super excited too! I think when our powers combine it is the ultimate resource for Python data, Python stuff in the universe. Absolutely.
02:09 Yeah, it's going to be great! So, for those of you who don't know me, my name is Michael Kennedy, I'm the host of Talk Python To Me, sort of developer focused Python podcast, and this week I am teaming up with The Partially Derivative guys.
02:23 And for those of you who don't know me, I'm Jonathan Morgan, one of the co-hosts of The Partially Derivative podcast, a podcast about data science- kind of, but also sort of about screwing around and drinking beer.
02:36 Yeah, that's really the most common combination of any two things, I would say it's probably drinking beer and data science right?
02:43 All of the best data science is done on a two beer buzz, it's the secret nobody tells you.
02:48 That's right. You only learn that is grad school.
02:51 Yeah, exactly.
02:53 I'd kind of like to hear about your company- the last I heard is you guys were starting a data science company, and that's about all I heard, it's called Popily, right?
03:02 Yeah, that's right. So it's called Popily and my two co-hosts of Partially Derivative, Chris Albon and Vidja Spandana, started this data science company. And so, we basically, we were realizing that there is a whole bunch of data science that is super hard, and everybody I think is familiar with this kind of artificial intelligence and complicated machine learning, but then there is a lot of data science that's actually pretty straight forward like it kind of boils down to inference and making charts out of data that you just have sitting around, so you have like a better everyday idea about what is happening. And it turns out that that is super hard, like actually even for pretty technical people it's super hard, you know, like I need a lot of developers who are like, go ask me a question like, "So I've got like thousands of rows of data in my database and they are all like at a time, but then how do I like look at it, you know over time?" And it's like, it's super technically confident people who are still like, "I just don't really understand how the data thing works," and kind of turning that first corner was really important to us, we were like, "Ok, cool, we could actually empower a lot of people to do data stuff if we could make that first step automatic, like if we could just go from some raw data to charts to give you an idea about what is happening inside your data, we should definitely do that for people". And so that seems easy, it took us a little bit longer than we thought.
04:30 All those details that keeps sneaking in there.
04:33 Exactly. Exactly. But yeah, so we've released the product, it's in private beta so Talk Python To Me listeners you should definitely be part of the private beta, I'm sure there will be some contact information, it's Populi.com, you can go, request an invite or just email me and we'll get you in the private beta. Yeah, an we are releasing publicly early next year, so it's super fun, we are having a really good time.
04:57 That's great. You guys are actually using Python quite a bit there, right?
05:01 Oh yeah, up and down the stack, it's all Python. Some of the data people out there might know that there is another couple languages that do some data science or the people use to do data science, one is called R, people use language called Scala, none of them live up to the awesome power and flexibility of Python, which is why we use it in almost everything we do, I mean, from the web app that people interact with when they are actually using the system, all of that is in Python. It's actually a Django app, I hadn't coded in Django for a while, it's super fun. And then the back end uses SciPy, like the whole SciPy stack for some of the machine learning and data processing stuff that we are doing on the back end. So, we are a Python shop all the way.
05:45 That sounds really fun to be put in all that together, I'm sure you guys like it.
05:50 Yeah, it's actually the coolest thing about Python from my perspective, is that like you can do kind of complicated scientific computing, and then plug it right into the web app that you had already built because the language is so flexible. So it's been fun.
06:04 Yeah, very cool. So, I'll put the link in the show notes, but definitely if you guys think that sounds interesting, check our Popili.com, is that right?
06:11 Yeah, that's it, Popili.com. We thought about Popili.io but popiliio- it was just weird, it was too much.
06:19 Get one of those Libyan domains, those are always good for the start up. The ly.
06:25 Yeah. Popili.ly? Missed opportunity.
06:31 Well, you can always change the name. All right, so do you want to talk about this year? I mean, this show is going to come out on Talk Python on the 29th and I suspect on the same time on Partially Derivative? It's perfect, right at the end of the year to talk about sort of what has happened in Python world intersective with data science I guess.
06:53 Yeah, absolutely, it's been a big year. A big year for Python and a big year for data science.
06:59 The first pick, maybe not the most important one is probably most relevant to people while they are listening to this show like if it come down to 29th, you know you got some vacation, maybe you'll pick it up around the 31st or the 1st is typically when we make our New Year's resolutions, right?
07:16 It is. It is, and the first story is all about how you pretty much are going to fail at this. I know, the numbers are, the probability is that you are not going to stick with that New Year's resolution, which is funny, because I think it's something that we all know, but this is actually from last year, Mona Chalabi who is not a FiveThirtyEight any more, she is doing data journalism for the Guardian, but she was at FiveThirtyEight at the time, and this was actually kind of the larger story of the last year that I think kind of data journalism also really came into the public mindset and she did a really interesting piece when she broke down the stats of how often people fail at their New Year's resolutions and how long they keep them, and I guess it's something like 70% fail within- the first two weeks.
08:10 I'm going to change my life, I promise this year will be different. Maybe. But it's a Tuesday, and you know my friends are going out and- whatever, right?
08:20 Yeah, yeah, totally. I really like the aspiration of New Year's resolutions and I am just not going to think about the cold, hard reality of their eventual failure, until later this month.
08:30 Yeah. Exactly, you are going to wait at least two weeks.
08:33 Yeah, some of the most common ones, well the most common one by quite a measure was lose weight and closely related to that was exercise more, and then the third one really in terms of popularity really puts a sort of a challenge on being able to determine whether or not you've achieved it, which is to be a better person. How do you analytically answer that, right?
08:56 That's true. It's tough, I mean, I guess you probably could say, there is a little bit of wiggle room, if 70% of people fail at that resolution, that's actually really worrying, I need to be just an incrementally better person, like, "no I tried but I'm still a jerk".
09:14 "No, I yelled at the neighbor again, and kicked at.... ", yeah, exactly.
09:19 Exactly. [laugh]
09:22 Nice. Ok, so you and I are both big fans of podcasts, we listen to a bunch and obviously we produce some that we are very passionate about, and the next one, the next item is actually about the most popular podcast of all time, Serial.
09:40 Yeah, in fact I think it was the only thing in 2015 that was more popular than Talk Python To Me and data science. The only thing, everything else you know, sort of pales in comparison with these two giants of media dominance but of course, there is Serial.
09:57 Yeah, Serial is if you guys haven't heard of it it's a podcast that goes back and looks at a person accused maybe convicted of murder, I can't remember-
10:10 He was convicted.
10:10 Convicted, yeah, so I thought. It goes through this high school guy and sort of rehashes, reevaluates the evidence, read us the interviews, and it's like investigative journalism look but through the eyes of a podcast rather than maybe through Time magazine or whatever, and it was downloaded something like 5 million times a week, it completely broke all the records of all time.
10:34 Yeah, it was pretty amazing. I mean and by the way, spoiler alert, the dude is totally guilty, I mean, this is not really a spoiler, because that is not how the show ends. But, and this is going to be cool, this is going to be really divisive for your listeners, because half I think are going to write angry emails and the other half are going to be like- totally. So that's where I stand, I feel like I feel the responsibility they get it off my chest.
10:59 Well, the second item, also at FiveThirtyEight.com sort of a journey of some data science folks to actually go through and try to apply data science tot his journalism to answer statistically or sort of through data science is he guilty or not.
11:17 Yeah, and this was actually interesting because it's not really something that can be quantified, in fact that was really a big theme in the show was that all of the information that we had at the time and that we have now to access whether or not this man was guilty of the crime that they think that he committed, this murder, is really suspect. It's really not, it's hard to say conclusively what did and what didn't happen, because it relied so much on personal accounts of the events of the day. That said, there was some information that they could point to that definitely happened and the big argument that everybody who believes he is guilty was making was that none of the events in particular made a certain that he was guilty, and then that was really unlikely, that was sort of the intuition that everybody had, but the interesting part was that- so FiveThirtyEight interviewed a couple of people who are data scientists and they went through kind of a basin process for assessing the likelihood that each of the events could happen in contert, like whether or not he basically just had super bad luck. And it actually, when you look at this, it's something called the multiple testing problem that's a way that you can test a hypothesis in lots of different ways and looking at it through that lens makes it seem a little bit more probable than you might first assume that all of these different things could happen to him. So basically- he asked the victim of the crime for a ride, he lent his car and his cellphone to somebody else who was also accused of the murder and then his phone was in the location that was within small distance from where the body was found, and then his cellphone records seemed to 13:01 with the bunch of other like of the prosecutions testimony- so basically like, there was like 4 or 5 things that made it like dude, that's impossible, like if all of those things were true you were definitely guilty, even though none of them as an individual piece of information is super 13:15 . But, it turns out, according to these data scientists, well, you know, maybe we should give this a second look, it's actually not that unlikely that he could have had that much bad luck. So, we'll see.
13:25 I think it's really interesting to take a hard science, like data science that is working with numbers and apply it to something soft like interviews and likelihood that someone is telling the truth and these kind of things, so I think even in that regard alone, it's really interesting.
13:41 Yeah, totally, I mean, ultimately, it was pretty subjective, but it was, I mean, and it's hard because these are like real people's lives that we are talking about, it was a true story, you know, it's not like a murder mystery but it was really hard not to get into it and take sides and think about it like you know, like a who done it? So, you know- apologies to all involved that we are talking about this in insensitive way.
14:04 Yeah, absolutely, it's definitely a harsh reality that something bad happened to somebody. They are having a second season, I haven't listened to it yet, what the story of the second season, do you know? Just came out.
14:14 No, I don't know, I think they are probably trying to keep up with the times because as we know, you've got to keep putting out content, or people get bored. But I guess all I know is that I think it's not about the same guy. So if you are sick of hearing about this guy's story then you are in luck because, I am assuming it's about another unsolved murder.
14:38 Yeah, it's got to be. All right. Moving on to the next item, is something very near and dear to the Talk Python listeners' hearts, I am sure, and that's Jupyter and IPython and IPython notebooks.
14:53 Yeah, and they've had a huge year, and I am assuming a lot of your audience will be pretty familiar with IPython Jupyther. Although, that said, I guess there is alike two camps of Python developers I feel like sort of web developers and software engineers and then kind of data and stats folks-
15:15 Did you come at Python from a computer science perspective or did you have to find some language to do your specialty and kind of grow into programming? I suspect that second category is very well familiar with IPython stuff. Or maybe I should just tell everyone- IPython notebooks are these sort of interactive documents you can load as web pages and you can write a little bit of Python code and then you can actually execute them real time, right there so you could pull some data from a database and then do some sort of science on it and a graph pops up and then you write a little bit more code and another bit of data pops up, and these things are sort of live research documents, very powerful. And this have been sort of generalized out of the Python world through this Jupyter project to apply to many different languages. So this Jupyter project is an open source project run by Fernando Perez and some other guys, I am actually working on having them on the show shortly so a couple of people asked if they can be on the show and- yes we'll have somebody from IPython and Jupyter soon. But these guys are on OpenSource project that has just receives six million dollars in funding.
16:27 Yeah which is kind of amazing, I mean it's a massive amount of funding for a project like this that's effectively a developer tool but it's so useful, I think anybody- I was a little bit sceptical of it at first because I came at Python first from more of a computer science and software engineering background and it was later that I got into data stuff, but to have something that you can basically record all of your actions because when you are doing data project so often you like need to explain, you get to a number or you get to a chart or you get to a discovery or whatever, and then you need to communicate why that matters to somebody else and it only really matters if you can give the context. So well like first I took this slice of the data and then I manipulated it in this way and then I extracted these features and then once I had those features I decided to clean some of them up by doing xyz and then once I have done all of those things obviously if you look at this chart it's really meaningful but without those steps after that, it's like that's awesome, all I can see is like three bars. And to play that back for somebody, and capture is really awesome. And it's nice too when you are not a great coder, which I am not, and you have to step through and go, "wait, why didn't that work, let me run it again, why didn't that work- let me run it again".
17:40 Yeah, yeah, they are very cool, if I think of writing sort of science based academic papers, it seems crazy to not do them as something like this, where rather than just saying, "oh I did something on my own here is the chart, believe it" right, here is the actual code and here is the data and you can just have the whole thing and run it if you like, right that's amazing.
18:02 Yeah, totally. To be able to reproduce research by literally running the code that went into like led to the discoveries that informed to the paper is huge. I mean Chris, my co-founder and the co-host on Partially Derivative when got a PhD and he talked about that all the time, often in post graduate work you get assigned a task to reproduce the research in somebody else's paper and he was always stunned at how difficult that was because even when you have access to the raw data, trying to work with it in such way that produces the same results is just really rough, and so being able to capture that process where like every little decision that you made to manipulate the data in a particular way and when I say manipulate in this context I mean like a legitimate transformation of the data not like a shady manipulation of the data.
18:59 Not like witness tampering type manipulate, but like we are going to make some assumptions about the underlying physics or statistics and then get a better answer, yeah?
19:08 Yeah, totally. So much of this kind of work is a little bit about, it takes some creativity, it takes a little bit of intuition, especially if you are working with natural language and you are trying to extract some, you are trying to make sense of unstructured data, you know, you make a lot of little choices on the way and those do impact the results that you see. And that's why being able to reproduce it is so important and for everybody to understand the assumptions that you made, and so I mean, I can only assume that was part of why they receive this like massive bucket of funding, it's a super popular product.
19:40 Yeah, it's going to definitely be a really important foundation of science. But if you just think of any OpenSource project- like what other OpenSource project do you know that hasn't got a company behind it, that somebody gave 6 million dollars like this is a big deal.
19:40 This episode is brought to you by Hired. Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
19:40 Each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them before you even talk to the company. Typically, candidates receive 5 or more offers in just the first week and there are no obligations, ever.
19:40 Sounds pretty awesome, doesn't it? Well did I mention the signing bonus? Everyone who accepts a job from Hired gets a $2,000 signing bonus. And, as Talk Python listeners, it get's way sweeter! Use the link hired.com/talkpythontome and Hired will double the signing bonus to $4,000!
19:40 Opportunity is knocking, visit hired.com/talkpythontome and answer the call.
21:04 Ok, so number 4 on our list is Artificial Intelligence. There have been some cool shows about artificial intelligence like Ex Machina and that actually happened to feature a little Python code in the show as well which is cool. But people are freaking out about it.
21:21 Yeah, this is kind of a new thing this year, I don't know maybe it's because machine learning at AI or becoming more a part of mainstream conversations or like you said, there was an entire movie made about it where they referenced the turing test specifically like in a major Hollywood movie that blew my mind. Because that's pretty nerdy stuff in normal circumstances, right. It's super cool. But yeah, and now all of the sudden you have these super prominent leaders of the technology community coming out like for and against, like speculatively against the idea of AI because you know, the Matrix or whatever, so I don't know, it was really interesting to see that debate happening in public, I don't know if you track that very much, like Bill Gates was warning us and Steven Hawking warning us about artificial intelligence, and then of course other people on the other side giving really nuance defenses of the way that artificial intelligence helps humans by making better decisions, it's super fascinating conversation.
22:23 Yeah, it's very fascinating. The person that came out that sort of in my mind carry the most weight honestly was Elon Musk.
22:30 Yeah. I mean, Elon Musk is always, when Elon Musk speaks we listen.
22:37 That's right. Because, Bill Gates, I am actually a fan of his, I think he did some really cool stuff, but I kind of fell like he has a certain world view that's sort of already here, and Steven Hawking has some amazing views of the universe but at the same time, I'm not entirely sure how practical his actual interaction with AI in programming is, but Elon Musk was- "I'm going to build an electric car that's amazing"- he built it, "I'm going to build things that go to space", and somehow he does that- I mean, he could actually build AIs if he wanted, if anybody can, I would say.
23:14 I think that that's the only way that he is really going to be a bond villain, because at the moment, he's got like super cool international technology companies that are doing amazing things, like you said going to space, but it's not quite super villain status yet and I think if he built a robot that like seamlessly works its way into society and obviously initially for good but you know, it gets out of control, that's where the plot of the movie really starts to get thick, I am excited about this.
23:49 Yes, absolutely, the AIs had to be created to man the super charging stations up and down the West Coast and then it just went all wrong, they gotten the cars and spread.
23:58 Exactly, they decided that they deserved more, than the menial task that they have been assigned.
24:04 Yeah, so we'll link to a really cool article from this project that was sort of getting respected leaders in the AI space to sign sort of a pledge to proceed with caution, but one of the follow up articles that we'll also list that I really liked was this thing about Mario as in like Super Mario brothers, that was really cool.
24:27 Yeah. It was pretty funny, there were some researchers that like basically made Mario sentient. I mean, maybe not quite, that's a stretch, but they empowered Mario, the character with his own intelligence and then like would let him loose in the original game to see how well he could defeat the Goombas and all the other dangers of Mario characters, it was super fun. So like on the flip there is this like really super accessible fun artificial intelligence project that aren't necessarily a threat to humanity.
24:59 Yeah, maybe it threats our high scores on Mario brothers but not maybe humanity in large.
25:06 So you know, those guys are, they are German, they are on the university there which is actually 35 minutes away from here now, that's cool. They made a video, and we'll link to it it's on mashable.com, and they do all sorts of cool stuff, there is a bunch of different intersections, it's not just like one part of AI but there is a lot of sort of the understanding the world, understanding language, learning, and so they would do things like they would ask Mario like they can speak to him, in English, and he would answer in English, it sounded a whole lot like war games, it's like you know, that really sort of choppy text to speech, so it's kind of funny how Super Mario brothers speak that way. But they would say things like, "If you jump on goomba, he will die." And then they would say, "Jump on goomba" and of course the character dies and they say, "What do you know about goomba" is "If I jump on him he will certainly die." And then later they reset his mind and they tell him to go over to go on and jump on goomba they don't tell anything. And then they later, first they ask him, "What do you know about him" and he says, "I don't know anything about him" and they say, "Jump on him", the guy dies, "Now what do you know about him?" He goes, "I know that he may die if I jump on him" And it was really interesting, it has all like different emotions, and knowledge, it's cool. Check it out.
26:25 Yeah it is super cool, actually that's probably the best part of the project is like Mario's sort of existential on ui.
26:37 Yeah, absolutely. So another thing that happened this year, had to do with the New England Patriots and for those of you who maybe don't follow American football super close, this was sort of a big deal, New England Patriots I don't really care one way or the other but they kind of been seen as a team that I'll put it nicely- takes as much advantage of the situation as they can by filming things they shouldn't, or whatever and it had come to a head where around the superbawl time they had actually been accused of deflating the football's for their team. And for a while I didn't know what that meant like ok well, it maybe hurts a little less to catch it, I don't know, but one of the things that deflated football is what you do is hold on to the much tighter, they are not slippery and so you won't fumble the ball and make some of these game losing mistakes or mistakes if you can like change the physics, so is not a problem, while that's easier to solve than being better football. So the story is, actually some data science folks came in and looked it at and I don't know, what is your opinion after looking at all the charts and graphs they built?
27:51 Well, I think it boils down to in a simple way that the New England Patriots were like a massive outlier and so you are right on that if you deflate the football a little bit apparently, it makes it easier to hold on to and so for those of you who don't watch American football it's a lot like rugby, if you are familiar with rugby, or more or less it's like you have a ball that you are holding on to and you are running through a large group of very strong man who are all trying to take that all from you. And so, you often lose it, because it's a large group of really strong big man trying to take it and pretty much anything that you have in that scenario is not going to be yours for very long. But, and the trend overall in the league the NFL , the league in which the Patriots play was that the teams were more successful like holding on to the football they had more plays that they could run so that the ratio between the number of attempts that they made and the number of times that they lost the ball was going up so you could run plays and fumble less. In general. That said, the patriots improved their ratio sort of exponentially more than all of the other teams at league and so when you see this one outlier way at the top right of the graph it tends to go like, "well that's not right." So something doesn't make sense, something is different about that one data point, and so of course it sparked a lot of speculation, like why is it that the Patriots fumble so infrequently and you know, where there is smoke there is fire, and in this case at least, and they found that the Patriots were deflating footballs just a little bit all of time and by doing that, they were able to maintain better control of it because it was easier to grip the ball because there was less air inside, they could hold on the ball. And so that was the whole thing, but basically it was like the way that that was determined was through just doing some relatively simple data science, relatively simple analyses, I mean it's not that simple in the aggregate, gathering the data was complex, connecting the dots, understanding the consequences of the things that we were learning, if you actually look at the analyses that the detected that outlier that I just described, it's pretty straightforward. So anyway, it was kind of cool and it was the super funny story to follow that ultimately was followed at such depth because oh my gosh, spots in America, that it got a little tedious, but at the beginning it was cool, it was like a wonderful little scandal.
30:31 Yeah, it was an interesting scandal and I think it really shows the power of data science because these sports guys they can go back and forth and they are talk radio, and always- down. Look at that graph, there is something going on here, that's it. The question is what is going on. It's more likely something sort of sneaky is going on rather they are just dramatically better than even second place right, so very interesting use of data science there.
30:31 So, speaking of US things, the United States now has a person called the chief data scientist- that's pretty awesome.
31:10 Yeah. It's kind of amazing, right, like data science-wow, ok, it's something like a bunch of geeks do and now there is like somebody who has a title, like over the whole, the domain o ft he entire United States, doing data science, they are chief.
31:30 Yeah that means that's almost like a politician has got data science, this is crazy. I think it's a really positive move, you think of places that have lots of data, sports, we just talked about that, they've got a lot of data, Cern at the large Hadron collider, those guys generate a lot of data, but the United States- we have so many different thing that we track about people and the answers to those questions really matter, right, we make policy based on these numbers so having somebody in charge of doing that right, makes a lot of sense.
32:00 Yeah, and actually what is super cool about the stuff that DJ Patil, he is the guy who is the Chief Data Scientist the first one, the stuff that he is focused on is really awesome, like a big initiative of his is that he wants to open up a lot of the data that the US government collects, so to keep government agencies more transparent that they have been before and just to like advocate for the open data movement in general and get the general public interacting with the data that these agencies produce, in a way almost like as a means of civic engagement, it's really awesome and so this whole idea that like sort of the general public's relationship with government can be more modern, can be more technology driven, can be more part of the 21st century, it's really cool, so the open data projects in particular have been really fascinating. And then I know, health care has been a big focus, of DJ Patil’s office, they have been really looking at how to encourage the public to improve the way that the healthcare system works by investigating the data that's been made available, and by making the data that we do produce easier to transport, easier to access, easier to sort of combine, and investigate, it's making data accessible and kind of top of mind for an entire generation of kind of early career Americans, it's really fascinating.
33:27 Yeah, that's cool, and maybe the United States government themselves has access to things you wouldn't otherwise share right, like you mentioned health data. I think one of the problems with analyzing health data in the large is people don't want to just give away every single thing about themselves for good reason and so it's hard to talk broadly about that, right? But there is extra data and there is somewhere that it can help make people healthy, that's cool.
33:54 Yeah, totally. And in general I think part of this whole idea of quantified social science I think for the longest time social science is done about the behavior of populations has been kind of anecdotal a lot of people doing qualitative research where they are using their intuition and using their knowledge to connect the dots and tell a good story and I'm sorry, that probably sounds super like diminutive of a lot of social science research, and I only kind of mean it to be. But there is a move toward saying, "Hey if you can't back up your statements with data, even if you have to be creative about how that data is collected or be creative about that data is interpreted, that's fine, but if you are not backing it up with any data then it doesn't really mean anything." And I think that that's very encouraging for people of my persuasion.
34:47 Yeah, I totally agree. The other thing that's cool you talked about the open data project, I mean when you live in democracy, you would expect that the data your government- you are suppose to be sort of the boss to the government, you should be able to have access to those things and so that's a really positive trend, and also on that note I want to recommend a video by Catherine Devlin, she did the keynote on PyOhio this year and she works for some government agency sort of, she works for a group of programmers within the US government that are basically an OpenSource wing of this programming group and so they will go into other places and say, "We will help you with this project but only if we get to opensource the results." And they are trying to spread the opensource within the US government as well. So those are tied together nicely.
35:42 Yeah, absolutely, everybody kind of involved with the CTO's office Megan Smith is the current CTO of the US, and almost all the initiatives they are doing are- it's like a totally different way of interacting with the government, then I think ever existed before.. So if you get a chance definitely check out what the CTO's office is doing, there is a lot of cool ways to get involved, there is a lot of cool projects that they do, and there is a lot of interesting open source projects like you just mentioned so analytical data sets which by the way, some of those data sets are really great for learning if you just kind of cutting your teeth on data science there is a lot of really interesting things about your community and your state and the country in general that might be more fun than working with a data set about advertising clicks. Or whatever.
35:42 This episode is brought to you by Digital Ocean. DigitalOcean offers simple cloud infrastructure, built for developers.
35:42 Over half a million developers deploy to DigitalOcean because it's easy to get started, flexible for scale, and just plain awesome.
35:42 In fact, Digital Ocean provides key infrastructure for delivering Talk Python episodes every day. When you (or your podcast client) download an episode, it comes straight out of a custom Flask app on Digital Ocean and it's been bulletproof.
35:42 On release days, the measured bandwidth on the single $10 a month server jumps to 900 Mbit/sec for sustained periods with no trouble. That's because they provide great servers on great hardware at a great price.
35:42 Head on over to digitalocean.com today and use the promo code TALKPYTHON to get started with a $10 credit.
37:49 Let's move to number 7. It's going to be late December, winter is coming, probably.
37:55 Yeah, maybe. It's hard to know.
37:58 What's the story with this one?
38:00 So, there is a handful of things on here that really I was attracted to because I just love this idea that data science and data analyses in general is becoming so much more main stream and so this post was about the Game Of Thrones and you may know, this is actually a shame that my co-host Chris isn't on the show right now because Chris produced a data set from A Song Of Ice And Fire, the books, not necessarily the HBO TV show, Game Of Thrones, obviously the two are related although I hear the TV show is kind of going off the rails and moving away from the books. Anyway, so there is a lot of dying in this TV show Game Of Thrones, and the book A Song Of Ice And Fire and my co-host made a data set of all of the battles where he tally the number of deaths and obviously there is some estimation in here because the books don't go into details all of the time.
38:57 But so- let me take a step back. So what I should probably tell everybody is if you are not familiar with game Of Thrones and the series of books called A Song Of Ice And Fire, if you are not familiar with them, it's basically kind of like medieval fantasy type of stuff, so there is dragons, and there is houses, and the houses fight with each other over land and there is different families that are in war with one another, and kind of the whole thing. And so of course because they fight all the time, there is a lot of dying and if you were of a more data statsy persuasion you might want to know like quantitatively which house is the best, like who wind the most battles, who kills the most soldiers, who has the biggest army, and these are the sorts of questions that Chris's data set answers. So, you should definitely go find that. I think it's just Game Of Thrones battles on GitHub if you Google for it, I bet you'll find it. But there is other work, other very interesting work that has been done about the likelihood that you will die if you are a character in the books or the TV shows.
39:56 And so, I just loved it, right, because I mean it's a fictional world, it's made up, the likelihood that you will die is whether or not the author decides that you should die, like that's really what's going on here, but based on the sort of constructs of the world that the author created this data scientist went through and he used, there is a whole Wikipedia clone called the Wiki Of Ice And Fire, that catalogues and documents the deaths of every major character, like every major death in the book, and so he went through and basically calculated the likelihood that any given character will survive based on their characteristics, so are they from a particular family, are they high born, are they low born, are they a man, are they a woman, how old are they, all of the things that you might say are like a feature of a character and then once you understand the features of a character you can calculate the likelihood that it will die.
40:51 Anyway, it's fascinating, I am sure, we will link to the blog post so you can go check it out. But it pans out kind of like you would expect you have much higher chance of dying if you are not of the noble class, I think man die more often than women, and so on, and so on, you should go check it out. But really the reason to bring it up is just to say doing data analyses about characters in the fictional universe, we are at some kind of peak, we are at like peak data science awesomeness when that is a possibility.
41:23 Yeah, that's really cool, and you could also run this algorithm on your favorite character to know whether or not you should get attached to them, right?
41:31 Yes, I think to be fair with this story the answer is probably always no.
41:38 That's right.
41:39 Yeah, I mean this is not a spoiler or anything, I am not giving anything away; if you like a character and you hope for them, if you have any hope that the character will survive, I think that's probably the best indicator that they are going to die.
41:54 It's the kiss of death, yeah. So speaking of dying, we are all getting older, and Microsoft has decided that machines can know really well how old we are. So if you go to how-old. net Microsoft is using machine learning to guess how old you are.
42:13 I know, isn't that how old- do you want to say, you don't have to if you don't want to.
42:17 I don't mind, so I have always kind of looked a little younger than I actually am, people were always like, "What, you have kids?" So my whole life has been kind of like this, like it was the problem in high school because it wasn't cool to look younger than everybody else, there is only a 4 years stretch but, you know, when I get older it looks better. So, I uploaded a picture of myself, the one I have on my main website, and it did actually say I was 47, I'm like, "What!? I'm only 42." This is crazy, but I had my glasses on it, right, and so I tried one without the glasses, I uploaded it- it said I'm 42. It hit it straight right on exact.
43:01 Wow, what does that say about the algorithm, that you got aged 5 years just because of your glasses?
43:09 Yeah, I think the glasses they may trend upwards, I would- if you don't want to be old, take your glasses off. How accurate was it for you?
43:20 I was similarly offended actually. I uploaded a photo of myself and it guessed that I was in my early 40s which I am not. Which is fine, early 40s are great, I'm happy for you-
43:41 But you shouldn't be in them when you are not.
43:43 I'm in my early 30s, and I was like, "Wait a second- " so I thought the same thing, maybe there was something about my appearance in the photo that is cause for concern, right?
43:58 Like search for ties, or like a collar maybe, like a hoody.
44:04 Exactly, like I put up a photo of myself without the beard, this is like super vain, I was like wait, I can't live with this and I found an old photo of myself without the beard, I put it into the system and it guessed that I was like 37, which is also older than I am, so I had to accept- moving in the right direction but I had to accept that either it's just like everybody got offended, because if you were Microsoft and you were training this algorithm, like wouldn't you want to skew young, like let's be honest.
44:32 Yeah. Ok, but let's say, it skews old or I 'm just old looking, maybe I am aged beyond my years, I have a kid but you know this, the kids do this to you, I bet it would have guessed I was like 25 until the day she was born, and then it was like 40.
44:49 One week later you age like 10 years like that.
44:55 So number nine, on our list here is actually little bit of a sad story, there was this really cool study about how much a reasonable argument- not argument as a fight but a logical argument or discussion around a political position may or may not change someone's opinion. And so the study was done trying to change people's opinion, open them up being more favorable towards same sex marriage. And the found in the study that if they went around and they had people canvas the location, go in, knock on the door and talk to people, after talking to them, they could actually make them more open to this idea. Which this flew in the place of a lot of the political science which is if you argue with somebody from the opposite perspective on something political they typically dig in and they are like, "Ha I am way against you man" It like hardens them against your argument rather than brings them over, so this was a big deal, until it was- a fraud.
45:58 Yeah, exactly. It was one of those things where I was like, ok, I think one of the big parts of the study that turned not to be true was that if the person who was having a discussion with you about your views on same sex marriage was themselves gay, then that would increase the likelihood that your opinion would change, you would be more favorable of same sex marriage if you were talking to a gay person. Basically saying that like the reason that most of us might hold a prejudice are simply because we are not exposed to people of a certain group and as soon as we are, our prejudices start to melt away which is kind of a nice idea, right. And so everybody was really excited about this survey but it turns out that it was just made up and what is also interesting was the way in which some additional researchers discovered the flaw, they were just as often happens, they were going to do an extension of the survey, they were going to build on the research that had already been published, which is pretty common, but then when they looked into it, they discovered that the responses in the survey were pretty normal, like pretty consistent. Which is actually like they were too consistent, the pattern was too clean, and what they actually found was that there were like irregularities in the data, and the irregularities that they found in the data looked like the sort of thing that a human would do if they were trying to be random. Like isn't that amazing? Like actual randomness is hard to produce and like human beings just aren't good at producing things randomly because that's just not what we do, like we go like, "Oh, I picked 3 last times so I have to pick an 8 this time and I have already picked an 8 so randomness says that I wouldn't pick an 8 two times in a row, I should pick a 12" and this is actually not how randomness works. So like that's super intentional, like you are layering your own vias on top of it and so what they saw was a very uniform noise in the data set and they were like, "Wait a second- this actually looks like you just made it up." And it's true, like I guess he took the results from a previous study and then tried to apply them to this one, like basically apply previous study's results in a new context and then sprinkle some noise on top in a way that he felt would look random and these other researchers basically had to say this like huge study-
48:28 It was covered in a ton of like high profile news New York Times, places like that, so it got a lot of attraction, until it crashed.
48:41 Yeah, absolutely. And so, I mean on so many levels it's like- on the one hand you look this and you go, "Oh what a horrible fraud this person perpetuated on the public" and on the other you go, "Well I guess the process kind of works." I mean it shouldn't have been published in the first place paraview should have caught that, but ultimately, that academic community discovered the fraud and outed it, which I think was good. So it's interesting in two ways, it's just interesting how research the kind of the mechanics of the research industry works and then on the other hand, it was interesting because it was this really, the thing that this kind of novel idea of randomness that is not normally part of mainstream discussion that everybody had to start talking about, in order to explain why the story worked out the way that it did.
49:24 Yeah, how did they catch him making fake random, I don't understand, what does that mean?
49:28 Yeah exactly, what does that mean. It was cool.
49:31 That's great. And on this note, I was seeing this story in a really positive light, like hey maybe we could sit down and talk to each other and we could like help evolve each other's opinions one way or the another, we could kind of come to an understanding but it turns out that like- no, probably we can’t. There is a really interesting book that I think is related to data science, it just has an insane amount of data analyses in it it's called The Big Sort- How The Clustering Of Like-Minded America is Tearing Us Apart. And it really just goes through 20, 40 years of data of like people's opinions, and working together and it's sort of- it would not support this study that we can just talk about things and agree more. But check it out.
50:19 That's interesting. I'll have to go check out the book.
50:21 Yeah, it's an amazing book. One of my favorites. So, number 10 which I am going to nominate to be my favorite of this year, is that Python is at an old time high as a programming language. So the Tiobe index is one of the more respected sort of how popular is my technology in the technology fight indexes and Python is now number 4 of all programming languages and it jumped from 8th to 4th in one year. So it's one of the very few with like a double up arrow, this thing is super growing fast. So for a language that was created in late 80s, came out early 90s, and it's been around for a long time, this massive jump, you know, I think it's interesting to us, like where is it coming from, it's coming from academic somewhat, like this Python is now the most popular language for first year college students studying computer science but I think it has a lot to do with data science as well.
51:21 Yeah, I mean, it's hard to see how it couldn't be actually, because i think Python's always been great as kind of along the same line as PHP which I can only assume for all the software engineers who listen to this podcast because they like Python so much, so don't get me wrong you guys- I hate PHP, those guys suck, but also, Python is a pretty good general purpose programming language as is PHP for building web applications and all of the sorts of products that we have seen being released on the internet in the past 10 years, that's huge boom in web based products. And Python is great for that, it's great for writing software that gets released on the internet, but at the same time, it's also overtaken another statistical programming language called R and become I would say the defacto language for doing data science and analyses. Which is pretty cool, because now it's a power house, now you can do two things in the same language that used to be totally separate from one another. So we talked about before, there is statistical computing which is basically like, "I already know how to do the stats but I need to script it and what language will help me do that," almost like an excel power user and then that goes all the way down through kind of complicated machine learning and artificial intelligence and neural networks and all that stuff. And so the fact that you can couple that with how do I respond to an HTTP request and hit a database and return some content, those two worlds used to be different but now we can build smarter and smarter applications that are seamlessly integrated with one another thanks to Python. So I mean, it's no surprise to me, it's awesome.
52:58 Yeah, I think there must be a huge boost coming from that direction. And generally, people are really jumping into it but thanks to data science for making our language look even more popular and awesome like it should be, right?
53:11 I know the only languages I think of poping on the list are like really lame ones that nobody wants to write in. They are probably forced to by their boss, like Java- I mean, come on. Let's be real.
53:21 Yeah, a little tear just appeared in my eye. But seriously, like Java literally is number one and C is up there as well I believe. It's pretty interesting but it's C and Java are fighting for first place and then it's C++ and then it's Python, right? So that's beautiful.
53:43 Yeah, honestly any dynamic language being as popular as that I think there are a lot of old school software engineers who are crying in their coffee at the release of this report. That's a toy language, that'll never be useful for anything in production, and yeah, now here we are Python is running some pretty massive products, so it's very cool, I am sure all your listeners know they listen to your podcast every week.
54:10 Yeah I'm sure, a lot of them are building some pretty amazing stuff out there. But yeah. I think this is great news for everybody who wants to get into Python it seems like the job possibilities, job prospects, in regular programming as well as data science is just going up so if I were betting on my career, I would consider Python one of the top choices.
54:32 Yeah, absolutely. And it makes it so much easier to start to explore new 54:35 so when you are ready to start bringing in kind of machine learning, or any kind of statistical data processing into your applications, it's complicated but it's a lot less complicated when yo are already familiar with the language that you are writing in. So I think it really gives you a leg up when you are trying to make that transition from one to the other or vice versa- if you have been for the Partially Derivative listeners, if you have been sort of scripting a lot in Python to try out, need to do some research or to work on some analyses and you want to start building applications to make the products that you are building accessible to the world, there is a whole ecosystem waiting for you and the water is warm my friends.
55:20 The water is warm and shallow.
55:23 Yeah, exactly, so yeah it's fun. And it's a fun language to program in, let's be honest.
55:29 Yeah, absolutely.
55:29 So Jonathan, that's our top ten list for the big news in data science this year. So I've got to ask you, what is your resolution you are going to break this year, you are going to give up on in two weeks? Have you already decided what you are not going to hold up?
55:45 I already decided what I am not going to do- let's see, the other people on my engineering team would probably appreciate it if my New Year's resolution was to write more unit tests, or better comment my code, and I am almost certain that neither of those things will happen. So let's- I'll do a dual resolution this year- unit tests and documentation.
56:15 That's a pretty safe one to throw out there I suspect. I think I'm going to make mine to actually put proper comments to my git check ins. Those could use some improvements, kind of get in hurry and they are not so good sometimes.
56:30 Yeah totally like fixing stupid bug.
56:38 It does express your emotional feel about the code in the check in but it doesn't really help them figure out what I meant.
56:45 Yeah, that's fair, that's a good resolution. But you don't think you'll stick to it? I mean it's hard habit to break.
56:51 I'm going to stick to it until I get to in a super big hurry and there is some production bugs to fix, and I'm going to like probably skip it. Not that I think that's a good idea, I'm not recommending it, I'm just telling you these are resolutions you make and you break.
57:05 I love it. This is the recommendation, so all of you out there who are learning Python and software programming engineering for the first time, here is what to do: talk a big game and then ultimately just write a bunch of spaghetti code that nobody can read. That's how the pros do it you guys.
57:21 That's how the pros do it. I think we should probably call it a show. Jonathan, this has been really fun, thanks for teaming up to put together an end of the year show for us.
57:33 Absolutely, this has been a blast, thank you so much for suggesting it and having me on the show. This has been a blast.
57:39 You bet, thanks. Buy.
57:39 This has been another episode of Talk Python To Me.
57:39 This is a joint episode with Partially Derivative and Jonathan Morgan and it has been sponsored by Hired and Digital Ocean. Thank you guys for supporting the show!
57:39 Hired wants to help you find your next big thing. Visit hired.com/talkpythontome to get 5 or more offers with salary and equity right up front and a special listener signing bonus of $4,000 USD.
57:39 Digital Ocean is amazing hosting blended with simplicity and crazy affordability. Create an account and within 60 seconds, you can have linux server with a 30 GB SSD at your command. Seriously, I do it all the time. Remember the discount code – TALKPYTHON
57:39 You can find the links from the show at talkpython.fm/episodes/show/40
57:39 Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes and direct RSS feeds in the footer on the website.
57:39 Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song on our website.
57:39 This is your host, Michael Kennedy. Thanks for listening!
57:39 Smixx, take us out of here.