#193: Data Science Year in Review 2018 Edition Transcript
00:00 Michael Kennedy: This year, 2018 is the year that the number of data scientists doing Python equals or maybe even exceeds the number of web developers doing Python. That's why I've invited Jonathan Morgan to join me to count down the top ten stories in the data science space. You'll find many accessible and interesting stories mixed in with a bunch of laughs. We hope you enjoyed it as much as we did. This is Talk Python to Me, recorded November 25th, 2018. Welcome to Talk Python To Me, a weekly podcast on Python: the language, the libraries, the ecosystem and the personalities. This is your host Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. Johnathan, welcome back to Talk Python.
00:59 Jonathan Morgan: Hey, Michael. Thank you so much, it's super awesome to be back.
01:01 Michael Kennedy: It is great to have you back. Where have you been?
01:03 Jonathan Morgan: I have been like a thousand places. I've not been podcasting. I feel a little bit guilty about it. Very much miss the Partially Derivative audience. Very much miss being on the show. I don't think we did this last year so I'm super pumped to be doing this again, but it's because I've been doing two things. I have a company, New Knowledge. We're focused on disinformation defense so a lot of data science. Ultimately we like to think that we're protecting public discourse and improving democracy, big things like that. Not at all pretentious.
01:32 Michael Kennedy: That is a really awesome goal and I suspect you've probably been busy, right?
01:36 Jonathan Morgan: It's...
01:37 Michael Kennedy: I'm just saying.
01:36 Jonathan Morgan: It's a lot, it's a lot. I mean, perhaps we've bitten off more than we can chew. It's like every month we expect that to be fading from public consciousness, like, all right, this is the month that people are going to get tired of talking about online manipulation and Facebook and Twitter and everything and then every month it's like it gets worse. Which is It's a thing, it's a thing. But it's great and then you know there's also Data for Democracy which is the non-profit that I work on as well and that's a community of about, now four thousand technologists and data scientists who are working on social impact projects. So, also kind of mission aligned, you know, democracy is cool. But yeah, between those two things I haven't had as much time to podcast as I would've liked. But I'm glad, you know, back in the saddle.
02:16 Michael Kennedy: Well, I'll say if you're going to hang up your headphones and your microphone, you've hung them up for pretty good reasons. Like those are pretty awesome projects.
02:22 Jonathan Morgan: Thanks man, thanks, yeah. It's exciting times, lots of good stuff to work on.
02:26 Michael Kennedy: Yeah. Well, you know what else is exciting? I would say data science is as popular as ever, wouldn't' you?
02:30 Jonathan Morgan: I agree. I feel like data science is coming into its own a little bit. It's actually, it's been interesting to see some of the transition out towards just some more like established workflow-driven, team-based data science. A lot of things that I think software engineers are probably super familiar with and comfortable with, that we're still new since the last time that we talked even a couple of years ago so yeah, it seems it might be here to stay. I don't know, I'm not going to call it but I think it's possible that it will be here for a while.
02:57 Michael Kennedy: I definitely think it's a thing. It's starting to show up, it's like full-on courses at Berkeley and things like that which is pretty awesome. We'll come back to that but I found a cool definition of a data scientist, an individual who does data science and I thought I'd throw that out there and just see what you thought in light of what we're about to talk to. This guy named Josh Wills, I don't know him but Twitter knows him.
03:18 Jonathan Morgan: He's very popular on Twitter.
03:19 Michael Kennedy: He said data science is defined... At least this Tweet is. He said that a data scientist is defined as a person who is better at statistics than any software engineer and better at software engineering than any statistician.
03:30 Jonathan Morgan: Oh.
03:30 Michael Kennedy: What do you say?
03:31 Jonathan Morgan: Just like burn left and right.
03:35 Michael Kennedy: I think those are both good qual... I think it's a positive representation of a data scientist.
03:38 Jonathan Morgan: I think that that's actually true. I don't think that Josh is trying to demean anybody with that tweet although it kind of sounds like it. You know, like take that statisticians, my software engineering skills totally blow yours out of the water. But I think that's right because it is like this weird hybrid where you're producing software, ultimately, I think software engineers might disagree with that the data scientists are producing software, and that's fair, that's fair.
04:04 Michael Kennedy: You can debate whether a notebook is a piece of software. I think it's a very interesting debate.
04:08 Jonathan Morgan: That's a good point, and actually, it's interesting how much people are trying to turn notebooks into runtime, like actual production, execution environments but that's probably a whole, we could go down that rabbit hole all day. But it's true, I think that's actually really captured, it gets that blend right in the middle of the Venn diagram between stats and software engineering.
04:28 Michael Kennedy: Yeah, it's cool. So, with that very precise definition, what we have done is mostly you, Johnathan, have gathered up a bunch of topics that represent some of the bigger pieces in news from 2018 and we're going to go through them for all the data science fans out there.
04:46 Jonathan Morgan: Yes, I mean it was a joint effort. It was a collaboration. I feel like we touched on some important things here on our list. I don't know if you were thinking along these lines, but I had a little bit of a theme with a lot of the stories that I was choosing. There's kind of a, AI may be coming to kill us all or save us all, it's like, it's one or the other right now. It's like, dark times for machine learning.
05:07 Michael Kennedy: It's very messy. It might accidentally kill us while trying to save us.
05:11 Jonathan Morgan: Right, it's like, really well intentioned. Like, oh, I can see what you're trying to do there, AI. I mean, I'm dead but I can, you know. You had my best interest in mind.
05:20 Michael Kennedy: All right, thanks for thinking of me, man.
05:23 Jonathan Morgan: Good try, good try. A for effort.
05:26 Michael Kennedy: That's all right. So would you say that the AIs need a babysitter? Maybe the other way around.
05:30 Jonathan Morgan: Whoa, look at that! That was smooth like butter. Speaking of babysitters, I'm not sure how many people have seen this, it's kind of an odd little story. But I wanted to, it's like perfectly encapsulates well-intentioned, but perhaps unforeseen consequences air quotes of AI. So a software company called Predictum thought to themselves, you know, it's really tough to find a babysitter. And it's true, you know, you're a parent, I'm a parent, when you're trying to find somebody to watch your kids, you're like, well, maybe my friends use somebody who worked out or maybe I use some online service and they do background checks or whatever. But it's tough to feel comfortable and confident that the person who is going to come into your home and be responsible for your child is a good person or at least not somebody who will put them in danger. Like, if somewhere in that sweet spot.
06:20 Michael Kennedy: Especially when it's a baby. When the baby doesn't speak, it can't report to you, yeah, the babysitter beat me, or the boyfriend came, like, you don't even, right? It's just a baby, it doesn't know.
06:31 Jonathan Morgan: Exactly.
06:31 Michael Kennedy: It can't say, right? So how do you know?
06:33 Jonathan Morgan: Well, the old fashioned way is to use some of the social signals that I mentioned before, but the new AI way is to use social signals from social media so this gets into kind of creepy territory, I think. Parents have started to turn to this application and what the system does is that it crawls the social media history of potential babysitters and it ranks them on like a risk scale. It gives them a risk rating. And so it risks them on all sorts of, or rates them on all sorts of things like drug abuse, bullying, harassment, disrespectful, bad attitude, all sorts of things that I guess you could get from social media, although I think for most of us, we would get a 5 for all of those things because that's just how social media works. That's what we want out of it, you know? But nevertheless, relatively speaking, I guess you get some kind of rating, so it's caused a little bit of controversy for reasons that you might expect like how would you classify. So I should take a step back and say I'm not sure if most of your listeners will be familiar with how an AI system might even go about determining these things, so how would I read your social media content and make a judgment that you were at risk of bullying or harassment or disrespectful. But there's not way of knowing because the system doesn't explain so it might be something really simple like it just looks for quote-unquote bullying words, like some keywords that they pulled out of a dictionary that are related to bullying, so really simple. Or it might be that it's trained a machine learning classifier that it's somehow got a hold of a bunch of example bullying tweets or bullying Facebook posts or whatever it is and it said, Oh, I can now recognize bullying content versus non-bullying content and is trying to use that as a rating system, who knows? Nobody knows, but the idea basically is that it's scoring any potential babysitters and it's giving these to parents on a scale of one to five. So somehow there's a difference between a risk assessment of 1 on the bullying scale, a risk assessment of 2 on the bullying scale, and so as parents we'll have to decide in this kind of arbitrary scale, am I comfortable with a disrespectful score of 3 but a bad attitude score of 1? I'm not really sure, what are you teaching my child? What kind of disrespect are you teaching my child that AI system has warned me based on your social medias that maybe, you know, you're not like a Boy Scout or whatever it is. So in any case, it's like, it kind of gets at this idea that like, I think there's this dream that we can look at people's digital exhaust on the internet, what they say on social media, how they spend their money, or the places that they've been, and it gets some kind of picture about who they are as a person. So that's the big leap, like, you could probably make a guess about how people will behave on social media based on how they behave on social media. Or you can probably get a sense of what people are likely to buy in the future based on what they've purchased in the past. But that leap to say, now I know something about you as a person, like how you'll behave in other environments where I've never observed you before, that's what this application's doing and I think it's generally a trend in AI. And I'm not sure anybody believes that's actually possible. Which is where it's kind of tricky. Should we use these kinds of tools in hiring and recruiting other types of assessments that we're making about really delicate, sensitive things like babysitters or maybe less delicate and sensitive things like should you be my tax accountant? If you get a four out of five for bullying on social media? I don't know.
10:00 Michael Kennedy: Maybe it's a good thing, maybe you want an aggressive tax accountant.
10:02 Jonathan Morgan: Exactly. What is a 4 relatively speaking to a 3 for bullying? I don't know, could you give me something on like a, you know, who's the most harassing person on social media and we'll just call them a 5 and then everybody else is ranked on that person's scale. I'm trying not to call out anybody in particular. I'm really, really trying not to call out anybody in particular, but I don't know. Anyway, that's the debate that this has stirred. So Predictum, I'm sure, well-intentioned. I'd love, you know, to know more about...
10:32 Michael Kennedy: It sounds so like Orwellian and creepy but let me read you just the quote from one of the founders of people that works there 'cause it sound so like something you would want. It says: this is one of the people from Predictum speaking: "If you search for abusive babysitters on Google, you'll see hundreds of results right now. There are people out there who either have mental illness or just born evil. Our goal is to do anything we can to stop them." And when you think of that description and your brand-new baby, you really don't want to put them together so it sounds so good, but it's also got this really dark side, right?
11:09 Jonathan Morgan: It does, I agree. I think that Predictum folks are pretty well-intentioned. But it does highlight what is that unintended consequences of maybe giving too much weight to the output of a data science model. Just because we can package it in a data science workflow, just because it kind of walks and talks like something that is algorithmically decided and therefore objective, it's not necessarily objective at all. It's just as easy for me to encode my subjective bias into a machine learning model as it is for me to act on that bias in real life. So it's difficult for people who aren't familiar with data science to recognize that and it has these really strange implications. What about all those, not the babysitters who are actually abusive, for sure, let's get rid of those. But what about the people who have, who are otherwise excellent babysitters but at one point said something mean about a TV show they didn't like on social media and their bullying ranking went through the roof? And so now they can't get jobs as a babysitter anymore. I think we need to think about how we're being transparent about the algorithms that we're using to make these sorts of choices. And, of course, that's a difficult thing for those developing those algorithms 'cause you want to, you know, it's your sort of special sauce your secret sauce for your product. There's a real tension there and I'm not sure we're getting it right just yet.
12:35 Michael Kennedy: Yeah, it's pretty crazy. I guess a few parting thoughts. For babysitters, that's usually a part-time, small thing for most folks. It's not, right, if your babysitting career goes a little sideways, you can do something else. But a lot of this is applied to all sorts of jobs. They talked about a company called Fama that uses AI to police all of the workers at these companies. Or how Amazon actually canceled what became a clearly biased algorithm for hiring across all of Amazon. Then it comes to sort of have more serious social effects.
13:07 Jonathan Morgan: Yeah, and I think this is coming at a moment when it's like a reckoning for Silicon Valley and technology and machine learning in general. We've been almost naively assuming that everybody that uses our technology will use it with the best intentions or the intentions that we had when we developed it, and I think the theme of 2018 is actually not. It's actually possible for things to go horribly wrong in ways that you didn't intend and I think it's good. I think it's a good time to have this debate about how close are we actually to encoding our real life values into our digital technologies? Maybe we're not that good at it yet.
13:48 Michael Kennedy: Yeah, we're definitely new on it. And I feel like the goodwill towards. Exactly, I feel like a lot of the good will is kind of left out of that space a little bit so we've got to be careful. Something that is not new was invented in the 1600s in the renaissance, right, we're talking the scientific paper, you know.
14:08 Jonathan Morgan: What? That old chestnut.
14:10 Michael Kennedy: Kepler's orbiting of the planets and all those things, right, written up. So actually the scientific paper turned out to be a big invention, it used to be you have to write books or you just wouldn't write it down, your inventions. Or they'd be like private papers, like Einstein wrote to Niels Bohr, or something like that. We just have to run across those papers, right? But it turns out that the scientific paper has been around and basically unchanged for like 400 years. I would say science has changed. Certainly the dependence upon computing has changed, right?
14:42 Jonathan Morgan: Yeah, absolutely, I mean I'm not in most of the quote-unquote hard sciences and even in social science, I think almost the majority of the work, I have no basis of that, that's probably fake news. But it does, a lot of it is data driven and requires some type of computation, for sure. And difficult to reproduce papers without having access to that data and computation.
15:03 Michael Kennedy: Right. And a lot of times the computations were not published, just the results. I think that's starting to give way. In one of the articles, this was published in The Atlantic and it's a serious research project, really, it's called The Scientific Paper is Obsolete.
15:20 Jonathan Morgan: Okay, that's a big statement, bold.
15:22 Michael Kennedy: That is a big statement, right? And the graphic is super intense, right, on that home page?
15:28 Jonathan Morgan: Yes, yes.
15:28 Michael Kennedy: So it's this sort of traditional scientific paper with lots of, there's probably 15 authors and all sorts of stuff on there and it's literally on fire. And it turns out to be a really interesting historical analysis, like sort of a storytelling of how do we go from scientific paper to closed-source software with egomaniac sort of folks, really, like, Mathematica, to Python and open source and Jupyter and all that, right?
15:57 Jonathan Morgan: Yeah, and I mean, I think it's interesting to see even the shift away from the traditional publishing model for anybody who's gotten into any type of research before, that it used to be that research only happened primarily in academic institutions and everything goes through peer reviewed journals, but almost like, in the tradition of open source, what's starting to creep in to the research community is people posting on something called Archive. So they'll write in the style of a peer reviewed paper but before it's peer reviewed, partially because the technology changes so quickly, but also because they want to be open and transparent about their work, they are uploading, basically, to this website called Archive where you can search academic papers prior to them being published in some type of peer reviewed journal or in a conference, or whatever, which is super interesting and I think it gives people a lot more access to a lot more techniques. It feels kind of like posting your code to GitHub and this is anecdotal, but at least what I find is that the code that sits behind some of these papers is usually available on GitHub. Like, the same types of authors who post to archive I find are the ones that also say, Oh, by the way, here's the GitHub repository so you can go run it yourself, and here's the sample dataset that I used. Which really gets at this idea that there's something to be said about reproducing these findings for these, especially for these complicated, and I'm thinking mostly from the perspective of new machine learning developments, but these complicated new modeling techniques have to be replicable to be usable and in the same way that your open source project is kind of only as good as it is usable so you might have the best new Javascript MVC framework, those probably aren't cool anymore, but when I was a software developer, but like, whatever, you might have the coolest new Javascript app but if nobody uses it, it doesn't really matter how good it is. The ones that ultimately the community rallies around are the ones that are the most usable, the most accessible, the most transparent. And I think it's interesting to see that creeping into research as well. So I really dig the idea that perhaps it's a model that needs to change.
17:59 Michael Kennedy: Yeah, they talk a lot, this is a pretty deep research piece here, it's quite the article. It's not just a few pages. And it really digs into the history of Mathematica. How it became really important for computation in research, but it just didn't really get accepted. And then along comes Perez and Granger and those folks with their iPython and their open source and their not centralized way of doing things. It's just really interesting how that's become embraced by science and data science and I think how it's actually influencing science. I hear all these scientists who I speak to talking about embracing open source principles and styles and more engineering, and I think all of that is being brought to them from this.
18:41 Jonathan Morgan: Oh, yeah, and I think it's even being embedded in the way that students are educated now which is totally different. I think in the article they talk about one of the authors of one of the open source projects, open source Python projects, that has a faculty appointment in the staff department at Berkeley now. And I know that Brian Granger's work and his team focused on Jupyter notebook is I think they're also based at Berkeley or, they're not at Stanford, but I mean, they're embedded inside a university department and they do some teaching as well and they're starting to teach courses that are entirely based around this open source science, open science workflow. It's like Python as a programming language and then Jupyter notebooks which if you're not familiar with Jupyter notebooks, it's basically a way to execute snippets of code sequentially, but in a web-based environment. So you can write a little bit of code, you can run it and see what the output is and then you can build on that and then you can share your notebooks as you go. In the same way that like, it basically captures the rough draft process of finding your way towards a data science solution or really any type of programming solutions, but often they're used by data scientists and these sorts of tools, I think, have made it possible to share your entire thought process, which is a really important part. It's kind of showing your work to getting to the results that you need to, which I think is more specific to data science and machine learning than it is to most types of software engineering. In software engineering, it works or it doesn't. And if it works fast enough, then like, all right, dope. Let's move on, like, check that box and then run the next task.
20:10 Michael Kennedy: Pushed the button, it did the thing.
20:11 Jonathan Morgan: Yeah, all right, fantastic. I'm going to close that Jira ticket and march right along. That is true to a certain extent in data science. Your code runs or it doesn't. But often, we're trying to evaluate what's the quality of the findings, or what's the quality of the predictions that you're making and what trade offs is your model making? All these more fine grain decisions, it really matters what's behind the curtains with most of the data science work and most of these research papers, and so I think that's why these tools have basically figured out how to capture all of that in a way that makes it really useful, really usable, really easy to share, really open and transparent, which is why I think they've caught on. They've caught on because they're usable and they have this great by product, this great knock on effect, they make all of our work more transparent.
20:54 Michael Kennedy: I see a lot of promise and I definitely see things going this way. I think it's really good. One of the things that you're describing is that to me that really resonates is I feel like a lot of science is dull and boring to people who are not super into it because it's been massively sterilized down to its essence, right, here's the formula, you plug this in and you get the volume of gas by temperature or whatever, you're like, well who cares about that, that's so boring, right? But if the story is like, embedded with it, and the thinking process that led up to it, it's so interesting. One of the most interesting classes I ever had was this combination of sort of the history of mathematics and something called real analysis for the math people out there, where we basically recreate calculus but from the thinking about the building blocks. And it was just so interesting because it had all the history of the people who created it in there whereas if you just learned the formulas, you're like, well, this is boring, forget this. This has the possibility of keeping more of the thinking and the exploration in the paper and in the reporting.
21:54 Jonathan Morgan: Oh, yeah. Absolutely. Which I agree is fascinating because it is an investigation at the end of the day and one of the authors of Jupyter, one of the leads of the Jupyter team, it's kind of meta because his development of Jupyter is now part of this upper level data science course at Berkeley in which all the students use Jupyter for all of the data science work that they're doing. It's cool, it's notebooks all the way down. Although I didn't know, I noticed that you pulled this out of the article, which I thought is really spot on, that the name Jupyter is actually in honor of Galileo. It's like going back to an early scientist, going way back into history, so it's like we haven't forgotten where we came from. The scientific method and how that gets encapsulated in this structure that we've all accepted, research papers, that we're standing on the shoulders of giants, we're just moving forward into this new rapidly iterative open source, more transparent era which is cool, like, why shouldn't research be democratized with all of the types of information? Like, thank you, internet, and we're not forgetting where we came from which I think is really important. We don't want to throw the baby out with the bath water.
23:01 Michael Kennedy: Yeah, there's so many interesting parallels between early scientific discovery and open source versus closed source. We'll come back to this, actually but part of his quote that you pointed out is that Galileo couldn't go anywhere and buy a telescope. He had to build his own. It's sort of like we just put it on GitHub and we had to just make it, it wasn't there. It's awesome.
23:20 Jonathan Morgan: Yeah, you got to scratch your own itch, you know.
23:23 Michael Kennedy: Like, fine, I'll build it myself. I was going to take the weekend off, but you know, whatever. World, come on, we're doing this. This portion of Talk Python to Me is brought to you by Us. Have you heard that Python is not good for concurrent programming problems? Whoever told you that is living in the past because it's prime time for Python's asynchronous features. With the widespread adoption of async methods and the async and await keywords, Python's ecosystem has a ton of new and exciting frameworks based on async and await. That's why we created a course for anyone who wants to learn all of Python's async capabilities. Async techniques and Examples in Python. Just visit talkpython.fm/async and watch the intro video to see if this course is for you. It's only $49 and you own it forever, no subscriptions. And there are discounts for teams as well. So this next one, we were speaking about advanced machine learning and AI and analyzing social media and sentiment analysis. This next one is more about algorithms and less about AI. It probably could be implemented with if-statements but is actually pretty evil. The focus on algorithms at the core is pretty interesting.
24:34 Jonathan Morgan: I thought so, although I would like to point out that I think half the time that people are talking about AI in air quotes, they're talking about a thing that could have been implemented with an if statement. And in fact, I would argue, I know that there's technical definitions of AI, but what most people mean is software making a decision for me and it's like, well, there is a way that software does that.
24:51 Michael Kennedy: Yes.
24:52 Jonathan Morgan: It is an if statement. It's AI. If you've written this statement in your code, your freshman programming class, you've written AI.
25:00 Michael Kennedy: I could go crazy and do a switch statement. But, yeah, one of these branching things.
25:07 Jonathan Morgan: The decisions are endless, you know. Theoretically endless. As many decisions as you want to spend time programming by hand.
25:13 Michael Kennedy: Yeah, and so this one comes back to the sort of, everything's going to be used for good, right?
25:18 Jonathan Morgan: I'm not sure how anybody thought this could have been used for good, actually. This algorithm was just intentionally designed to screw everybody over. So this episode is going to come out around the holidays and people will be traveling.
25:27 Michael Kennedy: People will be traveling.
25:30 Jonathan Morgan: And if you are traveling with your family, you may have had this experience where you're trying to buy cheap plane tickets because it's you and all your kids and you're traveling across the country at an expensive time to travel and you're trying to get to your parents' house in time for Christmas and it's really stressful, and you're annoyed, and then you book your tickets and you try and get seats together, and you can't do it. And you're like, wait a second, come on, airline. I booked all these tickets at the same time, I'm clearly traveling with a couple kids under 10 years old. Come on, this is hard enough as it is, shakes fists.
26:04 Michael Kennedy: Surely they know they should put the kids with the parents, right?
26:06 Jonathan Morgan: Right, that's like a common, why is this system not smart enough to recognize that? But it turns out, it is smart. It's just smart in exactly the evil opposite way that you don't want it to be. Because it turns out that at least in the UK some airlines are using algorithms not to put families together which is what we all assume they would be doing, but to intentionally split us up and you might ask yourself...
26:31 Michael Kennedy: Why?
26:35 Jonathan Morgan: That doesn't make, are you just, like, you know, like a sadist? I don't really, what's the, somewhere there's like a developer with this algorithm at the back of every airplane twiddling his thumbs like Mr. Burns and laughing maniacally. But it's because that way they can charge people more money so that they pay to sit together. They're like, oh, do you not want the inconvenience of asking 47 people whether or not they're going to switch with you so that you can sit with your child who may or may not have the emotional fortitude and maturity to not freak out by themselves on an airplane going across the country? Yes, so, apparently this is a common practice with a number of airlines. They are algorithmically looking at people who share last names, so if you have a common surname, you will not be seated with each other when the seats are assigned, which seems really uncool. I just, come on.
27:27 Michael Kennedy: So you could pay for the reserved seat the extra $27 per traveler, or whatever.
27:32 Jonathan Morgan: Right, how much do you care about your kids? Or how uncomfortable are you with asking a bunch of strangers during the holidays to switch with you? When it becomes that really complicated calculus problem where you're like, well, my wife's sitting 7 seats up and she's at the window, but we're also traveling with my son who's 4 seats up on the aisle, so if you switch with her and this person in row 47, which is with him, then I actually think that we can get approximately close to each other. It's like, it's absurd. So if you don't want to go through that, then yeah, you can just pay an extra $50, but it's like wouldn't the decent thing to do just would be to put them together?
28:04 Michael Kennedy: It's so evil to split the families apart so they'll pay to go back together. Although, to be clear, I'm well out of the range where this matters, right? My kids could sit alone and it wouldn't be that big of a deal. But my thought is, all right, evil airline and your algorithms, I see your play, and I raise you an alone child at 3 in the back by the business traveler. I'm going to bed now, thank you very much.
28:33 Jonathan Morgan: I love it.
28:34 Michael Kennedy: Probably shouldn't implement it, but it seems like, you know, you could just turn it around on them.
28:38 Jonathan Morgan: I love it, it's like, oh yeah, you know, we skipped math, and I just jumped three pop rocks sugar packets down my throat before I got on the plane because you know in 30 minutes they're going to freak out, yeah. Exactly, I like it. Let's see how many kids we can stack up how many screaming children we can stack up on an airplane.
28:58 Michael Kennedy: It's really bad, it's really bad. Something like this actually happened to my daughter. Not from algorithms, just other bad, bad stuff. So it's not great. And I do think it's really evil of the airlines to do this.
29:08 Jonathan Morgan: It is, it is. But what was interesting was that it actually got referred to in the UK. There's a new government organization called the Center for Data Science Ethics and Innovation. They actually have...
29:18 Michael Kennedy: That's crazy.
29:20 Jonathan Morgan: It's cool, right, it's cool.
29:21 Michael Kennedy: Yeah, it's very cool.
29:22 Jonathan Morgan: Yeah, yeah, and like, we have similar types of stuff here in the US. There's an Office of Science and Technology Policy that's in the administration, it's part of the White House. And so anybody who kind of follows that sort of stuff or is interested in data science, that's where our first US Chief Data Scientist sits is in the OSTP. I don't think there is a chief data scientist in the current administration yet. But there's a chief technology officer. It's really decent. But in the UK there's a Center for Data Science Ethics and Innovation and this case was actually referred to them. So I think they've just formed and they just formed and got handed, this is the most offensive thing that algorithms have done, good luck with it, Center for Data Science Ethics and Innovation. Fix this mess.
30:06 Michael Kennedy: They're like, great, this is why we exist, oh my God.
30:09 Jonathan Morgan: It is, it is, I guess like, because it's like a real soft ball. It's like, what would the ethical thing be to do in this situation? I don't know, gosh, I'm spent. I could not think, come up with a better, more ethical alternative than splitting parents up from their children.
30:25 Michael Kennedy: Even bureaucrats could totally solve this, for sure.
30:29 Jonathan Morgan: Yeah, so, you know, that's a thing. I mean, it seems like one of those open and shut cases. I think government ministers are calling it "exploitative". So that's usually not a good sign for your business practices.
30:43 Michael Kennedy: Yeah, no, that's a bad start.
30:44 Jonathan Morgan: It's a bad start. I think they may, they've got nowhere to go but up. We can say that.
30:49 Michael Kennedy: That's right.
30:50 Jonathan Morgan: That's kind of like a pun, I guess, because they're airplanes that go up in the air, but anyway. But it brings me to something that I was really excited to talk about with all of your listeners because it's something that's important to me personally and something that I'm involved with is actually an ethics project for data scientists so that hopefully, we could prevent these kinds of mishaps in the future. So as I mentioned on the show, I'm involved with an organization called Data for Democracy. It's a non-profit and we have recently launched what we're calling our Ethical Principles for Data Practitioners, so the global data ethics principles.
31:30 Michael Kennedy: So this is like the Hippocratic Oath like doctors take, but for data.
31:33 Jonathan Morgan: Exactly, exactly. And so because like we mentioned this has been kind of a bad year, I would say, for technology in general and technologists and Silicon Valley culture and data science and machine learning and AI and everybody's wondering, well, is this a good thing for society, is it not, how did we get here? How did we kind of stumble into this dystopia where our minds are being manipulated by propaganda on Facebook and in China they're doing social credit, like that terrible Dark Mirror episode that I saw that we'll get to later. What's happening to us? And fundamentally, as the people who are implementing this technology we have a real opportunity to think about our values, think about our ethics, think about the way that our technology might be used in ways that we haven't intended because we're a pretty optimistic group, technologists, I think we assumed that like, we want to put something really useful and meaningful out into the world, maybe not this family splitting algorithm, that was probably...
32:28 Michael Kennedy: That was probably in the business department. They said, hey, guys, can you...
32:31 Jonathan Morgan: Yeah, this whole, you know, we'll improve revenue by like 17% in Q4, like, that'll be good for the world. And there's nothing wrong with improving revenue. Great, businesses are fantastic.
32:42 Michael Kennedy: Not on the back of three-year-olds, maybe.
32:44 Jonathan Morgan: Right, maybe not in this way, you know. But I think also, like, the stuff that we do is actually pretty complicated. People don't really understand it at deep level, what the software is doing, and all of the potential ways that it might be used other than the intended use case. That's something that really only we think about or are in a good position to think about as technologists. So anyway, that was the idea behind this Data for Democracy project. It's a global initiative and it's basically like, what's a framework for thinking through, putting ethics into your process? So, how do you incorporate these principles just in your everyday data and technology work?
33:20 Michael Kennedy: What does it look like for data science? I know what it looks like for doctors. You won't do any harm and that kind of thing. How about for data scientists?
33:28 Jonathan Morgan: Yeah, exactly, so we have, we call it forts, there's the high level which is kind of like do no harm, so you think about fairness, openness, reliability, trust, and social behavior. There's a handful of principles that, it's kind of like a checklist and so you can kind of go through this checklist and as you're developing a new feature or maybe developing a new model, if you're a data scientist or some system for processing data, like anything that touches data, any of the technology work that you're doing, you can go through. And you know, it may be that some of these principles like, you don't have to check every box every single time but it's a nice, it's a framework for thinking about catching potential blind spots. And so what's your intention when you're toting building this feature or developing this model? Have you made your best effort to guarantee the security of any data that you're going to use? I mean, that seems like a no-brainer, but you know, it's easy to forget when you're moving fast and you're maybe thinking about you know, the deadline or the fact that you're trying to ship this really important feature because your customer really needs it, you know. Give it a second to remind yourself whether the data security's important. Have you made your best effort to protect anonymous data subjects, which is really important in a lot of data science research. Sometimes if we don't think about this we can inadvertently leak private data to the public when they consume our research even though that was never our intention and potentially is irresponsible. You can practice transparency, which is a lot of times understanding how our algorithms work, so in this case, there was no transparency around how the algorithm that shows seat assignments was functioning, and then after an investigation, it was revealed that it was actually examining whether or not you shared a last name, and if you do, it was splitting you up. And if that was a transparent algorithm, we would've said, Wait a second, this is totally uncool.
35:12 Michael Kennedy: You can't explain this algorithm transparently and people still won't accept it. That's kind of like the light is great disinfectant right, in politics, and all that, and business.
35:23 Jonathan Morgan: Yeah, absolutely. Another principle that maybe would have mitigated this problem was to communicate responsibly. If the engineer who was responsible for implementing that went, hey, this is going to split up families and like, during holiday travel. You can respect relevant intentions of stakeholders which I also think is really interesting because this is exactly probably what happened here. The business team said, hey, it'll improve revenue by 17% throughout the year, or whatever the number is. I'm making that up. So it's a set of principles, you can sign onto it, so if you go to datafordemocracy.org you will find a link to sign the ethics pledge. If you think that ethics are important then you should totally sign up for it and heres why you should sign the pledge. A) I think it's an important, it's a cool thing to do, if you think that ethics are important and you want to have this kind of mental checklist. But it's also important because we all work for organizations or we're students at academic institutions or we're otherwise doing this as our profession and the organizations that we work for are starting to adopt more ethical practices as companies and academic institutions and governments, this is becoming more prominent. But what's the way that we can make sure that our values are encoded in these kind of larger business processes and these larger institutions is by making ourselves heard, by showing our numbers. And so if we show up as a technology community and we sign a pledge or we communicate to our manager or whatever it is, ultimately, this isn't going to come from the top down. I don't think we want it to come from the top down. I don't think we want this to come from people who aren't doing the work every day who hear about this ethics thing and they wanted this as a stamp of approval on whatever products they're making. That's fine, but ultimately the system that they design won't actually accommodate the technology work that we have to do every day. So I think for those of us who are writing software, for those of us who are developing models, who are doing data science and software engineering every day, I think we need to make our voices heard about the ethical principles that we want to see applied to the work that we're doing. So anyway, that's why I think it's important. Datafordemocracy.org. Don't be the developer who splits up families at the holidays, you're better than that.
37:33 Michael Kennedy: That's right, that's awesome. So I think that's great and I think it ties in really well to a lot of the themes that seem to be happening around tech companies, like it use to just be, oh, tech companies are amazing! Of course we want to just encourage them. And now there's some real skepticism around Facebook and Uber and all these types of companies and they kind of have to earn it. And this oath is part of earning it, I think. That's cool.
37:56 Jonathan Morgan: Yeah, I think so too. I think so too. And it's a good opportunity I think for, like I was saying, the actual, the doers, those of us who are doing the work to participate in that conversation because it's happening, like you're saying. It's happening whether we might want it to or not. The kind of attitude has changed. People are thinking about legislating Silicon Valley which would have been a totally foreign, bizarre idea almost, even a year or two years ago.
38:21 Michael Kennedy: Yes, that's exactly what I was thinking of is that kind of stuff. It's like, wait, what? What do you mean?
38:25 Jonathan Morgan: Right, and so, we're just over here doing good, making cool stuff, come on in, pull out that iPhone, buddy! Let's play some Angry Birds!
38:33 Michael Kennedy: That's right.
38:34 Jonathan Morgan: Times have changed and more people are aware of the potential negative consequences and so now's our time to have a conversation and make the industry that we want to be a part of.
38:44 Michael Kennedy: So some of the things we've covered have been a little creepy.
38:46 Jonathan Morgan: Indeed.
38:47 Michael Kennedy: Like the babysitter one. But this next one is, I think it's just pure goodness.
38:53 Jonathan Morgan: 100%, I couldn't agree more.
38:55 Michael Kennedy: So we know that Python is being used more in general scientific work and it's probably being used more in the hard sciences. Would you consider economics hard science?
39:06 Jonathan Morgan: Ooh, oh.
39:08 Michael Kennedy: I'd put it right on the edge. I mean, there's a lot of math in there. There's numbers.
39:13 Jonathan Morgan: There's a lot of math in there. I guess because I never really took economics, I've only listened to a couple of economics textbooks in adulthood to try and, something I feel like I should know a little bit about. I feel like they were behavioral economics. It seems like a lot of correlation without causation. Personally, no offense, economists out there. But it's really data-driven which I think is really cool. So there is a huge amount of information to consume when you're doing good economics. So in any case, yeah, let's do it. I'm on board, I'm going to go one step further. Let's call it a hard science. I'm in.
39:46 Michael Kennedy: All right, right on. I certainly think some aspects of it are. Now we talked about the scientific papers obsolete, the move from just PDF or written paper over to Mathematica over to Jupyter and that sort of high level conversation of a trend, but this year, the Nobel prize in economics was basically won with Jupyter and Python, which is awesome.
40:10 Jonathan Morgan: That is amazing and not surprising. I mean, if you're going to do some scientific research, what other programming language would you choose?
40:18 Michael Kennedy: Yeah, absolutely, especially if you're an economist. So there were a bunch of folks who did a bunch of Mathematica and included in them were these two guys named Nordhouse and Romer. I think they're both American university professors and they're, let me see if I can mess up. Poorly describe what their Nobel prize was generally about. But it was like, basically looking at how do you create long term sustainable growth in a global economy that improves the world for everybody through things like capitalism and whatnot that mostly focus on very, very narrow self interests? They think they cracked that nut, which is pretty interesting, and they cracked it with Jupyter.
41:03 Jonathan Morgan: I mean, that sounds Nobel prize worthy to me. And also the economic discovery is probably useful. I think it's interesting because we talked about this in one of the previous stories is that the team or Romer in particular was a Python user and he wanted to make his research transparent and open. That was like a key part of the research so that people could understand how he was reaching his conclusions, so all of my joking about behavioral economists a minute ago aside, this is actually an important part of the work so that you can understand his assumptions and at least understand the choices that he's making and the data that he's choosing and that he tried to do with Mathematica, which is another way that people use, or sort of perform computations for their research and apparently that just made it too difficult to share his work in a way that anybody who wanted to try and understand his work would have to also use this proprietary software, which is a really high bar. It's really expensive, not everybody knows how to use it, and it's not as simple, intuitive, open and transparent as Python and Jupyter notebooks. So it was really core to the work that he did that ultimately won him the Nobel prize which is pretty awesome.
42:10 Michael Kennedy: Yeah, it's super cool, and I do think that there's, if your goal is to share your work, having it in these super expensive proprietary systems is not amazing, right? I mean, we're talking about Mathematica here. My experience was I did a bunch of stuff in MatLab at one point and we worked with some extra toolkits and these toolkits were like $2,000 a user just to run the code, and if you wanted to run the code and check it out, you also paid $2,000. Like, that's really prohibitive.
42:37 Jonathan Morgan: It really is and I think that was my experience with proprietary software. I mean, not only do those programming languages or applications make it, it's just difficult to collaborate and I think this will be of no surprise. I would think that this is something that everybody accepts as fact amongst all of your listeners, but the open source community has made this current software revolution possible. The fact that we are able to collaborate at this scale and share ideas and share code and build on top of each other's work I think is the reason that we've had this explosion in entrepreneurship this explosion in the kind of energy and excitement that comes out of Silicon Valley, even though we were just talking about it maybe being a bad thing. But this level of rapid innovation that we've been going through is because programming and, it's just so much more accessible than it's ever been before and I think that that's largely because we've transitioned away from these proprietary software models.
43:32 Michael Kennedy: Yeah, and you think about the people who that benefits. Obviously it benefits everyone, but if you live in a country where the average monthly income is a tenth of what it is in the US and you have to pay 2,000, it doesn't mean it's expensive, it means that you can't have it, right? It's just inaccessible to you. And so these sort of opening ups of this research and these capabilities, I think it benefits the people who need the most benefiting as well.
43:56 Jonathan Morgan: Yeah, absolutely, and it almost distributes creativity much more broadly than we would have been able to before. We can tap sources of creativity and innovation that like you're saying would have been taken off the board because proprietary software is only accessible to a small portion of the global population that can afford to spend thousands of dollar annually on licensing it.
44:21 Michael Kennedy: Yeah, absolutely. So there's a couple more thoughts I just want to share really quickly from this before we move on to something I guess. So this, you got to, in 2018 you got to put it up there, positive or not. So one thing I think is really awesome about this is this guy is 62. He transitioned into Python recently. You feel a lot of people are like, oh, I'm 32, I couldn't possibly learn programming or get into this. Like, this guy made this Nobel research change into Python programming and the data science tools at 60s, late 50s, that's awesome.
44:54 Jonathan Morgan: It is awesome and what's interesting is that he's now been exploring how software works and you pulled out a really great quote where he says, "The more I learn about proprietary software, the more I worry that," wait for it, "objective truth might perish from the earth." Which is...
45:10 Michael Kennedy: Insanely powerful statement.
45:12 Jonathan Morgan: That is a powerful statement from a guy who won the Nobel prize, so he's probably right.
45:16 Michael Kennedy: Yeah, he definitely sees a change in the world and I think this opens, this is a little bit of what I was saying, thinking when I said, look, we've got open source being used by scientists but also changing science, right?
45:28 Jonathan Morgan: Right, absolutely because it's not only making the work more approachable and accessible, but it's making it more repeatable and it's allowing us especially now that so much research is based on computation like we've been saying before. It really does make it possible because it would be impossible to come to a consensus about whether or not something is correct, acceptable, whether we can say that this is an established fact. You can't really say it if you can't show your work. And in this case, showing your work is the computation and the data.
45:57 Michael Kennedy: Right. It's not just writing the final number down in calculus. You're like, no, no, you've got to show your work.
46:02 Jonathan Morgan: Exactly, exactly. It's very important. Like, science and objective truth depends on it. Show your work.
46:08 Michael Kennedy: That's right, show your work. All right so the next one is more like a mundane day to day thing, but actually makes a big difference. What's the story with Waze and Waze reducing crashes?
46:20 Jonathan Morgan: This is pretty cool. So we've talked about using quote-unquote AI to predict things.
46:24 Michael Kennedy: Sorry. Maybe we should tell people what Waze is. I don't know how global Waze is. Maybe they don't have experience. Just give us a real quick summary of what it is.
46:31 Jonathan Morgan: That is a good point. So Waze is a mapping application that helps you find the shortest route from one point to another in your car. So there's a whole thing about Waze. It's a community of people and so you're kind of collaborating to point out things on the road. So I'm stuck in traffic and you can tap a button or there's a police car trying to catch people for speeding and you can tap a button or there's been an accident, you can tap a button. And so it's this way for drivers to communicate with each other on the road and it's so big. That community is so large that it can help you, it routes you to one point, from one point to another in the city, it helps you avoid obstacles that might be pretty dynamic and changing all the time. So it's kind of like Google Maps.
47:15 Michael Kennedy: It's more of a two way street, for sure, right? The users send info back a lot more, too. Okay, cool, with that in mind, how does Waze play into this story?
47:23 Jonathan Morgan: Well, as you can imagine, it captures a ton of data about driving patterns. So not only does it know what's happening in real time, but all that data is stored and so you start to get a sense about how people move through cities in general. And once you have data that captures a behavior in general, in data science, you can start to make predictions using that data. So you can train a model and say generally this is how things work and then maybe you could make some predictions about how things work in the future. And assuming that historical data is accurate, then you can usually make pretty decent predictions. And so the cool thing about an application like Waze that captures not only traffic patterns but also events like accidents or car crashes is that you could predict when and where car crashes are likely to occur, which is kind of mind blowing. I think traffic seems like this total, super complex, impossible to understand, messy event.
48:18 Michael Kennedy: If you ever thought of chaos, it should totally apply to this, right? Like, if a butterfly can affect weather, like, people just be crazy in cars.
48:27 Jonathan Morgan: They can be crazy in cars and I don't know if everybody will be familiar with this, there is this famous experiment where they, scientists set up a circular track and put cars, they had cars drive around the circular track all equidistant from one another. So in theory, they could all maintain their speed. But because there were human beings driving the cars, they would occasionally make these little choices, like they feel like they got a little bit too close to the car in front of them, or they put a little bit too much gas and they would get too close and so they'd brake, and then as soon as they put their foot on the brake then the car behind them would put their foot on the brake and it was just this cascading effect and no matter what, even though there was enough room for all of these cars on the road, they wound up in a traffic jam driving out of the circle.
49:05 Michael Kennedy: That's awesome.
49:05 Jonathan Morgan: Yeah. It's like a beautifully designed system that humans cannot participate in fully because we're just too human. So we cause traffic.
49:14 Michael Kennedy: We're flawed.
49:14 Jonathan Morgan: We're flawed. We're flawed. And all those flaws are captured by Waze. So what Waze basically started doing is that they have all this data from connected cars and road cameras and apps and whatever, and they have an overview of how the city works and they shared that data with local authorities who then basically have developed models that predict when and where these accidents are going to occur, and so the city's traffic and safety management agencies were able to take that data and say, Oh, there's likely to be an accident in this area at this time, and then they went to those areas to take preventative measures. So they basically identified one of the most high risk areas for where accidents are likely to occur at certain times of day or in certain conditions, and if you can then take action to make those areas more safe, then of course you can reduce the number of crashes and they reduced crashes by 20%. So if on a normal day there's 100 car crashes, there's only 80, which that's pretty amazing. So not only that, but if you're in an area where an accident's likely to occur, all of the other services that happen around that accident are more readily available, so you can get faster treatment for anybody who is injured. You can more quickly clear and restore normal traffic flow and so in addition to actually making the health implications, the public health implications of having fewer car crashes, you can actually make it easier for people to get around the city more quickly by moving, by quickly dealing with accidents as they occur because you kind of knew or had a pretty strong indication that that accident was going to happen in advance. So it's like Minority Report, but for car crashes.
50:53 Michael Kennedy: It's like a pre-crash.
50:53 Jonathan Morgan: It is, it is. So they've got the pre-crash. If that system is not called the pre-crash.
50:59 Michael Kennedy: Sir, if you don't mind pulling over, you were in a crash up there. What?
51:04 Jonathan Morgan: Right, you were the cause of an accident that was about to occur. Like, what? I actually, I kind of like this idea. I mean, we're getting into territory that sounds like the wrong, the unintended use of AI like we've been talking about this entire episode. But I kind of love it. I would love for the AI to tell me in advance, even if I got a ticket, like, you were going to cause an accident, you didn't, because we told you, and we've changed the future but you still have a $25 ticket. I'd pay that. I would pay that $25 ticket.
51:36 Michael Kennedy: That's so interesting. That's like the 2025 edition. The 2018 edition is, they were able to cause fewer crashes, address the ones that happened better, and basically just improve life for all the drivers. That's awesome.
51:48 Jonathan Morgan: Yeah, it's totally cool. And I think this is one of those almost classic data science problems now where it's weird to say that but there are such things as classic data science problems, but maybe an archtype of a data science problem where there's like, where you have a bunch of data about the past and whatever system it is that you're interested in understanding is so predictable and repeatable that all you really need to do is understand how it behaved in the past and then you can have a pretty good idea about how it's going to behave in the future. If only you can crunch enough numbers. So it's kind of like we could have a grand theory of the universe if only we could compute the universe and, sorry, not to get too philsophical, but like, with the systems that we have the computing power to understand are increasingly large and complex. So a city's traffic flow is pretty large and complex and probably too much data to really process in any meaningful way until now. And now we not only have the data but we have the ability to process it and make predictions, and so it's a real, significant tangible improvement on human life. Pretty awesome.
52:45 Michael Kennedy: Yeah, that's really awesome. I think another thing that would make an improvement on life is if more people participated in speaking about what they would like to happen in their country and what they would not like to happen in their country. What do you think?
52:59 Jonathan Morgan: That seems like a good idea. You know, participating in the process. I think that'd be a win all around.
53:04 Michael Kennedy: You know way better than I do. I feel like the average voting turnout is something like 65% or something in the US? Maybe 70 on a crazy period? But then that's just among registered voters. What about all the people who aren't even registered? Or they've become unregistered because they moved and they forgot to update it.
53:21 Jonathan Morgan: Right, this is, as we've probably heard about in the 2018 election cycle that we just had, a big part of the process and anybody who, well, you've probably met somebody who was standing in front of your office or your church or your school or walked around your neighborhood saying, hey, are you registered to vote? Like there's these "Get out the vote" campaigns that are really important because of what...
53:42 Michael Kennedy: Twitter asked me at least five times if I was registered.
53:44 Jonathan Morgan: Twitter did, I think I Googled something the day before the election and they were like, Hey, here's your local polling location where you can go get registered to vote right now. It's kind of cool. I think there's, people have recognized that there's low voter turnout and so there's all these interesting initiatives to try and get people to go out, make sure that they follow the appropriate procedures and they actually cast their vote which ultimately, you know, democracy dies in darkness. It's important that we're participating in the process. So technologists are of course trying to help and there's a guy called Jeff Jonas, he's a prominent data scientist and he is interested in the integrity of voter rules. So this is an interesting aspect of it. So in most places, I think in all places, you can't just show up to the polling place and vote, you have to be registered to vote. There's lots of reasons that you might get unregistered like you were just talking about. So I recently moved from one place to another in Austin and the people were coming around my neighborhood and I actually didn't know that this impacted my ability to vote here where I live. They said, Hey, are you registered to vote? I said, of course I am, and I mentioned offhand, oh, just moved to the neighborhood, love it so much. And they were like, Oh, then you're not registered to vote because you just changed your address. And so there's all these little details that are important sometimes hard to understand. It's a big government, bureaucratic process at the end of the day and so this guy Jeff Jonas used his software for this multi-state project that he called the Electronic Registration Information Center. It basically uses machine learning to identify eligible voters and then it cleans up the existing voter rolls. So okay, great, because the reason that you might need machine learning for this and not just like, normal string matching, is because people's names are sometimes slightly different on one form or another or maybe their address has changed or maybe multiple people have the same name, and so there's all sorts of ways where it's sometimes difficult to know whether a record in one database represents a human being, represents the same human being in another database even if their names are identical or similar. So there's, it gets a little bit complicated and that's why machine learning and AI are useful here because they can recognize these kind of subtle variations and patterns that ultimately lead you to be able to kind of triangulate a bunch of different data to point at the same human being. And so this non-profit, the Electronic Registration Information Center, identified 26 million people who are eligible to vote but unregistered. And then 10 million registered voters who have moved.
56:12 Michael Kennedy: Who maybe became ineligible because of that, right?
56:15 Jonathan Morgan: Right, so they somehow became ineligible. So they moved, they appear on more than one list, for whatever reason, this does happen. Or, they died. So important that after your death, you remain registered to vote and because who are you going to, I mean, you just died. It's not the thing that you think of on your deathbed. It's like, Wait! Take me off the voting roll, the end is near. So this is actually pretty common, but it can match a record of death with a name on a voter registration roll, is much more difficult than that sounds. So anyway, super interesting project because I think in every sense we want our voter rolls to be authentic. We want them to have integrity because we want our democracy to have integrity. We should make sure that one person, one vote. But at the same time, we want to make sure that as many people that can vote, do, because that's how the process works. We want to make sure that the people are represented in their elected leaders, so yeah. Really, really interesting project and I think, hard to imagine this going wrong.
57:16 Michael Kennedy: Yeah, it's really good. There were some conversations in the article about this project where it talks about trying to thread the needle of not looking partisan. We just want people to vote, how do we do that? And that's an interesting challenge.
57:32 Jonathan Morgan: Right, because of course that's become a polarized topic of conversation. There's a lot of concern about whether or not voting rolls are authentic. Nobody wants concern on one side of the aisle about voting fraud, and there's concern from the other side of the aisle about voter suppression. So how do we get to some common ground? Because I think ultimately the common ground that everybody has is that we want an authentic process where as many people who are eligible to vote, can. I think we all believe in democracy at the end of the day but there's these two kind of opposing points of view on what to prioritize when making sure that our process has integrity. So it's great to thread the needle and find a technology-driven, sort of a non partisan approach to making sure the system functions as best as it can.
58:14 Michael Kennedy: All right, let's take this technology, apply it to a problem that is inherently political, and try to make our political system better without raising the hairs and anger of any particular group, but it's pretty good.
58:27 Jonathan Morgan: Yeah, it's a challenge. So it's almost like these days it's hard to find anything that AI can do that makes everybody happy. However, I think this next story that you found actually is something that I think universally we can all agree on.
58:42 Michael Kennedy: I think it's pretty awesome.
58:43 Jonathan Morgan: Apolitical and universally beneficial, sort of unequivocally beneficial.
58:47 Michael Kennedy: Yeah, so I think there's two really interesting angles to this. Let's take the super positive one first. So this is about using machine learning and computer vision to fight breast cancer, which, there have been several projects around this and they're universally good, I think, right? You have these mammograms, they have pictures of potentially cancerous regions. But the key to catch cancer is to catch it early. But to catch it early means to see things that are utterly subtle, right? So they were saying something like 38% of radiologists or whatever group of doctors that's called that looks at this, they they were doing a 38% catch rate on these really super early cancers, so Google came up with an AI they went like, well, they took one off the shelf and they applied it, something ridiculous. You just like, we put it in these and asked it some questions and apparently it found the cancerous regions 99% of the time.
59:52 Jonathan Morgan: Yeah, it is, and for exactly the reasons you mentioned because it's capable of seeing much subtler patterns than human beings can which is exactly what machine vision is good for.
01:00:04 Michael Kennedy: Yeah, so first it started out like, We're just going to show the AI all the pictures and just get its opinions, like, forget the doctors. And then they said, well, what if, we have doctors and I think there still always will be doctors, but what if we give the doctors the stethoscope, we give them a tongue depressor, we give them a camera, what if we gave them AI as one of the tools they could point at the people and ask and analyze what the results are coming out of the machine, right? So they basically said, all right, the second one would take six pathologists I guess I should have said in the first part, six pathologists and have them complete a diagnostic test with and without the algorithm's assistance, and they said they found it easier to detect small cancerous regions and it only took half the time.
01:00:48 Jonathan Morgan: At least for me in terms of where AI can be helpful in healthcare, especially these life-saving moments when the earlier that you detect cancer, for example, the more likely you are to be able to treat it. It's so complicated. I feel like the medical profession, the amount that doctors are expected to hold in their minds in order to recognize relevant piece of information and put together like a theory of the case, you know, what might this be based on all the pieces of information that I'm seeing. It is really like a marvel and nothing but respect for doctors who go through the amount of training that they go through, the amount of information that they're able to retain and the kind of creativity that's required to recognize these different symptoms and put together some likely candidates for what might be ailing the patient. However, that type of pattern matching, assuming that you can capture accurate data is exactly what machines have become really, really, really good at.
01:01:42 Michael Kennedy: Yes.
01:01:42 Jonathan Morgan: And so I do think this is an area where, again, assuming that good data is available and I think when it comes to something that is more kind of binary to determine like whether or not there's cancer tissue present, and we can take an image of a region of the body and sort of reliably provide that to the algorithm that's trying to make a determination about whether or not cancer is present. In examples like that were the data is readily available, I think there's no reason that AI can't be a really powerful assistant so that doctors with all their knowledge and creativity can augment that and sort of help recognize where they might have blind spots or where there might be technical limitations to their ability as human beings. Like, our eyes only work so well. And so if we can augment and improve that with technology, it seems like a win-win, doctors can move more quickly, they can see more patients, they can be more accurate, and they can save more lives which I think it's not as if we are overflowing with medical attention and ability. I think it's a pretty scarce resource still. And so being able to make those folks more efficient is a great use for this type of technology.
01:02:47 Michael Kennedy: Yeah, and we talked about the democratization of creativity and whatnot with open source earlier. In a sense this is kind of the same with healthcare. If you live in a small town and they've just got a general practitioner and your only way to get really dealt with is to travel three hours each way, right, maybe if the world is more based on these types of things, well, you just had a scan done, they upload it to the cloud, and magic happens, and then you get an answer right back. It'd be great.
01:03:16 Jonathan Morgan: Yeah, I think so. It seems like that way the sort of sum total, like the aggregate knowledge of the medical community can be available to any practitioner, which is kind of amazing.
01:03:27 Michael Kennedy: Yeah, that's pretty amazing. You spoke about computers detecting these really careful, nuanced details in images. I would say China is doing a whole lot of interesting stuff like that to sort of assess people. So they've got crazy facial recognition stuff going on over there, they have a system that'll detect your gait and identify you by the way you walk.
01:03:50 Jonathan Morgan: That's, yeah, yeah.
01:03:50 Michael Kennedy: And all of these types of things are generally around the idea of a social credit, can we just like, turn cameras and machines on our population and find the good ones and the bad ones?
01:04:04 Jonathan Morgan: That's true, I mean, it's a really interesting...
01:04:06 Michael Kennedy: It sounds crazy, right, like you're laughing, but it's kind of like you're going to laugh or you're going to cry, pick one.
01:04:12 Jonathan Morgan: Well, it's true. So, A) I think China is a really interesting counterpoint to the debate that we're having here in the US where here our values are very much privacy, individual freedom, and whenever technology encroaches on that privacy, we're very suspicious. So even the fact that advertisers can target me based on my browsing history you know, makes a lot of people uncomfortable. Maybe that's a violation of my privacy even though the advertiser doesn't know who I am as an individual, it's possible to make a pretty good guess about who I might be based on my behavior.
01:04:46 Michael Kennedy: They know you want a leatherman pocket knife but they just don't know your name.
01:04:49 Jonathan Morgan: Right, right, and they know that people like me also like other products that people who like leatherman knives like. We're outside of my confort zone, though.
01:04:58 Michael Kennedy: I'm just, I'm sorry, I'm just making this up. But you get the idea, right?
01:05:01 Jonathan Morgan: Exactly, but here a lot of the conversation is like, is that okay, whereas in China they've gone the other direction, they were like, I mean, it's a way that we can find the best people and maybe maintain social order. I mean, I don't know what they're thinking. Just quickly, because not everybody will have seen this, but there's a Netflix TV show called Black Mirror if you are into kind of a dystopian technology future, or technology-enabled dystopian future. I highly recommend Black Mirror. There is an entire episode dedicated to a social credit system and everybody in the episode is ranked, like every social interaction you have, you can basically rate the person you had the interaction with, so did you get a smile from the barista at your favorite coffee shop? Five stars. Did you tip? You get five stars back from the barista. Did you take an Uber and the person was friendly? Five stars. And like, some of these things we are already doing. We rate our Uber drivers and our Uber drivers rate us.
01:05:55 Michael Kennedy: But they're usually kept within the little space of Uber or Starbucks or whatever. It's not, like, your global rating. You don't get a mortgage based on how you treated your Uber driver or not.
01:06:06 Jonathan Morgan: Right well, I mean, you get your babysitting gig depending on what you post to social media. But that's, it's again, we're debating is it okay to judge potential babysitters based on that rage they may have about the ending of the Dark Mirror episode that they thought was stupid. But China's gone into the other direction. Exactly like you were saying, they have all of these ways in which they are scoring and rating people based on the social behaviors that they can observe either through surveillance or people's behavior online, their social media behavior. And so Shanghai, a city in China, you may be familiar with it, is going to pool data from several departments and they're going to reward and punish about 22 million citizens based on their actions and reputations by the end of 2020 so social...
01:06:57 Michael Kennedy: Like a year and a half.
01:06:58 Jonathan Morgan: Yeah, exactly.
01:06:58 Michael Kennedy: Not in the future very far.
01:07:01 Jonathan Morgan: Right, this is not some like, far future dystopia. This is today, people. And so pro-social behaviors are rewarded. So if you do volunteer work of if you donate blood, which actually sounds cool, like, I'd like to be rewarded for those things. This doesn't seem that bad. But people who violate traffic laws or charge under the table fees are punished. Okay, if we're getting in, you know, in a way we're already socially rewarded for doing good things and we're socially harmed for doing mean things, I guess, or things that aren't cool. And so in a way it's like China wants to use technology to optimize the social systems that we already have in place, like if I'm rude to you, then I suffer social consequences for that. But it's not really captured anywhere. I guess like, that damage is pretty localized.
01:07:50 Michael Kennedy: It doesn't go on your permanent record.
01:07:52 Jonathan Morgan: Exactly. This is everybody's permanent record is now captured digitally and then managed by artificial intelligence. That's, welcome to Shanghai.
01:08:03 Michael Kennedy: Yeah, so, yeah. It's pretty interesting. I mean, on the one hand I can see this benefiting society but it just seems Black Mirror-esque to me as well. And so you might be wondering, well, what the heck is a punishment, right? So they say in another city, Hangzhou, they rolled out a credit system earlier this year in rewarding pro social behaviors such as volunteer work or blood donations and punish those who violate traffic laws they charge under the table fees and whatnot and statistically they said by the end of May they have been blocking more than 11 million flights and 4 million high speed train trips from the bad guys.
01:08:43 Jonathan Morgan: Right, right. Not bad guys, like, broke a law. Bad guys like, doesn't volunteer enough, doesn't seem like, you know, heart's probably not in the right place, we're not going to let you take this train trip, buddy.
01:08:59 Michael Kennedy: Just denied, I'm sorry. You need four stars, you've got three and a half.
01:09:03 Jonathan Morgan: Yeah, totally like I was going to go visit my family. I wasn't going to be able to sit with them on the plane, but at least I was going to take a trip home for the holidays or whatever. And the trip was thwarted, like actual real life consequences for things that we've all come to accept as pretty much being a right if we can afford to pay for them. Kind of bizarre.
01:09:20 Michael Kennedy: Well, it's going to be a very interesting social experiment. I don't want to be part of it.
01:09:26 Jonathan Morgan: I'm glad that I won't be part of it either although I wonder if everybody will just be super friendly and well-behaved. Maybe that's the outcome that we're just not seeing. We'll go to Shanghai and be like, man, people are just, this is like, this is like a Leave It to Beaver episode.
01:09:41 Michael Kennedy: Exactly. Well, how much of that would be disingenuous, you know, oh, bless your heart, that type of stuff.
01:09:48 Jonathan Morgan: Then we just need to optimize the algorithm have downvotes for disingenuous social pleasantry.
01:09:54 Michael Kennedy: It's a double downvote. You were mean and you were disingenuous about your niceness, boom! So the last one is, I thought it'd be fun to leave people with something practical on this one. Have you seen this new dataset search from Google?
01:10:07 Jonathan Morgan: Yes, I am so into this partially because it's useful and partially because I have worked on a project where we tried to accomplish something similar and it is super freakin' hard.
01:10:20 Michael Kennedy: So props to Google. Of course, if someone's going to get search right, Google.
01:10:24 Jonathan Morgan: Yeah, for sure.
01:10:24 Michael Kennedy: So tell people what this is.
01:10:26 Jonathan Morgan: Well I mean, it's just like it sounds. It's a dataset search. So sometimes the data is like an actual dataset that got captured somewhere as like a CSV or it's data that was extracted from less structured material like a table on a webpage or something. The way that you can search for it is by topic and I just want, as somebody who suffered through this very difficult problem, I just want people to understand what that actually would mean to know what a dataset is about in air quotes. And a dataset in this case, if you're not a data person, it's like a spreadsheet with column names at the top and a bunch of rows. For example, in the simplest case. And so maybe you can look at the column headers and perform some machine learning or something and get a sense of the column headers are really well named. But they never are, trust me. They never are. They always have some random name like the name of the column in the database table that was named in the 1970s that like this thing ultimately spit out in the first place and there's random characters in there that have no place in any dataset ever. There's a bunch of random missing values and everything is a number and so you can't actually make any guesses about what's in there because it's just a bunch of random numbers. And so you end up looking at weird things like the structure of the dataset and you go to some strange places, my friend. But the fact that Google figured out how to catalog this type of information and make it accessible to people this like almost like semantic search for datasets, it really is a real feat of machine learning and engineering.
01:11:53 Michael Kennedy: Yeah, it's awesome. So you just go to the data search page which is toolbox.google.com/datasetsearch at least for the time being, and you just type in a search and it'll give you a list, like, here is all the places we found and some of them will be legitimate raw data and sometimes it's just embedded cables in a webpage and all sorts of stuff. It's really well done.
01:12:13 Jonathan Morgan: Yeah. It kind of lets you look for data in a way that you would look for any other content on the internet. Just the way that you would, well the way that you would search normal Google which is pretty impressive because there's usually not that much context around datasets on the internet. Like that kind of semantic real world context.
01:12:30 Michael Kennedy: Yeah, it's great. So people are out there looking for datasets definitely drop in on Google dataset search and throw some stuff in there, shoot great answers. I also really like the 538 data, have you seen that?
01:12:41 Jonathan Morgan: Oh, yeah, totally. So the website 538 for folks who aren't familiar with it kind of a data journalism focused website. Lot about sports, lot about politics, but most of the work is driven by some type of data analysis.
01:12:53 Michael Kennedy: Yeah, and over at github.com/fivethirtyeight/data they have all of the datasets they use to drive their journalism which is like, hundreds of different datasets, so I love to go and grab data there for various things I'm looking into.
01:13:06 Jonathan Morgan: Yeah, absolutely, it's a great resource especially 'cause then you can go see how they used the data and what questions they asked and then it's kind of an easy way to, well, it's a great way to learn especially if you're toying around with data for the first time and just getting comfortable with some of the amazing data exploration and modeling tools in Python, that software language, take that R users.
01:13:27 Michael Kennedy: That's right!
01:13:27 Jonathan Morgan: Stick it!
01:13:32 Michael Kennedy: So if you're looking for datasets, where do you go?
01:13:33 Jonathan Morgan: Now I go to dataset search. But there's also a lot of great, it depends on what you're looking for, but there's a company based in Austin called Data.World that is trying to do kind of a similar thing to Google's dataset search but it's more curated so it's kind of community-based. People are sharing interesting datasets that they found and contextualizing them, so it's a lot of data for Democracy volunteers when they're working on projects they'll often go to Data.World to make the data available. Then there's also a lot of open data portals both at the national level, so there's still data.gov, I believe that's still up. There were some rumors that it might all come down eventually but it's still around, I think. And a lot of cities that have had open data projects or open data initiatives have collected all of those into a portal that's specific to where you live. So if you want to find out how many dogs and cats are in your local animal shelter from one month to the next or whatever the analysis is that you're curious about that's super relevant to your city, you can probably find one that is for the city closest to where you live. I know there's one here in Austin, New York, Chicago, LA, San Francisco, there's a lot of cities have this now and it's a great way to find data that answers interesting questions and just toy around a little bit and learn more about where you live.
01:14:44 Michael Kennedy: Yeah, that's cool. And Data.World is cool as well. I haven't seen that. All right, Johnathan, well that's the ten items for the year in review and I definitely think there's an interesting trend and it's been fun to talk to you about it.
01:14:56 Jonathan Morgan: Yeah, of course, thanks so much for having me on the show. And I hope everybody had a wonderful 2018 and is looking forward to an exciting year ahead.
01:15:02 Michael Kennedy: Absolutely, so I have a few quick questions before you get to go, though. First, when you coming back to podcasting? Or are you just going to make these guest appearances? Do you have any plans to come back or are you just going to keep working on your projects?
01:15:14 Jonathan Morgan: You know, I'm thinking maybe towards the end of 2019 I'll try and find a year in review podcast where I can sit in.
01:15:18 Michael Kennedy: Well, you can definitely come back in 2019, that'd be good.
01:15:22 Jonathan Morgan: It was prepositioning for the next year. I love doing it, it's really fun and I have no timetable for actually returning, though. Because even the thought to committing to the amount of work it takes to put on an episode.
01:15:34 Michael Kennedy: Yeah, it's a crazy amount of work per episode, that's for sure. Well, I'm really glad you came out of retirement to do this one.
01:15:40 Jonathan Morgan: Oh, thanks, man. I really appreciate it. I'm super happy to come on the show and do it. I mean, especially because I won't be responsible for any of the work once this is recorded. It's a dream.
01:15:49 Michael Kennedy: Exactly. All right, last two questions which I know you answered a couple of years ago, but could have changed, so if you write some Python code, data science or otherwise, what editor do you use?
01:16:01 Jonathan Morgan: Well, I use Jupyter notebooks pretty heavily and actually more recently than not I write much less software, like almost everything I write now is some type of exploratory data analysis so it's almost entirely in Python. But when I do code, I still use Sublime Text a lot which I feel like is kind of old school I hear that the world has moved on. There's like, Python-specific, quasi IDEs and I'm out of the game, what can I tell ya?
01:16:29 Michael Kennedy: It's all right, that's all good. It's a great one. And then a notable package on PyPI.
01:16:35 Jonathan Morgan: Aw, man, well, I mean, given what we've been talking about I would strongly encourage everybody to check out a package called Keras, K-E-R-A-S, or Tensorflow which is a both are very popular machine learning libraries or kind of machine learning frameworks. Keras is more high level. It's a very accessible way to get into exploring neural networks, basically, so there's lots of popular machine learning techniques that are sort of more traditional and super effective. There's nothing wrong with them but the world has kind of moved into these deep learning and AI is largely based on neural networks and Keras is a great way to explore those without getting too much into the weeds, and then once you get into the weeds and you find them kind of interesting, and you want to get down lower level and really play around with some of the network structures yourself, then you can get into Tensorflow which is one of the underlying libraries that Keras is built on top of. So highly recommend both of those.
01:17:31 Michael Kennedy: Right on, yeah, I've definitely heard nothing but good things about them. All right, final call to action, maybe especially around this whole democracy data pledge stuff people heard all your stories, they are interested, what can they do?
01:17:43 Jonathan Morgan: If you're like, I don't want to live in a Black Mirror episode, know that you have the power to change it, Python programmer and podcast listener. The power is in your hands. First step, go to datafordemocracy.org and you can sign the ethics pledge and you can let the world know that you believe in ethical technology and together, we'll make our voices heard. We'll make sure that practitioners have a voice in this whole thing. So datafordemocracy.org , we really would love to both have you sign the pledge but then also participate in the conversation because there's a whole community there that's really hashing out these ethical principles, making sure they work for real world technologists who have real world jobs, so you can contribute to that process, it's open source. It's happening on GitHub. We'd love to have you participate.
01:18:25 Michael Kennedy: All right, it's a great project. Hopefully people go and check it out. Johnathan, thanks for being on the show. It's been great to have you back if just for now.
01:18:33 Jonathan Morgan: Thanks, Michael, I really appreciate the opportunity, it was super fun.
01:18:35 Michael Kennedy: You bet.
01:18:35 Jonathan Morgan: All right, bye, bud.
01:18:37 Michael Kennedy: This has been another episode of Talk Python to Me. Our guest on this episode was Johnathan Morgan and it's been brought to you by us over at Talk Python training. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python. And of course if you're interested in more than one of these be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get out there and write some Python code.