« Return to show page
Transcript for Episode #56:
Data Science from Scratch
You likely know that Python is one of the fastest growing languages for data science.
This is a discipline that combines the scientific inquiry of hypotheses and tests, the mathematical intuition of probability and statistics, the AI foundations of machine learning, a fluency in big data processing, and the Python language itself. That is a very broad set of skills we need to be good data scientists and yet each one is deep and often hard to understand.
That's why I'm excited to speak with Joel Grus, a data scientist from Seattle. He wrote a book to help us all understand what's actually happening when we employ libraries such as scikit-learn or numpy. It's called Data Science from Scratch and that's the topic of this week's episode.
This is Talk Python To Me, episode number 56, recorded April 19th, 2016.
Welcome to Talk Python To Me, a weekly podcast on Python- the language, the libraries, the ecosystem and the personalities. This is your host, Michael Kennedy, follow me on Twitter where I am at @mkennedy, keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython.
This episode is brought to you by Snap CI and Hired. Thank them for supporting the show on Twitter via @snap_ci and @hired_hq.
Hey everyone, it's great to be back with you. I have something to share before we get to the main part of the show. I had a chance to be on the Partially Derivative podcast this week where we did a short segment on programming tips for data scientists. Relevant to this episode, wouldn't you say? I talked about using generator methods and the yield keyword to dramatically improve performance while processing lots of data in the pipeline. If you want to hear more about that kind of stuff, check out the link to the Partially Derivative episode in the show notes.
Now, let's talk to Joel.
2:01 Michael: Joel, welcome to the show.
2:03 Joel: Thanks, I am glad to be here.
2:05 Michael: Yeah, I am really looking forward to doing data science from scratch today.
2:10 Joel: That's why I am brought, I am the right person to be here for that topic.
2:13 Michael: Fantastic. So, we are going to dig into data science and I think your idea of taking it from the fundamentals and building something that's not so complicated or so well polished and optimized that you can really understand what you are doing is great, but before we get into that let's just sort of start from the beginning, how did you get into programming in Python?
2:32 Joel: So, if you go back about ten years or so I was in grad school for economics, I took a class called probability modeling, which was a lot of simulating Markov Chains, and doing simulations of things like that, and the class was actually taught in Matlab school had a site license for Matlab that only worked on campus and so that meant that if I went home, and worked at home which I liked to do I couldn't use Matlab. And so as an alternative, I found that I could use Python and Numpy to basically do all the Matlab things without the site license. So that's actually why I started using Python, then I just kind of liked it and stuck with it.
3:11 Michael: Yeah, I can definitely see how you would like working in Python more than Matlab, I've spent my fair share of time in the .m files and it's all right, but it's not Python. You were studying economics, and that's kind of how you got interested in data science, trying to answer these big questions sort of the subject of economics or some other way?
3:30 Joel: I followed sort of the convoluted path, my background is math in economics, I actually started off doing quantitative finance, options pricing, mathematics of financial risk, I worked at the hedge fund, and then when the hedge fund went out of the business, I just kind of walked into this job at a startup called Farecast, where I was doing a lot of BI work, writing SQL queries, building dashboards, and over the years I just sort of moved more and more in the data science kind of direction and eventually became data scientist and software developer. And certainly the math in economics training was helpful for that but it was a career that I sort of ended up in by accident, rather than through a deliberate plan.
4:12 Michael: That's a little bit like my career path as well, studied math and just sort of had to learn the programming and stumbled into it, that's cool. What do you do day to day now?
4:22 Joel: So I just started a new job, I work at a non-profit it's called Allen Institute For Artificial Intelligence, it was founded by Paul Allen who was one of the Microsoft founders, and we are basically doing kind of fundamental AI research, actually I can't tell you much about what I do because it's really my second week and I still kind of just learning what goes on, and learning all the way around, but I work on a team called Aristo, which is basically building AI to take science quizzes and to understand science.
4:53 Michael: Very interesting. That sounds like a fascinating place to work.
4:57 Joel: It's really neat. Smart people, interesting problems.
5:00 Michael: Would you say it's different than working in a hedge fund or is it surprisingly similar?
5:04 Joel: No, it's totally different. It's similar that there is a lot of smart people but it's not really similar in any other way.
5:09 Michael: Yeah, the goals are not necessarily the same, are they?
5:12 Joel: The goals, the incentives, the sort of day to day stress levels, it's all different.
5:16 Michael: Yeah. Sounds, great. So, let's talk about your book a little bit, it's called Data Science From Scratch. I think that's a really cool way to approach it, we have all these super polished data science tools and we have them in Python, we have them in R, in other languages as well, why from scratch other than just grabbing one of these libraries or set of these libraries and talk about how to use them?
5:39 Joel: There is a couple of reasons, one is as I mentioned I come from a math background, and the math way of doing things is you can't really use something until you can prove it. I once had a teacher who came in and he looked at the syllabus for the previous semester and he said oh good, you proved this theorem, that means that I can use it in this class. And, so there is this real rigor around you can't use things unless you understand them, and that approach always kind of resonated with me and so that's the approach I took in the book. At the same time, you also have all these really powerful really easy to use libraries like Scikit learn where you can go in and basically copy and paste five lines of code and you've built a decision tree, or you built a regression mode, and it's really easy to not know what you are doing, and you can pick up books and they will tell you what commands the type but you can also type those commands and again not what you are doing and so I thought, the book I would like, the book that I would have found valuable would be here is what these models are actually doing behind the scenes, and so instead applying them, you kind of understand the principles you understand where they go wrong, where they go right, what they are good for, and what they are actually doing.
6:57 Michael: Yeah, I think that's really interesting and that is very much the mathematical way of if you can't prove it you can't use it, it's sort of way of thinking which is not that common in programming, you know if there is an API hey, just grab it and use it, but I think it's really important in data science because you have all these different disciplines and backgrounds coming together to make it work, right, you are not just a pure programmer, and you are not a mathematician or some sort domain expert right, you kind of have to blend these together.
7:30 Joel: Yes, and the other thing that I think ended up working pretty nicely about this approach was there is a lot of really mathy books about data science and machine learning, that's like here is an equation, here is an equation, here is an equation. But what I ended up doing was a little bit of math but then when it came time to here I saw something works, do it in working Python code so that it's rigorous in terms of here is all the steps laid out, but it's also, that's the code you can run. And sort of follow along.
7:57 Michael: Yeah. Absolutely. Why Python and not something like R or some other language?
8:02 Joel: So the answer is that for whatever reason R is not sympathetic with the way my brain works, and whenever I try to sit down and start writing things in R I just find that it doesn't work the way I expected to, the joke is that R is the language designed by statisticians, there is some true to that, but for whatever reason Python is much friendlier to the way my brain work the way that I solve problems. Actually the first draft of my book was a lot harsher against R and then some of the early reviewers didn't like that, they really didn't like that so I kind of revised it to be more gentle.
8:39 Michael: Yeah, I think one of the things that's nice about Python is if you invest data science through Python or the Python data science tools rather, and then you want to build sort of a more working application, you don't have to convert that to some other language, right, you are already in Python, it's sort of a full stack thing, you keep working with, right?
8:59 Joel: That’s one important aspect, I think another important aspect is that from my perspective Python code if written well is super readable, and so you don't necessarily have to be a Python expert to take a well written piece of Python code and understand what it's doing, whereas other languages can be a lot more inpenetrable and so I find it a nice teaching language from that perspective as well.
9:21 Michael: Yeah. I would say most the universities in the US came to that same conclusion, right, with Python being the most common one on one course these days as opposed to like Java or Scheme which I took.
9:36 Joel: Yeah, I also did Scheme back in the day, and I actually thought that was a really nice language for learning computer science, but I can see why Python would be a better language for learning programming as opposed to computer science.
9:51 Michael: So when you started off your book, you actually have a few sections that are just about I am going to teach you just enough Python to get started, so that you can understand the data science tools and what we are doing, what kind of stuff did you put in there like what do you consider fundamental for doing data science and what can be sort of learned later?
10:10 Joel: The most important things I put in were obviously functions, you want to do any kind of analytic work in Python; the other thing I really tried to emphasize was writing kind of clean Pythonic code so I also did a lot of list comprehensions as well as understanding all the basic data structures, so lots of dicts and sets and some of the less common ones like default dicts and counters, I also make quite heavy use of. So basically, anything you would need to understand here is how to write out an algorithm in Python, here is how to use the correct data structures in Python, I kind of baked into that initial introduction. Whereas more advanced things like really specialized data structures like use I just kind of introduced as needed later in the book, some of the more esoteric math functions I just kind of introduced when they came up.
11:05 Michael: I think obviously, nicer dictionaries, things like default dict and so on, play a really important role in the subsequent pieces where you are trying to make conce, clean statements, I thought your use of sort of list comprehensions and generator expressions along with other functional pieces like zip and counter and so on, were really interesting, you write some very concise code without being unreadable. I thought that was nice.
11:36 Joel: That's one of the things I aimed for and as much time as you spend sort of revising the text of the book, I also spent probably just as much time revising the code of the book, I went back over it many times thinking how can I make this simpler, can I make it cleaner, can I make it easier to understand. And, I try and do that, I am kind of a stickler for code elegance.
11:59 Michael: But also, you come out very polished code, I found a very easy to follow. And, I thought the focus on this functional style programming it was really neat. So, one of the things you talk about in the beginning is about visualizing data, because you have all these numbers and obviously to understand them, putting a picture to it is very nice, and what is that, matplotlib?
12:19 Joel: Yes, in the book I only use Matplotlib, I was just having a conversation with some of my data science buddies about a week ago and they were as always complaining that the data visualization story in Python is still not very good.
12:34 Michael: yeah, so you said that Matplotlib was starting to show its age in your book,w hat did you mean by that?
12:39 Joel: When I first started learning Python, more than years ago, as like a Matlab replacement, Matplotlib was around back then and it had the same features and the same interface and it produced the same not very attractive 12:54 so it hasn't kind of evolved as the rest of the language has evolved, and there has been some projects like C born that try and put some prettiness on top of it, when there's been some other attempts to kind of bring the R style 13:09 in the Python, but from my perspective, none of them has really like become the solution, so I wanted kind of something different.
13:17 Michael: Yeah, you also talked about Boku, which is form the Continuum guys, who I had on show 34, could you maybe talk about that really quickly? Just what is it, what's it used for, that sort of thing?
13:28 Joel: It's another visualization library and I could be honest with you, I haven't really checked it out since I was writing the book like more than a year ago, so I seem to remember that it seem to have some facility for building in interactive 13:40 and things like that.
13:43 Michael: It's kind of d3 fancy graphics in the browser type stuff right?
13:48 Joel: I believe so, but I didn't spend much time playing with it, I've been doing lot of probably more d3 than Python visualization. Recently.
13:56 Michael: Right, right, tell everyone what d3 is, I don't think we've talked too much about that on this show.
14:34 Michael: yeah, and whenever I go to the d3 website, and like wow, I want to use this, I don't have a use for it but it's fantastic, like how can I just you know, build stuff that looks like this?
14:43 Joel: Yeah, so I have a joke which is that good data scientist copy from the d3 gallery and great data scientist steal from the d3 gallery. And, they give you the code for all of them, so I would say probably close to a 100% of the d3 visualizations I have ever built, have been something I found at the d3 gallery and kind of tweaked it until it fit my data.
15:03 Michael: Yeah, yeah, nice. So, the next section you talked about was like the fundamentals of the math and science that you need to know in order to be a data scientist, and I like the way that you've put it, you said these are like the cartoon versions of big mathematical ideas. How much math do you need to know to call yourself a data scientist, like if I study just programming I am not a data scientist so if I study just math I am not a data scientist, like- what's the story there?
15:28 Joel: Data science is a funny thing and that there is many different jobs called data scientists that are data scientists. So there are people who will call themselves data scientists and they write sql queries all day. There are people who call themselves data scientists and they do cutting edge machine learning research all day; there are people who do data scientist, like you could know almost no math and still get a job where you call yourself a data scientist you could know very little programming and get a job where you call yourself a data scientist. It's a field that's still kind of figuring itself out, and there is just such a breath of the different roles that call themselves data scientists. In the ideal world, you would know the algebra, you would know probability, you would know statistics, but among people who are data scientist some people know the stuff really well, some people know their stuff not so well.
16:17 Michael: Yeah, I can imagine. It's really about you've got a lot of data, or some specific type of data and you would answer questions about it or even discover the questions you could ask that nobody's asked, right?
Snap CI is a continuous delivery tool from ThoughtWorks that lets you reliably test and deploy your code through multi stage pipelines in the cloud, without the hustle of managing hardware. Automate and visualize your deployments with ease, and make pushing to production and effortless item on your to do list.
Snap also supports Docker and in browser debugging and hey integrate with AWS and Heroku.
Thank Snap CI for sponsoring this episode by trying them with no obligation for 30 days by going to snap.ci/talkpython
17:22 Joel: Sometimes it's actually the opposite which is I have a question I want to ask, where can I find some data that will allow me to enter that question. So you know, sometimes you start with the data and go to the questions, so once you start with the questions and then you've got to find the data.
17:35 Michael: Right, ok, very interesting. So, that actually brought you to a next section that you talked about in your book, which was getting data, and so one of the common ways that you talk about there?
17:46 Joel: Nowadays there is a lot of people who just like post interesting data sets, so if you go to like Kaggle which is a site that does data science competitions, everyone of their competitions has a data set which might be interesting in ways that are not related to the competition. The government websites publish all sorts of data some of which is potentially interesting for like whether they publish weather data if you like, economics, they publish economic data, there is a lot of Python libraries for scraping websites so even if a data set is not available, you can always go out and try and scrape it and collect data yourself and clean it. My go to source is always twitter, I always built in choosing twitter data so twitter in all of other sites will have apis where you can just make restful calls or even use libraries, Python has some twitter libraries that abstracts all that away and you can collect tweets on a given topic or collect tweets from certain users, and collect your own tweets and do analyses on those.
18:41 Michael: Yeah, you had a section about that in the book, what was the package you were using for that?
18:46 Joel: In Python the package I usually use is called twython, there is a bunch of them, that's the one I got to work the easiest.
18:58 Michael: Ok, cool, yeah I think twitter is fairly special among the social networks, to me twitter is the social network of ideas, more than it is of friends or family or whatever. So yeah, very cool. I love to get data from twitter as well.
19:11 Joel: Yeah, it's just, they won't give you the firehose, but for most things they'll give you more tweets than you can handle anyway, so if you want to find out what the people are saying on a certain topic or just what people are documenting in general, it's an awesome source for that.
19:25 Michael: Yeah, I saw some study or heard about some study a few years ago, where people were studying they were trying to do sentiment mood analyses on twitter and then trying to tie that back to the stock market and predict how people were feeling on Twitter to near term changes like in the next hour in the stockmarket which I thought was a pretty interesting project.
19:49 Joel: Ideas you can come over to use in that data are pretty much endless, I every month I think I am done with that and then I think of something else to do.
19:55 Michael: Nice. So, then you kind of get into the topics that are I think of it as traditional data science machine learning, network analyses that kind of stuff, and I thought maybe we can go through each section and you could sell me what it is that you need to sort of fundamentally understand what are the from scratch basic pieces of each part of this data science and then maybe some examples of problems you might solve with it?
20:23 Joel: Sure.
20:23 Michael: Yeah, so the first one you talk about is machine learning, and when you said machine learning, and later neural networks from scratch, like wow, that's a pretty big thing to take on from the scratch, what's the story there?
20:35 Joel: Machine learning at a high level is just learning some kind of model from data, rather than sitting down and writing up the model yourself by hand. And so, if you have a small amount of data, then coming up with an algorithm that's going to learn a model from it is actually pretty reasonable thing to do if you go with the simple model; it's only when you want to you know, start producing recommendations at netflix or when you want to start building something to recognize speech patterns from audio files that your beautiful hand crafted Python is probably not going to be up to the task.
21:11 Michael: No, I am sure it's not. Yeah, so suppose the types of things you solve with machine learning is fairly inbounded, there is a lot of problems that machine learning answers, what are some of your favorite examples?
21:23 Joel: Everything out there is machine learning these days, but if you want to talk about like projects that I worked on for fun, one time I built classifier to predict, to identify hacker news articles that I would be interested in or not interested in. So we take kind of the feed of new hacker news stories and cope with the score between 0 and 1 about how interested it thought I would be based on some initial seed values.
21:49 Michael: Nice, and how did you teach it, what did it consider?
21:53 Joel: A couple of things. One which is the words and I think the bigrams and the titles, also look at the site that it was linked to, so I think New York Times got a positive signal and some have a negative signal. It also looks for a few other kind of idiosyncratic features like is it Ask, Hacker News is it a still Hacker News, does it have a dollar amount in there because there are lot of kind of bad stories that's like here is how to make $5000, so that was a negative signal.
22:23 Michael: That's sort of a spamy signal.
22:25 Joel: Yeah, so and then it turned out to a blog post about it and someone posted the story to Hacker News and then Hacker News community got very angry that someone would think that not every story there was worth reading, because of course, every story there was worth reading so why would I want to filter some of them out? They accused me of wanting to live in a bubble and-
22:42 Michael: Yeah, they can be fairly critical on there but that's funny. So the big question is did your system like your article, your blog post?
22:52 Joel: That's a good question, I didn't actually check, I should go back and try and find that.
22:56 Michael: it would be funny if it recommended it to you or not, either way it would be funny to know.
23:00 Joel: well, you know, it's funny speaking of the so you know, I have an android phone and so Google now will sometimes recommend me articles and things I'd be interested in, and things it would recommend to me like my own blog post. One of the things I did with machine learning was I have a five year old daughter and I do her clothes shopping and I noticed that little boys clothes are very interesting and little girls' clothes tend to become like really boring, the boys' clothes have like dinosaurs and rockets and robots and girls' clothes have like hearts and flowers, and so I built a machine learning model to take an image of a children's T-shirt and predict whether it was a boy shirt or a girl shirt.
23:38 Michael: How awesome. Yeah and I agree with the classification there as well, I have three daughters and we still buy a fair number of boy clothes for them.
23:47 Joel: Yeah, I do the same.
23:49 Michael: nice, so the next topic that you covered was nearest neighbours, and I mean, conceptually, I kind of know what nearest neighbours are pretty easily but it's a computation and hardware problem especially in higher dimensions and so what kind of stuff do you do, what kind of problems do you solve with this?
24:03 Joel: One thing you can do is when you don't have like a great 24:07 it's going on, so for instance one place where I have seen this applied is when I have some kind of like time series type signal. And i want to know if it represents some kind of bad anomaly and it might be like thousands of points in some weird shape, I don't have a good way to classify it, one thing you can do is if you have a bunch of labeled signals, some are good some are bad, you just say I am just going to take my set of reference signals and figure out which are the ones it's closest to you, and that way I don't need to have a model of what's a good signal, what's a bad signal I can just say I have some label data, that's enough. And so I think that's kind of the situation where it can be useful.
24:46 Michael: Yeah, nice. When you are still trying to explore things and maybe do not have a model, you are trying to fit it to you, you are just trying to understand there, right?
24:54 Joel: Yeah, or think about where like you have some weird kind of shapes, weird kind of patterns that you can't really put math behind, what's a good pattern and what's a bad pattern, but you have some labels, then you could use nearest Neighbours to basically classify without having to have a mathematical model behind it.
25:11 Michael: Ok, very cool. So, then you get to another topic called Baisin analyses which the context I know this is around kind of determining spam and filtering and stuff like that, but what's the story with baisin analyses?
25:26 Joel: So it's the another Bases rule which is a theorem in statistics having to do with ways of versing conditional probabilities. So, at a high level if you have some knowledge about what is the probability of seeing certain features in email given that an email is spam, you can use that data to produce estimate of what is the probability that an email is spam given that you see those features, and so typically if you have a lot of spam and non spam you can make estimates of ok, given that an email is spam it's likely that I'll see viagra and then these are just the classifier that turns that around and allows you to now look for these features and in a mathematic way come with the probability that this is spam.
26:13 Michael: Ok, yeah. Very interesting. What other types of problems, spam is certainly easy to understand, but you know hopefully that's solved by gmail and other people for us, right?
26:24 Joel: I think it was Paul Grant who did a pretty influential article about this approach about 15 years ago now, and so |I think a lot of the tools that I don't know if it's gmail used but that a lot of these 26:34 and some of the other mail providers use are still based on this principle of naive bayes, but basically the anything where you have kind of big clumps of say text and you want to classify it into one or more classes, like two or more classes and you are willing to make this really big kind of technical assumption it the words entered are kind of independent of each others you treat it like a bag of words rather than as a sequence of words, than that's where you would use this kind of model.
27:06 Michael: Ok, yeah cool. So maybe the opposite of that where you treat words sort of as having more meaning is in natural language processing which is another thing you talk about, right?
27:16 Joel: Natural language processing is a huge field, like people there are text books about it, so and in fact there are Python books about it so cram it into a chapter of a book is really just kind of here is really high level overview. Natural language processing is actually relevant to this sort of problems that are getting into an artificial intelligence, it could use a lot by all the voice recognition in your phone, and a lot of these new basically reading texts and having a computer extracting information from it, so it's a pretty important, pretty rich field.
27:55 Michael: Yeah, there is some pretty decent Python libraries for doing that, right?
27:59 Joel: Yeah, so there is, so the big one is called NLTK, natural language toolkit I think it stands for, and so it has its own book just about natural language processing in that library and I think there is actually free on the web, that would be a great place to start if you want to understand more deeply. There is also a nice Coursera course on natural language processing, I took several years back and I liked it.
28:23 Michael: Ok, yeah, excellent. So let's see, another topic you covered was decision trees. What's the story with these?
28:30 Joel: Decision tree is kind of what it sounds like, you know, intuitively you might have a model making decisions based on a tree, so I have, I want to know whether to buy a certain car or not, so I might start asking a question, ok, is it American car or Japanese car? American car, ok, go to the next question, does it have four doors or two doors- it has two doors, ok, is it gas or diesel- diesel, and then you just have a sequence of questions and you classify it based on these questions. Ok, it's diesel, that means yes, I want to buy it. And so, conceptually, a tree like that is not that complicated interesting part is given some set of data, how do I build such a tree? And there is a variety of algorithms but the one that we talk about in the book is a pretty simple one which is basically around, ok, I have a bunch of data it has a bunch of attributes, I can split my data on each attributes so I can say ok at this stage I want to loo at two doors versus four doors, I want to look at Japanese versus American, I want to look at gas versus diesel and, which of those choices is going to allow me to kind of really separate the good buys from the bad buys to make the sale, if gas versus diesel totally splits where diesel is the buys and gas is the don't buys, that's a great thing to choose, if gas versus diesel splits where you know, on gas I want to by half and not by half, that's not a good split to choose because it doesn't really help me at all. And so the mathematics here are just around kind of making this precise, this entropy and building these trees from the data.
30:09 Joel: Ok, yeah. That's very cool. The way where you build them knowing the data intimately, you know, that makes a lot of sense, and you don't really necessarily need data science to do that, right, that's just helping you know, asking a few questions and making a decision, but the reverse I think is pretty interesting.
30:27 Joel: They build some of these expert systems in this kind of way, I think they applied them sometimes in like medical diagnosis, and sometimes they do better than doctors.
30:35 Michael: Wow, ok, sometimes you just let the data talk rather than intuition.
This episode is brought to you by Hired. Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
Each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them before you even talk to the company.
Typically, candidates receive 5 or more offers in just the first week and there are no obligations ever.
Sounds awesome, doesn't it? Well did I mention the signing bonus? Everyone who accepts a job from Hired gets a $1,000 signing bonus. And, as Talk Python listeners, it get's way sweeter! Use the link hired.com/talkpythontome and Hired will double the signing bonus to $2,000! Opportunity is knocking, visit hired.com/talkpythontome and answer the call.
31:38 Michael: Somewhat related to that maybe is you talked about neural networks as well.
31:43 Joel: yes, so neural networks are super hot topic right now, as I am sure you ave probably heard, all the buzz are about deep learning, right, and deep learning tends to neural networks at the root of it. So neural networks are basically a way of building up kind of layers and layers of representations. And the deep learning, this book doesn't really go into deep learning because it's the single neural network is hard enough to do by hand or from scratch, but basically it's a way to kind of build a classifier that works similar to how a toy model of a brain might work, so you have artificial neurons and each neuron has a bunch of inputs that go into it with weights, if the weight in some of the inputs succeed some treshold the neuron fired, and if it doesn't exceed that treshold and run it doesn't fire. So you present an info can be like an image so basically a bunch of zeros and ones and that causes some neurons to fire, and that propagates through this network and int he end it will spit out- I think you showed me an image of a cat, right, I think you showed me an image of a dog. And you train it by showing it a lot of labeled images, and adjusting the way it's based on how you got them wrong.
32:56 Michael: yeah, interesting. So you basically just say these are the inputs and the decisions, you feed it known data you say here is a cat, here is a cat, that's a dog, is this a cat or a dog? And then you ask that question. One of the problems I sort of remember from neural networks is that they as you try to get them more accurate they get over trained to just only do little bits of stuff, and so how does things like decision forest, how are these things and this deep learning, how does this deal with that?
33:26 Joel: So random forests are basically taking multiple decision trees and combining them, so there is this kind of general principle where a lot of times rather than building one model, it's like super predictive, I take a lot of much less predictive models and just kind of let them float on things or average their results or whatever. So you are right that there is going to be a decision tree and I included like all hundred features in my data set that there is a good chance I am going to like over fit and over learn my training data and not generalize outside of that, so one thing that happens with decision trees is people often don't use the decision trees, they use the random forest where they will build a bunch of smaller decision trees each of which is really restricted to a small subset of features so that each one is individually less powerful but then when you combine them they do well in the aggregate and they don't have necessarily the same over fitting problem that single decision through lots of features would. So in neural networks, especially deep learning there are not exactly the same techniques, but other techniques where you will zero out some of your weight sometimes and train without them to make sure that they don't learn too much and there is a lot of other techniques to get used in order to make sure you are not over fitting, there is something that data scientists and machine learning people worry about a lot.
34:48 Michael: It's a really hot field like you said right, that's awesome. So one more section I'd like to touch on is you talked about this thing called recommender systems?
34:59 Joel: Anyone who has used the internet has a pretty good experience where you've got Netflix and it says movies for Joel, and it's trying to predict what are the things I like, and you go Amazon and it will recommend items for me and you can go level further where you have all these startups like stitch fixes of very hot data science startup where you tell what kind of clothes you like and they will send you a box of clothes every month they think you will like a lot. And, so generating these kind of recommendations is a pretty popular task within data science, a lot of data scientists work on these kind of problems, there is a lot of data scientists work in- they've got to predict to sell you stuff and they always want to sell you stuff they think you'll like.
35:44 Michael: You convert better if you figure out what people actually might want, right?
35:48 Joel: Yeah, I mean well, if Amazon sends me an email hey Joel, these five things are on sale and they are five things that I really want to buy then it's much better for them and potentially for me than if they send me a random like these are the five most popular things on Amazon today.
36:05 Michael: Yeah, it definitely is better. I really like going to my Kindle and looking at the recommended things based on what I've been reading, but the Netflix recommendations I know they do really great work but it just doesn't work for me because my seven year old daughter watches Strawberry Shortcake and other random things and I get a lot of kids shows recommended to me.
36:24 Joel: They have who is watching button you can click and say a kid id watching.
36:31 Michael: I know, but my daughter won't use it, she'll just randomly pick one.
36:34 Joel: Yeah, so you know what's funny, I use that button pretty well, but my Netflix recommendations are also not that good, but I think it's just because I don't like anything on there so no matter what they recommend me I am not going to like it.
36:47 Michael: Yeah, yeah I hear you. I feel like you covered a pretty wide swath of data science from scratch, and some of the topics were really accessible, some of them required more math and more thinking but you know, it was all a really nice presentation, what do you feel like you left out?
37:06 Joel: There aren't any topics that, if anything I was I kept adding stuff when I was writing it so I don't feel there are any huge topics that I necessarily left out, but in terms of coverage probably my biggest regret is that I used Python 2 instead of Python 3.
37:21 Michael: Yeah, ok. I saw in your GitHub repo you had something about Python 3 in there, is that right?
37:27 Joel: Yeah, so pretty much as soon as the book came out one I got emails from people saying hey why didn't you use Python 3 and then I also got an email saying I would like to use the Python 3 would the code work. And so, I wrote them back and said you know, I don't feel like the code shouldn't work, like give it a try I bet it works with probably a few changes at some parenthesis and print statements and so on. And then, eventually one guy wrote me back and he said, it doesn't work. I said ok, so I sat down and I said I am going to convert the code to Python 3. And it took me about let's say 4 to 5 hours and that was with me knowing the code intimately, so it would have taken someone who hadn't written the code in the first place a lot longer than that, and then I felt kind of guilty that I have been telling on this people that it was so easy to do, but it wasn't-
38:15 Michael: Just spend a week, it will be fine.
38:17 Joel: Yeah, but so yeah, I have the Python 3 version of the code up on the github, I sort of regret not having just done it that way in the first place.
38:25 Michael: Yeah. Sure. So at the end you talked about data science not from scratch and you pointed out a lot of the libraries you might actually use, like numpy and so on, do you want to talk about that really quick, like what's the real data science versus from scratch data science?
38:42 Joel: Yeah, so I would say that numpy is pretty fundamental, that's basically the algebra library for Python. So it provides you matrices, matrix algebra high performance arrays, things like that, you don't get just built in. And you might not use it directly but a lot of the other libraries are really built on top of it so kind of the most broadly accessible machine learning library for Python is called scikit learn and it has really nice documentation and really nice tutorials and a fairly standard API for building machine learning models, anything you want to build a regression model, or a random forest model or any kind of classifier, that's probably the place to go. There is also a pandas which is the data frame library, which is good if you are working with tabular data so not necessarily the machine learning side of data science but more of the I have kind of spread sheet data set and then I want to clean it and aggregate it and pivot it and look for kind of data analyses type insights.
39:50 Michael: yeah, if you are exploring like you said tabular data and you got to kind of load it up and clean it, Panda seems really fantastic for that.
39:58 Joel: It's a really nice library. And then kind of the new kid on the block is tensorflow which is google's deep learning library and it only was released a few months ago and it's 1.0 yet but it seems people are sort of converging around it as that's how they are going to do deep learning in Python, now there are other sort of previous libraries that some people used and still use but tensorflow seems to be gaining a lot of mind share.
40:24 Michael: Ok, yeah, that's really cool and I've definitely seen tensorflow talked about a lot in this context, but it's just a library you can download that, it's not like a cloud type thing, right?
40:35 Joel: Yes, so currently, it's a local library and I am not sure if they have come out with the version that you can kind of do your own cloud, but I know they are going that direction because a lot of the stuff is deep learning models take a really long time to train, so you want to use it for anything serious, you want to distribute them and run them on high power machinery and not just run them on your laptop.
40:59 Michael: Yeah, of course, if you have tons of data maybe it's better to get a bunch of machines for an hour. So what do you think about these cloud learning or data science platforms, I am thinking like Azure machine learning or I just had sigopt on the show a few shows ago, and I am not really sure what else is out there in terms of like go out to the cloud and grab some data sciency stuff. What do you think about that?
41:22 Joel: I haven't spent much time looking at any of those, I think they can add value in terms of either one if you don't have the data science or machine learning expertise in house, to do whatever you need to do or two if you have some kind of model that you've built but you need help either put it into production or operational somehow, it can play a pretty good role there, but my sense is that most people doing data science are relying more on running the libraries and running the models themselves but that could be just my sample of people who I have talked to.
42:00 Michael: Yeah, of course. Another thing that you said you are into is taking some of the ideas from Haskell and thinking about how those might manifest in Python, where have you gone up to there?
42:14 Joel: Haskell, if you are not familiar with it is kind of the purest of the pure functionally typed languages, with strong types and lazy evaluation and things like that, and so I spent a lot of time thinking about how can I bring some of these concepts back into Python. And in Python 3 lazy evaluation plays a much bigger role in the sense that like range is a generator instead of a list and all the map and filter and things like that also are generators instead of lists. But I am starting to getting into the itertools library which starts giving new tools for generating basically infinite sequences, and just trying to see how far I could go using pure functions and infinite sequences and avoiding mutable variables and other things that you try not to do when you are working in Haskell like language.
43:11 Michael: Yeah, and how did you feel like it came out in the end, did you feel like you were able to do bring a lot of those ideas over?
43:16 Joel: I was and I ended up producing code that was really neat and really impenetrable, the lazy infinite sequences stuff that was more almost academic in terms of like yes I managed to do it, like this is mathematically interesting and it works well but it's not readable at all. So, imagine you wanted to represent let's say a binary tree in Python, kind of the two I would say obvious approaches would be one to make some kind of like class where it has a value element and a left element, and a right element, and then you also might just use a dictionary to represent it where it had those 43:55 and the language with algebra data types like Haskell you would just basically represent that kind of tree as a product type where it just has like three elements in it. And so I said what if I just represented a tree as a tuple with three elements where the first element is the left sub tree, the second element is the value, third element is the right sub tree. And certainly if you want to do like link lists in Python which you probably don't want to do but if you did want to do link lists in Python you just treat them as a tuple, first element is the element and the second element is the 44:26 list and so I actually found that I was able to write some pretty nice code using those kind of ideas, and I did some coding interviews that way, Ii am not sure that interviewers appreciated it.
44:38 Michael: [laugh] What is this talking about? Cool, so one question given your work at the AI institute and your background in data science and so on I wanted to ask is last year or so there has been a lot of news items and people coming out saying that artificial intelligence is a danger to humanity, what is your thought, should we be super excited about it or is it something we should be maybe cautious about?
45:10 Joel: I would say I probably fall in the middle, I mean I am excited about it because it's my job, but I don't go around encouraging everyone else that they have to be excited about it, but at the same time I don't spend a lot of time worrying about how dangerous it is I think we are pretty far off from the time when we have to worry about it and I do have some friends who think we should worry about it now before it's too late, but I think there is a lot more important things to worry about in the world when I read the news, so...
45:40 Michael: Yeah, I kind of agree with you on that, and I think there is two certainly ways to look at that, on one hand if you think about things like self driving cars, let's just take that as an example, like I believe one of the biggest job categories for men in the United States is some form of driving, like driving a truck, driving a taxi, those types of things, delivery vehicles, and so on, and if self driving cars were to remove all that, like that would have large social effects, but I think that's not so much the way that people at least recently in the news are talking about, it was more like terminator style, right, and so that I am not too worried about this personally, who knows.
46:26 Joel: Yeah, I mean I personally would pay a lot of money for a self driving car because I don't like driving that much, and I would rather be able to read while I am going somewhere.
46:36 Michael: Yeah, driving is fun until you get stuck on I5 for half an hour, then you are like I don't like driving anymore.
46:42 Joel: Exactly.
46:43 Michael: Awesome, well, we are kind of coming up near the end of the show, let me ask you just a few closing questions I always ask all my guests- if you are going to write some Python code, what editor do you open up?
46:51 Joel: So these days I tend to use Atom for pretty much everything, memory leaks and all.
46:56 Michael: Nice, yeah, that's from github right?
46:59 Joel: Yeah, it's pretty similar to Sublime but it doesn't nag you for money, so...
47:02 Michael: Yeah, it's pretty nice and I think it's atom.io, they have a really cool little video about how it's the editor of the future. It's nice.
47:11 Joel: Yeah, I didn't get that far but it's the editor of the present at least.
47:15 Michael: Yeah, it's like a George Jetson sort of like a promo video, it's pretty funny. It is pretty nice, isn't it, does it have good Python support?
47:22 Joel: You know, I am not someone who leans on like ide functionality a lot, so if good Python support counts as syntax 47:29 yes, but that's all I tend to use at work, so...
47:33 Michael: Yeah, yeah, ok, cool. And, if you look at the Python package index there is 75+ thousand packages, and we all have experience with different parts of it, there is things we love and would recommend, like what is your favorite one you would recommend that people maybe don't know about?
47:52 Joel: So the one I recommend that people don't necessarily know about is Beautiful Soup, it's basically html parsing library and so if you start a new scraping data from web pages you are going to get a big mess of ugly html that's probably not even well formed, most of the time, because most people do not bother to well form their html.
48:13 Michael: Are you telling me that I can't just like load that up as an xml document or something like this- I am just kidding, of course, it's terrible trying to work directly on the web, right, beautiful soup is really, I really like it as well.
48:25 Joel: It's really nice I mean, you have to spend a little time to get used to it, it's api and interface and everything, but it's super handy for getting data out of web pages and doing anything we have to do a bunch of scraping.
48:36 Michael: And you cover that in your book, right, in the getting data, I used a beautiful soup a little bit and show people how to use it.
48:43 Joel: Yeah, I covered it a little bit, it's a nice addition to the data scientists toolkit.
48:46 Michael: Nice, so Joel, how do people find your book? Amazon, Google search for data science from scratch?
48:53 Joel: Yeah, it's on Amazon, and you can buy it from O'Reilly, if you Google data science from scratch you will find it.
48:58 Michael: Cool, and I'll put a link to the github repo where you have all the code examples and so on as well. It's been really fun to talk about data science and I think you have a really interesting way of teaching people to appreciate the tools that we are all fairly familiar with by showing them how to build it from scratch, so thanks for that.
49:19 Joel: My pleasure.
This has been another episode of Talk Python To Me.
Today's guest was Joel Grus and this episode has been sponsored by SnapCI and Hired. Thank you guys for supporting the show!
Snap CI is modern continuous integration and delivery. Build, test, and deploy your code directly from github, all in your browser with debugging, docker, and parallelism included. Try them for free at snap.ci/talkpython
Hired wants to help you find your next big thing. Visit hired.com/talkpythontome to get 5 or more offers with salary and equity right up front and a special listener signing bonus of $2,000 USD.
Are you or a colleague trying to learn Python? Have you tried books and videos that left you bored by just covering topics point-by-point? Check out my onlne course Python Jumpstart by Building 10 Apps at talkpython.fm/course to experience a more engaging way to learn Python.
You can find the links from the show at talkpython.fm/episodes/show/56
Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at/play and direct RSS feed at /rss on talkpython.fm.
Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song at talkpython.fm/music.
This is your host, Michael Kennedy. Thanks for listening!
Smixx, take us out of here.