#56: Data Science from Scratch Transcript
00:00 You likely know that Python is one of the fastest growing languages for data science.
00:03 This is a discipline that combines the scientific inquiry of hypotheses and tests,
00:07 the mathematical intuition of probability and statistics, the AI foundations of machine learning,
00:12 affluency in big data processing, and the Python language itself. That's a very broad set of skills
00:18 you'll need to be a good data scientist, and yet each one is deep and often hard to understand.
00:23 That's why I'm excited to speak with Joel Gruse, a data scientist from Seattle. He wrote a book to
00:28 help us all understand what's actually happening when we employ libraries such as scikit-learn or
00:32 numpy. It's called data science from scratch, and that's the topic of this week's episode.
00:36 This is Talk Python to me, episode number 56, recorded April 19th, 2016.
00:56 Welcome to Talk Python to me, a weekly podcast on Python, the language, the libraries, the
01:11 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at
01:16 mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on
01:22 Twitter via at talkpython. This episode is brought to you by SnapCI and Hired. Thank them for supporting
01:28 the show on Twitter via at snap underscore CI and at Hired underscore HQ. Hey, everyone. It's great to be
01:36 back with you. I have something to share before we get to the main part of the show. I had a chance to be on the
01:40 Partially Derivative podcast this week, where we did a short segment on programming tips for data scientists,
01:46 relevant to this episode, wouldn't you say? I talked about using generator methods and the yield
01:51 keyword to dramatically improve performance while processing lots of data in a pipeline.
01:55 If you want to hear more about that kind of stuff, check out the link to the Partially Derivative episode
01:59 in the show notes. Now, let's talk to Joel. Joel, welcome to the show. Thanks. I'm glad to be here.
02:04 Yeah, I'm really looking forward to doing data science, but from scratch today.
02:09 I know. That's my book. I'm the right person to be here for that topic.
02:12 Fantastic. So we're going to dig into data science. And I think your idea of
02:16 taking it from the fundamentals and building something that's not so complicated or so well
02:22 polished and optimized that you can really understand what you're doing is great. But before we get into
02:26 that, let's just sort of start from the beginning. How did you get into programming in Python?
02:31 So if you go back about 10-ish years or so, I was in grad school for economics. I took a class called
02:39 probability modeling, which was a lot of simulating Markov chains and doing Monte Carlo simulations and
02:45 things like that. And the class was actually taught in MATLAB. School had a site license for MATLAB that
02:50 only worked on campus. And so that meant that if I went home and worked at home, which I like to do,
02:55 I couldn't use MATLAB. And so as an alternative, I found that I could use Python and NumPy to basically
03:02 do all the MATLAB-y things without that site license. So that's actually why I started being
03:08 into Python. And then I just kind of liked it and stuck with it.
03:10 Yeah, I can definitely see how you would like working in Python more than MATLAB. I've
03:14 spent my fair share of time in the .im files and it's all right, but it's not Python.
03:20 You were studying economics and that's kind of how you got interested in data science,
03:25 trying to answer these big questions. The subject of economics or some other way?
03:30 I followed sort of a convoluted path. My background was math and economics.
03:34 And I actually started off doing quantitative finance, options, pricing, mathematics of financial
03:39 risk. I worked at a hedge fund. And then when the hedge fund went out of business, I just kind of
03:43 lucked into this job at a startup called Faircast, where I was doing a lot of kind of BI work,
03:50 writing SQL queries, building dashboards. And over the years, I just sort of moved more and more in a
03:58 data science kind of direction and eventually became a data scientist and software developer.
04:02 And certainly the math and economics training was helpful for that. But it was
04:06 it was a career that I sort of ended up in by accident rather than through a deliberate plan.
04:11 That's a little bit like my my career path as well. Studied math and just sort of had to learn the
04:17 programming and stumbled into it. That's cool. What do you do day to day now?
04:21 So I just started a new job. I work at a nonprofit. It's called the Allen Institute for Artificial
04:27 Intelligence. It was founded by Paul Allen, who was one of the Microsoft founders. And we're
04:33 basically doing kind of fundamental AI research. I can't tell you that much about what I do. So
04:38 it's really my second week and I'm still kind of just learning what goes on. I'm learning my way
04:43 around. But I work on a team called Aristo, which is basically building AI to take science quizzes and
04:51 to understand science. Oh, very interesting. Yeah, that sounds like a fascinating place to work.
04:56 It's really neat. Yeah. A lot of smart people, interesting problems.
05:00 Would you say it's different than working in a hedge fund or is it surprisingly similar?
05:03 No, it's totally different. I mean, it's similar in that there's a lot of smart people,
05:07 but it's not really similar in any other way. Yeah. The goals are not necessarily the same,
05:11 are they? The goals, the incentives, the sort of day-to-day stress levels, it's all different.
05:16 Yeah. Yeah. Sounds great. So let's talk about your book a little bit. It's called Data Science
05:21 from Scratch. And I think that's a really cool way to approach it. We have all these super polished
05:28 data science tools. You know, we have them in Python, we have them in R and other languages as well.
05:34 Why from scratch, rather than just grabbing one of these libraries or a set of these libraries and
05:38 talking about how to use them? So a couple of reasons. One is, as I mentioned,
05:42 I come from a math background and the math way of doing things is you can't really use something
05:49 until you can prove it. I once had a teacher who came in and he looked at the syllabus for the
05:58 previous semester and he said, oh, good, you proved this theorem. That means I can use it in this class.
06:03 And so there's this real rigor around, you can't use things unless you understand them.
06:06 And that approach always kind of resonated with me. And so that, to a large degree,
06:11 that's the approach I took in the book. At the same time, you also have all these really powerful,
06:17 really easy to use libraries like Psych and Learn, where you can go in and basically copy and paste,
06:24 you know, five lines of code. And you've built a decision tree or you've built a regression model.
06:29 And it's very easy to not know what you're doing. And you can pick up books and they'll tell you what
06:34 commands to type. But you can also type those commands and again, not know what you're doing.
06:39 And so I thought the book I would like, the book that I would have found valuable would be,
06:44 here's what these models are actually doing behind the scenes. And so then when it's time you apply
06:50 them, you kind of understand the principles, you understand where they go wrong, where they go right,
06:54 what they're good for, and what they're actually doing.
06:56 Yeah, I think that's, that's really interesting. And that is very much the mathematical way of,
07:01 if you can't prove it, you can't use it, sort of a way of thinking, which is not that common
07:07 in programming, you know, if there's an API, hey, just grab it and use it. But I think it's really
07:12 important in data science, because you have all these different disciplines and backgrounds coming
07:18 together to make it work, right? You're not just a pure programmer, and you're not, you know, a
07:24 mathematician or some sort of domain expert, right? You kind of have to blend these together.
07:28 Yep. And the other thing that I think ended up working pretty nicely about this approach was,
07:34 there's a lot of really mathy books about data science and machine learning, that's like,
07:38 here's an equation, here's an equation, here's an equation. But what I ended up doing was a little bit
07:43 of math. But then when it came time to, here's how something works, do it in working Python code.
07:49 So that it's rigorous in terms of here's all the steps laid out. But it's also, that's the code you
07:54 can run and sort of follow along.
07:57 Yeah, absolutely. Why Python and not something like R or some other language?
08:02 So the short answer is that, for whatever reason, R is not sympathetic with the way my brain works.
08:08 And whenever I try to sit down and start writing things in R, I just find that it doesn't work the way
08:13 I expect it to. The joke is that R is a language designed by statisticians for statisticians.
08:18 There's some truth to that, some unfairness to it. But for whatever reason, Python is much
08:24 friendlier to the way that my brain works and the way that I solve problems. Actually, the first draft
08:29 of my book was a lot harsher against R. And then some of the earlier reviewers didn't like that.
08:35 They really don't like that. So I kind of revised it to be more gentle.
08:39 Yeah, I think one of the things that's nice about Python is if you invest to learn data science
08:44 through Python, or the Python data science tools, rather, and then you want to build sort of a more
08:50 working application, you don't have to, you know, convert that to some other language, right? You're
08:55 already in Python. It's sort of a full stack thing you keep working with, right?
08:59 That's one important aspect. I think another important aspect is that from my perspective,
09:03 Python code, if written well, is super readable. And so you don't necessarily have to be a Python
09:11 expert to take a well written piece of Python code and understand what it's doing. Whereas other
09:16 languages can be a lot more impenetrable. And so I find it a nice teaching language from that
09:21 perspective as well.
09:22 Yeah, I would say, you know, most of the universities in the US came to that same conclusion,
09:27 right? With Python being the most common CompSci 101 course these days, as opposed to like Java or
09:33 Scheme, which I took.
09:34 Yeah, I mean, when I took, I also did Scheme when I took CompSci 101 way back in the day. And I actually
09:40 thought that was a really nice language for learning computer science. But I can see why Python would be
09:46 a better language for, you know, learning programming as opposed to computer science.
09:50 So when you start off your book, you actually have a few sections that are just about,
09:55 I'm going to teach you just enough Python to get started so that you can understand the data
10:02 science tools and what we're doing. What kind of stuff do you put in there? Like, what do you
10:06 consider fundamental for doing data science? And what can be sort of learned later?
10:10 The most important things I put in were obviously functions, you want to do any kind of
10:16 analytic work or Python functions. The other thing I really tried to emphasize,
10:20 was writing kind of clean Pythonic code. So I also did a lot of list comprehensions,
10:27 as well as understanding all the basic data structures. So lots of dicks and sets and some
10:34 of the less common ones like default dicks and counters, I also make quite heavy use of.
10:40 So basically, anything you would need to understand, here's how to write out an algorithm in Python,
10:46 here's how to use the correct data structures in Python, I kind of baked into that initial
10:52 introduction. Whereas more advanced things like really specialized data structures like queues,
10:58 I just kind of introduced as needed later in the book, some of the more esoteric math functions,
11:03 I just kind of introduced when they came up.
11:05 I think obviously, nicer dictionaries, things like default dict and so on,
11:09 played a really important role in the subsequent pieces where you're trying to make concise,
11:15 clean statements. I thought your use of sort of list comprehensions and generator expressions,
11:23 along with the other functional pieces like zip and counter and so on, were really,
11:30 really interesting. You write some very concise code without being unreadable. I thought that was
11:35 nice.
11:35 That's one of the things I aim for. And as much time as you spend sort of revising the text of the book,
11:41 I also spent probably just as much time revising the code of the book. I went back over it many,
11:46 many times thinking, can I make this simpler? Can I make it cleaner? Can I make it easier to
11:50 understand? And I try and do that. I'm kind of a stickler for code elegance. It's kind of a
11:57 personal fault, maybe.
11:58 But also, you come out with very polished codes. I found it very easy to follow. And I thought the
12:04 focus on this functional style programming was really neat. So one of the things you talk about
12:10 in the beginning is about visualizing data, because you have all these numbers. And obviously,
12:15 to understand them, putting a picture to it is very nice. And what is that? Matt Plotlib?
12:19 Yes. In the book, I only use Matt Plotlib. I was just having a conversation with some of my data
12:25 science buddies about a week ago, where they were, as always, complaining that the data visualization
12:31 story in Python is still not very good.
12:33 Yeah. So you said that Matt Plotlib was starting to show its age in your book. What did you mean by
12:38 that?
12:39 When I first started learning Python, what I said, more than 10 years ago, as a Matlab replacement,
12:44 like Matt Plotlib was around back then, and it had the same features and the same interface, and it
12:51 produced the same not particularly attractive plots. So it hasn't kind of evolved as the rest of the
12:57 language has evolved, say. And there's been some projects like Seaborn that try and put some prettiness
13:03 on top of it. And there's been some other attempts to kind of bring the R-style GG plot into Python.
13:09 But from my perspective, none of them has really like won the mindshare and become the solution.
13:14 So everyone kind of does something different.
13:16 Yeah. You also talked about Boku, which is from the Continuum guys. I had them on show 34. Can you
13:23 maybe talk about that really quickly? Just what is it? What's it used for? That sort of thing.
13:28 It is another visualization library. And I have to be honest with you, I haven't really
13:32 checked it out since I was writing the book like more than a year ago. So I seem to remember that it
13:38 seemed to have some facility for building in interactive plots and things like that.
13:42 It's kind of D3, fancy graphics in the browser type stuff, right?
13:47 I believe so. But I haven't spent much time playing with it.
13:49 Yeah.
13:49 I haven't been doing a lot of, probably doing actually more D3 than Python visualization recently.
13:55 Right, right. Tell everyone what D3 is. I don't think we've talked too much about that on the show.
13:59 Oh, sure. So D3 is actually a JavaScript library for building interactive data visualizations.
14:06 It brings sort of an interesting model where you bind your data to DOM elements, which are quite often
14:12 SVG elements. And so what that makes it easy to do is to have a data set. And when you add new data to it,
14:21 your plot updates, and it allows a lot of really interesting interactivity. If you check out the
14:26 D3 website, they have a gallery of all sorts of amazing visualizations where you look at and think,
14:32 how on earth did they do that?
14:33 Yeah. Whenever I go to the D3 website, I'm like, wow, I want to use this. I don't have a use for it,
14:39 but it's fantastic. How can I just build stuff that looks like this?
14:42 Well, I mean, so I have a joke, which is that good data scientists copy from the D3 gallery and great
14:48 data scientists steal from the D3 gallery. And they give you the code for all of them. So I would say
14:53 probably close to 100% of the D3 visualizations I've ever built have been something I found in the D3
15:00 gallery and kind of tweaked until it fit my data.
15:02 Yeah. Yeah. Nice. So the next section you talked about was like the fundamentals of the math and
15:08 science that you need to know in order to be a data scientist. And I like the way that you put it.
15:14 You said these are like the cartoon versions of big mathematical ideas. How much math do you need to
15:19 know to call yourself a data scientist? Like if I studied just programming, I'm not a data scientist.
15:25 So I studied just math. I'm not a data scientist. Like what's the story there?
15:28 Data science is a funny thing in that there are as many different jobs called data science as there
15:34 are data scientists. So there are people who will call themselves data scientists and they write SQL
15:39 queries all day. There are people who call themselves data scientists and they do cutting edge machine
15:44 learning research all day. There are people who do data scientists who convince people to click it at.
15:49 Like you could know almost no math and still get a job where you call yourself a data scientist.
15:53 You could know very little programming and get a job where you call yourself a data scientist.
15:58 It's a field that's still kind of figuring itself out. And there's just such a breadth of the different
16:03 roles that are all calling themselves data science. In an ideal world, you would know, you know,
16:08 linear algebra, you would know probability, you would know statistics. But among people who are data
16:14 scientists, some people know that stuff really well. Some people know that stuff not so well.
16:17 Yeah, I can imagine. It's really about you've got a lot of data or some specific type of data and you
16:23 want to answer questions about it or even discover the questions you could ask that nobody's asked, right?
16:29 SnapCI is a continuous delivery tool from ThoughtWorks that lets you reliably test and deploy your code
16:48 through multi-stage pipelines in the cloud without the hassle of managing hardware. Automate and visualize
16:55 your deployments with ease and make pushing to production an effortless item on your to-do list.
16:59 Snap also supports Docker and M browser debugging, and they integrate with AWS and Heroku.
17:05 Thanks SnapCI for sponsoring this episode by trying them with no obligation for 30 days by going to
17:12 snap.com/talkpython Sometimes it's actually the opposite, which is, I have a question I want to ask. Where can I find some
17:26 data that will allow me to answer that question? So, you know, sometimes you start with the data and
17:31 go to the questions. Sometimes you start with the questions and then you got to find the data.
17:34 Right. Okay. Very interesting. So that actually brought you to your next section that you talked
17:39 about in your book, which was getting data. And so what are the common ways that you talk about there?
17:45 Nowadays, there's a lot of people who just like post interesting data sets. So if you go to like
17:50 Kaggle, which is a site that does data science competitions, every one of their competitions has
17:55 a data set, which might be interesting in ways that are not related to the competition. The government websites
18:01 publish all sorts of data, some of which is potentially interesting. If you like weather,
18:05 they publish weather data. If you like economics, they publish economic data. There's a lot of
18:10 Python libraries for scraping websites. So even if a data set is not available, you can always go out and
18:15 try and scrape it and collect it yourself and clean it. My go-to source is always Twitter. I always build
18:21 things using Twitter data. So Twitter and all the other sites will have APIs where you can just make
18:27 restful calls or even use libraries. Python has some Twitter libraries that abstracts all that away. And you can,
18:33 you know, collect tweets on a given topic or collect tweets from certain users and
18:37 collect your own tweets and do analysis on those.
18:41 Yeah. And you had a section about that in the book. What was the package you were using for that?
18:45 In Python, the package I usually use is called Twython. There's a bunch of them. That's the one that I
18:50 got to work the easiest, but I think highly of it.
18:53 I don't think I believe any other ones.
18:56 Yeah. Yeah. Okay, cool. Yeah. I think Twitter is fairly special among the social networks.
19:01 To me, Twitter is the social network of ideas more than it is of friends or family or whatever. So
19:08 yeah, pretty cool. I love to get data from Twitter as well.
19:11 Yeah. It's just that they won't give you the fire hose, but for most things, they'll give you
19:15 more tweets than you can handle anyway. So if you want to find out what people are saying on a certain
19:19 topic or just what people are talking about in general, it's an awesome source for that.
19:24 Yeah. I saw some study or heard about some study a few years ago where people were studying, they were
19:30 trying to do sentiment, mood analysis on Twitter, and then trying to tie that back to the stock market
19:38 and predict how people were feeling on Twitter to near-term changes like in the next hour in the
19:45 stock market, which I thought was a pretty interesting project.
19:48 So ideas you can come up with using that data are pretty much endless. Every month, I think I'm done
19:53 with it and then I think of something else to do.
19:54 Nice. So then you kind of get into the topics that I think of as traditional data science,
20:01 machine learning, neural networks, network analysis, that kind of stuff. And I thought maybe we could go
20:07 through each section and you could tell me what it is that you need to sort of fundamentally
20:13 understand what are the from scratch basic pieces of each part of this data science and then maybe
20:20 some examples of problems you might solve with it.
20:22 Sure.
20:22 Yeah. So the first one you talk about is machine learning. And when you said machine learning
20:27 and later neural networks from scratch, I'm like, wow, that's a pretty big thing to take on from
20:33 scratch. What's the story there?
20:34 Machine learning at a high level is just learning some kind of model from data rather than sitting
20:41 down and writing out the model yourself by hand. And so if you have a small amount of data, then
20:47 you know, coming up with an algorithm that's going to learn a model from it is actually a pretty
20:51 reasonable thing to do if you go with a simple model and don't add too many bells and whistles.
20:56 It's only when you want to, you know, start producing recommendations at Netflix scale
21:01 or when you want to start, you know, building something to recognize speech patterns from audio files
21:06 that your beautiful handcrafted Python is probably not going to be up to the task.
21:10 Yeah. So I suppose the types of things you solve with machine learning is fairly unbounded. There's
21:19 a lot of problems that machine learning answers. What are some of your favorite examples?
21:22 Everything out there is machine learning these days. But if you want to talk about like projects that
21:29 I've worked on for fun, one time I built a classifier to predict or to identify hacker news articles that I would be interested in or not
21:37 interested in. So it would take kind of the feed of new hacker news stories and come up with a score between zero and one about how interested, you know, it thought I would be based on some initial seed values that I leave.
21:49 How did you teach it?
21:58 How did you teach it?
22:06 I was for a few other kind of idiosyncratic features I threw in, like, you know, is it an ask hacker news? Is it a show hacker news? Does it have a dollar
22:15 amount in there? Because there are a lot of kind of bad stories that's like, here's how to make $5,000, I think. So that was a good negative signal.
22:23 It's sort of a spammy signal.
22:24 Yeah. So, and then it turned out I wrote a blog post about it and someone posted the story to hacker news and then
22:31 the hacker news community got very angry that someone would think that not every story there was worth
22:36 reading. Because of course every story there is worth reading. So why would I want to filter some of them out?
22:39 They accused me of wanting to live in a bubble.
22:41 And yeah, they can be fairly critical on there. But that's funny. So the big question is, did your system
22:49 like your article, your blog post?
22:51 That's a good question. I don't know that I actually checked. I should go back and try and find that.
22:55 It'd be funny if it recommended it to you. Or not. Either way, it would be funny to know.
22:59 Well, you know, it's funny speaking of this. So, you know, I have an Android phone and so
23:03 Google now will sometimes recommend me articles of things I'll be interested in.
23:07 And occasionally it will recommend to me like my own blog posts, which I guess means doing a good job.
23:13 You know, one of the things I did with machine learning was I have a five year old daughter and I do her clothes shopping.
23:19 And I noticed that little boys clothes were very interesting and little girls clothes tend to be
23:23 kind of like really boring. The boys clothes have like dinosaurs and rockets and robots and
23:28 girls clothes have like hearts and flowers. And so I built a machine learning model to take an image of a
23:33 children's t-shirt and predict whether it was a boy shirt or a girl shirt.
23:38 Awesome. Yeah. And I agree with the classification there as well. I have three daughters and we still
23:44 bought a fair number of boy clothes for them.
23:46 Yeah, I did the same.
23:47 Nice. So the next topic that you covered was nearest neighbors. And I mean, conceptually,
23:53 I kind of know what nearest neighbors are pretty easily, but it's a computation hard problem,
23:58 especially in higher dimensions. And so what kind of stuff do you do? What kind of problems do you solve with
24:03 this?
24:03 One thing you can do is when you don't have like a great parametric model for what's going on.
24:08 So for instance, one place where I've seen this applied is when I have some kind of like
24:12 time series type signal, and I want to know if it represents some kind of bad anomaly. And it might be,
24:19 you know, like thousands of points in some weird shape. And I don't have a great way to classify it.
24:22 But one thing you can do is if you have a bunch of labeled signals, you know, some are good or some
24:28 are bad, you just say, I'm just going to take my set of reference signals and figure out which are
24:33 the ones it's closest to. And that way, I don't need to necessarily have a model of what's a good
24:38 signal or what's a bad signal. I can just say, I have some labeled data, that's enough. And so I
24:43 think that's kind of the situation where it can be useful.
24:46 Yeah, nice. When you're still trying to explore things, and maybe you don't, you don't have a model
24:51 you're trying to fit it to, you're just trying to understand it, right?
24:53 Yeah. Or think about where like, you have some weird kind of shapes, or weird kind of patterns
24:59 that you can't really put math behind. What's a good pattern? What's a bad pattern? But you have
25:04 some labels, then you could use nearest neighbors to basically classify without having to have a
25:10 mathematical model behind it.
25:11 Okay, yeah, very cool. So then you get to another topic called Bayesian Bayesian analysis, which the context I know this is around kind of determining spam and
25:21 filtering and stuff like that. But what's, what's the story of Bayesian analysis?
25:26 So it's named after Bayes' rule, which is just a theorem in statistics, having to do with ways of
25:33 reversing conditional probabilities. So at a high level, if you have some knowledge about
25:38 what is the probability of seeing certain features in email, given that an email is spam, you can use that
25:45 data to produce estimates of what is the probability that an email is spam, given that you see those
25:50 features. And so typically, if you have a lot of, you know, spam and non-spam, you can make estimates
25:56 of, okay, given that an email is spam, it's likely that I'll see Viagra, and it's likely that I'll see
26:02 Gary Twick. And then basically just a classifier that turns that around and allows you to now look for
26:07 these features. And in a math-manipede rigorous way, come up with the probability, okay, here's the
26:11 probability that this is spam. Okay. Yeah. Very interesting. What other types of problems?
26:15 Spam is certainly easy to understand, but you know, hopefully that's solved by Gmail and other people
26:22 for us, right? I think it was Paul Graham who wrote a pretty influential article about this approach,
26:27 probably about 15 years ago now. And so I think a lot of the tools that, I don't know if it's what
26:33 Gmail uses, but that a lot of these BAM assassin and some of the other mail providers used are still
26:37 based on this principle of naive base. But basically the, anything where you have kind of big clumps of
26:44 say text, and you want to classify it into one or more classes, like two or more classes, and you're
26:52 willing to make this really big kind of technical assumption that the words in it are kind of independent
26:58 of each other. So you're treating them kind of like a bag of words rather than as a sequence of words,
27:02 then that's where you would use this kind of model. Okay. Yeah, cool. So maybe the opposite of that,
27:08 where you treat words sort of as, as having more meaning is in natural language processing, which is
27:14 another thing you talk about, right? Natural language processing is, is a huge field. Like people,
27:18 there are textbooks about it. So in fact, there are Python books about it. So cram it into a chapter of a
27:25 book is really just kind of, here's a really high level overview. And here's, you know, a couple of
27:31 examples to give you a flavor. Natural language processing is, it's actually relevant to this sort of
27:37 problems that I'm getting into in artificial intelligence. It gets used a lot by all the voice recognition in
27:43 your phone, and a lot of these new, basically reading texts, and having a computer extract information from it. So it's a
27:52 pretty important, pretty, pretty rich field. Yeah. And there's some pretty decent Python libraries for
27:57 doing that, right? Yeah. So there's a, so the big one is called NLTK, the natural language toolkit,
28:04 I think it stands for. And so it has its own book just about natural language processing in that library.
28:10 And I think that's actually free on the web. But yeah, that would be a great place to start it. If you
28:15 want to come to understand more deeply, there's also a nice Coursera course on natural language processing.
28:20 It took several years back that I liked. Oh, okay. Yeah, excellent. So let's see. Another topic you
28:25 covered were decision trees. And what's the story of these? A decision tree is kind of what it sounds
28:31 like. Yeah. You know, intuitively, you might have a model of making decisions based on a tree. So I have,
28:39 I want to know whether to buy a certain car or not. So, you know, I might start asking a question. Okay,
28:44 is it an American car or a Japanese car? American car. Okay. Go to the next question. Does it have
28:49 four doors or two doors? It has two doors. Okay. Is it gas or diesel? Diesel. And then you just have a
28:55 sequence of questions to ask and you classify it based on those questions. Okay. It's diesel. That
29:00 means, yes, I want to buy it. And so conceptually, you know, a tree like that is not that complicated.
29:06 Interesting part is given some set of data, how do I build such a tree? And there, there's a variety of
29:13 algorithms, but the one that we talk about in the book, it is a pretty simple one, which is basically
29:18 around, okay, I have a bunch of data. It has a bunch of attributes. I can split my data on each
29:24 attribute. So I could say, okay, at this stage, I want to look at two door versus four door. I want to
29:28 look at Japanese versus American, or I want to look at gas versus diesel and which of those choices is
29:35 going to allow me to kind of really separate the good buys from the bad buys the most. So if gas versus
29:43 diesel totally splits where diesel is the buys and gas is the don't buys, that's a great thing to choose.
29:50 If gas versus diesel splits where, you know, on gas, I want to buy half and not by half and on diesel,
29:56 diesel and by half and up by half. That's not a good split to choose because it doesn't really
29:59 help me at all. And so the mathematics here are just around kind of making this precise with this
30:05 entropy and building these trees from the data. Okay. Yeah, that's very cool. The way where you
30:11 build them knowing the data intimately, you know, that, that makes a lot of sense. And you don't really
30:17 necessarily need data science to do that, right? That's just helping, you know, asking a few questions
30:23 and making a decision. But the reverse, I think is pretty interesting.
30:27 They build some of these expert systems in this kind of way. I think they apply them sometimes in like,
30:31 medical diagnosis and sometimes they do better than doctors.
30:35 Wow. Okay. Sometimes just let the data talk, huh? Rather than intuition.
30:40 This episode is brought to you by hired hired is a two sided curated marketplace that connects the
30:56 world's knowledge workers to the best opportunities. Each offer you receive has salary and equity presented
31:01 right up front, and you can view the offers to accept or reject them before you even talk to the company.
31:07 Typically candidates receive five or more offers within the first week, and there are no obligations
31:11 ever. Sounds awesome, doesn't it? Well, did I mention the signing bonus? Everyone who accepts a job
31:16 from hired gets a thousand dollars signing bonus. And as talk Python listeners, it gets way sweeter.
31:22 Use the link hired.com/talkpythontome and hired will double the signing bonus to $2,000.
31:29 Is knocking. Visit hire.com/talkpythontome and answer the call.
31:32 Somewhat related to that maybe is, you talked about neural networks as well.
31:42 Yeah. So neural networks are a super hot topic right now, because I'm sure you've probably heard
31:47 that all the buzz about, or about deep learning, right? And deep learning tends to have neural networks
31:54 at the root of it. So neural networks are basically a way of building up kind of layers and layers of
32:01 representations. And the deep learning, this book doesn't really go into deep learning because it's,
32:06 the single neural network is hard enough to do by hand or from scratch. But basically it's a way to
32:12 kind of build a classifier that works similar to how a toy model of a brain might work. So you have
32:20 artificial neurons, and each neuron has a bunch of inputs that go into it with weights. If the weighted
32:26 sum of the inputs exceeds some certain threshold, the neuron fires. And if it doesn't exceed that
32:30 threshold, the neuron doesn't fire. So you present an input, which could be like a, an image. So basically
32:36 a bunch of zeros and ones, and that causes some neurons to fire. And that propagates through this
32:42 network. And in the end it will spit out, you know, I think you should be an image of a cat,
32:46 or I think you should be an image of a dog. And you train it by showing that a lot of labeled images
32:52 and adjusting the weights based on how you got them wrong.
32:55 Yeah. Interesting. So you basically just say, these are the inputs and the decisions that you feed it
33:02 known data. You say, you know, like, here's a cat, here's a cat, that's a dog. It's just a cat or a dog,
33:07 right? And then you ask that question. I was gonna say, well, one of the problems that I sort
33:11 of remember from neural networks is that they, as you try to get them more accurate,
33:15 they get over-trained to just only do little bits of stuff. And so how does things like decision
33:21 forests or these things and this deep learning, how does it deal with that?
33:26 So random forests are basically taking multiple decision trees and combining them. So there's this
33:33 kind of general principle where a lot of times rather than building one model, it's like super predictive.
33:40 I take a lot of much less predictive models and just kind of let them vote on things or average
33:47 the results or whatever. So you're right that it's really a decision tree. And I included like all
33:52 hundred features in my data set that there's a good chance I'm going to like overfit and
33:57 overlearn my training data and not generalize outside of that. So one thing that happens with decision
34:04 trees is people often don't use the bare decision trees. They use the random forest where they'll build
34:10 a bunch of smaller decision trees, each of which is really restricted to a small subset of features
34:16 so that each one is individually less powerful. But then when you combine them, they do well in the aggregate
34:22 and they don't have necessarily the same overfitting problem that single decision tree with lots of features would.
34:28 So in neural networks, especially in deep learning, they're not exactly the same techniques, but other
34:34 techniques where you will zero out some of your weight sometimes and train without them to make
34:40 sure that they don't learn too much. And there's a lot of other techniques that get used in order to
34:44 make sure you're not overfitting. There's something that, you know, data scientists and machine learning
34:47 people worry about a lot.
34:48 It's a really hot field, like you said right now. That's awesome. So one more section I'd like to touch
34:54 on is you talked about this thing called recommender systems.
34:58 So mobile, they're just what they sound like. Anyone who's used the internet has a pretty good experience
35:02 where you go to Netflix and it says movies for Joel and it's trying to predict what it thinks I'll like.
35:09 Or you go to Amazon and it will recommend items for me. And you can go a little further where you have
35:16 all these startups like Stitch Fix is a very hot data science startup where you tell what kind of clothes
35:22 you like and they'll send you a box of clothes every month that they think you'll like a lot.
35:28 And so generating these kind of recommendations is a pretty popular task within data science. A lot of data scientists
35:36 work on these kind of problems. There's a lot of data scientists work in this job. So they have to sell you stuff
35:41 and they always want to sell you stuff that they think you'll like.
35:44 You convert better if you figure out what people actually might want, right?
35:47 Yeah, I mean, well, if Amazon sends me an email, "Hey, Joel, you know, these five things are on sale
35:54 and they're five things that I really want to buy," then that's much better for them and potentially
35:58 for me than if they send me a random email that's like, "These are the five most popular things on
36:03 Amazon today." Yeah, it definitely is better. I really like going to my Kindle and looking at
36:08 the recommended things based on what I've been reading. But the Netflix recommendations, I know they
36:14 do really great work, but it just doesn't work for me because my seven-year-old daughter watches
36:19 Strawberry Shortcake and other random things. I get a lot of kid shows recommended to me.
36:25 They have a "Who's watching" button that you can click and say "A kid is watching."
36:30 I know, but my daughter won't use it. She'll just randomly pick one.
36:33 Yeah, so you know, it's funny. I use that button pretty well, but my Netflix recommendations are also
36:40 not that good. But I think it's just because I don't like anything on there. So no matter what they're
36:45 telling me, I'm not going to like it. Yeah, I hear you. I feel like you covered a pretty
36:49 wide swath of data science from scratch. Some of the topics were really accessible. Some of them
36:57 required more math and more thinking, but it was all a really nice presentation. What do you feel like
37:04 you left out? There aren't any topics that, if anything, I kept adding stuff while I was writing it.
37:11 So I don't feel like there were any huge topics that I necessarily left out. But in terms of coverage,
37:17 probably my biggest regret is that I used Python 2 instead of Python 3.
37:20 Yeah. Okay. And I saw on your GitHub repo, you had something about Python 3 in there. Is that right?
37:27 Yes. So pretty much as soon as the book came out, one, I got a lot of emails from people saying,
37:32 "Hey, why do you not use Python 3?" And then I also got a number of emails saying,
37:37 "I would like to use Python 3. Will the code work?" And so I wrote them back and I said,
37:41 "Yeah, you know, I don't feel like the code shouldn't work. Give it a try. I bet it works
37:46 with probably a few changes, add some parentheses, print statements and so on." And then eventually one
37:51 guy wrote me back and he said, "It doesn't work." I said, "Okay." So I sat down and I said, "I'm going to
37:57 convert the code to Python 3." And it took me about, I'd say four to five hours. And that's with me
38:04 knowing the code intimately. So it would have taken someone who hadn't written the code in the first
38:09 place a lot longer than that. And then I felt kind of guilty that I've been telling all these people that
38:12 it was so easy to do when it wasn't.
38:14 Just spend a week. It'll be fine.
38:16 Yeah. But so yeah, I have the Python 3 versions of the code up on the GitHub. I sort of regret not
38:23 having just done it that way in the first place.
38:25 Yeah, sure. So at the end, you talked about data science, not from scratch, and you pointed out
38:31 a lot of the libraries you might actually use, like NumPy and so on. Do you want to talk about that
38:37 really quick? Like what's the real data science versus the from scratch data science comparison?
38:43 Yeah. So I would say that NumPy is pretty fundamental. That's basically the linear algebra
38:51 library for Python. So it provides you matrices, matrix algebra, high performance arrays, things
38:58 like that, you don't get just built in. And you might not use it directly, but a lot of the other
39:05 libraries are really built on top of it. So kind of the most broadly accessible machine learning
39:11 library for Python is called Scikit Learn. And it has really nice documentation and really nice tutorials
39:16 and a fairly standard API for building machine learning models. Anything you want to build a
39:21 regression model or a random forest model or any kind of classifier, that's probably the place to go.
39:27 There's also Pandas, which is the data frame library, which is good if you're working with tabular data. So
39:34 not necessarily the machine learning side of data science, but more of the, I have kind of a
39:39 spreadsheet data set and now I want to clean it and aggregate it and pivot it and look for kind of data
39:48 analysis type insights.
39:49 Yeah. If you're exploring, like you said, tabular data and you kind of kind of load it up and clean it,
39:56 Pandas seems really fantastic for that.
39:57 It's a really nice library. And then kind of the new kid on the block is a TensorFlow, which is
40:03 Google's deep learning library. And it's only was released, you know, a few months ago and it's gone
40:09 through. It's not 1.0 yet, but it seems like people are sort of converging around it as that's how they're
40:15 going to do deep learning in Python. Now there are other sort of previous libraries that some people
40:20 used and still use, but TensorFlow seems to be gaining a lot of mindshare.
40:24 Okay. Yeah, that's really cool. And I've definitely seen TensorFlow talked about a lot in this context,
40:30 but it's just a library. You can download that and run it locally. It's not like a cloud type thing,
40:35 right?
40:35 Yes. So currently it's a local library and I'm not sure if they've come out with a version that you can
40:42 kind of do your own cloud, like on various AWS instances. But, but I know they're definitely going
40:47 in that direction because a lot of this stuff, these deep learning models take a really long time to
40:52 train. And so you want to go to use them for anything serious. You want to distribute them and
40:56 throw them in high-powered machinery and not just run them on your laptop.
40:59 Yeah, of course. If you have tons of data, maybe it's better to get a bunch of machines
41:03 for an hour. So what do you think about these cloud learning or data science platforms? I'm thinking like
41:09 Azure machine learning, or, you know, I just had SigOpt on the show a few shows ago. I'm not really sure
41:15 what else is out there in terms of like go out to the cloud and grab some data science stuff.
41:20 What do you think about that?
41:21 I haven't spent much time looking at any of those. I think they can add value in terms of either one,
41:31 if you don't have the data science or machine learning expertise in house to do whatever you need to do.
41:37 Or two, if you have some kind of model that you've built, but you need help either putting it into
41:43 production or operationalizing it somehow, that they can build a pretty good role there. But my sense is
41:50 that most people doing data science are aligned more on running the libraries and running the models
41:56 themselves. But that could be just my bias sample of the people who I talk to.
42:00 Yeah, of course, of course. Cool. So another thing that you said you're into is taking some of the
42:07 ideas from Haskell and thinking about how those might manifest in Python. What do you went up to there?
42:12 Haskell, if you're not familiar with it, is kind of the purest of the pure functionally typed
42:19 languages with strong types and lazy evaluation and things like that. And so once I spent some
42:28 time in that world, I spent a lot of time thinking about how can I bring some of these concepts back
42:33 into Python. And in Python 3, lazy evaluation plays a much bigger role in the sense that like range
42:42 is a generator instead of a list and all the map and filter and things like that also are generators
42:48 instead of lists. But I started getting into the iter tools library, which starts giving you tools for
42:55 generating basically infinite sequences and just trying to see how far I could go using
43:01 pure functions and infinite sequences and avoiding mutable variables and other things that you try not
43:07 to do when you're working in a Haskell-like language.
43:10 Yeah. And how did you feel like it came out in the end? Do you feel like you were able to bring a lot
43:14 of those ideas over?
43:15 I was, and I ended up producing code that was really neat and really impenetrable. The lazy
43:23 infinite sequences stuff, that was more almost academic in terms of like, yes, I managed to do it. Like,
43:29 this is mathematically interesting and it works well, but it's not readable at all.
43:34 So imagine you wanted to represent, let's say a binary tree in Python. Kind of the two,
43:41 I would say obvious approaches would be one to make some kind of like class where it has a,
43:47 you know, a value element and a left element and a right element. And then you also might just use
43:53 a dictionary to represent it where it had those keys. In a language with algebraic data types,
43:57 like Haskell, you would just basically represent that kind of tree as a product type where it just
44:04 has like three elements. And so I said, you know, what if I just represented a tree as a tupple with
44:09 three elements where first element is the left subtree, the second element is the value,
44:13 third element is the right subtree. And similarly, if you want to do like linked lists in Python,
44:17 which you probably don't want to do, but if you did want to do linked lists in Python, you just treat
44:21 them as a tupple. First element is the element and the second element is the tail linked list. And so,
44:28 I actually found that I was able to write some pretty nice code using those kinds of ideas.
44:32 And I did some coding interviews that way. I'm not sure the interviewers appreciated it.
44:38 What is this guy talking about? Yeah, cool. So what question given your work at the AI Institute and
44:47 your background in data science and so on, I wanted to ask is the last year or so, there's been a lot of
44:54 news items and people coming out saying that artificial intelligence is a danger to humanity.
45:01 What's your thought? Is like AI something we should be super excited about? Or is it something we should
45:07 be maybe cautious about? I would say I probably fall in the middle. I mean, I'm excited about it because
45:12 this is my job, but I don't go around encouraging everyone else that they have to be excited about it
45:18 because I don't know that that's necessarily warranted. But at the same token, I don't spend a lot of time
45:23 worrying about how dangerous it is. I think we're pretty far off from the time when we have to worry
45:29 about that. And I do have some friends who think we should worry about it now before it's too late. But
45:35 I think there's a lot more important things to worry about in the world when I read the news.
45:39 Yeah, I kind of agree with you on that. And I think there's two, certainly two ways to look at that. On
45:47 one hand, if you think about things like self-driving cars, let's just take that as an example. Like,
45:53 I believe one of the biggest job categories for men in the United States is some form of driving,
46:02 like driving a truck, driving a taxi, those types of things, delivery vehicles and so on. And if self-driving
46:08 cars were to like remove all that, like that would have large social effects. But I think, you know,
46:13 that's not so much the way that people, at least recently in the news, were talking about it. It's
46:19 more like Terminator style, right? And so that I, I'm not too worried about this personally. Who knows?
46:26 Yeah. I mean, I personally would pay a lot of money for a self-driving car because I,
46:30 I don't like driving that much. And I'd much rather be able to read while I'm going somewhere.
46:35 Yeah. Driving is fun until you get stuck on i5 for half an hour inching along. Then you know,
46:40 I don't like driving anymore. Exactly.
46:42 Awesome. Well, we're kind of coming up near the end of the show. Let me ask you just a
46:46 few closing questions. I always ask on my guests, if you're going to write some Python code,
46:50 what editor do you open up? So these days I tend to use Adam for pretty much everything,
46:55 memory leaks and all. Nice. Yeah. That's from a GitHub, right?
46:58 Yeah. It's pretty similar to sublime, but it doesn't nag you for money. So yeah, it's,
47:03 it's pretty nice. And I think it's Adam.io. They have a really cool little video about how
47:08 it's the editor of the future. It's nice.
47:10 Yeah. I don't know if I go that far, but it's the editor of the present at least.
47:14 Yeah. It's like a George Jetson sort of like a promo video. It's, it's pretty funny.
47:18 It is pretty nice. Isn't it? Does it have good Python support?
47:21 You know, I'm never, I'm not someone who leans on like ID functionality a lot. So if good Python
47:28 support counts as syntax highlighting, yes, but that's all I tend to use it for. So yeah.
47:32 Yeah. Yeah. Okay. Cool. And if you look at on the Python package index, there's,
47:38 you know, 75 plus thousand packages and we all have experience with, you know, different parts
47:46 of it and there's things that we love and would recommend, like, what is your favorite one you
47:49 might recommend that people maybe don't know about?
47:51 So the one I recommend that people don't necessarily know about is called Beautiful Soup. It's,
47:57 basically a HTML parsing library. And so if you start scraping data from webpages, you're going to get
48:04 a big mess of ugly HTML. That's probably not even well formed. most of the time,
48:10 most people don't bother to well form their HTML.
48:12 Are you telling me that I can't just like load that up as an XML document or something like this?
48:17 No, I'm just kidding. Of course it's, it's terrible trying to work directly on the web,
48:21 right? And Beautiful Soup is really, I really like it as well.
48:25 It's really nice. I mean, you have to spend a little bit of time getting used to
48:28 it's API and interface and everything, but it's super handy for getting data out of webpages and
48:34 doing anything where you have to do a bunch of scraping.
48:37 And you cover that in your book, right? In the getting data, I use Beautiful Soup a little bit and show people how to use it.
48:42 Yeah, I cover it a little bit. It's a nice addition to the data scientist toolkit.
48:46 Nice. So Joel, how do people find your book? Amazon? Just a Google search for data science
48:52 from scratch? Yeah, it's on Amazon and you can buy it from O'Reilly.com. But yeah,
48:57 if you Google data science from scratch, you'll find it.
48:59 Cool. And I'll put a link to the GitHub repo where you have all the code examples and so on as well.
49:04 It's been really fun to talk about data science. And I think you have a really interesting way of teaching
49:10 people to appreciate the tools that we're all fairly familiar with by showing you how to build it from
49:16 scratch. So thanks for that.
49:19 My pleasure.
49:21 This has been another episode of Talk Python to Me. Today's guest was Joel Gruse and this episode
49:25 has been sponsored by SnapCI and Hired. Thank you guys for supporting the show. SnapCI is modern,
49:30 continuous integration and delivery. Build, test and deploy your code directly from GitHub,
49:34 all in your browser with debugging, Docker and parallelism included. Try them for free at snap.ci
49:39 slash talk.com. Hired wants to help you find your next big thing. Visit Hired.com slash talkpython to me to get five or
49:45 more offers with salary and equity presented right up front and a special listener signing bonus of $2,000.
49:50 Are you or a colleague trying to learn Python? Have you tried books or videos that left you bored by just
49:56 covering the topics point by point? Check out my new online course, Python Jumpstart by building 10 apps at
50:01 talkpython.fm/course to experience a more engaging way to learn Python. You can find links
50:07 from this show at talkpython.fm/episodes slash show slash 56. Be sure to subscribe to the show. Open
50:14 your favorite podcatcher and search for Python. We should be right at the top. You can also find the
50:18 iTunes feed at /itunes, the Google Play feed at /play and the direct RSS feed at /rss on
50:25 talkpython.fm. Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.
50:31 You can hear his entire song at talkpython.fm/music. This is your host, Michael Kennedy.
50:36 Thank you so much for listening. Smix, take us out of here.
50:39 Stating with my voice. There's no norm that I can feel within. Haven't been sleeping. I've been using lots of rest. I'll pass the mic back to who rocked it best.
50:49 I'm first developers.
50:51 I'm first developers.
50:58 Developers, developers, developers, developers.
51:01 .
51:01 Thank you.
51:01 Thank you.