Data Science from Scratch
This is a discipline that combines the scientific inquiry of hypotheses and tests, the mathematical intuition of probability and statistics, the AI foundations of machine learning, a fluency in big data processing, and the Python language itself. That is a very broad set of skills we need to be good data scientists and yet each one is deep and often hard to understand.
That's why I'm excited to speak with Joel Grus, a data scientist from Seattle. He wrote a book to help us all understand what's actually happening when we employ libraries such as scikit-learn or numpy. It's called Data Science from Scratch and that's the topic of this week's episode.
Links from the show:
Book: Data Science from Scratch: amzn.to/1rhcbdT
Joel on Twitter: @joelgrus
Joel on the web: joelgrus.com
Partially Derivative Episode: partiallyderivative.com
Allen Institute for Artificial Intelligence: allenai.org
Data Science Libraries
numpy: numpy.org
Numpy episode: #34:
Continuum: Scientific Python and The Business of Open Source:
talkpython.fm/episodes/show/34
pandas: pandas.pydata.org
scikit-learn: scikit-learn.org
scikit-learn episode: #31: Machine Learning with Python and scikit-learn:
talkpython.fm/episodes/show/31
matplotlib: matplotlib.org
Google's TensorFlow: tensorflow.org
Episode Transcript
Collapse transcript
00:00 You likely know that Python is one of the fastest growing languages for data science.
00:03 This is a discipline that combines the scientific inquiry of hypotheses and tests,
00:07 the mathematical intuition of probability and statistics, the AI foundations of machine learning,
00:12 affluency in big data processing, and the Python language itself. That's a very broad set of skills
00:18 you'll need to be a good data scientist, and yet each one is deep and often hard to understand.
00:23 That's why I'm excited to speak with Joel Gruse, a data scientist from Seattle. He wrote a book to
00:28 help us all understand what's actually happening when we employ libraries such as scikit-learn or
00:32 numpy. It's called data science from scratch, and that's the topic of this week's episode.
00:36 This is Talk Python to me, episode number 56, recorded April 19th, 2016.
00:56 Welcome to Talk Python to me, a weekly podcast on Python, the language, the libraries, the
01:11 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at
01:16 mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on
01:22 Twitter via at talkpython. This episode is brought to you by SnapCI and Hired. Thank them for supporting
01:28 the show on Twitter via at snap underscore CI and at Hired underscore HQ. Hey, everyone. It's great to be
01:36 back with you. I have something to share before we get to the main part of the show. I had a chance to be on the
01:40 Partially Derivative podcast this week, where we did a short segment on programming tips for data scientists,
01:46 relevant to this episode, wouldn't you say? I talked about using generator methods and the yield
01:51 keyword to dramatically improve performance while processing lots of data in a pipeline.
01:55 If you want to hear more about that kind of stuff, check out the link to the Partially Derivative episode
01:59 in the show notes. Now, let's talk to Joel. Joel, welcome to the show. Thanks. I'm glad to be here.
02:04 Yeah, I'm really looking forward to doing data science, but from scratch today.
02:09 I know. That's my book. I'm the right person to be here for that topic.
02:12 Fantastic. So we're going to dig into data science. And I think your idea of
02:16 taking it from the fundamentals and building something that's not so complicated or so well
02:22 polished and optimized that you can really understand what you're doing is great. But before we get into
02:26 that, let's just sort of start from the beginning. How did you get into programming in Python?
02:31 So if you go back about 10-ish years or so, I was in grad school for economics. I took a class called
02:39 probability modeling, which was a lot of simulating Markov chains and doing Monte Carlo simulations and
02:45 things like that. And the class was actually taught in MATLAB. School had a site license for MATLAB that
02:50 only worked on campus. And so that meant that if I went home and worked at home, which I like to do,
02:55 I couldn't use MATLAB. And so as an alternative, I found that I could use Python and NumPy to basically
03:02 do all the MATLAB-y things without that site license. So that's actually why I started being
03:08 into Python. And then I just kind of liked it and stuck with it.
03:10 Yeah, I can definitely see how you would like working in Python more than MATLAB. I've
03:14 spent my fair share of time in the .im files and it's all right, but it's not Python.
03:20 You were studying economics and that's kind of how you got interested in data science,
03:25 trying to answer these big questions. The subject of economics or some other way?
03:30 I followed sort of a convoluted path. My background was math and economics.
03:34 And I actually started off doing quantitative finance, options, pricing, mathematics of financial
03:39 risk. I worked at a hedge fund. And then when the hedge fund went out of business, I just kind of
03:43 lucked into this job at a startup called Faircast, where I was doing a lot of kind of BI work,
03:50 writing SQL queries, building dashboards. And over the years, I just sort of moved more and more in a
03:58 data science kind of direction and eventually became a data scientist and software developer.
04:02 And certainly the math and economics training was helpful for that. But it was
04:06 it was a career that I sort of ended up in by accident rather than through a deliberate plan.
04:11 That's a little bit like my my career path as well. Studied math and just sort of had to learn the
04:17 programming and stumbled into it. That's cool. What do you do day to day now?
04:21 So I just started a new job. I work at a nonprofit. It's called the Allen Institute for Artificial
04:27 Intelligence. It was founded by Paul Allen, who was one of the Microsoft founders. And we're
04:33 basically doing kind of fundamental AI research. I can't tell you that much about what I do. So
04:38 it's really my second week and I'm still kind of just learning what goes on. I'm learning my way
04:43 around. But I work on a team called Aristo, which is basically building AI to take science quizzes and
04:51 to understand science. Oh, very interesting. Yeah, that sounds like a fascinating place to work.
04:56 It's really neat. Yeah. A lot of smart people, interesting problems.
05:00 Would you say it's different than working in a hedge fund or is it surprisingly similar?
05:03 No, it's totally different. I mean, it's similar in that there's a lot of smart people,
05:07 but it's not really similar in any other way. Yeah. The goals are not necessarily the same,
05:11 are they? The goals, the incentives, the sort of day-to-day stress levels, it's all different.
05:16 Yeah. Yeah. Sounds great. So let's talk about your book a little bit. It's called Data Science
05:21 from Scratch. And I think that's a really cool way to approach it. We have all these super polished
05:28 data science tools. You know, we have them in Python, we have them in R and other languages as well.
05:34 Why from scratch, rather than just grabbing one of these libraries or a set of these libraries and
05:38 talking about how to use them? So a couple of reasons. One is, as I mentioned,
05:42 I come from a math background and the math way of doing things is you can't really use something
05:49 until you can prove it. I once had a teacher who came in and he looked at the syllabus for the
05:58 previous semester and he said, oh, good, you proved this theorem. That means I can use it in this class.
06:03 And so there's this real rigor around, you can't use things unless you understand them.
06:06 And that approach always kind of resonated with me. And so that, to a large degree,
06:11 that's the approach I took in the book. At the same time, you also have all these really powerful,
06:17 really easy to use libraries like Psych and Learn, where you can go in and basically copy and paste,
06:24 you know, five lines of code. And you've built a decision tree or you've built a regression model.
06:29 And it's very easy to not know what you're doing. And you can pick up books and they'll tell you what
06:34 commands to type. But you can also type those commands and again, not know what you're doing.
06:39 And so I thought the book I would like, the book that I would have found valuable would be,
06:44 here's what these models are actually doing behind the scenes. And so then when it's time you apply
06:50 them, you kind of understand the principles, you understand where they go wrong, where they go right,
06:54 what they're good for, and what they're actually doing.
06:56 Yeah, I think that's, that's really interesting. And that is very much the mathematical way of,
07:01 if you can't prove it, you can't use it, sort of a way of thinking, which is not that common
07:07 in programming, you know, if there's an API, hey, just grab it and use it. But I think it's really
07:12 important in data science, because you have all these different disciplines and backgrounds coming
07:18 together to make it work, right? You're not just a pure programmer, and you're not, you know, a
07:24 mathematician or some sort of domain expert, right? You kind of have to blend these together.
07:28 Yep. And the other thing that I think ended up working pretty nicely about this approach was,
07:34 there's a lot of really mathy books about data science and machine learning, that's like,
07:38 here's an equation, here's an equation, here's an equation. But what I ended up doing was a little bit
07:43 of math. But then when it came time to, here's how something works, do it in working Python code.
07:49 So that it's rigorous in terms of here's all the steps laid out. But it's also, that's the code you
07:54 can run and sort of follow along.
07:57 Yeah, absolutely. Why Python and not something like R or some other language?
08:02 So the short answer is that, for whatever reason, R is not sympathetic with the way my brain works.
08:08 And whenever I try to sit down and start writing things in R, I just find that it doesn't work the way
08:13 I expect it to. The joke is that R is a language designed by statisticians for statisticians.
08:18 There's some truth to that, some unfairness to it. But for whatever reason, Python is much
08:24 friendlier to the way that my brain works and the way that I solve problems. Actually, the first draft
08:29 of my book was a lot harsher against R. And then some of the earlier reviewers didn't like that.
08:35 They really don't like that. So I kind of revised it to be more gentle.
08:39 Yeah, I think one of the things that's nice about Python is if you invest to learn data science
08:44 through Python, or the Python data science tools, rather, and then you want to build sort of a more
08:50 working application, you don't have to, you know, convert that to some other language, right? You're
08:55 already in Python. It's sort of a full stack thing you keep working with, right?
08:59 That's one important aspect. I think another important aspect is that from my perspective,
09:03 Python code, if written well, is super readable. And so you don't necessarily have to be a Python
09:11 expert to take a well written piece of Python code and understand what it's doing. Whereas other
09:16 languages can be a lot more impenetrable. And so I find it a nice teaching language from that
09:21 perspective as well.
09:22 Yeah, I would say, you know, most of the universities in the US came to that same conclusion,
09:27 right? With Python being the most common CompSci 101 course these days, as opposed to like Java or
09:33 Scheme, which I took.
09:34 Yeah, I mean, when I took, I also did Scheme when I took CompSci 101 way back in the day. And I actually
09:40 thought that was a really nice language for learning computer science. But I can see why Python would be
09:46 a better language for, you know, learning programming as opposed to computer science.
09:50 So when you start off your book, you actually have a few sections that are just about,
09:55 I'm going to teach you just enough Python to get started so that you can understand the data
10:02 science tools and what we're doing. What kind of stuff do you put in there? Like, what do you
10:06 consider fundamental for doing data science? And what can be sort of learned later?
10:10 The most important things I put in were obviously functions, you want to do any kind of
10:16 analytic work or Python functions. The other thing I really tried to emphasize,
10:20 was writing kind of clean Pythonic code. So I also did a lot of list comprehensions,
10:27 as well as understanding all the basic data structures. So lots of dicks and sets and some
10:34 of the less common ones like default dicks and counters, I also make quite heavy use of.
10:40 So basically, anything you would need to understand, here's how to write out an algorithm in Python,
10:46 here's how to use the correct data structures in Python, I kind of baked into that initial
10:52 introduction. Whereas more advanced things like really specialized data structures like queues,
10:58 I just kind of introduced as needed later in the book, some of the more esoteric math functions,
11:03 I just kind of introduced when they came up.
11:05 I think obviously, nicer dictionaries, things like default dict and so on,
11:09 played a really important role in the subsequent pieces where you're trying to make concise,
11:15 clean statements. I thought your use of sort of list comprehensions and generator expressions,
11:23 along with the other functional pieces like zip and counter and so on, were really,
11:30 really interesting. You write some very concise code without being unreadable. I thought that was
11:35 nice.
11:35 That's one of the things I aim for. And as much time as you spend sort of revising the text of the book,
11:41 I also spent probably just as much time revising the code of the book. I went back over it many,
11:46 many times thinking, can I make this simpler? Can I make it cleaner? Can I make it easier to
11:50 understand? And I try and do that. I'm kind of a stickler for code elegance. It's kind of a
11:57 personal fault, maybe.
11:58 But also, you come out with very polished codes. I found it very easy to follow. And I thought the
12:04 focus on this functional style programming was really neat. So one of the things you talk about
12:10 in the beginning is about visualizing data, because you have all these numbers. And obviously,
12:15 to understand them, putting a picture to it is very nice. And what is that? Matt Plotlib?
12:19 Yes. In the book, I only use Matt Plotlib. I was just having a conversation with some of my data
12:25 science buddies about a week ago, where they were, as always, complaining that the data visualization
12:31 story in Python is still not very good.
12:33 Yeah. So you said that Matt Plotlib was starting to show its age in your book. What did you mean by
12:38 that?
12:39 When I first started learning Python, what I said, more than 10 years ago, as a Matlab replacement,
12:44 like Matt Plotlib was around back then, and it had the same features and the same interface, and it
12:51 produced the same not particularly attractive plots. So it hasn't kind of evolved as the rest of the
12:57 language has evolved, say. And there's been some projects like Seaborn that try and put some prettiness
13:03 on top of it. And there's been some other attempts to kind of bring the R-style GG plot into Python.
13:09 But from my perspective, none of them has really like won the mindshare and become the solution.
13:14 So everyone kind of does something different.
13:16 Yeah. You also talked about Boku, which is from the Continuum guys. I had them on show 34. Can you
13:23 maybe talk about that really quickly? Just what is it? What's it used for? That sort of thing.
13:28 It is another visualization library. And I have to be honest with you, I haven't really
13:32 checked it out since I was writing the book like more than a year ago. So I seem to remember that it
13:38 seemed to have some facility for building in interactive plots and things like that.
13:42 It's kind of D3, fancy graphics in the browser type stuff, right?
13:47 I believe so. But I haven't spent much time playing with it.
13:49 Yeah.
13:49 I haven't been doing a lot of, probably doing actually more D3 than Python visualization recently.
13:55 Right, right. Tell everyone what D3 is. I don't think we've talked too much about that on the show.
13:59 Oh, sure. So D3 is actually a JavaScript library for building interactive data visualizations.
14:06 It brings sort of an interesting model where you bind your data to DOM elements, which are quite often
14:12 SVG elements. And so what that makes it easy to do is to have a data set. And when you add new data to it,
14:21 your plot updates, and it allows a lot of really interesting interactivity. If you check out the
14:26 D3 website, they have a gallery of all sorts of amazing visualizations where you look at and think,
14:32 how on earth did they do that?
14:33 Yeah. Whenever I go to the D3 website, I'm like, wow, I want to use this. I don't have a use for it,
14:39 but it's fantastic. How can I just build stuff that looks like this?
14:42 Well, I mean, so I have a joke, which is that good data scientists copy from the D3 gallery and great
14:48 data scientists steal from the D3 gallery. And they give you the code for all of them. So I would say
14:53 probably close to 100% of the D3 visualizations I've ever built have been something I found in the D3
15:00 gallery and kind of tweaked until it fit my data.
15:02 Yeah. Yeah. Nice. So the next section you talked about was like the fundamentals of the math and
15:08 science that you need to know in order to be a data scientist. And I like the way that you put it.
15:14 You said these are like the cartoon versions of big mathematical ideas. How much math do you need to
15:19 know to call yourself a data scientist? Like if I studied just programming, I'm not a data scientist.
15:25 So I studied just math. I'm not a data scientist. Like what's the story there?
15:28 Data science is a funny thing in that there are as many different jobs called data science as there
15:34 are data scientists. So there are people who will call themselves data scientists and they write SQL
15:39 queries all day. There are people who call themselves data scientists and they do cutting edge machine
15:44 learning research all day. There are people who do data scientists who convince people to click it at.
15:49 Like you could know almost no math and still get a job where you call yourself a data scientist.
15:53 You could know very little programming and get a job where you call yourself a data scientist.
15:58 It's a field that's still kind of figuring itself out. And there's just such a breadth of the different
16:03 roles that are all calling themselves data science. In an ideal world, you would know, you know,
16:08 linear algebra, you would know probability, you would know statistics. But among people who are data
16:14 scientists, some people know that stuff really well. Some people know that stuff not so well.
16:17 Yeah, I can imagine. It's really about you've got a lot of data or some specific type of data and you
16:23 want to answer questions about it or even discover the questions you could ask that nobody's asked, right?
16:29 SnapCI is a continuous delivery tool from ThoughtWorks that lets you reliably test and deploy your code
16:48 through multi-stage pipelines in the cloud without the hassle of managing hardware. Automate and visualize
16:55 your deployments with ease and make pushing to production an effortless item on your to-do list.
16:59 Snap also supports Docker and M browser debugging, and they integrate with AWS and Heroku.
17:05 Thanks SnapCI for sponsoring this episode by trying them with no obligation for 30 days by going to
17:12 snap.com/talkpython Sometimes it's actually the opposite, which is, I have a question I want to ask. Where can I find some
17:26 data that will allow me to answer that question? So, you know, sometimes you start with the data and
17:31 go to the questions. Sometimes you start with the questions and then you got to find the data.
17:34 Right. Okay. Very interesting. So that actually brought you to your next section that you talked
17:39 about in your book, which was getting data. And so what are the common ways that you talk about there?
17:45 Nowadays, there's a lot of people who just like post interesting data sets. So if you go to like
17:50 Kaggle, which is a site that does data science competitions, every one of their competitions has
17:55 a data set, which might be interesting in ways that are not related to the competition. The government websites
18:01 publish all sorts of data, some of which is potentially interesting. If you like weather,
18:05 they publish weather data. If you like economics, they publish economic data. There's a lot of
18:10 Python libraries for scraping websites. So even if a data set is not available, you can always go out and
18:15 try and scrape it and collect it yourself and clean it. My go-to source is always Twitter. I always build
18:21 things using Twitter data. So Twitter and all the other sites will have APIs where you can just make
18:27 restful calls or even use libraries. Python has some Twitter libraries that abstracts all that away. And you can,
18:33 you know, collect tweets on a given topic or collect tweets from certain users and
18:37 collect your own tweets and do analysis on those.
18:41 Yeah. And you had a section about that in the book. What was the package you were using for that?
18:45 In Python, the package I usually use is called Twython. There's a bunch of them. That's the one that I
18:50 got to work the easiest, but I think highly of it.
18:53 I don't think I believe any other ones.
18:56 Yeah. Yeah. Okay, cool. Yeah. I think Twitter is fairly special among the social networks.
19:01 To me, Twitter is the social network of ideas more than it is of friends or family or whatever. So
19:08 yeah, pretty cool. I love to get data from Twitter as well.
19:11 Yeah. It's just that they won't give you the fire hose, but for most things, they'll give you
19:15 more tweets than you can handle anyway. So if you want to find out what people are saying on a certain
19:19 topic or just what people are talking about in general, it's an awesome source for that.
19:24 Yeah. I saw some study or heard about some study a few years ago where people were studying, they were
19:30 trying to do sentiment, mood analysis on Twitter, and then trying to tie that back to the stock market
19:38 and predict how people were feeling on Twitter to near-term changes like in the next hour in the
19:45 stock market, which I thought was a pretty interesting project.
19:48 So ideas you can come up with using that data are pretty much endless. Every month, I think I'm done
19:53 with it and then I think of something else to do.
19:54 Nice. So then you kind of get into the topics that I think of as traditional data science,
20:01 machine learning, neural networks, network analysis, that kind of stuff. And I thought maybe we could go
20:07 through each section and you could tell me what it is that you need to sort of fundamentally
20:13 understand what are the from scratch basic pieces of each part of this data science and then maybe
20:20 some examples of problems you might solve with it.
20:22 Sure.
20:22 Yeah. So the first one you talk about is machine learning. And when you said machine learning
20:27 and later neural networks from scratch, I'm like, wow, that's a pretty big thing to take on from
20:33 scratch. What's the story there?
20:34 Machine learning at a high level is just learning some kind of model from data rather than sitting
20:41 down and writing out the model yourself by hand. And so if you have a small amount of data, then
20:47 you know, coming up with an algorithm that's going to learn a model from it is actually a pretty
20:51 reasonable thing to do if you go with a simple model and don't add too many bells and whistles.
20:56 It's only when you want to, you know, start producing recommendations at Netflix scale
21:01 or when you want to start, you know, building something to recognize speech patterns from audio files
21:06 that your beautiful handcrafted Python is probably not going to be up to the task.
21:10 Yeah. So I suppose the types of things you solve with machine learning is fairly unbounded. There's
21:19 a lot of problems that machine learning answers. What are some of your favorite examples?
21:22 Everything out there is machine learning these days. But if you want to talk about like projects that
21:29 I've worked on for fun, one time I built a classifier to predict or to identify hacker news articles that I would be interested in or not
21:37 interested in. So it would take kind of the feed of new hacker news stories and come up with a score between zero and one about how interested, you know, it thought I would be based on some initial seed values that I leave.
21:49 How did you teach it?
21:58 How did you teach it?
22:06 I was for a few other kind of idiosyncratic features I threw in, like, you know, is it an ask hacker news? Is it a show hacker news? Does it have a dollar
22:15 amount in there? Because there are a lot of kind of bad stories that's like, here's how to make $5,000, I think. So that was a good negative signal.
22:23 It's sort of a spammy signal.
22:24 Yeah. So, and then it turned out I wrote a blog post about it and someone posted the story to hacker news and then
22:31 the hacker news community got very angry that someone would think that not every story there was worth
22:36 reading. Because of course every story there is worth reading. So why would I want to filter some of them out?
22:39 They accused me of wanting to live in a bubble.
22:41 And yeah, they can be fairly critical on there. But that's funny. So the big question is, did your system
22:49 like your article, your blog post?
22:51 That's a good question. I don't know that I actually checked. I should go back and try and find that.
22:55 It'd be funny if it recommended it to you. Or not. Either way, it would be funny to know.
22:59 Well, you know, it's funny speaking of this. So, you know, I have an Android phone and so
23:03 Google now will sometimes recommend me articles of things I'll be interested in.
23:07 And occasionally it will recommend to me like my own blog posts, which I guess means doing a good job.
23:13 You know, one of the things I did with machine learning was I have a five year old daughter and I do her clothes shopping.
23:19 And I noticed that little boys clothes were very interesting and little girls clothes tend to be
23:23 kind of like really boring. The boys clothes have like dinosaurs and rockets and robots and
23:28 girls clothes have like hearts and flowers. And so I built a machine learning model to take an image of a
23:33 children's t-shirt and predict whether it was a boy shirt or a girl shirt.
23:38 Awesome. Yeah. And I agree with the classification there as well. I have three daughters and we still
23:44 bought a fair number of boy clothes for them.
23:46 Yeah, I did the same.
23:47 Nice. So the next topic that you covered was nearest neighbors. And I mean, conceptually,
23:53 I kind of know what nearest neighbors are pretty easily, but it's a computation hard problem,
23:58 especially in higher dimensions. And so what kind of stuff do you do? What kind of problems do you solve with
24:03 this?
24:03 One thing you can do is when you don't have like a great parametric model for what's going on.
24:08 So for instance, one place where I've seen this applied is when I have some kind of like
24:12 time series type signal, and I want to know if it represents some kind of bad anomaly. And it might be,
24:19 you know, like thousands of points in some weird shape. And I don't have a great way to classify it.
24:22 But one thing you can do is if you have a bunch of labeled signals, you know, some are good or some
24:28 are bad, you just say, I'm just going to take my set of reference signals and figure out which are
24:33 the ones it's closest to. And that way, I don't need to necessarily have a model of what's a good
24:38 signal or what's a bad signal. I can just say, I have some labeled data, that's enough. And so I
24:43 think that's kind of the situation where it can be useful.
24:46 Yeah, nice. When you're still trying to explore things, and maybe you don't, you don't have a model
24:51 you're trying to fit it to, you're just trying to understand it, right?
24:53 Yeah. Or think about where like, you have some weird kind of shapes, or weird kind of patterns
24:59 that you can't really put math behind. What's a good pattern? What's a bad pattern? But you have
25:04 some labels, then you could use nearest neighbors to basically classify without having to have a
25:10 mathematical model behind it.
25:11 Okay, yeah, very cool. So then you get to another topic called Bayesian Bayesian analysis, which the context I know this is around kind of determining spam and
25:21 filtering and stuff like that. But what's, what's the story of Bayesian analysis?
25:26 So it's named after Bayes' rule, which is just a theorem in statistics, having to do with ways of
25:33 reversing conditional probabilities. So at a high level, if you have some knowledge about
25:38 what is the probability of seeing certain features in email, given that an email is spam, you can use that
25:45 data to produce estimates of what is the probability that an email is spam, given that you see those
25:50 features. And so typically, if you have a lot of, you know, spam and non-spam, you can make estimates
25:56 of, okay, given that an email is spam, it's likely that I'll see Viagra, and it's likely that I'll see
26:02 Gary Twick. And then basically just a classifier that turns that around and allows you to now look for
26:07 these features. And in a math-manipede rigorous way, come up with the probability, okay, here's the
26:11 probability that this is spam. Okay. Yeah. Very interesting. What other types of problems?
26:15 Spam is certainly easy to understand, but you know, hopefully that's solved by Gmail and other people
26:22 for us, right? I think it was Paul Graham who wrote a pretty influential article about this approach,
26:27 probably about 15 years ago now. And so I think a lot of the tools that, I don't know if it's what
26:33 Gmail uses, but that a lot of these BAM assassin and some of the other mail providers used are still
26:37 based on this principle of naive base. But basically the, anything where you have kind of big clumps of
26:44 say text, and you want to classify it into one or more classes, like two or more classes, and you're
26:52 willing to make this really big kind of technical assumption that the words in it are kind of independent
26:58 of each other. So you're treating them kind of like a bag of words rather than as a sequence of words,
27:02 then that's where you would use this kind of model. Okay. Yeah, cool. So maybe the opposite of that,
27:08 where you treat words sort of as, as having more meaning is in natural language processing, which is
27:14 another thing you talk about, right? Natural language processing is, is a huge field. Like people,
27:18 there are textbooks about it. So in fact, there are Python books about it. So cram it into a chapter of a
27:25 book is really just kind of, here's a really high level overview. And here's, you know, a couple of
27:31 examples to give you a flavor. Natural language processing is, it's actually relevant to this sort of
27:37 problems that I'm getting into in artificial intelligence. It gets used a lot by all the voice recognition in
27:43 your phone, and a lot of these new, basically reading texts, and having a computer extract information from it. So it's a
27:52 pretty important, pretty, pretty rich field. Yeah. And there's some pretty decent Python libraries for
27:57 doing that, right? Yeah. So there's a, so the big one is called NLTK, the natural language toolkit,
28:04 I think it stands for. And so it has its own book just about natural language processing in that library.
28:10 And I think that's actually free on the web. But yeah, that would be a great place to start it. If you
28:15 want to come to understand more deeply, there's also a nice Coursera course on natural language processing.
28:20 It took several years back that I liked. Oh, okay. Yeah, excellent. So let's see. Another topic you
28:25 covered were decision trees. And what's the story of these? A decision tree is kind of what it sounds
28:31 like. Yeah. You know, intuitively, you might have a model of making decisions based on a tree. So I have,
28:39 I want to know whether to buy a certain car or not. So, you know, I might start asking a question. Okay,
28:44 is it an American car or a Japanese car? American car. Okay. Go to the next question. Does it have
28:49 four doors or two doors? It has two doors. Okay. Is it gas or diesel? Diesel. And then you just have a
28:55 sequence of questions to ask and you classify it based on those questions. Okay. It's diesel. That
29:00 means, yes, I want to buy it. And so conceptually, you know, a tree like that is not that complicated.
29:06 Interesting part is given some set of data, how do I build such a tree? And there, there's a variety of
29:13 algorithms, but the one that we talk about in the book, it is a pretty simple one, which is basically
29:18 around, okay, I have a bunch of data. It has a bunch of attributes. I can split my data on each
29:24 attribute. So I could say, okay, at this stage, I want to look at two door versus four door. I want to
29:28 look at Japanese versus American, or I want to look at gas versus diesel and which of those choices is
29:35 going to allow me to kind of really separate the good buys from the bad buys the most. So if gas versus
29:43 diesel totally splits where diesel is the buys and gas is the don't buys, that's a great thing to choose.
29:50 If gas versus diesel splits where, you know, on gas, I want to buy half and not by half and on diesel,
29:56 diesel and by half and up by half. That's not a good split to choose because it doesn't really
29:59 help me at all. And so the mathematics here are just around kind of making this precise with this
30:05 entropy and building these trees from the data. Okay. Yeah, that's very cool. The way where you
30:11 build them knowing the data intimately, you know, that, that makes a lot of sense. And you don't really
30:17 necessarily need data science to do that, right? That's just helping, you know, asking a few questions
30:23 and making a decision. But the reverse, I think is pretty interesting.
30:27 They build some of these expert systems in this kind of way. I think they apply them sometimes in like,
30:31 medical diagnosis and sometimes they do better than doctors.
30:35 Wow. Okay. Sometimes just let the data talk, huh? Rather than intuition.
30:40 This episode is brought to you by hired hired is a two sided curated marketplace that connects the
30:56 world's knowledge workers to the best opportunities. Each offer you receive has salary and equity presented
31:01 right up front, and you can view the offers to accept or reject them before you even talk to the company.
31:07 Typically candidates receive five or more offers within the first week, and there are no obligations
31:11 ever. Sounds awesome, doesn't it? Well, did I mention the signing bonus? Everyone who accepts a job
31:16 from hired gets a thousand dollars signing bonus. And as talk Python listeners, it gets way sweeter.
31:22 Use the link hired.com/talkpythontome and hired will double the signing bonus to $2,000.
31:29 Is knocking. Visit hire.com/talkpythontome and answer the call.
31:32 Somewhat related to that maybe is, you talked about neural networks as well.
31:42 Yeah. So neural networks are a super hot topic right now, because I'm sure you've probably heard
31:47 that all the buzz about, or about deep learning, right? And deep learning tends to have neural networks
31:54 at the root of it. So neural networks are basically a way of building up kind of layers and layers of
32:01 representations. And the deep learning, this book doesn't really go into deep learning because it's,
32:06 the single neural network is hard enough to do by hand or from scratch. But basically it's a way to
32:12 kind of build a classifier that works similar to how a toy model of a brain might work. So you have
32:20 artificial neurons, and each neuron has a bunch of inputs that go into it with weights. If the weighted
32:26 sum of the inputs exceeds some certain threshold, the neuron fires. And if it doesn't exceed that
32:30 threshold, the neuron doesn't fire. So you present an input, which could be like a, an image. So basically
32:36 a bunch of zeros and ones, and that causes some neurons to fire. And that propagates through this
32:42 network. And in the end it will spit out, you know, I think you should be an image of a cat,
32:46 or I think you should be an image of a dog. And you train it by showing that a lot of labeled images
32:52 and adjusting the weights based on how you got them wrong.
32:55 Yeah. Interesting. So you basically just say, these are the inputs and the decisions that you feed it
33:02 known data. You say, you know, like, here's a cat, here's a cat, that's a dog. It's just a cat or a dog,
33:07 right? And then you ask that question. I was gonna say, well, one of the problems that I sort
33:11 of remember from neural networks is that they, as you try to get them more accurate,
33:15 they get over-trained to just only do little bits of stuff. And so how does things like decision
33:21 forests or these things and this deep learning, how does it deal with that?
33:26 So random forests are basically taking multiple decision trees and combining them. So there's this
33:33 kind of general principle where a lot of times rather than building one model, it's like super predictive.
33:40 I take a lot of much less predictive models and just kind of let them vote on things or average
33:47 the results or whatever. So you're right that it's really a decision tree. And I included like all
33:52 hundred features in my data set that there's a good chance I'm going to like overfit and
33:57 overlearn my training data and not generalize outside of that. So one thing that happens with decision
34:04 trees is people often don't use the bare decision trees. They use the random forest where they'll build
34:10 a bunch of smaller decision trees, each of which is really restricted to a small subset of features
34:16 so that each one is individually less powerful. But then when you combine them, they do well in the aggregate
34:22 and they don't have necessarily the same overfitting problem that single decision tree with lots of features would.
34:28 So in neural networks, especially in deep learning, they're not exactly the same techniques, but other
34:34 techniques where you will zero out some of your weight sometimes and train without them to make
34:40 sure that they don't learn too much. And there's a lot of other techniques that get used in order to
34:44 make sure you're not overfitting. There's something that, you know, data scientists and machine learning
34:47 people worry about a lot.
34:48 It's a really hot field, like you said right now. That's awesome. So one more section I'd like to touch
34:54 on is you talked about this thing called recommender systems.
34:58 So mobile, they're just what they sound like. Anyone who's used the internet has a pretty good experience
35:02 where you go to Netflix and it says movies for Joel and it's trying to predict what it thinks I'll like.
35:09 Or you go to Amazon and it will recommend items for me. And you can go a little further where you have
35:16 all these startups like Stitch Fix is a very hot data science startup where you tell what kind of clothes
35:22 you like and they'll send you a box of clothes every month that they think you'll like a lot.
35:28 And so generating these kind of recommendations is a pretty popular task within data science. A lot of data scientists
35:36 work on these kind of problems. There's a lot of data scientists work in this job. So they have to sell you stuff
35:41 and they always want to sell you stuff that they think you'll like.
35:44 You convert better if you figure out what people actually might want, right?
35:47 Yeah, I mean, well, if Amazon sends me an email, "Hey, Joel, you know, these five things are on sale
35:54 and they're five things that I really want to buy," then that's much better for them and potentially
35:58 for me than if they send me a random email that's like, "These are the five most popular things on
36:03 Amazon today." Yeah, it definitely is better. I really like going to my Kindle and looking at
36:08 the recommended things based on what I've been reading. But the Netflix recommendations, I know they
36:14 do really great work, but it just doesn't work for me because my seven-year-old daughter watches
36:19 Strawberry Shortcake and other random things. I get a lot of kid shows recommended to me.
36:25 They have a "Who's watching" button that you can click and say "A kid is watching."
36:30 I know, but my daughter won't use it. She'll just randomly pick one.
36:33 Yeah, so you know, it's funny. I use that button pretty well, but my Netflix recommendations are also
36:40 not that good. But I think it's just because I don't like anything on there. So no matter what they're
36:45 telling me, I'm not going to like it. Yeah, I hear you. I feel like you covered a pretty
36:49 wide swath of data science from scratch. Some of the topics were really accessible. Some of them
36:57 required more math and more thinking, but it was all a really nice presentation. What do you feel like
37:04 you left out? There aren't any topics that, if anything, I kept adding stuff while I was writing it.
37:11 So I don't feel like there were any huge topics that I necessarily left out. But in terms of coverage,
37:17 probably my biggest regret is that I used Python 2 instead of Python 3.
37:20 Yeah. Okay. And I saw on your GitHub repo, you had something about Python 3 in there. Is that right?
37:27 Yes. So pretty much as soon as the book came out, one, I got a lot of emails from people saying,
37:32 "Hey, why do you not use Python 3?" And then I also got a number of emails saying,
37:37 "I would like to use Python 3. Will the code work?" And so I wrote them back and I said,
37:41 "Yeah, you know, I don't feel like the code shouldn't work. Give it a try. I bet it works
37:46 with probably a few changes, add some parentheses, print statements and so on." And then eventually one
37:51 guy wrote me back and he said, "It doesn't work." I said, "Okay." So I sat down and I said, "I'm going to
37:57 convert the code to Python 3." And it took me about, I'd say four to five hours. And that's with me
38:04 knowing the code intimately. So it would have taken someone who hadn't written the code in the first
38:09 place a lot longer than that. And then I felt kind of guilty that I've been telling all these people that
38:12 it was so easy to do when it wasn't.
38:14 Just spend a week. It'll be fine.
38:16 Yeah. But so yeah, I have the Python 3 versions of the code up on the GitHub. I sort of regret not
38:23 having just done it that way in the first place.
38:25 Yeah, sure. So at the end, you talked about data science, not from scratch, and you pointed out
38:31 a lot of the libraries you might actually use, like NumPy and so on. Do you want to talk about that
38:37 really quick? Like what's the real data science versus the from scratch data science comparison?
38:43 Yeah. So I would say that NumPy is pretty fundamental. That's basically the linear algebra
38:51 library for Python. So it provides you matrices, matrix algebra, high performance arrays, things
38:58 like that, you don't get just built in. And you might not use it directly, but a lot of the other
39:05 libraries are really built on top of it. So kind of the most broadly accessible machine learning
39:11 library for Python is called Scikit Learn. And it has really nice documentation and really nice tutorials
39:16 and a fairly standard API for building machine learning models. Anything you want to build a
39:21 regression model or a random forest model or any kind of classifier, that's probably the place to go.
39:27 There's also Pandas, which is the data frame library, which is good if you're working with tabular data. So
39:34 not necessarily the machine learning side of data science, but more of the, I have kind of a
39:39 spreadsheet data set and now I want to clean it and aggregate it and pivot it and look for kind of data
39:48 analysis type insights.
39:49 Yeah. If you're exploring, like you said, tabular data and you kind of kind of load it up and clean it,
39:56 Pandas seems really fantastic for that.
39:57 It's a really nice library. And then kind of the new kid on the block is a TensorFlow, which is
40:03 Google's deep learning library. And it's only was released, you know, a few months ago and it's gone
40:09 through. It's not 1.0 yet, but it seems like people are sort of converging around it as that's how they're
40:15 going to do deep learning in Python. Now there are other sort of previous libraries that some people
40:20 used and still use, but TensorFlow seems to be gaining a lot of mindshare.
40:24 Okay. Yeah, that's really cool. And I've definitely seen TensorFlow talked about a lot in this context,
40:30 but it's just a library. You can download that and run it locally. It's not like a cloud type thing,
40:35 right?
40:35 Yes. So currently it's a local library and I'm not sure if they've come out with a version that you can
40:42 kind of do your own cloud, like on various AWS instances. But, but I know they're definitely going
40:47 in that direction because a lot of this stuff, these deep learning models take a really long time to
40:52 train. And so you want to go to use them for anything serious. You want to distribute them and
40:56 throw them in high-powered machinery and not just run them on your laptop.
40:59 Yeah, of course. If you have tons of data, maybe it's better to get a bunch of machines
41:03 for an hour. So what do you think about these cloud learning or data science platforms? I'm thinking like
41:09 Azure machine learning, or, you know, I just had SigOpt on the show a few shows ago. I'm not really sure
41:15 what else is out there in terms of like go out to the cloud and grab some data science stuff.
41:20 What do you think about that?
41:21 I haven't spent much time looking at any of those. I think they can add value in terms of either one,
41:31 if you don't have the data science or machine learning expertise in house to do whatever you need to do.
41:37 Or two, if you have some kind of model that you've built, but you need help either putting it into
41:43 production or operationalizing it somehow, that they can build a pretty good role there. But my sense is
41:50 that most people doing data science are aligned more on running the libraries and running the models
41:56 themselves. But that could be just my bias sample of the people who I talk to.
42:00 Yeah, of course, of course. Cool. So another thing that you said you're into is taking some of the
42:07 ideas from Haskell and thinking about how those might manifest in Python. What do you went up to there?
42:12 Haskell, if you're not familiar with it, is kind of the purest of the pure functionally typed
42:19 languages with strong types and lazy evaluation and things like that. And so once I spent some
42:28 time in that world, I spent a lot of time thinking about how can I bring some of these concepts back
42:33 into Python. And in Python 3, lazy evaluation plays a much bigger role in the sense that like range
42:42 is a generator instead of a list and all the map and filter and things like that also are generators
42:48 instead of lists. But I started getting into the iter tools library, which starts giving you tools for
42:55 generating basically infinite sequences and just trying to see how far I could go using
43:01 pure functions and infinite sequences and avoiding mutable variables and other things that you try not
43:07 to do when you're working in a Haskell-like language.
43:10 Yeah. And how did you feel like it came out in the end? Do you feel like you were able to bring a lot
43:14 of those ideas over?
43:15 I was, and I ended up producing code that was really neat and really impenetrable. The lazy
43:23 infinite sequences stuff, that was more almost academic in terms of like, yes, I managed to do it. Like,
43:29 this is mathematically interesting and it works well, but it's not readable at all.
43:34 So imagine you wanted to represent, let's say a binary tree in Python. Kind of the two,
43:41 I would say obvious approaches would be one to make some kind of like class where it has a,
43:47 you know, a value element and a left element and a right element. And then you also might just use
43:53 a dictionary to represent it where it had those keys. In a language with algebraic data types,
43:57 like Haskell, you would just basically represent that kind of tree as a product type where it just
44:04 has like three elements. And so I said, you know, what if I just represented a tree as a tupple with
44:09 three elements where first element is the left subtree, the second element is the value,
44:13 third element is the right subtree. And similarly, if you want to do like linked lists in Python,
44:17 which you probably don't want to do, but if you did want to do linked lists in Python, you just treat
44:21 them as a tupple. First element is the element and the second element is the tail linked list. And so,
44:28 I actually found that I was able to write some pretty nice code using those kinds of ideas.
44:32 And I did some coding interviews that way. I'm not sure the interviewers appreciated it.
44:38 What is this guy talking about? Yeah, cool. So what question given your work at the AI Institute and
44:47 your background in data science and so on, I wanted to ask is the last year or so, there's been a lot of
44:54 news items and people coming out saying that artificial intelligence is a danger to humanity.
45:01 What's your thought? Is like AI something we should be super excited about? Or is it something we should
45:07 be maybe cautious about? I would say I probably fall in the middle. I mean, I'm excited about it because
45:12 this is my job, but I don't go around encouraging everyone else that they have to be excited about it
45:18 because I don't know that that's necessarily warranted. But at the same token, I don't spend a lot of time
45:23 worrying about how dangerous it is. I think we're pretty far off from the time when we have to worry
45:29 about that. And I do have some friends who think we should worry about it now before it's too late. But
45:35 I think there's a lot more important things to worry about in the world when I read the news.
45:39 Yeah, I kind of agree with you on that. And I think there's two, certainly two ways to look at that. On
45:47 one hand, if you think about things like self-driving cars, let's just take that as an example. Like,
45:53 I believe one of the biggest job categories for men in the United States is some form of driving,
46:02 like driving a truck, driving a taxi, those types of things, delivery vehicles and so on. And if self-driving
46:08 cars were to like remove all that, like that would have large social effects. But I think, you know,
46:13 that's not so much the way that people, at least recently in the news, were talking about it. It's
46:19 more like Terminator style, right? And so that I, I'm not too worried about this personally. Who knows?
46:26 Yeah. I mean, I personally would pay a lot of money for a self-driving car because I,
46:30 I don't like driving that much. And I'd much rather be able to read while I'm going somewhere.
46:35 Yeah. Driving is fun until you get stuck on i5 for half an hour inching along. Then you know,
46:40 I don't like driving anymore. Exactly.
46:42 Awesome. Well, we're kind of coming up near the end of the show. Let me ask you just a
46:46 few closing questions. I always ask on my guests, if you're going to write some Python code,
46:50 what editor do you open up? So these days I tend to use Adam for pretty much everything,
46:55 memory leaks and all. Nice. Yeah. That's from a GitHub, right?
46:58 Yeah. It's pretty similar to sublime, but it doesn't nag you for money. So yeah, it's,
47:03 it's pretty nice. And I think it's Adam.io. They have a really cool little video about how
47:08 it's the editor of the future. It's nice.
47:10 Yeah. I don't know if I go that far, but it's the editor of the present at least.
47:14 Yeah. It's like a George Jetson sort of like a promo video. It's, it's pretty funny.
47:18 It is pretty nice. Isn't it? Does it have good Python support?
47:21 You know, I'm never, I'm not someone who leans on like ID functionality a lot. So if good Python
47:28 support counts as syntax highlighting, yes, but that's all I tend to use it for. So yeah.
47:32 Yeah. Yeah. Okay. Cool. And if you look at on the Python package index, there's,
47:38 you know, 75 plus thousand packages and we all have experience with, you know, different parts
47:46 of it and there's things that we love and would recommend, like, what is your favorite one you
47:49 might recommend that people maybe don't know about?
47:51 So the one I recommend that people don't necessarily know about is called Beautiful Soup. It's,
47:57 basically a HTML parsing library. And so if you start scraping data from webpages, you're going to get
48:04 a big mess of ugly HTML. That's probably not even well formed. most of the time,
48:10 most people don't bother to well form their HTML.
48:12 Are you telling me that I can't just like load that up as an XML document or something like this?
48:17 No, I'm just kidding. Of course it's, it's terrible trying to work directly on the web,
48:21 right? And Beautiful Soup is really, I really like it as well.
48:25 It's really nice. I mean, you have to spend a little bit of time getting used to
48:28 it's API and interface and everything, but it's super handy for getting data out of webpages and
48:34 doing anything where you have to do a bunch of scraping.
48:37 And you cover that in your book, right? In the getting data, I use Beautiful Soup a little bit and show people how to use it.
48:42 Yeah, I cover it a little bit. It's a nice addition to the data scientist toolkit.
48:46 Nice. So Joel, how do people find your book? Amazon? Just a Google search for data science
48:52 from scratch? Yeah, it's on Amazon and you can buy it from O'Reilly.com. But yeah,
48:57 if you Google data science from scratch, you'll find it.
48:59 Cool. And I'll put a link to the GitHub repo where you have all the code examples and so on as well.
49:04 It's been really fun to talk about data science. And I think you have a really interesting way of teaching
49:10 people to appreciate the tools that we're all fairly familiar with by showing you how to build it from
49:16 scratch. So thanks for that.
49:19 My pleasure.
49:21 This has been another episode of Talk Python to Me. Today's guest was Joel Gruse and this episode
49:25 has been sponsored by SnapCI and Hired. Thank you guys for supporting the show. SnapCI is modern,
49:30 continuous integration and delivery. Build, test and deploy your code directly from GitHub,
49:34 all in your browser with debugging, Docker and parallelism included. Try them for free at snap.ci
49:39 slash talk.com. Hired wants to help you find your next big thing. Visit Hired.com slash talkpython to me to get five or
49:45 more offers with salary and equity presented right up front and a special listener signing bonus of $2,000.
49:50 Are you or a colleague trying to learn Python? Have you tried books or videos that left you bored by just
49:56 covering the topics point by point? Check out my new online course, Python Jumpstart by building 10 apps at
50:01 talkpython.fm/course to experience a more engaging way to learn Python. You can find links
50:07 from this show at talkpython.fm/episodes slash show slash 56. Be sure to subscribe to the show. Open
50:14 your favorite podcatcher and search for Python. We should be right at the top. You can also find the
50:18 iTunes feed at /itunes, the Google Play feed at /play and the direct RSS feed at /rss on
50:25 talkpython.fm. Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.
50:31 You can hear his entire song at talkpython.fm/music. This is your host, Michael Kennedy.
50:36 Thank you so much for listening. Smix, take us out of here.
50:39 Stating with my voice. There's no norm that I can feel within. Haven't been sleeping. I've been using lots of rest. I'll pass the mic back to who rocked it best.
50:49 I'm first developers.
50:51 I'm first developers.
50:58 Developers, developers, developers, developers.
51:01 .
51:01 Thank you.
51:01 Thank you.