Learn Python with Talk Python's 270 hours of courses

#56: Data Science from Scratch Transcript

Recorded on Tuesday, Apr 19, 2016.

00:00 You likely know that Python is one of the fastest growing languages for data science.

00:03 This is a discipline that combines the scientific inquiry of hypotheses and tests,

00:07 the mathematical intuition of probability and statistics, the AI foundations of machine learning,

00:12 affluency in big data processing, and the Python language itself. That's a very broad set of skills

00:18 you'll need to be a good data scientist, and yet each one is deep and often hard to understand.

00:23 That's why I'm excited to speak with Joel Gruse, a data scientist from Seattle. He wrote a book to

00:28 help us all understand what's actually happening when we employ libraries such as scikit-learn or

00:32 numpy. It's called data science from scratch, and that's the topic of this week's episode.

00:36 This is Talk Python to me, episode number 56, recorded April 19th, 2016.

00:56 Welcome to Talk Python to me, a weekly podcast on Python, the language, the libraries, the

01:11 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at

01:16 mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on

01:22 Twitter via at talkpython. This episode is brought to you by SnapCI and Hired. Thank them for supporting

01:28 the show on Twitter via at snap underscore CI and at Hired underscore HQ. Hey, everyone. It's great to be

01:36 back with you. I have something to share before we get to the main part of the show. I had a chance to be on the

01:40 Partially Derivative podcast this week, where we did a short segment on programming tips for data scientists,

01:46 relevant to this episode, wouldn't you say? I talked about using generator methods and the yield

01:51 keyword to dramatically improve performance while processing lots of data in a pipeline.

01:55 If you want to hear more about that kind of stuff, check out the link to the Partially Derivative episode

01:59 in the show notes. Now, let's talk to Joel. Joel, welcome to the show. Thanks. I'm glad to be here.

02:04 Yeah, I'm really looking forward to doing data science, but from scratch today.

02:09 I know. That's my book. I'm the right person to be here for that topic.

02:12 Fantastic. So we're going to dig into data science. And I think your idea of

02:16 taking it from the fundamentals and building something that's not so complicated or so well

02:22 polished and optimized that you can really understand what you're doing is great. But before we get into

02:26 that, let's just sort of start from the beginning. How did you get into programming in Python?

02:31 So if you go back about 10-ish years or so, I was in grad school for economics. I took a class called

02:39 probability modeling, which was a lot of simulating Markov chains and doing Monte Carlo simulations and

02:45 things like that. And the class was actually taught in MATLAB. School had a site license for MATLAB that

02:50 only worked on campus. And so that meant that if I went home and worked at home, which I like to do,

02:55 I couldn't use MATLAB. And so as an alternative, I found that I could use Python and NumPy to basically

03:02 do all the MATLAB-y things without that site license. So that's actually why I started being

03:08 into Python. And then I just kind of liked it and stuck with it.

03:10 Yeah, I can definitely see how you would like working in Python more than MATLAB. I've

03:14 spent my fair share of time in the .im files and it's all right, but it's not Python.

03:20 You were studying economics and that's kind of how you got interested in data science,

03:25 trying to answer these big questions. The subject of economics or some other way?

03:30 I followed sort of a convoluted path. My background was math and economics.

03:34 And I actually started off doing quantitative finance, options, pricing, mathematics of financial

03:39 risk. I worked at a hedge fund. And then when the hedge fund went out of business, I just kind of

03:43 lucked into this job at a startup called Faircast, where I was doing a lot of kind of BI work,

03:50 writing SQL queries, building dashboards. And over the years, I just sort of moved more and more in a

03:58 data science kind of direction and eventually became a data scientist and software developer.

04:02 And certainly the math and economics training was helpful for that. But it was

04:06 it was a career that I sort of ended up in by accident rather than through a deliberate plan.

04:11 That's a little bit like my my career path as well. Studied math and just sort of had to learn the

04:17 programming and stumbled into it. That's cool. What do you do day to day now?

04:21 So I just started a new job. I work at a nonprofit. It's called the Allen Institute for Artificial

04:27 Intelligence. It was founded by Paul Allen, who was one of the Microsoft founders. And we're

04:33 basically doing kind of fundamental AI research. I can't tell you that much about what I do. So

04:38 it's really my second week and I'm still kind of just learning what goes on. I'm learning my way

04:43 around. But I work on a team called Aristo, which is basically building AI to take science quizzes and

04:51 to understand science. Oh, very interesting. Yeah, that sounds like a fascinating place to work.

04:56 It's really neat. Yeah. A lot of smart people, interesting problems.

05:00 Would you say it's different than working in a hedge fund or is it surprisingly similar?

05:03 No, it's totally different. I mean, it's similar in that there's a lot of smart people,

05:07 but it's not really similar in any other way. Yeah. The goals are not necessarily the same,

05:11 are they? The goals, the incentives, the sort of day-to-day stress levels, it's all different.

05:16 Yeah. Yeah. Sounds great. So let's talk about your book a little bit. It's called Data Science

05:21 from Scratch. And I think that's a really cool way to approach it. We have all these super polished

05:28 data science tools. You know, we have them in Python, we have them in R and other languages as well.

05:34 Why from scratch, rather than just grabbing one of these libraries or a set of these libraries and

05:38 talking about how to use them? So a couple of reasons. One is, as I mentioned,

05:42 I come from a math background and the math way of doing things is you can't really use something

05:49 until you can prove it. I once had a teacher who came in and he looked at the syllabus for the

05:58 previous semester and he said, oh, good, you proved this theorem. That means I can use it in this class.

06:03 And so there's this real rigor around, you can't use things unless you understand them.

06:06 And that approach always kind of resonated with me. And so that, to a large degree,

06:11 that's the approach I took in the book. At the same time, you also have all these really powerful,

06:17 really easy to use libraries like Psych and Learn, where you can go in and basically copy and paste,

06:24 you know, five lines of code. And you've built a decision tree or you've built a regression model.

06:29 And it's very easy to not know what you're doing. And you can pick up books and they'll tell you what

06:34 commands to type. But you can also type those commands and again, not know what you're doing.

06:39 And so I thought the book I would like, the book that I would have found valuable would be,

06:44 here's what these models are actually doing behind the scenes. And so then when it's time you apply

06:50 them, you kind of understand the principles, you understand where they go wrong, where they go right,

06:54 what they're good for, and what they're actually doing.

06:56 Yeah, I think that's, that's really interesting. And that is very much the mathematical way of,

07:01 if you can't prove it, you can't use it, sort of a way of thinking, which is not that common

07:07 in programming, you know, if there's an API, hey, just grab it and use it. But I think it's really

07:12 important in data science, because you have all these different disciplines and backgrounds coming

07:18 together to make it work, right? You're not just a pure programmer, and you're not, you know, a

07:24 mathematician or some sort of domain expert, right? You kind of have to blend these together.

07:28 Yep. And the other thing that I think ended up working pretty nicely about this approach was,

07:34 there's a lot of really mathy books about data science and machine learning, that's like,

07:38 here's an equation, here's an equation, here's an equation. But what I ended up doing was a little bit

07:43 of math. But then when it came time to, here's how something works, do it in working Python code.

07:49 So that it's rigorous in terms of here's all the steps laid out. But it's also, that's the code you

07:54 can run and sort of follow along.

07:57 Yeah, absolutely. Why Python and not something like R or some other language?

08:02 So the short answer is that, for whatever reason, R is not sympathetic with the way my brain works.

08:08 And whenever I try to sit down and start writing things in R, I just find that it doesn't work the way

08:13 I expect it to. The joke is that R is a language designed by statisticians for statisticians.

08:18 There's some truth to that, some unfairness to it. But for whatever reason, Python is much

08:24 friendlier to the way that my brain works and the way that I solve problems. Actually, the first draft

08:29 of my book was a lot harsher against R. And then some of the earlier reviewers didn't like that.

08:35 They really don't like that. So I kind of revised it to be more gentle.

08:39 Yeah, I think one of the things that's nice about Python is if you invest to learn data science

08:44 through Python, or the Python data science tools, rather, and then you want to build sort of a more

08:50 working application, you don't have to, you know, convert that to some other language, right? You're

08:55 already in Python. It's sort of a full stack thing you keep working with, right?

08:59 That's one important aspect. I think another important aspect is that from my perspective,

09:03 Python code, if written well, is super readable. And so you don't necessarily have to be a Python

09:11 expert to take a well written piece of Python code and understand what it's doing. Whereas other

09:16 languages can be a lot more impenetrable. And so I find it a nice teaching language from that

09:21 perspective as well.

09:22 Yeah, I would say, you know, most of the universities in the US came to that same conclusion,

09:27 right? With Python being the most common CompSci 101 course these days, as opposed to like Java or

09:33 Scheme, which I took.

09:34 Yeah, I mean, when I took, I also did Scheme when I took CompSci 101 way back in the day. And I actually

09:40 thought that was a really nice language for learning computer science. But I can see why Python would be

09:46 a better language for, you know, learning programming as opposed to computer science.

09:50 So when you start off your book, you actually have a few sections that are just about,

09:55 I'm going to teach you just enough Python to get started so that you can understand the data

10:02 science tools and what we're doing. What kind of stuff do you put in there? Like, what do you

10:06 consider fundamental for doing data science? And what can be sort of learned later?

10:10 The most important things I put in were obviously functions, you want to do any kind of

10:16 analytic work or Python functions. The other thing I really tried to emphasize,

10:20 was writing kind of clean Pythonic code. So I also did a lot of list comprehensions,

10:27 as well as understanding all the basic data structures. So lots of dicks and sets and some

10:34 of the less common ones like default dicks and counters, I also make quite heavy use of.

10:40 So basically, anything you would need to understand, here's how to write out an algorithm in Python,

10:46 here's how to use the correct data structures in Python, I kind of baked into that initial

10:52 introduction. Whereas more advanced things like really specialized data structures like queues,

10:58 I just kind of introduced as needed later in the book, some of the more esoteric math functions,

11:03 I just kind of introduced when they came up.

11:05 I think obviously, nicer dictionaries, things like default dict and so on,

11:09 played a really important role in the subsequent pieces where you're trying to make concise,

11:15 clean statements. I thought your use of sort of list comprehensions and generator expressions,

11:23 along with the other functional pieces like zip and counter and so on, were really,

11:30 really interesting. You write some very concise code without being unreadable. I thought that was

11:35 nice.

11:35 That's one of the things I aim for. And as much time as you spend sort of revising the text of the book,

11:41 I also spent probably just as much time revising the code of the book. I went back over it many,

11:46 many times thinking, can I make this simpler? Can I make it cleaner? Can I make it easier to

11:50 understand? And I try and do that. I'm kind of a stickler for code elegance. It's kind of a

11:57 personal fault, maybe.

11:58 But also, you come out with very polished codes. I found it very easy to follow. And I thought the

12:04 focus on this functional style programming was really neat. So one of the things you talk about

12:10 in the beginning is about visualizing data, because you have all these numbers. And obviously,

12:15 to understand them, putting a picture to it is very nice. And what is that? Matt Plotlib?

12:19 Yes. In the book, I only use Matt Plotlib. I was just having a conversation with some of my data

12:25 science buddies about a week ago, where they were, as always, complaining that the data visualization

12:31 story in Python is still not very good.

12:33 Yeah. So you said that Matt Plotlib was starting to show its age in your book. What did you mean by

12:38 that?

12:39 When I first started learning Python, what I said, more than 10 years ago, as a Matlab replacement,

12:44 like Matt Plotlib was around back then, and it had the same features and the same interface, and it

12:51 produced the same not particularly attractive plots. So it hasn't kind of evolved as the rest of the

12:57 language has evolved, say. And there's been some projects like Seaborn that try and put some prettiness

13:03 on top of it. And there's been some other attempts to kind of bring the R-style GG plot into Python.

13:09 But from my perspective, none of them has really like won the mindshare and become the solution.

13:14 So everyone kind of does something different.

13:16 Yeah. You also talked about Boku, which is from the Continuum guys. I had them on show 34. Can you

13:23 maybe talk about that really quickly? Just what is it? What's it used for? That sort of thing.

13:28 It is another visualization library. And I have to be honest with you, I haven't really

13:32 checked it out since I was writing the book like more than a year ago. So I seem to remember that it

13:38 seemed to have some facility for building in interactive plots and things like that.

13:42 It's kind of D3, fancy graphics in the browser type stuff, right?

13:47 I believe so. But I haven't spent much time playing with it.

13:49 Yeah.

13:49 I haven't been doing a lot of, probably doing actually more D3 than Python visualization recently.

13:55 Right, right. Tell everyone what D3 is. I don't think we've talked too much about that on the show.

13:59 Oh, sure. So D3 is actually a JavaScript library for building interactive data visualizations.

14:06 It brings sort of an interesting model where you bind your data to DOM elements, which are quite often

14:12 SVG elements. And so what that makes it easy to do is to have a data set. And when you add new data to it,

14:21 your plot updates, and it allows a lot of really interesting interactivity. If you check out the

14:26 D3 website, they have a gallery of all sorts of amazing visualizations where you look at and think,

14:32 how on earth did they do that?

14:33 Yeah. Whenever I go to the D3 website, I'm like, wow, I want to use this. I don't have a use for it,

14:39 but it's fantastic. How can I just build stuff that looks like this?

14:42 Well, I mean, so I have a joke, which is that good data scientists copy from the D3 gallery and great

14:48 data scientists steal from the D3 gallery. And they give you the code for all of them. So I would say

14:53 probably close to 100% of the D3 visualizations I've ever built have been something I found in the D3

15:00 gallery and kind of tweaked until it fit my data.

15:02 Yeah. Yeah. Nice. So the next section you talked about was like the fundamentals of the math and

15:08 science that you need to know in order to be a data scientist. And I like the way that you put it.

15:14 You said these are like the cartoon versions of big mathematical ideas. How much math do you need to

15:19 know to call yourself a data scientist? Like if I studied just programming, I'm not a data scientist.

15:25 So I studied just math. I'm not a data scientist. Like what's the story there?

15:28 Data science is a funny thing in that there are as many different jobs called data science as there

15:34 are data scientists. So there are people who will call themselves data scientists and they write SQL

15:39 queries all day. There are people who call themselves data scientists and they do cutting edge machine

15:44 learning research all day. There are people who do data scientists who convince people to click it at.

15:49 Like you could know almost no math and still get a job where you call yourself a data scientist.

15:53 You could know very little programming and get a job where you call yourself a data scientist.

15:58 It's a field that's still kind of figuring itself out. And there's just such a breadth of the different

16:03 roles that are all calling themselves data science. In an ideal world, you would know, you know,

16:08 linear algebra, you would know probability, you would know statistics. But among people who are data

16:14 scientists, some people know that stuff really well. Some people know that stuff not so well.

16:17 Yeah, I can imagine. It's really about you've got a lot of data or some specific type of data and you

16:23 want to answer questions about it or even discover the questions you could ask that nobody's asked, right?

16:29 SnapCI is a continuous delivery tool from ThoughtWorks that lets you reliably test and deploy your code

16:48 through multi-stage pipelines in the cloud without the hassle of managing hardware. Automate and visualize

16:55 your deployments with ease and make pushing to production an effortless item on your to-do list.

16:59 Snap also supports Docker and M browser debugging, and they integrate with AWS and Heroku.

17:05 Thanks SnapCI for sponsoring this episode by trying them with no obligation for 30 days by going to

17:12 snap.com/talkpython Sometimes it's actually the opposite, which is, I have a question I want to ask. Where can I find some

17:26 data that will allow me to answer that question? So, you know, sometimes you start with the data and

17:31 go to the questions. Sometimes you start with the questions and then you got to find the data.

17:34 Right. Okay. Very interesting. So that actually brought you to your next section that you talked

17:39 about in your book, which was getting data. And so what are the common ways that you talk about there?

17:45 Nowadays, there's a lot of people who just like post interesting data sets. So if you go to like

17:50 Kaggle, which is a site that does data science competitions, every one of their competitions has

17:55 a data set, which might be interesting in ways that are not related to the competition. The government websites

18:01 publish all sorts of data, some of which is potentially interesting. If you like weather,

18:05 they publish weather data. If you like economics, they publish economic data. There's a lot of

18:10 Python libraries for scraping websites. So even if a data set is not available, you can always go out and

18:15 try and scrape it and collect it yourself and clean it. My go-to source is always Twitter. I always build

18:21 things using Twitter data. So Twitter and all the other sites will have APIs where you can just make

18:27 restful calls or even use libraries. Python has some Twitter libraries that abstracts all that away. And you can,

18:33 you know, collect tweets on a given topic or collect tweets from certain users and

18:37 collect your own tweets and do analysis on those.

18:41 Yeah. And you had a section about that in the book. What was the package you were using for that?

18:45 In Python, the package I usually use is called Twython. There's a bunch of them. That's the one that I

18:50 got to work the easiest, but I think highly of it.

18:53 I don't think I believe any other ones.

18:56 Yeah. Yeah. Okay, cool. Yeah. I think Twitter is fairly special among the social networks.

19:01 To me, Twitter is the social network of ideas more than it is of friends or family or whatever. So

19:08 yeah, pretty cool. I love to get data from Twitter as well.

19:11 Yeah. It's just that they won't give you the fire hose, but for most things, they'll give you

19:15 more tweets than you can handle anyway. So if you want to find out what people are saying on a certain

19:19 topic or just what people are talking about in general, it's an awesome source for that.

19:24 Yeah. I saw some study or heard about some study a few years ago where people were studying, they were

19:30 trying to do sentiment, mood analysis on Twitter, and then trying to tie that back to the stock market

19:38 and predict how people were feeling on Twitter to near-term changes like in the next hour in the

19:45 stock market, which I thought was a pretty interesting project.

19:48 So ideas you can come up with using that data are pretty much endless. Every month, I think I'm done

19:53 with it and then I think of something else to do.

19:54 Nice. So then you kind of get into the topics that I think of as traditional data science,

20:01 machine learning, neural networks, network analysis, that kind of stuff. And I thought maybe we could go

20:07 through each section and you could tell me what it is that you need to sort of fundamentally

20:13 understand what are the from scratch basic pieces of each part of this data science and then maybe

20:20 some examples of problems you might solve with it.

20:22 Sure.

20:22 Yeah. So the first one you talk about is machine learning. And when you said machine learning

20:27 and later neural networks from scratch, I'm like, wow, that's a pretty big thing to take on from

20:33 scratch. What's the story there?

20:34 Machine learning at a high level is just learning some kind of model from data rather than sitting

20:41 down and writing out the model yourself by hand. And so if you have a small amount of data, then

20:47 you know, coming up with an algorithm that's going to learn a model from it is actually a pretty

20:51 reasonable thing to do if you go with a simple model and don't add too many bells and whistles.

20:56 It's only when you want to, you know, start producing recommendations at Netflix scale

21:01 or when you want to start, you know, building something to recognize speech patterns from audio files

21:06 that your beautiful handcrafted Python is probably not going to be up to the task.

21:10 Yeah. So I suppose the types of things you solve with machine learning is fairly unbounded. There's

21:19 a lot of problems that machine learning answers. What are some of your favorite examples?

21:22 Everything out there is machine learning these days. But if you want to talk about like projects that

21:29 I've worked on for fun, one time I built a classifier to predict or to identify hacker news articles that I would be interested in or not

21:37 interested in. So it would take kind of the feed of new hacker news stories and come up with a score between zero and one about how interested, you know, it thought I would be based on some initial seed values that I leave.

21:49 How did you teach it?

21:58 How did you teach it?

22:06 I was for a few other kind of idiosyncratic features I threw in, like, you know, is it an ask hacker news? Is it a show hacker news? Does it have a dollar

22:15 amount in there? Because there are a lot of kind of bad stories that's like, here's how to make $5,000, I think. So that was a good negative signal.

22:23 It's sort of a spammy signal.

22:24 Yeah. So, and then it turned out I wrote a blog post about it and someone posted the story to hacker news and then

22:31 the hacker news community got very angry that someone would think that not every story there was worth

22:36 reading. Because of course every story there is worth reading. So why would I want to filter some of them out?

22:39 They accused me of wanting to live in a bubble.

22:41 And yeah, they can be fairly critical on there. But that's funny. So the big question is, did your system

22:49 like your article, your blog post?

22:51 That's a good question. I don't know that I actually checked. I should go back and try and find that.

22:55 It'd be funny if it recommended it to you. Or not. Either way, it would be funny to know.

22:59 Well, you know, it's funny speaking of this. So, you know, I have an Android phone and so

23:03 Google now will sometimes recommend me articles of things I'll be interested in.

23:07 And occasionally it will recommend to me like my own blog posts, which I guess means doing a good job.

23:13 You know, one of the things I did with machine learning was I have a five year old daughter and I do her clothes shopping.

23:19 And I noticed that little boys clothes were very interesting and little girls clothes tend to be

23:23 kind of like really boring. The boys clothes have like dinosaurs and rockets and robots and

23:28 girls clothes have like hearts and flowers. And so I built a machine learning model to take an image of a

23:33 children's t-shirt and predict whether it was a boy shirt or a girl shirt.

23:38 Awesome. Yeah. And I agree with the classification there as well. I have three daughters and we still

23:44 bought a fair number of boy clothes for them.

23:46 Yeah, I did the same.

23:47 Nice. So the next topic that you covered was nearest neighbors. And I mean, conceptually,

23:53 I kind of know what nearest neighbors are pretty easily, but it's a computation hard problem,

23:58 especially in higher dimensions. And so what kind of stuff do you do? What kind of problems do you solve with

24:03 this?

24:03 One thing you can do is when you don't have like a great parametric model for what's going on.

24:08 So for instance, one place where I've seen this applied is when I have some kind of like

24:12 time series type signal, and I want to know if it represents some kind of bad anomaly. And it might be,

24:19 you know, like thousands of points in some weird shape. And I don't have a great way to classify it.

24:22 But one thing you can do is if you have a bunch of labeled signals, you know, some are good or some

24:28 are bad, you just say, I'm just going to take my set of reference signals and figure out which are

24:33 the ones it's closest to. And that way, I don't need to necessarily have a model of what's a good

24:38 signal or what's a bad signal. I can just say, I have some labeled data, that's enough. And so I

24:43 think that's kind of the situation where it can be useful.

24:46 Yeah, nice. When you're still trying to explore things, and maybe you don't, you don't have a model

24:51 you're trying to fit it to, you're just trying to understand it, right?

24:53 Yeah. Or think about where like, you have some weird kind of shapes, or weird kind of patterns

24:59 that you can't really put math behind. What's a good pattern? What's a bad pattern? But you have

25:04 some labels, then you could use nearest neighbors to basically classify without having to have a

25:10 mathematical model behind it.

25:11 Okay, yeah, very cool. So then you get to another topic called Bayesian Bayesian analysis, which the context I know this is around kind of determining spam and

25:21 filtering and stuff like that. But what's, what's the story of Bayesian analysis?

25:26 So it's named after Bayes' rule, which is just a theorem in statistics, having to do with ways of

25:33 reversing conditional probabilities. So at a high level, if you have some knowledge about

25:38 what is the probability of seeing certain features in email, given that an email is spam, you can use that

25:45 data to produce estimates of what is the probability that an email is spam, given that you see those

25:50 features. And so typically, if you have a lot of, you know, spam and non-spam, you can make estimates

25:56 of, okay, given that an email is spam, it's likely that I'll see Viagra, and it's likely that I'll see

26:02 Gary Twick. And then basically just a classifier that turns that around and allows you to now look for

26:07 these features. And in a math-manipede rigorous way, come up with the probability, okay, here's the

26:11 probability that this is spam. Okay. Yeah. Very interesting. What other types of problems?

26:15 Spam is certainly easy to understand, but you know, hopefully that's solved by Gmail and other people

26:22 for us, right? I think it was Paul Graham who wrote a pretty influential article about this approach,

26:27 probably about 15 years ago now. And so I think a lot of the tools that, I don't know if it's what

26:33 Gmail uses, but that a lot of these BAM assassin and some of the other mail providers used are still

26:37 based on this principle of naive base. But basically the, anything where you have kind of big clumps of

26:44 say text, and you want to classify it into one or more classes, like two or more classes, and you're

26:52 willing to make this really big kind of technical assumption that the words in it are kind of independent

26:58 of each other. So you're treating them kind of like a bag of words rather than as a sequence of words,

27:02 then that's where you would use this kind of model. Okay. Yeah, cool. So maybe the opposite of that,

27:08 where you treat words sort of as, as having more meaning is in natural language processing, which is

27:14 another thing you talk about, right? Natural language processing is, is a huge field. Like people,

27:18 there are textbooks about it. So in fact, there are Python books about it. So cram it into a chapter of a

27:25 book is really just kind of, here's a really high level overview. And here's, you know, a couple of

27:31 examples to give you a flavor. Natural language processing is, it's actually relevant to this sort of

27:37 problems that I'm getting into in artificial intelligence. It gets used a lot by all the voice recognition in

27:43 your phone, and a lot of these new, basically reading texts, and having a computer extract information from it. So it's a

27:52 pretty important, pretty, pretty rich field. Yeah. And there's some pretty decent Python libraries for

27:57 doing that, right? Yeah. So there's a, so the big one is called NLTK, the natural language toolkit,

28:04 I think it stands for. And so it has its own book just about natural language processing in that library.

28:10 And I think that's actually free on the web. But yeah, that would be a great place to start it. If you

28:15 want to come to understand more deeply, there's also a nice Coursera course on natural language processing.

28:20 It took several years back that I liked. Oh, okay. Yeah, excellent. So let's see. Another topic you

28:25 covered were decision trees. And what's the story of these? A decision tree is kind of what it sounds

28:31 like. Yeah. You know, intuitively, you might have a model of making decisions based on a tree. So I have,

28:39 I want to know whether to buy a certain car or not. So, you know, I might start asking a question. Okay,

28:44 is it an American car or a Japanese car? American car. Okay. Go to the next question. Does it have

28:49 four doors or two doors? It has two doors. Okay. Is it gas or diesel? Diesel. And then you just have a

28:55 sequence of questions to ask and you classify it based on those questions. Okay. It's diesel. That

29:00 means, yes, I want to buy it. And so conceptually, you know, a tree like that is not that complicated.

29:06 Interesting part is given some set of data, how do I build such a tree? And there, there's a variety of

29:13 algorithms, but the one that we talk about in the book, it is a pretty simple one, which is basically

29:18 around, okay, I have a bunch of data. It has a bunch of attributes. I can split my data on each

29:24 attribute. So I could say, okay, at this stage, I want to look at two door versus four door. I want to

29:28 look at Japanese versus American, or I want to look at gas versus diesel and which of those choices is

29:35 going to allow me to kind of really separate the good buys from the bad buys the most. So if gas versus

29:43 diesel totally splits where diesel is the buys and gas is the don't buys, that's a great thing to choose.

29:50 If gas versus diesel splits where, you know, on gas, I want to buy half and not by half and on diesel,

29:56 diesel and by half and up by half. That's not a good split to choose because it doesn't really

29:59 help me at all. And so the mathematics here are just around kind of making this precise with this

30:05 entropy and building these trees from the data. Okay. Yeah, that's very cool. The way where you

30:11 build them knowing the data intimately, you know, that, that makes a lot of sense. And you don't really

30:17 necessarily need data science to do that, right? That's just helping, you know, asking a few questions

30:23 and making a decision. But the reverse, I think is pretty interesting.

30:27 They build some of these expert systems in this kind of way. I think they apply them sometimes in like,

30:31 medical diagnosis and sometimes they do better than doctors.

30:35 Wow. Okay. Sometimes just let the data talk, huh? Rather than intuition.

30:40 This episode is brought to you by hired hired is a two sided curated marketplace that connects the

30:56 world's knowledge workers to the best opportunities. Each offer you receive has salary and equity presented

31:01 right up front, and you can view the offers to accept or reject them before you even talk to the company.

31:07 Typically candidates receive five or more offers within the first week, and there are no obligations

31:11 ever. Sounds awesome, doesn't it? Well, did I mention the signing bonus? Everyone who accepts a job

31:16 from hired gets a thousand dollars signing bonus. And as talk Python listeners, it gets way sweeter.

31:22 Use the link hired.com/talkpythontome and hired will double the signing bonus to $2,000.

31:29 Is knocking. Visit hire.com/talkpythontome and answer the call.

31:32 Somewhat related to that maybe is, you talked about neural networks as well.

31:42 Yeah. So neural networks are a super hot topic right now, because I'm sure you've probably heard

31:47 that all the buzz about, or about deep learning, right? And deep learning tends to have neural networks

31:54 at the root of it. So neural networks are basically a way of building up kind of layers and layers of

32:01 representations. And the deep learning, this book doesn't really go into deep learning because it's,

32:06 the single neural network is hard enough to do by hand or from scratch. But basically it's a way to

32:12 kind of build a classifier that works similar to how a toy model of a brain might work. So you have

32:20 artificial neurons, and each neuron has a bunch of inputs that go into it with weights. If the weighted

32:26 sum of the inputs exceeds some certain threshold, the neuron fires. And if it doesn't exceed that

32:30 threshold, the neuron doesn't fire. So you present an input, which could be like a, an image. So basically

32:36 a bunch of zeros and ones, and that causes some neurons to fire. And that propagates through this

32:42 network. And in the end it will spit out, you know, I think you should be an image of a cat,

32:46 or I think you should be an image of a dog. And you train it by showing that a lot of labeled images

32:52 and adjusting the weights based on how you got them wrong.

32:55 Yeah. Interesting. So you basically just say, these are the inputs and the decisions that you feed it

33:02 known data. You say, you know, like, here's a cat, here's a cat, that's a dog. It's just a cat or a dog,

33:07 right? And then you ask that question. I was gonna say, well, one of the problems that I sort

33:11 of remember from neural networks is that they, as you try to get them more accurate,

33:15 they get over-trained to just only do little bits of stuff. And so how does things like decision

33:21 forests or these things and this deep learning, how does it deal with that?

33:26 So random forests are basically taking multiple decision trees and combining them. So there's this

33:33 kind of general principle where a lot of times rather than building one model, it's like super predictive.

33:40 I take a lot of much less predictive models and just kind of let them vote on things or average

33:47 the results or whatever. So you're right that it's really a decision tree. And I included like all

33:52 hundred features in my data set that there's a good chance I'm going to like overfit and

33:57 overlearn my training data and not generalize outside of that. So one thing that happens with decision

34:04 trees is people often don't use the bare decision trees. They use the random forest where they'll build

34:10 a bunch of smaller decision trees, each of which is really restricted to a small subset of features

34:16 so that each one is individually less powerful. But then when you combine them, they do well in the aggregate

34:22 and they don't have necessarily the same overfitting problem that single decision tree with lots of features would.

34:28 So in neural networks, especially in deep learning, they're not exactly the same techniques, but other

34:34 techniques where you will zero out some of your weight sometimes and train without them to make

34:40 sure that they don't learn too much. And there's a lot of other techniques that get used in order to

34:44 make sure you're not overfitting. There's something that, you know, data scientists and machine learning

34:47 people worry about a lot.

34:48 It's a really hot field, like you said right now. That's awesome. So one more section I'd like to touch

34:54 on is you talked about this thing called recommender systems.

34:58 So mobile, they're just what they sound like. Anyone who's used the internet has a pretty good experience

35:02 where you go to Netflix and it says movies for Joel and it's trying to predict what it thinks I'll like.

35:09 Or you go to Amazon and it will recommend items for me. And you can go a little further where you have

35:16 all these startups like Stitch Fix is a very hot data science startup where you tell what kind of clothes

35:22 you like and they'll send you a box of clothes every month that they think you'll like a lot.

35:28 And so generating these kind of recommendations is a pretty popular task within data science. A lot of data scientists

35:36 work on these kind of problems. There's a lot of data scientists work in this job. So they have to sell you stuff

35:41 and they always want to sell you stuff that they think you'll like.

35:44 You convert better if you figure out what people actually might want, right?

35:47 Yeah, I mean, well, if Amazon sends me an email, "Hey, Joel, you know, these five things are on sale

35:54 and they're five things that I really want to buy," then that's much better for them and potentially

35:58 for me than if they send me a random email that's like, "These are the five most popular things on

36:03 Amazon today." Yeah, it definitely is better. I really like going to my Kindle and looking at

36:08 the recommended things based on what I've been reading. But the Netflix recommendations, I know they

36:14 do really great work, but it just doesn't work for me because my seven-year-old daughter watches

36:19 Strawberry Shortcake and other random things. I get a lot of kid shows recommended to me.

36:25 They have a "Who's watching" button that you can click and say "A kid is watching."

36:30 I know, but my daughter won't use it. She'll just randomly pick one.

36:33 Yeah, so you know, it's funny. I use that button pretty well, but my Netflix recommendations are also

36:40 not that good. But I think it's just because I don't like anything on there. So no matter what they're

36:45 telling me, I'm not going to like it. Yeah, I hear you. I feel like you covered a pretty

36:49 wide swath of data science from scratch. Some of the topics were really accessible. Some of them

36:57 required more math and more thinking, but it was all a really nice presentation. What do you feel like

37:04 you left out? There aren't any topics that, if anything, I kept adding stuff while I was writing it.

37:11 So I don't feel like there were any huge topics that I necessarily left out. But in terms of coverage,

37:17 probably my biggest regret is that I used Python 2 instead of Python 3.

37:20 Yeah. Okay. And I saw on your GitHub repo, you had something about Python 3 in there. Is that right?

37:27 Yes. So pretty much as soon as the book came out, one, I got a lot of emails from people saying,

37:32 "Hey, why do you not use Python 3?" And then I also got a number of emails saying,

37:37 "I would like to use Python 3. Will the code work?" And so I wrote them back and I said,

37:41 "Yeah, you know, I don't feel like the code shouldn't work. Give it a try. I bet it works

37:46 with probably a few changes, add some parentheses, print statements and so on." And then eventually one

37:51 guy wrote me back and he said, "It doesn't work." I said, "Okay." So I sat down and I said, "I'm going to

37:57 convert the code to Python 3." And it took me about, I'd say four to five hours. And that's with me

38:04 knowing the code intimately. So it would have taken someone who hadn't written the code in the first

38:09 place a lot longer than that. And then I felt kind of guilty that I've been telling all these people that

38:12 it was so easy to do when it wasn't.

38:14 Just spend a week. It'll be fine.

38:16 Yeah. But so yeah, I have the Python 3 versions of the code up on the GitHub. I sort of regret not

38:23 having just done it that way in the first place.

38:25 Yeah, sure. So at the end, you talked about data science, not from scratch, and you pointed out

38:31 a lot of the libraries you might actually use, like NumPy and so on. Do you want to talk about that

38:37 really quick? Like what's the real data science versus the from scratch data science comparison?

38:43 Yeah. So I would say that NumPy is pretty fundamental. That's basically the linear algebra

38:51 library for Python. So it provides you matrices, matrix algebra, high performance arrays, things

38:58 like that, you don't get just built in. And you might not use it directly, but a lot of the other

39:05 libraries are really built on top of it. So kind of the most broadly accessible machine learning

39:11 library for Python is called Scikit Learn. And it has really nice documentation and really nice tutorials

39:16 and a fairly standard API for building machine learning models. Anything you want to build a

39:21 regression model or a random forest model or any kind of classifier, that's probably the place to go.

39:27 There's also Pandas, which is the data frame library, which is good if you're working with tabular data. So

39:34 not necessarily the machine learning side of data science, but more of the, I have kind of a

39:39 spreadsheet data set and now I want to clean it and aggregate it and pivot it and look for kind of data

39:48 analysis type insights.

39:49 Yeah. If you're exploring, like you said, tabular data and you kind of kind of load it up and clean it,

39:56 Pandas seems really fantastic for that.

39:57 It's a really nice library. And then kind of the new kid on the block is a TensorFlow, which is

40:03 Google's deep learning library. And it's only was released, you know, a few months ago and it's gone

40:09 through. It's not 1.0 yet, but it seems like people are sort of converging around it as that's how they're

40:15 going to do deep learning in Python. Now there are other sort of previous libraries that some people

40:20 used and still use, but TensorFlow seems to be gaining a lot of mindshare.

40:24 Okay. Yeah, that's really cool. And I've definitely seen TensorFlow talked about a lot in this context,

40:30 but it's just a library. You can download that and run it locally. It's not like a cloud type thing,

40:35 right?

40:35 Yes. So currently it's a local library and I'm not sure if they've come out with a version that you can

40:42 kind of do your own cloud, like on various AWS instances. But, but I know they're definitely going

40:47 in that direction because a lot of this stuff, these deep learning models take a really long time to

40:52 train. And so you want to go to use them for anything serious. You want to distribute them and

40:56 throw them in high-powered machinery and not just run them on your laptop.

40:59 Yeah, of course. If you have tons of data, maybe it's better to get a bunch of machines

41:03 for an hour. So what do you think about these cloud learning or data science platforms? I'm thinking like

41:09 Azure machine learning, or, you know, I just had SigOpt on the show a few shows ago. I'm not really sure

41:15 what else is out there in terms of like go out to the cloud and grab some data science stuff.

41:20 What do you think about that?

41:21 I haven't spent much time looking at any of those. I think they can add value in terms of either one,

41:31 if you don't have the data science or machine learning expertise in house to do whatever you need to do.

41:37 Or two, if you have some kind of model that you've built, but you need help either putting it into

41:43 production or operationalizing it somehow, that they can build a pretty good role there. But my sense is

41:50 that most people doing data science are aligned more on running the libraries and running the models

41:56 themselves. But that could be just my bias sample of the people who I talk to.

42:00 Yeah, of course, of course. Cool. So another thing that you said you're into is taking some of the

42:07 ideas from Haskell and thinking about how those might manifest in Python. What do you went up to there?

42:12 Haskell, if you're not familiar with it, is kind of the purest of the pure functionally typed

42:19 languages with strong types and lazy evaluation and things like that. And so once I spent some

42:28 time in that world, I spent a lot of time thinking about how can I bring some of these concepts back

42:33 into Python. And in Python 3, lazy evaluation plays a much bigger role in the sense that like range

42:42 is a generator instead of a list and all the map and filter and things like that also are generators

42:48 instead of lists. But I started getting into the iter tools library, which starts giving you tools for

42:55 generating basically infinite sequences and just trying to see how far I could go using

43:01 pure functions and infinite sequences and avoiding mutable variables and other things that you try not

43:07 to do when you're working in a Haskell-like language.

43:10 Yeah. And how did you feel like it came out in the end? Do you feel like you were able to bring a lot

43:14 of those ideas over?

43:15 I was, and I ended up producing code that was really neat and really impenetrable. The lazy

43:23 infinite sequences stuff, that was more almost academic in terms of like, yes, I managed to do it. Like,

43:29 this is mathematically interesting and it works well, but it's not readable at all.

43:34 So imagine you wanted to represent, let's say a binary tree in Python. Kind of the two,

43:41 I would say obvious approaches would be one to make some kind of like class where it has a,

43:47 you know, a value element and a left element and a right element. And then you also might just use

43:53 a dictionary to represent it where it had those keys. In a language with algebraic data types,

43:57 like Haskell, you would just basically represent that kind of tree as a product type where it just

44:04 has like three elements. And so I said, you know, what if I just represented a tree as a tupple with

44:09 three elements where first element is the left subtree, the second element is the value,

44:13 third element is the right subtree. And similarly, if you want to do like linked lists in Python,

44:17 which you probably don't want to do, but if you did want to do linked lists in Python, you just treat

44:21 them as a tupple. First element is the element and the second element is the tail linked list. And so,

44:28 I actually found that I was able to write some pretty nice code using those kinds of ideas.

44:32 And I did some coding interviews that way. I'm not sure the interviewers appreciated it.

44:38 What is this guy talking about? Yeah, cool. So what question given your work at the AI Institute and

44:47 your background in data science and so on, I wanted to ask is the last year or so, there's been a lot of

44:54 news items and people coming out saying that artificial intelligence is a danger to humanity.

45:01 What's your thought? Is like AI something we should be super excited about? Or is it something we should

45:07 be maybe cautious about? I would say I probably fall in the middle. I mean, I'm excited about it because

45:12 this is my job, but I don't go around encouraging everyone else that they have to be excited about it

45:18 because I don't know that that's necessarily warranted. But at the same token, I don't spend a lot of time

45:23 worrying about how dangerous it is. I think we're pretty far off from the time when we have to worry

45:29 about that. And I do have some friends who think we should worry about it now before it's too late. But

45:35 I think there's a lot more important things to worry about in the world when I read the news.

45:39 Yeah, I kind of agree with you on that. And I think there's two, certainly two ways to look at that. On

45:47 one hand, if you think about things like self-driving cars, let's just take that as an example. Like,

45:53 I believe one of the biggest job categories for men in the United States is some form of driving,

46:02 like driving a truck, driving a taxi, those types of things, delivery vehicles and so on. And if self-driving

46:08 cars were to like remove all that, like that would have large social effects. But I think, you know,

46:13 that's not so much the way that people, at least recently in the news, were talking about it. It's

46:19 more like Terminator style, right? And so that I, I'm not too worried about this personally. Who knows?

46:26 Yeah. I mean, I personally would pay a lot of money for a self-driving car because I,

46:30 I don't like driving that much. And I'd much rather be able to read while I'm going somewhere.

46:35 Yeah. Driving is fun until you get stuck on i5 for half an hour inching along. Then you know,

46:40 I don't like driving anymore. Exactly.

46:42 Awesome. Well, we're kind of coming up near the end of the show. Let me ask you just a

46:46 few closing questions. I always ask on my guests, if you're going to write some Python code,

46:50 what editor do you open up? So these days I tend to use Adam for pretty much everything,

46:55 memory leaks and all. Nice. Yeah. That's from a GitHub, right?

46:58 Yeah. It's pretty similar to sublime, but it doesn't nag you for money. So yeah, it's,

47:03 it's pretty nice. And I think it's Adam.io. They have a really cool little video about how

47:08 it's the editor of the future. It's nice.

47:10 Yeah. I don't know if I go that far, but it's the editor of the present at least.

47:14 Yeah. It's like a George Jetson sort of like a promo video. It's, it's pretty funny.

47:18 It is pretty nice. Isn't it? Does it have good Python support?

47:21 You know, I'm never, I'm not someone who leans on like ID functionality a lot. So if good Python

47:28 support counts as syntax highlighting, yes, but that's all I tend to use it for. So yeah.

47:32 Yeah. Yeah. Okay. Cool. And if you look at on the Python package index, there's,

47:38 you know, 75 plus thousand packages and we all have experience with, you know, different parts

47:46 of it and there's things that we love and would recommend, like, what is your favorite one you

47:49 might recommend that people maybe don't know about?

47:51 So the one I recommend that people don't necessarily know about is called Beautiful Soup. It's,

47:57 basically a HTML parsing library. And so if you start scraping data from webpages, you're going to get

48:04 a big mess of ugly HTML. That's probably not even well formed. most of the time,

48:10 most people don't bother to well form their HTML.

48:12 Are you telling me that I can't just like load that up as an XML document or something like this?

48:17 No, I'm just kidding. Of course it's, it's terrible trying to work directly on the web,

48:21 right? And Beautiful Soup is really, I really like it as well.

48:25 It's really nice. I mean, you have to spend a little bit of time getting used to

48:28 it's API and interface and everything, but it's super handy for getting data out of webpages and

48:34 doing anything where you have to do a bunch of scraping.

48:37 And you cover that in your book, right? In the getting data, I use Beautiful Soup a little bit and show people how to use it.

48:42 Yeah, I cover it a little bit. It's a nice addition to the data scientist toolkit.

48:46 Nice. So Joel, how do people find your book? Amazon? Just a Google search for data science

48:52 from scratch? Yeah, it's on Amazon and you can buy it from O'Reilly.com. But yeah,

48:57 if you Google data science from scratch, you'll find it.

48:59 Cool. And I'll put a link to the GitHub repo where you have all the code examples and so on as well.

49:04 It's been really fun to talk about data science. And I think you have a really interesting way of teaching

49:10 people to appreciate the tools that we're all fairly familiar with by showing you how to build it from

49:16 scratch. So thanks for that.

49:19 My pleasure.

49:21 This has been another episode of Talk Python to Me. Today's guest was Joel Gruse and this episode

49:25 has been sponsored by SnapCI and Hired. Thank you guys for supporting the show. SnapCI is modern,

49:30 continuous integration and delivery. Build, test and deploy your code directly from GitHub,

49:34 all in your browser with debugging, Docker and parallelism included. Try them for free at snap.ci

49:39 slash talk.com. Hired wants to help you find your next big thing. Visit Hired.com slash talkpython to me to get five or

49:45 more offers with salary and equity presented right up front and a special listener signing bonus of $2,000.

49:50 Are you or a colleague trying to learn Python? Have you tried books or videos that left you bored by just

49:56 covering the topics point by point? Check out my new online course, Python Jumpstart by building 10 apps at

50:01 talkpython.fm/course to experience a more engaging way to learn Python. You can find links

50:07 from this show at talkpython.fm/episodes slash show slash 56. Be sure to subscribe to the show. Open

50:14 your favorite podcatcher and search for Python. We should be right at the top. You can also find the

50:18 iTunes feed at /itunes, the Google Play feed at /play and the direct RSS feed at /rss on

50:25 talkpython.fm. Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

50:31 You can hear his entire song at talkpython.fm/music. This is your host, Michael Kennedy.

50:36 Thank you so much for listening. Smix, take us out of here.

50:39 Stating with my voice. There's no norm that I can feel within. Haven't been sleeping. I've been using lots of rest. I'll pass the mic back to who rocked it best.

50:49 I'm first developers.

50:51 I'm first developers.

50:58 Developers, developers, developers, developers.

51:01 .

51:01 Thank you.

51:01 Thank you.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon