Data Science from Scratch

Episode #56, published Wed, Apr 27, 2016, recorded Tue, Apr 19, 2016

Episode Deep Dive Transcript

You likely know that Python is one of the fastest growing languages for data science.

This is a discipline that combines the scientific inquiry of hypotheses and tests, the mathematical intuition of probability and statistics, the AI foundations of machine learning, a fluency in big data processing, and the Python language itself. That is a very broad set of skills we need to be good data scientists and yet each one is deep and often hard to understand.

That's why I'm excited to speak with Joel Grus, a data scientist from Seattle. He wrote a book to help us all understand what's actually happening when we employ libraries such as scikit-learn or numpy. It's called Data Science from Scratch and that's the topic of this week's episode.

Links from the show:

Book: Data Science from Scratch: amzn.to/1rhcbdT
Joel on Twitter: @joelgrus
Joel on the web: joelgrus.com
Partially Derivative Episode: partiallyderivative.com
Allen Institute for Artificial Intelligence: allenai.org

Data Science Libraries

numpy: numpy.org
Numpy episode: #34:
Continuum: Scientific Python and The Business of Open Source:
talkpython.fm/episodes/show/34

pandas: pandas.pydata.org
scikit-learn: scikit-learn.org
scikit-learn episode: #31: Machine Learning with Python and scikit-learn:
talkpython.fm/episodes/show/31

matplotlib: matplotlib.org
Google's TensorFlow: tensorflow.org

Episode Deep Dive

Guest Introduction and Background

Joel Grus is a data scientist from Seattle who has worked in quantitative finance, economics, and at various startups. At the time of this podcast (2016), he had just started a new role at the Allen Institute for Artificial Intelligence (AI2). He is the author of Data Science from Scratch, an O’Reilly book that explores core data science concepts by creating them in raw Python code rather than just importing libraries.

What to Know If You're New to Python

Here are a few pointers that will help you follow this conversation on data science topics, from Python syntax basics through more advanced libraries:

The discussion shows how list comprehensions and generator expressions keep code clean and “Pythonic.”
Python’s dictionary and collections tools (like defaultdict and Counter) are widely used for data counting and tracking.
Matplotlib for plotting is introduced, although other options like Bokeh or D3.js (JS-based) are also mentioned.
Some math background (probability, linear algebra, statistics) helps but the episode explains these topics in an accessible way.

Key Points and Takeaways

Building Data Science Tools from Scratch Joel’s book emphasizes teaching the core ideas of data science by building the functionality yourself. Instead of calling scikit-learn or pandas immediately, you learn how machine learning algorithms, statistics, and data structures work under the hood. This approach gives you deeper insight into why libraries do what they do, and where they might fail.
- Links and Tools:
  - Data Science from Scratch (Joel’s Book)
  - scikit-learn.org
Math Foundations for Data Science A strong grounding in statistics, probability, and linear algebra can be crucial, but Joel notes you can still be a data scientist without formal math degrees. Data Science from Scratch covers just enough math, “cartoon versions,” as he puts it, to equip you to understand the algorithms on a conceptual level.
- Links and Tools:
  - NumPy (underpins matrix and numeric work)
  - pandas.pydata.org (data analysis library built on NumPy)
Machine Learning Essentials Joel and Michael discuss creating foundational machine learning algorithms (such as linear regression and naive Bayes) by hand. These examples highlight how models learn from data, but also clarify the importance of avoiding “blind use” of high-level APIs.
- Links and Tools:
  - scikit-learn.org
  - TensorFlow (mentioned as an emerging deep learning framework)
Nearest Neighbor and Similarity Searches Using nearest neighbor methods can be helpful when you don’t have a clear mathematical model but you have enough labeled data to compare. For example, Joel once classified time-series signals or Hacker News articles by referencing known “neighbors” with certain features.
- Tools:
  - Kaggle (site with data science competitions and datasets)
Bayesian Classifiers and Spam Filtering Bayesian analysis arises when discussing spam detection, showing how probability can invert known relationships (“words in spam” vs. “spam vs. words”). This approach underpins naive Bayes classifiers, letting you handle text classification for many problems beyond spam.
- Tools:
  - naive Bayes on Wikipedia
Decision Trees and Random Forests A decision tree algorithm splits data based on which attribute best separates the classes, repeated recursively. Combining many trees in a random forest helps avoid overfitting by averaging the results of multiple, smaller trees. Both methods are conceptually easy to understand yet powerful.
- Links and Tools:
  - Random Forest explanation on scikit-learn.org
Neural Networks and Deep Learning Joel provides a short conceptual intro to neural networks: layers of weighted nodes that “fire” if thresholds are met. While Data Science from Scratch stops short of modern deep learning, it covers the basics of a single-layer or shallow network and the gradient-based training (backpropagation).
- Tools:
  - TensorFlow (Google’s library for deep learning)
  - PyTorch (not in the transcript specifically but commonly used for deep learning)
Data Gathering and Web Scraping Getting data often means retrieving web content through HTTP or scraping sites that lack an API. Joel frequently uses Twitter’s API for real-world data. He also highlights Python libraries like Beautiful Soup for parsing messy HTML.
- Links and Tools:
  - Beautiful Soup
  - Twython
Real-World Visualization The conversation touches on using matplotlib as Python’s staple plotting package, even though it’s “showing its age.” Libraries like Bokeh, Seaborn, or even JavaScript-based D3.js can bring more interactive or visually appealing results.
- Links and Tools:
R vs. Python for Data Science Joel discusses preferring Python over R, partially because Python’s general-purpose nature leads to clearer, more maintainable code. He also highlights how Python has become the go-to language for many data science tasks, especially when deeper application development is required.

Tools:
- R Project (mentioned for contrast)
- pandas (typical Python data workflow)

Interesting Quotes and Stories

"I come from a math background... you can't really use something until you understand or can prove it." -- Joel Grus, on why he wrote a from-scratch approach.

"Building a decision tree by hand clarifies what’s really going on under the scikit-learn interface." -- Michael Kennedy, on learning the fundamentals.

"If you’re just typing commands with no idea why, it’s easy to get machine learning wrong." -- Joel Grus, advocating deeper understanding beyond library calls.

Key Definitions and Terms

Naive Bayes: A classification method based on applying Bayes' theorem with strong (naive) independence assumptions between features. Often used for spam filtering.
Nearest Neighbor: A technique that classifies an unknown instance by looking at the “closest” known instances in the feature space.
Decision Tree: A model that splits data based on the feature that best separates the outcome, recursively building a branching set of decisions.
Neural Network: A computational model inspired by the human brain, composed of layers of weighted “neurons.” Modern deep learning stacks multiple layers for greater capacity.
Random Forest: An ensemble machine learning model that aggregates the predictions of multiple decision trees to reduce overfitting.

Learning Resources

Python for Absolute Beginners: Ideal if you’re brand new to coding in Python.
Data Science Jumpstart with 10 Projects: Hands-on projects to strengthen your data analysis and machine learning skills.
Python Data Visualization: Dive deeper into plotting and interactive charting beyond matplotlib.
Getting Started with NLP and spaCy: A quickstart on text analysis, expanding on topics like naive Bayes for text classification.

Overall Takeaway

Data Science from Scratch spotlights the importance of understanding fundamental algorithms and data structures at the code level rather than just importing libraries. Joel Grus encourages data scientists to grasp the “why” behind the tools they use, whether that’s building naive Bayes classifiers, decision trees, or neural networks by hand. By doing so, you gain clarity, confidence, and the ability to diagnose problems when standard tools inevitably need deeper customization or fail on edge cases.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 You likely know that Python is one of the fastest growing languages for data science.

00:03 This is a discipline that combines the scientific inquiry of hypotheses and tests,

00:07 the mathematical intuition of probability and statistics, the AI foundations of machine learning,

00:12 affluency in big data processing, and the Python language itself. That's a very broad set of skills

00:18 you'll need to be a good data scientist, and yet each one is deep and often hard to understand.

00:23 That's why I'm excited to speak with Joel Gruse, a data scientist from Seattle. He wrote a book to

00:28 help us all understand what's actually happening when we employ libraries such as scikit-learn or

00:32 numpy. It's called data science from scratch, and that's the topic of this week's episode.

00:36 This is Talk Python To Me, episode number 56, recorded April 19th, 2016.

00:56 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

01:11 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at

01:16 mkennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on

01:22 Twitter via at talkpython. This episode is brought to you by SnapCI and Hired. Thank them for supporting

01:28 the show on Twitter via at snap underscore CI and at Hired underscore HQ. Hey, everyone. It's great to be

01:36 back with you. I have something to share before we get to the main part of the show. I had a chance to be on the

01:40 Partially Derivative podcast this week, where we did a short segment on programming tips for data scientists,

01:46 relevant to this episode, wouldn't you say? I talked about using generator methods and the yield

01:51 keyword to dramatically improve performance while processing lots of data in a pipeline.

01:55 If you want to hear more about that kind of stuff, check out the link to the Partially Derivative episode

01:59 in the show notes. Now, let's talk to Joel. Joel, welcome to the show. Thanks. I'm glad to be here.

02:04 Yeah, I'm really looking forward to doing data science, but from scratch today.

02:09 I know. That's my book. I'm the right person to be here for that topic.

02:12 Fantastic. So we're going to dig into data science. And I think your idea of

02:16 taking it from the fundamentals and building something that's not so complicated or so well

02:22 polished and optimized that you can really understand what you're doing is great. But before we get into

02:26 that, let's just sort of start from the beginning. How did you get into programming in Python?

02:31 So if you go back about 10-ish years or so, I was in grad school for economics. I took a class called

02:39 probability modeling, which was a lot of simulating Markov chains and doing Monte Carlo simulations and

02:45 things like that. And the class was actually taught in MATLAB. School had a site license for MATLAB that

02:50 only worked on campus. And so that meant that if I went home and worked at home, which I like to do,

02:55 I couldn't use MATLAB. And so as an alternative, I found that I could use Python and NumPy to basically

03:02 do all the MATLAB-y things without that site license. So that's actually why I started being

03:08 into Python. And then I just kind of liked it and stuck with it.

03:10 Yeah, I can definitely see how you would like working in Python more than MATLAB. I've

03:14 spent my fair share of time in the .im files and it's all right, but it's not Python.

03:20 You were studying economics and that's kind of how you got interested in data science,

03:25 trying to answer these big questions. The subject of economics or some other way?

03:30 I followed sort of a convoluted path. My background was math and economics.

03:34 And I actually started off doing quantitative finance, options, pricing, mathematics of financial

03:39 risk. I worked at a hedge fund. And then when the hedge fund went out of business, I just kind of

03:43 lucked into this job at a startup called Faircast, where I was doing a lot of kind of BI work,

03:50 writing SQL queries, building dashboards. And over the years, I just sort of moved more and more in a

03:58 data science kind of direction and eventually became a data scientist and software developer.

04:02 And certainly the math and economics training was helpful for that. But it was

04:06 it was a career that I sort of ended up in by accident rather than through a deliberate plan.

04:11 That's a little bit like my my career path as well. Studied math and just sort of had to learn the

04:17 programming and stumbled into it. That's cool. What do you do day to day now?

04:21 So I just started a new job. I work at a nonprofit. It's called the Allen Institute for Artificial

04:27 Intelligence. It was founded by Paul Allen, who was one of the Microsoft founders. And we're

04:33 basically doing kind of fundamental AI research. I can't tell you that much about what I do. So

04:38 it's really my second week and I'm still kind of just learning what goes on. I'm learning my way

04:43 around. But I work on a team called Aristo, which is basically building AI to take science quizzes and

04:51 to understand science. Oh, very interesting. Yeah, that sounds like a fascinating place to work.

04:56 It's really neat. Yeah. A lot of smart people, interesting problems.

05:00 Would you say it's different than working in a hedge fund or is it surprisingly similar?

05:03 No, it's totally different. I mean, it's similar in that there's a lot of smart people,

05:07 but it's not really similar in any other way. Yeah. The goals are not necessarily the same,

05:11 are they? The goals, the incentives, the sort of day-to-day stress levels, it's all different.

05:16 Yeah. Yeah. Sounds great. So let's talk about your book a little bit. It's called Data Science

05:21 from Scratch. And I think that's a really cool way to approach it. We have all these super polished

05:28 data science tools. You know, we have them in Python, we have them in R and other languages as well.

05:34 Why from scratch, rather than just grabbing one of these libraries or a set of these libraries and

05:38 talking about how to use them? So a couple of reasons. One is, as I mentioned,

05:42 I come from a math background and the math way of doing things is you can't really use something

05:49 until you can prove it. I once had a teacher who came in and he looked at the syllabus for the

05:58 previous semester and he said, oh, good, you proved this theorem. That means I can use it in this class.

06:03 And so there's this real rigor around, you can't use things unless you understand them.

06:06 And that approach always kind of resonated with me. And so that, to a large degree,

06:11 that's the approach I took in the book. At the same time, you also have all these really powerful,

06:17 really easy to use libraries like Psych and Learn, where you can go in and basically copy and paste,

06:24 you know, five lines of code. And you've built a decision tree or you've built a regression model.

06:29 And it's very easy to not know what you're doing. And you can pick up books and they'll tell you what

06:34 commands to type. But you can also type those commands and again, not know what you're doing.

06:39 And so I thought the book I would like, the book that I would have found valuable would be,

06:44 here's what these models are actually doing behind the scenes. And so then when it's time you apply

06:50 them, you kind of understand the principles, you understand where they go wrong, where they go right,

06:54 what they're good for, and what they're actually doing.

06:56 Yeah, I think that's, that's really interesting. And that is very much the mathematical way of,

07:01 if you can't prove it, you can't use it, sort of a way of thinking, which is not that common

07:07 in programming, you know, if there's an API, hey, just grab it and use it. But I think it's really

07:12 important in data science, because you have all these different disciplines and backgrounds coming

07:18 together to make it work, right? You're not just a pure programmer, and you're not, you know, a

07:24 mathematician or some sort of domain expert, right? You kind of have to blend these together.

07:28 Yep. And the other thing that I think ended up working pretty nicely about this approach was,

07:34 there's a lot of really mathy books about data science and machine learning, that's like,

07:38 here's an equation, here's an equation, here's an equation. But what I ended up doing was a little bit

07:43 of math. But then when it came time to, here's how something works, do it in working Python code.

07:49 So that it's rigorous in terms of here's all the steps laid out. But it's also, that's the code you

07:54 can run and sort of follow along.

07:57 Yeah, absolutely. Why Python and not something like R or some other language?

08:02 So the short answer is that, for whatever reason, R is not sympathetic with the way my brain works.

08:08 And whenever I try to sit down and start writing things in R, I just find that it doesn't work the way

08:13 I expect it to. The joke is that R is a language designed by statisticians for statisticians.

08:18 There's some truth to that, some unfairness to it. But for whatever reason, Python is much

08:24 friendlier to the way that my brain works and the way that I solve problems. Actually, the first draft

08:29 of my book was a lot harsher against R. And then some of the earlier reviewers didn't like that.

08:35 They really don't like that. So I kind of revised it to be more gentle.

08:39 Yeah, I think one of the things that's nice about Python is if you invest to learn data science

08:44 through Python, or the Python data science tools, rather, and then you want to build sort of a more

08:50 working application, you don't have to, you know, convert that to some other language, right? You're

08:55 already in Python. It's sort of a full stack thing you keep working with, right?

08:59 That's one important aspect. I think another important aspect is that from my perspective,

09:03 Python code, if written well, is super readable. And so you don't necessarily have to be a Python

09:11 expert to take a well written piece of Python code and understand what it's doing. Whereas other

09:16 languages can be a lot more impenetrable. And so I find it a nice teaching language from that

09:21 perspective as well.

09:22 Yeah, I would say, you know, most of the universities in the US came to that same conclusion,

09:27 right? With Python being the most common CompSci 101 course these days, as opposed to like Java or

09:33 Scheme, which I took.

09:34 Yeah, I mean, when I took, I also did Scheme when I took CompSci 101 way back in the day. And I actually

09:40 thought that was a really nice language for learning computer science. But I can see why Python would be

09:46 a better language for, you know, learning programming as opposed to computer science.

09:50 So when you start off your book, you actually have a few sections that are just about,

09:55 I'm going to teach you just enough Python to get started so that you can understand the data

10:02 science tools and what we're doing. What kind of stuff do you put in there? Like, what do you

10:06 consider fundamental for doing data science? And what can be sort of learned later?

10:10 The most important things I put in were obviously functions, you want to do any kind of

10:16 analytic work or Python functions. The other thing I really tried to emphasize,

10:20 was writing kind of clean Pythonic code. So I also did a lot of list comprehensions,

10:27 as well as understanding all the basic data structures. So lots of dicks and sets and some

10:34 of the less common ones like default dicks and counters, I also make quite heavy use of.

10:40 So basically, anything you would need to understand, here's how to write out an algorithm in Python,

10:46 here's how to use the correct data structures in Python, I kind of baked into that initial

10:52 introduction. Whereas more advanced things like really specialized data structures like queues,

10:58 I just kind of introduced as needed later in the book, some of the more esoteric math functions,

11:03 I just kind of introduced when they came up.

11:05 I think obviously, nicer dictionaries, things like default dict and so on,

11:09 played a really important role in the subsequent pieces where you're trying to make concise,

11:15 clean statements. I thought your use of sort of list comprehensions and generator expressions,

11:23 along with the other functional pieces like zip and counter and so on, were really,

11:30 really interesting. You write some very concise code without being unreadable. I thought that was

11:35 nice.

11:35 That's one of the things I aim for. And as much time as you spend sort of revising the text of the book,

11:41 I also spent probably just as much time revising the code of the book. I went back over it many,

11:46 many times thinking, can I make this simpler? Can I make it cleaner? Can I make it easier to

11:50 understand? And I try and do that. I'm kind of a stickler for code elegance. It's kind of a

11:57 personal fault, maybe.

11:58 But also, you come out with very polished codes. I found it very easy to follow. And I thought the

12:04 focus on this functional style programming was really neat. So one of the things you talk about

12:10 in the beginning is about visualizing data, because you have all these numbers. And obviously,

12:15 to understand them, putting a picture to it is very nice. And what is that? Matt Plotlib?

12:19 Yes. In the book, I only use Matt Plotlib. I was just having a conversation with some of my data

12:25 science buddies about a week ago, where they were, as always, complaining that the data visualization

12:31 story in Python is still not very good.

12:33 Yeah. So you said that Matt Plotlib was starting to show its age in your book. What did you mean by

12:38 that?

12:39 When I first started learning Python, what I said, more than 10 years ago, as a Matlab replacement,

12:44 like Matt Plotlib was around back then, and it had the same features and the same interface, and it

12:51 produced the same not particularly attractive plots. So it hasn't kind of evolved as the rest of the

12:57 language has evolved, say. And there's been some projects like Seaborn that try and put some prettiness

13:03 on top of it. And there's been some other attempts to kind of bring the R-style GG plot into Python.

13:09 But from my perspective, none of them has really like won the mindshare and become the solution.

13:14 So everyone kind of does something different.

13:16 Yeah. You also talked about Boku, which is from the Continuum guys. I had them on show 34. Can you

13:23 maybe talk about that really quickly? Just what is it? What's it used for? That sort of thing.

13:28 It is another visualization library. And I have to be honest with you, I haven't really

13:32 checked it out since I was writing the book like more than a year ago. So I seem to remember that it

13:38 seemed to have some facility for building in interactive plots and things like that.

13:42 It's kind of D3, fancy graphics in the browser type stuff, right?

13:47 I believe so. But I haven't spent much time playing with it.

13:49 Yeah.

13:49 I haven't been doing a lot of, probably doing actually more D3 than Python visualization recently.

13:55 Right, right. Tell everyone what D3 is. I don't think we've talked too much about that on the show.

13:59 Oh, sure. So D3 is actually a JavaScript library for building interactive data visualizations.

14:06 It brings sort of an interesting model where you bind your data to DOM elements, which are quite often

14:12 SVG elements. And so what that makes it easy to do is to have a data set. And when you add new data to it,

14:21 your plot updates, and it allows a lot of really interesting interactivity. If you check out the

14:26 D3 website, they have a gallery of all sorts of amazing visualizations where you look at and think,

14:32 how on earth did they do that?

14:33 Yeah. Whenever I go to the D3 website, I'm like, wow, I want to use this. I don't have a use for it,

14:39 but it's fantastic. How can I just build stuff that looks like this?

14:42 Well, I mean, so I have a joke, which is that good data scientists copy from the D3 gallery and great

14:48 data scientists steal from the D3 gallery. And they give you the code for all of them. So I would say

14:53 probably close to 100% of the D3 visualizations I've ever built have been something I found in the D3

15:00 gallery and kind of tweaked until it fit my data.

15:02 Yeah. Yeah. Nice. So the next section you talked about was like the fundamentals of the math and

15:08 science that you need to know in order to be a data scientist. And I like the way that you put it.

15:14 You said these are like the cartoon versions of big mathematical ideas. How much math do you need to

15:19 know to call yourself a data scientist? Like if I studied just programming, I'm not a data scientist.

15:25 So I studied just math. I'm not a data scientist. Like what's the story there?

15:28 Data science is a funny thing in that there are as many different jobs called data science as there

15:34 are data scientists. So there are people who will call themselves data scientists and they write SQL

15:39 queries all day. There are people who call themselves data scientists and they do cutting edge machine

15:44 learning research all day. There are people who do data scientists who convince people to click it at.

15:49 Like you could know almost no math and still get a job where you call yourself a data scientist.

15:53 You could know very little programming and get a job where you call yourself a data scientist.

15:58 It's a field that's still kind of figuring itself out. And there's just such a breadth of the different

16:03 roles that are all calling themselves data science. In an ideal world, you would know, you know,

16:08 linear algebra, you would know probability, you would know statistics. But among people who are data

16:14 scientists, some people know that stuff really well. Some people know that stuff not so well.

16:17 Yeah, I can imagine. It's really about you've got a lot of data or some specific type of data and you

16:23 want to answer questions about it or even discover the questions you could ask that nobody's asked, right?

16:29 SnapCI is a continuous delivery tool from ThoughtWorks that lets you reliably test and deploy your code

16:48 through multi-stage pipelines in the cloud without the hassle of managing hardware. Automate and visualize

16:55 your deployments with ease and make pushing to production an effortless item on your to-do list.

16:59 Snap also supports Docker and M browser debugging, and they integrate with AWS and Heroku.

17:05 Thanks SnapCI for sponsoring this episode by trying them with no obligation for 30 days by going to

17:12 snap.com/talkpython Sometimes it's actually the opposite, which is, I have a question I want to ask. Where can I find some

17:26 data that will allow me to answer that question? So, you know, sometimes you start with the data and

17:31 go to the questions. Sometimes you start with the questions and then you got to find the data.

17:34 Right. Okay. Very interesting. So that actually brought you to your next section that you talked

17:39 about in your book, which was getting data. And so what are the common ways that you talk about there?

17:45 Nowadays, there's a lot of people who just like post interesting data sets. So if you go to like

17:50 Kaggle, which is a site that does data science competitions, every one of their competitions has

17:55 a data set, which might be interesting in ways that are not related to the competition. The government websites

18:01 publish all sorts of data, some of which is potentially interesting. If you like weather,

18:05 they publish weather data. If you like economics, they publish economic data. There's a lot of

18:10 Python libraries for scraping websites. So even if a data set is not available, you can always go out and

18:15 try and scrape it and collect it yourself and clean it. My go-to source is always Twitter. I always build

18:21 things using Twitter data. So Twitter and all the other sites will have APIs where you can just make

18:27 restful calls or even use libraries. Python has some Twitter libraries that abstracts all that away. And you can,

18:33 you know, collect tweets on a given topic or collect tweets from certain users and

18:37 collect your own tweets and do analysis on those.

18:41 Yeah. And you had a section about that in the book. What was the package you were using for that?

18:45 In Python, the package I usually use is called Twython. There's a bunch of them. That's the one that I

18:50 got to work the easiest, but I think highly of it.

18:53 I don't think I believe any other ones.

18:56 Yeah. Yeah. Okay, cool. Yeah. I think Twitter is fairly special among the social networks.

19:01 To me, Twitter is the social network of ideas more than it is of friends or family or whatever. So

19:08 yeah, pretty cool. I love to get data from Twitter as well.

19:11 Yeah. It's just that they won't give you the fire hose, but for most things, they'll give you

19:15 more tweets than you can handle anyway. So if you want to find out what people are saying on a certain

19:19 topic or just what people are talking about in general, it's an awesome source for that.

19:24 Yeah. I saw some study or heard about some study a few years ago where people were studying, they were

19:30 trying to do sentiment, mood analysis on Twitter, and then trying to tie that back to the stock market

19:38 and predict how people were feeling on Twitter to near-term changes like in the next hour in the

19:45 stock market, which I thought was a pretty interesting project.

19:48 So ideas you can come up with using that data are pretty much endless. Every month, I think I'm done

19:53 with it and then I think of something else to do.

19:54 Nice. So then you kind of get into the topics that I think of as traditional data science,

20:01 machine learning, neural networks, network analysis, that kind of stuff. And I thought maybe we could go

20:07 through each section and you could tell me what it is that you need to sort of fundamentally

20:13 understand what are the from scratch basic pieces of each part of this data science and then maybe

20:20 some examples of problems you might solve with it.

20:22 Sure.

20:22 Yeah. So the first one you talk about is machine learning. And when you said machine learning

20:27 and later neural networks from scratch, I'm like, wow, that's a pretty big thing to take on from

20:33 scratch. What's the story there?

20:34 Machine learning at a high level is just learning some kind of model from data rather than sitting

20:41 down and writing out the model yourself by hand. And so if you have a small amount of data, then

20:47 you know, coming up with an algorithm that's going to learn a model from it is actually a pretty

20:51 reasonable thing to do if you go with a simple model and don't add too many bells and whistles.

20:56 It's only when you want to, you know, start producing recommendations at Netflix scale

21:01 or when you want to start, you know, building something to recognize speech patterns from audio files

21:06 that your beautiful handcrafted Python is probably not going to be up to the task.

21:10 Yeah. So I suppose the types of things you solve with machine learning is fairly unbounded. There's

21:19 a lot of problems that machine learning answers. What are some of your favorite examples?

21:22 Everything out there is machine learning these days. But if you want to talk about like projects that

21:29 I've worked on for fun, one time I built a classifier to predict or to identify hacker news articles that I would be interested in or not

21:37 interested in. So it would take kind of the feed of new hacker news stories and come up with a score between zero and one about how interested, you know, it thought I would be based on some initial seed values that I leave.

21:49 How did you teach it?

21:58 How did you teach it?

22:06 I was for a few other kind of idiosyncratic features I threw in, like, you know, is it an ask hacker news? Is it a show hacker news? Does it have a dollar

22:15 amount in there? Because there are a lot of kind of bad stories that's like, here's how to make $5,000, I think. So that was a good negative signal.

22:23 It's sort of a spammy signal.

22:24 Yeah. So, and then it turned out I wrote a blog post about it and someone posted the story to hacker news and then

22:31 the hacker news community got very angry that someone would think that not every story there was worth

22:36 reading. Because of course every story there is worth reading. So why would I want to filter some of them out?

22:39 They accused me of wanting to live in a bubble.

22:41 And yeah, they can be fairly critical on there. But that's funny. So the big question is, did your system

22:49 like your article, your blog post?

22:51 That's a good question. I don't know that I actually checked. I should go back and try and find that.

22:55 It'd be funny if it recommended it to you. Or not. Either way, it would be funny to know.

22:59 Well, you know, it's funny speaking of this. So, you know, I have an Android phone and so

23:03 Google now will sometimes recommend me articles of things I'll be interested in.

23:07 And occasionally it will recommend to me like my own blog posts, which I guess means doing a good job.

23:13 You know, one of the things I did with machine learning was I have a five year old daughter and I do her clothes shopping.

23:19 And I noticed that little boys clothes were very interesting and little girls clothes tend to be

23:23 kind of like really boring. The boys clothes have like dinosaurs and rockets and robots and

23:28 girls clothes have like hearts and flowers. And so I built a machine learning model to take an image of a

23:33 children's t-shirt and predict whether it was a boy shirt or a girl shirt.

23:38 Awesome. Yeah. And I agree with the classification there as well. I have three daughters and we still

23:44 bought a fair number of boy clothes for them.

23:46 Yeah, I did the same.

23:47 Nice. So the next topic that you covered was nearest neighbors. And I mean, conceptually,

23:53 I kind of know what nearest neighbors are pretty easily, but it's a computation hard problem,

23:58 especially in higher dimensions. And so what kind of stuff do you do? What kind of problems do you solve with

24:03 this?

24:03 One thing you can do is when you don't have like a great parametric model for what's going on.

24:08 So for instance, one place where I've seen this applied is when I have some kind of like

24:12 time series type signal, and I want to know if it represents some kind of bad anomaly. And it might be,

24:19 you know, like thousands of points in some weird shape. And I don't have a great way to classify it.

24:22 But one thing you can do is if you have a bunch of labeled signals, you know, some are good or some

24:28 are bad, you just say, I'm just going to take my set of reference signals and figure out which are

24:33 the ones it's closest to. And that way, I don't need to necessarily have a model of what's a good

24:38 signal or what's a bad signal. I can just say, I have some labeled data, that's enough. And so I

24:43 think that's kind of the situation where it can be useful.

24:46 Yeah, nice. When you're still trying to explore things, and maybe you don't, you don't have a model

24:51 you're trying to fit it to, you're just trying to understand it, right?

24:53 Yeah. Or think about where like, you have some weird kind of shapes, or weird kind of patterns

24:59 that you can't really put math behind. What's a good pattern? What's a bad pattern? But you have

25:04 some labels, then you could use nearest neighbors to basically classify without having to have a

25:10 mathematical model behind it.

25:11 Okay, yeah, very cool. So then you get to another topic called Bayesian Bayesian analysis, which the context I know this is around kind of determining spam and

25:21 filtering and stuff like that. But what's, what's the story of Bayesian analysis?

25:26 So it's named after Bayes' rule, which is just a theorem in statistics, having to do with ways of

25:33 reversing conditional probabilities. So at a high level, if you have some knowledge about

25:38 what is the probability of seeing certain features in email, given that an email is spam, you can use that

25:45 data to produce estimates of what is the probability that an email is spam, given that you see those

25:50 features. And so typically, if you have a lot of, you know, spam and non-spam, you can make estimates

25:56 of, okay, given that an email is spam, it's likely that I'll see Viagra, and it's likely that I'll see

26:02 Gary Twick. And then basically just a classifier that turns that around and allows you to now look for

26:07 these features. And in a math-manipede rigorous way, come up with the probability, okay, here's the

26:11 probability that this is spam. Okay. Yeah. Very interesting. What other types of problems?

26:15 Spam is certainly easy to understand, but you know, hopefully that's solved by Gmail and other people

26:22 for us, right? I think it was Paul Graham who wrote a pretty influential article about this approach,

26:27 probably about 15 years ago now. And so I think a lot of the tools that, I don't know if it's what

26:33 Gmail uses, but that a lot of these BAM assassin and some of the other mail providers used are still

26:37 based on this principle of naive base. But basically the, anything where you have kind of big clumps of

26:44 say text, and you want to classify it into one or more classes, like two or more classes, and you're

26:52 willing to make this really big kind of technical assumption that the words in it are kind of independent

26:58 of each other. So you're treating them kind of like a bag of words rather than as a sequence of words,

27:02 then that's where you would use this kind of model. Okay. Yeah, cool. So maybe the opposite of that,

27:08 where you treat words sort of as, as having more meaning is in natural language processing, which is

27:14 another thing you talk about, right? Natural language processing is, is a huge field. Like people,

27:18 there are textbooks about it. So in fact, there are Python books about it. So cram it into a chapter of a

27:25 book is really just kind of, here's a really high level overview. And here's, you know, a couple of

27:31 examples to give you a flavor. Natural language processing is, it's actually relevant to this sort of

27:37 problems that I'm getting into in artificial intelligence. It gets used a lot by all the voice recognition in

27:43 your phone, and a lot of these new, basically reading texts, and having a computer extract information from it. So it's a

27:52 pretty important, pretty, pretty rich field. Yeah. And there's some pretty decent Python libraries for

27:57 doing that, right? Yeah. So there's a, so the big one is called NLTK, the natural language toolkit,

28:04 I think it stands for. And so it has its own book just about natural language processing in that library.

28:10 And I think that's actually free on the web. But yeah, that would be a great place to start it. If you

28:15 want to come to understand more deeply, there's also a nice Coursera course on natural language processing.

28:20 It took several years back that I liked. Oh, okay. Yeah, excellent. So let's see. Another topic you

28:25 covered were decision trees. And what's the story of these? A decision tree is kind of what it sounds

28:31 like. Yeah. You know, intuitively, you might have a model of making decisions based on a tree. So I have,

28:39 I want to know whether to buy a certain car or not. So, you know, I might start asking a question. Okay,

28:44 is it an American car or a Japanese car? American car. Okay. Go to the next question. Does it have

28:49 four doors or two doors? It has two doors. Okay. Is it gas or diesel? Diesel. And then you just have a

28:55 sequence of questions to ask and you classify it based on those questions. Okay. It's diesel. That

29:00 means, yes, I want to buy it. And so conceptually, you know, a tree like that is not that complicated.

29:06 Interesting part is given some set of data, how do I build such a tree? And there, there's a variety of

29:13 algorithms, but the one that we talk about in the book, it is a pretty simple one, which is basically

29:18 around, okay, I have a bunch of data. It has a bunch of attributes. I can split my data on each

29:24 attribute. So I could say, okay, at this stage, I want to look at two door versus four door. I want to

29:28 look at Japanese versus American, or I want to look at gas versus diesel and which of those choices is

29:35 going to allow me to kind of really separate the good buys from the bad buys the most. So if gas versus

29:43 diesel totally splits where diesel is the buys and gas is the don't buys, that's a great thing to choose.

29:50 If gas versus diesel splits where, you know, on gas, I want to buy half and not by half and on diesel,

29:56 diesel and by half and up by half. That's not a good split to choose because it doesn't really

29:59 help me at all. And so the mathematics here are just around kind of making this precise with this

30:05 entropy and building these trees from the data. Okay. Yeah, that's very cool. The way where you

30:11 build them knowing the data intimately, you know, that, that makes a lot of sense. And you don't really

30:17 necessarily need data science to do that, right? That's just helping, you know, asking a few questions

30:23 and making a decision. But the reverse, I think is pretty interesting.

30:27 They build some of these expert systems in this kind of way. I think they apply them sometimes in like,

30:31 medical diagnosis and sometimes they do better than doctors.

30:35 Wow. Okay. Sometimes just let the data talk, huh? Rather than intuition.

30:40 This episode is brought to you by hired hired is a two sided curated marketplace that connects the

30:56 world's knowledge workers to the best opportunities. Each offer you receive has salary and equity presented

31:01 right up front, and you can view the offers to accept or reject them before you even talk to the company.

31:07 Typically candidates receive five or more offers within the first week, and there are no obligations

31:11 ever. Sounds awesome, doesn't it? Well, did I mention the signing bonus? Everyone who accepts a job

31:16 from hired gets a thousand dollars signing bonus. And as talk Python listeners, it gets way sweeter.

31:22 Use the link hired.com/talkpythontome and hired will double the signing bonus to $2,000.

31:29 Is knocking. Visit hire.com/talkpythontome and answer the call.

31:32 Somewhat related to that maybe is, you talked about neural networks as well.

31:42 Yeah. So neural networks are a super hot topic right now, because I'm sure you've probably heard

31:47 that all the buzz about, or about deep learning, right? And deep learning tends to have neural networks

31:54 at the root of it. So neural networks are basically a way of building up kind of layers and layers of

32:01 representations. And the deep learning, this book doesn't really go into deep learning because it's,

32:06 the single neural network is hard enough to do by hand or from scratch. But basically it's a way to

32:12 kind of build a classifier that works similar to how a toy model of a brain might work. So you have

32:20 artificial neurons, and each neuron has a bunch of inputs that go into it with weights. If the weighted

32:26 sum of the inputs exceeds some certain threshold, the neuron fires. And if it doesn't exceed that

32:30 threshold, the neuron doesn't fire. So you present an input, which could be like a, an image. So basically

32:36 a bunch of zeros and ones, and that causes some neurons to fire. And that propagates through this

32:42 network. And in the end it will spit out, you know, I think you should be an image of a cat,

32:46 or I think you should be an image of a dog. And you train it by showing that a lot of labeled images

32:52 and adjusting the weights based on how you got them wrong.

32:55 Yeah. Interesting. So you basically just say, these are the inputs and the decisions that you feed it

33:02 known data. You say, you know, like, here's a cat, here's a cat, that's a dog. It's just a cat or a dog,

33:07 right? And then you ask that question. I was gonna say, well, one of the problems that I sort

33:11 of remember from neural networks is that they, as you try to get them more accurate,

33:15 they get over-trained to just only do little bits of stuff. And so how does things like decision

33:21 forests or these things and this deep learning, how does it deal with that?

33:26 So random forests are basically taking multiple decision trees and combining them. So there's this

33:33 kind of general principle where a lot of times rather than building one model, it's like super predictive.

33:40 I take a lot of much less predictive models and just kind of let them vote on things or average

33:47 the results or whatever. So you're right that it's really a decision tree. And I included like all

33:52 hundred features in my data set that there's a good chance I'm going to like overfit and

33:57 overlearn my training data and not generalize outside of that. So one thing that happens with decision

34:04 trees is people often don't use the bare decision trees. They use the random forest where they'll build

34:10 a bunch of smaller decision trees, each of which is really restricted to a small subset of features

34:16 so that each one is individually less powerful. But then when you combine them, they do well in the aggregate

34:22 and they don't have necessarily the same overfitting problem that single decision tree with lots of features would.

34:28 So in neural networks, especially in deep learning, they're not exactly the same techniques, but other

34:34 techniques where you will zero out some of your weight sometimes and train without them to make

34:40 sure that they don't learn too much. And there's a lot of other techniques that get used in order to

34:44 make sure you're not overfitting. There's something that, you know, data scientists and machine learning

34:47 people worry about a lot.

34:48 It's a really hot field, like you said right now. That's awesome. So one more section I'd like to touch

34:54 on is you talked about this thing called recommender systems.

34:58 So mobile, they're just what they sound like. Anyone who's used the internet has a pretty good experience

35:02 where you go to Netflix and it says movies for Joel and it's trying to predict what it thinks I'll like.

35:09 Or you go to Amazon and it will recommend items for me. And you can go a little further where you have

35:16 all these startups like Stitch Fix is a very hot data science startup where you tell what kind of clothes

35:22 you like and they'll send you a box of clothes every month that they think you'll like a lot.

35:28 And so generating these kind of recommendations is a pretty popular task within data science. A lot of data scientists

35:36 work on these kind of problems. There's a lot of data scientists work in this job. So they have to sell you stuff

35:41 and they always want to sell you stuff that they think you'll like.

35:44 You convert better if you figure out what people actually might want, right?

35:47 Yeah, I mean, well, if Amazon sends me an email, "Hey, Joel, you know, these five things are on sale

35:54 and they're five things that I really want to buy," then that's much better for them and potentially

35:58 for me than if they send me a random email that's like, "These are the five most popular things on

36:03 Amazon today." Yeah, it definitely is better. I really like going to my Kindle and looking at

36:08 the recommended things based on what I've been reading. But the Netflix recommendations, I know they

36:14 do really great work, but it just doesn't work for me because my seven-year-old daughter watches

36:19 Strawberry Shortcake and other random things. I get a lot of kid shows recommended to me.

36:25 They have a "Who's watching" button that you can click and say "A kid is watching."

36:30 I know, but my daughter won't use it. She'll just randomly pick one.

36:33 Yeah, so you know, it's funny. I use that button pretty well, but my Netflix recommendations are also

36:40 not that good. But I think it's just because I don't like anything on there. So no matter what they're

36:45 telling me, I'm not going to like it. Yeah, I hear you. I feel like you covered a pretty

36:49 wide swath of data science from scratch. Some of the topics were really accessible. Some of them

36:57 required more math and more thinking, but it was all a really nice presentation. What do you feel like

37:04 you left out? There aren't any topics that, if anything, I kept adding stuff while I was writing it.

37:11 So I don't feel like there were any huge topics that I necessarily left out. But in terms of coverage,

37:17 probably my biggest regret is that I used Python 2 instead of Python 3.

37:20 Yeah. Okay. And I saw on your GitHub repo, you had something about Python 3 in there. Is that right?

37:27 Yes. So pretty much as soon as the book came out, one, I got a lot of emails from people saying,

37:32 "Hey, why do you not use Python 3?" And then I also got a number of emails saying,

37:37 "I would like to use Python 3. Will the code work?" And so I wrote them back and I said,

37:41 "Yeah, you know, I don't feel like the code shouldn't work. Give it a try. I bet it works

37:46 with probably a few changes, add some parentheses, print statements and so on." And then eventually one

37:51 guy wrote me back and he said, "It doesn't work." I said, "Okay." So I sat down and I said, "I'm going to

37:57 convert the code to Python 3." And it took me about, I'd say four to five hours. And that's with me

38:04 knowing the code intimately. So it would have taken someone who hadn't written the code in the first

38:09 place a lot longer than that. And then I felt kind of guilty that I've been telling all these people that

38:12 it was so easy to do when it wasn't.

38:14 Just spend a week. It'll be fine.

38:16 Yeah. But so yeah, I have the Python 3 versions of the code up on the GitHub. I sort of regret not

38:23 having just done it that way in the first place.

38:25 Yeah, sure. So at the end, you talked about data science, not from scratch, and you pointed out

38:31 a lot of the libraries you might actually use, like NumPy and so on. Do you want to talk about that

38:37 really quick? Like what's the real data science versus the from scratch data science comparison?

38:43 Yeah. So I would say that NumPy is pretty fundamental. That's basically the linear algebra

38:51 library for Python. So it provides you matrices, matrix algebra, high performance arrays, things

38:58 like that, you don't get just built in. And you might not use it directly, but a lot of the other

39:05 libraries are really built on top of it. So kind of the most broadly accessible machine learning

39:11 library for Python is called scikit-learn. And it has really nice documentation and really nice tutorials

39:16 and a fairly standard API for building machine learning models. Anything you want to build a

39:21 regression model or a random forest model or any kind of classifier, that's probably the place to go.

39:27 There's also Pandas, which is the data frame library, which is good if you're working with tabular data. So

39:34 not necessarily the machine learning side of data science, but more of the, I have kind of a

39:39 spreadsheet data set and now I want to clean it and aggregate it and pivot it and look for kind of data

39:48 analysis type insights.

39:49 Yeah. If you're exploring, like you said, tabular data and you kind of kind of load it up and clean it,

39:56 Pandas seems really fantastic for that.

39:57 It's a really nice library. And then kind of the new kid on the block is a TensorFlow, which is

40:03 Google's deep learning library. And it's only was released, you know, a few months ago and it's gone

40:09 through. It's not 1.0 yet, but it seems like people are sort of converging around it as that's how they're

40:15 going to do deep learning in Python. Now there are other sort of previous libraries that some people

40:20 used and still use, but TensorFlow seems to be gaining a lot of mindshare.

40:24 Okay. Yeah, that's really cool. And I've definitely seen TensorFlow talked about a lot in this context,

40:30 but it's just a library. You can download that and run it locally. It's not like a cloud type thing,

40:35 right?

40:35 Yes. So currently it's a local library and I'm not sure if they've come out with a version that you can

40:42 kind of do your own cloud, like on various AWS instances. But, but I know they're definitely going

40:47 in that direction because a lot of this stuff, these deep learning models take a really long time to

40:52 train. And so you want to go to use them for anything serious. You want to distribute them and

40:56 throw them in high-powered machinery and not just run them on your laptop.

40:59 Yeah, of course. If you have tons of data, maybe it's better to get a bunch of machines

41:03 for an hour. So what do you think about these cloud learning or data science platforms? I'm thinking like

41:09 Azure machine learning, or, you know, I just had SigOpt on the show a few shows ago. I'm not really sure

41:15 what else is out there in terms of like go out to the cloud and grab some data science stuff.

41:20 What do you think about that?

41:21 I haven't spent much time looking at any of those. I think they can add value in terms of either one,

41:31 if you don't have the data science or machine learning expertise in house to do whatever you need to do.

41:37 Or two, if you have some kind of model that you've built, but you need help either putting it into

41:43 production or operationalizing it somehow, that they can build a pretty good role there. But my sense is

41:50 that most people doing data science are aligned more on running the libraries and running the models

41:56 themselves. But that could be just my bias sample of the people who I talk to.

42:00 Yeah, of course, of course. Cool. So another thing that you said you're into is taking some of the

42:07 ideas from Haskell and thinking about how those might manifest in Python. What do you went up to there?

42:12 Haskell, if you're not familiar with it, is kind of the purest of the pure functionally typed

42:19 languages with strong types and lazy evaluation and things like that. And so once I spent some

42:28 time in that world, I spent a lot of time thinking about how can I bring some of these concepts back

42:33 into Python. And in Python 3, lazy evaluation plays a much bigger role in the sense that like range

42:42 is a generator instead of a list and all the map and filter and things like that also are generators

42:48 instead of lists. But I started getting into the iter tools library, which starts giving you tools for

42:55 generating basically infinite sequences and just trying to see how far I could go using

43:01 pure functions and infinite sequences and avoiding mutable variables and other things that you try not

43:07 to do when you're working in a Haskell-like language.

43:10 Yeah. And how did you feel like it came out in the end? Do you feel like you were able to bring a lot

43:14 of those ideas over?

43:15 I was, and I ended up producing code that was really neat and really impenetrable. The lazy

43:23 infinite sequences stuff, that was more almost academic in terms of like, yes, I managed to do it. Like,

43:29 this is mathematically interesting and it works well, but it's not readable at all.

43:34 So imagine you wanted to represent, let's say a binary tree in Python. Kind of the two,

43:41 I would say obvious approaches would be one to make some kind of like class where it has a,

43:47 you know, a value element and a left element and a right element. And then you also might just use

43:53 a dictionary to represent it where it had those keys. In a language with algebraic data types,

43:57 like Haskell, you would just basically represent that kind of tree as a product type where it just

44:04 has like three elements. And so I said, you know, what if I just represented a tree as a tupple with

44:09 three elements where first element is the left subtree, the second element is the value,

44:13 third element is the right subtree. And similarly, if you want to do like linked lists in Python,

44:17 which you probably don't want to do, but if you did want to do linked lists in Python, you just treat

44:21 them as a tupple. First element is the element and the second element is the tail linked list. And so,

44:28 I actually found that I was able to write some pretty nice code using those kinds of ideas.

44:32 And I did some coding interviews that way. I'm not sure the interviewers appreciated it.

44:38 What is this guy talking about? Yeah, cool. So what question given your work at the AI Institute and

44:47 your background in data science and so on, I wanted to ask is the last year or so, there's been a lot of

44:54 news items and people coming out saying that artificial intelligence is a danger to humanity.

45:01 What's your thought? Is like AI something we should be super excited about? Or is it something we should

45:07 be maybe cautious about? I would say I probably fall in the middle. I mean, I'm excited about it because

45:12 this is my job, but I don't go around encouraging everyone else that they have to be excited about it

45:18 because I don't know that that's necessarily warranted. But at the same token, I don't spend a lot of time

45:23 worrying about how dangerous it is. I think we're pretty far off from the time when we have to worry

45:29 about that. And I do have some friends who think we should worry about it now before it's too late. But

45:35 I think there's a lot more important things to worry about in the world when I read the news.

45:39 Yeah, I kind of agree with you on that. And I think there's two, certainly two ways to look at that. On

45:47 one hand, if you think about things like self-driving cars, let's just take that as an example. Like,

45:53 I believe one of the biggest job categories for men in the United States is some form of driving,

46:02 like driving a truck, driving a taxi, those types of things, delivery vehicles and so on. And if self-driving

46:08 cars were to like remove all that, like that would have large social effects. But I think, you know,

46:13 that's not so much the way that people, at least recently in the news, were talking about it. It's

46:19 more like Terminator style, right? And so that I, I'm not too worried about this personally. Who knows?

46:26 Yeah. I mean, I personally would pay a lot of money for a self-driving car because I,

46:30 I don't like driving that much. And I'd much rather be able to read while I'm going somewhere.

46:35 Yeah. Driving is fun until you get stuck on i5 for half an hour inching along. Then you know,

46:40 I don't like driving anymore. Exactly.

46:42 Awesome. Well, we're kind of coming up near the end of the show. Let me ask you just a

46:46 few closing questions. I always ask on my guests, if you're going to write some Python code,

46:50 what editor do you open up? So these days I tend to use Adam for pretty much everything,

46:55 memory leaks and all. Nice. Yeah. That's from a GitHub, right?

46:58 Yeah. It's pretty similar to sublime, but it doesn't nag you for money. So yeah, it's,

47:03 it's pretty nice. And I think it's Adam.io. They have a really cool little video about how

47:08 it's the editor of the future. It's nice.

47:10 Yeah. I don't know if I go that far, but it's the editor of the present at least.

47:14 Yeah. It's like a George Jetson sort of like a promo video. It's, it's pretty funny.

47:18 It is pretty nice. Isn't it? Does it have good Python support?

47:21 You know, I'm never, I'm not someone who leans on like ID functionality a lot. So if good Python

47:28 support counts as syntax highlighting, yes, but that's all I tend to use it for. So yeah.

47:32 Yeah. Yeah. Okay. Cool. And if you look at on the Python package index, there's,

47:38 you know, 75 plus thousand packages and we all have experience with, you know, different parts

47:46 of it and there's things that we love and would recommend, like, what is your favorite one you

47:49 might recommend that people maybe don't know about?

47:51 So the one I recommend that people don't necessarily know about is called Beautiful Soup. It's,

47:57 basically a HTML parsing library. And so if you start scraping data from webpages, you're going to get

48:04 a big mess of ugly HTML. That's probably not even well formed. most of the time,

48:10 most people don't bother to well form their HTML.

48:12 Are you telling me that I can't just like load that up as an XML document or something like this?

48:17 No, I'm just kidding. Of course it's, it's terrible trying to work directly on the web,

48:21 right? And Beautiful Soup is really, I really like it as well.

48:25 It's really nice. I mean, you have to spend a little bit of time getting used to

48:28 it's API and interface and everything, but it's super handy for getting data out of webpages and

48:34 doing anything where you have to do a bunch of scraping.

48:37 And you cover that in your book, right? In the getting data, I use Beautiful Soup a little bit and show people how to use it.

48:42 Yeah, I cover it a little bit. It's a nice addition to the data scientist toolkit.

48:46 Nice. So Joel, how do people find your book? Amazon? Just a Google search for data science

48:52 from scratch? Yeah, it's on Amazon and you can buy it from O'Reilly.com. But yeah,

48:57 if you Google data science from scratch, you'll find it.

48:59 Cool. And I'll put a link to the GitHub repo where you have all the code examples and so on as well.

49:04 It's been really fun to talk about data science. And I think you have a really interesting way of teaching

49:10 people to appreciate the tools that we're all fairly familiar with by showing you how to build it from

49:16 scratch. So thanks for that.

49:19 My pleasure.

49:21 This has been another episode of Talk Python To Me. Today's guest was Joel Gruse and this episode

49:25 has been sponsored by SnapCI and Hired. Thank you guys for supporting the show. SnapCI is modern,

49:30 continuous integration and delivery. Build, test and deploy your code directly from GitHub,

49:34 all in your browser with debugging, Docker and parallelism included. Try them for free at snap.ci

49:39 slash talk.com. Hired wants to help you find your next big thing. Visit Hired.com slash talkpython to me to get five or

49:45 more offers with salary and equity presented right up front and a special listener signing bonus of $2,000.

49:50 Are you or a colleague trying to learn Python? Have you tried books or videos that left you bored by just

49:56 covering the topics point by point? Check out my new online course, Python Jumpstart by building 10 apps at

50:01 talkpython.fm/course to experience a more engaging way to learn Python. You can find links

50:07 from this show at talkpython.fm/episodes slash show slash 56. Be sure to subscribe to the show. Open

50:14 your favorite podcatcher and search for Python. We should be right at the top. You can also find the

50:18 iTunes feed at /itunes, the Google Play feed at /play and the direct RSS feed at /rss on

50:25 talkpython.fm. Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

50:31 You can hear his entire song at talkpython.fm/music. This is your host, Michael Kennedy.

50:36 Thank you so much for listening. Smix, take us out of here.

50:39 Stating with my voice. There's no norm that I can feel within. Haven't been sleeping. I've been using lots of rest. I'll pass the mic back to who rocked it best.

50:49 I'm first developers.

50:51 I'm first developers.

50:58 Developers, developers, developers, developers.

51:01 .

51:01 Thank you.