Machine Learning with Python and scikit-learn

Episode Deep Dive Transcript

Machine learning allows computers to find hidden insights without being explicitly programmed where to look or what to look for. Thanks to the work of some dedicated developers, Python has one of the best machine learning platforms called scikit-learn. In this episode, Alexandre Gramfort is here to tell us all about scikit-learn and machine learning.

Links from the show:

scikit-learn: scikit-learn.org
Alexandre's website: alexandre.gramfort.net
Alexandre on Twitter: @agramfort
Novel Machine Learning: forbes.com/sites/85broads/2014/01/06/six-novel-machine-learning-applications
Kaggle competitions: kaggle.com
scikit-learn on github: github.com/scikit-learn/scikit-learn
scikit-learn datasets: scikit-learn.org/stable/datasets

Episode Deep Dive

Guest introduction and background

Alexandre Gramfort is an assistant professor at Telecom Paris Tech and a scientific consultant for the CEA Neurospin Brain Imaging Center. His work focuses on statistical machine learning, signal and image processing, and scientific computing, particularly related to neuroimaging. He has experience using Python and scikit-learn for a variety of machine learning tasks, including both research projects in neuroscience and widely applicable open-source contributions.

What to Know If You're New to Python

Here are a few quick points to help you follow this episode’s technical parts if you’re just getting started:

Be aware of Python’s “scientific stack,” especially NumPy and SciPy, as they form the foundation of many machine learning libraries.
Understand how Python’s pip or conda environments can install packages like scikit-learn.
Familiarize yourself with basic Python data structures (lists, dictionaries, and tuples). They often become the foundation for reading and preparing data.
If you haven’t explored Jupyter Notebooks or IPython, they’re commonly used for data science demonstrations, including machine learning experiments.

Key points and takeaways

Scikit-learn as a Central ML Library Scikit-learn is one of the most popular and comprehensive machine learning libraries in Python, providing consistent APIs and a wide range of algorithms. Its design philosophy emphasizes ease of use, concise syntax, and integration with the broader Python scientific stack (NumPy, SciPy, pandas, etc.). Developers and data scientists often choose it for both production and research settings due to its maturity and strong community support.
- Links and tools:
  - Scikit-learn
  - NumPy
  - SciPy
Supervised vs. Unsupervised Learning A major theme of the conversation is distinguishing between supervised learning (training with labeled data, such as spam vs. not spam) and unsupervised learning (no labels, just raw data, seeking structure or anomalies). Examples of supervised learning include classification and regression tasks. In contrast, unsupervised methods like clustering or outlier detection help discover hidden structures in large datasets.
- Links and tools:
  - Spam/Ham Detection Explanation (Wikipedia)
Ensemble Methods: Random Forests & Beyond Random forests and other ensemble techniques combine multiple models (often decision trees) for more robust and accurate predictions. Random forests work well on many real-world tasks and are a common go-to approach in machine learning competitions. The discussion highlighted how scikit-learn implements these methods and why they perform well by combining diverse “perspectives” on the data.
- Links and tools:
  - Random Forest Documentation (Scikit-learn)
Data Sets in Scikit-learn Scikit-learn ships with several classic datasets (e.g., Iris, Boston housing, Digits, Diabetes) for quick experimentation and educational purposes. The Iris dataset is often used to demonstrate classification, while Boston housing and Diabetes illustrate regression problems. Digits is particularly good for exploring image classification with minimal code.
- Links and tools:
  - Scikit-learn Datasets Documentation
Dimensionality Reduction and Feature Extraction Techniques like Principal Component Analysis (PCA) help reduce dataset size and highlight the most important features or dimensions. This can be invaluable for visualization (plotting high-dimensional data in 2D or 3D) and improving computational efficiency. Ensuring relevant features remain while discarding noise can significantly boost model performance.
- Links and tools:
  - PCA in Scikit-learn
Scaling up: Out-of-Core Learning When datasets exceed available memory, scikit-learn offers partial_fit and out-of-core learning strategies for certain algorithms (notably linear models). This allows training models in small batches, making it possible to handle log files or massive text data. It’s especially relevant for applications needing to process streaming or very large datasets.
- Links and tools:
  - Incremental Learning Documentation (Scikit-learn)
Real-World Use Cases Examples showcased ranged from Netflix anomaly detection on servers, IBM analyzing handwritten physician notes for heart attack risk, and Kaggle competitions for whale call detection. In each case, scikit-learn underpins the processing, classification, or anomaly detection, highlighting its versatility in text, signal, audio, and more.
- Links and tools
  
  :
  - Netflix
  - Kaggle
Support Vector Machines (SVMs) SVMs remain popular for many tasks—both classification and regression. Scikit-learn binds to the external LibSVM library, offering powerful kernel methods for handling data that may not be inherently numeric (e.g., string kernels, graph kernels). Although neural networks and deep learning are on the rise, SVMs still see substantial use when data is limited or well-structured.
- Links and tools:
  - LibSVM
Community and Contribution The scikit-learn team places a high emphasis on code reviews, consistent APIs, and careful merging of new algorithms. Many contributors add specialized features, but maintaining a stable library requires deep involvement in reviewing and refining. Consolidating existing functionality and ensuring high-quality documentation is one of the core devs’ top priorities.
- Links and tools:
  - Scikit-learn GitHub Repository
Cross-disciplinary Applications and the MNE Library Beyond conventional machine learning, Alexandre Gramfort mentioned MNE, a Python package for analyzing EEG and MEG signals from the human brain. It uses scikit-learn for classification and advanced statistical modeling, highlighting how these frameworks can power highly specialized research in medical imaging and neuroscience.

Links and tools:
- MNE

Interesting quotes and stories

"In one week, MATLAB was gone from my life." -- Alexandre Gramfort, on switching to Python for scientific computing

"How to win a Kaggle competition without knowing anything about the data." -- Recounting a winner’s tongue-in-cheek description of a whale call detection challenge

"We had 10 lines of Python classifying digits." -- Demonstrating scikit-learn’s simplicity by recognizing handwritten digits

Key definitions and terms

Supervised Learning: Training models with labeled data (e.g., spam vs. ham emails).
Unsupervised Learning: Finding structure in unlabeled data (e.g., clustering customers based on behavior).
Ensemble Methods: Combining multiple models (decision trees, neural nets, etc.) to boost prediction accuracy.
Dimensionality Reduction: Techniques like PCA that project high-dimensional data into fewer dimensions while preserving essential structure.
Out-of-Core Learning: A way to handle very large datasets by incrementally updating the model rather than loading all data into memory.

Learning resources

Here are some additional materials to help you learn more:

Python for Absolute Beginners: Ideal if you’re new to Python and want a structured path through the language basics.
Getting Started with NLP and spaCy: While not exclusively scikit-learn, spaCy covers text processing that can pair well with ML.
Data Science Jumpstart with 10 Projects: Get hands-on with data science tasks, which often integrate scikit-learn.
Fundamentals of Dask: If scaling out-of-core computations is on your agenda, Dask is an essential tool to learn.

Overall takeaway

Machine learning with scikit-learn highlights the power of Python’s ecosystem for practical, real-world data science. Whether you’re analyzing massive streams of server logs, detecting anomalies, or working on small classic datasets, scikit-learn offers a consistent API and broad set of algorithms to get the job done. By focusing on both core algorithms (like random forests or linear models) and best practices (like incremental training or strong community code review), the conversation underscores that even a single library can deliver robust solutions across industries and research domains. Ultimately, scikit-learn’s simplicity and flexibility bring machine learning within reach of many Python developers, fueling innovation in everything from business applications to cutting-edge scientific research.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Machine learning allows computers to find hidden insights without being explicitly programmed where to look or what to look for.

00:06 Thanks to the work of some dedicated developers, Python has one of the best machine learning platforms out there called scikit-learn.

00:14 In this episode, Alexander Gramfort is here to tell us about scikit-learn and machine learning.

00:19 This is Talk Python To Me, number 31, recorded Friday, September 25, 2015.

00:25 I'm a developer in many senses of the word, because I make these applications, but I also use these verbs to make this music.

00:37 I construct it line by line, just like when I'm coding another software design.

00:41 In both cases, it's about design patterns. Anyone can get the job done, it's the execution that matters.

00:47 I have many interests, sometimes conflict, but creativity can usually be a benefit.

00:53 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

01:00 This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy.

01:04 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

01:11 This episode is brought to you by Hired and Codeship.

01:15 Thank them for supporting the show on Twitter via at Hired underscore HQ and at Codeship.

01:22 Hey, everyone. Thanks for listening today.

01:24 Let me introduce Alexander so we can get right to the interview.

01:27 Alexander Grandfort is currently an assistant professor at Telecom Paris Tech and scientific consultant for the CEA Neurospin Brain Imaging Center.

01:38 His work is on statistical machine learning, signal and image processing optimization, scientific computing, and software engineering with a primary focus in brain functional imaging.

01:50 Before joining Telecom Paris Tech, he worked at the Martino Center for Biomedical Imaging at Harvard in Boston.

01:56 He's also an active member for the Center for Data Science at Université Paris-Saclay.

02:02 Alexander, welcome to the show.

02:04 Thank you. Hi.

02:06 Hi. I'm really excited to talk about machine learning and scikit-learn with you today.

02:11 It's something I know almost nothing about, so it's going to be a great chance for me to learn along with everyone else who's listening in.

02:17 So hopefully I'll be able to give relevant answers.

02:21 Yeah, I'm sure that you will.

02:23 All right, so we're going to talk all about machine learning, but before we get there, let's hear your story.

02:27 How did you get into programming in Python?

02:29 Well, I've done a lot of scientific computing and scientific programming over the last maybe 10 to 15 years.

02:35 I started my undergrad in computer science, doing a lot of signal and image processing.

02:40 Well, like these types of people, I've done a lot of MATLAB in my previous life.

02:46 Yes, I've done a lot of MATLAB too. I know about the .im files.

02:49 And I switched to a team for my postdoc.

02:56 Basically, I did a PhD in computer science applied to brain imaging.

02:59 And I switched to a different team where basically I was surrounded by people working with Python.

03:05 And basically, I got into it and switched.

03:08 In one week, MATLAB was gone from my life.

03:14 But it's been maybe five years now.

03:16 And yeah, that's kind of the historical part.

03:20 Do you miss MATLAB?

03:22 Not really.

03:23 Me either.

03:25 There are some cool things about it, but...

03:29 Yeah, I still have students that are insisting to work with me in MATLAB.

03:34 So I have to still do stuff in MATLAB for supervision.

03:38 But not really when I have the choice.

03:43 Yeah, if you get a choice, of course.

03:44 I think one of the things that's really a drawback about specialized systems like MATLAB is it's very hard to build production finished products.

03:53 You can do research.

03:55 You can learn.

03:56 You can write papers.

03:57 You can even test algorithms.

03:59 But if you want to get something that's running on data centers on its own, probably MATLAB is, you know, you could make it work, but it's not generally the right choice.

04:06 Definitely.

04:07 Yeah.

04:08 Yeah.

04:09 And so things like, you know, I think that explains a lot of the growth of Python in this whole data science, scientific computing world, along with great toolkits like scikit-learn, right?

04:21 Yes.

04:22 I mean, definitely the way scikit-learn is now used.

04:27 The fact that the Python stack allows you to make this production type of code is a clear win for everyone.

04:36 So before we get into the details of scikit-learn and how you work with it and all the features it has, let's just, you know, in a really broad way, talk about machine learning.

04:46 Like, what is machine learning?

04:47 I would say the simple example of machine learning is trying to predict something from previous data.

04:54 So what people would call supervised learning.

04:58 And there are plenty of examples of this in everyday life, like your mailbox that predicts for you if your email is a spam or a ham.

05:07 And that's basically a system that learns from previous data how to make an informed choice and give you a prediction.

05:17 And that's basically the most simple way of seeing machine learning.

05:21 And basically you see machine learning problems framed this way in all contexts, from industry to academic science.

05:30 And, I mean, there are many examples.

05:33 And basically, in terms of other types of classes of problems that you see in machine learning, it's not really these prediction problems.

05:43 We're trying to make sense from raw data where you don't have labels like spam or ham, but you just have data and you want to figure out what's the structure, what types of input or insight can you get from it.

05:57 And that's, I would say, the other big class of problem that machine learning addresses.

06:02 Yeah, so there's that general classification.

06:06 I guess with the first category you were talking about, like spam filters and other things that maybe fall into that realm would be like credit card fraud, maybe trading stocks, these kind of binary, do it, don't do it, based on examples.

06:21 That's something that is, is it called structured learning or what's the?

06:26 The common name is supervised learning.

06:30 Supervised learning, that's right.

06:31 Yeah, so basically you have pairs of training observations that are the data and their corresponding labels.

06:38 So text and the label would be spam or ham.

06:41 Or you can also see, this is basically binary classification.

06:45 The other types of machine learning problems you have is, for example, regression.

06:49 You want to predict the price of a house and you know the number of square feet.

06:54 You know the number of rooms.

06:57 You know what's exactly the location.

06:59 And so you have a bunch of variables that describe your house or apartment.

07:03 And from this you want to predict the price.

07:05 And that's another example where now it seems the price is a continuous variable.

07:10 It's not binary.

07:11 This is what people call regression.

07:13 And this is another big class of supervised learning problem.

07:17 Right.

07:17 So you might know through the real estate data, all the houses in the neighborhood that have sold in the last two years, the ones that have sold last month, all their variables and dimensions, if you will, like number of bathrooms, number of bedrooms, square feet, or square meters.

07:34 You could feed it into the system to train it.

07:40 And then you could say, well, now I have a house with two bathrooms and three bedrooms.

07:44 And right here, what's it worth?

07:46 Right?

07:46 Exactly.

07:47 That's basically a typical example and also a typical data set that we use in scikit-learn that basically illustrates the concept of regression with a similar problem.

07:55 Right.

07:56 There's, we'll talk more about it, but there's a scikit-learn comes with some pre-built data sets.

08:01 And one of them is the Boston house market, right?

08:03 Exactly.

08:04 That's the one.

08:05 Yeah.

08:05 How much data do you have to give it?

08:08 Like, suppose I want to try to estimate the value of my house, which, you know, at least in the United States, we have this service called Zillow.

08:14 So they're doing way more.

08:16 I'm sure they're running something like this, actually.

08:19 But suppose I wanted to take it upon myself to, like, grab the real estate data and try to estimate the value of my home.

08:25 How many houses would I have to give it before it would start to be reasonable?

08:30 Well, that's a tough question.

08:32 And I guess there's no simple answer.

08:35 I mean, you have this, that you can see on the cheat sheets of scikit-learn that says if you have less than 50 observations, then go get more data.

08:45 But I guess it's also a simplified answer.

08:48 It depends on the difficulty of the task.

08:50 So at the end of the day, often for these types of problems, you want to know something.

08:55 And this can be easy or hard.

08:58 You cannot really know before trying.

09:00 And typically regression would say, okay, if I predict that the 10% plus or minus, that's maybe good enough for my application.

09:07 And maybe you need less data.

09:08 If you want to be super accurate, you need more data.

09:12 But the question of how much is, it's really hard to answer without really trying and using actual data.

09:17 Yeah, I can imagine.

09:18 It probably also depends on the variability of the data, the accuracy of the data, how many variables you're trying to give it.

09:26 So if you just added, just tried to base it on square footage or square meters of your house, that one variable, maybe it's easier to predict than, you know, 20 components that describe your house, right?

09:40 So the thing, the more variables you have, the more you can hope to get.

09:46 Now it's not as simple as this, because if variables are not informative, then they're basically adding noise to your problem.

09:53 So you want as many variables to describe your data in order to capture the weak signals.

10:02 But sometimes just variables are not relevant or predictive.

10:06 And so there are more, you want to remove them from the prediction problem.

10:10 Okay, that makes sense.

10:11 So I was looking into what are some of the novel uses of machine learning in order to sort of have some things to ask you about and just see what's out there.

10:25 What are ones that come to mind for you?

10:27 And then I'll give you some that I found on my list.

10:29 Maybe I'm biased because I'm really into using machine learning for scientific data and academic problems.

10:36 But I guess for things that are really academic breakthrough that are reaching everybody is really related to computer vision and NLP these days and probably also speech.

10:47 So these types of systems that try to predict something from speech signals or from images like describing you what's the contents, what types of objects you can find.

10:58 And for NLP you have like machine translation.

11:01 We did a show with OpenCV and the whole Python angle there.

11:07 There was a lot of really cool stuff on medical imaging going on there.

11:11 Does that have to do with scikit-learn as well?

11:14 Well, you have people doing medical imaging using scikit-learn, basically extracting features from MR images, magnetic resonance images, or CT scanners, or also like EEG brain signals.

11:29 And they're using EEG – sorry, they're using scikit-learn as the prediction tool, deriving features from their raw data.

11:40 And that reaches, of course, clinical applications in some contexts.

11:45 Maybe automatic systems that say, hey, this looks like it could be cancer or it could be some kind of problem, bring the attention of an expert who could actually look at it and say, yes, no, something like this?

11:57 Yeah, exactly.

11:58 It's like helping diagnosis, like trying to help the clinician to isolate something that looks weird or suspicious in the data to get like the time of this physicist and the clinician onto this particular part of the data to see what's going on and if the patient is suffering for something.

12:19 Right.

12:19 That's really cool.

12:20 I mean, maybe you could take previous biopsies and invasive things that have happened to other people and their pictures and their outcomes and say, look, you have basically the same features and we did this test and the machine believes that you actually don't have a problem.

12:36 So, you know, probably don't worry about it.

12:37 We'll just watch this or something like that, right?

12:39 Yeah, I mean, on this line of thought, there was recently a Kaggle competition using retina pictures.

12:45 So, like people suffering from diabetes usually have problems with retinas.

12:50 And so, you can take pictures of retinas from hundreds of people and see if you can build a system that predicts something about the patient and the state of the disease from these images.

13:05 And this is typically done by pooling data from multiple people.

13:08 That's really cool.

13:09 I've heard this Kaggle competition or challenges before in various places looking at it.

13:15 What is that?

13:15 So, it's basically a website that allows you to organize these types of supervised learning problems where a company or a structure, NGO, whatever, is having data and is trying to build a system, a predictive system.

13:33 And they ask Kaggle to set this up, which basically means for Kaggle putting the training data set online and giving this to data scientists.

13:45 And they basically then spend time building a predictive system that is evaluated on new data on which to get a score.

13:52 And that allows you to see how the system works on new data and to rank basically the data scientists that are playing with the system.

14:01 It's kind of an open innovation approach in data science.

14:07 That's really cool.

14:08 So, that's just Kaggle.com.

14:10 Yes.

14:11 K-A-G-G-L-E.com.

14:13 Exactly.

14:13 Yeah.

14:14 Very nice.

14:14 Some of the other ones that I sort of ran across while I was looking around that were pretty cool was one is some guys at Cornell University built machine learning algorithms to listen for the sound of whales in the ocean and use them in real time to help ships avoid running into whales.

14:34 That's pretty awesome, right?

14:35 Yeah.

14:36 Yeah.

14:36 There was a Kaggle competition on these whale sounds maybe a couple of years ago.

14:43 And it was a – I mean, not many data scientists have experienced, like, listening to whales.

14:49 So, it's kind of everybody doesn't really know what types of data.

14:53 And I remember this presentation from the winner basically saying how to win a Kaggle competition without knowing anything about the data.

15:01 It's kind of a provocative talk.

15:03 That is cool.

15:04 But showing how you can basically build a predictive system by just looking at the data and trying to make sense out of it without really being an expert in the field.

15:13 Yeah.

15:13 That's probably a really valuable skill as a data scientist to have, right?

15:17 Because you can be an expert, but not in everything.

15:19 Some other ones that were interesting was IBM was working on something to look at the handwritten notes of physicians.

15:29 Uh-huh.

15:30 And then it would predict whether – how likely the person that those notes were about would have a heart attack.

15:37 Yeah.

15:37 In the clinical world, it's true that a lot of information is actually raw text, like manual, like just written notes, but also raw text on the system.

15:48 For machine learning, that's a particularly difficult problem because it's what we call unstructured data.

15:57 So you need to – typically for scikit-learn to work on these types of data, you need to do something extra to basically come up with a structure or come up with features that allow you to predict something.

16:08 Sure.

16:10 And so both of those two examples that I brought up have really interesting data origin problems.

16:15 So if I give you an MP3 of a whale or an audio stream of a whale, how do you turn that into numbers that go into the machine even to train it?

16:30 And then similarly with handwriting, how do you – you've got to do handwriting recognition.

16:36 You've got to then do sort of understanding what the handwriting means.

16:41 And there's a lot of levels.

16:42 How do you take this data and actually get it into something like scikit-learn?

16:46 So scikit-learn expects that every observation, we also call it a sample or a data point, is basically described by a vector, like a vector of values.

16:57 So if you take the sound of the whale, you can say, okay, there's a sound in the MP3, it's just a set of floating point values, like every time sample, really time domain signals that you get for a few seconds of data.

17:10 It's probably not the best way to get a predictive – a good predictive system.

17:15 You want to do some feature transformation, change the input to get something that brings features that are more powerful for scikit-learn and the learning system.

17:25 And you would typically do this with time-frequency transform, things like spectrograms, trying to extract features that are really, for example, invariant to some aspects of the day, like frequencies or time shifts.

17:38 So there's probably a bit of pre-processing to do on these row signals.

17:43 And then once you have your vector, you can use the scikit-learn machinery to build your predictive system.

17:48 How much of that pre-processing is in the tool set?

17:53 So it depends for what types of data.

17:55 Typically for signals, there's nothing really specific in scikit-learn.

17:59 You would probably use scipy signal or any types of signal processing Python code that you find online.

18:06 I would say for other types of data, like text, in scikit-learn, there are something that is called feature extraction module.

18:14 And you have – in the feature extraction module, you have something for text, which is probably the biggest part of the feature extraction is really text processing.

18:22 And you have some stuff also for images, but it's quite limited.

18:29 We should probably introduce what scikit-learn is and get into the details of that.

18:33 But I have one more sort of example to let people know about that I think is pretty cool.

18:38 On show 16, I talked to Roy Rappaport from Netflix.

18:42 And Netflix has a tremendously large cloud computing infrastructure to power all of their – you know, basically their movie system, right?

18:51 And everything behind the scenes there.

18:53 And they have so many virtual machine instances and services running on them and then different types of devices, accessing services on those machines that they said it's almost impossible to determine if there's, you know, some edge case where there's a problem manually.

19:10 And so they actually set up machine learning to monitor their infrastructure and then tell them if there's some kind of problem in real time.

19:18 Yeah.

19:19 So I think that's really a cool use of it as well.

19:33 This episode is brought to you by Hired.

19:36 Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.

19:42 Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.

19:51 Typically, candidates receive five or more offers in just the first week, and there are no obligations ever.

19:58 Sounds pretty awesome, doesn't it?

20:00 Well, did I mention there's a signing bonus?

20:02 Everyone who accepts a job from Hired gets a $2,000 signing bonus.

20:06 And as Talk Python listeners, it gets way sweeter.

20:10 Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $4,000.

20:18 Opportunity's knocking.

20:20 Visit Hired.com slash Talk Python To Me and answer the call.

20:31 Yeah, that's a very cool thing to do.

20:36 And actually, many industries and many companies are looking for these types of systems that they call anomaly detection or failure prediction.

20:46 And it's getting a big use case for machine learning, indeed.

20:52 The Netflix guys were actually using scikit-learn, not just some other machine learning system.

20:56 So let's get to the details of that.

20:59 What's scikit-learn?

20:59 Where did it come from?

21:00 So scikit-learn is probably the biggest machine learning library that you can find in the Python world.

21:07 So it dates back from almost 10 years ago when David Cornapal was doing a Google Summer of Code to kickstart the scikit-learn project.

21:16 And then for a few years, there was a French guy called Mathieu Broucher who took on the project.

21:24 But it was kind of a one-guy project for many years.

21:28 And in 2010, with colleagues at INRIA in France, we decided to basically try to start from this state of scikit-learn and make it bigger and really try to build a community around this.

21:46 So these people are Gael Varroco and Fabian Pedregosa and also somebody you may have heard of in the machine learning world with Olivier Grisel.

21:56 And so that was pretty much 2010, so five years ago.

22:03 And basically it took on pretty quickly.

22:05 After, I would say, a year of scikit-learn, we had more than 10 core developers way beyond the initial lab where it started.

22:17 That's really excellent.

22:18 Yeah, I mean, it's definitely an absolutely mainstream project that people are using in production these days.

22:24 So congratulations to everyone on that.

22:26 That's great.

22:26 Thank you.

22:27 Yeah.

22:28 And so the name scikit-learn comes from the fact that it's basically an extension to the SciPy pieces, right?

22:37 So SciPy is like NumPy for numerical processing, SciPy for scientific stuff, Matplotlib, IPython, SimPy for symbolic math, and Pandas, right?

22:48 And then there's these extensions.

22:50 Yes.

22:51 So basically the kind of division is that you cannot put everything in SciPy.

22:55 SciPy is already a big project.

22:57 And the idea of the SciKits were to build extensions around SciPy that are more domain-specific.

23:03 Also, it's kind of also easier to contribute to a smaller project.

23:08 So basically the barrier of entry for newcomers is much lower when you contribute to a Scikit than to SciPy, which is a fairly big project now.

23:16 Yeah, there's so much support for the whole SciPy system, right?

23:22 So it's much better to just build on that than try to duplicate anything and say NumPy or whatever.

23:27 Exactly.

23:28 I mean, there's a lot of efforts to see what could be NumPy 2.0 and what's going to be the future of it and how to extend it.

23:37 I mean, a lot of people are thinking of what's next because, I mean, NumPy is almost 10 years old, probably more than 10 years old now.

23:44 And, yeah, people are trying to see also how it can evolve.

23:49 Sure.

23:49 That makes a lot of sense.

23:51 So speaking of evolving and going forward, what are the plans with scikit-learn?

23:57 Where is it going?

23:57 So I would say in terms of features, I mean, scikit-learn is really in the consolidation stage.

24:04 scikit-learn is five years old.

24:06 The API is pretty much settled.

24:09 There's a few things here and there that are basically that we have to deal with now that basically due to early decisions in terms of API that needs to be fixed.

24:20 And I guess the big objective is to basically do scikit-learn 1.0, like the first stable, fully stable release in terms of API because that's something that we've been talking about between the core developers for, I mean, more than two years now, coming with this 1.0 version that stabilizes every part of the API.

24:44 Right.

24:44 One final major cleanup, if you can, and then stabilizing it, yeah?

24:49 Exactly.

24:49 And in terms of new features, I mean, you always have a lot of cool stuff that are around and you see the number of pull requests that are coming on top of scikit-learn.

25:01 It's pretty crazy.

25:02 And I would say a huge maintainer's effort and reviewing effort.

25:07 So features are coming in slowly now in scikit-learn, much more slowly than it used to be, but I guess it's normal for a project that is getting big.

25:14 Yeah, it's definitely getting big.

25:16 It has 7,600 stars and 4,500 forks on GitHub, so that's pretty awesome.

25:22 Yeah.

25:23 It has 457 contributors.

25:24 Cool.

25:25 Yeah, I would say for every release we get, I mean, we try to release every six months.

25:30 And for every release we get a big number of contributors.

25:36 So maybe we could do like a survey of the modules of scikit-learn, just the important ones that come to mind.

25:44 What are the moving parts in there?

25:46 So I would say maybe something I know the most, which is a part of the module that I maintain the most, which is the linear model.

25:52 And recently the efforts on the linear models were to scale it up.

25:58 Basically try to learn this linear models in an out-of-core fashion to be able to scale to data that do not fit in RAM.

26:07 And that's part of the, I would say, part of the plan for this linear model module in scikit-learn.

26:15 That's cool.

26:15 So what kind of problems do you solve with that?

26:17 The types of problems where you have a, like, humongous number of samples and potentially a lot number of features.

26:24 So there are not so many applications where you get that many number of samples, but that's typically text or log files.

26:31 These types of industry problems where you collect a lot of samples on a regular basis.

26:38 You have these examples also if you monitor an industrial system, like if you want to do what we discussed before about, like, predictive maintenance.

26:46 That's probably a use case where this can be useful.

26:50 Probably the other, like, module that also attracts a lot of effort these days is the Ensemble module, and especially the tree module.

27:00 So for models like Random Forest or Gradient Boosting, which are very popular models that have been helping people to win cargo competitions for the last few years.

27:13 Yeah, I've heard a lot about these forests and so on.

27:17 Can you talk a little bit about what that is?

27:19 So a random forest basically is a set of decision trees that you pull together to get a prediction that is more accurate.

27:32 More accurate because it has less variance in technical terms.

27:36 And the way it works is you try to basically build decision trees from a subset of data, a subset of samples, subset of features in a clever way.

27:46 And then you pull all these trees in one big predictive model.

27:50 And, for example, if you do binary classification and you train a thousand trees, you ask for a new observation to the thousand trees.

27:59 What's the label?

27:59 Is it positive or negative?

28:01 And then you basically count the number of trees that are saying positive.

28:05 And if you have more trees saying positive, then you predict positive.

28:08 That's kind of the basic idea of random forest.

28:11 And it turns out to be super powerful.

28:13 That's really cool.

28:14 Well, it seems to me like it would bring in kind of different perspectives or taking different components or parts of a problem into account.

28:23 So some of the trees look at some features and maybe the other trees look at other features.

28:28 And then they can combine in some important way.

28:31 Exactly.

28:32 Yeah.

28:33 Another one that I see coming up is the SVM module.

28:36 What's that one do?

28:38 So SVM is a very popular machine learning approach that was basically, I mean, very big in the 90s and 10 years ago and still get some attraction.

28:52 And basically, the idea of a support vector machine, which is the SVM is the according for, is to be able to use kernels on the data and basically solve linear problems in an abstract space where you project your raw data.

29:10 Let me try to give an example.

29:11 If you take a graph or if you take a graph or if you take a string, that's not naturally something that can be represented by a vector.

29:18 And when you do an SVM, you have a tool, which is a kernel that allows you to compare these observations, like a kernel between strings, a kernel between graphs.

29:27 And once you define this kernel, and this kernel needs to satisfy some properties that I'm going to skip, then you can use this SVM to do classification but also regression.

29:37 This is what you have in the SVM module of Seciturn, which is basically a very clever and efficient binding of an underlying library, which is called LibSVM.

29:47 Okay, excellent.

29:48 And is that used more in the unsupervised world?

29:51 It's completely supervised.

29:52 When you do SVM, it's classification or regression that's supervised.

29:55 There's one use case of SVM in an unsupervised setting, which is what we call the one-class SVM.

30:02 So you just have one class, which basically means that you don't have labels, you just have data, and you're trying to see what are the data that are the less like the others.

30:11 That's more like an anomaly detection problem, or we call it also novelty detection or outlier detection.

30:17 Maybe we could talk a little bit about some of the algorithms.

30:21 As a non-expert in sort of the data science machine learning field, I go in there and I see all these cool algorithms and graphs, but I don't really know what would I do with that.

30:31 On the site, it says there's all these algorithms it supports.

30:34 So, for example, it supports dimensionality reduction.

30:38 Like, what kind of problems would I bring that in for?

30:41 I guess it's hard to summarize.

30:44 The hundreds of hundreds of pages that you have in scikit-learn in the documentation, I'm trying to give you a big picture without too much technical detail to tell you when these algorithms are useful and what they are useful for, and what are the hypotheses and what kind of output you can hope to get.

31:02 It's one of the strengths of the scikit-learn documentation, by the way.

31:05 And so to answer your question, dimensionality reduction, I would say like the 101 way of doing it is the principal component analysis, where you're trying to extract subspace that captures the most variance in the data.

31:22 And that can be used to do visualization of the data in low dimension.

31:26 If you do a PCA in two or three dimensions, then you can look at your observation as a scatterplot in two or three D.

31:33 And that's basically visualization.

31:35 But you can also use this to reduce the size of your data set, maybe without losing too much predictive power.

31:43 So you take your biggest data set, you run a PCA, and then you reduce the dimension.

31:48 And then suddenly you have a learning problem, which is on smaller data, because you basically reduce the number of features.

31:55 That's kind of the standard approaches, which is like visualization or reducing of the data set to have a more efficient learning in terms of computing time, but also sometimes in prediction problem.

32:08 Okay, that makes a lot of sense.

32:10 That's really cool.

32:10 So like if we went back to my house example, maybe I was feeding like the length of the driveway and the number of trees in the yard.

32:18 And it might turn out that neither of those have any effect on house prices.

32:22 So we could reduce it to a smaller problem by having this whole PCA go, look, those don't matter.

32:27 Throw that part out.

32:28 It's really about the number of bathrooms and the square footage or something.

32:32 Well, yes and no.

32:36 That's kind of the idea.

32:36 Okay, but in this example of Boston, the prediction of houses, you want to reduce the dimension in an informed way.

32:44 Because the number of trees in the yard can be informative for something, but maybe not to predict the price of the apartment or price of the house.

32:52 So when you do dimensionally reduction in the context of supervised learning, that can be also what you call feature selection or basically selecting the predictive features, which ultimately leads to a reduced data set because you remove features.

33:05 But that would be in a supervised context.

33:07 When you do PCA, you're really in an unsupervised way.

33:10 You don't know what are the labels.

33:11 You just want to figure out what's the variance.

33:14 Where is the variance in the data coming from?

33:16 On which axis and which direction should I look to see the structure?

33:20 Another thing that is in there are ensemble methods for predicting multiple supervised models.

33:29 What's the story there?

33:30 That sounds cool.

33:30 So random forest is an example of ensemble methods.

33:36 When you have an ensemble, it's basically saying that you're taking a lot of classifiers or a lot of regressors and you combine them in a bag of prediction, a bag of models or an ensemble of models.

33:50 And then you make them collaborate in order to build a better prediction.

33:53 And random forest is basically an ensemble of trees.

33:57 But you can also do an ensemble of neural networks.

34:02 You can do an ensemble of whatever model you want to pull.

34:07 And that turns out to be in practice often a very efficient approach.

34:11 Yeah, like we were saying, the more perspectives, different models, it seems like that's a really good idea.

34:18 So you mentioned neural networks.

34:20 Yes.

34:21 So scikit-learn has support for neural networks as well?

34:23 Well, you have a multilayer perception, which is like the basic neural network.

34:29 I mean, these days in neural network, people talk about deep learning.

34:32 I've heard about it.

34:33 That's about the extent of it.

34:34 What's deep learning?

34:35 This episode is brought to you by Codeship.

34:53 Codeship has launched organizations, create teams, set permissions for specific team members,

34:58 and improve collaboration in your continuous delivery workflow.

35:01 Maintain centralized control of your organization's projects and teams with Codeship's new organizations plan.

35:07 And as Talk Python listeners, you can save 20% off any premium plan for the next three months.

35:12 Just use the code TALKPYTHON, all caps, no spaces.

35:17 Check them out at CodeChip.com and tell them thanks for supporting the show on Twitter where they're at, CodeChip.

35:28 So deep learning is basically neural networks 2.0, where you take neural networks and you stack more layers.

35:35 So kind of the story there is that for many years, people were kind of stuck with networks of two or three layers.

35:44 So not very deep.

35:46 And part of the issue is that it was really hard to train something that would add more layers.

35:51 In terms of research, there was two things that came up, which is first that we get access to more data,

35:57 which means that we can train bigger and more complex models.

36:01 But also there were some breakthrough in learning these models that allowed people to avoid overfitting,

36:09 trying to be able to learn this bigish model, these big models, because you have more data,

36:14 but also clever ways to prevent overfitting.

36:17 And that basically led to deep learning these days.

36:19 Oh, very interesting.

36:20 Yeah, that's been one of the problems with neural networks, right?

36:23 Is that if you teach it too much, then it only knows, you know, just the things you've taught it or something, right?

36:27 Exactly.

36:28 It basically learns by heart what you provide as trading observations and end up being very bad when you provide new observations.

36:38 Want to talk a little bit about the datasets that come built in there?

36:41 Uh-huh.

36:42 We've talked a little bit about the Boston one, and that's the Boston house prices for regression.

36:47 What I hear coming up a lot is one called Iris.

36:50 Is that like your eye itself?

36:54 So Iris is the dataset that we use to illustrate all the classification problems.

37:00 It's really something that is a very common dataset that turned out to have a good license that we could ship it with scikit-learn,

37:07 and basically we built most of the examples using this Iris dataset, which is also very much used in textbooks of machine learning.

37:15 So that was kind of the default choice, and it talks to people because you understand what's the problem that you're trying to do,

37:23 and it's rich enough and not too big, so we can make all these examples run super fast and build a nice location.

37:30 That's very cool.

37:30 What is the dataset?

37:31 What exactly is it about?

37:33 So the Iris dataset, you're trying to predict the types of plants, for example, using the sepal length, so the sepal width.

37:44 So you have a number of features that describe the plant, and you're trying to predict which one among three it is.

37:51 So it's a three-label, three-class classification problem.

37:55 Yeah, that's cool.

37:56 Enough data to not just be a linear model or something, a single variable model, but not too much?

38:02 Exactly.

38:04 It's not completely linear a bit, but not too hard at the same time.

38:10 Right.

38:10 If you get 20 variables, that's probably too much to deal with.

38:13 Then one is on diabetes.

38:14 What about diabetes does that dataset represent?

38:17 Do you know?

38:18 I'm actually not really sure what's the – no, it's a regression problem.

38:24 It's used a lot in the linear model, especially for the sparse regression models because the – I mean, part of these sparse regression models are trying to extract the predictive features.

38:34 I guess in the diabetes dataset, you try to find something related to diabetes, and you're interested into finding the most predictive features.

38:41 What are the best features?

38:43 And then that's part of the reason I think we're using it.

38:46 And then another one is digits, which kind of meant to model images, right?

38:51 One of the early, I would say, breakthrough of machine learning was this work in the 90s where Yad Lequin and other people were trying to build a system that could predict what was the digit present on the screen or in the image.

39:10 So it's a very old machine learning problem where you start from a picture or an image of a digit that is handwritten, and you're trying to predict what it is from zero to nine.

39:20 And it's an example that basically people can easily grasp in order to understand what's the machine learning.

39:26 You give me an image, and I'll predict something between zero and nine.

39:30 And historically, when we did the first version of the scikit-learn website, we had something like seven or eight lines of Python code that were running classification of digits.

39:41 So that was kind of the motivation example where we said, okay, scikit-learn is machine learning made easy.

39:46 And here it is, an example.

39:48 It's ten lines of code classifying digits.

39:51 And that was basically the punchline.

39:53 Solving this old hard problem in a nice, simple way, right?

39:57 Yeah.

39:57 You know, lately, there's been a lot of talk about artificial intelligence, and especially from people like Elon Musk and Stephen Hawking,

40:08 saying that maybe we should be concerned about artificial intelligence and things like that.

40:14 So one of my first questions around this area is, is machine learning the same thing as artificial intelligence?

40:20 Depends who you ask.

40:23 Okay.

40:24 Sure.

40:25 No, I mean, AI was basically the early name of trying to teach a computer to do something.

40:34 I mean, it dates back from the 60s and 70s, where basically in the US, for example, at MIT, you had labs that are basically called AI labs.

40:42 And machine learning is kind of a, I would say, more restricted set of problems that compared to AI,

40:53 which is, say, when you do AI and you want to do work with text or linguistic, you want to build a system that understands linguistic.

41:02 That would be an AI problem.

41:05 But machine learning is kind of a saying, okay, I've got a loss function.

41:08 I want to optimize my criteria.

41:10 I've got something that I want to train my system on.

41:15 And in a sense, you teach a system to learn.

41:17 And so you create some kind of intelligence.

41:21 But it's not, I would say it's a, I would say simpler thing to say than saying intelligence, which is kind of a hard concept.

41:28 That's maybe part of my personal answer to this.

41:31 Yeah, no, it's a great answer.

41:33 Just from my limited exposure to it, it seems like machine learning is more about classification and prediction,

41:39 whereas the AI concept is a, there's a strong autonomous component that is just completely lacking for machine learning.

41:47 Yeah, I guess I would say, I would explain it simply like this, exactly.

41:52 What things have you seen people using scikit-learn for that surprised you?

41:58 Or like, wow, you guys are doing that?

42:00 That's amazing.

42:03 So on scikit-learn, we have this testimonial page where we ask typically companies or institutes that are using scikit-learn to write a couple of sentences to say, okay, what they're using scikit-learn for and why they think it's great.

42:23 I'm trying to find this.

42:25 And I remember there was this, I think, a dating website.

42:30 Saying that they were using scikit-learn to optimize dates between people.

42:36 That was great.

42:38 That was like a funny one.

42:40 That is funny.

42:41 So there may be people out there who are married and maybe even babies who are born because of scikit-learn.

42:46 Yeah, that would be great.

42:49 I'm going to add this to my resume.

42:52 It's awesome.

42:53 Matchmaker.

42:54 So if people want to get started with scikit-learn, they're out there listening, they're like, this is awesome.

43:00 Where do I start?

43:01 What would you recommend for sort of getting into this whole world of machine learning and getting started with scikit-learn in particular?

43:07 They first start with the scikit-learn website, which is pretty extensive.

43:11 But you also have a lot of tutorials that have been given by core devs of scikit-learn in different conferences like SciPy, EuroSciPy, or PyData events.

43:22 And you can find all these videos online.

43:24 I'll just tell you, just take some of them and just sit down and just listen and try to do it yourself afterwards.

43:35 I mean, for example, in SciPy, you've got tutorials on scikit-learn that are pretty much a whole day of tutorials, which is hands-on.

43:41 And all these are taped.

43:42 So you can really look and get the materials online from the tutorial and get started.

43:48 Oh, that's excellent.

43:49 Yeah, I think it's really amazing these days that there's so many of these videos online that you can...

43:56 There's some topic you imagine, like, hey, I want to know this thing in Python.

43:59 There's a very good chance that someone gave some kind of conference talk on it and it's online.

44:03 Yeah.

44:05 Anything you want to give sort of a shout-out to or a final call to action before we sort of wrap things up a bit?

44:10 So if you have free time, you like machine learning, come give us a hand to maintain this cool library.

44:18 Yeah, absolutely.

44:20 Yeah, like I said, there's 457 contributors, but, you know, you guys are looking to stabilize things and move forward.

44:26 So I'm sure there's a lot to be done around that.

44:28 Yeah.

44:29 I mean, basically, you also have two types of contributors.

44:32 You have these one-time contributors that are really expert in something that contributes something that is really specific and valuable that gets merged to the main code base.

44:42 And you have, I would say, less people that are investing their time to read the code from others, keep their library consistent in terms of API.

44:51 And that's really this big reviewing work that I would say.

44:55 I would say the historical core devs of scikit-learn are pretty much mostly doing these days and invest little time to do really new stuff that is basically left to the newcomers.

45:05 And I think what would be, if I had to wish something for the future is that these people or these one-time contributors also spend a bit of their time to help us maintain the entire library in longer run.

45:21 Yeah, that makes sense.

45:22 I can see in something like scikit-learn where it's kind of a family of all these different algorithms and little techniques that if you want to add your technique, you just go in there and you do that little bit and you kind of stay out of the rest of the code.

45:36 And I can see how that would definitely lead to inconsistencies and so on.

45:40 Yeah, and in terms of policy, I mean, in terms of scikit-learn, that's why maybe there's less things that are coming in these days is that we're not trying to build a library that contains all the algorithms that you can ever think of or that get published every year.

45:54 We're trying to keep or have the algorithms that are better on some clear use case in the current state of the art.

46:04 And so we cannot implement everything, but at least if you have a particular type of problem, you should have something in scikit-learn that does a good job.

46:12 So before I let you go, I have two more final questions for you.

46:16 So if you're going to open, if you're going to write some Python code, what editor do you open up?

46:21 So I've been a big user of TextMate over the years.

46:27 And I have to admit, I switched to Sublime recently because I got convinced by my neighbor.

46:34 So no Vim or Emacs troll with me.

46:37 Yeah, that's cool.

46:39 Yeah, I like Sublime Text a lot.

46:41 Very nice.

46:41 And of all the cool machine learning and Python in general packages out on PyPI, what are some that you think people maybe don't know about that you're like, hey, this is awesome.

46:52 You should know about it.

46:53 Well, maybe I'm biased because I do a lot of machine learning for brain science.

46:57 And so unrelated to scikit-learn per se, but I've been working for the last five years on this project that's called MNE, which allows you to process brain waves and classifying brain states.

47:09 Like, for example, build brain-computer interfaces or analyze clinical data of electrophysiology.

47:15 That's basically, if you want to play with brain waves, you can check it out.

47:18 That's really cool.

47:20 And when you say brain-machine interfaces, is it like EEGs and stuff like that?

47:24 Exactly.

47:25 EEG, MEG.

47:26 Yeah.

47:27 Okay.

47:27 Wow.

47:28 Very awesome.

47:28 Yeah, I hadn't heard of that one.

47:29 That's cool.

47:30 So again, I'm biased.

47:32 That's more my second baby.

47:34 Yeah, that's great.

47:36 So, Alexander, it's been really great to have you on the show.

47:40 And this has been a super interesting conversation.

47:42 Thanks.

47:42 Thank you very much.

47:44 You bet.

47:44 Talk to you later.

47:46 This has been another episode of Talk Python To Me.

47:49 Today's guest was Alexandra Gramfort.

47:51 And this episode has been sponsored by Hired and CodeShip.

47:55 Thank you guys for supporting the show.

47:57 Hired wants to help you find your next big thing.

48:00 Visit Hired.com slash Talk Python To Me to get five or more offers with salary and equity presented right up front.

48:06 And a special listener signing bonus of $4,000.

48:09 CodeShip wants you to always keep shipping.

48:13 Check them out at CodeShip.com and thank them on Twitter via at CodeShip.

48:17 Don't forget the discount code for listeners.

48:19 It's easy.

48:20 Talk Python.

48:21 All caps.

48:22 No spaces.

48:24 You can find the links from today's show at talkpython.fm/episodes slash show slash 31.

48:30 Be sure to subscribe to the show.

48:33 Open your favorite podcatcher and search for Python.

48:35 We should be right at the top.

48:36 You can also find the iTunes and direct RSS feeds in the footer of the website.

48:40 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

48:46 You can hear the entire song on talkpython.fm.

48:49 This is your host, Michael Kennedy.

48:52 Thank you very much for listening.

48:54 Smix, take us out of here.

48:56 Thank you.

48:57 Stating with my voice.

48:58 There's no norm that I can feel within.

49:00 Haven't been sleeping.

49:01 I've been using lots of rest.

49:03 I'll pass the mic back to who rocked it best.

49:06 First of all, first of all.

49:08 First of all, first of all.

49:15 developers, developers, developers, developers.