Monitor performance issues & errors in your code

#31: Machine Learning with Python and scikit-learn Transcript

Recorded on Friday, Sep 25, 2015.

00:00 Machine learning allows computers to find hidden insights without being explicitly programmed where to look or what to look for. Thanks to the work of some dedicated developers, Python has one of the best machine learning platforms called scikit-learn. In this episode, Alexandre Gramfort is here to tell us all about scikit-learn and machine learning. This is Talk Python To Me number 31 recorded Friday, September 25th 2015.

00:00 [music intro]

00:00 Welcome to Talk Python to Me. A weekly podcast on Python- the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy

00:00 Follow me on twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at and follow the show on twitter via @talkpython.

00:00 Sponsors

00:00 This episode is brought to you by Hired and Codeship. Thank them for supporting the show on twitter via @hired_hq and @codeship

00:00 Hi everyone. Thanks for listening. Let me introduce Alexandre so we can get right to the interview.

00:00 Alexandre Gramfort is currently an assistant professor at Telecom ParisTech and scientific consultant for the CEA Neurospin brain imaging center. His work is on statistical machine learning, signal and image processing, optimization, scientific computing and software engineering with primary applications in brain functional imaging.

00:00 Before joining Telecom ParisTech, he worked at the Martinos Center for Biomedical Imaging at Harvard in Boston. He is also an active member of the Center for Data Science at Université Paris-Saclay.

02:03 Alexandre, welcome to the show.

02:05 Thank you, hi.

02:06 Hi. I'm really excited to talk about machine learning and Scikit- learn with you today. It's something I know almost nothing about, so it is going to be a great chance for me to learn along with everyone else who is listening.

02:18 So hopefully I will be able to give answers.

02:22 Yeah, I'm sure that you will. All right, so we are going to talk all about machine learning, but before we get there, let's hear your story, how did you get into programming in Python?

02:30 Well, I've done a lot of scientific computing and scientific programming over the last maybe 10 to 15 years, having started my undergrad in computer science doing a lot of signal and image processing. Well, like these types of people have done a lot of Matlab, in my previous life and-

02:46 Yes, I've done a lot of Matlab too, I know about the dot in files.

02:52 And, I did a phd in computer science applied to brain imaging and I switched to a different team where I got surrounded by people working with Python, and basically I got into it, and I switched like in one week, Matlab has gone from my life. And, it has been maybe five years now. And, that's kind of the historical part.

03:22 Do you miss Matlab?

03:23 Not really.

03:25 Me neither. And there are some cool things about it, but-

03:30 Yeah, so I have students insisting to work with me in Matlab, I have to still do stuff in Matlab for supervision, but not really when I have the choice.

03:43 Yeah, we have the choice, of course. I think one of the things that is really a draw back about specialized systems like Matlab, is it's very hard to build production finish products. You can do research, you can learn, you can write papers, you can even test algorithms, but if you want to get something that is running on data centers on its own, probably Matlab is- you could make your work but it's not generally the right choice.

04:08 Definitely, yeah.

04:09 Yeah, I think that explains a lot of the growth of Python, in this whole data science scientific computer world, along with great tool kits like Scikit-learn, right?

04:21 Yes, that is definitely the way Sciki-learn is now used, just the fact that in Python stack allows you to make this 4:31 type of code is a clear win for everyone.

04:37 So, before we get into the details of Scikit-learn and how you work with it and all the features it has, let's just in a really broad way talk about machine learning- like, what is machine learning?

04:47 I would say the simple example of machine learning is trying to predict something from previous data. So what people would call supervised learning, and plenty of examples of this in everyday life like your mail bug that predicts for you if your email is a spam or ham. And that basically is a system that learns from previous data how to make it inform choice and give you a prediction. And as basically the most simple way of seeing machine learning and basically you see machine learning problems framed this way in all context, from industry to academic science and I mean, there are many examples. And basically the in terms of other types of classes of problems that you see in machine learning is not really these prediction problems we try to make sense from raw data where you don't have labels like spam or ham, but you just have data and you want to figure out what is the structure, what types of input or insight can you get from it and that's I would say the other big class of problem that machine learning addresses.

06:03 Yeah, so there is that general classification. I guess with the first category you were talking about like spam filters and other things that maybe fall into that realm, like credit card fraud, maybe trading stocks, these kind of binary do it/ don't do it based on examples, that's something that is called structured learning, what's the-

06:29 The common name is supervised learning.

06:31 Supervised learning, that's right.

06:32 Yeah. So basically you have pairs of training observation that are 6:39 their corresponding labels. So text and the label would be spam or ham. Or you can also see, this is basically binary classification, the other types of machinery problems you have, for example regression- you want to predict the price of the house, and you know the number of square feet, and you know the number of rooms, you know what is exactly the location and so you have a bunch of variables that describe your house or apartment and from this you want to predict the price. And that is another example where now since the price is a continuous variable, it's not binary, this is what people call regression, and this is another class of supervised learning problem.

07:17 Right, so you might know through the real estate data all the houses in the neighborhood that are sold int he last two years, the ones that are sold last month, are there variables, dimensions if you will like number of bathrooms, number of bedrooms, square feet and you could- square meters and you can feed it into the system to train it and then you can say, "Well now I have a house with two bathrooms and three bedrooms and right here- what's it worth?" Right?

07:47 Exactly, that is basically a typical example and also a typical data set that we use in Scikit- learn. That basically illustrates the concept of regression with the similar problem.

07:56 Right, there is- we'll talk more about it but there is a Scikit- learn comes with some pre- built data sets and one of them is the Boston house market, right?

08:04 Exactly, that's the one.

08:05 Yeah. How much data do you have to give it, like suppose I want to try to estimate the value of my house, which at least in the United States we have the service called Zillow so, they are doing way more, I am sure they are running something like this actually. But suppose I want it to take upon myself to like grab the real estate data and try to estimate the value of my home. How many houses would I have to give it before it would start to be reasonable?

08:32 Well, that's a tough question, and I guess there is no simple answer. I mean, you have this, that you can see under cheat sheet of Scikit- learn that says if it is less than 50 observations then go get more data. But I guess there's also a simplified answer, it depends on the difficulty of the test, so at the end of the day often for these types of problem you want to know something and this can be easy or hard, you cannot really know before trying, and typically regression it would say ok, if I predict the 10% plus or minus that's maybe good enough for my application and maybe you need less data. If you want to be super accurate you need more data, but the question of how much is it's really hard to answer without really trying and using actual data.

09:17 Yeah. I can imagine. And it probably also depends on the variability of the data, the accuracy of the data, how many variables you are trying to give it, so if you just try to base it on square footage or square meters of your house, that one variable maybe it's easier to predict than, you know, 20 components that describe your house, right?

09:41 So the thing is, the more variables you have, the more you can hope to get. Now it's not a simple as this, because if variables are not informative, than they are basically not adding knowledge to your problem. So want as many variables to describe your data, in order to like capture the weak signals, but sometimes just the variables are not relevant or predictive. And so you want to remove them from the prediction problem.

10:11 Ok. That makes sense. So, I was looking into what are some of the novel uses of machine learning. I sort of have some things to ask you about, and just see what is out there. What are ones that come to mind for you and then I'll give you some that I found on my list.

10:28 Maybe I biased because I'm really into using machine learning for scientific data and academic problems, I guess for the things that are really academic breakthrough that are reaching everybody is really, really computer vision and NLP these days, and probably also speech. So these types of system that try to predict something from speech signals of from images like describing you what's the contents, what types of objects you can find and for NLP you have like machine translation-

11:02 We did a show on open CV and the whole Python angle there, there is a lot of really cool stuff on medical imaging going on there, does that have to do with Scikit- learn as well?

11:14 Well, you have people doing medical imaging in Scikit-learn, basically extracting features from MR images- magnetic resonance images, or CT scanners, also like EEG brain signals, and they are using Scikit- learn as the prediction to all. Deriving features from the raw data, and that reaches of course clinical application to some context.

11:46 Maybe automatic systems that say, hey, this looks like it could be cancer, or it could be some kind of problem bringing the attention of an expert who could actually look it and say yes/ no. Something like this?

11:57 Yeah exactly. It's like helping diagnoses like trying to help to isolate something that looks weird or suspicious in the data, to get like the time of this physicist and equation on this particular part of the data to see what is going on and if the patient is suffering from something.

12:20 Right, that's really cool, I mean maybe you can take previous biopsies and invasive things that have happened to other people, and there are pictures and the outcomes and say, look, you have basically the same features and we did this test and the machine believes that you actually don't have the problem, so probably no worry about it. Or something like that, right?

12:39 Yeah, I mean on this line of code there was recently a competition using retina pictures so like people suffering from diabetes usually have problems with retinas and so you can take pictures of retinas from hundreds of people and see if you can build a system that predicts something about the patient and the state of the disease from these images and this is typically done by pulling data from multiple people.

13:08 That's really cool. I've heard this Kaggle competition or challenges before in various places, what is that?

13:16 So it's basically a website that allows you to organize these types of supervised learning problems, where a company or like a structure, NGO whatever is having data and is trying to build a predictive system and they ask Kaggle to set this up which basically means for Kaggle putting the training online and giving this to data scientists and they basically spend time building a predictive system that is evaluated on new data on which to get a score. And that allows to see how the system works on new data, and to rank basically data scientists that applying this system. It's got an open innovation approach in data science.

14:08 That's really cool. So that's just

14:11 Yes. Exactly.

14:14 Yeah, very nice. Some of the other ones that I sort of ran across while I was looking around that were pretty cool was, one is some guys at Cornell university built machine learning algorithms to listen for the sound of whales in the ocean and use them in real time to help ships to avoid running into whales. That's pretty awesome, right?

14:36 Yeah, there was a Kaggle competition on these whales sounds maybe a couple of years ago, and it was, basically I mean that many days the scientists have experienced like listening to whales. So it's kind of everybody doesn't really know what types of data, and I remember this presentation from the winner basically saying how to win a kind of competition without knowing anything about the data, it was kind of the provocative talk but showing how you can basically build a predictive system by just looking at the data and try to make sense of it without really being an expert in the field.

15:13 Yeah, that's probably a really valuable skill as a data scientist to have, right, you can be an expert, but not in everything. Some other ones that were interesting was, IBM was working on something to look at the handwritten notes of physicians. And then it would predict how likely the person that those notes were about would have a heart attack.

15:39 Yeah, in the clinical world it's true that the lot of the information is actually low text, like manual, like just written notes, but also raw text on the system. For machine learning that's particularly difficult problem because what we call unstructured data, so you need typically for Scikit- learn to work on this type of data you need to something extra, to basically come up with the structure, come up with the features that allow you to predict something.

16:10 Sure, and so both of those two examples that I've brought up have really interesting data origin problems. So if I gave you an mp3 of a whale, or audio stream of a whale, how do you turn that into numbers that go into the machine even to train it, and it is similar with handwriting, how do you- you've got to do handwriting recognition, you've got to then do sort of understanding what the handwriting means and there is a lot of levels. How do you take this data, like and actually get it into something like Scikit- learn?

16:47 So, Scikit- learning expects that every observation, we also call it a sample, or a data point, is basically described by a vector. Like a vector of values. So if you take the sound of the whale, you can say, ok just the sound in the mp3 is just a set of floating point values like every time samples really time them in signals that you get for a few seconds of data. It's probably not the best way to get a good predictive system, you want to do some feature transformation, change the input to get something that brings feature that are more powerful for Scikit- learn and the learning system. And you would typically do this with time frequency transform, things like spectrograms, trying to extract features that are really for example 17:36 to some aspects of the data like frequencies or time shifts. So they probably a bit of pre-processing to do on this raw signals and then once you have your vector, you can use the Scikit- learn machinery to build your predictive system.

17:49 How much of that pre-processing is in the toolset?

17:53 So it depends for what types of data, typically for signals there is nothing really specific in Scikit learn you would probably use SciPy signal, or any type of signal processing Python code that you find online. I would say for other types of data like text, there is in Scikit learn this thing called "feature extraction module" and you have in the feature extraction module you have something for text which is probably the biggest part of the feature extraction is really text processing. You have some stuff also for images but it's quite limited

18:29 So, we should probably introduce what Scikit- learn is and get into the details of that, but I have one more sort of example to let people know about that I think is pretty cool. On show 16, I talked to Roy Rapoport from Netflix and Netflix has a tremendously large cloud computing infrastructure to power all of their, movies system and everything behind the scenes there. And they have so many virtual machine instances and services running on them and then different types of devices accessing services on those machines that they said it's almost impossible to determine if there is some edge case where there is a problem, manually and so they actually set up machine learning to monitor their infrastructure and then tell them if there is some kind of problem in real time.

19:19 Yeah.

19:22 So I think that that is really a cool use of it as well.

19:22 [music]

19:22 This episode is brought to you by Hired. Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.

19:22 Each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them before you even talk to the company. Typically, candidates receive 5 or more offers in just the first week and there are no obligations, ever.

19:22 Sounds pretty awesome, doesn't it? Well did I mention the signing bonus? Everyone who accepts a job from Hired gets a $2,000 signing bonus. And, as Talk Python listeners, it get's way sweeter! Use the link and Hired will double the signing bonus to $4,000!

19:22 Opportunity is knocking, visit and answer the call.

19:22 [music]

20:33 Yeah, that's a very cool thing to do and actually many industries and many companies are looking for these types of systems that they like anomaly detection, or failure prediction. And that's like, it's getting a big use case for machine learning indeed.

20:53 The Netflix guys were actually using Scikit learn not some other machine learning system, so let's get to the details of that, what's Scikit learn, where did it come from?

21:01 So, Scikit- learn is probably the biggest machine learning that you can find in the Python world. So, it dates back from almost ten years ago when David Cournapeau was doing a Google summer of code to kickstart the Scikit- learn project. And then for a few years a friend Matthieu Brucher took on the project, but it was kind of a one guy project for many years and in 2010, with colleagues at INRIA and friends we decided to basically try to like start from this state of Scikit learning and make it bigger and really try to build a community around this. So these people are Gael Varoquaux and Fabian Pedregosa, and also somebody you may have heard of in the machine learning world Olivier Grisel. And so that was pretty much 2010, so five years ago and it basically took on pretty quickly, after I would say a year of Scikit- learn we had like more than ten core developers way beyond the initial lab where it started.

22:18 That's really excellent, yeah, I mean it's definitely, absolutely main stream project that people are using in production these days, so congratulations to everyone on that, that's great.

22:28 Thank you.

22:28 Yeah. And so the name Scikit- learn comes from the fact that it's basically an extension to the SciPy pieces, right? So SciPy is like NumPy for numerical processing SciPy for scientific stuff, Matplotlib, IPython, SimPy for simbolic math and Pandas. And then, there is these extensions.

22:52 Yes. So, basically the kind of the vision is that you cannot put everything in SciPy. SciPy is already a big project and the idea of the Scikit were to build the extensions around SciPy hat are more domain specific. Also, it's kind of easier to contribute to a smaller project, so it's basically- the barrier of entry for newcomers is much lower when you contribute to a Scikit then to a SciPy which is a fairly big project now.

23:17 Yeah, and there is so much support for the whole SciPy system right, so it's much better to just build on that and try to like duplicate and say NumPy or whatever.

23:27 Exactly. I mean, there is a lot of efforts to see what could be NumPy to point o and what could be the future of it and how to extend it. I mean, a lot of people are thinking of what's next, because I mean NumPy is almost ten years old. Probably more than ten years old now and yeah, people are trying to see also how it can evolve.

23:49 Sure. That makes a lot of sense. So speaking of evolving and going forward, what are the plans with Scikit- learn, where is it going?

23:58 So, I was saying in terms of features, I mean Scikit- learn is really in consolidation stage. Scikit- learn is five years old, the API is pretty much settled, there is a few things here and there that basically we have to deal now, that basically due to early decisions in terms of API that needs to be fixed and I guess the big objective is to basically do Scikit- learn to 1.0, like the first fully stable in releasing terms of API because that is something that we've been talking about between the core developers for more than two years now, coming with this 1.0 version that stabilizes every part of the API.

24:45 Right. When final major cleanup if you can and then stabilize yeah?

24:49 Exactly. And in terms of new features, there are I mean you always have a lot of cool stuff that are around and you see the number of pull requests that are coming on top of Scikit- learn, it's pretty crazy and I would say huge maintainers effort in reviewing effort The features are coming in slowly now in Scikit- learn much more slowly than it used to be but I guess it's normal for a project that is getting big.

25:15 Yeah, it's definitely getting big, it has 7600 stars and 4500 forks on GitHub, so that's pretty awesome. 457 contributors, cool.

25:25 Yeah. I would say for every release we get to, we try to release every six months and for every release we get a big number of contributors.

25:37 So, maybe we could do like a survey of the modules of Scikit- learn just the important ones that come to mind. What are the moving parts in there?

25:46 So maybe something I know that both represent the part of the module that I 25:50 the most which is the linear model. And recently, the efforts on the linear models were to scale it up. Basically try to learn this linear models in out of core fashion to be able to scale to a data that do not fit in ram, and that's part of the I would say part of the plan for this linear model module in Scikit- learn.

26:16 That's cool, so what kind of problems do you solve with that?

26:18 The types of problem where you have like humongous number of samples and potentially big number of features, so there are not so many applications where you get that many number of samples, but that's typically text or lock files. These types of industry problem where you collect a lot of samples on a regular basis. You have there is examples also if you monitor an additional system like if you want to do what we discussed before about like predicted maintenance, that's probably use case where this can be useful. The other like module that also attracts a lot of effort these days is the assemble module especially a tree module, so for models like Random Forest so great in boosting which are very popular models that have been helping people to win Kaggle competitions for the last few years.

27:14 Yeah, I've heard a lot about these force and so on, can you talk a little bit about what that is?

27:19 So a Random Forest basically is a set of decision trees that you pull together to get a prediction that is more accurate. More accurate because it has less variance in technical terms. The way it works is you try to basically build decision trees from subset of data, subset of samples, subset of features in a clever way, then you pull all these trees in one big predictive model and for example if you do minor classification and you train a thousand trees, you ask for a new observation to the 1000 trees what's the label, is it positive or negative? And then you basically count the number of trees that are saying positive and if you have more trees saying positive, than you predict positive. That's kind of the basic idea of random forest and it turns out to be super powerful.

28:14 That's really cool. It seems to me like it would bring in kind of different perspective or taking different components or parts of the problem into a count so, you know some of the trees look at some features and maybe the other trees look at other features and then they can combine in some important way.

28:32 Exactly.

28:33 Yeah. Another one that I see coming up is the SVM module. What does that one do?

28:38 So, SVM is a very popular machine learning approach that was basically very big in the 1990s and ten years ago and still get some attraction. And basically the idea of Support Vector Machine which is the- SVM is the acronym for, is to be able to use kernels on the data and basically solve linear problems in an abstract space where you project your 29:09 . Let me try to give an example. If you take a graph, or if you take a text or if you take a string, that's not naturally something that could be represented by a vector. And when you do a SVM you have a tool which is a kernel that allows you to compare these observations like in kernel between strings, a kernel between graphs and one you define this kernel , and this kernel needs to satisfy some properties that I'm going to skip, then you can use these SVM to do a classification but also regression. And this is what you have in the end, you have module of Scikit-learn which is basically a very clever and efficient binding of the underlying library which is called SVM.

29:47 Ok. excellent. And is that used more in the unsupervised world?

29:52 It's completely supervised, when you do SVM is classificational regression that's supervised there is one use case of SVM in an unsupervised setting which is what we call the one class SVM, so, you just have one class, which basically means that you don't have labels, you just have data and you are trying to see what are the data that are the less like the others, that's more like an anomaly detection problem or we call it also novelty detection, or outlier detection.

30:19 Maybe we could talk a little bit about some of the algorithms. As a anon expert in sort of the data science machine learning field, I go in there and I see all these cool algorithms and graph, but I don't really know like what would I do with that? On the side it says, there is all these algorithms it supports. So for example, it supports dimensionality reduction, like what kind of problem is what bring that in for?

30:41 I guess it's hard to summarize, the hundreds and hundreds of pages that you have in Scikit- learn, in the documentation, I tried to give you a big picture without too much technical detail, to tell you when these algorithms are useful and what they are useful for. And what are the hypothesis and what kind of output you can hope to get. As one of the strengths of the Scikit- learn documentation by the way. And so, to answer your question, dimensionally reduction like the 101 way of doing it is the principal component analyses, where you are trying to extract subspace that captures the most variance in the data and that can be used to do visualization of the data in low dimension.

31:26 If you do a PCA in 2 or 3 dimensions then you can look at your observation as a scatter plot in 2D or 3D. That's basically visualization, but you can also use this to reduce the size of your data sets, maybe without losing too much predictive power so you take like biggest data set you run a PCA, and then you reduce the dimension and then suddenly you have a learning problem which is in smaller data because you basically reduced the number of features. That is kind of the standard approaches which are visualization or reducing of the data set to have more efficient learning in terms of computing time but also sometimes in prediction power.

32:09 Ok, that makes sense, that's really cool. So, like if went back to my house example, maybe I was feeding like the length of the driveway and the number of trees in the yard, and it might turn out that neither of those have any effect on house prices, so we could reduce it to a small problem by having this whole PCA look, those don't matter, throw that part out. It's really about the number of bathrooms and the square footage or something.

32:33 Well, yes and no, that is kind of the idea but in this example of prediction of houses, you want to reduce the dimension in an informed way, because the number of trees in the yard can be informative for something but maybe not to predict the price of the apartment or the price of the house. So when you do dimensional reduction in the context of supervised learning, that can be also feature selection, or basically selecting the predictive features which ultimately leads to reduce data set because you remove features but that would be in the supervised context, when you do PCA you are really in unsupervised way, you don't know what are the labels you just want to figure out what's the variance in the data coming from, on which access and which direction should I look to see the structure.

33:21 Another thing that is in there are ensemble methods for predicting multiple supervised models. What's the story there, that sounds cool?

33:31 So, Random forest is an example of ensemble methods. When you have an ensemble is basically saying that you are basically taking a lot of classifiers, a lot of regressors, and you combine them in a bag of models or an ensemble of models. And then you make them collaborate in order to build a better prediction. And Random Forest is basically an ensemble of trees. But you can also do an ensemble of neural networks, you can do an ensemble of whatever model you want to pull and that turns out to be in practice of in a very efficient approach.

34:12 Yeah, like we are saying, the more perspectives. Different models, that seems like it's a really good idea. So you mentioned neural networks. So Scikit- learn has support for neural networks as well?

34:24 Well, you have a multi layer perceptor in which is like the basic neural network. I mean, these days in neural network people talk about deep learning.

34:33 I've heard about it, what's deep learning?

34:33 [music]

34:33 This episode is brought to you by Codeship. Codeship has launched organizations, create teams, set permissions for specific team members and improved collaboration in your continuous delivery workflow. Maintains centralized control over your organization's projects and teams with Codeship's new organization's plan.

34:33 And as Talk Python listeners, you can save 20% off any premium plan for the next 3 months. Just use the code TALKPYTHON.

34:33 Check them out at and tell them "thanks" for supporting the show on Twitter where they are at @codeship.

34:33 [music]

35:27 So deep learning basically, neural network 2.0. Where you take neural networks and you stack more layers. So, kind of the story there is that for many years people were kind of stuck with networks of 2 or 3 layers. So not very deep. And part of the issue is that it was really hard to train something that would add more layers. In terms of research, there was two things that came up, which is first, that we get access to more data which means that we can train bigger and more complex models, but also there were some breakthrough in learning these models, that allowed people to avoid overfitting. Trying to be able to learn these big models, because you have more clever ways to prevent overfitting and they basically led to deep learning these days.

36:19 Very interesting. Yeah, that's been one of the problems with neural networks, right, if you teach it too much than it only knows just the things you have taught it or something right?

36:28 Exactly. It basically learns by heart what you provide us, trading observations and thus being very bad when you provide new observations.

36:39 I want to talk a little bit about the data sets that come built in there. We have talked a little bit about the Boston one, and that's the Boston house prices for regression. One I hear coming up a lot is the one called Iris. Is that like your eye itself?

36:54 So Iris is the data set that we use illustrate all the classification problems. It's really something that is a very common data set that had turned out to have a good license that we could ship it with Scikit learn and basically we build most of the examples using this Iris data set which is also very much using text books of machine learning. So, that was kind of the default choice and it talks to people, because it understands what's the problem that you are trying to do and it's rich enough and not too big so we can- make all these examples super fast and build a nice-

37:31 That's cool, what is the data set? Like, what exactly is it about?

37:34 So the Iris data set you are trying to predict the types of plants, for example using the sepal length and the sepal width, so you have a number of features that describe the plant and you are trying to predict which one among 3, so it's a 3 label, 3 class classification problem.

37:56 Yeah, that's cool. Enough data to not just be a linear model or something, a single variable model but not too much?

38:05 Exactly, it's not completely linear, a bit, but not too hard at the same time.

38:10 Right, if you get 20 variables that's probably too much to deal. Then one is on diabetes. What about diabetes, what does that data set represent?

38:20 I am actually not really sure what's the- now it's a regression problem, it's used a lot in the linear model especially for the spots regression models because the, I mean, part of these spots regression models that try to extract the predictive features, I guess in the diabetes data set you try to find something related to diabetes and you are interested in finding the most predictive features what are the best features, and that's part of the reason I think we are using it.

38:47 And then another one is digits, which kind of is to model images, right?

38:52 One of the early I would say breakthroughs of the machine learning was this work int he 1990s where 39:01 were trying to build a system that could predict what was the digit present on the screen on in the image. So, it's a very old machine learning problem where you start from a picture or image of a digit that is handwritten and you try to predict what is it from 0 to 9. And it's an example that basically people can easily grasp in order to understand what is a machine learning. You give me an image and I'll predict something between 0 and 9. And historically, when we did the first version of the Scikit- learn website, we had something like seven or eight lines of Python code that were running classification of digits. So, that was kind of the motivation example where we said ok, Scikit learn has machine learning made easy and here it is an example, ten lines of code classifying digits and that was basically the punchline.

39:54 Solving this old, hard problem in a nice simple way, right?

39:57 Yeah.

39:58 You know, lately there's been a lot of talk about artificial intelligence, and especially from people like Elon Musk and Stephen Hawking, saying that maybe we should be concerned about artificial intelligence and things like that. So one of my first questions is around this area is, is machine learning the same thing as artificial intelligence?

40:22 Depends who you ask.

40:25 Sure.

40:27 I mean, yeah, it was basically the early mane of trying to tech us a computer to do something. It dates back from the 1960s and 1970s where basically the US example, MIT had labs that were basically called AI labs. And machine learning is kind of the I would say more restricted set of problems, that compared to AI which is say when you do AI and you want to do work with text or linguistic you want to build a system that understands linguistic. That would be an AI problem. But machine learning is kind of a saying ok, I've got a loss function, I want to optimize my criteria, I've got something that I want to train my system on and in the sense you teach a system to learn and so you create some kind of intelligence but it's not I would say simpler thing to say than saying intelligence which is kind of a hard concept. That's maybe my personal answer to this.

41:32 Yeah, and it's a great answer. Just from my limited exposure to it, it seems like machine learning is more about classification and prediction, whereas the AI concept is- there is a strong autonomous component that is just completely lacking for machine learning.

41:49 Yeah, I guess I would explain it simply like this exactly.

41:54 What things have you seen people using Scikit- learn for that surprised you? Or oyu were like wow, you guys are doing that, that's amazing.

42:03 So, in Scikit- learn we have this testimonial page where we ask typically companies or institute that are using Scikit- learn to like write a couple of sentences to say what they are using Scikit- learn for and why they think it's great, and trying to find this, and I remember there was this dating website saying that they are using Scikit- learn to optimize dates between people, so that was like a funny one.

42:41 And it is funny. So there may be people out there who are married and maybe even babies were born because of the Scikit- learn.

42:49 Yeah, that would be great, I'm going to add this to my resume.

42:53 Matchmaker. So, if people want to get started with Scikit- learn, they are listening and they are like, wow this is awesome, where do I start, like what would you recommend for sort of getting into this whole world of machine learning and getting started with Scikit- learn in particular?

43:07 The first start is the Scikit- learn website which is pretty extensive. But also a lot of tutorials that I have been giving by core devs of Scikit- learn in different conferences like SciPy, Euro SciPy, you can find all these videos online. And just take some of them and just sit down and just listen and try to do it yourself afterwards. I mean, for example in SciPy you get tutorials on Scikit- learn that are pretty much a whole day of tutorials, and which is hands on, so you can really look and get the materials online from the tutorial and get started.

43:48 Oh, that's excellent. Yeah, I think it's really amazing these days that there are so many of these videos online that you can- there is some topic you imagine like hey I want to know this thing in Python, there is very good chance that someone gave some kind of conference talk on it and it's online.

44:05 Yeah.

44:05 Anything you want to give sort of shout out to or final call to action before we sort of wrap things up a bit?

44:11 So, if you have free time, you like machine learning, come give us a hand to maintain this cool library.

44:19 Yeah. Absolutely. Yeah, like I said, there is 457 contributors, but you know, you guys are looking to stabilize things and move forward so I'm sure there is a lot to be done around that.

44:29 I mean, there are two types of contributors. Like you have this one time contributors that are really expert in something that contribute something that is really specific and valuable that gets merged to the main code base, and you have I would say less people that are investing their time to read the code from others, keep the library consistent in terms of API and that's really this big reviewing work that I would say the historical core devs of Scikit- learn are pretty much mostly doing this these days, and invest little time to do really new stuff that is basically left to the newcomers and I think what would be if I had to wish something for the future is that these people were these onetime contributors also spend a bit of their time to help us maintain the entire library, in longer run.

45:22 Yeah, that makes sense. I can see in something like Scikit- learn where it's kind of a family of all these different algorithms and little techniques that if you want to add your technique you just go in there and you do that a little bit and you kind of stay out of the rest of the code, and that I can see how that would definitely lead to inconsistencies and so on.

45:41 Yeah. And the policy, I mean, in terms of- Scikit- learn maybe there is less things that are coming in these days is that we are not trying to build a library that contains all the algorithms that you can ever think of what I get published every year; we are trying to keep or have the algorithm that are better on some clear use case in the current state. So, we cannot implement everything but at least if you have a particular type of problem, you should have something in Scikit- learn that does a good job.

46:14 So, before I let you go, I have two more final questions for you. So if you are going to write some Python code, what editor do you open up?

46:21 So I've been a big user of text mate over the years, and after that I switched to Sublime recently because the I got convinced by my neighbors. So no Vim or Emacs.

46:39 Yeah, that's cool. I like Sublime text a lot, very nice. And, of all of the cool machine learning and Python in general packages out on PyPi, what are some that you think people maybe don't know about that you are like hey this is awesome, you should know about it.

46:53 Maybe I'm buzzed , because I mean do a lot of machinery for brain science, and I've been working for the past 5 years on this project that was called MNE, which allows you to process brain waves and classifying brain stage like for example build a brain computer interfaces or analyze clinical data of the electrophysiology. If you want to play with the brain waves you can check it out.

47:20 That's really cool. And when you say brain machine interface, is that like EEGs and stuff like that?

47:26 Exactly. EEG, MEG.

47:27 Ok. Very awesome, I haven't heard of that one, that's cool.

47:33 That's more my second baby.

47:36 That's great. So, Alexandre, it's been really great to have you on the show, and this has been super interesting conversation, thanks.

47:44 Thank you very much.

47:44 You bet. Talk to you later.

47:44 This has been another episode of Talk Python To Me.

47:44 Today's guest was Alexandre Gramfort and this episode has been sponsored by Hired and CodeShip. Thank you guys for supporting the show!

47:44 Hired wants to help you find your next big thing. Visit to get 5 or more offers with salary and equity right up front and a special listener signing bonus of $4,000 USD.

47:44 Codeship wants you to ALWAYS KEEP SHIPPING. Check them out at and thank them on twitter via @codeship. Don't forget the discount code for listeners, it's easy: TALKPYTHON

47:44 You can find the links from the show at

47:44 Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes and direct RSS feeds in the footer on the website.

47:44 Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song on our website.

47:44 This is your host, Michael Kennedy. Thanks for listening!

47:44 Smixx, take us out of here.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon