#131: Top 10 machine learning libraries Transcript
00:00 Michael Kennedy: Data science has been one of the major driving forces behind the explosion of Python in recent years. It's now used for AI research, it controls some of the most powerful telescopes in the world, it tracks crop growth and predictions and so much more. But with all this growth, there's an explosion of data science and machine learning libraries. That's why I invited Pete Garcin onto the show. He's going to share his top ten machine learning libraries for Python. After this episode, you should be able to pick the right one for the job. This is Talk Python to Me, recorded July 20, 2017. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host Michael Kennedy. Follow me on Twitter where I'm @MKennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. This episode is brought to you by DataCamp and us right here at Talk Python Training. Be sure to check out what we're both offering during our segments. It really helps support the show. Pete, welcome to Talk Python.
01:13 Pete Garcin: Thanks, I'm happy to be here.
01:15 Michael Kennedy: It's great to have you here, and I've done a few shows on machine learning and data science, but I'm really happy to do this one because I think it's really accessible to everyone. We're going to bring all these different libraries together and kind of just make people aware of all the cool things that are out there for data science and machine learning.
01:31 Pete Garcin: Yeah, it's really crazy, actually, how many libraries are out there, and how active the development is on all of them. There's new contributions, new developments all the time, and it seems like there's new projects popping up almost daily.
01:45 Michael Kennedy: Yeah, it's definitely tough to keep up with, but hopefully this adds a little bit of help for the reference there. But before we get into all these libraries, let's start with your story. How'd you get into programming and Python?
01:55 Pete Garcin: I started programming at a pretty young age, like sort of back before Stack Overflow and things like that existed, and I sort of mostly made games. I started with basic, like most people probably, from a certain age, and then worked into working on Pascal, and was making games for my BBS, back in the day, making online games utilities and stuff like that, and then, for Python, later when, I worked in games for a long time, and when I worked in games, we were doing like, tool automations, like, build automation, certain workflow automation, build pipelines, all that kind of stuff. So Python was something, a tool that we used quite a lot there, so that was where I got my start with Python.
02:37 Michael Kennedy: Yeah, that's really cool. Python is huge in the workflow for games and movies, way more than people on the outside realize, I think.
02:44 Pete Garcin: Yeah, especially for artists. So, like, a lot of the tools have Python built into them, and so artists will use it for, like, automating model exports or rigging and that kind of stuff, so it's pretty popular in that sense, and then also still even just for like, building assets for games.
03:03 Michael Kennedy: Okay, I'm intrigued by your BBS stuff. It occurs to me, and it's kind of crazy, there may be younger people listening that don't actually know what a BBS is.
03:12 Pete Garcin: Okay, so a BBS is short for Bulletin Board System, and it was sort of, like in a way, the precursor to the internet, where you used to host what is effectively sort of a website on your home computer, and people would, like, call your phone number, and you'd have it hooked up to your modem, and they would like, call your phone number, connect to your home computer, so in my case, it was like, my computer that I played games on and did my homework on, and that kind of stuff, and they could connect and send messages to each other and download files and play games, very simple games, that kind of thing. So it was like a--
03:53 Michael Kennedy: Yeah, it was so fun.
03:54 Pete Garcin: Yeah, it was awesome, and I really really enjoyed it. And you had a thing called EchoMail back then, which was like this sort of way of transferring messages all over the world. So you know, somebody would send a message on your BBS, and then it would, like, call a whole bunch of others, like in this network. And then somebody in, like, Australia might answer it, and it would take days to get back because it would be like this chain of people's BBS's calling the next one.
04:18 Michael Kennedy: Yeah, there was no internet. It was the craziest thing. We, at our house, my brother and I had talked my dad into getting us multiple phone lines so we could work with BBS's in parallel, and you would send these mails, and at night there would be like a coordination of the emails across the globe as these things would sync up the emails they got queued up. It was the weirdest thing, but I loved, I don't know, was it Trade Wars, or Planet Wars, one of those games, I really loved it.
04:46 Pete Garcin: For sure, I'm a huge Trade Wars fan. You can actually play it now. There are people who have it set up on websites that have simulated TelNet stuff, and you can play versions of Trade Wars, which I have done recently, just to like--
05:00 Michael Kennedy: Don't tell me that. You're going to ruin my productivity for the whole day.
05:04 Pete Garcin: Yeah, after this, you'll be like, "Oh Trade Wars 2002, it's still around, people still play it." But it's such a good game, it's fantastic.
05:13 Michael Kennedy: Yeah, it was fantastic, it's awesome. All right, so that's how you got into this whole thing. What do you do today? You work at ActiveState, right?
05:21 Pete Garcin: I do, yeah. I'm a Devangelist at ActiveState, so generally that means working with developers, language communities, trying to make our distributions better. So at ActiveState we do language distributions. Probably a lot of people in the Python space know us, that we do Active Python, and it's been around for a long time. We were a founding member of the Python Software Foundation, and so ActiveState has a pretty long history in the Python community, and before that we were, people probably know us from Perl, and now we have Go Distribution and Ruby Beta coming out soon, so we're sort of expanding to all these different dynamic language ecosystems.
05:56 Michael Kennedy: Sure, that's awesome. So I know that maybe people are a little familiar with some of the advantages of these high-order distributions for Python, but maybe give us a sense of, what is the value of these distributions, over going and grabbing Go or grabbing Python and just running with that?
06:13 Pete Garcin: I think that you've got, obviously this sense of curated packages, so there are, you know, in the Python distribution, there's like over 300 packages, and so you know that they're going to build, know they're going to play nice with each other, know that they have current, stable versions, all that kind of stuff. And then additionally, you can buy commercial support. So for a lot of our customers, so we have a lot of, like, large enterprise customers, they can't actually adopt a language distribution or a tool like that without commercial support. They need to know that somebody has their back, and so that's something that we offer on these language distributions for those large customers. But for the community and for individual developers, then that is something that, having that curated set of packages that you know is going to work, that you know is going to play nice, and that, also, maybe as a development team lead, you might want a unified install base so that all your developers have the same development environment, and you know it's all going to play nice, and so that's something, that's one of the advantages of those.
07:14 Michael Kennedy: That's really cool. Certainly the ability to install things that have weird compilation stuff. Do you guys ship the binaries prebuilt for that, so I don't have to have a four-grand compiler or something weird?
07:25 Pete Garcin: Exactly, yes, so they're pre-built, all pre-compiled, so a lot of people, depending on what platform you're on, like on Windows, you might not even have a C compiler installed, and a lot of packages are C-based, and so they're prebuilt. And like you said, you don't need a Fortran or a compiler or some exotic build tool to actually make it work. It just works out of the box, yeah.
07:46 Michael Kennedy: Okay, that's really awesome, and ActivePython is free? If I'm, like, a random person, not like a huge corporate, that wants support?
07:54 Pete Garcin: Exactly, yeah, if you're just a developer, it's free to download and free to use. Even if you are, you know, a large corporation, it's free to use in non-production settings, so on your own. So it's, you know, you can go and just download it, try it out, see if it works for you.
08:09 Michael Kennedy: Okay, yeah, that sounds awesome. How many of the 10 libraries you want to talk about would come built-in? Do you know off the top of your head?
08:17 Pete Garcin: I think that, actually almost all of them, but maybe, I think Caffe, is on the list, it's not in the current one, but it is on the list to be included. So I think, actually, pretty much all of the other ones. Maybe CNTK as well.
08:32 Michael Kennedy: Yeah, yeah, that's really new as well.
08:34 Pete Garcin: That's really new. We are targeting to have as many of these as we possibly can, and so pretty much, most of them are included.
08:42 Michael Kennedy: That's awesome. So all the libraries that we're talking about, one really nice way to just get up to speed with them would be grab ActivePython, and you'd be ready to roll.
08:50 Pete Garcin: Exactly, yeah grab them, install them, you're ready to roll, right out of the gate.
08:56 Michael Kennedy: Cool, all right, so let's add what I would consider the foundation of them, the first library that you picked, which is NumPy and SciPy.
09:05 Pete Garcin: Absolutely, and they are foundational, in the sense that a lot of other libraries either depend on them, or are, in fact, built like on top of them, right? So they are sort of the base of a lot of these other libraries. And most people might have worked with NumPy. It's main feature is that sort of N-dimensional array structure that it includes, and a lot of the data that is shipped to a lot of the other libraries is either supported, you can send it in a NumPy array, or it requires that you format it that way. So, especially when you're doing machine learning, you're doing a lot of matrices, and a lot of higher dimensional data. Depending on how many features you have, it's a really, really useful data structure to have in place.
09:51 Michael Kennedy: Yeah, so NumPy is this sort of array-like, multidimensional array-like thing that stores and does a lot of its processing down in a C level, but has of course its programming API in Python, right?
10:07 Pete Garcin: Yes, yeah, exactly, and a lot of these machine learning libraries do tend to have C level, like, lowest-level implementations with a Python API, and that's predominantly for speed. So when you're doing tons and tons and tons of calculations, and you need them to be really, really lightening fast, that's the primary reason that they do these things, you know, sort of at the C level.
10:32 Michael Kennedy: Right, absolutely. And so, related to this is SciPy. They're kind of grouped under the same organization, but they're not the same library exactly, are they?
10:41 Pete Garcin: No, so SciPy is like a more scientific mathematical computing thing, and it has the more advanced, like, linear algebra and Fourier transforms, image processing. It has like physics calculation stuff built in. So most, like, scientific numerical computing functionality is built into SciPy. I know that NumPy does have, like, linear algebra and stuff in it, but I think that the preferred is that you use SciPy for all that kind of linear algebra crunching.
11:09 Michael Kennedy: Okay, yeah, so a lot of these things that we're going to talk about will somewhere in them have as a dependency or an internal implementation of some, even in maybe in its API, like the ability to pass between them, these NumPy arrays and things like that.
11:24 Pete Garcin: Absolutely.
11:25 Michael Kennedy: Yeah, one other thing that's worth noting that's pretty interesting, and I think this is a trend that's growing. Maybe you guys have more visibility into it than I do, but NumPy, and June 13, 2017, so about a month ago at the time of the recording, received a $645,000 grant for the next two years to grow it and evolve it and keep it going strong. That's pretty cool.
11:51 Pete Garcin: It is very cool, and I think that you're starting to see that these open source products are really forming the backbone of most of the machine learning research and actually implementation that you're seeing out there in the world. There's not a lot of more closed source behind trade secret stuff. A lot of the most bleeding-edge development and active development is happening in these open source projects, so I think it's great to see them receiving funding and sponsorship like that.
12:17 Michael Kennedy: Yeah, I totally agree. And it's just going to mean more good things for the community and all these projects. It's really great to see. One thing I want to touch on for every one of these is to give you a sense of how popular they are and for each one, we'll say the number of GitHub stars and forks, and it's not necessarily the exact right measure for the popularity because maybe, obvious NumPy is used across many of these other things which have more stars, but people don't necessarily contribute directly to NumPy and so on. But NumPy has about 5,000 stars and 2,000 forks to give you a sense of how popular it is. The next one up, Scikit-learn, has 20,000 stars and 10,000 forks. So tell us about Scikit-learn.
13:01 Pete Garcin: Scikit-learn is, again, like what we mentioned before, is a thing that's built on top of SciPy and NumPy and is a very popular library for machine learning in Python, and I think it was one of the first, if not the first, I'm not 100% sure, but it's been around for quite a long time, and it supports a lot of the most common algorithms for machine learning, so that's like classification, regression tools, all that kind of stuff. I actually just saw, like a blog post come up on my feed today, where Airbnb was using Scikit-learn to do some kind of, like, property value estimation using machine learning. So it's being used very, very widely, and in a lot of different scenarios.
13:45 Michael Kennedy: Yeah, that sounds really cool. It definitely is one of the early ones, and it's kind of simpler in the sense that it doesn't deal with all the GPUs and paralyzation and all that kind of stuff. It talks about classification and regression, clustering, dimensionality, and modeling, things like that, right?
14:05 Pete Garcin: Yes, that's right. It doesn't have GPU support, and that can make it a little bit easier to install. You know, sometimes the GPU stuff can have a lot more dependencies that you need to install to make it work, although that's getting better in the other libraries. And it's, like you say, it is made and sort of designed to be pretty accessible and pretty easy, because it has the sort of baked-in algorithms that you can just say, "Oh, I want to do this," and it will crunch out your results for you. So I think that that's sort of, the sort of ease of use and the sort of cleanliness of its API has contributed to its longevity as one of the most popular machine learning libraries.
14:43 Michael Kennedy: Yeah, absolutely. And it's, obviously, Scikit-learn being part of the SciPy and whole family, it's built on NumPy, SciPy, and matplotlib?
14:52 Pete Garcin: Yes, yes, so yeah, it includes interfaces for all that stuff, and for, like, graphing the output and using matplotlib and yeah, using NumPy for inputting your data and for getting your data results, all that kind of stuff.
15:06 Michael Kennedy: Yeah, yeah, very cool. All right, next up is Keras, at 17,700 stars and 6,000 forks. So this one is for deep learning specifically, right?
15:18 Pete Garcin: Yeah, and so this is for doing rapid development of neural networks in Python. It's one of the newest ones, but it's really, really popular. I've had some experience working with it directly myself, and I was sort of really, really blown away by how simple and straightforward it is. It creates a layer on top of lower-level type libs like TensorFlow and Theano and lets you just sort of define, "I want my network to look like this, so I want it to have this many layers, and this many nodes per layer, and here are the activation functions, and you know, here's the optimization method that I want to use.". And you sort of just define this, effectively, a configuration, and then it will build all of the graph for you, depending on what backend you used. And so it's very, very easy to experiment with the, like, shape of your network and with the different activation functions, so it lets you kind of really quickly reach and test, you know, different models to see which one works better, and to sort of see what one works at all. So it's really easy to use and really very effective. I used it to build a little game demo where we had an AI, where I trained an AI to play against you, to determine when it could shoot at you.
16:45 Michael Kennedy: Was this the demo you had at PyCon?
16:47 Pete Garcin: It is, yeah, yeah. And so we had that demo at PyCon, I since did a blog post about it a little bit. And then I actually just recently rewrote it in Go for GopherCon, too. So eventually, it will be open-sourced so that people can see, but one of the things that you really notice is that the actual code for Keras to basically define the network and do that sort of mission learning, heavy lifting part, is very, very minimal, like a dozen lines of code or something like that. It's really surprising, because you think it's like a ton of work, but it makes it super easy.
17:23 Michael Kennedy: Yeah, that's really cool. And it sounds like it's goal is to be very easy to get started with. I like the idea of the ability to switch out the back ends from say TensorFlow to CNTK to Theano. How easy is it to do that? Could I run some machine algorithms and say, "Let's try it in TensorFlow and say, do some performance benchmarks and stuff. No, no, no, let's switch it over to Theano and try it here", and kind of experiment rather than completely rewriting it in those various APIs?
17:54 Pete Garcin: Exactly. You literally, it's just a configuration thing. So you just, it's almost like a tick box. It's so easy, and so that is absolutely one of the, I think, the driving key features of that library, that you can just pick whichever one suits your purpose or your platform, depending on what's available on the platform that you're building for. Because currently, there's not TensorFlow versions for every platform on every version of Python, and all that kind of stuff.
18:23 Michael Kennedy: Right, okay. Well that's pretty cool. So there's two interesting things about this library. One is the fact that is does deep learning. So maybe tell people about what deep learning is. How does that relate to standard neural networks, or other types of machine learning stuff?
18:39 Pete Garcin: I think the sort of simplest way to put it is the idea of adding these additional layers to your network to create a more sophisticated model. So that allows you to create things that can take more sophisticated feature domains and then map those to an output more reliably. And that's where you've seen a lot of advances, for instance, a lot of the image recognition stuff that leverages deep learning to be really, really good at identifying images. Or even doing things like style transfer on images where you have a photograph of some scene, and then you have some other photograph, and you're like, "I want to transfer the style of the evening to my daytime photograph", and it will just do it and it looks pretty normal. And those are the most, I guess popular, common, deep learning examples that you see cited.
19:40 Michael Kennedy: Yeah, that makes a lot of sense. And it's easy to think of these as being Snap Chat-ty like, sort of superfluous type of examples, but machine learning doing them, putting the little cat face or switching faces or whatever. But there's real, meaningful things that can come out of this. Like for example, the detection of tumors in radiology scans and things like that. And these deep learning models can do the image recognition on that and go, "Yep, that's cancer.", maybe better than even radiologists can already. And in the future, it's going to get crazy.
20:19 Pete Garcin: Exactly. And it's funny you mention that. Stanford Medical, about a month ago, month and a half ago, actually released like, I don't know how many, 500,000 radiology scans that are annotated and ready for training machine learning. So that exact use case is intended to be a deep learning problem to be applied and there are all kinds of additionals of these data sets that are coming. I just saw a post this week about a deep learning model that was using, that was measuring heart monitor data, and being more effective than cardiologists kind of thing.
20:55 Michael Kennedy: It's really crazy. You think of this AI and automation, disrupting low end jobs, right. Like at McDonald's, we might have robots making our hamburgers or something silly like that, but if they start cutting into radiology and cardiologists, that's going to be a big deal.
21:14 Pete Garcin: It absolutely is going to be a big deal. I think people will probably need to start thinking about it. And I don't think it's necessarily a complete replacement thing, it's not, the radiologist AI can't talk to you, yet I guess and LTK. But it can definitely augment and lighten the load on professions like medicine that are perpetually overworked and allow them to be more effective human doctors. I think as tools, these things are going to be absolutely, incredibly revolutionary.
21:46 Michael Kennedy: Yeah, it's going to be amazing. Do you want a second opinion? Let's ask the super machine.
21:53 Pete Garcin: Exactly. But it's able to, one of the strengths of all these machine learning models is that machine learning models are able to visualize higher dimensional complex data sets in ways that humans can't really do, and they have this intense focus, I guess, these models. Where as, it might be pretty hard for a doctor to read every single paper ever written on subject X, or to look at 500,000 radiology images, even across the course of their career.
22:26 Michael Kennedy: Pretty optimistic, where this goes. It's going to be interesting to join all this stuff together. The other thing that we're just start to touch on here, and it's going to appear in a bunch of these others, so it may be worth spending a moment on as well, is Keras lets you basically seamlessly switch from CPU computation and GPU computation, so maybe not everyone knows the power of non visual GPU programming. Maybe we could talk about that a bit?
22:55 Pete Garcin: For sure. So your GPU, which is your Graphics Processing Unit. So if you have a gaming PC at home and you have, you know what I mean? A NVidia graphics card or an ATI graphic-
23:05 Michael Kennedy: Can run the unreal engine crazy or whatever, right?
23:08 Pete Garcin: Exactly. So if you play games and you have a dedicated graphics card, even without a dedicated graphics card, but you have a GPU, and there's this thing called General Purpose GPU programming. So originally, a GPU is highly parallel computer, it has like 1,000 cores in it or some huge number of cores.
23:27 Michael Kennedy: Yeah, one to four or 5,000 cores per GPU, right?
23:32 Pete Garcin: Exactly, yeah. And so the intention there was to originally, because it needs to, in parallel process, every pixel or every polygon that's going on the screen, and perform effects. That's why you can get blur and all this kind of stuff in real time and real time lighting and all that kind of stuff. So it processes all that stuff in parallel, but then, as the people start to develop SDKs that let you, in addition to doing graphics programming, we can just run regular programs on these things. And they're really, really fast at doing math programs, so we can do that. So now basically, a lot of these libraries support GPU processing, and it's literally just like a compile flag now, it's getting a lot easier. You still have to make sure that you have the drivers and you have a GPU that's reasonably powerful, especially if you're doing a lot of computation. And so then you can basically run these giant ML models on your GPU. And again, it's something that's pretty well-suited to being parallelized, so that it really great use of GPU, and that's why you're seeing it take off. Because these models are easily made parallel.
24:45 Michael Kennedy: Yeah, they're what I call embarrassingly parallel algorithms, right? Just throw them at these things with 4,000 cores and let them go crazy. The early days, still I guess, when you're doing DirectX or OpenGL, or these things, it's really all about, "I want to rotate the screen," so it's like a matrix multiplication against all of the vectored things in there. It's really similar, actually, the type of work it has to do. The other thing I guess, which I don't see appearing anywhere in here, but I suspect TensorFlow may have something to do with it, is the new stuff coming from Google where they have going beyond GPUs for like AI focused chips. Did you hear about this?
25:23 Pete Garcin: Yeah, so Google has a thing called a TPU, which is a Tensor Processing Unit or whatever, and it's like a Cloud-hosted special piece of hardware that's optimized for doing TensorFlow. I don't know the exact benchmarks in terms of how that compares to some gigantic GPU assembly, but obviously, Google thinks that this is a worthwhile investment to build these sort of hardware racks in the Cloud and give people access to run their models on there. I think you're probably going to see more and more specialized ML-targeted hardware that's coming out. I don't know whether it's like, obviously consumer hardware, you can go and buy something for your home computer, but especially in the Cloud, you definitely will.
26:10 Michael Kennedy: Yeah definitely in the Cloud. It's very interesting. They were talking about real time training, not just real time answers, so that sounds pretty crazy. This portion of Talk Python To Me has been brought to you by DataCamp, they're calling all data science and data science educators. DataCamp is building the future of online data science education. They have over 1.5 million learners from around the world who have completed 70 million Data Camp exercises to date. Learn how to get real, hands on experience by completing self-paced, interactive data science courses right in the browser. The best part is, these courses are taught by top data science experts from companies like Anaconda and Kaggle, and universities like Caltech and NYU. If you are a data science junkie with a passion for teaching, then you too can build data science courses for the masses and supplement or even replace your income while you're at it. For more information on becoming an instructor, just go datacamp.com/create, and let them know that Michael sent you. So speaking of popular libraries and TPUs, the next up is TensorFlow. That originally came from Google, and it is crazy, it has 64,000 stars and 31,000 forks. So tell us about TensorFlow.
27:19 Pete Garcin: So TensorFlow, obviously, this is Google's machine learning library, and this forms the sort of slightly lower level than something like Keras and obviously it's used as a back end. You can use it directly as well. And what it does it represents your model as a computation graph, so that's effectively a graph where the nodes are like operations and this is a way that they found is really, really effective to represent these models. It's a little bit more intimidating to get started with, mostly because you have to think about building this graph. But you can use it directly in Python. Python's actually the recommended language and workflow from Google. For example, when I rewrote the Go version of our little game there, I still had to train and export my model from Python. So I used Python to build that, export it. So that's the sort of recommended workflow currently from Google for many languages, is to use Python as the primary language binding.
28:19 Michael Kennedy: Yeah, that's really interesting. And it's great to see Python. Python appears in so many of these libraries as a primary way to do it. So there's some interesting stuff about this one, and it's obviously super popular, Google has so many use cases for machine learning just up and down their whole, everything that they're doing. So having this developed internally is really cool. It has a flexible architecture that lets it run on CPUs or GPUs, obviously, or mobile devices, and it even lets it run on multiple GPUs and multiple CPUs. Do you have to do anything to make that happen or do you know how it does that?
29:00 Pete Garcin: As far as I can tell, especially for switching between CPU and GPU, it's essentially a compile flag, so you have to build, when you build the libraries or download one of the nightly builds, you have to get one of the versions that has the enabled GPU support kind of thing built in. And I think that there are also now, increasingly CPU optimizations in there. So for instance, Intel is doing hand-optimized math kernel stuff that's integrated directly into TensorFlow to make it even faster so that that's something that you can also get in the latest version as well. I definitely think speed and performance and making that stuff easily accessible depending on where your hardware is and where you're going to deploy it, is a big focus for them.
29:50 Michael Kennedy: Yeah, that's really cool. So do you think this is running in the Waymo cars, the Google self driving cars?
29:56 Pete Garcin: I mean, I don't know for sure, but I would be almost positive of it. You know, from everything that I've ready and people that I've talked to, Google built this to use. This is their platform for all of their deep learning and machine learning projects, and so I would assume that TensorFlow is powering that and it's running pretty much all of their stuff.
30:18 Michael Kennedy: Very, very cool. It's probably in Google Photos and some other things as well.
30:22 Pete Garcin: Yeah, Google Translate, all those things are all, pretty much all of the projects when you start looking at them, that Google is running, are all effectively AI projects. Just recently, the Google Translate, which uses machine learning and like statistical models to do the translations, is approaching human-level accuracy for translation between a lot of the popular languages, where they have huge, huge data sets to pull from.
30:52 Michael Kennedy: Yeah, it's crazy. Very, very cool. So up next, number five, is Theano at 6,000 stars and 2,000 forks. And this one is really kind of similar to TensorFlow, but really low level, right?
31:06 Pete Garcin: Yeah, so it is more low level, and it is very similar to TensorFlow in the sense that it's also a very high speed math library. And I believe it was originally made by a couple of the guys who then went on to Google to make TensorFlow, so it predates TensorFlow by a little bit, but it also has the things that we're talking about here. It has transparent GPUs and you can do things like symbolic differentiation, a lot of mathematical things, mathematical operations that you want to be highly, highly performant. So it is actually pretty similar to what TensorFlow does, and sort of serves a similar purpose, but depending on what you're comfortable with and what your existing projects are, then that is probably going to dictate which one you're using. And if you're using something like Keras, then you can just choose this as the back end.
31:58 Michael Kennedy: Just flip the switch.
31:59 Pete Garcin: Just flip the switch and there you go.
32:01 Michael Kennedy: Yeah, it's cool, it also says it has extensive unit testing and self-verification where it will detect and diagnose errors, maybe you've set up your model wrong or something like that. That's pretty cool.
32:11 Pete Garcin: That's pretty cool, for sure. All these libraries are built by super, super smart college people who are creating things that are solving a real world problem for them and really sort of pushing things forward. And I actually think it's great that there are so many libraries in this space, because it really is just making it better for everybody.
32:36 Michael Kennedy: Yeah, the competition is really cool, too. To see the different ways to do this and probably across pollination.
32:42 Pete Garcin: Exactly.
32:43 Michael Kennedy: So one of the things you have to do for these models is feed theme data, and getting data can be a super messy thing. And the one library that stands out above all the others, about taking, transforming, redoing, cleaning up data is Pandas, right?
33:01 Pete Garcin: Absolutely, yeah. Pandas is one of those libraries that if you are manipulating, especially large sets of data and real world data, then this is the one that people repeatedly come back to. Pandas is, for those that might not know, is like a data munging, data analysis library that lets you transform it. One of the hardest parts when you're doing machine learning is actually getting your data into a format that can be used effectively by your model. So a lot of times, real world data is pretty messy or it might have gaps in it, or it might not actually be formatted in the right unit, so it might not be sort of normalized so that you're within the right ranges. And if you feed the models sort of raw data that hasn't been either cleaned up or formatted correctly, then what you might find is that the model doesn't converge or you get what seems like random results or things that don't really make sense. So spending this time and having a library that makes manipulating, especially very large sets of data very easy, like Pandas, is super useful. And even, just for instance, when I was doing that little demo there that we talked about, originally when I started it, I was feeding things raw, raw pixel values for positions and velocities and stuff, and it just wasn't working. And it wasn't until I really normalized the data, I really cleaned it up that I started getting good, consistent results. Doing large scale data sets and being able to manipulate them effectively is very important.
34:43 Michael Kennedy: Yeah, at the heart of all these successful AI things, these machine learning algorithms and what not, it is a tremendous amount of data. It's why the companies that we talk about doing well are enormous data-sucking machines, like Google, and Microsoft and some of these other ones, right?
35:03 Pete Garcin: Exactly, and that's where the power comes from. Google has access to just a massive amount of data that we don't have access to, regular people. Or like we were talking about earlier with the radiology images, you need a fairly large set of annotated data, and so that's where these are case files that a doctor's already gone through and said, "This one was a cancer patient, this one wasn't.". And without that kind of annotated data, the models can't really learn. They need to know what the answer is, right?
35:40 Michael Kennedy: Yeah, right.
35:42 Pete Garcin: So that's really, really important.
35:43 Michael Kennedy: We have the whole 10,000 hours to become an expert for humans, that's kind of the equivalent for machines.
35:48 Pete Garcin: Yeah, I guess. I don't what the thing is, machines might need more. That's one of the things that's really interesting about humans is that our neural networks can learn remarkably quickly without having to walk into traffic 1,000 times or do something like that. I don't know, there's some magic going on there or something.
36:11 Michael Kennedy: Yeah, there sure is. Alright, next up is Caffe and Caffe2, and this originally started out as a vision project, right?
36:20 Pete Garcin: That's right, Berkeley. So this was primarily a vision project and then there's a sort of successor that is backed by Facebook, actually, and is more general purpose and is optimized for web and mobile deployment. So obviously, if you want to have machine learning based apps on your phone, then having a library that sort of targets that is pretty important.
36:45 Michael Kennedy: Yeah, I'm sure we're going to see more of that. There are even rumors, I don't know how trustworthy they are, that the next Apple, actually did they announce this? That the next iPhone will have a built in AI chip?
36:58 Pete Garcin: I remember that they just announced. So Apple actually just announced machine language SDK core ML at WWDC in June. And so Apple is already targeting these sort of deployed ML models, so in that library's case, you are effectively choosing a pre-made model. So, I want image recognition, or, I want language parsing in my app. And then you can just feed these sort of pre-trained models. But it wouldn't surprise me, they've got the motion chip in your iPhone now.
37:29 Michael Kennedy: Yeah, they've got the motion chip, yeah.
37:32 Pete Garcin: So it wouldn't surprise me at all to start seeing that phones are deploying AI chips in there to assist with it, because most of the things like Siri is a machine learning based thing, right?
37:41 Michael Kennedy: Yeah, and it doesn't make sense to go to the Cloud all the time. That's one of the super annoying things about Siri, is you ask it a question and it's like six seconds late. You ask it something simple like, "What time is it?", 10 seconds later it'll tell you. Is it really that hard? Yeah, it's got to go all the way to the Cloud and you're in some sketchy network area or something, right?
38:03 Pete Garcin: Exactly, and so I wouldn't be surprised to start seeing that stuff deployed onto mobile.
38:08 Michael Kennedy: I think even at Build, Microsoft's conference, they started talking about Edge machine learning, where the machine learning habit is getting pushed to all these IOT devices that they're working on as well, so a lot of attempts in this area.
38:21 Pete Garcin: For sure, yeah. And that's the next big thing, right? Is like having IOT based machine learning devices. Like, can your fridge learn your grocery consumption habits so just tell you, "You're going to run out of milk in two days, and you're going to the store today, maybe you should pick some up."?
38:38 Michael Kennedy: It's going to happen.
38:40 Pete Garcin: It sounds kind of crazy, but it totally will happen.
38:42 Michael Kennedy: Yeah, it doesn't sound as crazy as, "Let's just let a car go drive in a busy city on its own.".
38:49 Pete Garcin: That's true. And yet, that's something that exists now, right? That's a thing. Maybe it's not fully autonomous, but you could go and buy one, like tomorrow you could buy a car that you can turn on autopilot and go.
39:05 Michael Kennedy: It's crazy, it's totally crazy.
39:07 Pete Garcin: It'll drive for you. The future is now.
39:11 Michael Kennedy: The future is here, it's just not evenly distributed. This portion of Talk Python To Me is brought to you by, us. As many of you know, I have a growing set of courses to help you go from Python beginner to novice to Python expert. And there are many more courses in the works. So please consider Talk Python training for you and your team's training needs. If you're just getting started, I've built a course to teach you Python the way professional developers learn, by building applications. Checkout my Python jumpstart by building 10 apps at talkpython.fm/course. Are you looking to start adding services to your app? Try my brand new consuming HTTP services in Python. You'll learn to work with restful HTTP service as well as SOAP, JSON, and XML data formats. Do you want to launch an online business? Well, Matt Makai and I have built an entrepreneurs playback with Python for entrepreneurs. This 16 hour course will teach you everything you need to launch your web-based business with Python. And finally, there's a couple of new course announcements coming really soon, so if you don't already have an account, be sure to create one at training.talkpython.fm to get notified. And for all of you who have bought my courses, thank you so much, it really, really helps support the show. One little fact or quote from the Caffe webpage that I want to just throw out there because I thought it was pretty cool, before we move on, is they say, Speed makes Caffe perfect for research experiments and administrative deployments. It can process 60 million images per day on a single GPU, that's one millisecond per image for inference and four milliseconds per image for learning. That's insane.
40:44 Pete Garcin: That's so fast. And 60 million images per day is just like, it's crazy. And that's why, like we were talking about the data just a minute ago, and the amount of data being poured into these models is just staggering, if they're everyday. And I don't doubt that they're probably feeding, people are feeding these models that much data everyday. I think they were saying 90% of the world's data that's ever been created has been created in the last year. It's just one of these things where it just accelerates and accelerates and build on all this stuff. So I think these things are just going to get faster until they're effectively real time.
41:23 Michael Kennedy: Yeah, absolutely. Alright, I don't think we said the stars for that one, 20,000 and 11,000 forks. So up next is definitely one that data scientists in general just live on, and that's Jupyter.
41:37 Pete Garcin: For sure. And so this has just become the standard interchange format for sharing data science, whether it's papers or data sets or models, or this has just become the sort of standard lingua franca for exchanging this data. And it's effectively a tool for the thing called the Jupyter Notebook, which is kind of like webpages with embedded programs and embedded data sets, I think that's probably a good way to describe it, for those who might not have used it before.
42:12 Michael Kennedy: Right. It's like, instead of writing a blog post or a paper that's got a little bit of code then a little bit of description, then a picture which is a graph, it's like live and you can re-execute and tweak it and it probably plugs into many of these other libraries. And it's using that somewhere behind the scenes to do that.
42:28 Pete Garcin: Exactly, yeah. It's built on the IPython kernel, that's like interactive Python kernel. Yeah, I'm sure that there are all kinds of specific uses that can run those Notebook or that Notebook code and use that stuff there.
42:42 Michael Kennedy: Cool, next up is maybe one of the newer kids on the block, and this deep learning story from Microsoft actually, their Cognitive Tool Kit, CNTK.
42:51 Pete Garcin: Yeah, and they just released I think the 2.0 version of it beginning of June or late May, and now it's open sourced and it's got the Python bindings and it's part of, Microsoft's been doing a lot of open sourced work lately and they've been really, really pushing a lot of their own projects. And it's, like we said earlier, it's available as a backend for Keras. So it's similar, again, to TensorFlow and Theano that's it's, again, focused on that sort of low level computation as a directed graph. So similar model, I think this obviously emerging as a popular and efficient way to represent machine learning models, is using that directed graph. So it's pretty popular, too, right? It's got a decent number of stars and forks, and obviously, is a Keras backend and Microsoft backed library. It's going to be pretty popular and pretty common up there.
43:47 Michael Kennedy: Yeah, absolutely. These days, with Satya Nadella, and a lot of the changes at Microsoft, I feel like the open sourced stuff is really taking a new direction, a positive one. And also it think their philosophy is if it's good for Azure, it's good for Microsoft, and so this plugs into their hosted stuff in interesting ways, and they've got a lot of cognitive Cloud services and things like that.
44:09 Pete Garcin: Yeah, Azure is becoming pretty huge. It's like starting to rival even AWS for a lot of those Cloud-hosted services, and especially around machine learning, Azure has so many different machine learning tools available, and it's really clearly a pretty big focus for Microsoft. And again, it's great to see more of the sort of big guns being more open about their development and sharing, I mean, it drives everybody forward and just accelerates development across the whole ecosystem.
44:44 Michael Kennedy: Yeah, they have a number of the Python core developers there, they have Brett Cannon, they have Steve Dower, they have Tito Veland. There are some serious people back there working on the Python part.
44:54 Pete Garcin: Exactly, yeah. They've got a lot of the Python core team there. And I know a bunch of the guys from ActiveState were just at PyData in Seattle and a huge number of the core team were there and just really, really great little conference there talking about Python and data science.
45:12 Michael Kennedy: Yeah, I think they have some really interesting language stuff as well. So speaking of languages, certainly the longest running one probably, that's really still going strong, is NLTK with 5,000 stars and 1,500 forks?
45:25 Pete Garcin: Yeah. So NLTK was like the Natural Language Toolkit, and obviously this is a thing for doing natural language parsing, which is I guess one of the holy grails of machine learning is to get it to be really, so you can just speak to your computer in completely natural language and maybe even give it instructions in natural language and be able to follow your directions and understand what you're asking. So this is a really popular one in academia for research. They link to and include massive corpora of work, so it's like gigantic bodies of text in different languages and in different styles to be able to train models. So there's also a pretty large open data component to this project as well. Obviously, the use case here for natural language, it's huge for translation, like we mentioned earlier, chat bots, which now are a huge thing for support. I mean, every website you go onto and it pops up, "Hey, I'm Bob, can I help you today?". And it's not really a person, it's a chat bot. There's just so many. And then we were saying Siri and Cortana and all those sort of personal assistants where you can say, ask it a natural language question and it can come back to you. So this is the sort of almost foundational library, still going strong, still tons of active development and research going on with this.
46:54 Michael Kennedy: Yeah, it's really cool. And especially with all the smart home speaker things, Google Home, HomePod, all that stuff, this is just going faster, not slower in terms of acceleration, right? We're talking more and interacting with them way more, definitely, the chat bots. And anytime you have text and you want a computer to understand it, this is like a first step for tokenization, stemming, tagging, parsing, semantic analysis, all that kind of stuff, right?
47:21 Pete Garcin: Yeah, and that's exactly what it outputs. So what it will do is like generate parse trees and stem it all out, and then use those, the kind of tokenized version, to use that to train your model, not sort of raw text characters. And we really are getting there. These days, for sure just the recognition part, the tokenization part, is very, very good. It's more than the kind of semantic meaning, what do you mean when you ask it, you ask Siri for, "What are the movie times for X?", or something like that, how specific do you have to be to get a reasonable answer from her?
47:59 Michael Kennedy: Yeah, it's got to go to speech to text and then it probably hits something like this.
48:02 Pete Garcin: Exactly, yeah, that's going to hit a library like this. And we're getting there. It's not quite at the Star Trek computer, do this for me, but it's way closer than I ever thought we would be, it's really pretty impressive sometimes.
48:17 Michael Kennedy: Yeah, absolutely. It's fun to see this stuff evolve.
48:20 Pete Garcin: Absolutely.
48:21 Michael Kennedy: Alright, Pete, that's the 10 libraries, and I think these are all really great selections. And hopefully people have got a lot of exposure and maybe learned about some they didn't know about, and I guess encourage everyone to go out there and try these on and if you've got an idea, play with it with one of them or more.
48:38 Pete Garcin: For sure. They're all so accessible now. You don't necessarily have to be ML researcher or a math wizard to actually create something that's interesting or experiment or learn a little bit. These libraries all do a really, really great job of abstracting away some of the more complicated mathematical parts, and in the case of a lot of them, making it reasonably accessible. So that's where I think you're seeing this kind of democratization trend in machine learning now, where this stuff is becoming more accessible, it's becoming easier. And I think you're going to see a lot of creativity and a lot of innovation come out of people if they just sort of give it a shot and try something out and learn something new.
49:25 Michael Kennedy: Yeah, that's awesome. I totally agree with the democratization of it. And that's also happening from a computational perspective. These are easier to use, but also with the GPUs and the Cloud and things like that, it's a lot easier. You don't need a super computer, you need $500 or something for a GPU.
49:42 Pete Garcin: Exactly. I think all these things sort of feed in together where you have a democratization trend in the tools and the source codes so now you can have access to Google's years and years of AI research via TensorFlow on GitHub, you also like you said, can go and buy a $500 GPU and have basically a super computer on your desktop. But also this open data component where you can get access to massive data sets, like the Stanford Image Library, and these huge NLTK language corpora that you can then use to train your models. Where previously, that was probably impossible to actually access.
50:27 Michael Kennedy: Yeah, that's a really good point. Because even though you have the machines and you have the algorithms, the data, data really makes it work. Alright, so I think let's leave it there for the libraries, those were great. And I'll hit you with the final two questions. If you're going to write some code, what editor, Python code, what editor do you open up?
50:45 Pete Garcin: Well obviously ActiveState has Komodo, so I tend to use that a lot for doing a Python code, but I've also, to be totally fair, I have used VS Code as well, which is getting increasingly popular. So I tend to like to cycle between them all, because we have an editor product, and so it's great to keep up to date on what all the other ones are doing, so I tend to cycle around a little bit. But Komodo is sort of my go-to.
51:12 Michael Kennedy: Yeah, it's cool. It's definitely important to look and see what the trends are, what other people are doing, how can you bring this cool idea back into Komodo, things like that, right?
51:20 Pete Garcin: Yeah, for sure.
51:22 Michael Kennedy: Alright, and I think we've already hit 10, but do you have another notable PyPI package?
51:26 Pete Garcin: I don't know, there's so many. I would again, probably give a little bit of a shout out to, since we're talking about machine learning, to Keras. Because I do think, as an entry point to machine learning, it's so accessible, it's so easy to at least get started and get a result with, I would give a little shout out to that. I think that if you're looking to get into this and you're looking to try it out, that's a really great place to start.
51:51 Michael Kennedy: Yeah, I totally agree with you, that's where I would start as well. Alright, well, it's very interesting to talk about all these libraries with you. I really appreciate you coming on the show and sharing this with everyone. Thanks for being here.
52:02 Pete Garcin: Thanks you for having me.
52:04 Michael Kennedy: You bet, bye. This has been another episode of Talk Python To Me. Our guest has been Pete Garcin, and this episode has been brought to you by DataCamp and us, right here at Talk Python Training. Want to share your data science experience and passion? Visit datacamp.com/create and write a course for a million budding data scientists. Are you or a colleague trying to learn Python? Have you tried books and videos that just left you bored by covering topis point by point? Well, check out my online course, Python Jumpstart by Building 10 Apps, at talkpython.fm/course to experience a more engaging way to learn Python. And if you're looking for something a little more advanced, try my Write Pythonic Code course at talkpython.fm/pythonic. Be sure to subscribe to the show, open your favorite podcatcher and search for Python. We should be right at the top. You can also iTunes feed at /itunes, Google Play feed at /play, indirect RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get out there and write some Python code.