Top 10 machine learning libraries
But with all this growth, there is an explosion of data science and machine learning libraries. That's why I invited Pete Garcin onto the show. He's going to share his top 10 machine learning libraries. After this episode, you should be able to pick the right one for the job.
Links from the show
Pete on GitHub: github.com/rawktron
ActivePython: activestate.com/activepython
NeuroBlast AI Game: github.com/ActiveState/neuroblast
The 10 Machine Learning Libraries
Numpy/Scipy: numpy.org
Scikit-Learn: scikit-learn.org
Keras: keras.io
TensorFlow: tensorflow.org
Theano: deeplearning.net/software/theano
Pandas: pandas.pydata.org
Caffe/Caffe 2: caffe.berkeleyvision.org
Jupyter: jupyter.org
CNTK: microsoft.com/en-us/cognitive-toolkit
NLTK: nltk.org
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode Transcript
Collapse transcript
00:00 Data science has been one of the major driving forces behind the explosion of Python in recent
00:04 years. It's now used for AI research, it controls some of the most powerful telescopes in the world,
00:09 it tracks crop growth and prediction, and so much more. But with all this growth,
00:14 there's an explosion of data science machine learning libraries. That's why I invited Pete
00:18 Garson onto the show. He's going to share his top 10 machine learning libraries for Python.
00:23 After this episode, you should be able to pick the right one for the job.
00:27 This is Talk Python to Me, recorded July 20th, 2017.
00:31 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem,
00:50 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.
00:55 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter
01:00 via at Talk Python. Talk Python to Me is partially supported by our training courses. Here's an
01:06 unexpected question for you. Are you a C-sharp or .NET developer getting into Python? Do you work at a
01:12 company that used to be a Microsoft shop, but is now finding their way over to the Python space?
01:18 We built a Python course tailor-made for you and your team. It's called Python for the .NET
01:23 developer. This 10-hour course takes all the features of C-sharp and .NET that you think you
01:29 couldn't live without. Unity Framework, Lambda Expressions, ASP.NET, and so on. And it teaches
01:34 you the Python equivalent for each and every one of those. This is definitely the fastest and clearest
01:39 path from C-sharp to Python. Learn more at talkpython.fm/.NET. That's talkpython.fm slash
01:46 D-O-T-N-E-T.
01:48 Pete, welcome to Talk Python.
01:50 Thanks. I'm happy to be here.
01:52 That's great to have you here. And I've done a few shows on machine learning and data science,
01:58 but I'm really happy to do this one because I think it's really accessible to everyone. We're
02:02 going to bring all these different libraries together and kind of just make people aware of
02:06 all the cool things that are out there for data science, machine learning.
02:09 Yeah, it's really crazy actually how many libraries are out there and how active the development is on
02:15 all of them. There's new contributions, new developments all the time. And it seems like
02:20 there's new projects popping up like almost daily.
02:22 Yeah, it's definitely tough to keep up with, but hopefully this adds a little bit of help
02:27 for the reference there. But before we get into all these libraries, let's start with your story.
02:30 How did you get into programming in Python?
02:32 I started programming at a pretty young age, like sort of back before Stack Overflow and
02:37 things like that existed. And I sort of mostly made games. I started with basic, like most people
02:43 probably from a certain age, and then worked into working on Pascal and was making games for my BBS
02:50 back in the day, making online games, utilities and stuff like that. And then for Python, later when I
02:56 worked in games for a long time. And when I worked in games, we were doing like tool automations, like build
03:04 automation, certain workflow automation, build pipelines, all that kind of stuff. So Python was something, a tool that we
03:10 used quite a lot there. So that was where I got my start with Python.
03:15 Oh, yeah, that's really cool. Python is huge in the workflow for games and movies, way more than people on the outside
03:21 realize, I think. Yeah, especially for artists. So like a lot of the tools have Python built into them.
03:26 And so artists will use it for like automating model exports or rigging and that kind of stuff. So it's pretty
03:35 popular in that sense. And then also still even just for like building assets for games.
03:40 Okay, I'm intrigued by your BBS stuff. It occurs to me and it's kind of crazy. There may be younger
03:46 people listening that don't actually know what a BBS is. Okay, so a BBS is short for bulletin board system.
03:54 And it was sort of like in a way the precursor to the internet where you used to host what is
04:02 effectively sort of a website on your home computer. And people would like call your phone number.
04:08 And you'd have it hooked up to your modem. And they would like call your phone number,
04:12 connect to your home computer. So in my case, it was like my computer that I played games on and
04:18 did my homework on and that kind of stuff. And they could connect and send messages to each other
04:24 and download files and play games, very simple games, that kind of thing. So it was like a...
04:30 Yeah, it was so fun.
04:31 Yeah, it was awesome. And like, I really, really enjoyed it. And you had a thing called like echo
04:36 mail back then, which was like this sort of way of like transferring messages all over the world. So,
04:41 you know, somebody would send a message on your BBS, and then it would like call a whole bunch of
04:45 others like in this network. And then somebody in like Australia might answer it. And it would take
04:50 like days to get back because it would be like this chain of people's BBS is calling the next one. So
04:56 yeah, there was no internet. It was the craziest thing. Like, we at our house, my brother and I
05:01 had talked my dad into getting us multiple phone lines so we could work with BBS is like in parallel.
05:08 And you would send these mails and like at night, there would be like a, like a coordination of the
05:14 emails across the globe as these things would like sync up the emails they got queued up. It was the
05:19 weirdest thing. But I loved it. I don't know, what's it? Trade Wars or Planet Wars? One of those games. I
05:23 really loved it.
05:24 For sure. I'm like a huge Trade Wars fan. You can actually play it now. Like there are people who have
05:29 it set up on like websites that have like simulated Telnet stuff. And you can, you can play versions of Trade
05:35 Wars, which I have done recently just to like, don't tell me that you're going to ruin my productivity
05:40 for like the whole day. Yeah. You'll be after this. You'll be like, Oh, Trade Wars 2002. Is it still,
05:46 it's still around? People still play it, but it was such a good game. It's fantastic. Yeah, it was
05:51 fantastic. It's awesome. All right. So that's how you got into this whole thing. Like what do you do
05:56 today? You work at ActiveState, right?
05:58 I do. Yeah. I'm a dev evangelist at ActiveState. So generally that means working with developers,
06:03 language communities, trying to make our distributions better. So at ActiveState,
06:06 we do language distributions. Probably a lot of people in the Python space know us that we do
06:11 ActivePython and it's been around for a long time. We were founding member of the Python Software
06:16 Foundation. And so ActiveState has a pretty long history in the Python community. And before that,
06:22 we were, people probably know us from Perl and now we have a Go distribution and Ruby beta coming out
06:28 soon. So we're sort of expanding to all these different dynamic language ecosystems.
06:34 Sure. That's awesome. So I know that maybe people are a little familiar with some of the advantages of
06:40 these higher order distributions for Python, but maybe give us a sense of like, what is the value of
06:46 these distributions over like going and grabbing Go or grabbing Python and just running with that?
06:50 I think that you've got obviously this sense of curated packages. So there are, you know, in the
06:58 Python distribution, there's like over 300 packages. And so you know that they're going to build, know they're
07:02 going to play nice with each other, know that they have current stable versions, all that kind of stuff.
07:06 And then additionally, you can buy commercial support. So for a lot of our customers, so we have a lot of like
07:13 large enterprise customers, they can't actually adopt a language distribution or a tool like that
07:19 without commercial support. They need to know that somebody has their back. And so that's something
07:24 that we offer on these language distributions for those large customers. But for the community and for
07:29 individual developers, then that is something that having that curated set of packages that you know
07:36 is going to work, that you know is going to play nice. And that also is a, maybe a development team
07:40 lead, you might want a unified install base, so that all your developers have the same development
07:47 environment and they and you know, it's all going to play nice. And so that's something that's one of
07:51 the advantages of those.
07:51 That's really cool. Certainly the ability to install things that have weird compilation stuff. Do you guys
07:57 ship the binaries like pre built for that? So I don't have to have like a Fortran compiler or
08:02 something weird?
08:03 Exactly. Yes. So they're all pre built, all pre compiled. So I mean, a lot of people depending
08:08 on what platform you're on, like on Windows, you're not might not even have a C compiler installed and a
08:12 lot of packages are C based. And so they're pre built, you don't and like you said, you don't need a
08:17 Fortran compiler or some, some exotic build tool to actually make it work. It just works out of the box.
08:23 Yeah. Okay, that's really awesome. And active Python is free. If I'm like a random person, not a huge
08:29 corporate that wants support.
08:31 Exactly. Yeah. If you're just a, you know, a developer, it's free to download and free to use.
08:37 And it even if you are, you know, a large corporation, it's free to use in non production settings. So on your
08:43 own. So it's, you know, you can go and just download it, try it out, see if it works for you.
08:47 Okay, yeah, that sounds sounds awesome.
08:49 How many of the 10 libraries we're going to talk about would come built in? Do you know off the top
08:54 of your head?
08:54 I think that actually almost all of them, but maybe I think cafe is on the list. It's not in the current
09:03 one, but it is on the list to be included. So I think actually like pretty much all of the other
09:08 ones, maybe CNTK as well is still new as well. That's really new. So but you know, we are targeting
09:15 to have as many of these as we possibly can. And so pretty much most of them are included.
09:20 That's awesome. So all the libraries that we're talking about, like one really nice way to just
09:23 get up to speed with them would be grab active, active Python, and you'd be ready to roll.
09:28 Exactly. Yeah, awesome.
09:29 Grab them, install them, you're ready to roll right out of the gate.
09:33 Cool. All right. So let's start at what I would consider the foundation of them. The first library
09:40 that you picked, which is NumPy and SciPy.
09:42 Absolutely. And they are foundational in the sense that a lot of other libraries either
09:47 depend on them or are in fact built like on top of them. Right. So they're, they are sort
09:54 of the base of a lot of these other libraries. And most people might have worked with, with
09:59 NumPy sort of the, its main feature is that sort of n dimensional array structure that it
10:05 includes. And a lot of the data that is shipped to a lot of the other libraries is either supported
10:11 that you can send it a NumPy array, or it requires that you, that you format it that way. So especially
10:17 when you're doing machine learning, you're doing a lot of matrices and a lot of like higher dimensional
10:22 data, depending on how many features you have. It's a really, really useful data structure to have
10:28 in place.
10:29 Yeah. So NumPy is this sort of array like multi-dimensional array like thing that stores and
10:37 does a lot of its processing down in a C level, but has of course, it's programming API and Python,
10:44 right?
10:44 Yes. Yeah, exactly. And a lot of these machine learning libraries do tend to have C level,
10:51 like lowest level implementations with a Python API. And that's predominantly for speed. So when
10:59 you're doing tons and tons and tons of calculations, and you need them to be really, really lightning fast,
11:05 that's the primary reason that they do these things, you know, sort of at the C level.
11:09 All right, absolutely. And so related to this is SciPy. They're kind of grouped under the same
11:15 organization, but they're not the same library exactly, are they?
11:18 No. So SciPy is like a more scientific mathematical computing thing. And it has the more advanced like
11:26 linear algebra and like Fourier transforms, image processing, it has like a physics calculation
11:32 stuff built in. So most like scientific numerical computing functionality is built into SciPy.
11:38 I know that NumPy does have like linear algebra and stuff in it. But I think that the preferred is
11:43 that you use SciPy for all that kind of linear algebra crunching.
11:47 Okay, yeah. So a lot of these things that we're going to talk about will somewhere in them have as
11:52 a dependency or an internal implementation of some variation, or even in maybe in its API,
11:57 like the ability to pass between them, these NumPy arrays and things like that.
12:02 Absolutely.
12:02 Yeah. One other thing that's worth noting, that's pretty interesting. And I think this is a trend
12:09 that's growing. Maybe you guys have more visibility into it than I do. But NumPy in June 13th, 2017,
12:16 so about a month ago at the time of the recording, received a $645,000 grant for the next two years to
12:24 grow it and evolve it and keep it going strong. That's pretty cool.
12:28 It is very cool. And I think that you're starting to see that these open source projects are really
12:34 forming the backbone of most of the machine learning research and actually implementation that you're
12:40 seeing out there in the world. There's not a lot of sort of more closed source behind trade secret
12:45 stuff. A lot of the most bleeding edge development and active development is happening in these open
12:50 source projects. So I think it's great to see them receiving funding and sponsorship like that.
12:55 Yeah, I totally agree. And it's just going to mean more good things for the community and all these
12:59 projects. It's really great to see. One thing I want to touch on for every one of these is to give
13:04 you a sense of how popular they are. And for each one, we'll say the number of GitHub stars and forks.
13:11 And that's not necessarily the exact right measure for the popularity because maybe this is you like
13:17 obviously NumPy is used across many of these other things which have more stars, but people don't
13:22 necessarily contribute directly to NumPy. So on. But for NumPy, NumPy has about 5,000 stars and 2,000
13:30 forks to give you a sense of how popular it is. The next one up, scikit-learn has 20,000 stars and 10,000
13:37 forks. So tell us about scikit-learn. Scikit-learn is, again, like we mentioned before, is a thing
13:44 that's built on top of scipy and NumPy and is a very popular library for machine learning in Python.
13:50 And I think it was one of the first, if not the first, I'm not 100% sure, but it's been around for
13:56 quite a long time. And it supports a lot of the sort of most common algorithms for machine learning.
14:03 So that's like classification, regression tools, all that kind of stuff. I actually just saw like a blog post
14:08 come up in my feed today where Airbnb was using scikit-learn to do some kind of like property value
14:16 estimation or something using machine learning. So it's being used very, very widely in a lot of different
14:22 scenarios.
14:22 Oh yeah, that sounds really cool. It definitely is one of the early ones. And it's kind of simpler in the
14:30 sense that it doesn't deal with all the GPUs and parallelization and all that kind of stuff. It just,
14:36 it talks about classification, regression, clustering, dimensionality, and modeling, things like that,
14:42 right?
14:42 Yes, that's right. It doesn't have GPU support. And that can make it a little bit easier to install if
14:48 you, you know, sometimes the GPU stuff can have a lot more dependencies that you need to install to make
14:52 it work. Although that's getting better in the other libraries. And it's like you say,
14:58 it is made and sort of designed to be pretty accessible and pretty easy, you know, because
15:02 it has the sort of baked in algorithms that you can just say, oh, I want to do this and it will
15:07 crunch out your results for you. So I think that that's sort of the sort of ease of use and the sort
15:13 of cleanliness of its API has contributed to its sort of longevity as a, one of the most popular
15:19 machine learning libraries.
15:21 Yeah, absolutely. And it's obviously scikit-learn being part of the scipy whole family. It's built
15:27 on numpy, scipy, and matplotlib.
15:30 Yes. Yes. So yeah, it includes interfaces for all that stuff and for like graphing the output and
15:36 using matplotlib and yeah, using numpy for inputting your data and for getting your data results,
15:43 all that kind of stuff.
15:44 Yeah. Very cool. All right. Next up is Keras at 17.7 thousand stars and 6,000 forks.
15:52 So this one is for deep learning specifically, right?
15:56 Yeah. And so this is for doing rapid development of neural networks in Python. It's one of the
16:06 newest ones, but it's really, really popular. I've had some experience working with it directly
16:12 myself and I was sort of really, really blown away by how simple and straightforward it is.
16:18 So there's like, it creates a layer on top of lower level libs like TensorFlow and Theano and lets you
16:25 just sort of define, I want my network to look like this. So I want it to have this many layers and this
16:32 many nodes per layer. And here are the activation functions. And, you know, here's the optimization
16:37 method that I want to use. And you sort of just define this effectively a configuration,
16:42 and then it will build all of the graph for you, depending on what backend you used.
16:48 And so it's very, very easy to experiment with the like shape of your network and with the different
16:56 activation functions. So it lets you kind of really quickly reach and test, you know, different models
17:04 to see which one works better and to sort of see what one works at all. So it's really easy to use
17:10 and really very effective. I used it to build a little game demo where we like had an AI where
17:18 I trained an AI to play against you to determine when it could shoot at you.
17:22 Was this the demo you had at PyCon?
17:25 It is. Yeah. Yeah. And so we had that demo at PyCon. I since did a blog post about it a little bit.
17:31 And then I actually just recently rewrote it in Go for Go4Con too. So eventually it will be open sourced
17:38 so that people can see. But one of the things that you really notice is that the actual like code for
17:44 Keras to basically define the network and do the sort of machine learning heavy lifting part is very,
17:52 very minimal, like a dozen lines of code or something like that. It's really surprising because you think
17:57 it's like a ton of work, but it makes it super easy. Yeah, that's really cool. And it sounds like
18:02 its goal is to be very easy to get started with. I like the idea of the ability to switch out the
18:09 backend from say TensorFlow to CMTK to Theano. How easy is it to do that? Like if I'm, could I run some
18:18 machine learning algorithms and say, let's try it in TensorFlow and say, do some performance benchmarks
18:24 and stuff? No, no, let's switch it over to Theano and try it here and kind of experiment rather than
18:29 completely rewriting in those various APIs. Exactly. You literally, it's just a configuration
18:34 things. You just, it's almost like a tick box essentially, you know, like it's so easy.
18:40 And so that is absolutely one of the, I think the driving key features of that library that you can
18:47 just pick whichever one suits your purpose or your platform, you know, depending on what's available
18:52 on the platform that you're building for. Cause currently there's not TensorFlow versions for
18:57 every platform on every version of Python and all that kind of stuff. Right. Okay. Well, that's,
19:01 that's pretty cool. So there's two interesting things about this library. One is the fact that it does
19:07 deep learning. So maybe tell people about what deep learning is. How does that relate to like
19:13 standard neural networks or other types of machine learning stuff?
19:17 Well, I think the sort of the simplest way to put it is the idea of like adding these additional layers
19:24 to your network to create a more sophisticated model. So that allows you to create things that can take
19:35 more sophisticated feature domains and then map those to an output more reliably. So, and that's where
19:44 you've seen a lot of advances, for instance, like in like a lot of the image recognition stuff that
19:48 leverages deep learning to be really, really good at identifying images or even doing things like
19:55 style transfer on images where you have a photograph of some scene and then you have some other photograph
20:03 and you're like, I want to transfer the style of the evening to my daytime photograph. And it will just
20:09 do it and it looks like pretty normal. And those are like the most, I guess, popular, common,
20:15 deep learning examples that you see cited.
20:18 Yeah, it makes a lot of sense. And you know, it's, it's easy to think of these as being like,
20:22 I know, Snapchatty, like, sort of superfluous type of examples. But you know, machine learning,
20:29 doing them, like, you know, putting the little cat face on or switching faces or whatever. But,
20:35 you know, there's real meaningful things that can come out of this. Like, for example,
20:40 the detection of tumors in radiology scans, and things like that. And these deep learning models
20:48 can do the image recognition on that and go, yep, that's cancer, you know, maybe better than even
20:53 radiologists can already. And then in the future, it's gonna get crazy.
20:57 Exactly. And it's funny, you mentioned that Stanford Medical about a month ago,
21:01 month and a half ago, actually released like, I don't know how many, like 500,000 radiology scans
21:07 that are like annotated and ready for training machine learning. So that exact use case is intended
21:14 to be like a deep learning problem to be applied. And there are all kinds of additionals of these
21:21 datasets that are coming out. I just saw a post this week about deep learning model that was using
21:27 that was measuring heart monitor data and being more effective than cardiologists kind of thing. So
21:33 It's really crazy. You think of this AI and automation disrupting low end jobs, right? Like,
21:39 at McDonald's, we might have robots making our hamburgers or something silly like that. But if they start
21:46 cutting into radiology and cardiologists, and that's, that's gonna like, it's gonna be a big deal.
21:52 It absolutely is gonna be a big deal. I think people probably start need to start thinking about it. I don't think
21:57 it's necessarily a complete replacement thing. It's not, you know, the radiologist AI can't talk to you
22:04 yet, I guess. And until wait till we get to NLTK, but it can definitely augment and lighten the load
22:12 on professions like medicine that are, you know, perpetually overworked and allow them to be more
22:18 effective, you know, human doctors. So I think like as tools, these things are going to be absolutely
22:22 incredibly revolutionary. Yeah, it's gonna be amazing. You know, do you want a second opinion?
22:27 Let's ask, let's ask the super machine.
22:29 Exactly. But I mean, it's able to one of the strengths of all these machine learning models
22:35 is that the machine learning models are able to visualize higher dimensional complex data sets
22:43 in ways that like humans can't really do. And they have like just intense focus, I guess,
22:50 right? These models, whereas it might be, it's pretty hard for a doctor to read every single paper ever
22:56 written on subject X or to look at 500,000 radiology images even across the course of their career.
23:03 So pretty optimistic where this goes, it's going to be interesting to join all this stuff together.
23:08 The other thing that we're just starting to touch on here, and it's going to appear in a bunch of
23:13 these others. So maybe worth spending a moment on as well is Karis lets you basically seamlessly switch
23:21 from CPU computation and GPU computation. So maybe not everyone knows like the power of non visual GPU
23:30 programming. Maybe talk about that a bit.
23:32 For sure. So your GPU, which is a graphics processing unit. So, you know, if you have a gaming PC at home,
23:38 and you have like, you know what I mean, an Nvidia graphics card or an ATI grout.
23:43 Can run the Unreal Engine like crazy or whatever, right?
23:46 Oh, exactly. So if you have if you play games, and you have a dedicated graphics card, you well,
23:50 even without a dedicated graphics card, but you have a GPU, and there's this thing called general purpose GPU
23:56 programming. So that originally, like a GPU is highly parallel computer has like 1000 cores in it,
24:03 or whatever, something some huge number of cores. Yeah, the one to four or 5000 cores per GPU, right?
24:09 Exactly. Yeah. And so like the intention there was originally that it's because it needs to, in parallel process
24:15 every pixel, or every polygon that's going on the screen, right, and perform like effects. So that's why you can get
24:21 like blur and all this kind of stuff in real time, and real time lighting and all that kind of stuff. So it process
24:28 all that stuff in parallel. But then as the people started to develop SDKs that let you like, well, in addition to
24:36 doing graphics programming, we can just run regular programs on these things. And they're really, really
24:41 fast that cut doing math programs. So we can do that. And so now, basically, a lot of these libraries
24:49 support GPU processing, and it's literally just like a compile flag. Now it's getting a lot easier, you know,
24:54 you still have to make sure you have the drivers and that you you know, you have a GPU that's reasonably
24:58 powerful that's and especially if you're doing a lot of computation. And so then you can basically run
25:05 these giant ml models on your GPU. And again, it's something that's pretty, pretty well suited to
25:13 being parallelized. So that is really great use of GPU. And that's why you're seeing it take off,
25:19 because these models are are easily made parallel. Yeah, they're what are called embarrassingly parallel
25:25 algorithms, right? And just throw them at this, these things with 4000 cores and let them go crazy.
25:29 Yeah, the early days, I mean, still, I guess, when you're doing direct decks or OpenGL, or these things,
25:35 like, it's really all about I want to rotate the screen. So that's like a matrix multiplication
25:39 against all of the vector things. And it's really similar, actually, the type of work it has to do.
25:45 The other thing, I guess, which I don't see appearing anywhere in here, but I'm I suspect
25:49 TensorFlow may have something to do with it, is the the new stuff coming from Google, where they have
25:54 like going beyond GPUs for like, AI focused chips. Did you hear about this?
26:00 Yes. So Google has a thing called a TPU, which is a tensor processing unit or whatever. And you can
26:07 that's like a cloud hosted, special piece of hardware that's optimized for doing TensorFlow.
26:12 And so I don't know the exact benchmarks in terms of how that compares to, you know, like some gigantic
26:20 GPU assembly. But obviously, Google thinks that this is a worthwhile investment to build these sort of
26:27 hardware racks in the cloud, and then give people access to run their models on there. So I think
26:32 you're probably going to see more and more specialized, ML targeted hardware that's coming out, whether I
26:40 don't know whether it's like, you'll obviously consumer hardware, like you can go and buy it,
26:43 something for your home computer, but especially in the cloud, you definitely will.
26:48 Yeah, definitely in the cloud. Yeah, it's very interesting. They were talking about real time
26:52 training, not just real time answers. So that sounds pretty crazy.
26:55 This portion of Talk Python to me has been brought to you by DataCamp. They're calling all data science
27:02 and data science educators. DataCamp is building the future of online data science education.
27:07 They have over 1.5 million learners from around the world who have completed 70 million DataCamp
27:13 exercises to date. Learners get real hands-on experience by completing self-paced, interactive
27:19 data science courses right in the browser. The best part is these courses are taught by top data science
27:23 experts from companies like Anaconda and Kaggle and universities like Caltech and NYU. If you're a data
27:29 science junkie with a passion for teaching, then you too can build data science courses for the masses and
27:33 supplement or even replace your income while you're at it. For more information on becoming
27:37 an instructor, just go to datacamp.com slash create and let them know that Michael sent you.
27:42 So speaking of popular libraries and TPUs, the next up is TensorFlow. That originally came from
27:50 Google and it is crazy at 64,000 stars and 31,000 forks. So tell us about TensorFlow.
27:56 So TensorFlow, obviously, yeah, is this is Google's machine learning library and this is forms the sort of
28:02 slightly lower level than something like Keras and like obviously it's used as a backend.
28:07 You can use it directly as well. And what it does is it represents your model as a computation graph.
28:14 So that's effectively a graph where the nodes are like operations. And this is a way that they found
28:22 is really, really effective to represent these models. And it's a little bit more intimidating to
28:28 get started with mostly because you have to think about building this graph, but you can use it directly
28:34 in Python. Python is actually the recommended language and workflow from Google. So for example,
28:40 you know, when I rewrote the Go version of our little game there, I still had to train and export my model
28:46 from Python. So I use Python to build that, export it. So that's the sort of recommended workflow
28:52 currently from Google for many languages is to use Python as the primary language binding.
28:56 Yeah, that's, that's really interesting and great to see Python. Python appears in so many of these,
29:01 these libraries as a primary way to do it. So there's some interesting stuff about this one.
29:08 Obviously it's super popular. Google has so many use cases for machine learning, just up and down
29:15 their whole, you know, everything that they're doing. So having this like developed internally is really
29:20 cool. It has a flexible architecture that lets it run on CPUs or GPUs, obviously, or mobile devices.
29:28 And it even lets it run like on multiple GPUs and multiple CPUs. Do you have to do anything to make
29:35 that happen? Or do you know how it does that? As far as I can tell that, especially for like this
29:40 switching between CPU and GPU, it's essentially a compile flag. So you have to build like when you build
29:45 the libraries or download one of the nightly builds or whatever, you have to get one of the,
29:51 the versions or that has the enabled GPU support kind of thing built in. And I think that there are
29:57 also now increasingly like CPU optimizations in there. So like for instance, Intel is doing hand
30:04 optimized math kernel stuff that's integrated directly into TensorFlow to make it even faster. So that that's
30:12 something that you can also get in like the latest version as well. So I definitely think speed and
30:18 performance and making that stuff easily accessible to depending on what your hardware is and where
30:23 you're going to deploy it is a big focus for them. Yeah, that's really cool. So do you think this is
30:30 running in the Waymo cars, you know, the Google self driving cars?
30:33 Yeah, I mean, I don't know for sure, but I'd be almost positive of it, you know, from everything that
30:37 I've read and people that I've talked to. I mean, this is Google built this to use not just,
30:42 you know, so there, this is the platform for all of their deep learning and machine learning
30:48 projects. And so I would assume that it's that's TensorFlow is powering that and it's running pretty
30:53 much all of their all of their stuff. Very, very cool. It's probably in Google photos and some other
30:58 things as well. Yeah, Google translate, all those things are all, you know, those things,
31:03 pretty much all of the projects when you start looking at them that Google is running are all
31:08 effectively AI projects. And that's basically all the things that, you know, that just recently,
31:16 like the Google translate, which uses machine learning and like statistical models to do the
31:22 translations is approaching human level accuracy for translation between a lot of the popular
31:27 languages where they have huge, huge data sets to pull from. Yeah, that's crazy. And very,
31:32 cool. So up next, number five is Theano at 6000 stars and 2000 forks. And this one is really kind
31:40 of similar to TensorFlow, but really low level, right? Yeah, so it is, you know, more low level,
31:46 and it is very similar to TensorFlow in the sense that it's also a very high level, high speed math
31:51 library. And I believe it's actually it was originally made by a couple of the guys who then
31:56 went on to Google to make TensorFlow. So it predates TensorFlow by a little bit. But it also has,
32:01 you know, the things that we're, we're talking about here, it has transparent GPU use. And you can do
32:08 things like symbolic differentiation, and a lot of like mathematical things, mathematical operations
32:13 that you want to be highly, highly performant. So it is actually pretty similar to what TensorFlow does,
32:21 and sort of serves a similar purpose. But depending on what you're comfortable with, and what your maybe
32:27 existing projects are, then that is probably going to dictate which one you're using. And if you're
32:32 using something like Harris, then you can just choose this as the back end. And I flip the switch,
32:36 just flip the switch. And there you go. Yeah, it's cool. It also says it has extensive unit testing and
32:42 self verification where it'll detect and diagnose errors, maybe you've set up your model wrong or
32:47 something like that. That's pretty cool. That's pretty cool. Yeah, for sure. I mean, all of these
32:51 libraries are built by super, super smart, accomplished people who are creating things
32:58 that are, you know, solving a real world problem for them and really, you know, sort of pushing things
33:04 forward. And I actually think it's great that there's so many, so many libraries in this space,
33:09 because it really is just making it better for everybody. Yeah, the competition is really cool
33:15 to see the different ways to do this and probably cross pollination. Exactly. Yeah. Yeah. So one of
33:22 the things you have to do for these models is feed them data. And getting data can be a super messy
33:27 thing. And the one library that stands out above all the others about taking transforming,
33:34 redoing, cleaning up data is pandas, right? Absolutely. Yeah. Pandas is, is one of those,
33:41 those libraries that if you're manipulating, especially large sets of data and real world data,
33:47 then this is the one that, that people, you know, repeatedly come back to. And yeah, so pandas is,
33:54 for those that might not know, is like a, you know, data munging data analysis library that lets you
34:00 transform it. One of the hardest parts when you're doing machine learning is actually getting your data
34:06 into a format that can be used effectively by your model. And so a lot of times real world data is
34:12 pretty messy, or it might have gaps in it, or it might not actually be formatted in the right units.
34:20 So it might not be sort of normalized so that you're within the right ranges. And if you feed the models,
34:26 just sort of raw data that hasn't really been either cleaned up or, or formatted correctly,
34:32 then what you might find is that the model doesn't converge or you get what seems like random results
34:40 or things that don't really make sense. And so, you know, spending this time and having a library that
34:46 makes manipulating, especially very large sets of data, very easy, like pandas is super useful.
34:53 And even just for instance, like when I was doing that, that little demo there that, that we talked
34:59 about originally, you know, when I started, I was feeding things raw, raw pixel values for positions
35:05 and velocities and stuff. And it just wasn't working. And it wasn't until I really normalized the data,
35:10 cleaned it up that I had started getting good consistent results. So it's, you know, dealing large
35:15 scale data sets and being able to manipulate them effectively is super important.
35:20 Yeah. At the heart of all these successful AI things, these machine learning algorithms and
35:28 whatnot is a tremendous amount of data. It's why the companies that we talk about doing well are like
35:33 enormous data sucking machines like Google and Microsoft and some of these other ones. Right.
35:41 Exactly. And that's where the power of them comes from is like, you know, Google has access to like just
35:46 massive amount of data that we don't have access to regular people. Or like we were talking about earlier
35:54 with like the radiology images, you need to do need a fairly large set of annotated data. And so that's data
36:01 where, you know, these are case files or whatever that, you know, a doctor has already gone through and said,
36:06 this one was a cancer patient, this one wasn't. And without that kind of annotated data, the models
36:13 can't really learn. They need to know what the answer is. Right. And so that's really, really important.
36:20 Yeah. We have the whole 10,000 hours become an expert for humans. It's that's kind of the equivalent
36:25 for machines.
36:26 Yeah, I guess. Yeah. I don't know what the I don't know what the thing is. It's the machines might need
36:31 more. That's one of the things that is really interesting about humans is that our neural networks
36:38 can learn remarkably quickly without having to walk into traffic 1000 times or do something like that.
36:46 And so there's I don't know, there's some magic going on there or something.
36:49 Yeah, there sure is. All right. Next up is cafe and cafe two. And this originally started out as a
36:57 vision project, right?
36:58 That's right. Yeah, Berkeley. And so this was primarily a vision project. And then there's a sort of successor
37:05 that is backed by Facebook, actually, and is more general purpose and is sort of optimized for web and
37:13 mobile deployment. So obviously, you know, if you want to have machine learning based apps on your
37:18 phone, then having a library that sort of targets that is pretty important.
37:22 Yeah, I'm sure we're going to see more of that. I mean, there are even rumors. I don't know how
37:27 trustworthy they are that the next Apple maybe actually today analysis that the next iPhone will have a
37:34 built in AI chip.
37:35 I remember that they just announced so Apple actually just announced machine language SDK core ML at
37:42 WWDC in June. And so Apple is already targeting these sort of deployed ML models. So, you know, in that
37:51 that library's case, you are effectively choosing a pre-made model. So I want image recognition or I want,
37:57 you know, language parsing in my app. And then you can just feed these sort of pre-trained models.
38:01 But it wouldn't surprise me, you know, they've got the was like the motion chip in your iPhone now.
38:07 Yeah, they got the motion chip. Yeah.
38:08 So it wouldn't surprise me at all that to start seeing that phones are deploying AI chips in there
38:13 to assist with this because most of the sort of things like Siri is a machine learning based thing.
38:18 Right. So yeah. Yeah. It's and it doesn't make sense to go to the cloud all the time. Like that's one of
38:24 the super annoying things about Siri is you ask it a question and it's like six seconds later. Like you
38:29 ask it something simple like what time is it? 10 seconds later, it'll tell you it's such and such.
38:34 Like, is it really that hard? Yeah. Yeah. It's got to go all the way to the cloud and you're in some
38:38 sketchy network area or something. Right. Exactly. And so that I wouldn't be surprised to start seeing
38:43 that stuff deployed onto onto mobile. I think at even at build Microsoft's conference, they started
38:49 talking about edge machine learning where like the machine learning happen is getting pushed to all
38:55 these IOT devices that they're working on as well. So a lot of a lot of attempts in this area.
38:59 For sure. Yeah. And that's the next big thing, right? Is like having IOT based machine learning
39:04 devices. Like, can your fridge learn like your grocery consumption habits and, you know, suggest
39:10 tell you like you're going to run out of milk in two days and you're going to the store today. Maybe
39:13 you should pick some up. I mean, it's going to happen kind of crazy, but it totally will happen.
39:18 And yeah. Yeah. I mean, it doesn't sound as crazy as let's just let a car go drive in a busy
39:24 city on its own. That's true. And yet that's, that's something that exists now, right? Like
39:31 that's, that's a, that's a thing like you can, and maybe it's not fully autonomous, but I mean,
39:36 you could go and buy one like tomorrow, you could buy a car that you can turn on autopilot and like,
39:42 it's crazy. It's fully drive for you. So the future is now, the future is here. It's just not
39:49 evenly distributed. This portion of Talk Python is brought to you by us. As many of you know,
39:56 I have a growing set of courses to help you go from Python beginner to novice to Python expert.
40:00 And there are many more courses in the works. So please consider Talk Python training for you and
40:05 your team's training needs. If you're just getting started, I've built a course to teach you Python
40:10 the way professional developers learn by building applications. Check out my Python jumpstart by
40:15 building 10 apps at talkpython.fm/course. Are you looking to start adding services to your app?
40:21 Try my brand new consuming HTTP services in Python. You'll learn to work with RESTful HTTP services,
40:27 as well as SOAP, JSON and XML data formats. Do you want to launch an online business? Well,
40:32 Matt McKay and I built an entrepreneur's playbook with Python for entrepreneurs. This 16 hour course will
40:38 teach you everything you need to launch your web-based business with Python. And finally,
40:42 there's a couple of new course announcements coming really soon. So if you don't already have an
40:46 account, be sure to create one at training.talkpython.fm to get notified. And for all of you who have bought
40:52 my courses, thank you so much. It really, really helps support the show. One little fact or a quote from
40:59 the cafe webpage that I want to just throw out there because I thought it was pretty cool before we move
41:03 on. They say, speed makes cafe perfect for research experiments and industry deployments. It can process 60 million
41:12 images per day on a single GPU. That's one millisecond per image for inference and four milliseconds per image for
41:20 learning. That's insane.
41:21 So fast. And 60 million images per day is just like, it's crazy. And that's why we were talking
41:30 about the data just a minute ago. And the amount of data being poured into these models is just
41:36 staggering every day. And I don't doubt that they're probably feeding, people are feeding these models
41:42 like that much data every day. And I think they were saying 90% of the world's data that's ever been
41:48 created has been created in the last year. And so it's just one of these things where it gets
41:53 accelerates and accelerates and builds on all this stuff. So I think these things are just going to
41:59 get faster until they're effectively real time.
42:01 Yeah, absolutely. All right. I don't think we said the stars for that one. 20,000 and 11,000 forks.
42:08 So up next is definitely one that data scientists in general just live on. And that's Jupyter.
42:15 For sure. And so this has just become like the standard interchange format for sharing data science,
42:24 whether it's papers or data sets or models, or this has just become the sort of standard,
42:32 I don't know what you're going to call it, lingua franca for exchanging this data. And it's effectively
42:37 a tool for the thing called a Jupyter notebook, which is like kind of like a web pages with like
42:43 embedded programs and embedded data sets. I think that's probably a good way to describe it for those
42:48 who might not have used it before.
42:49 Right. It's like instead of writing a blog post or a paper that's got a little bit of code,
42:54 then a little bit of description, then a picture, which is a graph, it's like live and you can re-execute
42:59 it and tweak it. And it probably plugs into many of these other libraries and it's using that
43:03 somewhere behind the scenes to do that.
43:06 Exactly. Yeah. It's built on the IPython kernel for that's like interactive Python kernel. Yeah. I'm
43:13 sure that there are all kinds of specific uses that can run those notebook or that notebook code and use
43:19 that, that stuff there.
43:20 Cool. Next up is maybe one of the newer kids on the block in this deep learning story from Microsoft,
43:26 actually their cognitive toolkit, C and TK.
43:29 Yeah. And it's, they just released, I think the 2.0 version of it beginning of June or late May.
43:35 And, you know, now it's open source and it's, it's got the Python bindings and it's part of,
43:42 you know, Microsoft's been doing a lot of open source work lately and they've been, you know,
43:46 really, really pushing a lot of their own projects.
43:48 And, it's like we said earlier, it's available as a backend for Keras. So it's similar again to
43:56 TensorFlow and Theano that it's, it's again, focused on that sort of low level
44:00 computation as a directed graph. So similar model, I think this is, you know, obviously emerging as a
44:07 popular and efficient way to represent machine learning models is using that directed graph.
44:11 So it's pretty popular too, right? It's got a decent number of stars and forks and obviously
44:17 as a Keras backend and Microsoft backed library, it's going to be pretty popular and pretty common
44:23 out there.
44:24 Yeah, absolutely. These days, you know, with, Satya Nadella and a lot of the changes at Microsoft,
44:29 I feel like this open source stuff is really taking a new direction, a positive one. And also I think
44:35 their philosophy is if it's good for Azure, it's good for Microsoft. And so this plugs into their
44:41 hosted stuff and interesting ways. And they've got a lot of like cognitive cloud services and things
44:47 like that.
44:47 Yeah. Azure is becoming pretty huge. It's like starting to rival maybe even AWS for, you know,
44:57 a lot of this cloud hosted services and especially around machine learning, like Azure has so many
45:01 different machine learning tools available. And it's really clearly a pretty, pretty big focus for
45:08 Microsoft. And again, it's great to see, you know, more of the, you know, the sort of big guns being
45:14 more open about their development and sharing. I mean, it drives everybody forward and, and, you know,
45:19 just accelerates development across the whole ecosystem.
45:21 Yeah. And they have a number of the Python core developers there. They have Brett Cannon,
45:25 they have Steve Dower, they have, you know, VLAN, like there's some serious people back there working
45:30 on the Python part.
45:31 Exactly. Yeah. They've got a lot of the Python core team there. And, I know a bunch of the
45:37 guys from active state were just at PI data in Seattle and, you know, huge number of the core team
45:42 were there and, you know, just really, really great little conference. They're talking about
45:48 Python and data science. Yeah. I think they have some really interesting language stuff as well.
45:52 So speaking of languages, the, most, certainly the longest running one, probably that's really
45:58 still going strong is NLTK with 5,000 stars and 1.5 thousand forks.
46:03 Yeah. And so NLTK was like the natural language toolkit. And, you know, obviously this is a thing
46:09 for doing natural language parsing, which is, I guess, one of the, the holy grails of, of machine
46:14 learning is to get it to be really, you know, so you can just speak to your, to your computer
46:19 and completely natural language, and maybe even give it instructions in natural language
46:23 and, and be able to be able to follow your, for your directions and understand what you're
46:28 asking. And so this is like a really popular one in academia for research. They link to and
46:35 include massive corpora of, of work. So that's like gigantic bodies of text in different languages
46:43 and in different styles to be able to train models. So there's, there's also like a pretty
46:48 large, like open data component to this project as well. And, obviously, you know, the use
46:54 case here for natural language is, you know, it's huge for translation. Like we mentioned earlier,
46:59 chatbots, which are now a huge thing for like support. I mean, every website you go onto and
47:05 it pops up, Hey, I'm, you know, Bob and I'm, can I help you today? And it's like, not a really a
47:10 person. It's just a chatbot. And, you know, there's just so many. And then like we were saying, Siri and,
47:16 and Cortana and all those sort of personal assistants where you can say, ask it a natural language question
47:22 and it can come back to you. So this is the sort of almost like foundational library still going
47:27 strong, still tons of active development and research going on with this. Yeah. It's really
47:32 cool. And especially with all the smart home speaker things, Google home, home pod, all that stuff.
47:37 This is just, this is going faster, not slower terms of acceleration, right? It's
47:42 weird talking more and interacting with them way more. Definitely the chatbots. And anytime you have
47:49 text and you want a computer to understand it, this is like a first step for tokenization,
47:55 stemming, tagging, parsing, semantic analysis, all that kind of stuff. Right? Yeah. And that's,
47:59 that's exactly what it outputs. So it will do is like generate parse trees and, and stem it all out and
48:05 then use those, the kind of tokenized version to use that to train your model, not sort of raw text
48:12 characters. And, we really are getting there. I mean, like these days, like for sure, like just the
48:19 recognition part, you know, the tokenization part is very, very good. It's more like the kind of
48:24 semantic meaning. What do you mean when you ask it, you ask Siri for what are the movie times for X or
48:32 something like that? How specific do you have to be for, to get a reasonable answer from her? Yeah.
48:37 It's got to go speech to text and then it probably hits something like this. Exactly. Yeah, exactly.
48:41 That's going to hit a library like this and we're getting there. It's not quite at the Star Trek
48:46 computer do this for me, but it's like way closer than I kind of ever thought we would
48:51 be. It's really pretty impressive sometimes. Yeah, absolutely. It's, it's fun to see this
48:57 stuff evolve. Absolutely. All right, Pete, that's the 10 libraries. And I think these are all really
49:02 great selections and hopefully people have got a lot of exposure and maybe learned about some
49:09 they didn't know about. And I guess encourage, encourage everyone to go out there and try these
49:13 down and if you've got an idea, play with it with one of them or more. For sure. They're also accessible
49:17 now. You know, you don't necessarily have to be ML researcher or a math wizard to actually create
49:27 something that's interesting or experiment or learn a little bit. These libraries all do a really,
49:32 really great job of abstracting away some of the more complicated mathematical parts. And,
49:39 you know, in the case of a lot of them making it reasonably accessible. And so that's where I think
49:45 you're seeing this kind of like democratization trend in machine learning now where this stuff is
49:51 becoming more accessible. It's becoming easier. And I think you're going to see a lot of creativity and a
49:56 lot of innovation come out of people if they just sort of give it a shot and try something out and,
50:01 you know, learn something new.
50:03 Yeah, that's awesome. I totally agree with the democratization of it. And that's also happening
50:07 from a computational perspective, right? Like these are easier to use, but also with the GPUs
50:12 and the cloud and things like that, it's a lot easier. You don't need a supercomputer. You need 500
50:18 bucks or something for a GPU.
50:20 Exactly. That's the, I think all of these sort of things feed into that in together where you have a
50:25 democratization trend in the tools and the source code so that now a, you can have access to Google's
50:33 years and years of AI research via TensorFlow on GitHub. You also, like you said, can go and buy a
50:40 $500 GPU and have basically a supercomputer on your desktop, but also this open data component where
50:48 you can get access to massive data sets like the Stanford image library and, you know, these huge
50:56 NLTK like language corpora that you can then use to train your models where previously that was probably
51:03 impossible to actually access.
51:05 Yeah, that's a really good point because even though you have the machines and you have the algorithms,
51:09 the data, data really makes it work. All right. So I think let's leave it there for the library.
51:14 So those were great. And I'll, I hit you with the final two questions. You're going to write some
51:20 code. What editor, Python code, what editor do you open up?
51:22 Well, obviously ActiveState has Komodo. So I tend to use that a lot for doing a Python code, but I've also
51:30 to be totally fair. I have used VS Code as well, which is getting increasingly popular. So I tend to like
51:36 to cycle between them all because we have an editor product. And so, you know, it's great to keep up to
51:42 date on what all the other ones are doing. So I tend to cycle around a little bit, but yeah, like
51:48 Komodo is sort of my go-to.
51:50 Yeah, that's cool. Yeah. It's definitely important to look and see what the trends are, what other
51:54 people are doing, how can you bring this cool idea back into Komodo, things like that, right?
51:57 Yeah, for sure.
51:58 Yeah.
51:58 All right. And I think we've already hit 10, but do you have another notable PyPI package?
52:03 I don't know. There's, there's so many. I would again, probably give a, a little bit of a shout
52:08 out to, you know, since we're talking about machine learning to Keras, because I do think as an entry
52:14 point to machine learning, it's so accessible. It's so easy to at least get started and get a result with.
52:22 I would give a little shout out to that, that I think that if you're looking to get into this and
52:25 you're looking to try it out, that's a really great place to start.
52:28 Yeah, I totally agree with you. That's, that's where I would start as well.
52:31 All right. Well, it's very interesting to talk about all these libraries with you. I really
52:36 appreciate you coming on the show and sharing this with everyone. Thanks for being here.
52:39 Thank you for having me.
52:40 You bet. Bye.
52:41 This has been another episode of Talk Python to Me. Our guest has been Pete Carson,
52:47 and this episode has been brought to you by DataCamp and us right here at Talk Python Training.
52:52 Want to share your data science experience and passion? Visit datacamp.com slash create
52:58 and write a course for a million budding data scientists.
53:01 Are you or a colleague trying to learn Python? Have you tried books and videos that just left
53:06 you bored by covering topics point by point? Well, check out my online course,
53:10 Python Jumpstart by Building 10 Apps at talkpython.fm/course to experience a more engaging way to
53:16 learn Python. And if you're looking for something a little more advanced, try my Write Pythonic Code
53:21 course at talkpython.fm/pythonic. Be sure to subscribe to the show. Open your favorite podcatcher
53:28 and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes,
53:33 Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm.
53:39 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it.
53:44 Now get out there and write some Python code.
53:46 Thank you.
54:06 Thank you.
54:07 Thank you.