Learn Python with Talk Python's 270 hours of courses

#131: Top 10 machine learning libraries Transcript

Recorded on Thursday, Jul 20, 2017.

00:00 Data science has been one of the major driving forces behind the explosion of Python in recent

00:04 years. It's now used for AI research, it controls some of the most powerful telescopes in the world,

00:09 it tracks crop growth and prediction, and so much more. But with all this growth,

00:14 there's an explosion of data science machine learning libraries. That's why I invited Pete

00:18 Garson onto the show. He's going to share his top 10 machine learning libraries for Python.

00:23 After this episode, you should be able to pick the right one for the job.

00:27 This is Talk Python to Me, recorded July 20th, 2017.

00:31 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem,

00:50 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.

00:55 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter

01:00 via at Talk Python. Talk Python to Me is partially supported by our training courses. Here's an

01:06 unexpected question for you. Are you a C-sharp or .NET developer getting into Python? Do you work at a

01:12 company that used to be a Microsoft shop, but is now finding their way over to the Python space?

01:18 We built a Python course tailor-made for you and your team. It's called Python for the .NET

01:23 developer. This 10-hour course takes all the features of C-sharp and .NET that you think you

01:29 couldn't live without. Unity Framework, Lambda Expressions, ASP.NET, and so on. And it teaches

01:34 you the Python equivalent for each and every one of those. This is definitely the fastest and clearest

01:39 path from C-sharp to Python. Learn more at talkpython.fm/.NET. That's talkpython.fm slash

01:46 D-O-T-N-E-T.

01:48 Pete, welcome to Talk Python.

01:50 Thanks. I'm happy to be here.

01:52 That's great to have you here. And I've done a few shows on machine learning and data science,

01:58 but I'm really happy to do this one because I think it's really accessible to everyone. We're

02:02 going to bring all these different libraries together and kind of just make people aware of

02:06 all the cool things that are out there for data science, machine learning.

02:09 Yeah, it's really crazy actually how many libraries are out there and how active the development is on

02:15 all of them. There's new contributions, new developments all the time. And it seems like

02:20 there's new projects popping up like almost daily.

02:22 Yeah, it's definitely tough to keep up with, but hopefully this adds a little bit of help

02:27 for the reference there. But before we get into all these libraries, let's start with your story.

02:30 How did you get into programming in Python?

02:32 I started programming at a pretty young age, like sort of back before Stack Overflow and

02:37 things like that existed. And I sort of mostly made games. I started with basic, like most people

02:43 probably from a certain age, and then worked into working on Pascal and was making games for my BBS

02:50 back in the day, making online games, utilities and stuff like that. And then for Python, later when I

02:56 worked in games for a long time. And when I worked in games, we were doing like tool automations, like build

03:04 automation, certain workflow automation, build pipelines, all that kind of stuff. So Python was something, a tool that we

03:10 used quite a lot there. So that was where I got my start with Python.

03:15 Oh, yeah, that's really cool. Python is huge in the workflow for games and movies, way more than people on the outside

03:21 realize, I think. Yeah, especially for artists. So like a lot of the tools have Python built into them.

03:26 And so artists will use it for like automating model exports or rigging and that kind of stuff. So it's pretty

03:35 popular in that sense. And then also still even just for like building assets for games.

03:40 Okay, I'm intrigued by your BBS stuff. It occurs to me and it's kind of crazy. There may be younger

03:46 people listening that don't actually know what a BBS is. Okay, so a BBS is short for bulletin board system.

03:54 And it was sort of like in a way the precursor to the internet where you used to host what is

04:02 effectively sort of a website on your home computer. And people would like call your phone number.

04:08 And you'd have it hooked up to your modem. And they would like call your phone number,

04:12 connect to your home computer. So in my case, it was like my computer that I played games on and

04:18 did my homework on and that kind of stuff. And they could connect and send messages to each other

04:24 and download files and play games, very simple games, that kind of thing. So it was like a...

04:30 Yeah, it was so fun.

04:31 Yeah, it was awesome. And like, I really, really enjoyed it. And you had a thing called like echo

04:36 mail back then, which was like this sort of way of like transferring messages all over the world. So,

04:41 you know, somebody would send a message on your BBS, and then it would like call a whole bunch of

04:45 others like in this network. And then somebody in like Australia might answer it. And it would take

04:50 like days to get back because it would be like this chain of people's BBS is calling the next one. So

04:56 yeah, there was no internet. It was the craziest thing. Like, we at our house, my brother and I

05:01 had talked my dad into getting us multiple phone lines so we could work with BBS is like in parallel.

05:08 And you would send these mails and like at night, there would be like a, like a coordination of the

05:14 emails across the globe as these things would like sync up the emails they got queued up. It was the

05:19 weirdest thing. But I loved it. I don't know, what's it? Trade Wars or Planet Wars? One of those games. I

05:23 really loved it.

05:24 For sure. I'm like a huge Trade Wars fan. You can actually play it now. Like there are people who have

05:29 it set up on like websites that have like simulated Telnet stuff. And you can, you can play versions of Trade

05:35 Wars, which I have done recently just to like, don't tell me that you're going to ruin my productivity

05:40 for like the whole day. Yeah. You'll be after this. You'll be like, Oh, Trade Wars 2002. Is it still,

05:46 it's still around? People still play it, but it was such a good game. It's fantastic. Yeah, it was

05:51 fantastic. It's awesome. All right. So that's how you got into this whole thing. Like what do you do

05:56 today? You work at ActiveState, right?

05:58 I do. Yeah. I'm a dev evangelist at ActiveState. So generally that means working with developers,

06:03 language communities, trying to make our distributions better. So at ActiveState,

06:06 we do language distributions. Probably a lot of people in the Python space know us that we do

06:11 ActivePython and it's been around for a long time. We were founding member of the Python Software

06:16 Foundation. And so ActiveState has a pretty long history in the Python community. And before that,

06:22 we were, people probably know us from Perl and now we have a Go distribution and Ruby beta coming out

06:28 soon. So we're sort of expanding to all these different dynamic language ecosystems.

06:34 Sure. That's awesome. So I know that maybe people are a little familiar with some of the advantages of

06:40 these higher order distributions for Python, but maybe give us a sense of like, what is the value of

06:46 these distributions over like going and grabbing Go or grabbing Python and just running with that?

06:50 I think that you've got obviously this sense of curated packages. So there are, you know, in the

06:58 Python distribution, there's like over 300 packages. And so you know that they're going to build, know they're

07:02 going to play nice with each other, know that they have current stable versions, all that kind of stuff.

07:06 And then additionally, you can buy commercial support. So for a lot of our customers, so we have a lot of like

07:13 large enterprise customers, they can't actually adopt a language distribution or a tool like that

07:19 without commercial support. They need to know that somebody has their back. And so that's something

07:24 that we offer on these language distributions for those large customers. But for the community and for

07:29 individual developers, then that is something that having that curated set of packages that you know

07:36 is going to work, that you know is going to play nice. And that also is a, maybe a development team

07:40 lead, you might want a unified install base, so that all your developers have the same development

07:47 environment and they and you know, it's all going to play nice. And so that's something that's one of

07:51 the advantages of those.

07:51 That's really cool. Certainly the ability to install things that have weird compilation stuff. Do you guys

07:57 ship the binaries like pre built for that? So I don't have to have like a Fortran compiler or

08:02 something weird?

08:03 Exactly. Yes. So they're all pre built, all pre compiled. So I mean, a lot of people depending

08:08 on what platform you're on, like on Windows, you're not might not even have a C compiler installed and a

08:12 lot of packages are C based. And so they're pre built, you don't and like you said, you don't need a

08:17 Fortran compiler or some, some exotic build tool to actually make it work. It just works out of the box.

08:23 Yeah. Okay, that's really awesome. And active Python is free. If I'm like a random person, not a huge

08:29 corporate that wants support.

08:31 Exactly. Yeah. If you're just a, you know, a developer, it's free to download and free to use.

08:37 And it even if you are, you know, a large corporation, it's free to use in non production settings. So on your

08:43 own. So it's, you know, you can go and just download it, try it out, see if it works for you.

08:47 Okay, yeah, that sounds sounds awesome.

08:49 How many of the 10 libraries we're going to talk about would come built in? Do you know off the top

08:54 of your head?

08:54 I think that actually almost all of them, but maybe I think cafe is on the list. It's not in the current

09:03 one, but it is on the list to be included. So I think actually like pretty much all of the other

09:08 ones, maybe CNTK as well is still new as well. That's really new. So but you know, we are targeting

09:15 to have as many of these as we possibly can. And so pretty much most of them are included.

09:20 That's awesome. So all the libraries that we're talking about, like one really nice way to just

09:23 get up to speed with them would be grab active, active Python, and you'd be ready to roll.

09:28 Exactly. Yeah, awesome.

09:29 Grab them, install them, you're ready to roll right out of the gate.

09:33 Cool. All right. So let's start at what I would consider the foundation of them. The first library

09:40 that you picked, which is NumPy and SciPy.

09:42 Absolutely. And they are foundational in the sense that a lot of other libraries either

09:47 depend on them or are in fact built like on top of them. Right. So they're, they are sort

09:54 of the base of a lot of these other libraries. And most people might have worked with, with

09:59 NumPy sort of the, its main feature is that sort of n dimensional array structure that it

10:05 includes. And a lot of the data that is shipped to a lot of the other libraries is either supported

10:11 that you can send it a NumPy array, or it requires that you, that you format it that way. So especially

10:17 when you're doing machine learning, you're doing a lot of matrices and a lot of like higher dimensional

10:22 data, depending on how many features you have. It's a really, really useful data structure to have

10:28 in place.

10:29 Yeah. So NumPy is this sort of array like multi-dimensional array like thing that stores and

10:37 does a lot of its processing down in a C level, but has of course, it's programming API and Python,

10:44 right?

10:44 Yes. Yeah, exactly. And a lot of these machine learning libraries do tend to have C level,

10:51 like lowest level implementations with a Python API. And that's predominantly for speed. So when

10:59 you're doing tons and tons and tons of calculations, and you need them to be really, really lightning fast,

11:05 that's the primary reason that they do these things, you know, sort of at the C level.

11:09 All right, absolutely. And so related to this is SciPy. They're kind of grouped under the same

11:15 organization, but they're not the same library exactly, are they?

11:18 No. So SciPy is like a more scientific mathematical computing thing. And it has the more advanced like

11:26 linear algebra and like Fourier transforms, image processing, it has like a physics calculation

11:32 stuff built in. So most like scientific numerical computing functionality is built into SciPy.

11:38 I know that NumPy does have like linear algebra and stuff in it. But I think that the preferred is

11:43 that you use SciPy for all that kind of linear algebra crunching.

11:47 Okay, yeah. So a lot of these things that we're going to talk about will somewhere in them have as

11:52 a dependency or an internal implementation of some variation, or even in maybe in its API,

11:57 like the ability to pass between them, these NumPy arrays and things like that.

12:02 Absolutely.

12:02 Yeah. One other thing that's worth noting, that's pretty interesting. And I think this is a trend

12:09 that's growing. Maybe you guys have more visibility into it than I do. But NumPy in June 13th, 2017,

12:16 so about a month ago at the time of the recording, received a $645,000 grant for the next two years to

12:24 grow it and evolve it and keep it going strong. That's pretty cool.

12:28 It is very cool. And I think that you're starting to see that these open source projects are really

12:34 forming the backbone of most of the machine learning research and actually implementation that you're

12:40 seeing out there in the world. There's not a lot of sort of more closed source behind trade secret

12:45 stuff. A lot of the most bleeding edge development and active development is happening in these open

12:50 source projects. So I think it's great to see them receiving funding and sponsorship like that.

12:55 Yeah, I totally agree. And it's just going to mean more good things for the community and all these

12:59 projects. It's really great to see. One thing I want to touch on for every one of these is to give

13:04 you a sense of how popular they are. And for each one, we'll say the number of GitHub stars and forks.

13:11 And that's not necessarily the exact right measure for the popularity because maybe this is you like

13:17 obviously NumPy is used across many of these other things which have more stars, but people don't

13:22 necessarily contribute directly to NumPy. So on. But for NumPy, NumPy has about 5,000 stars and 2,000

13:30 forks to give you a sense of how popular it is. The next one up, scikit-learn has 20,000 stars and 10,000

13:37 forks. So tell us about scikit-learn. Scikit-learn is, again, like we mentioned before, is a thing

13:44 that's built on top of scipy and NumPy and is a very popular library for machine learning in Python.

13:50 And I think it was one of the first, if not the first, I'm not 100% sure, but it's been around for

13:56 quite a long time. And it supports a lot of the sort of most common algorithms for machine learning.

14:03 So that's like classification, regression tools, all that kind of stuff. I actually just saw like a blog post

14:08 come up in my feed today where Airbnb was using scikit-learn to do some kind of like property value

14:16 estimation or something using machine learning. So it's being used very, very widely in a lot of different

14:22 scenarios.

14:22 Oh yeah, that sounds really cool. It definitely is one of the early ones. And it's kind of simpler in the

14:30 sense that it doesn't deal with all the GPUs and parallelization and all that kind of stuff. It just,

14:36 it talks about classification, regression, clustering, dimensionality, and modeling, things like that,

14:42 right?

14:42 Yes, that's right. It doesn't have GPU support. And that can make it a little bit easier to install if

14:48 you, you know, sometimes the GPU stuff can have a lot more dependencies that you need to install to make

14:52 it work. Although that's getting better in the other libraries. And it's like you say,

14:58 it is made and sort of designed to be pretty accessible and pretty easy, you know, because

15:02 it has the sort of baked in algorithms that you can just say, oh, I want to do this and it will

15:07 crunch out your results for you. So I think that that's sort of the sort of ease of use and the sort

15:13 of cleanliness of its API has contributed to its sort of longevity as a, one of the most popular

15:19 machine learning libraries.

15:21 Yeah, absolutely. And it's obviously scikit-learn being part of the scipy whole family. It's built

15:27 on numpy, scipy, and matplotlib.

15:30 Yes. Yes. So yeah, it includes interfaces for all that stuff and for like graphing the output and

15:36 using matplotlib and yeah, using numpy for inputting your data and for getting your data results,

15:43 all that kind of stuff.

15:44 Yeah. Very cool. All right. Next up is Keras at 17.7 thousand stars and 6,000 forks.

15:52 So this one is for deep learning specifically, right?

15:56 Yeah. And so this is for doing rapid development of neural networks in Python. It's one of the

16:06 newest ones, but it's really, really popular. I've had some experience working with it directly

16:12 myself and I was sort of really, really blown away by how simple and straightforward it is.

16:18 So there's like, it creates a layer on top of lower level libs like TensorFlow and Theano and lets you

16:25 just sort of define, I want my network to look like this. So I want it to have this many layers and this

16:32 many nodes per layer. And here are the activation functions. And, you know, here's the optimization

16:37 method that I want to use. And you sort of just define this effectively a configuration,

16:42 and then it will build all of the graph for you, depending on what backend you used.

16:48 And so it's very, very easy to experiment with the like shape of your network and with the different

16:56 activation functions. So it lets you kind of really quickly reach and test, you know, different models

17:04 to see which one works better and to sort of see what one works at all. So it's really easy to use

17:10 and really very effective. I used it to build a little game demo where we like had an AI where

17:18 I trained an AI to play against you to determine when it could shoot at you.

17:22 Was this the demo you had at PyCon?

17:25 It is. Yeah. Yeah. And so we had that demo at PyCon. I since did a blog post about it a little bit.

17:31 And then I actually just recently rewrote it in Go for Go4Con too. So eventually it will be open sourced

17:38 so that people can see. But one of the things that you really notice is that the actual like code for

17:44 Keras to basically define the network and do the sort of machine learning heavy lifting part is very,

17:52 very minimal, like a dozen lines of code or something like that. It's really surprising because you think

17:57 it's like a ton of work, but it makes it super easy. Yeah, that's really cool. And it sounds like

18:02 its goal is to be very easy to get started with. I like the idea of the ability to switch out the

18:09 backend from say TensorFlow to CMTK to Theano. How easy is it to do that? Like if I'm, could I run some

18:18 machine learning algorithms and say, let's try it in TensorFlow and say, do some performance benchmarks

18:24 and stuff? No, no, let's switch it over to Theano and try it here and kind of experiment rather than

18:29 completely rewriting in those various APIs. Exactly. You literally, it's just a configuration

18:34 things. You just, it's almost like a tick box essentially, you know, like it's so easy.

18:40 And so that is absolutely one of the, I think the driving key features of that library that you can

18:47 just pick whichever one suits your purpose or your platform, you know, depending on what's available

18:52 on the platform that you're building for. Cause currently there's not TensorFlow versions for

18:57 every platform on every version of Python and all that kind of stuff. Right. Okay. Well, that's,

19:01 that's pretty cool. So there's two interesting things about this library. One is the fact that it does

19:07 deep learning. So maybe tell people about what deep learning is. How does that relate to like

19:13 standard neural networks or other types of machine learning stuff?

19:17 Well, I think the sort of the simplest way to put it is the idea of like adding these additional layers

19:24 to your network to create a more sophisticated model. So that allows you to create things that can take

19:35 more sophisticated feature domains and then map those to an output more reliably. So, and that's where

19:44 you've seen a lot of advances, for instance, like in like a lot of the image recognition stuff that

19:48 leverages deep learning to be really, really good at identifying images or even doing things like

19:55 style transfer on images where you have a photograph of some scene and then you have some other photograph

20:03 and you're like, I want to transfer the style of the evening to my daytime photograph. And it will just

20:09 do it and it looks like pretty normal. And those are like the most, I guess, popular, common,

20:15 deep learning examples that you see cited.

20:18 Yeah, it makes a lot of sense. And you know, it's, it's easy to think of these as being like,

20:22 I know, Snapchatty, like, sort of superfluous type of examples. But you know, machine learning,

20:29 doing them, like, you know, putting the little cat face on or switching faces or whatever. But,

20:35 you know, there's real meaningful things that can come out of this. Like, for example,

20:40 the detection of tumors in radiology scans, and things like that. And these deep learning models

20:48 can do the image recognition on that and go, yep, that's cancer, you know, maybe better than even

20:53 radiologists can already. And then in the future, it's gonna get crazy.

20:57 Exactly. And it's funny, you mentioned that Stanford Medical about a month ago,

21:01 month and a half ago, actually released like, I don't know how many, like 500,000 radiology scans

21:07 that are like annotated and ready for training machine learning. So that exact use case is intended

21:14 to be like a deep learning problem to be applied. And there are all kinds of additionals of these

21:21 datasets that are coming out. I just saw a post this week about deep learning model that was using

21:27 that was measuring heart monitor data and being more effective than cardiologists kind of thing. So

21:33 It's really crazy. You think of this AI and automation disrupting low end jobs, right? Like,

21:39 at McDonald's, we might have robots making our hamburgers or something silly like that. But if they start

21:46 cutting into radiology and cardiologists, and that's, that's gonna like, it's gonna be a big deal.

21:52 It absolutely is gonna be a big deal. I think people probably start need to start thinking about it. I don't think

21:57 it's necessarily a complete replacement thing. It's not, you know, the radiologist AI can't talk to you

22:04 yet, I guess. And until wait till we get to NLTK, but it can definitely augment and lighten the load

22:12 on professions like medicine that are, you know, perpetually overworked and allow them to be more

22:18 effective, you know, human doctors. So I think like as tools, these things are going to be absolutely

22:22 incredibly revolutionary. Yeah, it's gonna be amazing. You know, do you want a second opinion?

22:27 Let's ask, let's ask the super machine.

22:29 Exactly. But I mean, it's able to one of the strengths of all these machine learning models

22:35 is that the machine learning models are able to visualize higher dimensional complex data sets

22:43 in ways that like humans can't really do. And they have like just intense focus, I guess,

22:50 right? These models, whereas it might be, it's pretty hard for a doctor to read every single paper ever

22:56 written on subject X or to look at 500,000 radiology images even across the course of their career.

23:03 So pretty optimistic where this goes, it's going to be interesting to join all this stuff together.

23:08 The other thing that we're just starting to touch on here, and it's going to appear in a bunch of

23:13 these others. So maybe worth spending a moment on as well is Karis lets you basically seamlessly switch

23:21 from CPU computation and GPU computation. So maybe not everyone knows like the power of non visual GPU

23:30 programming. Maybe talk about that a bit.

23:32 For sure. So your GPU, which is a graphics processing unit. So, you know, if you have a gaming PC at home,

23:38 and you have like, you know what I mean, an Nvidia graphics card or an ATI grout.

23:43 Can run the Unreal Engine like crazy or whatever, right?

23:46 Oh, exactly. So if you have if you play games, and you have a dedicated graphics card, you well,

23:50 even without a dedicated graphics card, but you have a GPU, and there's this thing called general purpose GPU

23:56 programming. So that originally, like a GPU is highly parallel computer has like 1000 cores in it,

24:03 or whatever, something some huge number of cores. Yeah, the one to four or 5000 cores per GPU, right?

24:09 Exactly. Yeah. And so like the intention there was originally that it's because it needs to, in parallel process

24:15 every pixel, or every polygon that's going on the screen, right, and perform like effects. So that's why you can get

24:21 like blur and all this kind of stuff in real time, and real time lighting and all that kind of stuff. So it process

24:28 all that stuff in parallel. But then as the people started to develop SDKs that let you like, well, in addition to

24:36 doing graphics programming, we can just run regular programs on these things. And they're really, really

24:41 fast that cut doing math programs. So we can do that. And so now, basically, a lot of these libraries

24:49 support GPU processing, and it's literally just like a compile flag. Now it's getting a lot easier, you know,

24:54 you still have to make sure you have the drivers and that you you know, you have a GPU that's reasonably

24:58 powerful that's and especially if you're doing a lot of computation. And so then you can basically run

25:05 these giant ml models on your GPU. And again, it's something that's pretty, pretty well suited to

25:13 being parallelized. So that is really great use of GPU. And that's why you're seeing it take off,

25:19 because these models are are easily made parallel. Yeah, they're what are called embarrassingly parallel

25:25 algorithms, right? And just throw them at this, these things with 4000 cores and let them go crazy.

25:29 Yeah, the early days, I mean, still, I guess, when you're doing direct decks or OpenGL, or these things,

25:35 like, it's really all about I want to rotate the screen. So that's like a matrix multiplication

25:39 against all of the vector things. And it's really similar, actually, the type of work it has to do.

25:45 The other thing, I guess, which I don't see appearing anywhere in here, but I'm I suspect

25:49 TensorFlow may have something to do with it, is the the new stuff coming from Google, where they have

25:54 like going beyond GPUs for like, AI focused chips. Did you hear about this?

26:00 Yes. So Google has a thing called a TPU, which is a tensor processing unit or whatever. And you can

26:07 that's like a cloud hosted, special piece of hardware that's optimized for doing TensorFlow.

26:12 And so I don't know the exact benchmarks in terms of how that compares to, you know, like some gigantic

26:20 GPU assembly. But obviously, Google thinks that this is a worthwhile investment to build these sort of

26:27 hardware racks in the cloud, and then give people access to run their models on there. So I think

26:32 you're probably going to see more and more specialized, ML targeted hardware that's coming out, whether I

26:40 don't know whether it's like, you'll obviously consumer hardware, like you can go and buy it,

26:43 something for your home computer, but especially in the cloud, you definitely will.

26:48 Yeah, definitely in the cloud. Yeah, it's very interesting. They were talking about real time

26:52 training, not just real time answers. So that sounds pretty crazy.

26:55 This portion of Talk Python to me has been brought to you by DataCamp. They're calling all data science

27:02 and data science educators. DataCamp is building the future of online data science education.

27:07 They have over 1.5 million learners from around the world who have completed 70 million DataCamp

27:13 exercises to date. Learners get real hands-on experience by completing self-paced, interactive

27:19 data science courses right in the browser. The best part is these courses are taught by top data science

27:23 experts from companies like Anaconda and Kaggle and universities like Caltech and NYU. If you're a data

27:29 science junkie with a passion for teaching, then you too can build data science courses for the masses and

27:33 supplement or even replace your income while you're at it. For more information on becoming

27:37 an instructor, just go to datacamp.com slash create and let them know that Michael sent you.

27:42 So speaking of popular libraries and TPUs, the next up is TensorFlow. That originally came from

27:50 Google and it is crazy at 64,000 stars and 31,000 forks. So tell us about TensorFlow.

27:56 So TensorFlow, obviously, yeah, is this is Google's machine learning library and this is forms the sort of

28:02 slightly lower level than something like Keras and like obviously it's used as a backend.

28:07 You can use it directly as well. And what it does is it represents your model as a computation graph.

28:14 So that's effectively a graph where the nodes are like operations. And this is a way that they found

28:22 is really, really effective to represent these models. And it's a little bit more intimidating to

28:28 get started with mostly because you have to think about building this graph, but you can use it directly

28:34 in Python. Python is actually the recommended language and workflow from Google. So for example,

28:40 you know, when I rewrote the Go version of our little game there, I still had to train and export my model

28:46 from Python. So I use Python to build that, export it. So that's the sort of recommended workflow

28:52 currently from Google for many languages is to use Python as the primary language binding.

28:56 Yeah, that's, that's really interesting and great to see Python. Python appears in so many of these,

29:01 these libraries as a primary way to do it. So there's some interesting stuff about this one.

29:08 Obviously it's super popular. Google has so many use cases for machine learning, just up and down

29:15 their whole, you know, everything that they're doing. So having this like developed internally is really

29:20 cool. It has a flexible architecture that lets it run on CPUs or GPUs, obviously, or mobile devices.

29:28 And it even lets it run like on multiple GPUs and multiple CPUs. Do you have to do anything to make

29:35 that happen? Or do you know how it does that? As far as I can tell that, especially for like this

29:40 switching between CPU and GPU, it's essentially a compile flag. So you have to build like when you build

29:45 the libraries or download one of the nightly builds or whatever, you have to get one of the,

29:51 the versions or that has the enabled GPU support kind of thing built in. And I think that there are

29:57 also now increasingly like CPU optimizations in there. So like for instance, Intel is doing hand

30:04 optimized math kernel stuff that's integrated directly into TensorFlow to make it even faster. So that that's

30:12 something that you can also get in like the latest version as well. So I definitely think speed and

30:18 performance and making that stuff easily accessible to depending on what your hardware is and where

30:23 you're going to deploy it is a big focus for them. Yeah, that's really cool. So do you think this is

30:30 running in the Waymo cars, you know, the Google self driving cars?

30:33 Yeah, I mean, I don't know for sure, but I'd be almost positive of it, you know, from everything that

30:37 I've read and people that I've talked to. I mean, this is Google built this to use not just,

30:42 you know, so there, this is the platform for all of their deep learning and machine learning

30:48 projects. And so I would assume that it's that's TensorFlow is powering that and it's running pretty

30:53 much all of their all of their stuff. Very, very cool. It's probably in Google photos and some other

30:58 things as well. Yeah, Google translate, all those things are all, you know, those things,

31:03 pretty much all of the projects when you start looking at them that Google is running are all

31:08 effectively AI projects. And that's basically all the things that, you know, that just recently,

31:16 like the Google translate, which uses machine learning and like statistical models to do the

31:22 translations is approaching human level accuracy for translation between a lot of the popular

31:27 languages where they have huge, huge data sets to pull from. Yeah, that's crazy. And very,

31:32 cool. So up next, number five is Theano at 6000 stars and 2000 forks. And this one is really kind

31:40 of similar to TensorFlow, but really low level, right? Yeah, so it is, you know, more low level,

31:46 and it is very similar to TensorFlow in the sense that it's also a very high level, high speed math

31:51 library. And I believe it's actually it was originally made by a couple of the guys who then

31:56 went on to Google to make TensorFlow. So it predates TensorFlow by a little bit. But it also has,

32:01 you know, the things that we're, we're talking about here, it has transparent GPU use. And you can do

32:08 things like symbolic differentiation, and a lot of like mathematical things, mathematical operations

32:13 that you want to be highly, highly performant. So it is actually pretty similar to what TensorFlow does,

32:21 and sort of serves a similar purpose. But depending on what you're comfortable with, and what your maybe

32:27 existing projects are, then that is probably going to dictate which one you're using. And if you're

32:32 using something like Harris, then you can just choose this as the back end. And I flip the switch,

32:36 just flip the switch. And there you go. Yeah, it's cool. It also says it has extensive unit testing and

32:42 self verification where it'll detect and diagnose errors, maybe you've set up your model wrong or

32:47 something like that. That's pretty cool. That's pretty cool. Yeah, for sure. I mean, all of these

32:51 libraries are built by super, super smart, accomplished people who are creating things

32:58 that are, you know, solving a real world problem for them and really, you know, sort of pushing things

33:04 forward. And I actually think it's great that there's so many, so many libraries in this space,

33:09 because it really is just making it better for everybody. Yeah, the competition is really cool

33:15 to see the different ways to do this and probably cross pollination. Exactly. Yeah. Yeah. So one of

33:22 the things you have to do for these models is feed them data. And getting data can be a super messy

33:27 thing. And the one library that stands out above all the others about taking transforming,

33:34 redoing, cleaning up data is pandas, right? Absolutely. Yeah. Pandas is, is one of those,

33:41 those libraries that if you're manipulating, especially large sets of data and real world data,

33:47 then this is the one that, that people, you know, repeatedly come back to. And yeah, so pandas is,

33:54 for those that might not know, is like a, you know, data munging data analysis library that lets you

34:00 transform it. One of the hardest parts when you're doing machine learning is actually getting your data

34:06 into a format that can be used effectively by your model. And so a lot of times real world data is

34:12 pretty messy, or it might have gaps in it, or it might not actually be formatted in the right units.

34:20 So it might not be sort of normalized so that you're within the right ranges. And if you feed the models,

34:26 just sort of raw data that hasn't really been either cleaned up or, or formatted correctly,

34:32 then what you might find is that the model doesn't converge or you get what seems like random results

34:40 or things that don't really make sense. And so, you know, spending this time and having a library that

34:46 makes manipulating, especially very large sets of data, very easy, like pandas is super useful.

34:53 And even just for instance, like when I was doing that, that little demo there that, that we talked

34:59 about originally, you know, when I started, I was feeding things raw, raw pixel values for positions

35:05 and velocities and stuff. And it just wasn't working. And it wasn't until I really normalized the data,

35:10 cleaned it up that I had started getting good consistent results. So it's, you know, dealing large

35:15 scale data sets and being able to manipulate them effectively is super important.

35:20 Yeah. At the heart of all these successful AI things, these machine learning algorithms and

35:28 whatnot is a tremendous amount of data. It's why the companies that we talk about doing well are like

35:33 enormous data sucking machines like Google and Microsoft and some of these other ones. Right.

35:41 Exactly. And that's where the power of them comes from is like, you know, Google has access to like just

35:46 massive amount of data that we don't have access to regular people. Or like we were talking about earlier

35:54 with like the radiology images, you need to do need a fairly large set of annotated data. And so that's data

36:01 where, you know, these are case files or whatever that, you know, a doctor has already gone through and said,

36:06 this one was a cancer patient, this one wasn't. And without that kind of annotated data, the models

36:13 can't really learn. They need to know what the answer is. Right. And so that's really, really important.

36:20 Yeah. We have the whole 10,000 hours become an expert for humans. It's that's kind of the equivalent

36:25 for machines.

36:26 Yeah, I guess. Yeah. I don't know what the I don't know what the thing is. It's the machines might need

36:31 more. That's one of the things that is really interesting about humans is that our neural networks

36:38 can learn remarkably quickly without having to walk into traffic 1000 times or do something like that.

36:46 And so there's I don't know, there's some magic going on there or something.

36:49 Yeah, there sure is. All right. Next up is cafe and cafe two. And this originally started out as a

36:57 vision project, right?

36:58 That's right. Yeah, Berkeley. And so this was primarily a vision project. And then there's a sort of successor

37:05 that is backed by Facebook, actually, and is more general purpose and is sort of optimized for web and

37:13 mobile deployment. So obviously, you know, if you want to have machine learning based apps on your

37:18 phone, then having a library that sort of targets that is pretty important.

37:22 Yeah, I'm sure we're going to see more of that. I mean, there are even rumors. I don't know how

37:27 trustworthy they are that the next Apple maybe actually today analysis that the next iPhone will have a

37:34 built in AI chip.

37:35 I remember that they just announced so Apple actually just announced machine language SDK core ML at

37:42 WWDC in June. And so Apple is already targeting these sort of deployed ML models. So, you know, in that

37:51 that library's case, you are effectively choosing a pre-made model. So I want image recognition or I want,

37:57 you know, language parsing in my app. And then you can just feed these sort of pre-trained models.

38:01 But it wouldn't surprise me, you know, they've got the was like the motion chip in your iPhone now.

38:07 Yeah, they got the motion chip. Yeah.

38:08 So it wouldn't surprise me at all that to start seeing that phones are deploying AI chips in there

38:13 to assist with this because most of the sort of things like Siri is a machine learning based thing.

38:18 Right. So yeah. Yeah. It's and it doesn't make sense to go to the cloud all the time. Like that's one of

38:24 the super annoying things about Siri is you ask it a question and it's like six seconds later. Like you

38:29 ask it something simple like what time is it? 10 seconds later, it'll tell you it's such and such.

38:34 Like, is it really that hard? Yeah. Yeah. It's got to go all the way to the cloud and you're in some

38:38 sketchy network area or something. Right. Exactly. And so that I wouldn't be surprised to start seeing

38:43 that stuff deployed onto onto mobile. I think at even at build Microsoft's conference, they started

38:49 talking about edge machine learning where like the machine learning happen is getting pushed to all

38:55 these IOT devices that they're working on as well. So a lot of a lot of attempts in this area.

38:59 For sure. Yeah. And that's the next big thing, right? Is like having IOT based machine learning

39:04 devices. Like, can your fridge learn like your grocery consumption habits and, you know, suggest

39:10 tell you like you're going to run out of milk in two days and you're going to the store today. Maybe

39:13 you should pick some up. I mean, it's going to happen kind of crazy, but it totally will happen.

39:18 And yeah. Yeah. I mean, it doesn't sound as crazy as let's just let a car go drive in a busy

39:24 city on its own. That's true. And yet that's, that's something that exists now, right? Like

39:31 that's, that's a, that's a thing like you can, and maybe it's not fully autonomous, but I mean,

39:36 you could go and buy one like tomorrow, you could buy a car that you can turn on autopilot and like,

39:42 it's crazy. It's fully drive for you. So the future is now, the future is here. It's just not

39:49 evenly distributed. This portion of Talk Python is brought to you by us. As many of you know,

39:56 I have a growing set of courses to help you go from Python beginner to novice to Python expert.

40:00 And there are many more courses in the works. So please consider Talk Python training for you and

40:05 your team's training needs. If you're just getting started, I've built a course to teach you Python

40:10 the way professional developers learn by building applications. Check out my Python jumpstart by

40:15 building 10 apps at talkpython.fm/course. Are you looking to start adding services to your app?

40:21 Try my brand new consuming HTTP services in Python. You'll learn to work with RESTful HTTP services,

40:27 as well as SOAP, JSON and XML data formats. Do you want to launch an online business? Well,

40:32 Matt McKay and I built an entrepreneur's playbook with Python for entrepreneurs. This 16 hour course will

40:38 teach you everything you need to launch your web-based business with Python. And finally,

40:42 there's a couple of new course announcements coming really soon. So if you don't already have an

40:46 account, be sure to create one at training.talkpython.fm to get notified. And for all of you who have bought

40:52 my courses, thank you so much. It really, really helps support the show. One little fact or a quote from

40:59 the cafe webpage that I want to just throw out there because I thought it was pretty cool before we move

41:03 on. They say, speed makes cafe perfect for research experiments and industry deployments. It can process 60 million

41:12 images per day on a single GPU. That's one millisecond per image for inference and four milliseconds per image for

41:20 learning. That's insane.

41:21 So fast. And 60 million images per day is just like, it's crazy. And that's why we were talking

41:30 about the data just a minute ago. And the amount of data being poured into these models is just

41:36 staggering every day. And I don't doubt that they're probably feeding, people are feeding these models

41:42 like that much data every day. And I think they were saying 90% of the world's data that's ever been

41:48 created has been created in the last year. And so it's just one of these things where it gets

41:53 accelerates and accelerates and builds on all this stuff. So I think these things are just going to

41:59 get faster until they're effectively real time.

42:01 Yeah, absolutely. All right. I don't think we said the stars for that one. 20,000 and 11,000 forks.

42:08 So up next is definitely one that data scientists in general just live on. And that's Jupyter.

42:15 For sure. And so this has just become like the standard interchange format for sharing data science,

42:24 whether it's papers or data sets or models, or this has just become the sort of standard,

42:32 I don't know what you're going to call it, lingua franca for exchanging this data. And it's effectively

42:37 a tool for the thing called a Jupyter notebook, which is like kind of like a web pages with like

42:43 embedded programs and embedded data sets. I think that's probably a good way to describe it for those

42:48 who might not have used it before.

42:49 Right. It's like instead of writing a blog post or a paper that's got a little bit of code,

42:54 then a little bit of description, then a picture, which is a graph, it's like live and you can re-execute

42:59 it and tweak it. And it probably plugs into many of these other libraries and it's using that

43:03 somewhere behind the scenes to do that.

43:06 Exactly. Yeah. It's built on the IPython kernel for that's like interactive Python kernel. Yeah. I'm

43:13 sure that there are all kinds of specific uses that can run those notebook or that notebook code and use

43:19 that, that stuff there.

43:20 Cool. Next up is maybe one of the newer kids on the block in this deep learning story from Microsoft,

43:26 actually their cognitive toolkit, C and TK.

43:29 Yeah. And it's, they just released, I think the 2.0 version of it beginning of June or late May.

43:35 And, you know, now it's open source and it's, it's got the Python bindings and it's part of,

43:42 you know, Microsoft's been doing a lot of open source work lately and they've been, you know,

43:46 really, really pushing a lot of their own projects.

43:48 And, it's like we said earlier, it's available as a backend for Keras. So it's similar again to

43:56 TensorFlow and Theano that it's, it's again, focused on that sort of low level

44:00 computation as a directed graph. So similar model, I think this is, you know, obviously emerging as a

44:07 popular and efficient way to represent machine learning models is using that directed graph.

44:11 So it's pretty popular too, right? It's got a decent number of stars and forks and obviously

44:17 as a Keras backend and Microsoft backed library, it's going to be pretty popular and pretty common

44:23 out there.

44:24 Yeah, absolutely. These days, you know, with, Satya Nadella and a lot of the changes at Microsoft,

44:29 I feel like this open source stuff is really taking a new direction, a positive one. And also I think

44:35 their philosophy is if it's good for Azure, it's good for Microsoft. And so this plugs into their

44:41 hosted stuff and interesting ways. And they've got a lot of like cognitive cloud services and things

44:47 like that.

44:47 Yeah. Azure is becoming pretty huge. It's like starting to rival maybe even AWS for, you know,

44:57 a lot of this cloud hosted services and especially around machine learning, like Azure has so many

45:01 different machine learning tools available. And it's really clearly a pretty, pretty big focus for

45:08 Microsoft. And again, it's great to see, you know, more of the, you know, the sort of big guns being

45:14 more open about their development and sharing. I mean, it drives everybody forward and, and, you know,

45:19 just accelerates development across the whole ecosystem.

45:21 Yeah. And they have a number of the Python core developers there. They have Brett Cannon,

45:25 they have Steve Dower, they have, you know, VLAN, like there's some serious people back there working

45:30 on the Python part.

45:31 Exactly. Yeah. They've got a lot of the Python core team there. And, I know a bunch of the

45:37 guys from active state were just at PI data in Seattle and, you know, huge number of the core team

45:42 were there and, you know, just really, really great little conference. They're talking about

45:48 Python and data science. Yeah. I think they have some really interesting language stuff as well.

45:52 So speaking of languages, the, most, certainly the longest running one, probably that's really

45:58 still going strong is NLTK with 5,000 stars and 1.5 thousand forks.

46:03 Yeah. And so NLTK was like the natural language toolkit. And, you know, obviously this is a thing

46:09 for doing natural language parsing, which is, I guess, one of the, the holy grails of, of machine

46:14 learning is to get it to be really, you know, so you can just speak to your, to your computer

46:19 and completely natural language, and maybe even give it instructions in natural language

46:23 and, and be able to be able to follow your, for your directions and understand what you're

46:28 asking. And so this is like a really popular one in academia for research. They link to and

46:35 include massive corpora of, of work. So that's like gigantic bodies of text in different languages

46:43 and in different styles to be able to train models. So there's, there's also like a pretty

46:48 large, like open data component to this project as well. And, obviously, you know, the use

46:54 case here for natural language is, you know, it's huge for translation. Like we mentioned earlier,

46:59 chatbots, which are now a huge thing for like support. I mean, every website you go onto and

47:05 it pops up, Hey, I'm, you know, Bob and I'm, can I help you today? And it's like, not a really a

47:10 person. It's just a chatbot. And, you know, there's just so many. And then like we were saying, Siri and,

47:16 and Cortana and all those sort of personal assistants where you can say, ask it a natural language question

47:22 and it can come back to you. So this is the sort of almost like foundational library still going

47:27 strong, still tons of active development and research going on with this. Yeah. It's really

47:32 cool. And especially with all the smart home speaker things, Google home, home pod, all that stuff.

47:37 This is just, this is going faster, not slower terms of acceleration, right? It's

47:42 weird talking more and interacting with them way more. Definitely the chatbots. And anytime you have

47:49 text and you want a computer to understand it, this is like a first step for tokenization,

47:55 stemming, tagging, parsing, semantic analysis, all that kind of stuff. Right? Yeah. And that's,

47:59 that's exactly what it outputs. So it will do is like generate parse trees and, and stem it all out and

48:05 then use those, the kind of tokenized version to use that to train your model, not sort of raw text

48:12 characters. And, we really are getting there. I mean, like these days, like for sure, like just the

48:19 recognition part, you know, the tokenization part is very, very good. It's more like the kind of

48:24 semantic meaning. What do you mean when you ask it, you ask Siri for what are the movie times for X or

48:32 something like that? How specific do you have to be for, to get a reasonable answer from her? Yeah.

48:37 It's got to go speech to text and then it probably hits something like this. Exactly. Yeah, exactly.

48:41 That's going to hit a library like this and we're getting there. It's not quite at the Star Trek

48:46 computer do this for me, but it's like way closer than I kind of ever thought we would

48:51 be. It's really pretty impressive sometimes. Yeah, absolutely. It's, it's fun to see this

48:57 stuff evolve. Absolutely. All right, Pete, that's the 10 libraries. And I think these are all really

49:02 great selections and hopefully people have got a lot of exposure and maybe learned about some

49:09 they didn't know about. And I guess encourage, encourage everyone to go out there and try these

49:13 down and if you've got an idea, play with it with one of them or more. For sure. They're also accessible

49:17 now. You know, you don't necessarily have to be ML researcher or a math wizard to actually create

49:27 something that's interesting or experiment or learn a little bit. These libraries all do a really,

49:32 really great job of abstracting away some of the more complicated mathematical parts. And,

49:39 you know, in the case of a lot of them making it reasonably accessible. And so that's where I think

49:45 you're seeing this kind of like democratization trend in machine learning now where this stuff is

49:51 becoming more accessible. It's becoming easier. And I think you're going to see a lot of creativity and a

49:56 lot of innovation come out of people if they just sort of give it a shot and try something out and,

50:01 you know, learn something new.

50:03 Yeah, that's awesome. I totally agree with the democratization of it. And that's also happening

50:07 from a computational perspective, right? Like these are easier to use, but also with the GPUs

50:12 and the cloud and things like that, it's a lot easier. You don't need a supercomputer. You need 500

50:18 bucks or something for a GPU.

50:20 Exactly. That's the, I think all of these sort of things feed into that in together where you have a

50:25 democratization trend in the tools and the source code so that now a, you can have access to Google's

50:33 years and years of AI research via TensorFlow on GitHub. You also, like you said, can go and buy a

50:40 $500 GPU and have basically a supercomputer on your desktop, but also this open data component where

50:48 you can get access to massive data sets like the Stanford image library and, you know, these huge

50:56 NLTK like language corpora that you can then use to train your models where previously that was probably

51:03 impossible to actually access.

51:05 Yeah, that's a really good point because even though you have the machines and you have the algorithms,

51:09 the data, data really makes it work. All right. So I think let's leave it there for the library.

51:14 So those were great. And I'll, I hit you with the final two questions. You're going to write some

51:20 code. What editor, Python code, what editor do you open up?

51:22 Well, obviously ActiveState has Komodo. So I tend to use that a lot for doing a Python code, but I've also

51:30 to be totally fair. I have used VS Code as well, which is getting increasingly popular. So I tend to like

51:36 to cycle between them all because we have an editor product. And so, you know, it's great to keep up to

51:42 date on what all the other ones are doing. So I tend to cycle around a little bit, but yeah, like

51:48 Komodo is sort of my go-to.

51:50 Yeah, that's cool. Yeah. It's definitely important to look and see what the trends are, what other

51:54 people are doing, how can you bring this cool idea back into Komodo, things like that, right?

51:57 Yeah, for sure.

51:58 Yeah.

51:58 All right. And I think we've already hit 10, but do you have another notable PyPI package?

52:03 I don't know. There's, there's so many. I would again, probably give a, a little bit of a shout

52:08 out to, you know, since we're talking about machine learning to Keras, because I do think as an entry

52:14 point to machine learning, it's so accessible. It's so easy to at least get started and get a result with.

52:22 I would give a little shout out to that, that I think that if you're looking to get into this and

52:25 you're looking to try it out, that's a really great place to start.

52:28 Yeah, I totally agree with you. That's, that's where I would start as well.

52:31 All right. Well, it's very interesting to talk about all these libraries with you. I really

52:36 appreciate you coming on the show and sharing this with everyone. Thanks for being here.

52:39 Thank you for having me.

52:40 You bet. Bye.

52:41 This has been another episode of Talk Python to Me. Our guest has been Pete Carson,

52:47 and this episode has been brought to you by DataCamp and us right here at Talk Python Training.

52:52 Want to share your data science experience and passion? Visit datacamp.com slash create

52:58 and write a course for a million budding data scientists.

53:01 Are you or a colleague trying to learn Python? Have you tried books and videos that just left

53:06 you bored by covering topics point by point? Well, check out my online course,

53:10 Python Jumpstart by Building 10 Apps at talkpython.fm/course to experience a more engaging way to

53:16 learn Python. And if you're looking for something a little more advanced, try my Write Pythonic Code

53:21 course at talkpython.fm/pythonic. Be sure to subscribe to the show. Open your favorite podcatcher

53:28 and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes,

53:33 Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm.

53:39 This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it.

53:44 Now get out there and write some Python code.

53:46 Thank you.

54:06 Thank you.

54:07 Thank you.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon