#103: Compiling Python through PyLLVM and MongoDB for Data Scientists Transcript
00:00 This episode, we have an optimization twofer.
00:02 We begin by looking at optimizing a subset of Python code for machine learning
00:07 using the LLVM compiler with a project called PyLLVM.
00:12 It takes plain Python code, compiles it to optimize machine instructions,
00:16 and distributes it across a cluster to do machine learning.
00:19 In the second half, we'll look at a fabulous new way to work with MongoDB
00:23 for Python writing data scientists.
00:26 The project is called Bison NumPy and provides direct connections between MongoDB and NumPy.
00:33 It's 10 times faster than working with PyMongo directly if you plan to end up in NumPy anyway.
00:39 You're about to meet the woman behind both of these projects, Anna Herlihy.
00:43 This is Talk Python to Me, episode 103, recorded February 6, 2017.
00:52 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
01:21 This is your host, Michael Kennedy.
01:22 Follow me on Twitter where I'm @mkennedy.
01:24 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python.
01:31 This episode is brought to you by Talk Python Training and Hired.
01:36 Be sure to check out what we both have to offer during our segments.
01:39 It helps support the show.
01:40 Anna, welcome to Talk Python.
01:43 Thank you.
01:44 Thank you for having me.
01:45 Yeah, it's great to have you here.
01:46 We got a couple of really cool things to talk about.
01:49 We're going to talk about PyLLVM, which is a really cool project that you worked on.
01:55 And we're also going to talk about a super high-performance sort of data science-y thing
01:59 with MongoDB that you're also working on right now.
02:02 So I'm looking forward to talking about both of those.
02:04 But before we get into those, what's your story?
02:06 How did you get into programming in Python?
02:08 I had my first real experience with programming.
02:12 I mean, I had done some basic when I was pretty small, but I thought it was not very fun.
02:18 So I kind of put that off for another five or six years.
02:20 But in university, I was kind of all over the place.
02:24 I didn't know if I wanted to be a writer or an engineer.
02:27 But I figured that computers are probably going to be relevant.
02:31 So I took a CS class because I didn't know the difference between computer science and computer
02:38 literacy.
02:38 But it ended up actually working out in my favor because I really, really enjoyed it.
02:43 And I ended up prioritizing my CS classes over all my other classes.
02:47 And then I kind of figured, maybe I should just do this full time.
02:50 That's really cool.
02:51 What was that first CS class?
02:52 That was Andy Van Dam's intro to computer science, which for anybody who went to Brown will recognize
02:59 that it's quite a popular course.
03:02 I mean, when I took it, it was maybe 150 people.
03:05 But I think just last year, they had 250 or something.
03:09 It's really, really blowing up.
03:11 Oh, that's excellent.
03:11 What language did you study?
03:13 Do you remember?
03:13 Yeah, it was a Java course.
03:15 We did a lot of swings.
03:17 Okay, excellent.
03:18 Yeah.
03:19 And so you decided, hey, I kind of like this programming stuff.
03:22 Let's just do that as a job, right?
03:23 And how did you go from learning Java and CS 101 to working in Python?
03:29 So I really liked programming languages in general.
03:34 So I wrote some compilers, some interpreters.
03:38 My area in computer science was mostly looking at different programming languages.
03:44 And I found that Python was kind of the most elegant, in my opinion.
03:48 And I didn't feel bogged down by really complicated syntax.
03:53 But I also felt very powerful.
03:54 So I kind of felt that as opposed to being pure Python, I wanted to do more like Python and C stuff,
04:02 sort of more from the very beginning to how it gets down to the computer code at the end.
04:07 I see.
04:07 So you really were interested in like the internals and stuff and whatnot, huh?
04:11 Yes.
04:12 Okay.
04:12 Cool.
04:13 And so the first project we're going to talk about, which I want to ask you a question quick before then,
04:18 but PyLLVM, that was part of your university work, right?
04:23 Yes, exactly.
04:24 The project that it was based on Tupleware is a research project with some of the database professors at Brown.
04:32 And my project was a senior thesis that was built sort of on top of that.
04:36 Okay.
04:36 Wow.
04:36 Very cool.
04:37 Very cool.
04:37 Okay.
04:38 So today you work at MongoDB, right?
04:40 Yes.
04:40 Yeah.
04:41 What do you do there?
04:41 So I actually recently switched over from spending the majority of my time on Python work.
04:47 So like PyMongo, Mongo Connector, other sort of Python related projects to MongoDB Compass,
04:55 which is basically a user interface for MongoDB.
04:58 It's actually written in JavaScript, which was something new for me since I had done a pretty
05:04 excellent job of avoiding JavaScript until now.
05:06 It finally got you, huh?
05:08 Yeah, I know.
05:09 It was inevitable.
05:10 But now I really like it.
05:11 I can't say that I am converted and I still think that Python is the community and the language
05:16 that I like best.
05:17 I'm probably preaching to the choir here.
05:20 But I now split my time between working on Compass and working on this BSON NumPy package,
05:26 which I hope we'll get to talk about later.
05:28 Yeah, we'll definitely talk about it because it's very cool.
05:30 Okay.
05:31 Excellent.
05:32 So it sounds like you have a lot of cool projects going on.
05:35 Let's talk about the project that you did first with this LLVM thing.
05:39 So I suspect a number of people out there know what LLVM is, but the audience is diverse.
05:45 There's people from all sorts of places.
05:46 Let's start with just talking about what is LLVM?
05:49 Cool.
05:50 So despite the name, it doesn't really have too much to do with virtual machines.
05:54 It's basically just a collection of compiler technologies.
05:58 So the LLVM project itself is huge.
06:01 But what I worked with is LLVM IR, which stands for intermediate representation.
06:07 And LLVM IR is basically a way of representing code that's about halfway between a top level
06:15 language.
06:15 So for example, Python and some machine code.
06:19 So something that you wouldn't really want to read.
06:21 You could probably look at the assembler and know what was going on, but it's not something
06:25 you would ever really want to write.
06:26 But runs much, much faster than something that a human would be able to write.
06:30 Sure.
06:31 So does...
06:32 I don't know all that much about it, actually, that myself.
06:34 So do you compile the source language into this LLVM IR and then do a further compilation
06:42 towards some final target?
06:43 Yes, exactly.
06:44 And that's actually what makes it so powerful, is that a lot of the time you have compilers
06:50 that have to be very specialized.
06:52 So they take a language and they take a platform.
06:54 And that is what the compiler does, is it compiles Python for a particular architecture.
07:01 What LLVM IR does, which is not unique to LLVM, it's true for all intermediate representations
07:07 or most of them, is that it's a way of making things both language and platform agnostic.
07:14 So you can take any popular language you want from Java, R, Python, and you can compile it
07:23 down to LLVM IR.
07:24 Then you can take that LLVM IR and you can ship it anywhere and it can be compiled down to most
07:30 platforms.
07:31 So it doesn't even matter what the original language was and it doesn't matter what platform
07:36 you're eventually going to run it on.
07:37 It makes the code really...
07:39 It makes it very cross-platform and it makes it fast.
07:43 Yeah, sure.
07:44 So basically, if you can get something to compile down to LLVM IR, you can get it to
07:49 run quickly on many, many places, right?
07:52 Because there's existing infrastructure to turn LLVM IR into executable code on all sorts
07:59 of platforms, right?
08:00 Exactly.
08:01 And it's super powerful.
08:02 The optimizations that the LLVM IR compilers do make it so that it doesn't matter if you wrote
08:09 your code in C or if you wrote your code in Python.
08:12 I'm of the opinion that it's much easier to write your code in Python than it is to write
08:15 it in C.
08:16 And so it's pretty nice because you get the benefits of maybe a more difficult syntax with
08:21 pretty much whatever you're comfortable in.
08:23 Yeah, that's really cool.
08:24 What's the weirdest thing that you can execute this stuff on?
08:29 What's the weirdest platform?
08:31 That's a good question.
08:31 Actually, I don't know.
08:33 But I think it's pretty easy to if you just look up the LLVM IR or the LLVM compiler project,
08:39 I think it's a really good way to get to know compilers in general.
08:44 So there are a lot of proof of concept compilers that use LLVM because the infrastructure is so
08:51 good.
08:51 So I'm sure somebody has written something for an extremely obscure platform just to show
08:57 that they could do it.
08:58 Yeah, I'm sure.
08:59 Do you think that it makes a lot of sense for people to create from scratch compilers these
09:05 days?
09:05 Or should most people just be building on LLVM?
09:08 I think it depends on what the goal of your project is.
09:13 So because this was a project that was done in the world of academia, I pretty much just
09:19 picked whatever seemed like would be the most fun.
09:21 And I know that's not necessarily true for a lot of people who have needs and users and they
09:28 have to work quickly.
09:29 And so I think it just depends on if you want something to work faster or if you want to learn
09:35 a lot while you're doing it.
09:37 Sure.
09:37 Yeah.
09:37 And I guess also like you only had, it was like a senior project sort of thing, right?
09:41 So you couldn't start from scratch and do a ton of work.
09:45 You had a time frame and a little amount of time and energy, right?
09:48 So I guess that's a-
09:49 Exactly.
09:49 I had to graduate.
09:49 Exactly.
09:50 So that's a pretty good testament to LLVM.
09:52 Cool.
09:53 And I know it's like used for Swift and some other things on the Apple platform and it's
09:58 pretty cool.
09:59 Okay.
10:00 So you also talked about a project to do with machine learning called Tupleware.
10:05 And this is a project at Brown, right?
10:07 Yes, it is.
10:08 So Tupleware's tagline is that it is a distributed analytical platform that leverages LLVM IR to
10:18 be totally platform agnostic and totally language agnostic.
10:23 So the way that PY LLVM fits into Tupleware, it was basically a proof of concept that you
10:31 could take Python code and you could compile it down to LLVM IR and then you could ship
10:38 it to your clusters and then have them be automatic and have your code be automatically run on a
10:44 very large dataset.
10:45 Okay, cool.
10:46 So this is like a distributed machine learning type of system and your project PY LLVM basically
10:52 made it possible to feed Python instead of say C++ code or something like that to it, right?
10:57 Exactly.
10:57 And I chose Python because I didn't really want to write an R compiler, but also because at
11:04 the time I was working on this, which is about three, three and a half years ago now, the language
11:10 that most scientists were using for programming was still MATLAB.
11:14 I actually had a summer job where I converted MATLAB code to scientific Python code and it was
11:21 becoming more and more common that people who were not programmers would be programming in Python
11:26 instead of just inheriting these old MATLAB scripts that they would reuse.
11:29 So that's why I've picked Python.
11:32 Yeah, that's cool.
11:33 I've definitely seen that as well.
11:35 When I was in school, MATLAB was definitely the thing and people would have all these scripts
11:40 and one of my first programming jobs was to take a bunch of MATLAB code and turn it into like
11:47 graphical visualizations on Silicon Graphics supercomputers and I had to convert the MATLAB code to C++
11:54 but if it was today, I very well might have done that in Python instead, right?
11:57 Yeah, that's funny.
11:59 Sounds like the same job.
12:00 Yeah, exactly.
12:01 Exactly.
12:02 Awesome.
12:03 And so you said PY LLVM started from an abandoned open source project.
12:08 What's the story there?
12:09 Basically, once I had pinned down what exactly I wanted to do for my thesis,
12:15 I went on a search spree where I tried to figure out if somebody else had done it for me,
12:20 which would have been very convenient.
12:21 But at the time, I couldn't find any Python 2 LLVM IR compilers that were already up and running or already published.
12:32 So what I did is I found this GitHub.
12:36 Actually, it wasn't even GitHub.
12:38 It was a Google code, right?
12:39 Yeah, it was a Google code project, which had been documented almost zero.
12:46 There were some slides in Japanese that I found, which I tried to feed through Google Translate, which did not work.
12:52 So it was basically just some code.
12:55 And it ended up being almost exactly what I needed because it was a pretty simple compiler outline without any of it really implemented.
13:04 But the structural stuff was there.
13:06 So a lot of the scoping and a lot of the variable tracking had been started but not finished.
13:11 So it was kind of the perfect project to pick up because a lot of the design decisions that would have resulted in a lot of cost versus value arguments.
13:21 So, you know, we could implement it this way, but it would take a lot of time and we might not actually get that much out of it was sort of already decided for me.
13:28 So I had a pretty clear path in terms of getting the subset of Python it supported to be something that I thought a user could actually make use of.
13:38 Yeah, that seems really great that, you know, that's out there and you can pick it up.
13:42 And it's funny it was on Google Code because not only was the project kind of abandoned, like Google Code itself is kind of in archive mode, right?
13:49 I guess it's a real testament to GitHub.
13:51 I think it was in archive mode when I found it even.
13:55 It's definitely in archive mode now.
13:57 It's several layers.
13:58 Yeah, but that's helpful that it's like, well, we already kind of put the thinking into place, but hey, I need to – they just didn't take it to completion, right?
14:07 And so, like, okay, these are the final to-dos on this project or whatever until I can actually use it on Tupleware.
14:13 Yeah, I mean, when I found it, it wouldn't even compile.
14:15 So I have really a lot of respect for the person that wrote it because it really had everything – it was clear that this person didn't actually run it, but they had thought about it a lot.
14:28 And the skeleton was there, but it hadn't actually gone through the process of becoming a tool that could be run.
14:34 And that, to me, is very incredible.
14:36 I mean, I run my code every 10 seconds just to make sure that it's working.
14:39 Exactly.
14:40 Yeah, I'm with you on that.
14:43 Like, run it often.
14:43 So, yeah, I guess some people – you know, probably more in the early days when compiling took longer and stuff, but some people really sit there and work through it, and then they run it.
14:56 You know, hours after hours of work, that's just how I work.
14:59 I run it often.
15:00 Yeah, it's very impressive.
15:02 Yes, indeed.
15:03 It takes a subset of Python.
15:05 I can't feed it like Django and have it spit that out, right?
15:08 Because what happens is it takes the Python language and it turns it to LLVMIR, which then finally compiles down to machine instructions.
15:16 But there's no interpreter.
15:19 There's no standard library underneath there for it to run, right?
15:24 So, you've got to be pretty focused on what you give it.
15:26 Yes.
15:27 What it does have is sort of the C standard library.
15:31 So, for example, if you want to call, like, Printline, you can do some somewhat complicated gymnastics in order to get LLVM to make that external call.
15:44 So, that's really nice because getting Printline to work when you're writing a compiler is so important.
15:49 And when it does work, it's, like, really one of the best feelings ever.
15:53 But generally, no, if you wouldn't be able to import a Python package, for example, since the goal of the project was really just to provide people an alternative to using MATLAB.
16:05 So, instead of writing your machine learning algorithms in MATLAB, you can write them in Python.
16:10 Or, you know, in this case specifically, it'd be instead of writing in C++, you're writing it in Python.
16:14 But the nature of those algorithms is that they're quite simple.
16:18 So, we didn't – I didn't spend a lot of time trying to get objects or more complicated data structures working because I didn't anticipate it would be that common.
16:27 Right.
16:28 So, you're working with, like, loops and join and map and stuff like that.
16:33 And then you feed it basic algorithms, huh?
16:36 So, like, Bayesian stuff or linear regression.
16:40 Yeah, exactly.
16:42 You feed those off, huh?
16:43 Okay.
16:43 And I'm not a machine learning – I had not really done much machine learning at that time.
16:49 So, I basically just went to my ML professor and was like, give me, like, your top five most common machine learning algorithms that you would expect to want to run on large data sets.
16:59 I'll make these run in Python through this thing, right?
17:02 Nice.
17:03 And so, yeah.
17:05 So, you basically – you write the algorithms.
17:06 They compile down and then finally get transformed to run basically on the C, C++ runtime, like you said.
17:12 And that goes through something called Boost Python, right?
17:15 What is Boost Python?
17:16 So, Boost Python – Boost is a bunch of packages for C++ that provide a lot of really, really powerful capabilities.
17:25 And so, Boost Python is actually just an easier way of calling the Python C API.
17:31 If I had done this now, I probably just would have skipped using Boost entirely because I'm pretty comfortable with the Python C API.
17:38 But at the time, it was sort of the first Python plus another language interfacing I had done before.
17:45 And this was pretty simple because it was already being used by Tupperware itself.
17:51 So, I didn't have to do a lot of – I didn't actually have to incorporate it into the ecosystem.
17:55 It was already there.
17:56 Yeah.
17:57 That's cool.
17:57 Yeah.
17:58 And by now, you've done so much with PyMongo and all the stuff in MongoDB.
18:02 Yeah.
18:02 Yeah.
18:02 Cool.
18:03 Okay.
18:03 So, give us, like, a short example of what this algorithm might look like.
18:08 I mean, code over audio is really hard.
18:11 But just, like, what kind of stuff would you feed off to this system?
18:14 So, I would expect there to be – so, you've got to assign to your variables.
18:19 You most likely have an array that you are assigning values to.
18:26 You need to iterate through the array or you need to iterate through the data that you've been given.
18:31 And you need to do a lot of – or at least a decent amount of error checking.
18:35 So, what that means in terms of syntax is you would want to have reassignable variables.
18:42 You would need to have conditionals, loops, and arithmetic.
18:47 Those are basically the most important things.
18:49 This portion of Talk Pythonry is brought to you by us.
18:53 As many of you know, I have a growing set of courses to help you go from Python beginner to novice to Python expert.
18:59 And there are many more courses in the works.
19:01 So, please consider Talk Python training for you and your team's training needs.
19:05 If you're just getting started, I've built a course to teach you Python the way professional developers learn, by building applications.
19:12 Check out my Python Jumpstart by building 10 apps at talkpython.fm/course.
19:17 Are you looking to start adding services to your app?
19:20 Try my brand new Consuming HTTP Services in Python.
19:23 You'll learn to work with RESTful HTTP services as well as SOAP, JSON, and XML data formats.
19:29 Do you want to launch an online business?
19:30 Well, Matt McKay and I built an Entrepreneur's Playbook with Python for Entrepreneurs.
19:35 This 16-hour course will teach you everything you need to launch your web-based business with Python.
19:40 And finally, there's a couple of new course announcements coming really soon.
19:43 So, if you don't already have an account, be sure to create one at training.talkpython.fm to get notified.
19:49 And for all of you who have bought my courses, thank you so much.
19:53 It really, really helps support the show.
19:55 One of the things that seems a little challenging to me is like you're building something in Python
20:00 and it's being more or less just compiled into the results of C++, which is a typed system, right?
20:09 So, it expects, you know, here's a four-byte integer.
20:14 Here's a Boolean.
20:16 Here's a string pointer, right?
20:17 Whereas in Python, you don't really have that, right?
20:20 So, what do you have to do to make the type system sort of fit together there?
20:24 That is a problem that comes up sort of time and time again because no matter how many layers of abstraction you're talking about,
20:32 ultimately, machine code is pretty strictly typed.
20:36 And we don't really like working in strictly typed languages.
20:40 So, there's a lot of cost-benefit analysis going on there where you could say I demand that I can reassign my variable from an array to an integer
20:53 or between different types of numbers, for example.
20:56 Or you could say that's going to save me a ton of work if I just tell my user who I anticipate is probably used to MATLAB.
21:04 Don't do that.
21:06 This is Python, but it's not that Pythonic.
21:08 Like, you know, maybe down the line when this is no longer a proof of concept.
21:13 Ultimately, what we ended up doing was just sticking to pretty static types because whenever you write a compiler,
21:22 it's kind of a debate between how much do I want to reinvent the wheel and how much are my users going to be willing to sort of have a more limited experience
21:32 for the sake of, like, my sanity or how much time I'm willing to put in on this project.
21:37 Sure, that makes a lot of sense.
21:39 I feel like because it's such a limited set of types, really, that you can write your algorithm in,
21:45 it's not so bad to sort of restrict it and talk about the types, right?
21:51 You've got, like, the fundamental numerical types, strings, lists, and a few other things that they can really work with, right?
21:58 Yeah, and I definitely, that's what we're trying to sell here.
22:03 But I remember having this argument where I came up with what I thought was, like, a super clever solution for how we're going to do dynamic types.
22:11 And it was, like, twice as much code as the entire compiler was before that point.
22:16 And I brought it to my advisor.
22:17 I was like, this is such a good idea.
22:19 Like, I can't wait to implement this.
22:21 And he kind of looked at me and he was like, this is not what we need right now.
22:24 What we need is something that is working, which is, I guess, was my first experience with, you know, writing code that actually needs to do something.
22:33 And it needs to do something as soon as possible.
22:35 As opposed to code that's, like, beautiful and elegant.
22:38 And I've, like, prototyped it three times.
22:40 And I have all the time in the world.
22:41 So.
22:42 Yeah.
22:42 You know, shipping is a feature, right?
22:44 Yeah.
22:45 No one's going to use your code if you don't actually get it out there and get it working, I guess.
22:49 Yeah.
22:50 I mean, I still struggle with that now.
22:52 I definitely like to write code that is fully baked as opposed to just, like, getting stuff out of the door.
23:00 But I'm definitely getting used to that now that I'm working in JavaScript more.
23:04 Yeah, it's one benefit.
23:05 That's awesome.
23:06 Yeah.
23:07 Yeah, I guess I really appreciate getting something out there so that people can use it and give me feedback.
23:12 Like, this is working.
23:13 This is not working.
23:14 But, yeah, you just have to have some flexibility.
23:16 You can't get locked into, like, some early prototype API or something, right?
23:19 Yeah, exactly.
23:20 Yeah.
23:21 So you said that some of the LLVM IR features made your life easier and some made it harder.
23:25 How'd that work out?
23:26 So probably the most involved issue that comes up between converting from Python to a intermediate representation is like what you mentioned with types and reassigning.
23:41 LLVM IR is written in SSA, which means static single assignment.
23:46 And what that basically means is that you have your registers, which are your smallest unit of storage, and you can only assign to them once.
23:56 So if you assign a number to your register, even if you want to assign something of the same type, it is frozen for that function call.
24:07 So what I needed to do to get around that was basically move everything onto the stack.
24:13 And once everything is on the stack, then you need to keep track of stuff using what people call a symbol table, which is basically just a dictionary where you can look up.
24:23 I have a variable named X, and it lives at this memory address, and it's been around for this long and all sorts of other metadata like that.
24:31 Okay.
24:32 So, yeah.
24:33 Does that make sense?
24:34 Yeah, yeah.
24:35 That's pretty interesting.
24:36 You can only assign to them once.
24:38 Okay.
24:38 Pretty interesting.
24:39 And then memory management, how did that work, actually?
24:42 Because everything was on the stack, for the most part, things basically just took care of themselves.
24:48 I didn't have to write a garbage collector, thankfully.
24:51 But there was one particular instance that was really awkward.
24:57 So if you are making a function call and you want to return something that can't fit into a register.
25:08 So say, for example, you have an array, or you have a function call that populates an array, and then you want it to return the array to the original caller.
25:17 The problem there is that you can no longer save that on the stack.
25:22 Because as anyone who's ever programmed in C or C++ knows, it'll go out of scope.
25:26 So the solution there is either, oh, do I have to keep track of scopes now?
25:32 Do I actually have to write a garbage collector?
25:34 Do I have to reinvent memory management in Python myself in two months?
25:40 And the answer to that was, okay, no, that doesn't make sense either.
25:45 So what ended up happening is I would just move the data that you put in your array onto the heap temporarily and then pass back a pointer to it.
25:56 And then either copy it into the stack again and free it, or just chalk that up to a memory leak and let the stuff that you return from a function just call it a memory leak.
26:10 So that's sort of a lose-lose situation.
26:13 It was definitely one of those things that if I had had more time, I would have liked to dig into that.
26:18 And dynamically reassigning variables to two parts of the compiler, I feel like, are unfinished.
26:24 Sure.
26:25 Okay.
26:26 Yeah, that's a big challenge in adding your own garbage collector.
26:29 That sounds like a lot of work.
26:31 Sounds like it could be really fun, but it sounds like a different thesis.
26:34 Exactly.
26:35 It's not my problem right now.
26:37 Okay.
26:37 Interesting.
26:38 So how was the performance?
26:40 Say, like, one of the options was I could write my code in C++ and I could give it to Tupleware, or I could write it in Python and give it to Tupleware.
26:48 What was the trade-off there in terms of performance?
26:50 Was it huge or pretty close?
26:52 So in terms of performance, I actually do have the numbers that I ran that I can dig up.
26:58 But the bottom line was that the LLVM compiler itself was much faster.
27:08 If you take C++ and you compile it down to LLVM IR using a compiler that I think is written in LLVM IR itself, or at least in C or C++, compared to a compiler that is written in pure Python that takes in Python code, does all the, builds the syntax tree, does the parsing, all the semantic error analysis, that kind of thing.
27:29 It takes longer, but it doesn't take an order of magnitude longer.
27:35 So if you're running ML algorithms on huge data sets, having the compilation of your algorithm, which only happens once, if it takes one second or if it takes five seconds, it doesn't really matter because we're talking about hours and hours of work.
28:01 So it's definitely slower, but for what the project needed, it wasn't inhibitively slower.
28:08 Yeah, yeah, sure.
28:08 And for the execution speed, like would the algorithms run about the same or was it really different?
28:13 So the algorithms itself, once they got down to LLVM IR, pretty much ran the same.
28:20 There were a couple interesting cases there.
28:22 So what actually happens when the code gets put into LLVM IR and is then shipped to the other half of the distributed system where it gets actually compiled, analyzed, and run, is that is all handled by LLVM and C++.
28:40 There's no Python involved in that.
28:43 And so the optimization passes that the LLVM IR compilers actually do are incredibly powerful.
28:50 So the reason that there's not a lot of optimization happening in the compiler itself, I mean the compiler from Python to LLVM IR, is that it's pretty much going to get squashed no matter what with the LLVM passes themselves.
29:04 So that was a huge benefit because I didn't really have to sweat optimizations, which is another huge part of compiler writing.
29:12 Yeah, that's awesome.
29:12 Just let LLVM do its analysis on the intermediate representation, yeah?
29:17 Yeah, I mean it's a huge selling point of LLVM IR.
29:20 But there is some interesting stuff about optimizing function calls.
29:25 And basically if you have a recursive call, it becomes a lot more difficult.
29:31 I basically discovered that the LLVM IR compiler is not as good at unrolling these recursive calls as it is when you give it Python code because there's sort of, if you're doing...
29:45 It just basically doesn't like the recursion so much, huh?
29:47 Yeah, there's some things that you have to do in Python that you don't have to do in C++ because it's closer to the end result that ends up tripping up the optimizer.
30:00 And so for an algorithm that involves recursion, it will actually perform slower.
30:06 Okay, interesting.
30:08 Yeah, so this sounds like a really cool project if you have something super focused like Tupleware where you can take a really small subset of Python and execute it against that system, right?
30:21 There's a bunch of different implementations or runtimes out there.
30:25 So we've got things like IronPython and Jython that try to take a different take on Python.
30:29 There's PyPy, there's Pigeon, there's Cython, and Numba.
30:35 It sounds like you're much closer to something like Numba with this project than you would be, say, with Cython.
30:42 Yeah, I think that a lot of the sort of the line in the sand that gets drawn between projects is everybody is trying to get Python code to run really, really fast.
30:52 But the way that LLVM IR actually factors into it can vary a lot between projects.
30:59 So the goal of Numba is to run your Python code super fast.
31:04 LLVM IR is just one step in what is, I think, a six-step process.
31:10 And there's not actually, there wasn't actually a way to extract the LLVM IR directly from Numba, which has changed now.
31:19 But when I initially wrote it, they didn't have that ability.
31:23 So I would have had to basically go in and pick and choose bits from their code and then move it into a separate project because there was no elegant way to really pull it out.
31:36 Another huge difference is that a lot of these compilers are JITs, which work great for what they're trying to do and generally work faster.
31:48 But the thing about JITs is that they're lazy.
31:51 And if you have a lazy compiler, it won't actually compile anything unless it's run.
31:57 But for Tupleware, we're completely unconcerned with running the code.
32:02 We just want to compile it and then we want to take that compiled code and pass it off to somebody else.
32:07 I see.
32:07 So that's basically part of the mechanism for deploying to this distributed cluster, right?
32:14 Because you've got to give it the executable code.
32:16 I see.
32:16 So sort of ahead of the time JIT compilation would be as close as you could get or something.
32:22 I did some performance comparisons between Numba and PyLVM.
32:27 But the problem there is that there's no way to actually run code using my compiler.
32:33 But you have to run the code in order to compile stuff with Numba.
32:37 So if you're doing benchmarking, the actual cost of the algorithm itself, it doesn't negate the data, but it makes it a pretty big asterisk.
32:48 Like, by the way, we also had to run the algorithm.
32:51 Yeah, sure.
32:53 So it's hard to compare apples to apples.
32:55 Yeah.
32:55 Okay.
32:56 So I guess maybe we should kind of wrap it up on PyLVM and talk about your MongoDB stuff.
33:02 But what's the two quick questions?
33:05 Is this a Python 2 or Python 3 project or both?
33:09 This is Python 2.7 is what I wrote it in.
33:13 Yeah, sure.
33:13 Okay, cool.
33:15 And what's the future for this project?
33:18 Do you know if anyone's picking it up or, you know, people out there listening, if it sounds interesting, you can pick it up.
33:23 It's on GitHub, right?
33:24 Yes, it is on GitHub.
33:25 I would really recommend people to look at the project and to contribute.
33:30 But that is also from like a curiosity slash selfish interest.
33:35 I think that if you're going to do it, there are a lot of design decisions that were made in the interest of this specific project.
33:45 And so if it matches your use case, that's excellent.
33:47 And so if it matches your use case, because there are so many variables, I recommend just writing one because it's probably one of my favorite projects I've ever worked on.
33:58 Just because it's such a neat, like a very well defined and very satisfying problem to solve.
34:04 Nice.
34:05 Do you feel like you understand how a lot of these compilers and execution fits together better now?
34:10 Yeah, I think it's probably the best learning project anyone could have is to actually understand what goes on under the hood of the language they use.
34:18 It also made me much, much better at writing optimizations work.
34:26 Okay.
34:26 Yeah, very cool.
34:27 All right.
34:28 So let's talk about this, what you're up to these days at MongoDB.
34:32 And you said you'd worked on PyMongo.
34:34 And just for everyone listening, like that's the primary driver.
34:37 The primary way to speak to MongoDB is to, you know, pip install PyMongo and import it.
34:43 And then you just start talking.
34:44 And basically the data exchange is dictionaries, right?
34:48 You write a prototypical query sort of thing as a dictionary and you get back rows, which are documents in the forms of dictionaries, right?
34:55 Yes, that's correct.
34:56 So you don't actually have to get dictionaries anymore.
34:59 For a long time, PyMongo would just automatically read your data into dictionaries.
35:04 But now you can actually get raw BSON out of the driver.
35:07 And that opens up a lot of doors for what you can do with it.
35:10 Right.
35:10 So BSON is binary JSON, which is the actual in-memory on-the-wire representation that you get talking to MongoDB.
35:19 You had a cool talk about something called Monary, which is kind of getting superseded by the project that you're working on now.
35:27 But you had some interesting performance numbers about getting dictionaries back or just in terms of analysis in general.
35:35 Forget the database for a minute.
35:37 Like working with dictionaries versus working with lists versus something like NumPy, right?
35:42 Yes.
35:43 So that was pretty enlightening for me as a relatively new Python programmer to realize that Python dictionaries, which I considered was kind of the most basic way of storing data in Python, was actually pretty slow.
36:00 And compared to like ND arrays, which are C-style arrays that come with the NumPy package versus just something like a list, they are significantly slower.
36:10 Yeah.
36:11 You had some cool numbers.
36:12 You said something like for a certain algorithm, working with a bunch of Python dictionaries, you could do like 12 million a second.
36:18 Yes.
36:18 And with lists, it was close to 10 times as much, 110 million a second.
36:22 And with NumPy, you could do 500 million a second.
36:25 Yes.
36:25 And that's just a really simple like take a bunch of data that's in this form and just add them all together or like make pairs or something like that.
36:35 Right, right.
36:36 Like basically read through it or something to that effect.
36:38 Exactly.
36:39 Okay.
36:39 And so you said, look, the way with the BSON capability notwithstanding, like basically the way that PyMongo works is you do a query, you get a bunch of dictionaries back in Python.
36:49 But if you're doing data science or something computational, you probably want to work in, say, NumPy, right?
36:57 But the workflow would be I make a query in my Python layer.
37:00 It goes to Mongo.
37:01 That comes back across the wire.
37:03 It gets turned into Python dictionaries and then serialized back in down in the C layer into NumPy.
37:08 And that was a problem, right?
37:10 Yes, that's a huge problem because you're taking something that is pretty fast, namely MongoDB, and you have something on the other end of your line, which is also really fast, NumPy.
37:20 And then you have this bottleneck, which is Python dictionaries.
37:24 And it's kind of a shame that there hasn't been more stuff related to this until recently because MongoDB is an incredibly powerful database and it's very easy to use.
37:36 And NumPy also has a selling point that it's extremely powerful and pretty easy to use.
37:40 So you'd think that for a lot of data scientists or for people who don't love coding in MATLAB, they would want to put these two tools together.
37:48 But MongoDB hasn't been used in this context super often because of the limitation that in order to get the data out, you have to put it through this kind of clunky data structure before getting it back into your super fast arrays.
38:01 Right.
38:01 So you said basically with some tests you did, like going through PyMongo, you could read about from MongoDB through Python into NumPy about 150,000 documents a second.
38:12 And there's this other project called Monary that you were talking about when I saw this presentation where it basically says, let's stay down in the C layer the entire time.
38:24 And even though we're calling it from a Python app, it sort of connects it directly to NumPy, right?
38:31 Exactly. So now we are taking this raw BSON format and basically moving it directly into C style arrays.
38:39 So you have two things that are in sort of natural machine order and you no longer have to take it out of that.
38:47 Nice. And that's something like 10 times faster, right? At least it was.
38:51 Yeah, it was 1.7 million reads per second compared to 150,000.
38:56 That's a big difference. You might.
38:58 That's like a difference between going and getting a cup of coffee while your algorithm runs on your data versus going to lunch or something.
39:07 Yes, exactly. And so it's really exciting to me that we can leverage these two really awesome technologies.
39:14 And Monary itself is actually not under active development anymore.
39:19 Monary is a project that actually was community based.
39:22 It was started by somebody who did not work at MongoDB, David Beach, who basically wrote it because he was sick of having to lose so much time to Python dictionaries.
39:33 And so now we are writing a codec that is 100% in C that takes this raw BSON, which before, I think maybe a year or two ago, you couldn't actually get raw BSON from PyMongo, but now you can.
39:49 So now you can take this raw BSON and you can have this super lightweight package that just converts it directly into NumPy and you are good to go.
39:58 That's really cool. So what's this thing called now?
40:00 So this is called BSON NumPy, which is a deeply uncreative name.
40:05 And I am definitely open to alternatives, but it is descriptive.
40:10 Basically, it is in beta.
40:11 So maybe closer to alpha because we stopped working on it about two weeks ago and we released, we actually released about two weeks ago.
40:20 So it's the project of both me and a Jesse Giryu Davis, who is my coworker who you spoke to twice, I think.
40:29 Yeah, yeah. Jesse's a friend of the show.
40:30 So hello, Jesse.
40:31 And nice work on this project as well.
40:33 We are both really eager to hear people's feedback.
40:36 We really want MongoDB to become more useful for data scientists.
40:43 And we think that this is really the tool that is going to make it happen.
40:46 Yeah, it seems like a really great idea.
40:48 Just we'll skip the serialization where it's slow and just flow the data straight into NumPy.
40:54 This portion of Talk Python is brought to you by Hired.
41:08 Hired is the platform for top Python developer jobs.
41:11 Create your profile and instantly get access to 3,500 companies who will work to compete with you.
41:16 Take it from one of Hired's users who recently got a job and said, I had my first offer on Thursday after going live on Monday and I ended up getting eight offers in total.
41:24 I've worked with recruiters in the past, but they've always been pretty hit and miss.
41:28 I tried LinkedIn, but I found Hired to be the best.
41:30 I really like knowing the salary up front.
41:33 Privacy was also a huge seller for me.
41:35 Sounds awesome, doesn't it?
41:37 Well, wait until you hear about the sign-in bonus.
41:39 Everyone who accepts a job from Hired gets $1,000 sign-in bonus.
41:42 And as Talk Python listeners, it gets way sweeter.
41:45 Use the link Hired.com slash Talk Python to me and Hired will double the signing bonus to $2,000.
41:50 Opportunities knocking.
41:52 Visit Hired.com slash Talk Python to me and answer the door.
42:03 I guess people can just go to MongoDB and find it.
42:05 How do they learn more about this project?
42:07 Because it's in its very, very first iteration.
42:11 It's not...
42:12 We released it on bsunnumpy.readthedocs.io.
42:17 And you can find it on Read the Docs.
42:21 Basically, we...
42:22 Neither myself nor Jesse are data scientists, but we're both super familiar with Python and
42:29 we're super familiar with MongoDB.
42:31 But that does leave a lot of open questions that I wouldn't necessarily know how to answer.
42:37 So I'm not sure what the most common data type for NumPy would be.
42:43 I mean, there are all these really cool, complex data types, but for all I know, nobody ever
42:48 uses them.
42:49 So what I really want is to be able to reach out to the community and have people tell me
42:53 what they need.
42:54 Because a lot of the great features at MongoDB pretty much came out because somebody had a
42:59 need and they asked for it.
43:00 So we wrote them and then it became a huge selling point.
43:02 So what I want is just to hear from what people who would use these technologies really want.
43:07 Yeah, absolutely.
43:08 So people can check it out.
43:09 It's bsun-numpy on PyPI.
43:12 And also I'll put the link in the show notes.
43:14 So check it out and give both of you guys feedback on like, hey, this looks really cool, but it
43:20 doesn't do whatever, right?
43:22 Exactly.
43:23 Nice.
43:23 You also had in your presentation about Monary, it's still kind of the same type of question
43:29 and answer type thing that you get the same kind of analysis you can do.
43:32 with Bison-numpy, you had an interesting analysis you did of taxi cabs and Times Square, right?
43:39 Do you want to tell people that story?
43:41 Sure.
43:41 So when I first joined MongoDB, I was working out of the headquarters, which is in Times Square
43:49 in New York City.
43:50 As a side note, right now I work out of Stockholm, Sweden, which I prefer quite a bit more to
43:55 fighting my way through the crowds.
43:56 But I was struck one morning when I was trying to think of sort of a nice data set to
44:02 sort of show examples of how much faster Monary would be than Pymongo.
44:07 I just was so sick of fighting through the crowds and fighting to get to the train station.
44:14 And I kind of figured like, wow, what is...
44:16 Why is there a Batman in my way?
44:17 What's going on in this place?
44:20 Exactly.
44:20 It's like, how can I get out of here as fast as possible?
44:23 Like, where can I go?
44:25 How can I do it?
44:26 So I also had access to the taxi data for all of New York City, which is freely available
44:33 online.
44:34 And it's a really interesting data set.
44:36 And so I basically just took that data set and I looked at all the rides that both started
44:43 and ended in Times Square.
44:44 Because I kind of just wanted to know, like, where are people going?
44:47 Like, what's the rush?
44:48 Where are you coming from?
44:50 Why are you coming here?
44:52 Of all the places in New York City to be, why Times Square?
44:54 Okay.
44:55 Yeah.
44:55 That's really cool.
44:56 And you found, you had a really great bunch of visualizations that came out of MATLAB.
45:01 And I'll put the link to that video up there because you have, you know, some really great
45:06 maps and like bars living on flat maps and all kinds of stuff.
45:10 I thought that was great.
45:10 Yeah.
45:11 It's like, it's MATPLOTLIB.
45:12 Oh, yeah, yeah.
45:13 Sorry.
45:13 I keep, I don't know why I said MATLAB.
45:14 MATLAB because we were talking about, yeah, yeah.
45:16 MATPLOTLIB is what I had in mind.
45:18 Yeah.
45:18 But yeah, no, I mean, the point of those diagrams were just basically to show off how
45:23 cool NumPy can be and how powerful MATPLOTLIB can be for creating really beautiful visualizations.
45:31 And I know that there are a lot of different visualization tools out there.
45:35 But in terms of selling Python and providing a way for data scientists to use both MongoDB
45:41 and NumPy, I felt like it was a pretty good sell.
45:44 Yeah, that's cool.
45:45 And you also were able to leverage, I think the interesting linkage there is you were able
45:50 to put a huge amount of data in Mongo and apply a geospatial index to it and then use that
45:57 in your queries, but then also do the analysis with NumPy, right?
46:01 Mm-hmm.
46:02 Exactly.
46:03 So you are taking advantage of these geo queries, which are really simple, really easy to use,
46:08 but also very fast.
46:10 And you are taking advantage of Python, NumPy specifically, where you have really, really
46:17 fast analysis, but a lot of these algorithms are actually written for you.
46:20 So you can use a lot of these scientific Python packages that have done what I consider the hard
46:26 work for you completely.
46:27 Yeah.
46:28 Yeah.
46:28 It seems like just take the tools, click them together, and you get some really great
46:32 analysis.
46:33 That's awesome.
46:33 Yeah.
46:34 So I'm looking forward to your presentation where you talk about your Bison NumPy version.
46:40 Yeah.
46:41 Well, I'll have to find something better, some better visualizations and some, you know, I
46:47 don't have as many things to complain about now that I live in Stockholm.
46:49 So I have to...
46:51 Yeah.
46:51 You can do something with snow or winter or something.
46:54 Who knows?
46:54 Yeah.
46:55 The darkness.
46:55 Yes, exactly.
46:56 Here's the light analysis.
46:58 All right.
46:59 Cool.
46:59 So I guess let's go ahead and leave it there.
47:01 That sounds like a great project.
47:02 And so if you're out there working with MongoDB and ultimately your data ends up in NumPy,
47:07 you know, maybe skip the dictionaries is the message, right?
47:10 If you can.
47:10 Yeah, exactly.
47:11 And especially for people who use scientific Python and have never considered MongoDB a viable
47:17 option, I think that is going to change.
47:19 And I think that we are going to be able to write programs that take advantage of all this
47:25 cool work that has already been done.
47:27 Yeah.
47:27 And I think having it officially part of the product, the library, gives it a little more,
47:33 probably gives people more confidence to build upon it.
47:35 Yeah, exactly.
47:36 And I am personally just so excited because this is the first project that I've really had
47:41 ownership over.
47:42 I contributed a lot to PyMongo, but it wasn't my project.
47:45 Well, I feel very proud of what I've done for this particular project.
47:50 It sounds really, really cool.
47:51 So, so nice work.
47:52 Let me ask you a couple of questions before we get out of here, as I always do.
47:57 First one, if you're going to write some Python code, maybe not C code, if you write some Python
48:01 code, what editor do you open up?
48:03 So I use PyCharm.
48:04 And it's funny that you mentioned C code because I got a perpetual license for JetBrains.
48:13 And I love that I can write, I can write Node, I can write Ruby, I can write Python, I can write C.
48:20 And all my shortcuts are the same.
48:22 I was a Vim user before Jesse convinced me to switch over to PyCharm.
48:27 And now I'm like completely converted.
48:30 I can have all my Vim shortcuts and I can also have all the power of a really great IDE.
48:34 So I really recommend the JetBrains libraries because it's really nice to not have to switch
48:39 between IDEs, between like SeaLion, PyCharm, Ruby9, WebStorm.
48:45 You know, it's the same environment.
48:47 So I feel like I don't have to lose any time to relearning stuff.
48:50 Yeah, that's really cool.
48:51 Jesse's a big fan of PyCharm and so am I.
48:54 Just so people who are not familiar, basically there's this IntelliJ platform, which is kind
48:59 of the IDE and then they plug in the language specific stuff.
49:02 So like when you say it's all the same, it's like all kind of the same base.
49:05 And like you said, like it behaves the same.
49:07 It's pretty sweet.
49:08 Yeah.
49:08 Okay, cool.
49:09 And we just this week passed 100,000 packages on PyPI.
49:14 So that's a big milestone.
49:16 Hooray.
49:17 Wow.
49:18 Yeah, no kidding.
49:19 And so there's a ton of them out there.
49:21 Do you have one that maybe like you think people don't necessarily know about?
49:24 You're like, hey, you should check this out.
49:26 BSON NumPy maybe?
49:28 Yes.
49:29 Is it too shamelessly self-promoting to say that you should download BSON NumPy and that's
49:34 the best package on PyPI?
49:36 It's one of the newer ones.
49:38 That's a really good question.
49:39 I guess I'll just go with that because I'm very utilitarian with my packages.
49:44 I pretty much just keep them to a minimum.
49:47 Okay.
49:47 Yeah.
49:48 Very cool.
49:48 All right.
49:49 Yeah.
49:50 It's a great package.
49:50 And you know, you guys need some people to try it out and actually do real data science
49:54 with it.
49:54 So that's great.
49:55 Exactly.
49:56 All right.
49:57 So final call to action.
49:58 What do you want people to check out or do?
50:00 I would like people to check out BSON NumPy because I think it's a really cool project.
50:05 I think that if anyone is interested in a really neat piece of code, they should look through
50:12 the LLVM Py work, especially if you're either new to compilers, maybe new to Python, new to
50:17 LLVM IR.
50:18 It's a very nice sort of pet project and it's very self-contained and it is, in my opinion,
50:24 pretty well documented.
50:25 There's an entire 20-page thesis I wrote about it.
50:28 So if you have questions, they're probably answered.
50:30 I'm also always happy to answer by like email, Twitter, anything.
50:34 So those are my two things I would love to see people do.
50:38 All right.
50:38 That sounds great.
50:39 Ana, thank you so much for being on the show.
50:41 It's been really interesting to talk to you.
50:42 Yeah, of course.
50:43 Thank you so much again for having me.
50:44 Yeah.
50:45 Thanks so much.
50:45 Bye.
50:45 Bye.
50:46 This has been another episode of Talk Python to Me.
50:50 Today's guest has been Ana Herlihy and this episode has been sponsored by Hired and Talk
50:56 Python Training.
50:57 Are you or a colleague trying to learn Python?
51:00 Have you tried books and videos that just left you bored by covering topics point by point?
51:05 Well, check out my online course, Python Jumpstart by Building 10 Apps at talkpython.fm
51:10 slash course to experience a more engaging way to learn Python.
51:13 And if you're looking for something a little more advanced, try my Write Pythonic code course
51:18 at talkpython.fm/pythonic.
51:21 Hired wants to help you find your next big thing.
51:24 Visit Hired.com slash Talk Python to me to get five or more offers with salary and equity
51:29 presented right up front and a special listener signing bonus of $2,000.
51:35 Be sure to subscribe to the show.
51:36 Open your favorite podcatcher and search for Python.
51:38 We should be right at the top.
51:40 You can also find the iTunes feed at /itunes, Google Play feed at /play and direct
51:46 RSS feed at /rss on talkpython.fm.
51:49 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.
51:54 Corey just recently started selling his tracks on iTunes, so I recommend you check it out at
51:59 talkpython.fm/music.
52:01 You can browse his tracks he has for sale on iTunes and listen to the full length version of the theme
52:06 song.
52:06 This is your host, Michael Kennedy.
52:08 Thanks so much for listening.
52:09 I really appreciate it.
52:11 Smix, let's get out of here.
52:13 I'll see you next time.
52:34 I'll put it in.
52:35 Thank you.