#103: Compiling Python through PyLLVM and MongoDB for Data Scientists Transcript
00:00 Michael Kennedy: This episode we have an optimization two-fer. We begin by looking at optimizing a subset of Python code for machine learning using the LLVM compiler, with a project called PyLLVM. It takes plain Python code, compiles it to optimize machine instructions and distributes it across a cluster to do machine learning. In the second half, we'll look at a fabulous new way to work with MongoDB for Python-writing data scientists. The project is called bson-numpy, and provides direct connections between MongoDB and NumPy. It's 10 times faster than working with PyMongo directly, if you plan it up in NumPy anyway. You're about to meet the woman behind both of these projects, Anna Herlihy. This is Talk Python To Me, Episode 103, recorded February 6th, 2017. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @MKennedy. Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. This episode is brought to you by Talk Python Training and Hired. Be sure to check out what be both have to offer during our segments. It helps support the show. Anna, welcome to Talk Python.
01:43 Anna Herlihy: Thank you, thank you for having me.
01:45 Michael Kennedy: Yeah, it's great to have you here. We got a couple of really cool things to talk about. We're going to talk about PyLLVM, which is a really cool project that you worked on. And we're also gonna talk about a super high performance sort of data science-y thing with MongoDB that you're also working on right now. So I'm looking forward to talking to you about both of those, but before we get into those, what's your story? How'd you get into programming in Python?
02:08 Anna Herlihy: I had my first real experience with programming, I mean I had done some Basic when I was pretty small, but I thought it was not very fun so I kind of put that off for another five or six years. But in university I was kind of all over the place. I didn't know if I wanted to be a writer or an engineer, but I figured that computers are probably gonna be relevant. So I took--
02:33 Michael Kennedy: Good bet.
02:34 Anna Herlihy: I took a CS class because I didn't know the difference between computer science and computer literacy, but it ended up actually working out in my favor because I really really enjoyed it and I ended up prioritizing my CS classes over all my other classes, and then I kinda figured maybe I should just do this full time.
02:50 Michael Kennedy: That's really cool. What was that first CS class?
02:53 Anna Herlihy: That was Andy van Dam's Intro to Computer Science. Which for anybody who went to Brown, will recognize that it's quite a popular course. I mean when I took it it was maybe 150 people, but I think just last year they had 250 or something. It's really really blowing up.
03:11 Michael Kennedy: Oh that's excellent. What language did you study? Do you remember?
03:14 Anna Herlihy: Yeah it was a Java course. We did a lot of Swing.
03:17 Michael Kennedy: Okay, excellent. Yeah and so you decided, "Hey I kinda like this programming stuff, "let's just do that as the job," right? Yeah how do you go from learning Java in CS101 to working in Python?
03:29 Anna Herlihy: So I really liked programming languages in general. So I wrote some compilers, some interpreters. My area in computer science was mostly looking at different programming languages, and I found that Python was kind of the most elegant, in my opinion, and I didn't feel bogged down by really complicated syntax, but it also felt very powerful. So I kind of felt that, as opposed to being pure Python I wanted to do more, like, Python in C stuff, sort of more from the very beginning to how it gets done, to the computer code at the end.
04:07 Michael Kennedy: I see, so you really were interested in the internals and stuff, and whatnot, huh?
04:11 Anna Herlihy: Yes.
04:11 Michael Kennedy: Okay. Cool. So the first project we're gonna talk about, which I want to ask you one question quick before that, but PyLLVM, that was part of your university work right?
04:23 Anna Herlihy: Yes, exactly. The project that it was based on, Tupleware, is a research project with some of the database professors at Brown. And my project was a senior thesis that was built sort of on top of that.
04:36 Michael Kennedy: Okay, wow, very cool, very cool. Okay so today you work at MongoDB, right?
04:40 Anna Herlihy: Yes.
04:40 Michael Kennedy: Yeah, what do you do there?
04:42 Anna Herlihy: So I actually recently switched over from spending the majority of my time on Python work, so like PyMongo, Mongo Connector, other sort of Python-related projects, to MongoDB Compass, which is basically a user interface for MongoDB. It's actually written in JavaScript, which was something new for me since I had done a pretty excellent job of avoiding JavaScript until now.
05:06 Michael Kennedy: It finally got you, huh?
05:08 Anna Herlihy: Yeah I know, it was inevitable. But now I really like it. I can't say that I am converted, and I still think that Python is the community and the language that I like best. I'm probably preaching to the choir here. But I now split my time between working on Compass, and working on this bson-numpy package, which I hope we'll get to talk about later.
05:28 Michael Kennedy: Yeah we'll definitely talk about it, 'cause it's very cool. Okay, excellent. So sounds like you have a lot of cool projects going on. Let's talk about the project that you did first with this LLVM thing. I suspect a number of people out there know what LLVM is, but the audience is diverse. There's people from all sorts of places. Let's start with just talking about, what is LLVM?
05:50 Anna Herlihy: Cool so, despite the name, it doesn't really have too much to do with virtual machines. It's basically just a collection of compiler technologies. So the LLVM project itself is huge. But what I worked with is LLVM IR, which stands for Intermediate Representation. And LLVM IR is basically a way of representing code, that's about halfway between a top-level language, so for example Python, and some machine code, so something that you wouldn't really want to read, you could probably look at the assembler and know what's going on but it's not something you would ever really want to write. But runs much much faster than something that a human would be able to write.
06:30 Michael Kennedy: Sure, so I don't know it all that much about it actually myself, so, do you compile the source language into this LLVM IR, and then do a further compilation towards some final target?
06:43 Anna Herlihy: Yes, exactly. And that's actually what makes it so powerful is that a lot of the time you have compilers that have to be very specialized. So they take a language and they take a platform, and that is what the compiler does is it compiles Python for a particular architecture. What LLVM IR does, which is not unique to LLVM, it's true for all intermediary representations, or most of them, is that it's a way of making things both language and platform agnostic. So you can take any popular language you want, from Java, R, Python, and you can compile it down to LLVM IR. Then you can take that LLVM IR and you can ship it anywhere, and it can be compiled down to most platforms. So it doesn't even matter what the original language was, and it doesn't matter what platform you're eventually gonna run it on. It makes the code really, it makes it very cross-platform and it makes it fast.
07:44 Michael Kennedy: Yeah sure. So basically if you can get something to compile down to LLVM IR, you can get it to run quickly on many many places, right? Because there's existing infrastructure to turn LLVM IR into executable code on all sorts of platforms, right?
08:00 Anna Herlihy: Exactly, and it's super powerful. The optimizations that the LLVM IR compilers do make it so that it doesn't matter if you wrote your code in C, or if you wrote your code in Python. I'm of the opinion that it's much easier to write your code in Python than it is to write it in C. And so it's pretty nice, because you get the benefits of maybe a more difficult syntax, with pretty much whatever you're comfortable in.
08:23 Michael Kennedy: Yeah that's really cool. What's the weirdest thing that you can execute this stuff on? What's the weirdest platform?
08:30 Anna Herlihy: That's a good question. Actually I don't know. But I think it's pretty easy to, if you just look up the LLVM IR, or the LLM Compiler Project, I think, it's a really good way to get to know compilers in general. So there are a lot of proof-of-concept compilers that use LLVM because the infrastructure is so good. So I'm sure somebody has written something for an extremely obscure platform just to show that they could do it.
08:58 Michael Kennedy: Yeah I'm sure. Do you think that it makes a lot of sense for people to create sort of from scratch compilers these days? Or should most people just be building on LLVM?
09:08 Anna Herlihy: I think it depends on what the goal of your project is. So because this was a project that was done in the world of academia, I pretty much just picked whatever seemed like it would be the most fun. And I know that's not necessarily true for a lot of people who have needs and users and they have to work quickly, and so I think it just depends on if you want something to work faster, or if you want to learn a lot while you're doing it.
09:37 Michael Kennedy: Sure, yeah, and I guess also, you only had, it was a senior project sort of thing right? So you couldn't start from scratch and do a ton of work, you had a time frame, and limited amount of time and energy right? So I guess that's--
09:48 Anna Herlihy: Exactly, I had to graduate.
09:50 Michael Kennedy: Exactly. So that's a pretty good testament to LLVM. Cool, and I know it's used for Swift and some other things on the Apple platform, and it's pretty cool. Okay so you also talked about a project to do with machine learning called Tupleware, and this is a project at Brown right?
10:07 Anna Herlihy: Yes, it is. So Tupleware's tagline is that it is a distributed analytical platform that leverages LLVM IR to be totally platform agnostic, and totally language agnostic. So the way that PyLLVM fits into Tupleware is basically a proof of concept that you could take Python code, and you could compile it down to LLVM IR, and then you could ship it to your clusters and then have your code be automatically run on a very large dataset.
10:45 Michael Kennedy: Okay cool, so this is like a distributed machine learning type of system, and your project, PyLLVM, basically made it possible to feed Python, instead of say C++ code or something like that to it, right?
10:57 Anna Herlihy: Exactly, and I chose Python because I didn't really want to write an R compiler. But also because at the time I was working on this, which was about three, 3 1/2 years ago now, the language that most scientists were using for programming was still MATLAB. I actually had a summer job where I converted MATLAB code to scientific Python code. And it was becoming more and more common that people who were not programmers would be programming in Python instead of just inheriting these old MATLAB scripts that they would reuse. So that's why I picked Python.
11:33 Michael Kennedy: Yeah that's cool. I've definitely seen that as well. When I was in school, MATLAB was definitely the thing. People would have all these scripts. One of my first programming jobs was to take a bunch of MATLAB code and turn it into, like, graphical visualizations on Silicon Graphics supercomputers. And I had to convert the MATLAB code to C++. But if it was today, I very well might have done that in Python instead.
11:58 Anna Herlihy: Yeah that's funny. Sounds like the same job.
12:00 Michael Kennedy: Yeah exactly, exactly. Awesome. And so you said PyLLVM started from an abandoned open source project. What's the story there?
12:10 Anna Herlihy: Basically once I had pinned down what exactly I wanted to do for my thesis, I went on a search spree where I tried to figure out if somebody else had done it for me, which would've been very convenient. But at the time I couldn't find any Python to LLVM IR compilers that were already up and running, or already published. So what I did is I found this GitHub, actually it wasn't even GitHub, it was--
12:38 Michael Kennedy: A Google Code right?
12:39 Anna Herlihy: Yeah, it was Google Code project, which had been documented almost zero. There were some slides in Japanese that I found, which I tried to take to Google Translate, which did not work. So it was basically just some code, and it ended up being almost exactly what I needed because it was a pretty simple compiler outline without any of it really implemented, but the structural stuff was there. So a lot of the scoping and a lot of the variable tracking had been started but not finished. So it was kind of the perfect project to pick up because a lot of the design decisions that would've resulted in a lot of cost versus value arguments, so you know, we could implement it this way but it would take a lot of time and we might not actually get that much out of it, was sort of already decided for me so I had a pretty clear path in terms of getting the substantive Python it supported to be something that I thought a user could actually make use of.
13:38 Michael Kennedy: Yeah that seems really great that that's out there and you can pick it up. It's funny it was on Google Code, because not only was the project kind of abandoned, like Google Code itself is kind of in archive mode, right? I guess it's a real testament to GitHub.
13:52 Anna Herlihy: I think it was in archive mode when I found it even.
13:56 Michael Kennedy: It's definitely in archive mode now, it's several layers. Yeah but that's helpful that it's like, well we already kind of put the thinking into place, but hey I need, they just didn't take it to completion. And so okay, these are the final to-dos on this project or whatever, 'til I can actually use it on Tupleware.
14:13 Anna Herlihy: Yeah I mean when I found it, it wouldn't even compile. So I have really a lot of respect for the person that wrote it, because it really had everything. It was clear that this person didn't actually run it, but they had thought about it a lot. And the skeleton was there. But it hadn't actually gone through the process of becoming a tool that could be run. And that, to me, is very incredible. I mean I run my code every 10 seconds just to make sure that it's working.
14:39 Michael Kennedy: Exactly, exactly. Yeah I'm with you on that, like, run it often. I guess some people, even probably more in the early days when compiling took longer and stuff, but some people really sit there and work through it and then they run it, you know, hours after hours of work. That's not how I work. I run it often.
14:59 Anna Herlihy: Yes. Yeah it's very impressive.
15:01 Michael Kennedy: Yes indeed. It takes a subset of Python. I can't feed it like Django and have it spit that out, right? Because what happens is it takes the Python language and it turns it to LLVM IR, which then finally compiles down to machine instructions. But there's no interpreter, there's no standard library underneath there for it to run, right? So you've gotta be pretty focused on what you give it.
15:26 Anna Herlihy: Yes. What it does have is sort of the C standard library, so for example if you wanna call like, print line. You can do some somewhat complicated gymnastics in order to get LLVM to make that external call. So that's really nice, because getting print line to work when you're writing a compiler is so important. And when it does work, it's really one of the best feelings ever. But generally no, if you wouldn't be able to import a Python package, for example. Since the goal of the project was really just to provide people an alternative to using MATLAB, so instead of writing your machine learning algorithms in MATLAB, you can write them in Python. You know in this case specifically it would be instead of writing in C++ you're writing it in Python. But the nature of those algorithms is that they're quite simple, so I didn't spend a lot of time trying to get objects or more complicated data structures working because I didn't anticipate it would be that common.
16:28 Michael Kennedy: Right, so you're working with like, loops, and join, and map, and stuff like that. And then you feed it basic algorithms, huh? So like Bayesian stuff, or linear regression?
16:41 Anna Herlihy: Yeah exactly.
16:41 Michael Kennedy: Okay.
16:43 Anna Herlihy: I'm not a machine learning. I had not really done much machine learning at that time, so I basically just went to my ML professor and was like, give me your top five most common machine learning algorithms that you would expect to want to run on a large dataset.
16:59 Michael Kennedy: I'll make these run in Python. Through this thing. Nice. So you basically, you write the algorithms, they compile down, and then finally get transformed to run basically on the C, C++ runtime, like you said. And that goes through something called Boost.Python, right? What is Boost.Python?
17:17 Anna Herlihy: So Boost.Python, Boost is a bunch of packages for C++ that provide a lot of really really powerful capabilities. And so Boost.Python is actually just an easier way of calling the Python/C API. If I had done this now I probably just would have skipped using Boost entirely because I'm pretty comfortable with the Python/C API, but at the time it was sort of the first Python plus another language interfacing I had done before, and this was pretty simple because it was already being used by Tupleware itself, so I didn't have to do a lot of, I didn't actually have to incorporate it into the ecosystem, it was already there.
17:57 Michael Kennedy: Yeah that's cool. By now you've done so much with PyMongo and all the stuff at MongoDB.
18:02 Anna Herlihy: Yeah.
18:02 Michael Kennedy: Yeah, cool. Give us like a short example of what this algorithm might look like. Like, code over audio's really hard, but just like, what kind of stuff would you feed off to the system?
18:14 Anna Herlihy: So I would expect there to be, so you gotta assign to your variables. You most likely have a array that you are assigning values to. You need to iterate through the array, or you need to iterate through the data that you've been given. And you need to do a lot of, or at least a decent amount of error checking. So what that means in terms of syntax is you would want to have reassignable variables. You would need to have conditionals, loops, and arithmetic. Those are basically the most important things.
18:51 Michael Kennedy: This portion of Talk Python To Me is brought to you by, us! As many of you know, I have a growing set of courses to help you go from Python beginner to novice to Python expert, and there are many more courses in the works. So please consider Talk Python Training for you and your team's training needs. If you're just getting started, I've built a course to teach you Python the way professional developers learn, by building applications. Check out my Python Jumpstart by Building 10 Apps at talkpython.fm/course. Are you looking to start adding services to your app? Try my brand new Consuming HTTP Services in Python. You'll learn to work with RESTful HTTP services as well as SOAP, JSON, and XML data formats. Do you want to launch an online business? Well Matt MaKai and I have built an entrepreneur's playbook with Python for Entrepreneurs. This 16-hour course will teach you everything you need to launch your web-based business with Python. And finally there's a couple of new course announcements coming really soon. So if you don't already have an account, be sure to create one at training.Talk Python.fm to get notified. And for all of you who have bought my courses, thank you so much. It really really helps support the show. One of the things that seems a little challenging to me is like, you're building something in Python. And it's being more or less just compiled into the results of C++, which is a typed system, right? So it expects, you know, here's a four-byte integer. Here's a Boolean, here's a string pointer, right? Whereas in Python you don't really have that. So what did you have to do to make the type system sort of fit together there?
20:25 Anna Herlihy: That is a problem that comes up time and time again, because no matter how many layers of abstraction you're talking about, ultimately machine code is pretty strictly typed. And we don't really like working in strictly typed languages. So there's a lot of cost-benefit analysis going on there, where you could say, I demand that I can reassign my variable from an array to an integer. Or between different types of numbers, for example. Or you could say, that's gonna save me a ton of work if I just tell my user, who I anticipate is probably used to MATLAB, don't do that. This is Python but it's not that Pythonic. You know, maybe down the line when this is no longer a proof of concept. Ultimately what we ended up doing was just sticking to pretty static types, because whenever you write a compiler it's kind of a debate between how much do I want to reinvent the wheel, and how much are my users gonna be willing to sort of have a more limited experience for the sake of, like, my sanity, or how much time I'm willing to put into this project?
21:37 Michael Kennedy: Sure, that makes a lot of sense. I feel like because it's such a limited set of types really, that you can write your algorithm in, it's not so bad to sort of restrict it and talk about the types, right? You've got the fundamental numerical types, strings, lists, and a few other things that they can really work with, right?
21:58 Anna Herlihy: Yeah and I definitely, that's what we're trying to sell here. But I remember having this argument where I came up with what I thought was a super clever solution for how we're gonna do dynamic types, and it was twice as much code as the entire compiler was before that point, and I brought it to my advisor, and I was like, this is such a good idea. I can't wait to implement this. And he kinda looked at me, and he was like, this is not what we need right now. What we need is something that is working. Which I guess is my first experience with writing code that actually needs to do something, and it needs to do something as soon as possible. As opposed to code that's like, beautiful and elegant, and I've prototyped it three times and I have all the time in the world.
22:42 Michael Kennedy: Yeah, shipping is a feature, right?
22:45 Anna Herlihy: Yeah.
22:46 Michael Kennedy: No one's gonna use your code if you don't actually get it out there and get it working, I guess.
22:49 Anna Herlihy: Yeah, I mean I still struggle with that now. I definitely like to write code that is fully baked, as opposed to just getting stuff out of the door. But I'm definitely getting used to that now that I'm working in JavaScript more.
23:03 Michael Kennedy: Yeah that's one benefit, huh? That's awesome. I guess I really appreciate getting something out there so that people can use it and give me feedback, like, "This is working, this is not working." But yeah, you just have to have some flexibility. Can't get locked into some early prototype API or something, right?
23:20 Anna Herlihy: Yeah exactly.
23:20 Michael Kennedy: Yeah, so you said that some of the LLVM IR features made your life easier, and some made it harder. How'd that work out?
23:27 Anna Herlihy: So probably the most involved issue that comes up between converting from Python to a intermediary representation is like what you mentioned with types and reassigning. LLVM IR is written in SSA, which means Static Single Assignment. And what that basically means is that you have your registers, which are your smallest unit of storage, and you can only assign to them once. So if you assign a number to your register, even if you want to assign something of the same type, it is frozen for that function call. So what I needed to do to get around that was basically move everything onto the stack, and once everything is on the stack then you need to keep track of stuff using what people call a symbol table, which is basically just a dictionary where you can look up, I have a variable named x, and it lives at this memory address, and it's been around for this long, and all sorts of other metadata like that. Yeah, does that make sense?
24:34 Michael Kennedy: Yeah yeah yeah, that's pretty interesting you can only assign to them once. Pretty interesting. And then memory management, how did that work actually?
24:43 Anna Herlihy: Because everything was on the stack, for the most part, things basically just took care of themselves. I didn't have to write a garbage collector, thankfully, but there was one particular instance that was really awkward. So if you are making a function call and you want to return something that is more than, that can't fit into a register. So say for example you have an array. Or you have a function call that populates an array, and then you want it to return the array to the original caller. The problem there is that you can no longer save that on the stack, because as anyone who's ever programmed in C or C++ knows it'll go out of scope. So the solution there is either, oh do I have to keep track of scopes now? Do I actually have to write a garbage collector? Do I have to reinvent memory management in Python myself, in two months? And the answer to that was, okay, no that doesn't make sense either. So what ended up happening is I would just move the data that you put in your array onto the heap temporarily, and then pass back a pointer to it. And then either copy it into the stack again and free it, or just chalk that up to a memory leak and let the stuff that you return from a function just call it a memory leak. So that's sort of a lose-lose situation. It was definitely one of those things that if I had had more time I would've liked to dig into. That, and dynamically reassigning variables are the two parts of the compiler I feel like are unfinished.
26:25 Michael Kennedy: Sure, okay. Yeah that's a big challenge in adding in your own garbage collector. That sounds like a lot of work.
26:31 Anna Herlihy: Sounds like it could be really fun, but it sounds like a different thesis.
26:35 Michael Kennedy: Exactly yeah. It's not my problem right now. Okay, interesting. So how was the performance, say one of the options was I could write my code in C++, and I could give it to Tupleware, or I could write it in Python and give it to Tupleware. What was the trade-off there in terms of performance? Was it huge or pretty close?
26:52 Anna Herlihy: So in terms of performance I actually do have the numbers that I ran that I can dig up, but the bottom line was that the LLVM compiler itself was much faster if you take C++ and you compile it down to LLVM IR using a compiler, that I think is written in LLVM IR itself, or at least in C or C++, compared to a compiler that is written in pure Python that takes in Python code, builds the syntax tree, does the parsing, all the semantic error analysis, that kind of thing. It takes longer, but it doesn't take an order of magnitude longer, and compared to the cost of actually analyzing your workflows, deploying to your distributed cluster, it ultimately doesn't really matter. So if you are running ML algorithms on huge datasets, having the compilation of your algorithm, which only happens once, if it takes one second or if it takes five seconds, it doesn't really matter because we're talking about hours and hours of work. So it's definitely slower, but for what the project needed it wasn't inhibitively slower.
28:08 Michael Kennedy: Yeah yeah, sure. And for the execution speed, were the algorithms run about the same or was it really different?
28:14 Anna Herlihy: So the algorithms itself, once it got down to LLVM IR, pretty much ran the same. There were a couple interesting cases there. So what actually happens when the code gets put into LLVM IR, and is then shipped to the other half of the distributed system where it gets actually compiled, analyzed, and run, is that is all handled by LLVM and C++. There's no Python involved in that. And so the optimization passes that the LLVM IR compilers actually do are incredibly powerful. So the reason that there's not a lot of optimization happening in the compiler itself, I mean the compiler from Python to LLVM IR, is that it's pretty much gonna get squashed no matter what with the LLVM passes themselves. So that was a huge benefit because I didn't really have to sweat optimizations, which is another huge part of compiler writing.
29:12 Michael Kennedy: Yeah that's awesome, just let LLVM do it, do its analysis on the intermediate representation.
29:17 Anna Herlihy: Yeah I mean it's a huge selling point of LLVM IR. But there is some interesting stuff about optimizing function calls, and basically if you have a recursive call it becomes a lot more difficult, I basically discovered that the LLVM IR compiler is not as good at unrolling these recursive calls as it is when you give it Python code because there's sort of, if you're doing--
29:45 Michael Kennedy: It just basically doesn't like the recursion so much huh?
29:48 Anna Herlihy: Yeah. There's some things that you have to do in Python that you don't have to do in C++ because it's closer to the end result that ends up tripping up the optimizer. And so for a algorithm that involves recursion, it will actually perform slower.
30:07 Michael Kennedy: Okay, interesting. Yeah so this sounds like a really cool project if you have something super focused like Tupleware where you can take a really small subset of Python and execute it against that system. There's a bunch of different implementations or runtimes out there. So we've got things like IronPython and Jython that try to take a different take on Python. There's PyPy, there's pygn, there's Cython. And Numba. It sounds like you're much closer to something like Numba with this project than you would be, say, with Cython.
30:42 Anna Herlihy: Yeah, I think that a lot of the line in the sand that gets drawn between projects is everybody is trying to get Python code to run really really fast. But the way that LLVM IR actually factors into it can vary a lot between projects. So the goal of Numba is to run your Python code super fast. LLVM IR is just one step in what is, I think, a six-step process. And there's not actually, there wasn't actually a way to extract the LLVM IR directly from Numba, which has changed now. But when I initially wrote it they didn't have that ability. So I would've had to basically go in and pick and choose bits from their code and then move it into a separate project because there is no elegant way to really pull it out. Another huge difference is that a lot of these compilers are JITs, which work great for what they're trying to do and generally work faster, but the thing about JITs is that they're lazy. And if you have a lazy compiler, it won't actually compile anything unless it's run. But for Tupleware we're completely unconcerned with running the code. We just want to compile it, and then we want to take that compiled code and pass it off to somebody else.
32:07 Michael Kennedy: I see, so that's basically part of the mechanism for deploying to these distributed cluster, is you gotta give it the executable code, I see. Sort of ahead of the time JIT compilation would be as close as you could get, or something.
32:22 Anna Herlihy: I did some performance comparisons between Numba and PyLLVM, but the problem there is that there's no way to actually run code using my compiler, but you have to run the code in order to compile stuff with Numba. So if you're doing benchmarking, the actual cost of the algorithm itself, it doesn't negate the data but it makes it a pretty big asterisk. Like, by the way, we also had to run algorithms.
32:52 Michael Kennedy: Sure, so it's hard to compare apples to apples. I guess maybe we should kinda wrap it up on PyLLVM and talk about your MongoDB stuff. But what's the, two quick questions. Is this a Python 2 or Python 3 project, or both?
33:09 Anna Herlihy: This is Python 2.7 is what I wrote it in.
33:13 Michael Kennedy: Yeah sure, okay cool. And what's the future for this project? Do you know if anyone's picking it up? Or people out there listening, if it sounds interesting, you could pick it up. It's on GitHub right?
33:23 Anna Herlihy: Yes, it is in GitHub. I would really recommend people to look at the project and to contribute, but that is also from a curiosity/selfish interest. I think that if you are actually trying to get code from Python to LLVM IR, there are a lot of design decisions that were made in the interest of this specific project, and so if it matches your use case, that's excellent. But if it doesn't match your use case, because there are so many variables, I recommend just writing one, because it's probably one of my favorite projects I've ever worked on, just because it's such a neat problem. Like a very well-defined and very satisfying problem to solve.
34:04 Michael Kennedy: Nice, do you feel like you understand how a lot of these compilers and execution fits together better now?
34:11 Anna Herlihy: Yeah I think it's probably the best learning project anyone could have, is to actually understand what goes on under the hood of the language they use. It also made me much much better at writing optimizable code, once I knew how compiler optimizations work.
34:26 Michael Kennedy: Yeah okay, very cool. All right so let's talk about what you're up to these days, that MongoDB. And you said you'd worked on PyMongo, and just for everyone listening, that's the primary driver, the primary way to speak to MongoDB, is to pip install PyMongo, and import it, and then you just start talking. Basically the data exchange is dictionaries, right? You write a prototypical query sort of thing into the dictionary and you get back rows which are documents in the forms of dictionaries right?
34:55 Anna Herlihy: Yes, that's correct. So you don't actually have to get dictionaries anymore. For a long time PyMongo would just automatically read your data into dictionaries, but now you can actually get raw BSON out of the driver, and that opens up a lot of doors for what you can do with it.
35:10 Michael Kennedy: Right, so BSON is binary JSON, which is the actual end memory on the wire representation that you get talking to MongoDB. You had a cool talk about something called Monary, which is kind of getting superseded by the project that you're working on now, but you had some interesting performance numbers about getting dictionaries back, or just in terms of analysis in general, forget the database for a minute. Working with dictionaries versus working with lists versus something like NumPy, right?
35:42 Anna Herlihy: Yes, so that was pretty enlightening for me as a relatively new Python programmer, to realize that Python dictionaries which I considered was kind of the most, canonical, the most basic way of storing data in Python, was actually pretty slow. And compared to ndarrays, which are C-style arrays that come with the numpy package, versus just something like a list, they are significantly slower.
36:11 Michael Kennedy: Yeah you had some cool numbers. You said something like, for a certain algorithm, working with a bunch of Python dictionaries you could do like 12 million a second.
36:18 Anna Herlihy: Yes.
36:18 Michael Kennedy: With lists it was close to 10 times as much, 120 million a second. And with NumPy you could do 500 million a second.
36:25 Anna Herlihy: Yes and that's just a really simple, take a bunch of data that's in this form and just add them up together, or make pairs, or something like that.
36:35 Michael Kennedy: Right right, basically read through it or something to that effect.
36:38 Anna Herlihy: Exactly.
36:39 Michael Kennedy: So you said, look the way with the BSON capability notwithstanding, basically the way that PyMongo works is you do a query, you get a bunch of dictionaries back in Python, but if you're doing data science or something computational, you probably want to work in, say, NumPy, right? But the workflow would be, I make a query in my Python layer, it goes to Mongo, that comes back across the wire, it gets turned into Python dictionaries and then serialized back in down at the C layer into NumPy, and that was a problem, right?
37:10 Anna Herlihy: Yes, that's a huge problem. Because you're taking something that is pretty fast, namely MongoDB, and you have something on the other end of your line which is also really fast, NumPy. And then you have this bottleneck which is Python dictionaries. And it's kind of a shame that there hasn't been more stuff related to this until recently, because MongoDB is an incredibly powerful database, and it's very easy to use. NumPy also has a selling point that it's extremely powerful and pretty easy to use. So you'd think that for a lot of data scientists or for people who don't love coding in MATLAB, they would want to put these two tools together. But MongoDB hasn't been used in this context super often because it's a limitation that in order to get the data out, you have to put it through this kind of clunky data structure before getting it back into your super fast arrays.
38:01 Michael Kennedy: Right, so you said basically, with some tests you did, going through PyMongo you could read about from MongoDB through Python into NumPy, at about 150,000 documents a second. And there's this other project called Monary that you were talking about when I saw this presentation, where it basically says, let's stay down in the C layer the entire time, and even though we're calling it from a Python app it sort of connects it directly to NumPy right?
38:31 Anna Herlihy: Exactly, so now we are taking this raw BSON format, and basically moving it directly into C-style arrays. So you have two things that are in natural machine order, and you no longer have to take it out of that.
38:47 Michael Kennedy: Nice, and that's something like 10 times faster, right? At least it was.
38:49 Anna Herlihy: Yeah. It was 1.7 million reads per second compared to 150,000.
38:57 Michael Kennedy: That's a big difference. You might--
38:58 Anna Herlihy: Yeah.
38:59 Michael Kennedy: That's like a difference between going to get a cup of coffee while your algorithm runs on your data versus going to lunch.
39:07 Anna Herlihy: Yes, exactly, and so it's really exciting to me that we can leverage these two really awesome technologies. And Monary itself is actually not under active development anymore. Monary's a project that actually was community based. It was started by somebody who did not work at MongoDB, David Beach, who basically wrote it because he was sick of having to lose so much time to Python dictionaries. And so now we are writing a codec that is 100% in C that takes this raw BSON, which before I think maybe a year or two ago you couldn't actually get raw BSON from PyMongo, but now you can. So now you can take this raw BSON and you can have this super lightweight package that just converts it directly into NumPy, and you are good to go.
39:59 Michael Kennedy: That's really cool. So what's this thing called now?
40:01 Anna Herlihy: So this is called bson-numpy, which is a deeply uncreative name, and I am definitely open to alternatives. But it is descriptive. Basically it is in beta, maybe closer to alpha, because we stopped working on it about two weeks ago, we actually released about two weeks ago. So it's the project of both me and A. Jesse Jiryu Davis, who is my coworker who you spoke to twice I think.
40:29 Michael Kennedy: Yeah yeah, Jesse's a friend of the show. So hello Jesse, and nice work on this project as well.
40:33 Anna Herlihy: We are both really eager to hear people's feedback. We really want MongoDB to become more useful for data scientists, and we think that this is really the tool that is gonna make it happen.
40:46 Michael Kennedy: Yeah it seems like a really great idea. Just we'll skip the serialization where it's slow and just float the data straight into NumPy. This portion of Talk Python To Me is brought to you by Hired. Hired is the platform for top Python developer jobs. Create your profile and instantly get access to 3500 companies who will work to compete with you. Take it from one of Hired's users who recently got a job and said, "I had my first offer on Thursday "after going live on Monday, "and I ended up getting eight offers in total. "I've worked with recruiters in the past "but they've always been pretty hit and miss. "I tried LinkedIn, but I found Hired to be the best. "I really like knowing the salary up front. "Privacy was also a huge seller for me." Sounds awesome, doesn't it? Well wait until you hear about the signing bonus. Everyone who accepts a job from Hired gets $1,000 signing bonus, and as Talk Python listeners it gets way sweeter. Use the link hired.com/talkpythontome and Hired will double the signing bonus to $2,000. Opportunity's knocking. Visit hired.com/talkpythontome, and answer the door. I guess people can just go to MongoDB and find it? Like how do they learn more about this project?
42:07 Anna Herlihy: Because it's in its very very first iteration, we released it on bson-numpy.readthedocs.io, and you can find it on Read the Docs. Basically neither myself nor Jesse are data scientists, but we're both super familiar with Python and we're super familiar with MongoDB. But that does leave a lot of open questions that I wouldn't necessarily know how to answer. So I'm not sure what the most common data type for NumPy would be, I mean there are all these really cool complex data types, but for all I know, nobody ever uses 'em. So what I really want is to be able to reach out to the community and have people tell me what they need. Because a lot of the great features of MongoDB pretty much came out because somebody had a need and they asked for it, so he wrote them, and then it became a huge selling point. So what I want is just to hear from what people who would use these technologies really want.
43:07 Michael Kennedy: Yeah absolutely. So if people can check it out, it's bson-numpy on PyPI, and also I'll put the link in the show notes. So check it out and give both of you guys feedback on, like hey, this looks really cool but it doesn't do whatever, right?
43:22 Anna Herlihy: Exactly.
43:22 Michael Kennedy: Nice. You also had, in your presentation about Monary, still kind of the same type of question and answer type thing, the same kind of analysis you can do with bson-numpy. You had an interesting analysis you did of taxicabs in Times Square, right? You wanna tell people that story?
43:41 Anna Herlihy: Sure. When I first joined MongoDB I was working out of the headquarters, which is in Times Square in New York City. As a side note, right now I work out of Stockholm, Sweden, which I prefer quite a bit more to fighting my way through the crowds. But I was struck one morning when I was trying to think of sort of a nice dataset to sort of show examples of how much faster Monary would be than PyMongo. I just was so sick of fighting through the crowds, and fighting to get to the train station, and I kind of figured, like, wow, what is--
44:16 Michael Kennedy: Why is there a Batman in my way? What is going on in this place?
44:20 Anna Herlihy: Exactly, like how can I get out of here as fast as possible? Where can I go, how can I do it? So I also had access to the taxi data for all of New York City, which is freely available online and it's a really interesting dataset. So I basically just took that dataset and I looked at all the rides that both started and ended in Times Square. Because I kinda just wanted to know, where are people going? What's the rush? Where are you coming from? Why are you coming here? Of all the places in New York City to be, why Times Square?
44:55 Michael Kennedy: Okay yeah, that's really cool. And you found, you had a really great bunch of visualizations that came out of MATLAB. And I'll put the link to that video up there because you have some really great maps and bars living on flat maps and all kinds of stuff. I thought that was great.
45:11 Anna Herlihy: Yeah, it's matplotlib.
45:12 Michael Kennedy: Oh yeah sorry, I keep, I don't know why I said MATLAB. MATLAB 'cause we were just talking about, yeah matplotlib is what I had in mind.
45:18 Anna Herlihy: But yeah, no, I mean. The point of those diagrams were just basically to show off how cool NumPy can be and how powerful matplotlib can be for creating really beautiful visualizations. And I know that there are a lot of different visualization tools out there, but in terms of selling Python and providing a way for data scientists to use both MongoDB and NumPy, I felt like it was a pretty good sell.
45:45 Michael Kennedy: Yeah and it's cool, you also were able to leverage, I think that the interesting linkage there is you were able to put a huge amount of data in Mongo, and apply a geospatial index to it, and then use that in your queries but then also do the analysis with NumPy, right?
46:02 Anna Herlihy: Exactly, so you are taking advantage of these geoqueries, which are really simple, really easy to use, but also very fast. And you're taking advantage of Python, NumPy specifically, where you have really really fast analysis, but a lot of these algorithms are actually written for you. So you can use a lot of these scientific Python packages that have done what I consider the hard work, for you, completely.
46:26 Michael Kennedy: Yeah. It seems like, just take the tools, click them together, and you get some really great analysis. That's awesome. So I'm looking forward to your presentation where you talk about your bson-numpy version.
46:40 Anna Herlihy: Yeah, well I'll have to find something better, some better visualizations. You know I don't have as many things to complain about now that I live in Stockholm, so I have to.
46:51 Michael Kennedy: Yeah, you can do something with snow or winter or something, who knows.
46:54 Anna Herlihy: Yeah, the darkness.
46:54 Michael Kennedy: Yes exactly. Here's the light analysis. All right, cool. So I guess let's go ahead and leave it there. That sounds like a great project, and so if you're out there working with MongoDB and ultimately your data ends up in NumPy, maybe skip the dictionaries is the message, if you can.
47:11 Anna Herlihy: Yeah exactly, and especially for people who use scientific Python and have never considered MongoDB a viable option, I think that is going to change. And I think that we are going to be able to write programs that take advantage of all this cool work that has already been done.
47:27 Michael Kennedy: Yeah and I think having it officially part of the product, the library, gives it a little more, probably gives people more confidence to build upon it.
47:35 Anna Herlihy: Yeah exactly, and I am personally just so excited because this is the first project that I've really had ownership over. I contributed a lot to PyMongo, but it wasn't my project, while I feel very proud of what I have done for this particular project.
47:50 Michael Kennedy: It sounds really really cool. So nice work. Let me ask you a couple questions before we get outta here. As I always do. First one, if you're gonna write some Python code, maybe not C code, if you're gonna write some Python code, what editor do you open up?
48:03 Anna Herlihy: So I use PyCharm. And it's funny that you mention C code, because I got a perpetual license for JetBrains, and I love that I can write Node, I can write Ruby, I can write Python, I can write C. And all my shortcuts are the same. I was a Vim user before Jesse convinced me to switch over to PyCharm, and now I'm completely converted. I can have all my Vim shortcuts and I can also have all the power of a really great IDE. So I really recommend the JetBrains libraries because it's really nice to not have to switch between IDEs, between CLion, PyCharm, RubyMine, WebStorm. It's the same environment, so I feel like I don't have to lose any time to relearning stuff.
48:50 Michael Kennedy: Yeah that's really cool. Jesse's a big fan of PyCharm, and so am I. Just so people who are not familiar, basically there's this IntelliJ platform, which is kind of the IDE and then they plug in the language-specific stuff. So when you say it's all the same, it's all kind of the same base, and like you said, it behaves the same. It's pretty sweet. Okay cool, and we just this week passed 100,000 packages on PyPI. So that's a big milestone, hooray.
49:17 Anna Herlihy: Wow.
49:17 Michael Kennedy: Yeah, no kidding. And so there's a ton of them out there. Do you have one that maybe you think people don't necessarily know about, like hey, you should check this out? Bson-numpy maybe?
49:29 Anna Herlihy: Yeah I was gonna say, is it too shamelessly self-promoting to say that you should download bson-numpy? That's the best package on PyPI.
49:37 Michael Kennedy: It's one of the newer ones.
49:38 Anna Herlihy: That's a really great question. I guess I'll just go with that because I'm very utilitarian with my packages. I pretty much just keep 'em to a minimum.
49:47 Michael Kennedy: Okay yeah, very cool. All right yeah, it's a great package and you guys need some people to try it out and actually do real data science with it, so that's great.
49:55 Anna Herlihy: Exactly.
49:56 Michael Kennedy: All right, so final call to action. What do you want people to check out or do?
50:01 Anna Herlihy: I would like people to check out bson-numpy, because I think it's a really cool project. I think that if anyone is interested in a really neat piece of code, they should look through the llvm-py work, especially if you are either new to compilers, maybe new to Python, new to LLVM IR. It's a very nice sort of pet project, and it's very self-contained, and it is, in my opinion, pretty well-documented. There's an entire 20-page thesis I wrote about it, so if you have questions they're probably answered. I'm also always happy to answer by email, Twitter, anything. So those are my two things I would love to see people do.
50:38 Michael Kennedy: All right, that sounds great. Anna, thank you so much for being on the show, it's been really interesting to talk to you.
50:42 Anna Herlihy: Yeah of course, thank you so much again for having me.
50:44 Michael Kennedy: Yeah thanks so much. Bye.
50:46 Anna Herlihy: Bye.
50:48 Michael Kennedy: This has been another episode of Talk Python To Me. Today's guest has been Anna Herlihy, and this episode has been sponsored by Hired and Talk Python Training. Are you or a colleague trying to learn Python? Have you tried books and videos that just left you bored by covering topics point by point? Well check out my online course Python Jumpstart By Building 10 Apps at talkpython.fm/course to experience a more engaging way to learn Python. And if you're looking for something a little more advanced, try my Write Pythonic Code course at talkpython.fm/pythonic. Hired wants to help you find your next big thing. Visit hired.com/talkpythontome to get five or more offers with salary and equity presented right in front, and a special listeners signing bonus of $2,000. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm. Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. Cory just recently started selling his tracks on iTunes, so I recommend you check it out at talkpython.fm/music. You can browse his tracks he has for sale on iTunes and listen to the full-length version of the theme song. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Smixx? Let's get outta here.