#28: Making Python Fast: Profiling Python Code Transcript
00:00 Is that Python code of yours running a little slow? Are you thinking of rewriting the algorithm or maybe even in another language? Well, before you do, you'll want to listen to what Davis Silverman has to say about speeding up Python code using Profiling. This is show number 28 recorded Wednesday, September 16th 2015.
00:00 [music intro]
00:00 Welcome to Talk Python to Me- a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on twitter via @talkpython.
00:00 This episode is brought to you by Hired and Codeship. Thank them for supporting the show on twitter via @hired_hq and @codeship
00:00 Let me introduce Davis. Davis Silverman is currently a student at the University of Maryland, working part time at The HumanGeo Group. He writes mostly Python, with an emphasis on performant, pythonic code
01:26 Davis, welcome to the show.
01:27 Hello.
01:27 Thanks for being here. I'm really excited to talk to you about how you made some super slow Python code much, much faster using profiling. You work at a place called the Human Geo Group. You guys do a lot of Python there, and we are going to spend a lot of time talking about how you took some of your social media sort of data real time analytics type stuff and built that in Python and improved it using profiling, but let's start at the beginning- what's your story, how did you get into programming in Python?
01:57 I originally, when I was a kid obviously, I grew up and I had internet, I was lucky. And I was into computers, and my parents were very happy with me building a fixing computers for them obviously. So by the time high school came around, I took a programming course and it was Python, and I fell in love with it immediately and I've since always been programming in Python. I'm programming in Python since sophomore year of high school. So, it's been quite a few years now.
02:32 I think all of us programmers unwittingly become like tech support for our families and what not, right?
02:38 Oh yeah, my entire family, I'm that guy.
02:42 Yeah, I try to not be that guy but I end up being that guy a lot. So, you took Python in high school. That's pretty cool that they offered Python there. Did they have other programming classes as well?
02:55 Yeah, so my high school, I live in the DC Metro Area, I live in Montgomery County, it's a very nice county and the schools are very good, and luckily the intro programming course was taught by a very intelligent teacher. So she taught us of Python, and then the courses after that were Java courses, the college level advance placement Java and then a data structures class after that. So, we got to learn a lot about the fundamentals of computer science with those classes.
03:24 Yeah, that's really cool, I think I got to take Basic when I was in high school, and that was about it. It was a lot
03:30 I wrote a Basic interpreter, but it wasn't very good.
03:34 Cool, so before we get into the programming stuff, maybe you could just tell me what is the HumanGeo Group, what do you guys do?
03:41 Yeah, so the HumanGeo Group is a, we are small government contractor, and we deal mostly in government contracts but we have a few commercial projects and ventures that I was working on over the summer that we'll be talking about. We're great small company in Arlington Virginia, and we actually just won an award for one of the best places to work in the DC Metro Area for millennials and for younger people.
04:06 If you go to thehumageo.com, you guys have a really awesome web page. I really like how that looks, you know just, it's like bam- here we are, we are about data and it just has kind of this live page, you know, so many companies want to get their CEO statement and all the marketing stuff and the news, you guys are like, look, it's about data. That's cool.
04:28 Yeah. There is a recent- I don't remember how recent, but there was the web site rewrite, and the web site- this one guy he decided, he took the time, he is like, "I really want do something you know, show some geographic stuff" so he used Leaflet JS which we do a lot of open source with, and on our GitHub page with Leaflet JS, and he made it very beautiful and even there is some icons of all the people in HumanGeo. And I think it's much better than any of those like you said generic contractor sites, it's much better and much more energetic.
04:59 It looks to me like you do a lot with social media and like sentiment analysis and tying that to location to Geo part right, and all that kind of stuff. What's the story there with, what kind of stuff do you guys do?
05:13 Yeah, so what I was working on was we took- one of our customers is a billion dollar entertainment company and I mean you've probably heard of them, I think we talk about them on our site and what we do is we analyze various social media sites like Redit and Twitter and YouTube and we gather geographic data if available and we gather sentiment data using specific algorithms from things like the natural language toolkit which is an amazing Python package; then, we show it the user in a very nice website that we have created.
05:51 So you said you work for this entertainment company as well as like a government contractor. What is the government interested in with all this data- the US government that is, right, for the international listeners?
06:01 Yeah, it's definitely a United States government. We do less social media analysis for the government, we do some, but it's not what people think the NSA does, definitely, I think just like anything like a company would want like you search on something and then it would have like oh there are twitter users talking about this, in these areas-
06:30 Yeah, I guess the government would want to know that especially in like emergencies or I things like that possibly, right?
06:37 Yeah, we also do some platform stuff, like we create certain platforms for the government that's not necessarily social media stuff.
06:46 Right, sure. So, how much of that is Python and how much of that is other languages?
06:51 So at the Human Geo we do tones of Python the back end for some of the government stuff we do Java, which is big in the government obviously. We definitely do a lot of Python, we use a lot in the Pipeline for various tools and things that we use internally, and externally at HumanGeo. I know that the project that I was working on is exclusively Python in all parts of the Pipeline for gathering data and for representing data for the back end server and the front end server. So that was all Python.
07:26 Right. So it's pretty much Python end to end other than, it looks, I don't know specifically the project that you were working on, but it looks like it's very heavy like, D3 fancy Javascript stuff on the front end for the visualization, but other than that, it was more or less Python, right?
07:41 We do use a lot- I mean yeah, we have some amazing Javascript people. They do a lot of really fun looking stuff.
07:49 Yeah, you can tell it's a fun place for like data display, front end development, that's cool. So, Python 2 or Python 3?
07:56 So, we use Python 2, but I was making sure as- I mean when I was working on the code base I was definitely writing Python 3 compatible code, using the proper future imports and I was testing it in Python 3 and we're probably closer to Python 3 than a lot of companies are, we just haven't expanded the time to do it, we probably will in 2018 when Python 2 is near the end.
08:19 That seems like a really smart way to go. Did you happen to profile it under CPython 2 and CPython 3?
08:28 I didn't. It doesn't fully run in CPython 3 right now. I wish I could.
08:34 It would just be really interesting since you spent so much time looking at the performance, if you could compare those. But if it doesn't run-
08:39 That would be interesting, you're right, I wish I could do that.
08:42 I suspect that most people know what profiling is, but there is a whole diverse set of listeners so maybe we could just say really quickly- what is profiling?
08:49 Yeah, so profiling in any language and anything is knowing heuristics about what is running in your program, and for example how many times is a function called, and how long does it take for this section of code to run. And it's simply like a statistical thing. Like you get a profile of your code, you see all the parts of your code that are fast or slow for example.
09:13 You wrote a really interesting blog post, and we'll talk about that in just a second, and I think like all good profiling articles or topics, you sort of point out what I consider to be the first rule of profiling, or more like the first rule of optimization, which profiling is kind of the tool to get you there. Which is to not prematurely optimize your stuff, right.
09:36 Yeah, definitely.
09:37 Yeah, you know, I spent a lot of time thinking about how programs run and why it's slow or why it's fast, or worrying about this little part or that little part and you know, most of the time it just doesn't matter. Or if it slow, it's slow for reasons that were unexpected. Right?
09:53 Yeah, definitely. I always make sure that there is a legitimate problem to be solved, before spending time doing something like profiling or optimizing the code base.
10:03 Definitely. So let's talk about your blog post. Because that kind of went around on Twitter, pretty big way and in the Python newsletters and so on, and I read then and, "oh this is really cool, I should have Davis on, we should talk about this." So, on the HumanGeo Blog, you wrote an article post called "Profiling In Python". Right? What motivated you to write that?
10:25 Yeah, so when I was working on this code, basically my co-workers and my boss they said, "this piece of code, we have this pipeline for gathering data from the customers specifically that they give us" and we run it about at like 2 am every night. The problem was it took ten hours to run this piece of code, it was doing very heavy text processing, which I'll talk about more later I guess. It was doing a lot of text processing which ended up taking ten hours, and they looked at it and they said, "you know, it's updating a new in every day and the work day starts at like 9, so we should probably try to fix this and get it work faster. Davis, you should totally look at this as a great first project".
11:11 Here's your new project, right.
11:12 Yeah. It was like, day one, ok make this thing ten times faster, go.
11:17 Yeah, oh is that all you want me to do?
11:20 Yeah, so I wrote this and I did what any guy would do; first thing I did was profile it, which is the first thing you should do to make sure that there is actually a hot spot. And I ran through the process that I talked about; In the post I talk about what I ran through, the tools I use, and I realized that it wasn't a simple thing for me, to do all this, I don't do this often. And I figured that a lot of people like you said, maybe don't know about profiling and haven't done this. So I said, "If I write a blog post about this, and hopefully somebody else won't have to Google like 20 different posts and read all of them to come up with one coherent idea of how to do this".
12:04 Right, maybe you could just lay it out like, "these are the five steps you should probably start with". Profiling is absolutely an art more than it is engineering. But it at least have the steps to follow, it's super helpful. You started out by talking about the CPython distribution, and I think it would be interesting to talk a little bit about potentially alternative implementations as well, because you did talk about that a bit. You said there is basically two profilers that come with CPython. Profile and cProfile.
12:38 Yeah. So, the two profilers that come with Python, as you said Profile and cProfile, all have the same exact interface and include the same heuristics, but the idea behind profile is that it's written in Python and is portable across Python implementations, hopefully. And CPython, which is written in C, and is pretty specific to using a C Interpreter such as CPython or even PyPy, because PyPy is very interoperable.
13:09 Right, it does have that interoperability layer. So maybe if we were doing like Jython, or we were doing IronPython or Pystone or something, we couldn't use C profile?
13:21 Yeah, I wouldn't say that you can't.
13:22 I'm just guessing. I don't really- I didn't test it.
13:26 I would say that you would use for the Jython and Python you could use the standard Java or .NET profilers, and use those instead. I'm pretty sure those will work just fine, because they are known to work great with their respective ecosystems.
13:41 Do you know whether there is a significant performance difference- let me take a step back. It seems to me like when I've done profiling in the past, that it's a little bit like the Heisenberg uncertainty principle. If you observe a thing, by the fact that you've observed it, you've altered it. Right? You know, when you run your code under the profiler, it might not behave in the same exact timing and so on, as if you were running natively. But you can still get a pretty good idea, usually, right. So, is there a difference across the cProfile versus profile in that regard?
14:17 Oh yes, definitely. Profile is much slower and it has a much higher latency, and as overhead I mean, than cProfile. Because it has to do a lot of different- I mean Python exposes internals to CPython in some Python modules, but they are not slower than just using straight C and getting straight to it. So if you are using regular profile, I'd recommend- if you are using in CPython I would recommend using cProfile instead because it is much slower overhead and it gives you much better numbers that you can work with that make more sense.
14:53 Ok. Awesome. So, how do I get started? Suppose that I have some code that is slow, how do I run it in cProfile?
15:01 Yeah, so the first thing that I would do is, I mean, in the blog post which I'm pulling up right now just to see-
15:07 And I'll be sure to link to that in the show notes as well, so that people can just jump to the show notes and pull it up.
15:14 Yeah, so one of my favorite things about cProfile is that you can call it using the Python dash m syntax to call a module syntax, and it will print out the standard out your profile for you when it's done, and it's super easy I mean, all you need to do is just give it your main Python file and it'll run it, and then at the end of running it will give you the profile. It's super simple and one of the reasons why the blog post was so easy to write.
15:43 Yeah, that's cool. So by default, it looks like it gives you basically the total time spinning method of all of the methods looking at the call stack and stuff, number of times it was called, stuff like that. The cumulative time, the individual time per call and so on.
16:00 It gives you the default like you said, you are correct, and it's also really easy, you can give it a sorting argument, so that way you can if you want to call it and see you know, how many times this was called like it was called 60 000 times, it's probably a problem, in ten minute run. And it could be only called twice, but it may take an hour to run. That would be very scary. In which case you definitely want to sort it both ways, you want to sort it every way, so you can see what just in case you are missing something important.
16:33 Right. You want to slice it in a lot of different ways. How many times was it called, what was the maximum individual time, cumulative time, all those types of things. Maybe you are calling a database function, right, and maybe that's just got tons of data.
16:49 Maybe, the database is slow, exactly.
16:50 Yeah, maybe the database is slow. And so that says don't even worry about your Python code, just go make your database faster. Or use Redit for cashing or something, right.
16:59 Or yeah, work on your query, maybe you can make a distinct that this query get much smaller data set that ends up coming back to you.
17:07 Yeah, absolutely, it could just be a smarter query you are doing. So this all is pretty good, output that you get is pretty nice, but in a real program, with many many functions and so on, that can be pretty hard to understand, right?
17:26 Yes. I definitely, I found, so- when I was working on this, I had the same issue; there is so many lines of code, it was filling my terminal and what I had to do is to save it in an output file and that was too much work, so I was searching around more and I found PyCallGraph, which is amazing at showing you the flow of your program and it gives you great graphical representation of what cProfile is also showing you.
17:51 That's awesome, so it's kind of like a visualiser for the cProfile output, right?
17:57 Yeah, and even colors- the more red it is the more hot of color it is the more times it runs and longer runs.
18:04 Yeah, that's awesome. So just pip install PyCallGraph, to get started, right?
18:08 Yeah, super simple. It's one of the best things about pip as a package manager.
18:14 Yeah, I mean, that's part of why Python is awesome, right, pip install whatever, antigravity. So, then you say, you say basically PyCallGraph and you say graph is and then- how does that work?
18:30 So, PyCallGraph is, it supports outputting to multiple different file formats, graph is simply a file format for- I mean, it's a program that can show .files, I don't really understand how to say it out loud. So I mean, the first argument for Graph is the style of how it is going to be read the program that's going to read it, and then you give it the double dash which means it's not part of the PyCallGraph call options it's now the Python call that you are calling it with, and those arguments, so it's almost basically teh same cProfile but it's kind of inverted.
19:12 Right. And you can get really simple callGraphs so, just this function calls this function which calls this function, and it took this amount of time, or you can get really quite complex call graphs. Right? Like you can say this module calls this function, which then reach out to this other module, and then they are all interacting in these ways. That's pretty amazing.
19:34 Yeah, it shows exactly what module are you using, I mean like for using regex, it'll show you each part of the regex module, like the regex compiler or you know, the different modules that are using the regex module, and then it will show you how many times each is called, and they're boxed nicely and the image is so easy to look at and you can just zoom in at the exact part that you want, and then look at what calls it and what it calls to see how the program flows, much simpler.
20:03 Yeah, that's really cool. And you know, it's something that just...
20:03 [music]
20:03 This episode is brought to you by Hired. Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
20:03 Each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them before you even talk to the company. Typically, candidates receive 5 or more offers in just the first week and there are no obligations, ever.
20:03 Sounds pretty awesome, doesn't it? Well did I mention the signing bonus? Everyone who accepts a job from Hired gets a $2,000 signing bonus. And, as Talk Python listeners, it get's way sweeter! Use the link hired.com/talkpythontome and Hired will double the signing bonus to $4,000!
20:03 Opportunity is knocking, visit hired.com/talkpythontome and answer the call.
20:03 [music]
21:18 Because the color is the hot spot, it's really good for profiling, but even if you weren't doing profiling, it seems like that would be pretty interesting for just understanding new code that you are trying to get your head around.
21:29 Oh yeah, that's definitely true, I've employed that since as a method to look at, to control flow of a program-
21:38 Right, how does these methods, how do these modules and all this stuff how do they relate, like just run the main method and see what happens, right?
21:45 Exactly. It has become useful tool of mine that I'll definitely be using in the future; I always have my virtual of nowadays.
21:55 So, we've taken C Profile, we've applied it to our program, we've sort of gotten some textual version of the output that tells us where we're spending our time and various ways, and then we can use PyCallrGraph to visualize that to understand it more quickly, so then what? Like, how do you fix these problems, what are some of the common things you can do?
22:15 Yeah, so as I outlined in the blog post, there is a platter of methods that you can do depending on what your profile is showing. For example if you are spending a lot of time in Python code, then you can definitely look at things like using a different interpreter for example in optimizing compiler like PyPy will definitely make your code run a lot faster as it translate to machine code at runtime. Or you could also look at the algorithm that you are using and see if it's for example like an O N qubed time complexity algorithm, that would be terrible and you might want to fix that.
22:55 Yeah, those are the kinds of problems that you run into when you have small test data. And you write your code and that seems fine, and then you give it real data and it just dies, right?
23:03 Exactly, that is why they gave me the code that took- I mean, they gave me like 5 GB of data, and they said, "this is the data that we get, like on a nightly basis" and I said, "Oh my God, this will take all day to run" so I use smaller test pieces and then luckily I use big enough that it showed some statistically significant numbers for me.
23:25 Right. It was something that you could actually execute in a reasonable amount of time, through your exploration but not so big-
23:32 I'd rather not do like C++ times of run, you know, compile times but it runtime, because it just kind of sits there while it's processing, so rather I would only do it once a day.
23:43 You mentioned some interesting things like PyPy, is super interesting to me. Can you talk a little more about why you chose not to use that but, you know, in show 21 we had Maciej from the PyPy group-
23:57 Oh yeah, I have it up right now, I'm going to watch it later. I'm so excited.
24:00 Yeah, that was super interesting, and we talked a little bit about optimization there, and like why you might choose an alternate interpreter, that was cool. Then there is some other interesting things you can do as well like you could use like name tuples instead of actual class, or you could use built in data structures instead of creating your own. Because a lot of times, the built in structures like list, an array, and dictionary and so on are implemented deep down in C and they're much faster.
24:28 Yeah, definitely. One of the best examples of this is that I saw some guy who wrote his own dictionary class and it was a lot slower than- and this not in Human Geo code base, just so you know, we have good code at Human Geo.
24:44 You guys don't show up on the daily wtf?
24:46 Oh please no, no we're much better than that, this is another place that I saw some code, and yeah, I mean, they have a lot of optimizations like in the latest Python release, they actually made the dict class, it's actually now an ordered dict in the latest Python releases, because they basically copy what PyPy did the same, they did the same thing.
25:07 Yeah.
25:09 So you should always trust the internal stuff.
25:12 Yeah, and that's really true. And if you are going to stick with CPython, as the majority of people do, understanding how the interpreter actually works, is also super important. And I talked about several times in the show, and had Philip Guo on the show, he did a ten hour, he recorded basically a graduate course at the university of Rochester, he put it online, so definitely I can to try to link to that, check it out, it's show 22-
25:41 I'd love to see that.
25:41 Yeah, I mean, it's, you really understand what's happening inside this C runtime and so then you are like "oh I see why this is expensive". And all those things can help. You know, another thing that you talked about that was interesting was IO, like I give you my example of databases or maybe you are talking about a file system, or you know, a web service call, something like this, right, and you had some good advice on that I thought.
26:04 Yeah, I definitely, basically the CPython GIL, the global interpreter lock, you can't do multiple computations at the same time, because CPython only allows to be used at one core at the time-
26:24 Right, basically computational parallelism is not such a thing, in Python you've got to drop down to C or fork the processes or something like that, right?
26:32 Exactly. And those are all fairly expensive, for a task that we are running on AWS server that we are trying to spend as little money as possible, because it runs like the break of dawn, so we don't want to be running multiple Python instances, but when you are doing IO which doesn't involve any Python, you can use threading, to do things like file system expensive IO test like getting the data off of URL and things like that, that's great for Python's threading. But otherwise, you don't really want to be using it.
27:07 Right, basically, the built in functions that wait on IO, they release the global interpreter lock, and so that freese up the code to keep up running, more or less, right?
27:19 You definitely want to make sure that if you are doing IO that it's not the bottleneck. I mean, as long as everything else is not the bottleneck.
27:26 Right, and we've had a lot of cool advances in Python 3, around this type of parallelism. Right, and they just added async and await the new keywords to is that 3.5 I think it was, right?
27:45 Yeah, 3.5 has just came out like 2 days ago.
27:46 Yeah, 2 days ago, so yeah, I mean, that's super new but these are the type of places where async await would be ideal for helping you increase your performance.
27:55 Yeah, it's the new syntax is like a special case it's syntactic sugar for yielding, but it makes things much simpler, and easier to read, because if you are using yields to do complex async stuff, that isn't just like a generator, then it's very ugly, so they added this new syntax to make it much simpler.
28:12 yeah, it's great, I'm looking forward to doing stuff with that. You also had some guidance on regular expressions. In the first one, the first bit also really like, kind of like your premature optimization bit is one you decide to use a regular expression to solve a problem, you now have two problems.
28:28 Yeah, I always have that issue. Whenever I do anything and I talk to students and they are like, and I'm like, "oh look at this text and you could do this" and they are like, "Oh I'll just use regex to solve it", and I'm saying, "please, no, you'll end up with something that you won't be able to read in the next two days." And then just find a better way to do it, for goodness sake.
28:49 Yeah, I'm totally with you. A friend of mine, he has this really great way of looking at complex code, both around regular expressions and like parallelism. He says that when you are writing that kind of code, you are often writing code right at the limit of your understanding of code, or your ability to write complex code and debugging is harder than writing code. And so, you are writing code you literally can't debug.
29:18 Yeah, I think you should always try to make code as simple as possible, because debugging and profiling and looking through your code will be much less fun if you try to be as clever as possible.
29:29 Yeah, absolutely. Clever code is not so clever when it's your bug to solve later.
29:34 Yeah, and I also try to give special mention to Python's regex engine as much as I dislike regex, I think Python is verbose, flag is amazing, and if you haven't looked into it, Python has great support for greatly 29:51 regex, so if you have to use it, you can be very verbose about it and that way it's much better documented in the code.
29:59 Yeah, that's great advice. So, let's maybe talk about how you solve your problem after you- what was the problem, what did you find out the issue to be in then how did you solve it?
30:10 Yeah, so what we were doing in this code is we were taking GB on GB of plain text data like words from users of various forums and we got, we processed all this data for sentiment analysis and to do that you need to stem each word to its base word, so that way you can have you can make sure you are analyzing the correct word because there is like 5 different forms of every single word-
30:40 Right, like run, running, ran, all those are kind of basically the same meaning, right?
30:45 Exactly. Yeah, so we get the base word of all those words, and the thing is GB of like let's say 5GB of words, if a word is like 4 Bytes, that's like billions, like so many words, I don't even want to think about it, and then for every single word we stemmed it and we analyzed it, and it was a slow process and I realized, we run the same stemming function, which is an nltk supporter stemmer, which is amazing and it's whatever one uses for stemming. We ran it about in my 50MB test case which is still so many words, thousands upon thousands of words, it ran about like 600,000 times. And I was like, my goodness, this is running way too much and this is only like 400,000 words in the English language. There is no way each of these words needs to be stemmed, because gamers on the internet aren't very linguistically amazing.
31:47 Yeah. Exactly.
31:50 So, I figured you know, I can create a cash or as it's called in more functional terms, I can create a memoization algorithm that saves the computation so i don't need to re-compute the function, because stemming is a pure function, if you are in the functional programming, you can, you don't need to re-compute it every single time you get the same input.
32:15 Right, if you guarantee with the same inputs you get the same output, then you can cash the heck out of this thing, right?
32:22 Exactly, so I went from like 600 000 calls to like 30 000. And it was an immediate, these words like the whole program ran or does a magnitude faster.
32:40 That's awesome. And you know, what I think, I really love about this, two things I love about it: one is, you talked a little bit about the solution on the blog and it's like nine lines of code?
32:53 Yes, it's Python. So awesome.
32:57 And the other thing is you didn't even have to change your code necessarily, you are able to like create things that you can just apply to your code, you have to rewrite the algorithms, or things like this, right?
33:08 Exactly. Yes, I really find that the algorithm worked, it got things done, it did it correctly, there had to be I mean I wasn't opposed the chain of the algorithm obviously, if that was the hot part of the code. But once I found out that the hot part of the code wasn't even code that we wrote, it was just the code that we were calling from another library.
33:31 And it's probably really optimized but if you call it 600 000 times well...
33:36 Nothing is optimized when you are calling it hundreds of thousands of times. You know, at that point you've got to not call it that many times within that time span.
33:45 You basically created a decorator that will just cash the output, and only call the function if it's never seen the inputs before, right?
33:53 Exactly, so I mean, what it does is it internally a decorator all it does is it wraps the function in another function, so it adds an internal cash which is just a Python dictionary which keeps the function arguments as the key and output of the function as the value, and if it's been computed then it's in the dictionary, so all it needs to do is a simple dictionary call. It's like just like one or two Python bytecode instructions which are, I mean as opposed to calling an entire function itself, which would be hundreds of Python bytecode instructions.
34:30 Right, yeah. And that's fantastic.
34:33 And when it doesn't find the answer, I mean when it doesn't find the arguments in the function, it's just one computation and for the amount of words in English language, that are primarily used, it'll be called much less.
34:48 Right. Yeah, so your typical data set maybe 30 000? 20 000 times?
34:55 Something like that. Yeah.
34:55 [music]
34:55 This episode is brought to you by Codeship. Codeship has launched organizations, create teams, set permissions for specific team members and improved collaboration in your continuous delivery workflow. Maintains centralized control over your organization's projects and teams with Codeship's new organization's plan.
34:55 And as Talk Python listeners, you can save 20% off any premium plan for the next 3 months. Just use the code TALKPYTHON.
34:55 Check them out at codeship.com and tell them "thanks" for supporting the show on Twitter where they are at @codeship.
34:55 [music]
35:51 The other thing I like is that the solution was so simple. But you probably needed the profiling to come up with it. Right? You know, so, I have an example that I've given a few times about these types of things, and just choosing the right data structure can be really important. I worked on this project that did real time processing, of data coming in at 250 times a second, so that leaves 4 miliseconds to process each segment, right, which is really not much, but it had to be real time and if it couldn't keep up- well then you can't run a real time system, or you need some insane hardware to run it or something right. And it was doing crazy math, like wavelet decomposition and all sorts of stuff. Ok, this is like I was saying at the verge of understanding what we were doing, right? And it was too slow.
36:39 The first time it ran I was like, "Oh no, please don't make me hand try to optimize a wavelet decomposition" and there's got to be a better way, right? So I break out the profiler and it turned out that we had to sort of do lookups back in the past on our data structure, and we happen to use a list. To go back and look up, and we are spending 80% of our time just looking for stuff in the list, we just switch it to a dictionary and it went 5 times faster. And it was like almost one line of code changed. It was like ridiculously simple, but if you don't use the profiler to find the real problem, you are going to go mock with your algorithm and not even really make a difference. Because, it had nothing to do with algorithm, right, it was somewhere else.
37:24 Yeah, I definitely find that there is also a lot more push, like in the Java world to make things like final by default, to try to make them immutable, unless they don't need to be, and a lot of languages are also embracing immutable by default, and trying to keep as strict as possible. So that way, you can be more linear when you need to and I find the same thing in languages like Python, where they try to use a set, like unless I absolutely needles or I'm just containing elements, a set is much better for finding things.
37:54 Right. If you just want to know have I seen this thing before, set is maybe the right data structure. Or if you are going to store integers in the list, you would be much better off using an array of integers, or an array of floats because it's much, much more efficient. You said one of the things you considered was PyPy. Just for those who don't know, maybe what's the super quick elevator pitch on PyPy and what did you find out about using it?
38:18 Yeah so PyPy is a very compliant Python interpreter that at runtime turns the Python code into machine code, it finds the hot parts of your code or what is being run a lot and it finds a better way to run that for you. So that way it'll run a lot faster than CPython because CPython doesn't do many optimizations by default, or just in general.
38:43 Right, it runs it through the interpreter, which is a fairly complex thing rather than turn it straight to machine instructions.
38:49 Yeah. There is a lot more overhead.
38:51 You tried now PyPy- did it make it faster? Did it matter at all?
38:55 Oh yeah, PyPy is- I actually use PyPy before even profiling just to see what I could, like, "Oh let's just see how faster PyPy is in this case" and it ran about like 5 times faster. Because it figured out what to do. But the thing is that under our constraints, we didn't, we wanted to stick with a little AWS instance that we were just running every night, and the thing is PyPy uses more memory than CPython to support its garbage collector and its just in time compiler that both need a run at the runtime, so it uses a little bit more memory. And we didn't really want to spend that money whereas you know, because if I get it down to an hour in CPython, and it runs at 2 am, no one is going to be looking at that stuff at 3 am.
39:42 Right, absolutely. And if you can optimize it in CPython and then feed it into PyPy maybe it could go even faster, right?
39:49 Exactly. That would have run it at about like- as opposed from 10 ten hours to 1 hour, it would be like if I was running in PyPy with the cash optimization, it would probably run in like 30 minutes, 20 minutes, like I said, it was unnecessary like it would have been nice but we didn't need it, so we didn't really feel like spending the time to add that to the pipeline.
40:08 Right. Sure. It seems if needed to go faster, it seems like you could just keep iterating this, so for every execution you decorator will wrap it. For every run of your Python app, but you know, you could actually save it like to a Redis cash or to a file, or to a file or to a database or you could just keep going on on this idea, right?
40:33 Yeah, that was definitely something, I wanted to, I thought about doing that yeah, and definitely it was fast enough already, and that's that what goes next after that, I always say that when you profile and if you have a flat profile with nothing sticking out, and with nothing that looks like it needs to be optimized, that's when you need to change the runtime, that's when you need to look into FFI or PyPy or Jython with the proper JIT.
40:59 Interesting. What's FFI?
41:00 So, FFI is the Foreign Function Interfaces, just a general term used by all languages in which case you can drop down into C or any C compatible language.
41:10 Yeah, so you would basically go right in C and then hook into that FFI
41:14 Exactly. So, or use like CYthon for example, which compiles Python down to C with the weirdly Python + C syntax.
41:23 If you have to, right?
41:24 Yeah. I never tried it but I've seen it. Like, you are saying like you annotate Python variables with C like you say double I = zero in Python syntax. it's really strange.
41:38 Yeah, that is strange. So, I talk about PyCharm on the show a lot, I'm a big fan of PyCharm. And they just added built in profiling in their latest release 4.5.
41:51 That sounds nice.
41:51 Yeah, I don't know if it is out, it has a visualizer too, kind of like PyCallGraph? But they said they are using something called the YAPPI profiler and it will fall back to cProfile. If it has to. Do you know YAPPI?
42:05 Yeah, I was looking at all these profiles also PyPy comes with some profiler that works on Linux, VM prof and those are all different profiles and I looked at them and they are nicer, but I really loved how simple it was- I mean, I got the results and it comes with Python, you didn't need to install anything you just ran the module. And that's why I was so happy with using it and didn't need to try a different profiler.
42:34 Right, that's cool, just Python space dash em space cProfile and then your app- boom.
42:41 Exactly. And if you are using PyCharm, if it comes with YAPPI or any other profiler definitely use that, I'm sure it's great, because the PyCharm people are awesome, but for the purposes I mean of the simplicity that I wanted to keep in the blog post, I only use the word pip once, I mean like and I only used it once to install PyCallGraph and that's just how simple it needs to be.
43:06 Right, you maybe could have gotten away without it if you really wanted to look, it's just a text output. The other thing is, this is something you could do really well in your server. You can go over and profile something or maybe it has real data you can get to, like maybe it works fine on my machine, right? But not in production, or not in QA or something like, so you don't have to go install PyCharm on the server.
43:27 Exactly. And that's why Python is amazing for being "batteries included" as that includes all of these nice things that you could need that any developer is going to need to use eventually.
43:45 You want to talk a little bit about open source and what you guys are doing with open source at HumanGeo?
43:50 Yeah, so at HumanGeo we have a GitHub profile and we mostly, the most of our open source stuff is the Javascript leaflet, leaflet stuff that we have incorporated, so if you want any map visualizations I like it much better than Google maps API using leaflet. So, I'd recommend looking at the stuff we've done there, and we also have a couple of Python libraries, we have an elasticsearch binding that we've used- which is superseded by Mozilla. We definitely love open source at HumanGeo, and we make and use libraries in open source and that is one of my favorite parts about HUmanGeo.
44:36 That's really awesome. And you guys are hiring, right?
44:40 Oh yeah, we are always hiring, we are looking for great developers, at any age, Python, Java, like I said, we won one of the best places to work for millennials in DC Metro Area-
44:53 Right, and for the international listeners Washington DC, in the United States.
45:00 Yeah, whether you have a government clearance or not, we would love just send an email or call or get in touch if you want to work at an amazing place that does awesome Python.
45:11 Yeah, that's really cool, it should definitely reach a bunch of enthusiastic Python people, so if you are looking for a job give those guys a call, it's awesome.
45:19 Definitely.
45:21 So you guys are big fans of Python, right?
45:22 Oh yeah, we've, I think they have been using Python since the company started and I was looking at the original commits like when the company started and it was 4 of these projects and they were all python and it's really exciting, they've been using it since the beginning, it amazing to rapidly iterate, it's fast enough obviously and you can look at it and super easy to profile when you need to. That's another reason why it is amazing, it's not just like you can say it's slow, but then it's easy to optimize in that case.
45:52 Yeah, that's really cool. So are there some practices or techniques or whatever that you guys have sort of landed on or focusing on that you have been doing it for so long that you could recommend?
46:04 At HumanGeo we make sure to, we don't go like full agile or anything, I mean, we definitely consider ourselves fast moving, and we work at a very great pace, so I guess you could call it agile, and then have great Git practices, we use GitFlow, and we make sure to have good code review and any code base including Python you have to have good code review and when you write Python code, I do write a lot of unit tests for my code, things like that...
46:39 Nice. So is there a unit test module or pytest, or-
46:46 Oh yeah, the standard library. Just, I love standard libraries stuff.
46:49 Yeah, "the batteries included", right?
46:51 Yes, definitely.
46:51 So, anything else you want to touch on, before we kind of wrap things up, Davis?
47:00 No, I'm just really excited to be given this opportunity, I only want to give a shoutout to all of my amazing colleagues at HumanGeo.
47:06 Yeah, that's awesome. You guys are doing cool stuff, so I'm glad to shine a light on that. So final two questions, before I let you out of here: favorite editor? What do you use to write Python these days?
47:16 I'll say that from my open source like I work on website for the DCPython community and I use Sublime text for all that open source stuff, all of my tools that I use. And when I work for a company, I ask them for a PyCharm license, because it's great for big projects, that you can really focus on.
47:38 Yeah. You know, like I said that's my favorite editor as well and definitely there is group of people that love the lightweight editors, like Vim and Emacs and Sublime, and then people that like IDEs, I feel like when you are on a huge project, you can just understand it, sort of more in its entirety using something like PyCharm, so...
48:00 Yeah. My favorite thing about PyCharm is that you can like control click or command click and you can go like on a module and it will take you to source for that module. So you can really fast, like look at where the code is flowing, in the source.
48:14 Yeah, absolutely. Or, "hey, you are importing a module but it's not specified in the requirements. Do you want me to add it for you for this package you are writing...," stuff like that, it's sweet.
48:22 And the great support for the tooling, and the tool chain of Python.
48:25 Yeah. Awesome. Davis, this has been a really interesting conversation. And hopefully some people can go make their Python code faster.
48:33 Yeah, I definitely hope that they will and I hope that they learned a lot from this.
48:38 Yeah, thanks for being on the show.
48:40 No problem, thank you so much.
48:41 Yeah, talk to you later.
48:41 This has been another episode of Talk Python To Me.
48:41 Today's guest was Davis Silverman and this episode has been sponsored by Hired and CodeShip. Thank you guys for supporting the show!
48:41 Hired wants to help you find your next big thing. Visit hired.com/talkpythontome to get 5 or more offers with salary and equity right up front and a special listener signing bonus of $4,000 USD.
48:41 Codeship wants you to ALWAYS KEEP SHIPPING. Check them out at codeship.com and thank them on twitter via @codeship. Don't forget the discount code for listeners, it's easy: TALKPYTHON
48:41 You can find the links from the show at talkpython.fm/episodes/show/28
48:41 Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes and direct RSS feeds in the footer on the website.
48:41 Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song on talkpython.fm.
48:41 This is your host, Michael Kennedy. Thanks for listening!
48:41 Smixx, take us out of here.