#28: Making Python Fast: Profiling Python Code Transcript
00:00 Is that Python code of yours running a little slow? Are you thinking about rewriting the
00:04 algorithm or maybe even in another language? Well, before you do, you'll want to listen to
00:08 what Davis Silverman has to say about speeding up Python code using profiling.
00:12 This is show number 28, recorded Wednesday, September 16th, 2015.
00:17 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the library,
00:47 the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter
00:51 where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm
00:56 and follow the show on Twitter via at Talk Python. This episode is brought to you by Hired and
01:02 CodeShip. Thank them for supporting the show on Twitter via at Hired underscore HQ and at CodeShip.
01:08 There's nothing special to report this week, so let's get right to the show with Davis.
01:13 Let me introduce Davis. Davis Silverman is currently a student at the University of Maryland working
01:18 part-time at the Human Geo Group. He writes mostly Python with an emphasis on performant
01:23 Pythonic code. Davis, welcome to the show.
01:26 Hello.
01:27 Thanks for being here. I'm really excited to talk to you about how you made some super slow Python
01:33 code much, much faster using profiling. You work at a place called the Human Geo Group. You guys do a
01:39 lot of Python there, and we're going to spend a lot of time talking about how you took some of your
01:44 social media sort of data, real-time analytics type stuff, and built that in Python and improved
01:51 it using profiling. But let's start at the beginning. What's your story? How did you get into programming
01:56 in Python?
01:56 I originally, when I was a kid, obviously, I grew up and I had internet. I was lucky, and I was very into
02:07 computers, and my parents were very happy with me building and fixing computers for them, obviously.
02:15 So by the time high school came around, I took a programming course, and it was Python, and I fell in
02:22 love with it immediately. And I've since always been programming in Python. I've been programming in Python
02:27 since sophomore year of high school. So it's been quite a few years now.
02:31 I think all of us programmers unwittingly become tech support for our families and whatnot, right?
02:38 Oh, yeah. My entire family, I'm that guy.
02:41 Yeah. I try to not be that guy, but I end up being that guy a lot.
02:45 So you took Python in high school. That's pretty cool that they offered Python there. Do they have
02:52 other programming classes as well?
02:54 Yeah. So my high school, I live in the DC metro area. I live in Montgomery County. It's a very nice
03:01 county, and the schools are very good. And luckily, the intro programming course was taught by a very
03:07 intelligent teacher. So she taught us all Python. And then the courses after that were Java courses,
03:13 the college-level advanced placement Java, and then a data structures class after that.
03:18 So we got to learn a lot about the fundamentals of computer science with those classes.
03:24 Yeah, that's really cool. I think I got to take basic when I was in high school, and that was about it.
03:27 It was a while ago.
03:28 I wrote a basic interpreter, but it wasn't very good.
03:32 Cool. So before we get into the programming stuff, maybe you could just tell me,
03:37 what is the Human Geo Group? What do you guys do?
03:39 Yeah. So the Human Geo Group, we're a small government contractor, and we deal mostly in government contracts,
03:48 but we have a few commercial projects and ventures that I was working on over the summer that we'll be
03:53 talking about. We're a great small little company in Arlington, Virginia, and we actually just won an award for one of the best places to work in the DC metro area for millennials and for younger people.
04:05 If you go to thehumangeo.com, you guys have a really awesome webpage. I really like how that is. It's like, bam, here we are. We're about data, and it just has kind of this live page. So many companies want to get their CEO statement and all the marketing stuff and the news. You guys are just like, look, it's about data. That's cool.
04:27 Yeah. There's a recent, I don't remember how recent, but there is the website rewrite and the website, this one guy, he decided he took the time and he's like, I really want to do something, you know, show some geographic stuff.
04:40 So he used leaflet.js, which we do a lot of open source with and on our GitHub page with leaflet.js. And he made it very beautiful. And he, and that even there's some icons of all the people at human geo. And I think it's much better than any of those. Like you said, a generic, you know, contractor site. It's, it's much better and much more energetic.
04:59 It looks to me like you do a lot with social media and like sentiment analysis and tying that to, to location, the geo part. Right. And all that kind of stuff. What's the story there with what kind of stuff do you guys do?
05:12 Yeah. So one of, what I was working on was we took one of our, one of our customers is a billion dollar entertainment company. And, I mean, you've probably heard of them. I think we talk about them on our site.
05:27 And what we do is we analyze various social media sites like, you know, Reddit and Twitter and YouTube, and we gather geographic data if available. And we gather sentiment data using specific algorithms from things like the natural language toolkit, which is an amazing Python package.
05:44 Then we show it to the user in a, in a very nice, website that we have created.
05:50 So you say you're, you work for this entertainment company as well as like a government contractor.
05:56 What is the government interested in with all this data? The U S government that is right for the international listeners.
06:01 Yeah, it's, it's definitely a United States government. we, we do, not, we do less social media analysis for the government. We do, we do some, but it's nothing.
06:11 It's not what people think the NSA does. definitely. I think, you know, just like anything, like a company would want, like your, like you'd search on something and then it would have like, Oh, there are Twitter users talking about this.
06:26 This in the, you know, in these areas. Yeah. I guess the government wouldn't know, would want to know that, especially in like emergencies or things like that possibly. Right.
06:36 Yeah. We also do some, platform stuff. Like we create, certain platforms for, the government. That's not necessarily social media stuff.
06:46 Right. Sure. So how much of that is Python and how much of that is other languages?
06:51 So at the human geo, we do, we do tons of Python at the backend. for some of the government stuff, we do Java, which is big in the government, obviously. so I think we, I mean, we definitely do have a lot of Python. We use it a lot in the pipeline, for various tools and things that we use internally and externally at human geo. I know that the project that I was working on is exclusively Python.
07:16 Python and all parts of the pipeline for, for gathering data and for representing data for, you know, the backend server and the front end server. So that was all Python.
07:26 Right. So it's, it's pretty much Python end to end other than it looks, I don't know specifically the project you were working on, but it looks like it's very heavy, like D3, fancy JavaScript stuff on the front end for the visualization.
07:39 But other than that, it was more or less Python, right?
07:41 We do use a lot. I mean, yeah, we do have some amazing JavaScript people. They, they do a lot of really fun looking stuff.
07:48 Yeah. You can tell it's, it's a fun place for like data display, front end development. That's cool.
07:53 So Python two or Python three.
07:55 So we use Python two, but I was making sure as, I mean, when I was working on the code base, I was definitely writing Python three compatible code using the proper future imports.
08:06 And, and I, and I was testing it on the Python three and we're, we're probably closer to Python three than a lot of companies are. we just haven't expanded the time to do it.
08:14 We're probably will in 2018 when Python two is nearing end of line.
08:19 That's seems like a really smart, smart way to go. Did you happen to profile it under Python, CPython two and CPython three?
08:27 I didn't. It doesn't fully run in CPython three right now. I wish I could.
08:33 It would just be really interesting since you spent so much time looking at the performance if you could compare those, but yeah, if it doesn't run.
08:38 That would be interesting. You're right. I wish I could know that.
08:41 I suspect that most people know what profiling is, but there's a whole diverse set of listeners. So maybe we could just say really quickly, what is profiling?
08:50 Yeah. So profiling in any language, anything is, knowing heuristics about what's running in your program. And, you know, for example, how many times is a function called or how long does it take for this section of code to run? And it's simply like a statistical thing. Like you, like you get a profile of your code. You see all the parts of your code that are fast or slow. For example.
09:12 You wrote a really interesting blog post and we'll talk about that in just a second. And I think like all good profiling articles or topics, you know, you, you sort of point out what I consider to be the first rule of profiling or more like the first rule of optimization, which profiling is kind of the tool to get you there, which is to not prematurely optimize your stuff. Right.
09:35 Yeah, definitely.
09:36 Yeah. I, you know, I've spent a lot of time thinking about how programs run and why it's slow or why it's fast or worrying about this little part of that little part. And, you know, most of the time it just doesn't matter. Or if it is slow, it's slow for reasons that were unexpected. Right.
09:52 Yeah, definitely. Yeah. I, I always make sure that there is a legitimate problem to be solved before spending time doing something like profiling or optimizing a code base.
10:03 Definitely. So let's talk about your blog post because that kind of went around on Twitter in a pretty big way and on the Python newsletters and so on. And I read that. Oh, this is really cool. I should have Davis on and we should talk about this.
10:14 So, so, so on the, the human geo blog, you wrote an article or post called profiling in Python, right? What motivated you to write that?
10:25 Yeah. So when I was right, when I was working on this code, basically my, my coworkers, my boss, they said, you know, this piece of code, we have this pipeline for, for gathering data from the customers specifically that they give us.
10:40 And we run it, we ran it about like 2am every night. The problem was it took 10 hours to run this, this piece of code. It was doing a very heavy text processing, which I'll talk about more later, I guess.
10:53 It was doing a lot of text processing, which, which ended up taking 10 hours and they look at it and they said, you know, it's updating at noon every day and the workday starts at like nine.
11:03 So we should probably try to fix this and get it working faster. Davis, you should totally look at this as a great first project.
11:10 Here's your new project, right?
11:12 Yeah. It was like, here's day one. Okay. Make this thing 10 times faster. Go.
11:17 Yeah. Oh, is that all you want me to do?
11:19 Yeah. So I, I start, so I, I wrote this and I, I did what any guy would do. I, first thing I did was, you know, profile it, which is the first thing you should do to make sure that there's actually a hotspot.
11:32 And I, I ran through the process that I, that I talked about in the post post. I talk about what I ran through the tools I use. And I realized that it was, it wasn't a simple thing for me to do all this. I don't do this often. And I figure that a lot of people, like you said, maybe don't know about profiling or haven't done this.
11:51 So I said, you know, if I write a blog post about this and hopefully somebody else won't have to Google like 20 different posts and read all of them to come up with one coherent idea of how to do this in one fell swoop.
12:03 Right. Maybe you can just lay it out. Like, these are the five steps you should probably start with. And, you know, profiling is absolutely an art more than it is, you know, engineering, but it, at least having the steps to follow is super helpful.
12:17 You started out by talking about the CPython distribution. And I think it'd be interesting to talk a little bit about potentially alternative implementations as well. Cause you, you did talk about that a bit.
12:30 Yep. You said there's basically two profilers that come with CPython profile and CPython.
12:38 Yeah. So there, the two profilers that come with Python, as you said, profile and CPython, all have the same exact interface and include the same heuristics.
12:50 But the idea behind profile is that it's written in Python and is portable across Python implementations, hopefully, and CPython, which is written in C.
13:00 And it's pretty specific to using a C interpreter, such as a CPython or even PyPy because PyPy is, is very, interoperable.
13:07 Right. It does have that, that interoperability layer. So maybe if we were doing like Jython or we were doing IronPython or PyStone or something, we couldn't use CPython.
13:19 Yeah. I, I wouldn't, I wouldn't say that you can.
13:22 I'm just guessing. I don't really, I haven't tested it.
13:25 I would say that you would use the, for Jython and IronPython, you could use the standard Java or .NET profilers and use those instead.
13:34 I'm pretty sure those will work just fine because I mean, they've, they're known to work great with their respective ecosystems.
13:41 Do you know whether there's a significant performance difference? Let me take a step back. You know, it seems to me like when I've done profiling in the past, that it's a little bit like the Heisenberg uncertainty principle.
13:54 And that if you observe a thing by the fact you've observed it, you've altered it. Right. You know, when you run your code under the profiler, it might not behave in the same exact timings and so on as if it were running natively, but you can still get a pretty good idea usually. Right. So is there a difference across the C profile versus profile in that regard?
14:17 Oh yeah, definitely. Profile is much slower and it has a much higher latency and as overhead, I mean, than C profile because it has to do a lot of different worker.
14:29 I mean, because Python exposes internals to C to CPython in, in some Python modules, but they're a lot slower than just using straight C and getting straight to it. So if you're using regular profile, I'd recommend, well, if you're using it in CPython or pipe, I'd recommend using C profile instead because it has much lower overhead and it gives you much better numbers that you can work with that make more sense.
14:53 Okay. Awesome. So how do I get started? Like, suppose I have some code that is slow, but how do I run it in this in C profile?
14:59 Yeah. So the first thing that I would do is, I mean, I, in the blog posts, which I'm pulling up right now just to see.
15:06 And I'll be sure to link to that in the show notes as well so that if people can just jump to the show notes and pull it up.
15:14 Yeah. Well, so one of my favorite things about C profile is that you can call it using, you know, the Python dash M syntax, the column module syntax, and it will print out to standard out your profile for you when it's done.
15:27 It's super easy. I mean, all you need to do is just give it your, you know, main Python file and it'll run it. And then at the end of the running, it'll give you the profile. It's super simple. And one of the reasons why the blog post is so easy to write.
15:42 Yeah, that's cool. So by default, it looks like it gives you basically the total time spent in methods, all of the methods, you know, looking at the call stack and stuff, a number of times it was called, stuff like that, right? The cumulative time, the individual time per call and so on.
16:00 It gives you the default, like you said, you're correct. And it's also really easy. You can give it a sorting argument. So that way you can, if you want to call it on C, you know, how many times this is called, like if it's called 60,000 times, it's probably a problem in a, you know, a 10 minute run.
16:17 And if it's, you know, it could be only called twice, but it may take an hour to run. That would be very scary. In which case you definitely want to try, you want to, you want to sort it both ways. You want to sort it every way. So you can see what, you know, just in case you're missing something important.
16:33 Right. You want to slice it in a lot of different ways. How many times was it called? What was the, you know, maximum individual time, cumulative time, all those types of things.
16:42 All right. Maybe you're calling a database function. Right. And maybe that's just got tons of data.
16:49 Yeah. It's slow. Yeah. Yeah. Maybe, maybe the database is slow. And so that says, well, don't even worry about your Python code. Just go make your database fast or use Redis for caching or something. Right.
16:59 Or yeah, work on your query. Maybe you're, maybe you're, maybe you can make it distinct, a distinct query and get a much smaller data set that ends up coming back to you.
17:08 Yeah, absolutely. Yeah. It could just be a smarter query you're doing. So this all is pretty good. I mean, this, this output that you get is pretty nice, but in a real program with many, many functions and so on, that can be pretty hard to understand, right?
17:26 Yes. I definitely, I found, so when I was working on this, I had the same issue. There was, there was so many lines of code. It wasn't, it was filling up my terminal, you know, and what I had to do is I had to save it to an output file and that was too much work.
17:39 So I was searching around more and I found PyCallGraph, which is amazing at showing you the flow of a program and it gives you great graphic representation of what C profile is also showing you.
17:50 That's awesome. So it's kind of like a visualizer for the C profile output, right?
17:56 Yeah. It even colors in the, the more red it is, the more hot of a call it is, or the, the more times it runs and the longer it runs.
18:04 Yeah. That's awesome. So just pip install PyCallGraph to get started, right?
18:08 Super simple. It's one of the best things about, pip as a package manager.
18:13 Yeah. I mean, that's part of why Python is awesome, right? pip install.
18:16 Definitely.
18:17 Whatever. Anti-gravity. So then you say, you, invoke it, you say basically PyCallGraph and you say GraphFizz.
18:26 And then how does that work?
18:29 So PyCallGraph is, it supports outputting to multiple different file formats. GraphFizz is simply a file format for, I mean, it's a program that can show called .file.
18:45 I mean dot dot files I don't really understand how to say it out loud so I mean the first argument for
18:51 graph is the style of how it's going to be read the program that's going to
18:55 read it and then you give it the double dash which means it's not part of the PyCallGraphCall options
19:01 it's now the Python call that you're calling it with and those arguments
19:06 so it's almost basically the same as cProfile but it's kind of inverted right and you can get really simple
19:13 call graphs that are just this function called this function which called this function and it took this
19:17 amount of time or you can get really quite complex call graphs like you can say you know this module
19:25 calls these functions which then reach out to this other module and then they're all
19:29 interacting in these ways that's pretty amazing yeah it shows exactly what module you're using
19:36 I mean like if you're using regex it'll just it'll show you each part of the regex
19:40 module like the regex compile or you know the different modules that are using the regex module
19:46 and then it'll show you how many times each is called and they're boxed all nicely
19:50 and and it gives you I mean the image is so easy to look at and you could just zoom in at the exact part you want
19:56 and then look at what calls it and where and what it calls to see you know how the program flows much simpler
20:03 yeah that's really cool and you know it's something that just came to me as an idea
20:07 I'm looking at this episode is brought to you by hired hired is a two-sided
20:23 curated marketplace that connects the world's knowledge workers to the best opportunities
20:27 each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them
20:35 before you even talk to the company typically candidates receive five or more offers in just the first week
20:41 and there are no obligations ever sounds pretty awesome doesn't it well did I mention there's a signing bonus
20:47 everyone who accepts a job from hired gets a two thousand dollar signing bonus
20:51 and as talk python listeners it gets way sweeter use the link hired.com slash talk python to me
21:00 and hired will double the signing bonus to four thousand dollars opportunities knocking
21:05 visit hired.com slash talk python to me and answer the call because it colors the hot spots and all it's really good for profiling but
21:22 even if you weren't doing profiling it seems like that would be pretty interesting for just
21:25 understanding new code that you're trying to get your head around oh yeah that's definitely true that's uh
21:31 i've employed that since as a method to look at the control flow of a program
21:36 right how does these methods how did these modules and all this stuff just how do they relate like just run the main method and see what happens right
21:44 exactly it's it's uh it's become a useful tool of mine that i'll definitely be using in the future
21:52 i always have it my virtual end of nowadays so we've taken c profile we've
21:58 applied it to our program we've sort of gotten some textual version of the output that tells us where we're spending our time in various ways and then we can use py call graph to visualize that understand it more quickly
22:09 so then what like how do you fix these problems what are some of the common things you can do
22:14 yeah so as i outlined in the blog post there's there's a plethora of methods that you can do depending on what
22:22 your profile is showing for example if if you're spending a lot of time in
22:28 in python code then you can definitely look at things like using a different interpreter for example an optimizing compiler like
22:37 pypy uh will definitely make your code run a lot faster as it'll translate to machine code at runtime
22:43 or you can also look at the algorithm that you're using and see if it's for example like an on cubed
22:49 uh time complexity algorithm that would be terrible and you might want to fix that
22:54 yeah those are the kinds of problems that you run into when you have small test data
22:58 and you write your code and it seems fine and then you give it real data and it just dies right
23:02 exactly that's why the thing they gave me the code that took i mean they gave me like five gigabytes of data and they said this is the data that we get like on a
23:11 nightly basis and i said oh my god this will take all day to run so i i use smaller test test pieces
23:19 and then luckily i i used big enough that it showed uh some statistically significant numbers for me
23:25 right it was something you could actually execute in a reasonable amount of time as you
23:29 through your exploration but not so big that i'd rather not do like c plus plus
23:33 times of run you know compile times but at runtime because it just kind of sits there while it's processing
23:39 right so rather not only do it once a day you mentioned some interesting things like
23:45 pi pi is super interesting to me we're going to talk more about how how you well you chose not to use that but
23:51 you know on show 21 we had macha from the the pi pi oh yeah i have it up right now i'm gonna watch it later
23:59 yeah i'm so excited yeah that was super interesting and we talked a little bit about
24:03 optimization there and like why you might choose an alternate interpreter that was cool
24:08 then there's some other interesting things you can do as well like you could use like name tuples instead of actual classes or you could use built-in data
24:18 structures instead of creating your own because a lot of times the built-in structures like list and array
24:23 and dictionary and so on are implemented deep down in c and they're much faster
24:27 yeah definitely uh one of the best examples of this is that i saw some guy who wrote his own dictionary class
24:35 and uh it was a lot slower than and this isn't in the human geo code base just so you know we have we have good code at human geo
24:43 you guys don't show up on the daily wtf oh please no no we're we're much better than that no this is a another another place that i'm that i saw some code
24:51 and uh yeah i mean they have a lot of optimizations like in the latest python release
24:56 they actually made them the dict class i mean it's actually now an ordered dict in the latest python releases
25:02 because they basically copied what pypy did the same they did the same thing
25:06 yeah so you should always trust uh the internal stuff yeah and that's really true and if you're going to stick
25:14 with c python as you know the majority of people do understanding how the interpreter actually works
25:20 is also super important and i i talked about it several times in the show and um had philip guau on the show
25:28 he did a 10 hour he recorded basically a graduate's uh graduate course at the university of rochester he did
25:34 and put it online so uh definitely i can try to link to that and check it out that's show 22
25:40 i'd love to see that yeah i mean it's you really understand what's happening inside the c runtime
25:45 and so then you're like oh i see why this is expensive and all those things can help you know another thing
25:50 that you talked about that was interesting was io like i gave you my example of databases
25:54 or maybe you're talking about a file system or you know a web service call something like this
26:00 right and you had some good advice on that i thought yeah i definitely uh basically the the c python
26:07 c python's jill the global interpreter lock is very uh precluding you can't you can't do multiple
26:16 multiple computation and computations at the same time because c python only allows to be used at one core
26:22 at a time right basically computational parallelism is not such a thing in python if you're you got
26:28 to drop down to c or fork the processes or something like that right exactly and those are all fairly
26:33 expensive for a task that we're running on an aws server on an aws server that we're trying to spend as
26:39 little money as possible because it runs at like the break of dawn so we don't want to be running multiple
26:45 python instances but when you when you're doing io which doesn't involve any python you can use
26:52 threading to do things like you know file system you know expensive io tasks like getting a getting
26:59 the data off a url or things like that that would that's great for python's threading but otherwise
27:05 you don't really want to be using it right basically the the built-in functions that wait on io they
27:12 release the global interpreter lock and so that frees up your other code to keep on running more or less
27:18 right you definitely want to make sure that if you're doing io that it's not the bottleneck
27:23 i mean as long as everything else is not the bottleneck right and you know we've had um
27:28 a lot of cool advances in python 3 around this type of parallelism right and they just added async and
27:37 await the the the new keywords to uh is that three five i think it was right yeah just three fives just
27:43 came out like two days yeah two days ago so yeah i mean that's super new but these are the types
27:49 of places where async and await would be ideal for helping you increase your performance yeah it's uh
27:55 it's the new syntax is like a special case it's syntactic sugar for yielding but it makes things
28:02 much simpler and easier to read because if you're using yields to do complex async stuff that isn't just
28:08 like a generator then it's very ugly so they added this new syntax to make it much simpler yeah it's great i'm
28:13 looking forward to doing stuff with that you also had some guidance on regular expressions and the
28:18 first one the first bit i also really like kind of like your premature optimization bit is once you
28:24 decide to use a regular expression to solve a problem you now have two problems yeah i always have that
28:29 issue um when i whenever whenever i do anything and i talk to students and they're like and i'm like oh
28:36 look at this text and you could do this and like oh i'll just use regex to solve it i'm saying please know
28:40 you know you'll end up with something that you won't be able to read in the next two days and then
28:46 you know just find a better way to do it for goodness sakes yeah i'm i'm totally with you a friend of mine
28:51 he has this really great way of looking at complex code both around regular expressions and like
28:58 parallelism and it says when you're writing that kind of code you're often writing code right at the limit
29:03 of your understanding of code or your ability to write complex code and debugging is harder than
29:09 writing code and so you're writing code you literally can't debug so you may maybe maybe not
29:15 go quite that far right yeah i think uh you should always try to make code as simple as possible because
29:22 debugging and profiling and looking through your code will be uh much less fun if you try to be as
29:28 clever as possible yeah absolutely clever code it's not so clever when it's your bug to solve later
29:32 yeah and i i also had try to give special mention to python's regex engine as much as i dislike regex
29:40 i think python's read that verbose flag is amazing and if you haven't uh looked into it python has great
29:48 support for i'm greatly annotating the regex so if you have to use it you can be very verbose about it
29:55 and that way it's much better documented in the code yeah that's great advice so let's maybe talk
30:02 about how you solve your problem after you like what was the problem what did you find out the the issue
30:08 to be and then how do you solve it yeah so what we were doing in this code is we were taking gigabytes
30:15 upon gigabytes of plain text data like words um from users of you know various forums and we
30:24 got we processed all this data for sentiment analysis and to do that you need to stem each word to its base word
30:32 so that way you can have you can make sure you you're analyzing the correct word because there's like five different
30:37 forms of every single word right like running ran all those are kind of basically the same meaning right exactly
30:45 yeah okay yeah so so we get the base word of all those words and the thing is gigabytes of like let's say
30:52 five gigabytes of words if it's if a word is like four bytes you know that's like billions of that's so many
30:59 words i don't even want to think about it and then for every single word we stemmed it and we analyzed it and it was
31:06 slow arduous process and as i was profiling and i realized we run the same uh stemming stemming function
31:13 which is an nltk it's a porter stemmer which is amazing and it's what everyone uses for stemming
31:18 we ran it about in my you know 50 megabyte test case which is still so many words thousands upon thousands
31:27 of words it ran about like 600 000 times and i was like my my goodness this is this is running way too much
31:34 and there's only like 400 000 words in the english language there's no way you know each of these
31:39 words is needs to be stemmed because you know gamers on the internet aren't very you know linguistically
31:45 amazing yeah exactly so so what we so i figured you know i can create a cache or as it's called you know
31:55 in more functional you know terms i can create a memoization algorithm that saves these answers i mean this
32:01 saves the computation so i don't need to recompute the function because stemming is a pure function
32:07 if you i mean if you're into functional programming you can you don't need to recompute it every single
32:12 time you get the same input right if you're guaranteed with the same inputs you get the same output
32:17 then you can cash the heck out of this thing right exactly so i i went from about like 60 600 000 calls to
32:26 like you know like 30 000 and i was like it was an immediate you know these these words like the whole
32:34 pro program ran orders of magnitude faster that's awesome and you know what i think i really love about
32:43 this two things i love about it one is you know you talked a little bit about the the solution on the
32:49 blog and it's like i don't know nine lines of code no yes that's it's python yeah it's so awesome and
32:57 the other thing is you didn't even have to change your code necessarily you're able to like create
33:03 things that you can just apply to your code you have to re you have to rewrite the algorithms or things
33:07 like this right yeah i i really find that you know the algorithm worked i mean it got it got things
33:14 done it did it correctly there had to be you know i i mean i wasn't opposed to changing the algorithm
33:20 obviously if that was the hot part of the code but once i found out that the hot part of the code wasn't
33:25 even code that we wrote you know it was just code that we were calling right from another library and
33:31 it's probably really optimized but if you're calling it 600 000 times well nothing is optimized when you're
33:36 calling it hundreds of thousands of times you know at that point you got to not call it that many times
33:43 within that time span you basically created a decorator that will just cache the output and
33:50 only call the function if it's never seen the inputs before right exactly so i mean what it does is it
33:56 you know internally a decorator all it does is it wraps the function in another function so it
34:01 adds an internal cache which is just a python dictionary which which keeps the function arguments as the key
34:10 and the output of the function as the value and if it's been computed then it's in the dictionary so
34:16 it all it needs to do is a simple dictionary call you know it's like just like one or two python bytecode
34:21 instructions which are i mean as opposed to calling an entire function um itself which would which would
34:27 be hundreds of python bytecode instructions right yeah that's fantastic yeah and i'll and when it when it
34:34 doesn't find the answer when i mean when it doesn't find the arguments in the function it i mean it's just
34:40 one computation and for the amount of words in the english language um that are primarily used it's it'll
34:46 be called much less right yeah so your typical data set maybe i don't know 30 000 20 000 times something like that
34:55 yeah yeah yeah this episode is brought to you by code ship code ship has launched organizations create teams set
35:17 permissions for specific team members and improve collaboration in your continuous delivery workflow
35:22 maintain centralized control over your organization's projects and teams with code ship's new organizations
35:27 plan and as talk python listeners you can save 20 off any premium plan for the next three months
35:33 just use the code talk python all caps no spaces check them out at code ship.com and tell them thanks
35:40 for supporting the show on twitter where they're at code ship the other yeah the other thing i i like is that the solution was so simple but but you probably needed
35:56 the profiling to come up with it right you know so i have an example that i've given a few times about
36:03 these types of things and just you know choosing the right data structure can be really important i worked
36:09 on this project that did real time processing of data coming in at 250 times a second so that leaves four
36:15 milliseconds to process each segment right which is really not much and but it had to be real time and if
36:21 it couldn't keep up well then you can't write a real time system or you need some insane hardware to run it or
36:26 something right and it was doing crazy math like wavelet decomposition and all sorts of stuff okay this is
36:33 like like like i was saying earlier it's the verge of understanding what we're doing right yeah and it
36:39 was too slow the first time we ran like oh no please don't make me try to optimize a wavelet
36:43 decomposition you know it's kind of like four year analysis but worse yeah and i'm like there's got to
36:49 be a better way right so break out the profiler and it turned out that we had to sort of do lookups
36:54 back in the past on our data structures yeah and we happened to use a list to go back and look up
37:01 and we were spending 80 of our time just looking for stuff in the list we just switched it to an on
37:07 yeah exactly we just switched it to a dictionary and it went five times faster and i mean it was like
37:11 almost one line of code change it was like ridiculously simple but if you don't use the
37:16 profiler to find the real problem you're going to go muck with your algorithm and not even really
37:20 make a difference because it had nothing to do with the algorithm right it was somewhere else
37:23 yeah i i definitely find that there's also a lot of a lot more push like in the java world
37:30 to make things like final by default to try to make them immutable unless they don't need to be
37:35 and a lot of languages are also embracing immutable by default and trying to keep as strict as possible
37:40 so that way you can you know be linear more lenient when you need to and i find the same thing in
37:46 languages like python whereas i try to use a set before like unless i absolutely need a list if i'm just
37:51 containing elements a set is much better for finding things right if you just want to know have i seen
37:56 this thing before a set exactly it's maybe the right data structure or if you're going to store
38:00 integers in a list you'd be much better off using an array of integers or an array of floats because
38:05 it's much much more efficient you said one of the things you considered was py py just for those who
38:12 don't know maybe what's the super quick elevator pitch on py py and what would you find out about using it
38:17 yeah so py py is a very compliant python interpreter that at runtime turns the python code into machine
38:26 code it finds the hot parts of your code or what's being run a lot and it finds a better way to run
38:31 that for you so that way it'll um it'll run a lot faster than c python because c python doesn't do many
38:40 optimizations by default or just in general right it it runs it through the interpreter which is a fairly
38:46 complex thing rather than turn it straight to machine instructions yeah yeah there's a lot more
38:50 overhead you tried out py py did it make it faster or did it matter at all oh yeah oh yeah py py is i i
38:58 actually use py py before even profiling just to see you know what i could like just because i was like
39:03 oh let's just see how faster py py is in this case and it ran about like five times faster um because it
39:10 figured out what to do uh but the thing is that under our constraints uh we didn't we wanted to stick
39:16 with a little aws instance that we were just running every night and the thing is py py uses more memory
39:24 than c python to support its garbage collector and its just-in-time compiler that both need to run at the
39:30 runtime so it uses a little bit more memory and we didn't really want to spend that money whereas you know
39:36 because if i can get it down to an hour in c python and it runs at 2 a.m no one's going to be looking at
39:41 that stuff at 3 a.m right absolutely and if you can optimize it in c python and then feed it into py py
39:47 maybe it could go exactly faster still right that that would have run it at about like as opposed to
39:52 from 10 hours to one hour it'd be from like if i was running in py py with the cache optimization
39:56 it would probably run in like 30 minutes 20 minutes like i said it was unnecessary like it would it would
40:01 have been nice and but we didn't need it so we didn't really feel like spending them spending the
40:07 time to add that to the pipeline right sure it seems like i mean if you need it to go faster it seems like
40:12 you could just keep iterating on this right so for every execution your your decorator will wrap it
40:18 for every run of your python app but you know you could actually keep those stemmed words yeah it could
40:25 save it like to a redis cache or to a file or to a database or right you could just keep going on this
40:30 idea right but yeah that was definitely something i wanted to i thought about doing that yeah and
40:36 definitely it was fast enough already and that's that is where we'll go next after that i always find
40:41 that you use whenever when you profile and you have a flat profile with nothing sticking out and with
40:47 nothing um you know that looks like it needs to be optimized that's when you need to change the runtime
40:52 that's when you need to look into ffi or py py or jython with a proper jit interesting what's ffi
41:00 so ffi is the foreign function interface it's just the general term used by all languages
41:05 in which case you can drop down into c that's right or any c compatible language yeah that's the c so you
41:11 would basically go right in c and then hook into it that's right so or use like cython for example
41:16 which compiles python down to c with a weirdly python plus c syntax if you have to right yeah i've never
41:25 tried it but i've seen it and i'm like you're saying like you annotate um python variables with
41:31 c like you say double i equals zero in python syntax that's right it's really strange yeah that is that is
41:38 strange uh so you know i talk about pycharm on the show a lot i'm a big fan of pycharm
41:44 and uh they just added built-in profiling in their latest release 4.5 sounds nice yeah i don't know
41:51 if it's out or not it has a visualizer too kind of like py call graph but they said they're using
41:55 something called the yappy y-a-p-p-i profiler and it'll fall back to c profile if it has to do you know
42:02 yappy i haven't i've not seen this yeah i i was looking at all these profiles uh also i mean py py comes
42:09 with some profiler that works on Linux called vm prof um and those are all different profiles and i
42:15 looked at them and and they're nice sure but i really loved how simple it was i mean and i got
42:22 the results that i needed from c profile and it comes with python it was that you didn't need to
42:27 install anything you just ran the module and that's why i was so happy with using it and didn't need to
42:33 try a different profiler right that's cool just python space dash m space you know c profile and
42:38 then your your app boom exactly and if you're using something like pycharm i'm sure i mean if it comes
42:44 with uh yappy or any other profile it like definitely use that you know i'm sure it's i'm sure it's great
42:50 because that pycharm people are awesome uh but for the purposes i mean of the simplicity that i wanted
42:57 keeping the blog post i only used the word pip once i mean like i only used it once to install pycharm
43:03 and that's just how simple it needs to be right you maybe could have gotten away without it if you
43:07 really wanted to look at just the text output the other thing is you know this is something you
43:13 can do really well on your server right you can go over and profile something or maybe it has real
43:18 data it can get to like maybe it works fine on my machine right but not in production or not in qa or
43:23 something like and so you don't have to go install pycharm on the server which exactly probably you
43:28 don't want to do on you know sshed into aws yeah and that's why python is amazing for being
43:36 batteries included is that it includes all of these nice things that you that you could need that any
43:41 pro that any developer was going to need to use eventually do you want to talk a little bit about
43:45 open source and what you guys are doing with open source at human geo yeah so human geo we do it we
43:52 have a github profile github slash human geo and we mostly the most of our open source stuff is
43:59 the javascript leaflet uh leaflet stuff that we've incorporated we add themes and stuff so if you do any
44:04 if you want any map visualizations i like it much better than the google maps api um using leaflet so
44:11 i'd recommend looking at the art the stuff we've done there and we also have uh one or a couple of python
44:17 libraries uh you know we have a an elastic search binding that we that we've used which has since
44:23 been superseded by the mozilla elastic binding so we definitely love open source of human geo
44:29 and we make and use libraries in open source and that's one of my favorite parts about human geo
44:36 that's really awesome and you guys are hiring right oh yeah we're we're always hiring uh we're looking for
44:42 you know great developers at any age python java like i said we won one of the best places to work
44:50 for millennials in the dc metro area right and for the international listeners that's washington dc
44:56 in the united states yes washington dc yeah no worries yeah whether uh whether you have
45:00 a government clearance or not we'd love you know just send send an email or call or get in touch if
45:07 you want to work at an amazing place that does awesome python i love their code bases yeah that's
45:11 really cool it should definitely reach a bunch of uh enthusiastic python people so if if you're
45:17 looking for a job give those guys a ring that's awesome definitely so you guys are big fans of python
45:21 right oh yeah we we've i think they i mean they've been using python since the company started and i was
45:29 looking at the original commits like when the company started and it was for these projects and they're
45:33 all python it's and it's really exciting we've been using since the beginning it's amazing to rapidly
45:38 iterate uh it's fast enough obviously and you can uh you can look at it and it's super easy to profile
45:44 when you need to that's another reason why it's amazing it's not just like you can say it's slow
45:48 but then it's easy to optimize in that case yeah that's really cool so are there some
45:53 practices or techniques or whatever that you guys have sort of landed on or focused in on if you've
46:02 been doing it for so long that you could recommend at human geo we uh you know we make sure to
46:08 i mean we don't go like full agile or anything i mean we definitely consider ourselves fast moving
46:14 and we work uh at a very great pace uh we so i guess you could call it agile and then we have
46:22 great git practices we use git flow and we make sure to have good code review and any code base including
46:30 in python you know you got to have good code review and when you write python code uh i do
46:35 write a lot of unit tests for my code things like that nice so is that a unit test module or pipe or um
46:42 py test or oh yeah the standard library just i love standard library stuff yeah it's the batteries
46:49 included right yes definitely so anything else you want to touch on before we uh kind of wrap things up
46:55 davis uh no i was just uh i'm i'm just really excited to be given this opportunity and i wanted
47:01 to give a shout out to all my amazing colleagues at the imageo yeah that's awesome you guys are doing
47:07 cool stuff so i'm glad to shine a light on on that so final two questions before i let you out of here
47:13 favorite editor what do you what do you use to write python these days i'll say that for all my open
47:18 source like i work on you know a website for the dc python community and i use sublime text for all
47:24 that open source stuff all my tools that i use and when i'm given when i work for a company i i ask them
47:30 for a pycharm license yeah nice because it's uh pycharm is great for big projects that you can really
47:37 focus on yeah i you know like i said that's my favorite editor as well and it definitely there's a
47:43 definitely a group of people that love the lightweight editors like vim and emacs and sublime
47:48 and then people that like ides it's a bit of a divide but i you know i feel like when you're on a huge
47:53 project you can just understand it sort of more definitely more in its entirety using something
47:59 like pycharm so yeah i like it my favorite thing about pycharm is that you can like control click
48:04 or command click and you can go like on a module and it'll take you to the source for that module
48:09 so you can really fast like look at where the code is flowing um in the in the source yeah absolutely
48:15 or hey you're importing a module but it's not specified in the requirements do you want me to add
48:19 it for you for this package you're writing right stuff like that it's just it's sweet they have great
48:23 support for the tooling and the tool chain of python yeah awesome davis this has been a really
48:28 really interesting conversation and hopefully some people can go make their python code faster
48:33 yeah i definitely hope that they will and i hope they learned a lot from this yeah thanks for
48:38 being on the show man no problem thank you so much yeah talk to you later this has been another
48:43 episode of talk python to me today's guest was davis silverman and this episode has been brought to you
48:48 by hired and coachip thank you guys for supporting the show hired wants to help you find your next big
48:54 thing visit hired.com slash talk python to me to get five or more offers with salary and equity
48:58 presented right up front and a special listener signing bonus of four thousand dollars
49:03 coachip wants you to always keep shipping check them out at coachip.com and thank them on twitter
49:09 via at coachip don't forget the discount code for listeners it's easy talk python all caps no spaces
49:15 you can find the links from today's show at talkpython.fm/episodes slash show slash 28
49:22 be sure to subscribe to the show open your favorite podcatcher and search for python we should be right at the top
49:27 you can also find itunes and direct rss feeds in the footer of the website our theme music is developers
49:34 developers developers developers by cory smith who goes by smix you can hear the entire song at talkpython.fm
49:40 this is your host michael kennedy thank you very much for listening smix take us out of here
49:47 stating with my voice there's no norm that i can feel within haven't been sleeping i've been using
49:53 lots of rest i'll pass the mic back to who rocked it best developers developers developers developers developers developers developers developers developers developers
50:03 developers developers developers developers developers developers developers developers