Making Python Fast: Profiling Python Code

Episode #28, published Tue, Oct 6, 2015, recorded Wed, Sep 16, 2015

Episode Deep Dive Transcript

Is that Python code of yours running a little slow? Are you thinking of rewriting the algorithm or maybe even in another language? Well, before you do, you'll want to listen to what Davis Silverman has to say about speeding up Python code using Profiling.

Links from the show:

Davis on Twitter: @sinistersnare
Profiling Article: blog.thehumangeo.com/2015/07/28/profiling-in-python
HumanGeo on GitHub: github.com/humangeo
HumanGeo Group: thehumangeo.com
PyCharm Profiler:
blog.jetbrains.com/pycharm/2015/05/pycharm-4-5-eap-build-141-988-introducing-python-profiler
PyCallGraph: pycallgraph.readthedocs.org

Episode Deep Dive

Guest Introduction and Background

Davis Silverman is a Python developer who worked at the Human Geo Group while studying at the University of Maryland. He spent significant time making Python code run faster using profiling and optimization. Davis worked on real-time analytics, sentiment analysis, and map-based data visualizations for government and commercial clients.

What to Know If You're New to Python

If you’re relatively new to Python, here are a few points from the episode to help you follow along:

Python’s built-in tools like profile and cProfile can help you see where your program spends the most time.
Using decorators and dictionaries (for memoization / caching) can be a simple way to speed up repetitive tasks.
The global interpreter lock (GIL) in CPython affects threading for CPU-bound code, but I/O-bound tasks can run concurrently in multiple threads.

Key Points and Takeaways

Profiling First, Optimizing Second Davis emphasizes never to prematurely optimize. The first step is always profiling: You need to confirm where your code is slow before making any big changes. By using Python’s built-in cProfile or profile modules, you can quickly see which functions consume the most resources and how often they are called.
- Links & Tools:
  - Python cProfile docs: https://docs.python.org/3/library/profile.html
  - Python profile docs: https://docs.python.org/3/library/profile.html
Reading and Interpreting Profiling Data Once you generate a profile report, you can analyze various metrics such as total time in function calls, number of calls, and cumulative time. Sorting these results in different ways (e.g., by the number of calls vs. total time) can help you locate the true bottlenecks.
- Links & Tools:
  - stats methods for sorting in the pstats module: https://docs.python.org/3/library/profile.html#module-pstats
Visualizing with PyCallGraph Plain text profiling data can be overwhelming. Tools like PyCallGraph make it easier by producing a color-coded graph of function calls, highlighting “hot” areas. This helps you quickly see nested calls and identify where time is spent.
- Links & Tools:
  - PyCallGraph (GitHub): https://github.com/gak/pycallgraph
Caching and Memoization Davis found a massive performance boost by adding a dictionary-based cache (memoization) around his text processing function. Since many words repeat in large text datasets, memoizing their stemmed results eliminated redundant computation. This technique took runtime from hours down to more manageable durations.
- Links & Tools:
  - Python dictionaries (built-in): https://docs.python.org/3/tutorial/datastructures.html#dictionaries
NLTK and Heavy Text Processing The slowest function in Davis’s pipeline was NLTK’s Porter Stemmer, called on every word in gigabytes of text. Porter Stemmer is pure Python and correct but can become expensive if repeated too often. Memoization turned hundreds of thousands of calls into a small fraction of that number.
- Links & Tools:
  - NLTK Porter Stemmer: https://www.nltk.org
Switching to Alternative Interpreters (PyPy) PyPy’s just-in-time (JIT) compiler can speed up Python code significantly, sometimes 5× or more. Although PyPy improved performance for Davis, it required more memory, which wasn’t ideal in a constrained AWS environment. So, weigh memory usage against speed before deploying PyPy.
- Links & Tools:
  - PyPy: https://www.pypy.org
RegEx Performance and Alternatives Regular expressions can be powerful but are often overused. They’re also a frequent source of hidden slowdowns. Davis recommends verifying if you truly need a RegEx or if a simpler Python approach (like string methods or partitioning) would be both faster and more maintainable.
I/O vs. CPU-Bound Work In the episode, Davis highlights Python’s Global Interpreter Lock (GIL) which limits true CPU-bound parallelism in plain CPython. However, I/O-bound operations (e.g., network or file I/O) can happen concurrently in separate threads, making a big difference for data pipelines. For heavy CPU-bound code, consider multiprocessing or other techniques.
- Links & Tools:
  - Python threading vs. multiprocessing: https://docs.python.org/3/library/threading.html, https://docs.python.org/3/library/multiprocessing.html
Code Review & Project Workflow Davis mentioned that at Human Geo, the team relied on code review, Git workflows, and agile-like practices to keep Python projects healthy. This ensures best practices around performance and maintainability and encourages consistent code style.
Tools and Editors: PyCharm vs. Sublime Text Davis uses Sublime Text for quick open-source work but relies on PyCharm for larger projects where advanced features like auto-navigation, refactoring, and integrated profiling can help. He found PyCharm particularly good for corporate or complex codebases.

Links & Tools:
- PyCharm: https://www.jetbrains.com/pycharm
- Sublime Text: https://www.sublimetext.com

Interesting Quotes and Stories

"The first rule: don’t prematurely optimize. Always profile first.” -- Davis

“We were calling the stemming function about 600,000 times. Memoization dropped that to 30,000!” -- Davis

“PyPy made it five times faster but used too much memory for our AWS constraints.” -- Davis

Key Definitions and Terms

Profiling: Gathering data about how long parts of your program take to run, how often functions are called, and other usage statistics to pinpoint bottlenecks.
Memoization: A form of caching that remembers results of function calls given certain inputs.
JIT Compiler (Just-in-Time): A compiler that generates machine code on the fly at runtime, used by interpreters like PyPy.
GIL (Global Interpreter Lock): A mechanism in CPython that allows only one thread to run Python code at a time for CPU-bound tasks.

Learning Resources

If you’d like to further enhance your Python skills or deepen your understanding of memory and performance, here are some recommended Talk Python Training courses.

Python for Absolute Beginners: Great starting point if you need solid fundamentals to follow performance topics effectively.
Python Memory Management and Tips: Dive deeper into how Python handles memory, garbage collection, and ways to optimize performance.

Overall Takeaway

It’s amazing how simple changes—like profiling with built-in tools or caching a single function—can reduce runtime from hours to minutes. Before undertaking massive rewrites or switching languages, follow Davis’s example: Identify exactly where your program is slow, apply targeted optimizations, and see immediate performance gains. Above all, let your profiler guide you, and keep exploring Python’s built-in capabilities for elegant and efficient solutions.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Is that Python code of yours running a little slow? Are you thinking about rewriting the

00:04 algorithm or maybe even in another language? Well, before you do, you'll want to listen to

00:08 what Davis Silverman has to say about speeding up Python code using profiling.

00:12 This is show number 28, recorded Wednesday, September 16th, 2015.

00:17 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the library,

00:47 the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter

00:51 where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm

00:56 and follow the show on Twitter via at Talk Python. This episode is brought to you by Hired and

01:02 CodeShip. Thank them for supporting the show on Twitter via at Hired underscore HQ and at CodeShip.

01:08 There's nothing special to report this week, so let's get right to the show with Davis.

01:13 Let me introduce Davis. Davis Silverman is currently a student at the University of Maryland working

01:18 part-time at the Human Geo Group. He writes mostly Python with an emphasis on performant

01:23 Pythonic code. Davis, welcome to the show.

01:26 Hello.

01:27 Thanks for being here. I'm really excited to talk to you about how you made some super slow Python

01:33 code much, much faster using profiling. You work at a place called the Human Geo Group. You guys do a

01:39 lot of Python there, and we're going to spend a lot of time talking about how you took some of your

01:44 social media sort of data, real-time analytics type stuff, and built that in Python and improved

01:51 it using profiling. But let's start at the beginning. What's your story? How did you get into programming

01:56 in Python?

01:56 I originally, when I was a kid, obviously, I grew up and I had internet. I was lucky, and I was very into

02:07 computers, and my parents were very happy with me building and fixing computers for them, obviously.

02:15 So by the time high school came around, I took a programming course, and it was Python, and I fell in

02:22 love with it immediately. And I've since always been programming in Python. I've been programming in Python

02:27 since sophomore year of high school. So it's been quite a few years now.

02:31 I think all of us programmers unwittingly become tech support for our families and whatnot, right?

02:38 Oh, yeah. My entire family, I'm that guy.

02:41 Yeah. I try to not be that guy, but I end up being that guy a lot.

02:45 So you took Python in high school. That's pretty cool that they offered Python there. Do they have

02:52 other programming classes as well?

02:54 Yeah. So my high school, I live in the DC metro area. I live in Montgomery County. It's a very nice

03:01 county, and the schools are very good. And luckily, the intro programming course was taught by a very

03:07 intelligent teacher. So she taught us all Python. And then the courses after that were Java courses,

03:13 the college-level advanced placement Java, and then a data structures class after that.

03:18 So we got to learn a lot about the fundamentals of computer science with those classes.

03:24 Yeah, that's really cool. I think I got to take basic when I was in high school, and that was about it.

03:27 It was a while ago.

03:28 I wrote a basic interpreter, but it wasn't very good.

03:32 Cool. So before we get into the programming stuff, maybe you could just tell me,

03:37 what is the Human Geo Group? What do you guys do?

03:39 Yeah. So the Human Geo Group, we're a small government contractor, and we deal mostly in government contracts,

03:48 but we have a few commercial projects and ventures that I was working on over the summer that we'll be

03:53 talking about. We're a great small little company in Arlington, Virginia, and we actually just won an award for one of the best places to work in the DC metro area for millennials and for younger people.

04:05 If you go to thehumangeo.com, you guys have a really awesome webpage. I really like how that is. It's like, bam, here we are. We're about data, and it just has kind of this live page. So many companies want to get their CEO statement and all the marketing stuff and the news. You guys are just like, look, it's about data. That's cool.

04:27 Yeah. There's a recent, I don't remember how recent, but there is the website rewrite and the website, this one guy, he decided he took the time and he's like, I really want to do something, you know, show some geographic stuff.

04:40 So he used leaflet.js, which we do a lot of open source with and on our GitHub page with leaflet.js. And he made it very beautiful. And he, and that even there's some icons of all the people at human geo. And I think it's much better than any of those. Like you said, a generic, you know, contractor site. It's, it's much better and much more energetic.

04:59 It looks to me like you do a lot with social media and like sentiment analysis and tying that to, to location, the geo part. Right. And all that kind of stuff. What's the story there with what kind of stuff do you guys do?

05:12 Yeah. So one of, what I was working on was we took one of our, one of our customers is a billion dollar entertainment company. And, I mean, you've probably heard of them. I think we talk about them on our site.

05:27 And what we do is we analyze various social media sites like, you know, Reddit and Twitter and YouTube, and we gather geographic data if available. And we gather sentiment data using specific algorithms from things like the natural language toolkit, which is an amazing Python package.

05:44 Then we show it to the user in a, in a very nice, website that we have created.

05:50 So you say you're, you work for this entertainment company as well as like a government contractor.

05:56 What is the government interested in with all this data? The U S government that is right for the international listeners.

06:01 Yeah, it's, it's definitely a United States government. we, we do, not, we do less social media analysis for the government. We do, we do some, but it's nothing.

06:11 It's not what people think the NSA does. definitely. I think, you know, just like anything, like a company would want, like your, like you'd search on something and then it would have like, Oh, there are Twitter users talking about this.

06:26 This in the, you know, in these areas. Yeah. I guess the government wouldn't know, would want to know that, especially in like emergencies or things like that possibly. Right.

06:36 Yeah. We also do some, platform stuff. Like we create, certain platforms for, the government. That's not necessarily social media stuff.

06:46 Right. Sure. So how much of that is Python and how much of that is other languages?

06:51 So at the human geo, we do, we do tons of Python at the backend. for some of the government stuff, we do Java, which is big in the government, obviously. so I think we, I mean, we definitely do have a lot of Python. We use it a lot in the pipeline, for various tools and things that we use internally and externally at human geo. I know that the project that I was working on is exclusively Python.

07:16 Python and all parts of the pipeline for, for gathering data and for representing data for, you know, the backend server and the front end server. So that was all Python.

07:26 Right. So it's, it's pretty much Python end to end other than it looks, I don't know specifically the project you were working on, but it looks like it's very heavy, like D3, fancy JavaScript stuff on the front end for the visualization.

07:39 But other than that, it was more or less Python, right?

07:41 We do use a lot. I mean, yeah, we do have some amazing JavaScript people. They, they do a lot of really fun looking stuff.

07:48 Yeah. You can tell it's, it's a fun place for like data display, front end development. That's cool.

07:53 So Python two or Python three.

07:55 So we use Python two, but I was making sure as, I mean, when I was working on the code base, I was definitely writing Python three compatible code using the proper future imports.

08:06 And, and I, and I was testing it on the Python three and we're, we're probably closer to Python three than a lot of companies are. we just haven't expanded the time to do it.

08:14 We're probably will in 2018 when Python two is nearing end of line.

08:19 That's seems like a really smart, smart way to go. Did you happen to profile it under Python, CPython two and CPython three?

08:27 I didn't. It doesn't fully run in CPython three right now. I wish I could.

08:33 It would just be really interesting since you spent so much time looking at the performance if you could compare those, but yeah, if it doesn't run.

08:38 That would be interesting. You're right. I wish I could know that.

08:41 I suspect that most people know what profiling is, but there's a whole diverse set of listeners. So maybe we could just say really quickly, what is profiling?

08:50 Yeah. So profiling in any language, anything is, knowing heuristics about what's running in your program. And, you know, for example, how many times is a function called or how long does it take for this section of code to run? And it's simply like a statistical thing. Like you, like you get a profile of your code. You see all the parts of your code that are fast or slow. For example.

09:12 You wrote a really interesting blog post and we'll talk about that in just a second. And I think like all good profiling articles or topics, you know, you, you sort of point out what I consider to be the first rule of profiling or more like the first rule of optimization, which profiling is kind of the tool to get you there, which is to not prematurely optimize your stuff. Right.

09:35 Yeah, definitely.

09:36 Yeah. I, you know, I've spent a lot of time thinking about how programs run and why it's slow or why it's fast or worrying about this little part of that little part. And, you know, most of the time it just doesn't matter. Or if it is slow, it's slow for reasons that were unexpected. Right.

09:52 Yeah, definitely. Yeah. I always make sure that there is a legitimate problem to be solved before spending time doing something like profiling or optimizing a code base.

10:03 Definitely. So let's talk about your blog post because that kind of went around on Twitter in a pretty big way and on the Python newsletters and so on. And I read that. Oh, this is really cool. I should have Davis on and we should talk about this.

10:14 So, so, so on the human geo blog, you wrote an article or post called profiling in Python, right? What motivated you to write that?

10:25 Yeah. So when I was right, when I was working on this code, basically my, my coworkers, my boss, they said, you know, this piece of code, we have this pipeline for, for gathering data from the customers specifically that they give us.

10:40 And we run it, we ran it about like 2am every night. The problem was it took 10 hours to run this, this piece of code. It was doing a very heavy text processing, which I'll talk about more later, I guess.

10:53 It was doing a lot of text processing, which, which ended up taking 10 hours and they look at it and they said, you know, it's updating at noon every day and the workday starts at like nine.

11:03 So we should probably try to fix this and get it working faster. Davis, you should totally look at this as a great first project.

11:10 Here's your new project, right?

11:12 Yeah. It was like, here's day one. Okay. Make this thing 10 times faster. Go.

11:17 Yeah. Oh, is that all you want me to do?

11:19 Yeah. So I start, so I wrote this and I did what any guy would do. I, first thing I did was, you know, profile it, which is the first thing you should do to make sure that there's actually a hotspot.

11:32 And I ran through the process that I, that I talked about in the post post. I talk about what I ran through the tools I use. And I realized that it was, it wasn't a simple thing for me to do all this. I don't do this often. And I figure that a lot of people, like you said, maybe don't know about profiling or haven't done this.

11:51 So I said, you know, if I write a blog post about this and hopefully somebody else won't have to Google like 20 different posts and read all of them to come up with one coherent idea of how to do this in one fell swoop.

12:03 Right. Maybe you can just lay it out. Like, these are the five steps you should probably start with. And, you know, profiling is absolutely an art more than it is, you know, engineering, but it, at least having the steps to follow is super helpful.

12:17 You started out by talking about the CPython distribution. And I think it'd be interesting to talk a little bit about potentially alternative implementations as well. Cause you, you did talk about that a bit.

12:30 Yep. You said there's basically two profilers that come with CPython profile and CPython.

12:38 Yeah. So there, the two profilers that come with Python, as you said, profile and CPython, all have the same exact interface and include the same heuristics.

12:50 But the idea behind profile is that it's written in Python and is portable across Python implementations, hopefully, and CPython, which is written in C.

13:00 And it's pretty specific to using a C interpreter, such as a CPython or even PyPy because PyPy is, is very, interoperable.

13:07 Right. It does have that, that interoperability layer. So maybe if we were doing like Jython or we were doing IronPython or PyStone or something, we couldn't use CPython.

13:19 Yeah. I wouldn't, I wouldn't say that you can.

13:22 I'm just guessing. I don't really, I haven't tested it.

13:25 I would say that you would use the, for Jython and IronPython, you could use the standard Java or .NET profilers and use those instead.

13:34 I'm pretty sure those will work just fine because I mean, they've, they're known to work great with their respective ecosystems.

13:41 Do you know whether there's a significant performance difference? Let me take a step back. You know, it seems to me like when I've done profiling in the past, that it's a little bit like the Heisenberg uncertainty principle.

13:54 And that if you observe a thing by the fact you've observed it, you've altered it. Right. You know, when you run your code under the profiler, it might not behave in the same exact timings and so on as if it were running natively, but you can still get a pretty good idea usually. Right. So is there a difference across the C profile versus profile in that regard?

14:17 Oh yeah, definitely. Profile is much slower and it has a much higher latency and as overhead, I mean, than C profile because it has to do a lot of different worker.

14:29 I mean, because Python exposes internals to C to CPython in, in some Python modules, but they're a lot slower than just using straight C and getting straight to it. So if you're using regular profile, I'd recommend, well, if you're using it in CPython or pipe, I'd recommend using C profile instead because it has much lower overhead and it gives you much better numbers that you can work with that make more sense.

14:53 Okay. Awesome. So how do I get started? Like, suppose I have some code that is slow, but how do I run it in this in C profile?

14:59 Yeah. So the first thing that I would do is, I mean, I, in the blog posts, which I'm pulling up right now just to see.

15:06 And I'll be sure to link to that in the show notes as well so that if people can just jump to the show notes and pull it up.

15:14 Yeah. Well, so one of my favorite things about C profile is that you can call it using, you know, the Python dash M syntax, the column module syntax, and it will print out to standard out your profile for you when it's done.

15:27 It's super easy. I mean, all you need to do is just give it your, you know, main Python file and it'll run it. And then at the end of the running, it'll give you the profile. It's super simple. And one of the reasons why the blog post is so easy to write.

15:42 Yeah, that's cool. So by default, it looks like it gives you basically the total time spent in methods, all of the methods, you know, looking at the call stack and stuff, a number of times it was called, stuff like that, right? The cumulative time, the individual time per call and so on.

16:00 It gives you the default, like you said, you're correct. And it's also really easy. You can give it a sorting argument. So that way you can, if you want to call it on C, you know, how many times this is called, like if it's called 60,000 times, it's probably a problem in a, you know, a 10 minute run.

16:17 And if it's, you know, it could be only called twice, but it may take an hour to run. That would be very scary. In which case you definitely want to try, you want to, you want to sort it both ways. You want to sort it every way. So you can see what, you know, just in case you're missing something important.

16:33 Right. You want to slice it in a lot of different ways. How many times was it called? What was the, you know, maximum individual time, cumulative time, all those types of things.

16:42 All right. Maybe you're calling a database function. Right. And maybe that's just got tons of data.

16:49 Yeah. It's slow. Yeah. Yeah. Maybe, maybe the database is slow. And so that says, well, don't even worry about your Python code. Just go make your database fast or use Redis for caching or something. Right.

16:59 Or yeah, work on your query. Maybe you're, maybe you're, maybe you can make it distinct, a distinct query and get a much smaller data set that ends up coming back to you.

17:08 Yeah, absolutely. Yeah. It could just be a smarter query you're doing. So this all is pretty good. I mean, this, this output that you get is pretty nice, but in a real program with many, many functions and so on, that can be pretty hard to understand, right?

17:26 Yes. I definitely, I found, so when I was working on this, I had the same issue. There was, there was so many lines of code. It wasn't, it was filling up my terminal, you know, and what I had to do is I had to save it to an output file and that was too much work.

17:39 So I was searching around more and I found PyCallGraph, which is amazing at showing you the flow of a program and it gives you great graphic representation of what C profile is also showing you.

17:50 That's awesome. So it's kind of like a visualizer for the C profile output, right?

17:56 Yeah. It even colors in the more red it is, the more hot of a call it is, or the more times it runs and the longer it runs.

18:04 Yeah. That's awesome. So just pip install PyCallGraph to get started, right?

18:08 Super simple. It's one of the best things about, pip as a package manager.

18:13 Yeah. I mean, that's part of why Python is awesome, right? pip install.

18:16 Definitely.

18:17 Whatever. Anti-gravity. So then you say, you, invoke it, you say basically PyCallGraph and you say GraphFizz.

18:26 And then how does that work?

18:29 So PyCallGraph is, it supports outputting to multiple different file formats. GraphFizz is simply a file format for, I mean, it's a program that can show called .file.

18:45 I mean dot dot files I don't really understand how to say it out loud so I mean the first argument for

18:51 graph is the style of how it's going to be read the program that's going to

18:55 read it and then you give it the double dash which means it's not part of the PyCallGraphCall options

19:01 it's now the Python call that you're calling it with and those arguments

19:06 so it's almost basically the same as cProfile but it's kind of inverted right and you can get really simple

19:13 call graphs that are just this function called this function which called this function and it took this

19:17 amount of time or you can get really quite complex call graphs like you can say you know this module

19:25 calls these functions which then reach out to this other module and then they're all

19:29 interacting in these ways that's pretty amazing yeah it shows exactly what module you're using

19:36 I mean like if you're using regex it'll just it'll show you each part of the regex

19:40 module like the regex compile or you know the different modules that are using the regex module

19:46 and then it'll show you how many times each is called and they're boxed all nicely

19:50 and and it gives you I mean the image is so easy to look at and you could just zoom in at the exact part you want

19:56 and then look at what calls it and where and what it calls to see you know how the program flows much simpler

20:03 yeah that's really cool and you know it's something that just came to me as an idea

20:07 I'm looking at this episode is brought to you by hired hired is a two-sided

20:23 curated marketplace that connects the world's knowledge workers to the best opportunities

20:27 each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them

20:35 before you even talk to the company typically candidates receive five or more offers in just the first week

20:41 and there are no obligations ever sounds pretty awesome doesn't it well did I mention there's a signing bonus

20:47 everyone who accepts a job from hired gets a two thousand dollar signing bonus

20:51 and as talk python listeners it gets way sweeter use the link hired.com slash talk python to me

21:00 and hired will double the signing bonus to four thousand dollars opportunities knocking

21:05 visit hired.com slash talk python to me and answer the call because it colors the hot spots and all it's really good for profiling but

21:22 even if you weren't doing profiling it seems like that would be pretty interesting for just

21:25 understanding new code that you're trying to get your head around oh yeah that's definitely true that's uh

21:31 i've employed that since as a method to look at the control flow of a program

21:36 right how does these methods how did these modules and all this stuff just how do they relate like just run the main method and see what happens right

21:44 exactly it's it's uh it's become a useful tool of mine that i'll definitely be using in the future

21:52 i always have it my virtual end of nowadays so we've taken c profile we've

21:58 applied it to our program we've sort of gotten some textual version of the output that tells us where we're spending our time in various ways and then we can use py call graph to visualize that understand it more quickly

22:09 so then what like how do you fix these problems what are some of the common things you can do

22:14 yeah so as i outlined in the blog post there's there's a plethora of methods that you can do depending on what

22:22 your profile is showing for example if if you're spending a lot of time in

22:28 in python code then you can definitely look at things like using a different interpreter for example an optimizing compiler like

22:37 pypy uh will definitely make your code run a lot faster as it'll translate to machine code at runtime

22:43 or you can also look at the algorithm that you're using and see if it's for example like an on cubed

22:49 uh time complexity algorithm that would be terrible and you might want to fix that

22:54 yeah those are the kinds of problems that you run into when you have small test data

22:58 and you write your code and it seems fine and then you give it real data and it just dies right

23:02 exactly that's why the thing they gave me the code that took i mean they gave me like five gigabytes of data and they said this is the data that we get like on a

23:11 nightly basis and i said oh my god this will take all day to run so i i use smaller test test pieces

23:19 and then luckily i i used big enough that it showed uh some statistically significant numbers for me

23:25 right it was something you could actually execute in a reasonable amount of time as you

23:29 through your exploration but not so big that i'd rather not do like c plus plus

23:33 times of run you know compile times but at runtime because it just kind of sits there while it's processing

23:39 right so rather not only do it once a day you mentioned some interesting things like

23:45 pi pi is super interesting to me we're going to talk more about how how you well you chose not to use that but

23:51 you know on show 21 we had macha from the pi pi oh yeah i have it up right now i'm gonna watch it later

23:59 yeah i'm so excited yeah that was super interesting and we talked a little bit about

24:03 optimization there and like why you might choose an alternate interpreter that was cool

24:08 then there's some other interesting things you can do as well like you could use like name tuples instead of actual classes or you could use built-in data

24:18 structures instead of creating your own because a lot of times the built-in structures like list and array

24:23 and dictionary and so on are implemented deep down in c and they're much faster

24:27 yeah definitely uh one of the best examples of this is that i saw some guy who wrote his own dictionary class

24:35 and uh it was a lot slower than and this isn't in the human geo code base just so you know we have we have good code at human geo

24:43 you guys don't show up on the daily wtf oh please no no we're we're much better than that no this is a another another place that i'm that i saw some code

24:51 and uh yeah i mean they have a lot of optimizations like in the latest python release

24:56 they actually made them the dict class i mean it's actually now an ordered dict in the latest python releases

25:02 because they basically copied what pypy did the same they did the same thing

25:06 yeah so you should always trust uh the internal stuff yeah and that's really true and if you're going to stick

25:14 with c python as you know the majority of people do understanding how the interpreter actually works

25:20 is also super important and i i talked about it several times in the show and um had philip guau on the show

25:28 he did a 10 hour he recorded basically a graduate's uh graduate course at the university of rochester he did

25:34 and put it online so uh definitely i can try to link to that and check it out that's show 22

25:40 i'd love to see that yeah i mean it's you really understand what's happening inside the c runtime

25:45 and so then you're like oh i see why this is expensive and all those things can help you know another thing

25:50 that you talked about that was interesting was io like i gave you my example of databases

25:54 or maybe you're talking about a file system or you know a web service call something like this

26:00 right and you had some good advice on that i thought yeah i definitely uh basically the c python

26:07 c python's jill the global interpreter lock is very uh precluding you can't you can't do multiple

26:16 multiple computation and computations at the same time because c python only allows to be used at one core

26:22 at a time right basically computational parallelism is not such a thing in python if you're you got

26:28 to drop down to c or fork the processes or something like that right exactly and those are all fairly

26:33 expensive for a task that we're running on an aws server on an aws server that we're trying to spend as

26:39 little money as possible because it runs at like the break of dawn so we don't want to be running multiple

26:45 python instances but when you when you're doing io which doesn't involve any python you can use

26:52 threading to do things like you know file system you know expensive io tasks like getting a getting

26:59 the data off a url or things like that that would that's great for python's threading but otherwise

27:05 you don't really want to be using it right basically the built-in functions that wait on io they

27:12 release the global interpreter lock and so that frees up your other code to keep on running more or less

27:18 right you definitely want to make sure that if you're doing io that it's not the bottleneck

27:23 i mean as long as everything else is not the bottleneck right and you know we've had um

27:28 a lot of cool advances in python 3 around this type of parallelism right and they just added async and

27:37 await the new keywords to uh is that three five i think it was right yeah just three fives just

27:43 came out like two days yeah two days ago so yeah i mean that's super new but these are the types

27:49 of places where async and await would be ideal for helping you increase your performance yeah it's uh

27:55 it's the new syntax is like a special case it's syntactic sugar for yielding but it makes things

28:02 much simpler and easier to read because if you're using yields to do complex async stuff that isn't just

28:08 like a generator then it's very ugly so they added this new syntax to make it much simpler yeah it's great i'm

28:13 looking forward to doing stuff with that you also had some guidance on regular expressions and the

28:18 first one the first bit i also really like kind of like your premature optimization bit is once you

28:24 decide to use a regular expression to solve a problem you now have two problems yeah i always have that

28:29 issue um when i whenever whenever i do anything and i talk to students and they're like and i'm like oh

28:36 look at this text and you could do this and like oh i'll just use regex to solve it i'm saying please know

28:40 you know you'll end up with something that you won't be able to read in the next two days and then

28:46 you know just find a better way to do it for goodness sakes yeah i'm i'm totally with you a friend of mine

28:51 he has this really great way of looking at complex code both around regular expressions and like

28:58 parallelism and it says when you're writing that kind of code you're often writing code right at the limit

29:03 of your understanding of code or your ability to write complex code and debugging is harder than

29:09 writing code and so you're writing code you literally can't debug so you may maybe maybe not

29:15 go quite that far right yeah i think uh you should always try to make code as simple as possible because

29:22 debugging and profiling and looking through your code will be uh much less fun if you try to be as

29:28 clever as possible yeah absolutely clever code it's not so clever when it's your bug to solve later

29:32 yeah and i i also had try to give special mention to python's regex engine as much as i dislike regex

29:40 i think python's read that verbose flag is amazing and if you haven't uh looked into it python has great

29:48 support for i'm greatly annotating the regex so if you have to use it you can be very verbose about it

29:55 and that way it's much better documented in the code yeah that's great advice so let's maybe talk

30:02 about how you solve your problem after you like what was the problem what did you find out the issue

30:08 to be and then how do you solve it yeah so what we were doing in this code is we were taking gigabytes

30:15 upon gigabytes of plain text data like words um from users of you know various forums and we

30:24 got we processed all this data for sentiment analysis and to do that you need to stem each word to its base word

30:32 so that way you can have you can make sure you you're analyzing the correct word because there's like five different

30:37 forms of every single word right like running ran all those are kind of basically the same meaning right exactly

30:45 yeah okay yeah so so we get the base word of all those words and the thing is gigabytes of like let's say

30:52 five gigabytes of words if it's if a word is like four bytes you know that's like billions of that's so many

30:59 words i don't even want to think about it and then for every single word we stemmed it and we analyzed it and it was

31:06 slow arduous process and as i was profiling and i realized we run the same uh stemming stemming function

31:13 which is an nltk it's a porter stemmer which is amazing and it's what everyone uses for stemming

31:18 we ran it about in my you know 50 megabyte test case which is still so many words thousands upon thousands

31:27 of words it ran about like 600 000 times and i was like my my goodness this is this is running way too much

31:34 and there's only like 400 000 words in the english language there's no way you know each of these

31:39 words is needs to be stemmed because you know gamers on the internet aren't very you know linguistically

31:45 amazing yeah exactly so so what we so i figured you know i can create a cache or as it's called you know

31:55 in more functional you know terms i can create a memoization algorithm that saves these answers i mean this

32:01 saves the computation so i don't need to recompute the function because stemming is a pure function

32:07 if you i mean if you're into functional programming you can you don't need to recompute it every single

32:12 time you get the same input right if you're guaranteed with the same inputs you get the same output

32:17 then you can cash the heck out of this thing right exactly so i i went from about like 60 600 000 calls to

32:26 like you know like 30 000 and i was like it was an immediate you know these these words like the whole

32:34 pro program ran orders of magnitude faster that's awesome and you know what i think i really love about

32:43 this two things i love about it one is you know you talked a little bit about the solution on the

32:49 blog and it's like i don't know nine lines of code no yes that's it's python yeah it's so awesome and

32:57 the other thing is you didn't even have to change your code necessarily you're able to like create

33:03 things that you can just apply to your code you have to re you have to rewrite the algorithms or things

33:07 like this right yeah i i really find that you know the algorithm worked i mean it got it got things

33:14 done it did it correctly there had to be you know i i mean i wasn't opposed to changing the algorithm

33:20 obviously if that was the hot part of the code but once i found out that the hot part of the code wasn't

33:25 even code that we wrote you know it was just code that we were calling right from another library and

33:31 it's probably really optimized but if you're calling it 600 000 times well nothing is optimized when you're

33:36 calling it hundreds of thousands of times you know at that point you got to not call it that many times

33:43 within that time span you basically created a decorator that will just cache the output and

33:50 only call the function if it's never seen the inputs before right exactly so i mean what it does is it

33:56 you know internally a decorator all it does is it wraps the function in another function so it

34:01 adds an internal cache which is just a python dictionary which which keeps the function arguments as the key

34:10 and the output of the function as the value and if it's been computed then it's in the dictionary so

34:16 it all it needs to do is a simple dictionary call you know it's like just like one or two python bytecode

34:21 instructions which are i mean as opposed to calling an entire function um itself which would which would

34:27 be hundreds of python bytecode instructions right yeah that's fantastic yeah and i'll and when it when it

34:34 doesn't find the answer when i mean when it doesn't find the arguments in the function it i mean it's just

34:40 one computation and for the amount of words in the english language um that are primarily used it's it'll

34:46 be called much less right yeah so your typical data set maybe i don't know 30 000 20 000 times something like that

34:55 yeah yeah yeah this episode is brought to you by code ship code ship has launched organizations create teams set

35:17 permissions for specific team members and improve collaboration in your continuous delivery workflow

35:22 maintain centralized control over your organization's projects and teams with code ship's new organizations

35:27 plan and as talk python listeners you can save 20 off any premium plan for the next three months

35:33 just use the code talk python all caps no spaces check them out at code ship.com and tell them thanks

35:40 for supporting the show on twitter where they're at code ship the other yeah the other thing i i like is that the solution was so simple but but you probably needed

35:56 the profiling to come up with it right you know so i have an example that i've given a few times about

36:03 these types of things and just you know choosing the right data structure can be really important i worked

36:09 on this project that did real time processing of data coming in at 250 times a second so that leaves four

36:15 milliseconds to process each segment right which is really not much and but it had to be real time and if

36:21 it couldn't keep up well then you can't write a real time system or you need some insane hardware to run it or

36:26 something right and it was doing crazy math like wavelet decomposition and all sorts of stuff okay this is

36:33 like i was saying earlier it's the verge of understanding what we're doing right yeah and it

36:39 was too slow the first time we ran like oh no please don't make me try to optimize a wavelet

36:43 decomposition you know it's kind of like four year analysis but worse yeah and i'm like there's got to

36:49 be a better way right so break out the profiler and it turned out that we had to sort of do lookups

36:54 back in the past on our data structures yeah and we happened to use a list to go back and look up

37:01 and we were spending 80 of our time just looking for stuff in the list we just switched it to an on

37:07 yeah exactly we just switched it to a dictionary and it went five times faster and i mean it was like

37:11 almost one line of code change it was like ridiculously simple but if you don't use the

37:16 profiler to find the real problem you're going to go muck with your algorithm and not even really

37:20 make a difference because it had nothing to do with the algorithm right it was somewhere else

37:23 yeah i i definitely find that there's also a lot of a lot more push like in the java world

37:30 to make things like final by default to try to make them immutable unless they don't need to be

37:35 and a lot of languages are also embracing immutable by default and trying to keep as strict as possible

37:40 so that way you can you know be linear more lenient when you need to and i find the same thing in

37:46 languages like python whereas i try to use a set before like unless i absolutely need a list if i'm just

37:51 containing elements a set is much better for finding things right if you just want to know have i seen

37:56 this thing before a set exactly it's maybe the right data structure or if you're going to store

38:00 integers in a list you'd be much better off using an array of integers or an array of floats because

38:05 it's much much more efficient you said one of the things you considered was py py just for those who

38:12 don't know maybe what's the super quick elevator pitch on py py and what would you find out about using it

38:17 yeah so py py is a very compliant python interpreter that at runtime turns the python code into machine

38:26 code it finds the hot parts of your code or what's being run a lot and it finds a better way to run

38:31 that for you so that way it'll um it'll run a lot faster than c python because c python doesn't do many

38:40 optimizations by default or just in general right it it runs it through the interpreter which is a fairly

38:46 complex thing rather than turn it straight to machine instructions yeah yeah there's a lot more

38:50 overhead you tried out py py did it make it faster or did it matter at all oh yeah oh yeah py py is i i

38:58 actually use py py before even profiling just to see you know what i could like just because i was like

39:03 oh let's just see how faster py py is in this case and it ran about like five times faster um because it

39:10 figured out what to do uh but the thing is that under our constraints uh we didn't we wanted to stick

39:16 with a little aws instance that we were just running every night and the thing is py py uses more memory

39:24 than c python to support its garbage collector and its just-in-time compiler that both need to run at the

39:30 runtime so it uses a little bit more memory and we didn't really want to spend that money whereas you know

39:36 because if i can get it down to an hour in c python and it runs at 2 a.m no one's going to be looking at

39:41 that stuff at 3 a.m right absolutely and if you can optimize it in c python and then feed it into py py

39:47 maybe it could go exactly faster still right that that would have run it at about like as opposed to

39:52 from 10 hours to one hour it'd be from like if i was running in py py with the cache optimization

39:56 it would probably run in like 30 minutes 20 minutes like i said it was unnecessary like it would it would

40:01 have been nice and but we didn't need it so we didn't really feel like spending them spending the

40:07 time to add that to the pipeline right sure it seems like i mean if you need it to go faster it seems like

40:12 you could just keep iterating on this right so for every execution your your decorator will wrap it

40:18 for every run of your python app but you know you could actually keep those stemmed words yeah it could

40:25 save it like to a redis cache or to a file or to a database or right you could just keep going on this

40:30 idea right but yeah that was definitely something i wanted to i thought about doing that yeah and

40:36 definitely it was fast enough already and that's that is where we'll go next after that i always find

40:41 that you use whenever when you profile and you have a flat profile with nothing sticking out and with

40:47 nothing um you know that looks like it needs to be optimized that's when you need to change the runtime

40:52 that's when you need to look into ffi or py py or jython with a proper jit interesting what's ffi

41:00 so ffi is the foreign function interface it's just the general term used by all languages

41:05 in which case you can drop down into c that's right or any c compatible language yeah that's the c so you

41:11 would basically go right in c and then hook into it that's right so or use like cython for example

41:16 which compiles python down to c with a weirdly python plus c syntax if you have to right yeah i've never

41:25 tried it but i've seen it and i'm like you're saying like you annotate um python variables with

41:31 c like you say double i equals zero in python syntax that's right it's really strange yeah that is that is

41:38 strange uh so you know i talk about pycharm on the show a lot i'm a big fan of pycharm

41:44 and uh they just added built-in profiling in their latest release 4.5 sounds nice yeah i don't know

41:51 if it's out or not it has a visualizer too kind of like py call graph but they said they're using

41:55 something called the yappy y-a-p-p-i profiler and it'll fall back to c profile if it has to do you know

42:02 yappy i haven't i've not seen this yeah i i was looking at all these profiles uh also i mean py py comes

42:09 with some profiler that works on Linux called vm prof um and those are all different profiles and i

42:15 looked at them and and they're nice sure but i really loved how simple it was i mean and i got

42:22 the results that i needed from c profile and it comes with python it was that you didn't need to

42:27 install anything you just ran the module and that's why i was so happy with using it and didn't need to

42:33 try a different profiler right that's cool just python space dash m space you know c profile and

42:38 then your your app boom exactly and if you're using something like pycharm i'm sure i mean if it comes

42:44 with uh yappy or any other profile it like definitely use that you know i'm sure it's i'm sure it's great

42:50 because that pycharm people are awesome uh but for the purposes i mean of the simplicity that i wanted

42:57 keeping the blog post i only used the word pip once i mean like i only used it once to install pycharm

43:03 and that's just how simple it needs to be right you maybe could have gotten away without it if you

43:07 really wanted to look at just the text output the other thing is you know this is something you

43:13 can do really well on your server right you can go over and profile something or maybe it has real

43:18 data it can get to like maybe it works fine on my machine right but not in production or not in qa or

43:23 something like and so you don't have to go install pycharm on the server which exactly probably you

43:28 don't want to do on you know sshed into aws yeah and that's why python is amazing for being

43:36 batteries included is that it includes all of these nice things that you that you could need that any

43:41 pro that any developer was going to need to use eventually do you want to talk a little bit about

43:45 open source and what you guys are doing with open source at human geo yeah so human geo we do it we

43:52 have a github profile github slash human geo and we mostly the most of our open source stuff is

43:59 the javascript leaflet uh leaflet stuff that we've incorporated we add themes and stuff so if you do any

44:04 if you want any map visualizations i like it much better than the google maps api um using leaflet so

44:11 i'd recommend looking at the art the stuff we've done there and we also have uh one or a couple of python

44:17 libraries uh you know we have a an elastic search binding that we that we've used which has since

44:23 been superseded by the mozilla elastic binding so we definitely love open source of human geo

44:29 and we make and use libraries in open source and that's one of my favorite parts about human geo

44:36 that's really awesome and you guys are hiring right oh yeah we're we're always hiring uh we're looking for

44:42 you know great developers at any age python java like i said we won one of the best places to work

44:50 for millennials in the dc metro area right and for the international listeners that's washington dc

44:56 in the united states yes washington dc yeah no worries yeah whether uh whether you have

45:00 a government clearance or not we'd love you know just send send an email or call or get in touch if

45:07 you want to work at an amazing place that does awesome python i love their code bases yeah that's

45:11 really cool it should definitely reach a bunch of uh enthusiastic python people so if if you're

45:17 looking for a job give those guys a ring that's awesome definitely so you guys are big fans of python

45:21 right oh yeah we we've i think they i mean they've been using python since the company started and i was

45:29 looking at the original commits like when the company started and it was for these projects and they're

45:33 all python it's and it's really exciting we've been using since the beginning it's amazing to rapidly

45:38 iterate uh it's fast enough obviously and you can uh you can look at it and it's super easy to profile

45:44 when you need to that's another reason why it's amazing it's not just like you can say it's slow

45:48 but then it's easy to optimize in that case yeah that's really cool so are there some

45:53 practices or techniques or whatever that you guys have sort of landed on or focused in on if you've

46:02 been doing it for so long that you could recommend at human geo we uh you know we make sure to

46:08 i mean we don't go like full agile or anything i mean we definitely consider ourselves fast moving

46:14 and we work uh at a very great pace uh we so i guess you could call it agile and then we have

46:22 great git practices we use git flow and we make sure to have good code review and any code base including

46:30 in python you know you got to have good code review and when you write python code uh i do

46:35 write a lot of unit tests for my code things like that nice so is that a unit test module or pipe or um

46:42 py test or oh yeah the standard library just i love standard library stuff yeah it's the batteries

46:49 included right yes definitely so anything else you want to touch on before we uh kind of wrap things up

46:55 davis uh no i was just uh i'm i'm just really excited to be given this opportunity and i wanted

47:01 to give a shout out to all my amazing colleagues at the imageo yeah that's awesome you guys are doing

47:07 cool stuff so i'm glad to shine a light on that so final two questions before i let you out of here

47:13 favorite editor what do you what do you use to write python these days i'll say that for all my open

47:18 source like i work on you know a website for the dc python community and i use sublime text for all

47:24 that open source stuff all my tools that i use and when i'm given when i work for a company i i ask them

47:30 for a pycharm license yeah nice because it's uh pycharm is great for big projects that you can really

47:37 focus on yeah i you know like i said that's my favorite editor as well and it definitely there's a

47:43 definitely a group of people that love the lightweight editors like vim and emacs and sublime

47:48 and then people that like ides it's a bit of a divide but i you know i feel like when you're on a huge

47:53 project you can just understand it sort of more definitely more in its entirety using something

47:59 like pycharm so yeah i like it my favorite thing about pycharm is that you can like control click

48:04 or command click and you can go like on a module and it'll take you to the source for that module

48:09 so you can really fast like look at where the code is flowing um in the in the source yeah absolutely

48:15 or hey you're importing a module but it's not specified in the requirements do you want me to add

48:19 it for you for this package you're writing right stuff like that it's just it's sweet they have great

48:23 support for the tooling and the tool chain of python yeah awesome davis this has been a really

48:28 really interesting conversation and hopefully some people can go make their python code faster

48:33 yeah i definitely hope that they will and i hope they learned a lot from this yeah thanks for

48:38 being on the show man no problem thank you so much yeah talk to you later this has been another

48:43 episode of talk python to me today's guest was davis silverman and this episode has been brought to you

48:48 by hired and coachip thank you guys for supporting the show hired wants to help you find your next big

48:54 thing visit hired.com slash talk python to me to get five or more offers with salary and equity

48:58 presented right up front and a special listener signing bonus of four thousand dollars

49:03 coachip wants you to always keep shipping check them out at coachip.com and thank them on twitter

49:09 via at coachip don't forget the discount code for listeners it's easy talk python all caps no spaces

49:15 you can find the links from today's show at talkpython.fm/episodes slash show slash 28

49:22 be sure to subscribe to the show open your favorite podcatcher and search for python we should be right at the top

49:27 you can also find itunes and direct rss feeds in the footer of the website our theme music is developers

49:34 developers developers developers by cory smith who goes by smix you can hear the entire song at talkpython.fm

49:40 this is your host michael kennedy thank you very much for listening smix take us out of here

49:47 stating with my voice there's no norm that i can feel within haven't been sleeping i've been using

49:53 lots of rest i'll pass the mic back to who rocked it best developers developers developers developers developers developers developers developers developers developers

50:03 developers developers developers developers developers developers developers developers