Profiling data science code with FIL

Episode #274, published Fri, Jul 24, 2020, recorded Wed, Jul 8, 2020

Episode Deep Dive Links Transcript

Do you write data science code? Do you struggle loading large amounts of data or wonder what parts of your code use the maximum amount of memory? Maybe you just want to require smaller compute resources (servers, RAM, and so on).

If so, this episode is for you. We have Itamar Turner-Trauring, creator of the Python data science memory profiler FIL here to talk memory usage and data science.

Episode Deep Dive

Guest introduction and background

Itamar Turner-Trauring is a seasoned Python developer with experience across various areas such as distributed computing, scientific computing, and asynchronous programming. In this episode, Itamar shares insights on managing memory usage, especially in data-science and batch-processing settings. He also introduces FIL (pronounced like "fill"), a memory profiler designed to help developers find memory bottlenecks in Python code. Itamar’s focus on real-world scenarios, such as dealing with large data sets, brings a practical perspective to an often overlooked topic in Python.

What to Know If You're New to Python

Here are a few things to keep in mind so you can get the most out of this episode:

Python automatically manages memory (via reference counting and a garbage collector) so you don’t have to do everything manually.
When you create large objects or data structures, Python can still use more memory than you might expect because of internal overhead and retained references.
Profiling memory usage in Python often differs from CPU profiling; specialized tools such as FIL help pinpoint temporary peaks in memory usage instead of just leaks.
Generators (using yield) and libraries like NumPy or Pandas can help handle large data more efficiently.

Key points and takeaways

Why Memory Profiling Matters for Data Science Data science code often handles massive data sets. A single intermediate array or data structure might require gigabytes of RAM, so pinpointing spikes is crucial. Unlike web services, batch-processing pipelines and one-shot data scripts can fail outright if memory usage peaks too high.
- Tools / Links:
  - FIL (Itamar’s Profiler)
  - Pandas
Python’s Memory Management Model Python primarily uses reference counting. As soon as an object’s reference count drops to zero, it gets deallocated. It also has a garbage collector for resolving cyclical references. While this is mostly seamless, it can lead to hidden overhead if large data structures stay in scope longer than necessary.
- Tools / Links:
  - CPython source (github.com/python/cpython)
Function Calls Retaining Data in Scope One surprise many developers face is that Python keeps local variables alive until the function exits. Even if you pass a large object onward and never use it again, the original function’s scope retains a reference. This can significantly increase memory usage for data pipelines.
- Techniques:
  - Setting unused references to None
  - Reusing a single variable name (e.g., data = transform(data))
Python Object Overhead A Python integer or string includes memory overhead beyond the raw value. For example, an integer might use ~28 bytes. When multiplied by millions of entries in a list, this overhead adds up fast. Using NumPy arrays with fixed dtypes often slashes memory requirements dramatically.
- Tools / Links:
  - NumPy
  - sys.getsizeof() documentation (docs.python.org)
Comparing Server vs. Batch-Processing Memory Concerns Many profiling tools revolve around server apps that run for weeks, focusing on leaks over time. For data pipelines, the single-run peak usage is more pressing, because a single spike can crash or wedge a system. FIL is specifically tuned to watch for such momentary peaks.
- Tools / Links:
  - Memory Profiler (PyPI)
  - Austin (Python Sampling Profiler)
FIL: A Dedicated Memory Profiler for Data Pipelines FIL intercepts allocations at a low level. Unlike sampling profilers, it records every allocation for a detailed map of the call stacks that triggered them. Developers then receive a flame graph indicating which lines of Python code caused large allocations or spikes.
- Tools / Links:
  - FIL (pythonspeed.com)
Flame Graph Visualizations FIL outputs an interactive flame graph, letting you see the memory usage “width” of each call stack. The redder and wider a block is, the more memory it consumed. This visually guides developers to the lines of code that need attention.
- Tools / Links:
  - Inferno (Rust library for flame graphs)
Sampling vs. Instrumentation in Profiling While CPU profiling often benefits from sampling to reduce overhead, memory usage can be tricky to sample. Brief but large allocations can go undetected if you only measure intermittently. FIL’s instrumentation approach provides continuous coverage, critical for capturing big one-time spikes.
- Tools / Links:
  - PySpy (CPU profiling)
  - TraceMalloc (CPython’s built-in)
Out-of-Memory Crashes and FIL’s Handling One of FIL’s unique features is attempting to gracefully handle out-of-memory events. It sets aside a small reserve of memory, then releases that reserve and logs a snapshot if the process hits a failing allocation. This can give you a last-ditch clue on what caused the crash.
- Challenges:
  - Hard to generate complex flame graphs once memory is fully exhausted
  - Plans to refine these emergency routines in future releases
Strategies to Reduce Peak Memory Beyond using a profiler, you can reduce peak usage with streaming or batching data, compressing or encoding data more tightly (e.g., using smaller dtypes), and limiting references. A small architectural tweak, like reading line by line or overriding variables in place, can save gigabytes.

Techniques:
- Generators (yield)
- Breaking big tasks into smaller chunks
- Using efficient libraries like NumPy or Pandas

Interesting quotes and stories

"We were going to spend like 70% of our expected revenue just on cloud computing... so I quietly optimized it first, then told my manager." — Itamar Turner-Trauring

"If you want to understand why something is using too much resources, you need to measure it. Otherwise, you'd never suspect a function call is dragging around a 10 GB array." — Itamar Turner-Trauring

Key definitions and terms

Reference Counting: A technique where each Python object tracks how many references point to it. When that count drops to zero, Python immediately frees the object’s memory.
Garbage Collection (GC): An algorithm in Python that reclaims memory from objects in reference cycles (where objects reference each other), which reference counting alone can’t free.
Peak Memory Usage: The highest amount of RAM a process uses at any single moment. Critical in batch-processing and data pipelines because it often determines whether the process will crash.
Flame Graph: A visualization where each function call is represented as a stack block. The wider the block, the more resources it used. Useful for quickly identifying hotspots.

Learning resources

Here are a couple of courses from Talk Python that align with the topics in this episode:

Python Memory Management and Tips: Dive deeper into how Python handles memory, from reference counting to garbage collection, and learn practical strategies to optimize memory usage.
Python for Absolute Beginners: Ideal if you're entirely new to Python and want to gain a solid foundation before tackling specialized topics like memory profiling.

Overall takeaway

Optimizing memory in Python is as much about understanding your code’s flow as it is about using the right tools. As Itamar Turner-Trauring explains, a single line of code can quietly hold massive data in memory, pushing a system to its limit. By using specialized profilers like FIL, developers gain precise insights into where memory spikes occur. If your Python code processes large amounts of data or you’re running heavy batch jobs, taking a careful look at memory usage can unlock huge performance gains and cost savings.

Links from the show

Itamar on twitter: @itamarst
FIL: pythonspeed.com
Python Bytes coverage of FIL: pythonbytes.fm
Video: Small Big Data: using NumPy and Pandas when your data doesn't fit in memory: youtube.com
Software Engineering for Data Scientists Article: pythonspeed.com

Python Tutor: pythontutor.com
Weak references: docs.python.org

memory_profiler package: github.com
Austin profiler: github.com
WSL2 on Windows: pbpython.com/wsl-python.html
Episode #274 deep-dive: talkpython.fm/274
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #274 deep-dive: talkpython.fm/274

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Do you write data science code? Do you struggle loading large amounts of data or wonder what parts

00:04 of your code use the maximum amount of memory? Maybe you just want to require smaller compute

00:08 resources, servers, RAM, and so on. If so, this episode is for you. We have Itamar Turner-Trowing,

00:15 creator of the Python data science memory profiler Phil, here to talk about memory usage

00:20 and data science. This is Talk Python To Me, episode 274, recorded July 8th, 2020.

00:26 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem,

00:44 and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.

00:49 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter

00:55 via at Talk Python. This episode is brought to you by Linode and us. Do you want to learn Python,

01:02 but you can't bear to subscribe to yet another service? At Talk Python Training, we hate subscriptions

01:08 too. That's why our course bundle gives you full access to the entire library of courses

01:12 for one fair price. That's right. With the course bundle, you save 70% off the full price of our

01:18 courses, and you own them all forever. That includes courses published at the time of the purchase,

01:24 as well as courses released within about a year of the bundle. So stop subscribing and start learning

01:30 at talkpython.fm/everything.

01:32 Hey, Tamar. Welcome to Talk Python To Me.

01:34 Hi, great to be here.

01:35 Yeah, it's great to have you here. I'm excited to talk about Python and memory.

01:39 Yeah, me too.

01:40 Yeah, and I think it's something that doesn't really get as much coverage as I think it deserves in the

01:46 Python space. You know, if you're a Java developer or a .NET developer, people go on and on and on about

01:53 optimizing the GC and tweaking this thing or that thing or your code or algorithms for memory management.

02:00 If you're a C developer, you're constantly in fear of memory leaks and memory management. And in Python,

02:07 we get it just kind of coast.

02:08 Or not. And so my motivation for getting into this was doing some scientific computing with basically a giant pile of images,

02:18 and we'd have to extract information from them. And I initially just focused on getting it working.

02:24 And then one day, I said, Okay, we're running this on these cloud computers. And it's taking, you know, 18 hours to process the data. Like most of the CPUs are idle, because you're using so much memory. I wonder if this is a problem.

02:37 And so I did some math, and I talked to management about our expected revenue. And it turns out we were going to spend like 70% of our expected revenue just on cloud computing, given my current implementation, which wouldn't have let any, there wouldn't be any left over for it.

02:53 Were they excited about that? Or were they not so excited?

02:55 I didn't mention this. I went and optimized it. And then I just, like, then I sent emails to my manager saying, Look, look, look, the great work I did.

03:04 Exactly.

03:05 But I hadn't done any optimization.

03:07 Yeah, yeah, yeah, yeah. That's very, very cool.

03:09 And so and reducing the memory, like meant that you could use a lot more CPUs, because that was the bottleneck initially. Like we had this cloud VM that was like mostly just sitting idle, because you just need so

03:22 much RAM for each of the threads or processes.

03:24 Right. You can get a high memory version of a cloud computer, but it still, there is that tradeoff, right? You want to take full advantage of the CPUs there. And obviously, less memory is better. And also just it might mean fewer cloud computers to manage.

03:41 Yeah. And if you think about your computer, if you look at like the usage of your computer, much of the time, your computer usage is going to be like you're using 1% of the CPU is just sitting there. And your RAM, if you're like, a lot of people say like a gigabytes of RAM, their computer, your RAM is going to be like three quarters percent full, 75% full.

03:59 And basically, it's just that proportionally RAM is much more expensive than computing. And so you don't have as much of a just look at all the CPU guy, like memory tends to be resource constrained, and then the failure modes are you run out of memory and like your computer's wedged or you lost your data.

04:14 Right, right. You run out of CPU, it goes slower.

04:17 Yeah. So the failure modes are much worse.

04:20 Interesting. Yeah, well, it's going to be really fun to dig into it. And I think it's an interesting angle of just the Python ecosystem that people don't spend that much time obsessing about the memory, but it's important. And it's interesting. And we're going to spend some time obsessing about it for the next hour. So for sure. Before we do, let's get into your story, though. How'd you get into programming a Python?

04:42 I got into programming back in the mid 90s, and my parents were this business creating multimedia CD-ROMs, which was exciting new technology in the mid 1990s. And so I ended up just doing coding for them. I got into Python a few years later, when I discovered SOAP, the Z-O-P framework, which at the time was like really huge in the Python world. Like, you go to Python conferences, there'd be like a whole track on SOAP.

05:09 And then I just stuck around and ended up using Python for lots of things like distributed computing, worked on Twisted for many years, and scientific computing, and just a variety of different things.

05:20 Yeah, very cool. And what do you do day to day?

05:22 I've been doing training, stuff with Docker and packaging for Python. I'm hoping to eventually teach some stuff about Python memory, and then have some products, do a little consulting on the side and start thing.

05:35 Yeah, very cool. Is this training in person? Is it online? What is it like?

05:38 Originally, this was in person training. I was supposed to have like, open enrollment class right after Taicon in Pittsburgh, for example. And nowadays, it's over Zoom. Because what do you do?

05:54 Yeah, because the world is crazy. It's absolutely crazy. Yeah. Okay. Well, cool. That's a lot of fun. I did that for like 10 years and really enjoyed my time doing in-person training. Luckily, there were no pandemics.

06:06 Zoom actually works.

06:08 It definitely got disrupted with other things, but not too much. Yeah, we did some stuff over, I think, GoToMeeting, GoToWebinar, which there was no Zoom. So that's what we were using. It was pretty good, actually. Yeah, it's not a bad story.

06:18 All right. So speaking of obsessing with Python memory, let's just get started off with a little bit of an overview of how Python memory works. So I feel like Python memory lives a little bit in between the C++ world, where it's very explicit, and the Java.net GC world, where it's not even deterministic. What's the story?

06:40 As a payload, this actually depends on which Python interpreter you're using. If you're using PyPy, P-Y-P-Y, it's actually basically like Java or .NET. If you're using CPython, which most people do, it's a little bit different.

06:54 And the basic idea is that every Python object has a reference counter. And so when you get a reference to an object, it gets incremented by one. You remove a reference, it gets decremented. So when you append your object to a list, that's an extra reference. If you destroy the list, that reference goes down to...

07:14 If the reference goes down to zero, the object is not being used by anyone. There's no references to it. So it can immediately be freed up and deallocated.

07:22 The problem with reference counting is that it's not... It doesn't cover all cases. If you have a circular... Set of circular references, the objects will never hit reference count to zero.

07:31 So if you take a list and then you append it to itself, it's going... Because it has a reference to itself, its reference count is never going to hit zero, even if you don't have any other references to it.

07:42 So in addition to the reference counting, Python also has a garbage collection system, which every... I think it's based on how many bytecodes have run.

07:49 It will go and look for objects that are in this little loop by themselves, but not being used in any actual code and get rid of them too.

07:57 Right. And I think the GC is also generational, like the other main ones, say Java and .NET as well.

08:04 Yeah. And I don't quite remember how this works.

08:09 So, you know, a totally, maybe more maniacal example might be if you're studying some kind of graph theory type object, like a network of things or a network of relationships among people or something like that, where it doesn't even have to be one thing pointing back at itself.

08:27 It could be thing A points at B, B points at C and D, D points back at F, but F points at A.

08:34 If you can make a circle following that chain, reference counting breaks.

08:38 Yeah. And so then you fall back to GC, the garbage collection and...

08:42 Right. But I would say for the most part that just knowing the GC is there to kind of catch that edge case is really all most people need to know, right?

08:48 Because the primary story is this reference counting story. What do you think?

08:53 Unless you're using PyPy, because then there's no reference counting. It's only garbage collection.

08:58 Yeah. But I'm thinking most people are running CPython. Maybe they're using some data science libraries, especially in the context of using your tool that we're going to talk about.

09:06 It feels like it's definitely in the data science side of things. In that world, in the CPython world, then it's probably reference counting that you care the most about.

09:16 Yeah. And I mean, just a fairly high level understanding that as long as something's referring to your object, it will exist. If the references go away, it will either immediately or eventually disappear and get deallocated.

09:26 That's pretty much all you need to know the vast majority of the time.

09:30 Yep.

09:30 And the vast majority of time, that's enough, but not always.

09:33 Not always. So we're going to talk about a project that you started called PhilFIL that is about profiling memory allocations for data pipeline type of scenarios in particular is optimized for that.

09:48 Although I suspect you could use it for a lot of different things.

09:51 I think so, yeah.

09:52 But let's start the story by just talking about some memory challenges, I guess we could call them.

09:57 So you wrote a cool blog post article called Clinging to Memory, How Python Function Calls Can Increase Your Memory Usage.

10:06 Yeah.

10:07 That's a pretty interesting one. Tell us the general idea here.

10:09 And so this is something I encountered in the real world, so it can impact you.

10:14 And this is more of an issue in the kind of applications where you're processing large amounts of data.

10:20 So like one object might be like four gigabytes of RAM.

10:23 Like if it's like, if objects live slightly longer and they're like, you know, a dictionary of three entries and there's only one dictionary, you don't really care how long it lives because it's not using.

10:33 Are you using 2.7 or 2.701 megabytes for this working memory?

10:39 Nobody cares.

10:39 Yeah.

10:40 Yeah.

10:40 When you have like an array that's like four gigabytes or 20 gigabytes, like this can have very significant impacts.

10:46 If an array lives even slightly longer than it needs to.

10:49 And so the idea is if you have a function and you create something in it and then you pass that object to another function that you're calling, you have function f and you're creating this.

11:00 You have this large array, you pass it to g.

11:02 If you have a local variable inside of f that the parent function still refers to that array.

11:08 Like the parameter that accepted the data, for example.

11:11 Yeah.

11:12 Then that reference within that function call is a reference.

11:18 It means reference count's not going to hit zero.

11:20 Even if g like uses that array and then throws it away and doesn't care about it anymore.

11:24 The parent function still has reference to that array.

11:28 And so you can end up with these situations where if you read the code, you know that you are never going to use this data again.

11:34 There is no way you can use it.

11:36 But from Python's perspective, because there's a local variable, then the function frame that's referring to it, that object is going to persist until that function returns or if there was an exception that exits.

11:47 Right.

11:47 Because everything that was loaded up in that function got defined.

11:50 And so here's all the variables of the function.

11:52 And reference counting, they're still pointing at things until those variables go away, right?

11:58 And they go away when the function returns.

12:00 Yeah.

12:00 And you can imagine like if you go into PDB, like you can actually travel up and down the stack and like you can go up to like the parent function and see the local variables.

12:09 They're still there.

12:09 Like you can still go in the debugger prompt, just go up two frames to the parent caller and you'll still see the local variable pointing to your large object.

12:17 And so you can restructure your code in various ways to deal with this.

12:22 And the way I ended up actually doing it was basically copying this idiom from C++ where you have this object whose only job is to own another object.

12:32 It has that.

12:33 You end up with only one reference to the large area that you care about, which is from inside the owner object.

12:38 Then you pass the owner object around.

12:40 And when you know that you don't need that data anymore, you tell the owner object, clear your contents.

12:46 And then that one reference goes away and memory is freed.

12:49 So you sort of just interesting situation where every once in a while you actually have to fall back to the manual memory management techniques that you have to use all the time in languages like C or C++.

13:00 Right.

13:01 You know what's interesting is I see examples of code like this.

13:05 And then you'll see other people lamenting the fact that code is written this way.

13:10 And they'll say, you should never write code this way.

13:12 It's not necessary in Python because it has automatic memory management.

13:16 Or you should never do this like halfway through randomly set a variable to none and then keep going.

13:22 Why would you ever do that?

13:23 That's like you don't need to do that.

13:25 Right.

13:25 If you're not going to use it again.

13:26 Oh, except when that was costing you an extra gig of memory.

13:32 All of a sudden, this kind of nonstandard pattern, it turns out to be really valuable.

13:37 Right.

13:37 It's the difference between it works or it doesn't work or it's a thousand versus two hundred dollars a cloud compute or whatever.

13:43 Right.

13:45 This portion of Talk Python To Me is brought to you by Linode.

13:48 Whether you're working on a personal project or managing your enterprise's infrastructure, Linode has the pricing, support, and scale that you need to take your project to the next level.

13:57 With 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise-grade hardware, S3-compatible storage, and the next-generation network, Linode delivers the performance that you expect at a price that you don't.

14:12 Get started on Linode today with a $20 credit and you get access to native SSD storage, a 40 gigabit network, industry-leading processors, their revamped cloud manager at cloud.linode.com, root access to your server, along with their newest API and a Python CLI.

14:28 Just visit Talk Python.com.com when creating a new Linode account and you'll automatically get $20 credit for your next project.

14:36 Oh, and one last thing.

14:37 They're hiring.

14:38 Go to linode.com slash careers to find out more.

14:41 Let them know that we sent you.

14:45 Yeah.

14:45 Having never done scientific computing before this job I was at a couple years ago, it was an interesting experience learning a different – because the domain is different, like you have different constraints and different goals, and some of the ways you write software end up being different.

15:01 Unless you're doing large-scale data processing most of the time in Python, you just don't think of any of these things.

15:08 Like, you might have to worry about memory leaks, but that's a different sort of – much of the time, that's a different set of problems, where, like, you don't think about the fact that an object being alive for five more milliseconds might cost you, like, another $100,000 if you're scaling up.

15:25 Yeah, for sure.

15:26 It's interesting.

15:27 Another solution that you proposed – well, you proposed three solutions.

15:31 One is this ownership story.

15:34 One was maybe only applicable for very limited small functions, but you could just have no local variables and just basically chain one function call into another.

15:44 Yeah.

15:44 The intermediate one, though, seems possible, possibly reasonable as well, which is to reuse the local variable.

15:51 So you're going to load up some data, and then you're going to maybe make some changes, which will copy the data.

15:55 Instead of having data one, data two, data three, you just say data equals load it, data equals modify the data, data equals modify the data again.

16:03 And that way, at least as you go through these steps, after each one, it's, you know, released the memory from the prior, potentially.

16:10 Yeah.

16:10 And one of the things about sort of data processing applications, they often have these sort of idioms where you're, like, doing a series of steps.

16:18 And this is where, like, keeping old copies of the data around tends to end up cumulatively being very expensive in terms of memory because it's a series of steps.

16:26 Once you've done step one, you don't really care about the initial input.

16:29 Once you've done step two, you don't care about that one.

16:31 So just explicitly overriding the previous step is another way to do this.

16:36 I can see somebody looking at this in a code review and going, why are you doing this?

16:40 These data mean different things.

16:41 One should be initial data.

16:43 The other should be, you know, grouped by state.

16:46 And the third should be some other thing.

16:48 Like, you're naming these wrong.

16:49 You know what I mean?

16:50 That's what I was kind of hinting at is, like, sometimes you need to break the rules to break through to, like, a better outcome.

16:57 Yeah.

16:58 And in general, pretty much every best practice is very situation-specific.

17:03 And sometimes it's the vast majority.

17:05 But...

17:06 Yeah, that's a really good point.

17:08 A lot of times when you hear advice like that, it's spoken as if it was absolute.

17:13 But there's an implicit context, right?

17:15 Like you said, when you don't really care about memory and that kind of stuff, you just said you just go and write the code.

17:20 But, you know, it probably means implicitly what I care about is readability.

17:25 And what I care about is maintainability.

17:27 And I just want to optimize it to be as clean and pure as possible, which is fine.

17:32 But if pure doesn't work and not clean totally works, like, forget the clean.

17:37 We don't care anymore.

17:37 I want it to work.

17:38 That's more important.

17:39 Like, functioning is primary here.

17:41 Yeah.

17:41 And then there's, like, places like MicroPython where you're running on little embedded devices with very little RAM.

17:46 And then some of the problems that you have in large data processing are translated down to very small programs.

17:53 That's an interesting example for sure because, again, if you didn't care about that extra meg of RAM, but all of a sudden you only have half a meg, now you really care about it.

18:01 Yeah.

18:01 I do want to throw out something from Philip Guo over at pythontutor.com.

18:06 So if you want to understand, like, a lot of these relationships and how objects refer back to each other, he's got a really cool visualization.

18:12 I think when you're over there, you have to check.

18:15 There's, like, a checkbox at the bottom.

18:18 Let me pull it up.

18:19 Under the way it renders objects, I think you have to flip it.

18:23 From inline primitives to, say, render all objects on the heap like Java and Python do.

18:28 Anyway, if you want to, like, show that off or visualize that, that's a really cool quick one.

18:33 Also, if you want to observe reference counting without changing reference counting.

18:39 Because, like, you might want to say, how do I know if there's a reference to this?

18:42 You can't store a variable and point at it and say, now we're going to ask because you've now changed it, right?

18:48 Have you done anything with weak references?

18:51 Weak ref?

18:52 I'm not sure I've ended up using them in scientific computing.

18:55 I've definitely done them and used them in some places with, like, asynchronous programming servers.

19:02 Yeah.

19:03 Yeah, you could use them for, like, caches that can kind of, like, auto expire and stuff as well.

19:08 But they're really good for, I could create a weak reference to an object.

19:11 Then you can ask, how many things point at this?

19:13 And even if you know something points at it, knowing whether that's one or two might help you get a different understanding, right?

19:20 You're like, oh, I thought there was only one pointer.

19:21 Why are there two pointers to this thing?

19:23 Where did that second one come from?

19:25 And so you can ask interesting questions without changing the reference counting with weak references.

19:30 It's really easy.

19:31 Yeah.

19:31 And there's an API, sys.get refers.

19:33 It gives you the objects that refers to an object.

19:36 But then, yeah, you inevitably add the current function frame as an additional reference, and you have to discount it.

19:43 Right, right.

19:43 You threw also get size of in here as well.

19:47 What's the story of the get size of?

19:49 The function call thing is sort of just an example of places where sort of automatic memory management gets in your way.

19:54 But there are more fundamental limits or problems you end up with when using Python in memory intensive situations, which you need to understand.

20:06 And one of them is just that Python objects use a surprising amount of memory for what information that they store.

20:14 So pretty much every, if you look at the implementation of the CPython interpreter, every object has an addition to whatever data you need to actually store the object itself.

20:22 It has, on a 64-bit machine, which is most of them these days, it has a pointer to the class or the C type for the class.

20:30 So that's another eight bytes.

20:31 And then it has the reference count.

20:33 So that's another eight bytes.

20:35 And I think if you have, if it's object sports garbage collection, it's even more.

20:40 And so if you check the system get size of as a nice utility that lets you use, tell me, tell you how many bytes object uses.

20:48 I don't think that actually traverses the object tree, right?

20:53 Like if this thing, if it's a list and the list points at things and those points at those.

20:57 Yeah.

20:58 I think it's just how much is like that, the immediate thing that that variable value points at, right?

21:04 Yeah.

21:05 Okay.

21:05 That's what I thought.

21:06 Yeah.

21:07 I can check what you're talking.

21:08 And if you check the, how much memory, like an integer uses, like the number one, it takes 28 bytes.

21:16 Yeah.

21:16 So if you think about like how you'd represent numbers in memory, like unless you have really large numbers where you obviously need more, 64 bits is sort of, will get you some really big numbers.

21:27 So if you're using a list with a million integers, that's way like I did the math and I think it was 35, a list with a million integers in it is 35 megabytes of RAM.

21:49 If you allocated that in a CRA, it would be eight megabytes of RAM.

21:52 So you're using four and a half times as much memory just because you're using Python objects.

21:57 In another example, while we're talking about it is the character A. So in C, the character A would be what, four bytes or something like that?

22:06 If you're using UTF-8, you can probably get it down to one byte.

22:09 Yeah. Yeah. You could definitely make it smaller if you do it right.

22:11 Yeah.

22:12 In Python, it's 50.

22:12 Yeah.

22:14 So also get size of, it does some interesting stuff. So if I give it a list of like a million items, it'll say the size is 800,000. It's not quite a million. Maybe it's 100,000. I think it's 100,000.

22:25 But if I give it a list, which has the number one and also contains within that list, that list of 100,000 items, the size is 72. So yeah, you got to be real careful. It's, it's, it doesn't tell you the whole story, but it does.

22:38 Yeah, exactly. But it gives you a sense of, oh, like the letter A is 50 and the number one is 28. The memory that we use per representation and data in Python is fairly expensive, I think is the takeaway, right?

22:50 Yeah. So if you have like a common, one place where people hit this is like, you're reading in some data and then like you're creating like a list per thing for like a reading in some like rows of data from the CSV or something.

23:05 And you're turning into like, here's a list and then like there's a dictionary with like a bunch of keys for each one or an object for each entry.

23:12 And you end up with like a massive amount of, considering the information you're storing, you end up with a huge amount of just overhead from creating all those different Python objects.

23:22 And so like one situation you end up in Python running out of memory is if you're doing like data processing and it's just like, you just have 10 gigabytes of data you loaded.

23:31 It's going to be a lot of memory, but sometimes it's not actually that much data if you store it on disk or if you store it in the appropriate CXC object.

23:40 And it's just a lot of data because you create a lot of Python objects.

23:43 And so it's using like five times as much memory as the actual information.

23:47 Right, right. So maybe load it into NumPy or Pandas or something like that instead of into a native Python dictionary or something.

23:55 Yeah. So if you think about Python lists, which has a bunch of Python integers in it.

24:00 And so each of those Python integers has like 28 bytes of RAM.

24:04 A NumPy array, it's basically has the same Python overhead, but only once at the beginning where it says, I'm an array.

24:12 And I store 64-bit integers.

24:14 And then the storage is just, it's not a generic pointer to generic Python.

24:19 Right, right. It's eight bytes per entry.

24:21 Yeah.

24:22 Yeah. And so the information that would take 35 megabytes in a Python list will be eight megabytes in a NumPy array.

24:27 Yeah. Another one that, I mean, moving to some of these libraries that support it more efficiently, certainly in the data science world make a lot of sense.

24:34 But also something that makes a lot of sense that I think people may be overlooking is using different algorithms or different ways of processing.

24:43 Like one really simple way is like I need to load, compute a bunch of stuff and give it back as a collection.

24:49 So I'm going to create a list, fill it up and return it. Right.

24:52 That loads, you know, maybe a million items in a memory and all the cost and overhead of that.

24:57 And then you give it over to be processed and off it goes.

24:59 Alternatively, add a yield instead of a list and just do a generator and you process them one at a time.

25:06 Because probably what you're going to do when you get that list back is go through the list one at a time. Right.

25:10 And that uses one one millionth of the memory or something to that effect. Right.

25:14 It doesn't, it only loads one in memory at a time and not all of them.

25:18 There's things like that as well that you can do.

25:20 Yeah.

25:21 If processing them one at a time in order makes sense.

25:23 If you need to seek around and say, well, what the third one compared to the first one is, then forget it.

25:28 Yeah.

25:28 Like the three basic techniques usually are batching and streaming with that generator is like a sort of batches of one.

25:36 A size of one.

25:37 Yeah.

25:38 Yeah.

25:38 And then there's compression where you have the same memory and same data semantically, but with less overhead.

25:46 So like switching from Python lists to NumPy arrays in some sense compression.

25:50 If you know your numbers are only going to go up to 100, you can use a 8-bit NumPy array.

25:56 And then like you've cut your memory by like 80% at no cost because you have exact same information.

26:02 And then the final technique is indexing where you need to load only some of the data.

26:08 Then you can sort of arrange your data.

26:10 So you can only, you only need to load that part.

26:12 So like if you're doing accounting, like if you have one file for every month of the year, like you can just load July's file and then you don't have to worry about the data in the other months.

26:20 Yeah.

26:21 Yeah.

26:21 Very cool summary.

26:22 So that's the picture.

26:24 That's the memory story.

26:25 That's some of the challenges you might hit and some of the potential solutions that you might come up against.

26:31 But at some point you might just need to know like, okay, this is about as good as it's going to get, but I still need to understand the memory better.

26:38 Or I'm running out of memory.

26:39 Why?

26:40 Where?

26:41 Or maybe you want to take the lazy approach.

26:44 Maybe you want to start from, well, I know I have this problem of using too much memory.

26:48 I know one of these things that these guys talked about will possibly solve it, but where should I focus my attention?

26:55 Right.

26:55 I've got a thousand lines of Python.

26:57 Maybe only three need to be changed.

26:59 Which are the three?

27:00 Right.

27:01 So you probably want to profile it somehow.

27:03 Answer the question, like where is the memory coming from?

27:06 What's the problem?

27:07 It's very difficult to optimize something if you can't measure it.

27:10 Like the example we gave with functions keeping local variables, keeping things alive.

27:15 Like I would never, now that I know it's a problem I've encountered, I might be able to look for it.

27:21 But at time, like it was, I believe it was something like extra 10 gigabytes of RAM or something.

27:27 And I don't think I ever would have spotted it just by reading the code.

27:31 Yeah.

27:31 Because it looks perfect.

27:32 It's clean.

27:33 It's readable.

27:34 It's optimized exactly for the scenario that you most of the time optimize it for.

27:38 So it doesn't look broken.

27:39 Yeah.

27:40 If you want to understand why something is using too much resources, like you need to measure it.

27:44 I built a profiler for a memory profiler called Phil, Phil, F-I-L, which is designed to solve this problem because I hadn't tried other tools available.

27:55 I decided they weren't sufficient.

27:56 Yeah.

27:56 So Phil, I think is really interesting.

27:59 And the thing that made it connect for me at first, I was like, well, we already have some pretty interesting ones.

28:06 I mean, you've got built-in C profile.

28:07 I think that only does CPU profiling, not memory.

28:10 We have memory underscore profiler, which will do memory profiling.

28:14 Yeah.

28:15 We have Austin.

28:16 Yeah.

28:17 Yeah.

28:17 We have Austin.

28:18 Are you familiar with Austin?

28:19 I've not used it.

28:20 I've used a Pi instrument and I know about Pi top, Pi spy, and they're all sampling profilers.

28:27 Right.

28:28 And Austin's pretty interesting as well.

28:30 But Phil, the way that you laid it out is really a lot of these profilers are either general purpose or they're built around the idea of working on servers.

28:40 And long living processes that do short amounts of work many, many times.

28:45 Right.

28:45 Like a web server or something like that.

28:47 And that's a pretty different scenario of I have a script.

28:50 I need to run it once in order and then look at where the memory comes from.

28:56 Right.

28:57 Yeah.

28:57 So memory profiler is the tool I ended up using when I was trying to use memory usage.

29:02 And memory profiler will do this thing where it gives you, like you write on a function that says this function added 10K in memory or 100 megabytes in memory or whatever.

29:11 And if you're trying to find a memory leak, this is actually pretty useful.

29:14 Like you can say, like I call this function, now my memory usage higher.

29:18 And so what happened?

29:19 So you can just figure out this function is where your memory is leaking.

29:24 But the thing that I was trying to do and what data processing applications, as you mentioned, are trying to do is reduce your peak memory.

29:31 The idea is that you're running this process.

29:35 It's going to load in some data.

29:36 It's going to process it.

29:37 And then it's going to write it out.

29:38 And it's going to exit.

29:39 And the peak memory usage is what determines how much hardware you need or virtual hardware.

29:45 Like it doesn't matter if like 99% of the time it's only using 100 megabytes, if 1% of the time it needs 60 gigabytes of RAM.

29:53 Like it's that peak moment in time that you need to.

29:56 That's what you have to put it in for.

29:58 Yeah.

29:58 The high watermark.

29:59 It's like you're building a dam.

30:00 Like you figure out what the highest flood you get is.

30:04 And the thing about memory profiler, like you can run it on your function and it'll say this line of code added zero megabytes of RAM.

30:12 Like it measured before, measured after.

30:13 They're the same.

30:14 So no memory was added.

30:16 But that doesn't tell you.

30:18 It's fine, right?

30:19 Right.

30:20 But it may be that it allocated 20 gigabytes of RAM, did something, and then deallocated.

30:25 And so you have to, the memory profiler, like recursively go through your whole code base function by function until you find that one line of code that's spiking things.

30:38 And so you can use it to figure out the peak memory, but it is a very manual, tedious process.

30:45 Yeah.

30:46 And once your code base is hard enough, it can become quite difficult.

30:51 And another big distinction between servers and data pipelines is how much you care about memory leaks.

30:56 As long as it's a small memory leak, like if you're doing like a process that runs for an hour and it leaked 100k, like after an hour, it'll just exit.

31:05 If you have, if you're leaking 100k an hour, but your process, you have like 10 processes and they're running for a year, 100k may still not be a problem.

31:14 But like there are some thresholds where for a server, like it accumulates and your server crashes.

31:18 And for a batch process, so long it's not impacting the peak, you don't care.

31:22 Right.

31:22 Well, imagine you leak only one kilobyte of memory, but it's in the context of a web request and you're getting 100,000 web requests an hour.

31:31 All of a sudden, your server is toast, right?

31:34 Whereas if you call the function once and you leak a kilobyte and you're doing like a top to bottom run at once data pipeline, who cares?

31:41 Doesn't matter, right?

31:43 It's lost in the void there.

31:45 So I think also just the focus of what you care about is really different.

31:50 You don't generally have these huge spikes in server type applications.

31:54 You can if you're doing like reporting or other weird stuff, but like standard data driven stuff, it's pretty flat line.

32:00 Yeah.

32:01 And it turns out that if you think about it, a memory leak, a tool that can find peak memory can also find memory leaks.

32:08 Because if you have a memory leak, peak memory is always like right now.

32:13 Yeah, exactly.

32:14 If you just run for a while and peak memory, eventually like your memory is overwhelmed by the leak.

32:18 And then you dump the memory then.

32:20 And so that moment is peak memory.

32:22 So a tool that can find peak memory can deal with leaks, but it deals with leaks can't necessarily help you with peak memory.

32:29 So it's actually a more general concept.

32:31 Talk Python To Me is partially supported by our training courses.

32:36 How does your team keep their Python skills sharp?

32:39 How do you make sure new hires get started fast and learn the Pythonic way?

32:43 If the answer is a series of boring videos that don't inspire or a subscription service you pay way too much for and use way too little, listen up.

32:52 At Talk Python Training, we have enterprise tiers for all of our courses.

32:56 Get just the one course you need for your team with full reporting and monitoring.

33:00 Or ditch that unused subscription for our course bundles, which include all the courses and you pay about the same price as a subscription once.

33:08 For details, visit training. talkpython.fm/business or just email sales at talkpython.fm.

33:15 Another thing I like to do is relate quantum mechanics back to programming ideas.

33:22 And I think they're really relevant in both profiling and debugging.

33:27 And that is the idea I'm thinking of is the observer effect, that by observing some phenomenon, you might actually change it, right?

33:35 Maybe the tool you're using to measure it actually makes some difference.

33:39 Or in quantum mechanics, like just insane, bizarre observer effect.

33:43 Things happen that, again, shouldn't, but it does.

33:47 One of the challenges I see around profiling is, especially instrumenting style profilers,

33:55 is you run it because it's too slow, you won't understand the performance.

33:59 So you apply the profiler to it.

34:02 Now it's 10 times slower or 20 times slower, but not evenly, right?

34:06 Like if it's in a really tight loop, that part slows down more than if you're calling like a C function that you're not technically profiling that part,

34:13 but it's still slow.

34:14 That might not really slow down at all.

34:15 So you might exaggerate different parts of it as well.

34:19 And it sounds to me like Phil doesn't have much of this observer problem.

34:23 Yeah.

34:23 So the observer problems tend to be worse in CPU profiling, because as you said, like the act of profiling can change how fast the process runs or which parts of the code run faster.

34:34 So C profile suffers from this, because it's adding overhead per Python function.

34:39 And so code that has a lot of Python functions will be slower than code that has less Python functions, even if the actual runtime is the same.

34:46 So the ads are already unevenly.

34:48 And the solution on the hand CPU profile is actually sampling.

34:51 We only like every thousand times a second, you see what's running right now.

34:55 And tools, I believe Austin works that way and PySpy and PyInstrument.

34:59 All right.

35:00 It's more like a helicopter parent.

35:02 Like, what are you doing?

35:03 What are you doing?

35:04 Instead of actually walking along every step, just constantly asking.

35:08 Yeah.

35:09 And so then it gets a chance to run faster or whatever when it's not asking.

35:12 Yeah.

35:12 The impact is quite minimal.

35:14 And because slower CPU functions will show up more when you're just peaking every once in a while, like statistically it'll converge.

35:22 You'll get an overview of where performance is being spent that isn't exactly right, but it's close enough that it doesn't matter that it's not exact.

35:31 So that's CPU.

35:32 In memory, sampling might work well for something like a memory leak.

35:38 Because with a memory leak, like eventually all your memory usage is this one function being called over and over.

35:43 So if you only check some of the time, it's like eventually you'll catch it.

35:48 But if you care about the peak, you have to maybe not have to capture all the allocations, but like you may have like one specific allocation that's like 20 gigabytes.

35:59 That's what causes your peak.

36:00 And if your sampling doesn't catch it, then the sampling, the profiling is useless.

36:06 And so effectively, one way or another, you have to track every memory allocation if you actually want to find peak memory.

36:13 So the implementation approach, whereas sampling is like the superior approach for CPU, if you will care about a high watermark or peak memory, instrumentation is often the only way to go.

36:26 If you have uneven allocation patterns, which is the case in data processing applications.

36:31 Right.

36:31 Yeah.

36:32 And it sounds like maybe a 50% speed hit is what the docs say.

36:36 That doesn't sound too bad.

36:38 Yeah.

36:38 I mean, it's probably slower in some cases and faster than others.

36:43 This is what you like if you run PyStone.

36:44 Yeah.

36:45 It's not like a thousand percent or something like that.

36:47 Right.

36:47 Yeah.

36:47 And I spent basically, once your profile is slow enough, people just don't use it because they don't have the patience.

36:54 Yeah.

36:54 So a lot of the effort I put, like, the basic idea of what it does is not that sophisticated.

37:02 It's basically, like, you intercept all memory allocations and keep track.

37:06 And then whenever you hit a new peak, you store a copy of that so that you know that's the peak.

37:12 It's just if you want to do that with low overhead, that takes work.

37:16 Right.

37:16 Absolutely.

37:18 So one of the challenges is the reason you're using the profiler probably is because you have a lot of data and you built it in some small scenario and then you run it in the real scenario, then it actually is not doing as well as you'd hoped.

37:28 That's exactly when you need to be able to run it with the profiler.

37:32 And you need it to work fast, I guess is what I'm saying, to really use it in real scenarios.

37:36 Yeah.

37:36 And another thing I've done to handle that, which, and this is a new project, so this is all, like, work in progress.

37:43 But I know at least, like, I've gotten at least one success story of someone saying they found a, like, in within minutes, they found a memory issue they wouldn't have found otherwise.

37:52 So I know it's useful for some people and other people have bugs.

37:56 But another feature that I've added is when the worst case scenario for running out of memory is your program just crashes.

38:04 And this can be as bad as, like, your computer just wedges altogether, which is not uncommon.

38:09 Like, just everything's become so utterly slow that, like, yes, if you left it alone for a day, it'd come back, but you put files to restart it.

38:19 Or you get, like, or it just crashes.

38:20 And you can do a core dump, but, like, the core dump doesn't tell you.

38:23 In theory, it has information you want.

38:25 Yeah, in practice, that's a whole other level right there.

38:28 Yeah.

38:28 Or it actually does not have the information you want, to come to think of it.

38:32 Another thing, a feature I've added is Phil makes some attempts to handle out-of-memory crashes.

38:39 Like, if you run out of memory, it'll say, like, okay, you just got a failed allocation, so I'm going to try to deallocate all the large allocations that I know about just to free up some memory.

38:49 And it has, like, this emergency stash of, like, 16 megabytes that, like, just allocates up front and select it.

38:55 It breaks the glass.

38:56 It allocates that memory, so it's a bit more.

38:58 It lets it go and then starts tearing stuff down as hard as it can.

39:00 Yeah.

39:01 And then it tries to dump a report of, like, this is what your memory usage was.

39:04 And it won't always work, and I suspect it needs a bunch more work.

39:08 Like, it needs a bunch of optimization, because I feel it dumping the report from field text memory.

39:13 But the idea, like, my goal, at least, is that when you run out of memory, instead of just a crash, you'll actually get some feedback that will help you diagnose the problem.

39:24 Yeah, that's really, really cool.

39:25 I don't know how CPython, a Cprofile, excuse me, I don't know exactly how deep its reach is.

39:32 But in Cprofile, if I'm trying to look at, say, data science stuff, and I'm calling a library, and it's using its internal malloc and its internal C stuff to manage the memory down in the C layer, I don't know if Cprofile will check that.

39:48 You know, if it's doing, like, crazy Fortran stuff or other allocations, who knows?

39:53 So CPprofile, I mean, it's giving you CPU, but it's...

39:57 Yeah, I'm sorry, Memory Profiler, the one that does memory one.

40:00 Yeah, yeah.

40:00 So Python actually has a memory profiler thing, TraceMalloc, but it only knows about Python memory APIs.

40:09 Like, if you're using an arbitrary C++ library, you won't know about it.

40:12 Which is common in the data science world, right?

40:14 I mean, that's exactly where a lot of the action is.

40:16 Yeah.

40:17 Yeah, Memory Profiler has a bunch of different ways it can work, but it can actually...

40:23 The most general way it works is, like, at the beginning of the line of code, the end of the line of code, it checks just how much memory that process is using.

40:29 And so it'll work with any allocation, but it has the other downsides that we talked about earlier.

40:35 So Memory Profiler can actually...

40:38 The reason I was using it was because it can actually catch any allocation from any C library.

40:43 I see.

40:44 Painfully, for purposes of reducing memory usage.

40:47 Yeah, for sure.

40:48 And so my goal with Phil was to not just be tied to Python code allocations and be able to just generically support anything that any third-party library is using.

41:00 Yeah.

41:03 Yeah.

41:04 I think that's a lot of things that I've done.

41:05 I think that's a lot of things that I've done.

41:06 I think that's a lot of things that I've done.

41:07 I think that's a lot of things that I've done.

41:08 I think that's a lot of things that I've done.

41:10 I think that's a lot of things that I've done.

41:11 I think that's a lot of things that I've done.

41:12 I think that's a lot of things that I've done.

41:13 I think that's a lot of things that I've done.

41:14 I think that's a lot of things that I've done.

41:15 I think that's a lot of things that I've done.

41:16 I think that's a lot of things that I've done.

41:17 I think that's a lot of things that I've done.

41:18 I think that's a lot of things that I've done.

41:19 I think that's a lot of things that I've done.

41:20 I think that's a lot of things that I've done.

41:21 I think that's a lot of things that I've done.

41:22 I think that's a lot of things that I've done.

41:23 I think that's a lot of things that I've done.

41:24 I think that's a lot of things that I've done.

41:25 I think that's a lot of things that I've done.

41:26 I think that's a lot of things that I've done.

41:27 I think that's a lot of things that I've done.

41:28 I think that's a lot of things that I've done.

41:29 I think that's a lot of things that I've done.

41:30 I think that I've done.

41:31 because the operating system will cleverly load and unload stuff from disk on demand.

41:36 And so it is affecting how much memory you use, but the OS will sort of optimize it for you.

41:41 So it's not clear how to measure it.

41:43 So there's a lot of ways that if you want to track everything, like there's a lot of them,

41:48 and I don't do all of them quite yet.

41:50 But I've been sort of adding them one by one and hope to cover the vast majority of cases pretty soon.

41:57 Yeah, but you covered some of these at least already, huh?

42:00 Yeah.

42:01 I cover basic MF usage, malloc, calloc, realloc, which I said, the standard APIs, added aligned alloc, which is C++.

42:10 Apparently, at least in some cases, Fortran, I've never done anything with Fortran.

42:16 I just know that it's a thing that scientific computing uses.

42:19 And so like I said, okay, I'm going to figure out if Fortran is covered by this.

42:23 And it turns out that traditionally Fortran never actually had memory allocation.

42:28 You would just like write some code and you would say, I'm going to have this array and that's all you ever got.

42:34 But modern Fortran from 1990 onwards has explicit allocation.

42:38 And Phil can at least capture that if you use at least GCC's Fortran compiler.

42:43 And so the idea is you should be able to just take arbitrary data processing or scientific computing code and it will figure out those allocations.

42:51 It won't tell you like which line of Fortran and which line of C was responsible because that's like there are tools that do that.

42:59 But the performance overhead is immense.

43:02 But it will tell you at least which line of Python was responsible and much of the time that's sufficient.

43:07 Right. And as a Python developer, really, that's kind of the answer you want.

43:10 You don't want to know that like this internal part of NumPy did it.

43:13 You just want to know I called, you know, load CSV on Pandas or something.

43:18 And that's where the memory is or something.

43:20 Yeah, exactly.

43:20 You want to see the kind of boundary into that library because that's where you control.

43:25 You're not going to go rewrite Pandas or NumPy.

43:26 Yeah.

43:27 And yeah, much of it.

43:28 So yeah, you will like the goal field is to tell you where in your Python code the memory usage was and not only tell you that in a very easy to understand way, which was another one of my goals.

43:41 Yeah. So you want to tell people maybe describe the flame graphs that they can see and explore.

43:45 Yeah.

43:46 And maybe we can link to one of the show notes.

43:48 So flame graph, I think Brendan Gregg came up with the idea.

43:52 And the idea is it's sort of showing you, you know, your programs that you can think of as a, any point you have like a call stack, like you have function F calls function Z calls function H.

44:02 That's sort of a stack.

44:04 And so you can put these bars that where the wider they are, the more resource they're using.

44:10 Brendan Gregg originally did this for CPU.

44:12 I'm using it for memory.

44:13 And the idea is so if you have a really wide, like if you have a bar that's like 100% of the screen, that's like it's things using all this or the functions of call.

44:21 They're using all your memory.

44:22 If it's like narrower, it's using less memory.

44:25 And then I've arranged it in a way that it actually includes the source code.

44:28 So what you're reading looks like a stack trace.

44:30 It looks like something through an exception and you're just reading it.

44:34 But the width of the bar shows you which lines of code were responsible for how much memory cumulatively.

44:40 I also added some stuff where there's a building on a Rust library called Inferno, which is great, which didn't much of the heavy lifting.

44:47 But I added a feature to Inferno where the wider the bar, the more memory it's using, the redder it is.

44:53 And so the idea is you just look at the graph and you can just see like where it's red is where.

44:58 Where is it red?

44:59 That's the problem.

45:00 That's the thing you got to focus on, right?

45:01 Yeah.

45:02 It really focuses on the expensive parts of the code.

45:05 And then what you're reading is a stetress.

45:07 Yeah.

45:08 And these are cool.

45:09 You can embed these into the web pages and then you can hover over them and click and like zoom into the functions and really explore it quick and easy, right?

45:17 Yeah.

45:18 I originally rolled this sort of Perl script that converted data into these SVGs and then Inferno library ported that to Rust and so I'm using it.

45:26 So they didn't much of the work and I'm just building on top of it mostly.

45:30 So they did a few small features.

45:32 Yeah.

45:33 It's like this whole UI for exploring there.

45:35 To use it is super simple.

45:36 Like if you were going to run Python space your app.py with its arguments, you just would replace Python with fill dash profile space run and that's it, right?

45:47 And you get this output.

45:48 Yeah.

45:48 My goal was also no options.

45:51 Like this isn't a people don't run memory profiling like every day.

45:56 Like it's not like a tool you want to tweak and customize your own personal needs or that you want to spend a lot of time learning.

46:02 So another of my goals is just it should just work.

46:05 So I've at the moment it has one command line option, like where it dumped the data, you know, you need to set that or think about it.

46:11 And then the output is like a HTML page that has the graphs embedded and has some explanations.

46:17 And so the goal is as much as possible to make it as sort of transparent and easy to use.

46:23 And I have some further ideas of how to improve the UX, which I haven't gotten to yet.

46:28 Nice.

46:29 So if I'm like a data scientist or a computing person who is not necessarily a programmer, I could just drop in here, pip install, fill, fill dash profile, run my thing that normally I would just say Python run.

46:39 And that's, that's all I really got to know.

46:41 And then I just look at a web page.

46:43 Yeah.

46:44 It'll open the web page automatically if you're, it can.

46:46 So you don't even have to.

46:47 Yeah.

46:48 If you're, the goal is you run it and it, yeah.

46:50 Yeah.

46:51 If you're, the goal is you run it, it pops up a web page, read the web page and you have the answer.

46:55 Yeah.

46:56 What's using where memory is going.

46:57 You spoke about one of the cool features being the out of memory catch and analysis, and you've got to do a slightly different thing on the command line to make that work.

47:06 Right?

47:06 Yeah.

47:07 The issue is, and this is a thing I can probably fix eventually.

47:11 It's just, this is sort of a limit in my implementation.

47:13 The code that generates the report right now is in Python.

47:16 And if you just run out of memory, you can't go back into Python at that point.

47:20 Yeah.

47:21 So if you run out of memory, like it's not, the experience isn't quite as nice.

47:25 Eventually I might end up like, if it reaches the point where I'm not like iterating it as quickly, I might rewrite that in Rust.

47:33 And then at that point, it might be feasible to actually like have the fully nice UI and the crash.

47:39 Yeah.

47:39 Right.

47:40 Okay, cool.

47:41 Now also currently it runs on POSIX, Linux and macOS only, right?

47:46 Yeah.

47:47 I would expect that.

47:48 I'm not sure it would run on anything other than like, if you run this in FreeBSD, my guess is it will work.

47:52 Yeah.

47:53 But I don't think-

47:54 Linux and macOS, yeah.

47:55 Yeah.

47:56 Yeah.

47:56 I don't think data scientists or scientists are using much FreeBSD.

47:58 Right.

47:59 Yeah.

48:00 And macOS was added fairly recently.

48:04 And someday I would like to add Windows, but it's, there's a lot of like dealing with like linkers and like fairly low level details that I don't know as much about on Windows.

48:18 So it should be possible.

48:19 I've seen things that make that make me think that it is possible.

48:23 I just, it's a chunk of work I haven't done too, because they're hard.

48:27 Sure.

48:28 Yeah.

48:29 You've either got to get it working or-

48:31 Yeah.

48:32 You're just supporting macOS, because that's a lot of work, so.

48:34 Yeah.

48:35 Yeah.

48:36 I'm sure it was.

48:37 So I actually think that maybe you don't have to worry too much about Windows.

48:41 And that's not to say that people don't use Windows.

48:43 Windows is used by like half the Python developers, and it's probably pretty heavy in the data science world as well.

48:49 But, you know, Windows 10 now has Windows subsystem for Linux, and V2 is quite nice.

48:55 So it's very possible you can just point people at, you know, you have to use Windows subsystem for Linux.

49:00 It would probably work, because it's all, it's all APIs that I would expect are emulated fairly faithfully.

49:07 Yeah.

49:07 I think it's just a BlinkView virtual machine, so I don't think you have to do anything.

49:12 My impression is that it, well, at least the original one was rather more sophisticated.

49:15 Like, there was something about like, translating syscalls.

49:18 I don't know about version two.

49:19 But yeah, there's a decent chance it'll work just fine on WSL.

49:23 Yeah, I'll put a link to Chris Moffitt's article on creating a, using Windows SL to build a Python development environment on Windows.

49:32 And maybe that'll help people in general.

49:34 Maybe this will work.

49:35 I don't know.

49:35 We can give it a try.

49:36 Cool.

49:37 And then you also, you know, it's one thing to just say, well, too bad that didn't work.

49:43 It's a lot better to say.

49:44 And here are some ideas for making it better.

49:46 So you have a couple of recommendations for data scientists on how to be more efficient with their code and their memory.

49:53 So I talked earlier about batching, indexing, and compression.

49:57 And I actually gave a, supposed to give a talk at PyCon about that this year.

50:02 It was, I mean, there's a recorded recording of it, but I never gave it live.

50:05 And there's a series of articles here that sort of talk about those ideas and then show how to apply them in NumPy, show how to apply them in Pandas.

50:12 And I started writing some articles about like how to just Python level issues, like how do you, like we talked about with like function calls and just ways to structure a code to reduce memory usage.

50:23 So there's a bunch of articles that are already adding more over time, just with sort of the techniques you need to, once you figure out where the problem is to reduce the memory usage.

50:34 Right, right.

50:35 Yeah.

50:35 I just saw your video.

50:37 I didn't realize, I didn't watch it yet.

50:39 So I'll put a link to it in the show notes so people can watch your virtual PyCon talk.

50:43 Yeah.

50:44 I've been going to PyCon for a very long time.

50:47 And so it's just really sad not being able to see like friends that leave me once a year.

50:51 And I know, PyCon is like my geek holiday, you know, just get out of there and hang out with a lot of my friends that I only see otherwise interact with online.

51:01 And it's really special.

51:02 It's too bad it didn't happen this year.

51:03 Yeah.

51:04 Someday.

51:05 Yeah.

51:05 Someday it'll be back someday, like everything.

51:08 All right.

51:09 Well, these are really interesting ideas.

51:12 I think covering them in general was good.

51:14 And Phil is a cool project.

51:15 So I think it'll help some people out there who are having challenges.

51:19 Maybe their code is using too much memory and swaps out and becomes insanely slow, or they just couldn't process the data they wanted because it didn't work.

51:27 So they can hit it with this, use some of your recommendations and maybe unlock some answers.

51:32 Yeah.

51:32 I should add, this is a very new project.

51:34 And so like, I know one person for whom it worked right, but I also know one person for whom it just wildly misreported the memory usage.

51:43 Okay.

51:44 He's hoping to send me a reproducer later this week.

51:46 We can fix it.

51:47 So if it doesn't work, I very much encourage you to file a bug report.

51:52 Let me know.

51:53 I'm happy to do a screen sharing session.

51:55 So some people will bug it just because I want this to be a tool that works for people.

52:01 And so if it's not working, I want to help.

52:03 And it's an early enough stage that I expect that there are still a bunch of major issues, even if it does actually work in some cases.

52:11 So please try it.

52:12 It might just work.

52:13 And if it doesn't, please let me know.

52:15 I'll do my best to help.

52:16 Yeah.

52:16 Very cool.

52:17 And speaking of which, you know, people are asking me recently, hey, I'm looking for an open source project to contribute to.

52:22 Do you have any recommendations on ones I might look at or consider contributing to?

52:27 What's the story there?

52:28 Are you looking for people who might participate?

52:30 I would be happy to accept contributions.

52:33 It's some parts of it are, there's a lot of fun stuff in there.

52:37 Like in terms of low level systems programming, there's like, there's a bunch of rust and like a bunch of C code and like poking into the internals of CPython.

52:47 If that is a thing that interests you, there's a bunch of work there.

52:50 There's also a bunch of UI things that could be done.

52:53 Like, if you think about profiling, the real usage pattern should really be profile this program, try to fix it, and then say, profile this again and show me the difference.

53:04 Like, and then you can have a visualization of the differences.

53:07 That is my eventual goal is like, to have a user experience that's not just what you use now, but actually shows you if things are better or worse and where.

53:15 So if people are interested in sort of that sort of UX kind of work, there's a room there.

53:21 What about building like tutorials and stuff like that?

53:25 Yeah.

53:26 I mean, like in general, it'd be exciting to see people pick it on, but it's also the same time.

53:33 Yeah.

53:34 Some low level stuff, right?

53:35 You will hit these places where it's like, I'm poking into the, like I'm causing slight memory leaks internally in CPython for optimization purposes.

53:45 Yeah.

53:46 Things like that.

53:47 Sure.

53:48 Because you want to be able to refer to pointers being like, there's a bunch of work in order to not have a lot of overhead when you report a new allocation.

53:56 And so you want to be able to like keep like a pointer address in the Python interpreter as a persistent key, which means you have to make sure things don't think garbage.

54:04 Yeah.

54:05 Makes sense.

54:05 Yeah.

54:06 I can imagine low level.

54:07 This is a beast.

54:08 Yeah.

54:09 The debugging can be tricky, but it's a lot of fun.

54:12 And it's a very, I find it sort of a therapeutic project because like, it's like, it's tricky and difficult, but it's also like a very, it's like a closed universe.

54:22 You know, you're doing web development or distributed systems.

54:25 It's like, you're talking to remote services and like, you have to spin up five processes and like, you're dependent on the whole external world to make anything work these days.

54:35 Otherwise, this is sort of like, it's a program that runs in your computer, read some data, write some data.

54:40 Like there's no, there's no outside world.

54:43 Yeah, that's cool.

54:44 So it's just like, you can stay focused on the problem on hand and not the fact that like GitHub is down or whatever.

54:51 Yeah, I've been there.

54:52 All right.

54:53 Before I let you out here though, let me ask you the final two questions.

54:55 If you're going to write some Python code, what editor do you use?

54:58 I use space max, which is a configuration of Emacs that makes Emacs a lot more like a modern ID.

55:04 Nice.

55:05 Okay.

55:06 It makes Emacs like experience jump 20 years forward.

55:10 That's awesome.

55:11 Just by installing and configuring the right packages.

55:12 Cool.

55:13 And a notable PI PI package.

55:16 pip install.

55:17 Is it Phil or Phil dash profile?

55:18 I got a pip install.

55:19 It's Phil profiler, no dash or hyphen.

55:22 So like F I L E R F I L E R.

55:26 That's an obvious one.

55:27 What's another one that maybe you've come across recently and you're like, oh, this is really cool.

55:30 People should know about.

55:31 Nothing is, I guess, to mention Austin.

55:36 I don't know quite as much about it, but Pyspy is another, another sampling profiler.

55:42 And it's another kind of a system programming package where like it's doing these interesting things in Rust where it's like, it looks at like the memory layout.

55:52 It doesn't, it looks at the memory layout of your Python program, like parses out the data structures and reads things out.

55:58 So it's another sort of very intense system programming, which ideally is all hidden behind the scenes.

56:05 It just gives you really useful results.

56:06 Cool.

56:07 All right.

56:08 That's a good one.

56:09 Yeah.

56:10 I have to check it out.

56:11 I haven't tried that one.

56:12 All right.

56:13 Final call to action.

56:14 People want to get started with Phil.