GPU Programming in Pure Python
Episode Deep Dive
Guest introduction and background
Bryce Adelstein-Lelbach is a programming language and high-performance computing expert at NVIDIA. He began by coding text-based MUDs before contributing to the Boost C++ libraries and working at a supercomputing research center. Bryce holds an applied mathematics degree, helped develop the HPX distributed runtime, and served on the C++ Standards Committee. Over eight years at NVIDIA, he led the CUDA C++ core libraries team and now champions Python-centric GPU tooling, including the CUDA Python SDKs.
What to Know If You're New to Python
Here are a few things that will help you follow the episode and explore GPU computing with Python:
- NumPy‑style arrays rule the day: Many GPU tools mirror the NumPy API so baseline NumPy knowledge transfers almost directly.
- CUDA is NVIDIA‑only: The frameworks discussed (CUDA Python, Numba CUDA, CuPy) require an NVIDIA GPU; they will not run on integrated or non‑NVIDIA graphics cards.
- JIT means “just‑in‑time” compilation: Numba compiles selected Python functions to machine code at runtime, eliminating Python’s normal overhead where it matters.
- Start in the cloud if you lack hardware: Google Colab and Compiler Explorer both provide free T4‑powered notebooks for experimenting without buying a GPU.
Key points and takeaways
Achieving near-native GPU performance in Python NVIDIA’s CUDA Python SDKs bridge the gap between Python’s high-level expressiveness and GPU hardware, delivering performance similar to C++. By leveraging JIT compilation and efficient C bindings, Python code can offload compute-intensive tasks directly to the GPU with minimal overhead. This removes common barriers for data scientists who previously relied on C++ or specialized DSLs. Real-world benchmarks show matrix multiplications and deep-learning kernels reaching nearly identical throughput to native CUDA C++ implementations.
Overview of the CUDA Python meta-package CUDA Python is a meta-package that centralizes core GPU components, CuPy, cuDF,
cuda.core
, Numba CUDA, NVMath Python, under a single pip install. This ensures compatible versions of each library and simplifies dependency management for Python developers. Rather than juggling multiple installs, you get a cohesive stack that’s tested to work together seamlessly. Future updates to individual modules roll out through this unified channel, keeping your environment stable and up to date.Accelerated array computing with CuPy CuPy mirrors NumPy’s API for GPU arrays, letting you accelerate array math, FFTs, and linear algebra by simply changing your import. Under the hood, CuPy manages GPU memory, asynchronous operations, and streams to maximize utilization. You can interoperate with CPU arrays via
cp.asnumpy()
and move data back and forth efficiently. With lazy allocation and just-in-time kernel fusion, CuPy often outperforms straightforward element-wise loops by reducing launch overhead.GPU DataFrames with cuDF and RAPIDS cuDF offers a pandas-style DataFrame API that executes on GPUs, enabling blazing-fast groupbys, joins, and aggregations. Part of the RAPIDS ecosystem, it supports GPU-accelerated I/O from CSV and Parquet, as well as UDFs through Numba. Combined with Dask, cuDF scales across multiple GPUs and nodes for truly large-scale analytics. Common workflows, filtering, computing rolling statistics, merging tables, run orders of magnitude faster than on CPU.
Orchestrating GPU tasks with
cuda.core
Thecuda.core
API maps directly to CUDA C++ runtime functions, exposing memory allocation (cudaMalloc
), kernel launch syntax, streams, and events. You can create multiple streams for concurrent workload execution and use events for fine-grained synchronization. Device queries let you inspect GPU properties (compute capability, memory size) at runtime. This level of control is essential for tuning performance in production systems, where explicit management of concurrency and memory placement matters.Writing custom GPU kernels using Numba CUDA Numba’s
@cuda.jit
decorator compiles Python functions into GPU kernels, handling grid and block dimensions for you. Inside a kernel, you usecuda.grid(1)
orcuda.grid(2)
to compute each thread’s global index. Shared memory and thread synchronization primitives (cuda.syncthreads()
) enable sophisticated parallel algorithms. Numba also supports device functions, allowing you to structure complex kernels with nested calls and achieve high occupancy.The role of DSLs and JITs in Python GPU programming Python’s flexible AST makes it ideal for building DSLs, like Triton, CuPy’s kernel fusion, or MLIR dialects, that express GPU code in Pythonic syntax. JIT compilers transform these high-level constructs into optimized machine code, eliminating Python’s interpreter overhead. This approach lets you experiment rapidly and iterate on performance without rewriting in C++. NVIDIA’s MLIR efforts aim to standardize these DSLs, enabling any language targeting MLIR to generate GPU code.
Core parallel algorithms: transform, reduce, scan Fundamentally, GPU workloads decompose into three primitives:
- Transform: element-wise functions (e.g.,
y = f(x)
) executed across an array. - Reduce: aggregations (sum, max) combining values into a single result.
- Scan: prefix sums providing cumulative results, essential for stream compaction and filtering. Libraries like Thrust and CUDA Cooperative implement these efficiently, letting you build higher-level data pipelines. Mastering these primitives covers most parallel patterns in data science and HPC.
- Transform: element-wise functions (e.g.,
GPU architecture fundamentals: threads, warps, blocks, memory GPUs organize computation hierarchically: thousands of threads grouped into warps (32 threads), which form blocks that share on-chip “shared memory.” Memory coalescing and avoiding bank conflicts in shared memory are critical for peak throughput. Divergent branching within a warp can serialize execution and degrade performance. Registers, local memory, and global memory each have different latencies; optimal kernels maximize register use and minimize global accesses.
Determining when to use GPUs: data size and parallelism GPUs shine when your workload is parallelizable and your dataset large enough to amortize kernel launch and data-transfer overhead. For O(n) tasks, you typically need gigabytes of data to see speedups; for n² or n log n tasks (sorting, graph algorithms), even tens of megabytes can suffice. Amdahl’s law reminds us that the serial portion (data prep, Python control flow) limits overall speedup. Profiling both compute and transfer times helps decide the right batch sizes and whether GPU acceleration is worthwhile.
CPU vs GPU: bandwidth vs latency and GIL implications CPUs optimize for low-latency, single-thread performance and complex control flow, while GPUs deliver massive parallel bandwidth at higher per-operation latency. Python 3.13’s removal of the GIL enables true multithreading in Python, allowing concurrent data loading and kernel launches. This means you can dedicate one thread per GPU for maximal feed throughput. Properly partitioning CPU tasks (I/O, queuing) and GPU tasks (compute kernels) yields balanced pipelines and keeps all hardware busy.
Getting started: accessible GPU environments and resources You can prototype CUDA Python on free platforms like Google Colab (T4 GPUs) or Compiler Explorer for C++ snippets. Local gaming GPUs with Docker can host experiments, and cloud VMs (AWS P3, GCP A2) scale to production loads. Kubernetes clusters with GPU nodes enable orchestrated multi-tenant workflows. Always remember to shut down cloud instances when idle to avoid runaway costs.
Interesting quotes and stories
"Python is a great language for doing high performance work." – Bryce Adelstein-Lelbach
"A GPU is just a bandwidth optimized processor." – Bryce Adelstein-Lelbach
"When I used to play text-based games when I was a kid, I didn’t need a graphics card." – Bryce Adelstein-Lelbach
Key definitions and terms
- Kernel: A function executed in parallel by many GPU threads.
- Warp: A group of threads that execute the same instruction in lockstep.
- Block: A collection of warps that share fast, on-chip “shared memory.”
- Global memory: Main device memory accessible by all threads.
- Shared memory: Low-latency scratchpad memory for threads within a block.
- Transform: Element-wise application of a function over an array.
- Reduce: Aggregation operation (e.g., sum, max) across elements.
- Scan: Prefix sum producing intermediate results for each position.
- Data parallelism: Performing the same operation across many data elements simultaneously.
- Device vs Host: “Device” refers to the GPU, “host” to the CPU side of the program.
Learning resources
Here are recommended resources to explore GPU programming in Python and solidify your understanding:
- Python for Absolute Beginners: A comprehensive course covering Python basics and programming fundamentals.
- Accelerated Computing Hub GPU Python Tutorial: Self-guided Jupyter notebooks to practice CUDA Python.
- CuPy Documentation: Official guide for GPU-accelerated NumPy-compatible arrays.
- Numba Documentation: Resources for writing JIT-compiled Python code and GPU kernels.
- Google Colab: Free notebooks with GPU access for hands-on experimentation.
- Compiler Explorer: Online sandbox for compiling and running CUDA code snippets.
Overall takeaway
GPU programming has entered a new era where pure Python delivers performance once reserved for C++. NVIDIA’s CUDA Python ecosystem, comprising CuPy, cuDF, cuda.core
, Numba CUDA, and more, empowers you to write, orchestrate, and optimize GPU workloads entirely in Python. By grasping core concepts like parallel primitives, architecture hierarchies, and DSL-based JITs, software developers and data scientists can unlock unprecedented computational power without leaving the comfort of Python. Harness these tools, explore the resources above, and start transforming your data-intensive applications today.
Links from the show
Episode Deep Dive write up: talkpython.fm/blog
NVIDIA CUDA Python API: github.com
Numba (JIT Compiler for Python): numba.pydata.org
Applied Data Science Podcast: adspthepodcast.com
NVIDIA Accelerated Computing Hub: github.com
NVIDIA CUDA Python Math API Documentation: docs.nvidia.com
CUDA Cooperative Groups (CCCL): nvidia.github.io
Numba CUDA User Guide: nvidia.github.io
CUDA Python Core API: nvidia.github.io
Numba (JIT Compiler for Python): numba.pydata.org
NVIDIA’s First Desktop AI PC ($3,000): arstechnica.com
Google Colab: colab.research.google.com
Compiler Explorer (“Godbolt”): godbolt.org
CuPy: github.com
RAPIDS User Guide: docs.rapids.ai
Watch this episode on YouTube: youtube.com
Episode #509 deep-dive: talkpython.fm/509
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #509 deep-dive: talkpython.fm/509
Episode Transcript
Collapse transcript
00:00 If you're looking to leverage the insane power of modern GPUs for data science and machine learning, you might think you'll need to use some low-level programming language, such as C++.
00:10 But the folks over at NVIDIA have been hard at work building Python SDKs, which provide near-native level of performance when doing Pythonic GPU programming.
00:19 Bryce Adelstein-Lelbach is here to tell us about programming your GPU in pure Python.
00:26 This is Talk Python To Me, episode 509, recorded May 13th, 2025.
00:32 Are you ready for your host, please?
00:34 You're listening to Michael Kennedy on Talk Python To Me.
00:38 Live from Portland, Oregon, and this segment was made with Python.
00:44 Welcome to Talk Python To Me, a weekly podcast on Python.
00:48 This is your host, Michael Kennedy.
00:50 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both accounts over at Fostadon.org and keep up with the show and listen to over nine years of episodes at talkpython.fm.
01:03 If you want to be part of our live episodes, you can find the live streams over on YouTube.
01:08 Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows.
01:14 This episode is sponsored by Posit Connect from the makers of Shiny.
01:18 Publish, share, and deploy all of your data projects that you're creating using Python.
01:23 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs.
01:30 Posit Connect supports all of them.
01:32 Try Posit Connect for free by going to talkpython.fm/posit, P-O-S-I-T.
01:38 And it's brought to you by Agency.
01:40 Discover agentic AI with Agency.
01:43 Their layer lets agents find, connect, and work together, any stack, anywhere.
01:47 Start building the internet of agents at talkpython.fm/agency spelled A-G-N-T-C-Y.
01:55 I want to share an awesome new resource with you all before we talk GPUs with Bryce.
02:00 You may have seen the episode deep dives on the episode pages at talkpython.fm.
02:06 These are roughly 1,250 word write-ups about the episode.
02:10 It's not just a summary or list of topics.
02:13 These cover additional study materials you might focus on to get the most out of the episode as well as related tools and topics to explore.
02:20 The reason we're talking about these now is that I just finished the very last deep dive write-up for the entire back catalog.
02:27 That's 510 episodes resulting in 600,000 words of extra detailed information for our podcasts.
02:35 These are only available online on the website and not in the podcast player feeds because podcast players are already complaining that our RSS feed is too big for them as if it's 1993 or something.
02:47 adding the deep dives would surely cause more trouble.
02:50 So be sure to visit the episode page for each episode to check out the deep dive.
02:55 I did a write-up on how these were created, why, and more at the Talk Python blog over at talkpython.fm/blog, creative I know.
03:04 I'll put the link to the blog post in the show notes.
03:07 These deep dives were a big effort, but I really think they add a lot of value to the show.
03:12 Thanks as always. Let's talk GPUs.
03:15 Bryce, welcome to Talk Python To Me. Awesome to have you here.
03:18 Thrilled to be here. It's my first appearance on a podcast, actually.
03:22 Okay, very cool, very cool. Well, we're going to talk a lot about GPUs and not that much about graphics, which is ironic, but, you know, that's the world we live in these days, right? It's amazing.
03:33 It's funny, I've worked at NVIDIA for eight years now. I know next to nothing about graphics.
03:39 That's pretty funny. I do on my gaming PC have a 2070 RTX, but I don't do any programming. I probably should. It's just so loud when that thing is turned on. It's like it's going to take off.
03:52 But they sure are powerful. They sure are powerful, these GPUs. So it's going to be exciting to talk about it, what you can do with them.
03:57 That's true. I remember when we launched the 27 day.
04:00 I'm trying to think what my first GPU was, but because when I used to play video games when I was a kid, I didn't play video games. I played text-based games, MUDs. So I never really needed a graphics card because it was all just command my interface.
04:13 The modem is the limiting factor or whatever for your MUD, right? I'm presuming it's on the internet. I used to play a MUD called Shadow's Edge. I don't know if people heard of this one out there in the world, but
04:24 I don't think I've heard of that one. This is early 90s, early 90s.
04:28 That was a little before my time.
04:30 I used to play like a lot of Star Wars MUDs and I played a pretty popular one that was called Avatar.
04:35 That was one of the big ones.
04:37 And that's actually how I got started programming.
04:39 Yeah, I actually know a friend who got started that way as well.
04:42 All right, just acronym police here.
04:44 What is a MUD?
04:45 So it stands for Multi-User Dungeon.
04:48 And it's like a very weird little corner of the gaming world because MUDs are largely not run for profit.
04:55 It's like somebody creates it, and for the most part, they host it, and then they build a little community around it.
05:01 And so they're little multiplayer games.
05:03 Some of them are like a role-playing theme to them, and it's usually coded by volunteers, run by volunteers.
05:10 And I always found it a very pure form of gaming.
05:13 I really like them as well.
05:14 Just take your time, you build a little community of friends, and a little world, and you just go live in it.
05:20 It's been a long time since I played it, and I recently played Something Like a Mud, but it wasn't multiplayer with my daughter, and she thought it was the coolest thing.
05:28 So there's hope for the new generation to carry on the torch.
05:32 I've played some text-based games powered by large language models where you're basically just interacting with a large language model recently, and it kind of reminded me of the MUDs from back in the day because it's all just textual interaction.
05:45 I think as the LLMs get better and the tools for building these things get better, there'll probably be some really interesting stories that are powered by them.
05:53 Really interesting stories.
05:54 interesting games. Yeah, it's an exciting future ahead of us.
05:57 Yeah, I'm kind of looking forward to it. And that brings us kind of to our topic a little bit. Like, how do those LLMs get trained?
06:03 Well, probably on NVIDIA things. Before we jump into that, though, give us a quick introduction.
06:08 Who are you? I got my start at programming, teaching myself how to program a MUD. And that was when I was about 19. And from there, I got involved in open source working on the Boost C++ plus libraries. And I was a college dropout at that point and was looking for a job. And so I just asked somebody who I knew through the open source communities, through the Boost community, like, hey, I need a job. And this guy said, you should come down to Louisiana, come work for me at my supercomputing research center. And I was like, sure, like I'm a college dropout, why not?
06:40 My parents didn't think this was such a good idea, but I managed to convince them. So I went down there. And I worked there for about four years working on HPX, which is a distributed C++ runtime for HPC or high performance computing. And I sort of learned under this professor, Hartman Kaiser, who was my first mentor. And he kind of tricked me into going back to college. And so I completed my degree when I was there. And we started and ran. What was your
07:11 degree in? It was applied mathematics. I figured if I tried to get a degree in computer science, I knew that I was an arrogant teenager and I figured I'd clash with my professors. So I was like, I got to get my major in a field where I don't know anything so that I'll respect the professors and I won't get in trouble in school. So that's the math
07:30 degree. It's easy to feel like you don't know too much doing math. I've been there. When
07:34 I was there, we started this research group. We developed this the HPX runtime together. And then after that, I went to work at Lawrence Berkeley National Lab in California, which is a part of the US Department of Energy research apparatus.
07:48 And I was there for about two years, also doing high performance computing stuff and C++ stuff.
07:54 And around that time, I got involved in the C++ Standards Committee. And then I went to NVIDIA.
07:58 And I've been at NVIDIA for, I think, eight years now, 2017.
08:04 And at NVIDIA, what I primarily do is I work on programming language evolution.
08:10 So that means thinking about how should languages like C++, Python, and Rust evolve.
08:18 And in particular, because it's NVIDIA, my focus is thinking about concurrency, parallelism, and GPU acceleration.
08:25 So how do we make it easier and more accessible to write parallel and GPU accelerated code in these programming languages?
08:33 And it's not just programming languages.
08:35 I also work a lot on library design, interface design.
08:40 I started the CUDA C++ core libraries team here at NVIDIA.
08:45 But I spent maybe the first six or seven years of my career at NVIDIA almost exclusively doing C++ stuff.
08:53 And then the last one, two years, I've been getting ramped up on Python things, learning more about Python.
09:00 And now I'm involved in a lot of NVIDIA's Python efforts, although I am by no means a Python expert, I would say.
09:08 Interesting. That's quite the background.
09:10 A couple of things I want to ask you about.
09:11 I guess start from the beginning.
09:12 I'm less likely to forget them.
09:14 So you were doing high performance computing and grid computing type stuff in the early days.
09:19 I mean, these are like SETI at home and protein folding days and all that kind of stuff.
09:24 I know that's not exactly the same computers, but what were some of the cool things you were working on?
09:28 Did you come across some neat projects or have some neat stories to share from there?
09:33 I came in, you can sort of break up the HPC era into the like what scale of compute we were at.
09:40 And so I came in after TeraScale, TeraFlops, right at the advent of the Petaflop era.
09:46 And there was sort of this big question of how do we go from petaflop to exaflop.
09:53 And at the time, we thought it was going to be really, really hard because this was before accelerators were a thing.
09:59 And so the plan for going from petaflops to exaflops was just to scale it up with CPUs.
10:07 And to scale from petaflops to exaflops and CPUs, to be able to do an exaflop of compute with CPUs, you would need millions of processors and cores. And the mean time to failure for hardware components was very low. Like if you're running something on a million CPUs, a million nodes, a hard drive is going to fail every two to three minutes on average.
10:30 Yeah, something's breaking all the time, right?
10:32 And the computing modalities that we had at the time, we didn't think that they were going to be resilient at that scale. And so there was this big challenge of how do we come up with new computing modalities. The main modality at the time was what's called MPI message passing interface plus X, where X is your on-node solution for parallelism, you know, how you're going to use the threads on your system. And so what I worked on was HPX, which was this sort of one of a variety of different competing research runtimes that were exploring new parallel computing models.
11:08 And HPX was this fine-grained tasking model, so you could launch tasks that were very, very lightweight.
11:15 And it also had what we'd call an active address space.
11:18 So it could dynamically load balance work, and this would give you both load balancing but also resiliency, because if a node went down, you could move the work to another node.
11:28 And I mostly worked on the thread scheduling system and the migration and global address space system.
11:36 But what ended up happening in this push to the exascale era is that GPUs came onto the scene, accelerators came onto the scene. And it turned out that we could get to an exaflop of compute with 10,000 nodes that had GPUs in them or 20,000 nodes that had GPUs in them. And so we were able to scale up a different way.
11:57 And so it ended up that the existing modalities, the existing ways that we did parallel computing, they were basically good enough.
12:06 Like the way that we did the distributed computing was more or less good enough.
12:09 We just had to add the GPU programming aspect.
12:12 Right. Get a whole lot more parallels than per node, right?
12:14 Yep, exactly.
12:15 Today, obviously, there's huge clusters of very powerful GPUs like H100, H200 sort of processors that you all are making, right?
12:23 And our new, the Blackwell processors, which somebody can get their hands on. I don't know who. It's hard for me to get my hands on them, but they're out there somewhere.
12:31 What are your thoughts on ARM in the data center or supercomputers?
12:35 I think that ARM is ultimately going to take over everything for the very simple reason that the software world loves simplicity and consistency.
12:47 And if we can support one type of CPU instead of two types of CPUs, that's so great.
12:53 It's like a 10x win.
12:54 That's what I often like to say with programming languages.
12:57 For a new programming language to be successful, to gain inertia, it needs to be 10x better in some way.
13:04 And for hardware, it's probably more like 20X or maybe 100X better.
13:09 And so like what is the 10X or 100X advantage that ARM has?
13:13 There's lots of like actual advantages.
13:15 We could talk about the merits of the hardware itself.
13:18 But at the end of the day, the only thing that matters is that ARM is the architecture that's in your phones and in your tablets.
13:26 And so naturally, but like x86 doesn't really scale down to the phone and the tablets.
13:31 People have tried, not really a good option.
13:33 not a good architecture for going at the low end of computing for a variety of reasons.
13:38 And so because x86 can't survive at the low scale of computing, then even if it's a better processor at the high end of computing, naturally ARM is going to push it out of the high end of computing.
13:49 And so I think that ARM is the inevitable future for CPU architectures.
13:55 Eventually something will come around to disrupt, but we tend to like uniformity.
14:00 And so I think that ARM will be dominant for a while.
14:04 Yeah, very interesting.
14:05 I also think ARM has a massive energy benefit or advantage, right?
14:09 And data centers are almost limited by energy these days more than a lot.
14:15 Energy and cooling.
14:16 In video, we also make CPUs.
14:17 We make ARM processors for both data center and mobile.
14:21 And our ARM processors are, yes, very energy efficient.
14:24 And that's one of the big advantages for us in picking that architecture.
14:28 Yeah, when you hear headlines like, Microsoft is starting up Three Mile Island again for a data center, you realize, oh, that's a lot of energy they need.
14:37 It is becoming the limiting factor on compute.
14:41 This portion of Talk Python To Me is brought to you by the folks at Posit.
14:44 Posit has made a huge investment in the Python community lately.
14:48 Known originally for RStudio, they've been building out a suite of tools and services for Team Python.
14:54 Over the past few years, we've all learned some pretty scary terms.
14:57 hypersquatting, supply chain attack, obfuscated code, and more. These all orbit around the idea that when you install Python packages, you're effectively running arbitrary code off the internet on your dev machine, and usually even on your servers. But thought alone makes me shudder, and this doesn't even touch the reproducibility issues surrounding external packages.
15:20 But there are tools to help. Posit Package Manager can solve both problems for you.
15:25 Think of Posit Package Manager as your personal package concierge.
15:29 Use it to build your own package repositories within your firewall that keep your project safe.
15:33 You can upload your own internal packages to share or import packages directly from PyPI.
15:39 Your team members can install from these repos in normal ways using tools like pip, Poetry, and uv.
15:45 Posit Package Manager can help you manage updates, ensuring you're using the latest, most secure versions of your packages.
15:52 but it also takes point-in-time snapshots of your repos, which you can use to rerun your code reproducibly in the future.
16:00 Posit Package Manager reports on packages with known CVEs and other vulnerabilities so you can keep ahead of threats.
16:06 And if you need the highest level of security, you can even run Posit Package Manager in air-gapped environments.
16:12 If you work on a data science team where security matters, you owe it to you and your org to check out Posit Package Manager.
16:19 Visit talkpython.fm/ppm today and get a three-month free trial to see if it's a good fit.
16:25 That's talkpython.fm/ppm.
16:28 The link is in your podcast player's show notes.
16:30 Thank you to Posit for supporting the show.
16:33 The other thing I want to talk to you about is, before we get into the topic exactly, is with all of your background in C++, working with the standards, give people a sense of how has C++ evolved.
16:43 When you're talking to me, you're talking to a guy who did some professional C++, but stopped at the year 2000.
16:50 You know what I mean?
16:51 Like there's one view of C++ and there's probably something different now.
16:55 C++ is, the evolution of C++ is managed by an international standards organization, ISO.
17:03 And ISO has a very interesting stakeholder model where the stakeholders in ISO are national delegations or national bodies as we call them.
17:12 So each different country that participates in ISO can choose to send a national delegation.
17:17 And each different country has different roles for how their national delegations work.
17:20 In the U.S., membership in the national delegation is by company.
17:25 And there's no real requirement other than that you have to pay a fee.
17:29 It's like 2.5K a year.
17:31 And you can join the C++ committee if you're a U.S.-based organization.
17:35 In other countries, like in the U.K., it's a panel of experts.
17:41 And you're invited by the panel of experts to join the panel of experts.
17:45 And there's different, like some other countries have different models.
17:48 In some countries, the standards body is actually run by the government.
17:52 And so you have all these experts, these national delegates that come together, and then they all work on C++.
17:58 And there's a lot of bureaucracy and procedure, and it's sort of like a model UN, but for a programming language.
18:06 But yeah, for angle brackets and semicolons, got it.
18:09 A lot of people on the committee love to talk about all the details of how the committee works.
18:14 I don't really think that it's particularly important.
18:16 I think that the key thing to understand is that it's sort of got an odd stakeholder model.
18:21 It's not a stakeholder model where it's like, oh, let's get the major implementations together or let's get the major users together.
18:27 For the most part, it's like anybody who can figure out how to join a national body can participate.
18:33 And if you happen to live from some small country where you're the only delegate, then you get the same vote as the entire United States national body because at the end of the day, votes on the C++ standard are by national body.
18:48 And so there's some people that have an outsized influence in the C++ committee.
18:52 The C++ committee itself is organized into a number of subgroups.
18:56 There's one for language evolution.
18:58 There's one for library evolution, which I chaired for about three years.
19:03 And then there are core groups for both language and library.
19:07 And so the evolution groups, they work on the design of new features and proposals.
19:13 And then the core groups sort of vet those designs and make sure that they're really solid.
19:18 And then there's a bunch of study groups that there's one for concurrency.
19:22 There's one for particular features like ranges or reflection.
19:26 And those groups develop those particular features or represent a particular interest area, like game developers, for example.
19:34 And proposals flow from those study groups to those evolution groups and then through the core groups.
19:40 And then eventually they go into the standard and then the national bodies vote on the standard.
19:43 Sounds pretty involved.
19:44 I guess the main thing I was wondering about is like, how different is C++ today versus 20, 30 years ago?
19:51 I think it's a radically different language.
19:54 C++ 11 completely changed the language in many ways and really revitalized it after a very long period.
20:01 after the first standard, which was C++98.
20:04 And then after C++11, C++ began shipping new versions every three years.
20:10 Whatever features were ready would ship.
20:11 So we adopted a consistent ship cycle.
20:14 And the next big revision after C++11 was C++20.
20:19 Not as transformative, I think, as C++11, but pretty close.
20:23 And then we're just about to finish C++26, which will also be a pretty substantial release.
20:29 And probably by the time that this goes out to your podcast audience, we'll be right around when we finalize the feature set for C++ 26.
20:38 Interesting.
20:38 So if I want to do more C++, I probably need to start over and learn it again.
20:41 If you learned it before C++ 11, yeah, you'd have to relearn some patterns.
20:47 Yeah.
20:47 We like to talk about modern C++ versus old C++.
20:52 C++ 11 is sort of like a Python 2 to Python 3 sort of jump, except...
20:57 Yeah, I was thinking maybe.
20:58 without as much breaking behavior.
21:00 There was very little breaking behavior.
21:02 But the best practices changed drastically from the pre-C++11 era to the modern era.
21:09 Yeah, super interesting.
21:11 I honestly could talk to you for a long time about this.
21:13 But this is not a C++ show, so let's move on to Python.
21:16 But I've been focused a little bit on C++ because traditionally that's been one of the really important ways to program GPUs and work with CUDA and things like that, right?
21:26 And now one of the things that you all have announced, released, or are working on is CUDA-Python, right?
21:35 How long has this been out here for?
21:36 Five months?
21:37 Something like that?
21:38 Not terribly long.
21:39 I don't know how long the repo's been out, but the CUDA-Python effort's been around for about a year or two.
21:43 And you're absolutely right that for a long time, C++ was the primary interface to not just our compute platform, but to most compute platforms.
21:54 And so what changed?
21:55 Well, the two big things is that data science and machine learning happened.
22:01 Both fields that tend to have a lot of domain experts, computational scientists, who are not necessarily interested in writing low-level code in C++ and learning the best practices of software engineering in a systems language like C++.
22:18 They just want to be able to do their domain expertise, to do their data science, or to build their machine learning models.
22:24 So naturally they gravitated towards a more accessible and user-friendly language, Python.
22:30 And it became apparent a couple of years ago within NVIDIA that we needed to make our platform language agnostic.
22:39 And that's really what we've been focusing on the last year or two.
22:42 And CUDA Python is not just us saying, let's add another language.
22:47 Let's, okay, now we're going to do Python and C++.
22:49 It really reflects our overall goal of making the platform more language agnostic.
22:53 And you'll see that in our focus more and more on exposing things at the compiler level, exposing ways for other languages, other compilers, other DSLs to target our platform via things like MLIR dialects, which we've announced a bunch of recently.
23:11 But CUDA Python obviously was the place where we needed to start.
23:15 So the goal of CUDA Python is to provide the same experience with more or less the same performance that you would get in C++.
23:25 And when I say more or less, I mean that there are some higher level parts of CUDA Python that don't necessarily map directly to C++ things.
23:34 But the parts of CUDA Python that have direct counterparts in CUDA C++, we expect that you will get the same performance.
23:44 And we think that's really important because we don't want users to have to sacrifice performance to be able to do things natively within Python.
23:52 That's pretty impressive, honestly.
23:53 I think a lot of times when people think about doing stuff with Python, they're shown or they discover some kind of benchmark that is 100% Python or 100% the other language.
24:04 Like, oh, here's the three-body problem implemented in Python, and here it is implemented in Rust.
24:10 And look, it's 100 times slower or something.
24:13 But much of this work, much of the data science work especially, but even in the web, a lot of times what you're doing is you take a few pieces together in Python and you hand it off to something else, right?
24:24 In this case, you're handing it off to the GPU through the CUDA bindings, the C bindings.
24:29 And once it's in there, it's off to the races internally, right?
24:33 And yeah, and when I think of like the web, you very likely are taking a little bit of data, a little bit of string stuff, doing some dictionary things, handing it to a database or handing it to a third-party API.
24:43 And again, it doesn't matter what you're doing.
24:45 Like it's off into this, whatever that's written in C or whatever for the database.
24:50 Tell me a bit about like the work you had to do to sort of juggle that.
24:52 Sometimes people think of Python as being a slow language.
24:55 I actually will make the claim that Python is a great language for doing high performance work.
25:03 And the reason for that is because Python, it's very easy.
25:08 Python's a very flexible language where it's very easy to do two things.
25:11 One, to make, to optimize the fast path to either through JIT or through Cython extensions.
25:18 It's very amenable to that.
25:20 And two, the language semantics are flexible enough that it's in the AST and the parsing is accessible enough that it's super, super easy to build a DSL in Python.
25:32 And because the language semantics are flexible, it's very easy to build a DSL where it's like, okay, well, our DSL, you write the syntax of Python, and there's some caveats here where some of the things that you know about Python are maybe a little bit different in this DSL.
25:46 But with those relaxations, we can give you super fast code.
25:49 And that's how things like Numba and things like Numbacuda and things like CoupyX and things like Triton, LANG, that's how those things all work, is they build the DSL where they take Python-like syntax, they follow most of the rules of Python with a couple relaxations, restrictions, et cetera.
26:07 And then they give you super fast code that has native performance.
26:10 And if you look at other languages that have tried to deliver on this, that have tried to have a managed runtime, tried to give you portability and high level ease of use and also performance, a lot of the other ones have failed.
26:25 I remember seeing a talk a couple of years ago from one of the lead engineers on, I forget which JVM, but of a particular JVM implementation.
26:34 And he was talking about everything that goes into making a native call from Java, like the protocol for Java to call a C function.
26:43 And he was showing us an assembly, like the call convention, and like you have to do all this stuff and save and restore all this stuff.
26:49 And he was like, and we have to do all this work to be able to make this fast.
26:54 And because Java doesn't have as flexible semantics, you have to do all that.
26:59 But in Python, it's so much easier.
27:01 And this is, I think, one of the reasons why Python has succeeded as a language because it's so easy when you need to put something in production, if it is slow, it's so easy to make that slow thing fast.
27:12 It's a really interesting take.
27:13 I hadn't really thought of that because it's a more malleable language that can be shaped to adapt to something underlying that's faster.
27:20 Exactly.
27:21 How much does Numba or Cython or things like that factor into what's happening here?
27:25 I see it's 16% Cython according to GitHub, but I don't know, Yeah, those stats are sometimes crazy.
27:32 We use Cython in a lot of places on fast paths.
27:36 Now, with CUDA, there's a couple different types of things that you do with CUDA.
27:40 The first with CUDA is when you're writing code that runs on your CPU, that is managing and orchestrating GPU actions.
27:50 That could be allocating GPU memory.
27:52 That could be launching and waiting on CUDA kernels.
27:56 It could be that sort of thing.
27:58 making work, making memory, transferring work in memory, waiting on things, setting up dependencies, et cetera. For the most part, a lot of those tasks are not going to be performance critical.
28:11 And the reason for that is because the limiting factor for performance is typically around like synchronization costs. If your major cost is like acquiring a lock or allocating memory, that whether you call Cuda Malic from C++ or Python, it
28:29 doesn't really matter what language you're calling it from.
28:31 Cuda Malic is going to take a little while because it's got to go and get storage.
28:35 Now, one of the exceptions to this is when you're launching a kernel.
28:38 That's a thing that we want to be very, very fast.
28:41 So some of our frameworks have their kernel launch paths sithonized.
28:48 Numba we use pretty extensively.
28:50 So we use Numba for the other piece.
28:54 So I just talked about the orchestration and management of GPU work and memory.
28:59 But how do you actually write your own algorithms that run on the GPU?
29:04 So you don't have to do this frequently because we provide a whole huge library of existing algorithms.
29:10 And generally, we advise you should try to use the existing algorithms.
29:14 We got a lot of top people, spend a lot of time making sure that those algorithms are fast.
29:18 So you should try really hard to use the algorithms that we provide.
29:22 But sometimes you've got to write your own algorithm.
29:23 If you want to do that natively in Python, you need some sort of JIT compiler that's going to know how to compile some DSL that's going to JIT compile it down to native code that can run on the device.
29:37 We use Numba for that.
29:39 There's a Numba backend called CUDA that allows you to write CUDA kernels in Python with the same speed and performance that you'd get out of writing those kernels in CUDA C++.
29:49 And then we have a number of libraries for the device-side programming that you can use in Numba CUDA.
29:55 That sounds pretty useful, like a pretty good library to start with instead of trying to just straight talk to it, the GPUs directly.
30:03 This portion of Talk Python To Me is brought to you by Agency.
30:07 Agency, spelled A-G-N-T-C-Y, is an open-source collective building the internet of agents.
30:14 We're all very familiar with AI and LLMs these days But if you have not yet experienced the massive leap that agentic AI brings, herein for a treat Agentic AIs take LLMs from the world's smartest search engine to truly collaborative software That's where agency comes in Agency is a collaboration layer where AI agents can discover, connect, and work across frameworks For developers, this means standardized agent discovery tools, seamless protocols for interagent communication, and modular components to compose and scale multi-agent workflows.
30:50 Agency allows AI agents to discover each other and work together regardless of how they were built, who built them, or where they run.
30:58 Agency just announced several key updates as well, including interoperability for Anthropics Model Contacts Protocols, MCP, across several of their key components.
31:08 a new observability data schema enriched with concepts specific to multi-agent systems, as well as new extensions to the Open Agentic Schema Framework, OASF.
31:20 Are you ready to build the future of multi-agent software?
31:23 Get started with Agency and join Crew AI, LangChain, Llama Index, BrowserBase, Cisco, and dozens more.
31:30 Build with other engineers who care about high-quality multi-agent software.
31:35 Visit talkpython.fm/agency to get started today.
31:39 That's talkpython.fm/agency.
31:41 The link is in your podcast players, show notes, and on the episode page.
31:45 Thank you to agency for supporting Talk Python To Me.
31:49 So how much do people need to understand GPUs, GPU architecture and compute, especially Python people, to work with it?
31:58 I see a bunch of things like you talked about kernels.
32:02 warps, memory hierarchies, threads?
32:05 Give us a sense of some of these definitions and then which ones we need to pay attention to.
32:09 Most people need to know very little of this.
32:12 When we teach CUDA programming, when we teach it in C++ even these days, we start off by teaching you how to use the CUDA algorithms, how to use CUDA accelerated libraries, how to use kernels or GPU code that people have already written where you just plug in operators in the data that you want and how to use that.
32:34 And then we teach you about the notion of different kinds of memory, memory that the CPU can access and that the GPU can access.
32:42 And then like we teach you about like how to optimize using those algorithms and that memory.
32:47 And it's like only after like hour three or four of the education do we introduce to you the idea of writing a kernel because we don't want you to have to write your own kernels and because it's not the highest productivity thing and oftentimes it's not going to be the highest performance thing.
33:02 Writing GPU kernels has gotten more and more complex as GPUs have matured because the way that we get more and more perf out of our hardware today because we can no longer just get more as well scaling is that we have to expose more and more of the complexity of the hardware to the programmer.
33:20 And so writing a good GPU kernel today is in some ways easier but in some ways more challenging than it was 5 to 10 years ago.
33:29 Because five to 10 years ago, there were a lot less tools to help you out in writing a GPU algorithm.
33:35 But the hardware was a bit simpler.
33:37 Most people, I would say 90 to 95% of people, do not need to write their own kernels.
33:44 And if you don't need to write your own kernels, you don't have to understand the CUDA thread hierarchy.
33:49 And the CUDA thread hierarchy is what warps and blocks are.
33:53 So on the GPU, you've got a bunch of different threads.
33:57 And those threads are grouped into subsets that are called blocks.
34:03 And the blocks all run at the same time.
34:06 So all the threads in the block run at the same time, rather.
34:10 And all the threads within the block have fast ways of communicating to each other.
34:15 And they have fast memory that they can use to communicate to each other.
34:18 This scratchpad memory that we call shared memory.
34:21 And blocks are further divided into warps, which are smaller subsets of threads that are executed as one.
34:30 Like they do the same operations on the same particular physical piece of hardware at a time.
34:37 They do each have an individual state, so they can be at different positions.
34:42 They essentially each have their own thread, but they're executed in lockstep.
34:46 But you don't need to know most of that if you're looking to use CUDA.
34:52 What you need to understand is some of the basics of parallel programming and how to use algorithms.
34:58 And there are, I would say, three main generic algorithms that matter.
35:05 And the first is just like four, it's just a for loop.
35:09 And the for loop that we most often think about is the one that we call transform, where you've got an input of an array of some shape and an output of an array of some shape. And you just apply a function to every element of the input, and then that produces the output for you.
35:25 The second algorithm Sounds a little like pandas or something like that, right? The vectorized math and so on.
35:30 That's exactly right. Or in Numba, it's like the notion of a generalized universal function or like a ufunk from NumPy or something like that. Just an element-wise function. The next algorithm is reduction. So this is like doing a sum. It's just a generalized version of doing a sum over something. And a reduction is a basis operation that you can use to implement all sorts of things like counting, any form of counting, any form of searching. If you're looking for the maximum of something, you can do that with a reduction. And the last algorithm is a scan, which is also known in some circles as a partial sum. So a scan works somewhat similar to a reduction, but it gives you the intermediate sums. So if you're scanning an array of length n, the output of that scan is the sum of the first element and the second element, the first element, the second element, and the third element, the first element, the second element, the third element, and the fourth element, et cetera, through to the end. And scans are very useful for anything that has position-aware logic.
36:40 Like if you want to do something where you want to reason about adjacent values, or if you want to do something like a copy if, or like a filter, something like that, that's what you'd use a scan for. And those three, I think, are the basis of programming with parallel algorithms. And you would be surprised at how often a parallel programming problem will break down into calling some combination of those three
37:08 algorithms. With this version of Python 3.13, we kind of have no more gil. Right. How much does that matter? It sounds like it might not actually really make much difference for this project. This
37:20 is a question I get asked a lot internally. I do think it matters.
37:23 One of the reasons it matters is that these days you don't normally have just one GPU in like a high-end system.
37:32 You tend to have multiple GPUs.
37:35 Multiple being two or four or 20?
37:37 Two, four, or like eight. Eight is what you typically see.
37:40 You typically see like two CPUs and eight GPUs in like your default like high-end compute
37:48 notes.
37:48 Yeah, but even if you just serialize one of those, that's only 16% of the GPU processing, right?
37:54 Like one eighth.
37:55 Oftentimes to feed all the GPUs, you need to have parallelism on the CPU.
38:03 It's oftentimes not sufficient to just have one thread in the CPU, launch everything.
38:08 And there's also like, GPUs are not the answer for everything.
38:13 Like people often have ideas of what they think a GPU is, But I'll tell you my definition of what a GPU is.
38:20 A GPU is just a bandwidth optimized processor.
38:23 It's a general purpose processor, just like a CPU.
38:26 But a CPU is optimized for latency.
38:29 It's optimized for the single threaded case.
38:32 It's optimized for getting you an answer for every operation as quickly as possible.
38:36 A GPU is optimized for bandwidth.
38:38 If you ask a GPU to load from memory, it's going to take a while.
38:42 If you ask a GPU to add something, it's going to take a while.
38:45 but it will have a higher bandwidth of doing those things.
38:48 And so for some tasks, a lot of like data loading, storing and ingestion tasks, the CPU might be the better fit.
38:55 And also the CPU is generally the better thing to talk to disk and to talk to network directly.
39:02 And so for a lot of applications, the CPU has got to do work to get data ready and then to communicate that data to the GPU.
39:10 And oftentimes the highest performance architectures are going to be ones where that data prep and loading and command and control work is being done in parallel.
39:22 And a gill-less Python will enable us to be able to express those efficient command and control architectures directly in Python and to be able to have Python efficiently communicate with the GPU.
39:34 Yeah, maybe make a thread per GPU or something like that and send them all off the...
39:39 Or even multiple threads per GPU you may need in some cases.
39:43 Probably depends on how many threads your CPU has also.
39:45 Yes, definitely.
39:46 That all sounds super interesting.
39:47 I want to dive into the different building blocks.
39:49 There's all these, like CUDA Python is at least in its current and ongoing, maybe future form, what's called a meta package in the sense that it is not itself a thing that has a bunch of functionality, but it sort of claims dependency on a bunch of things and does an import of them and centralizes that to like sort of bundle a bunch of pieces, right?
40:09 So I want to talk about all of those.
40:11 But before we do, give people out there who maybe have Python, I guess, mostly data science problems, maybe AI, email problems.
40:20 Or it'd be interesting if there was something that didn't fit either of those that is good to solve with CUDA and GPU programming.
40:28 Like, maybe give us some examples of what people build with this.
40:31 The more important question is usually what order of magnitude of data do you have to work with?
40:37 If you don't have a large enough problem size, you're not going to get a benefit out of using a GPU.
40:42 Okay, define large.
40:44 That means different things for different people, right?
40:46 Measured in gigabytes of memory footprint.
40:48 Okay.
40:49 You typically will need to have, well, it's a little more nuanced than that.
40:53 If the compute that you're doing is linear in your problem size, that is, if you're doing like ON operations, like you're doing some linear number of operations, like per element your data size, then you need gigabytes.
41:10 If you're doing more than that per element, if you've got something that's like n squared or exponential or n log in, something like sorting, where it's not going to scale linearly with the number of elements, where it's going to be worse than that, then you might have a smaller problem size that will make sense to run on the GPU.
41:31 But the majority of people who have a compute task have things that fall into the ON regime.
41:37 It's like, oh, I've got a set of in things and I want to apply this function to each one of them.
41:43 And in that case, you normally need gigabytes.
41:46 If you're sorting things, if you're sorting 100 megabytes of ints, you'll probably see a speedup on a GPU.
41:53 If you're just adding a million ints to a million ints, that's probably about the cutoff point for seeing a benefit from using the GPU.
42:02 The types of workloads, I think generally, It's sort of hard to say because it's so broad in what you can do.
42:10 You need to have something that has some amount of parallelism.
42:13 So there needs to be some data parallel aspect to your problem.
42:17 If you can't parallelize it, you're not going to benefit from the GPS.
42:20 Right. Maybe I have questions about all the, I don't know, sales or some sort of prediction per state for all 50 states in the U.S.
42:28 or all countries in the world.
42:29 You could just break it up by country or state and let it rip.
42:32 Yeah. And those are definitely the easiest, the easiest.
42:34 When it's completely embarrassingly parallel, that's usually a good sign that it's something that will fit well in GPU.
42:40 But also if you have something like, I want to add up, I want to take the sum of every integer in this huge data set or something like that, that's also a task that GPUs can be good for, even though it's not embarrassingly parallel, even though there are data dependencies.
42:55 Where's a good place to run it?
42:56 Something I've been fascinated with is this thing you guys announced but have not yet shipped, the home AI computer, This little golden box.
43:05 So the DGX Spark, great entry-level platform, but there is no reason to have to buy something if you want to play around with CUDA.
43:14 One of the best places to play around with CUDA Python, I think, is Google CoLab.
43:19 Google CoLab, it's a Jupyter notebook environment, and they have three GPU instances with T4 GPUs.
43:28 It's not there by default, but if you go in, If you go to runtime, you can change to use a GPU environment.
43:36 And then you can play around with Python in the Colab environment.
43:40 There is also Compiler Explorer, which is an online platform.
43:45 It's godbolt.org is the link.
43:48 It's an online compiler sandbox, and it has GPU environments.
43:55 And it does also have Python support, although I think they're still working on getting Python packaging to be available here.
44:04 But for something like CUDA C++, you can use this to write code and then see what the assembly is that you'd get, make sure that your code compiles, and then also run that code, and you can even run the code on a GPU.
44:18 So I think if you want to get started, there's a lot of different places where you can do GPU development without having to have your own GPU.
44:25 Another couple possibilities, it's not too bad.
44:29 It's not that cheap, but if you don't leave it on all the time, it's not too bad to get a cloud Linux machine that's got some GPU component.
44:37 But if you leave it on all the time, they do get expensive.
44:39 So set yourself a reminder to turn it off.
44:42 Another one that's interesting, which I was just thinking about with my Mac Mini here.
44:46 I don't know that you've said it yet, but I'm pretty sure it's still true, that CUDA requires NVIDIA GPUs, not just a GPU, right?
44:52 That is true, yes.
44:53 So it's not going to work super well on my Apple Silicon.
44:56 That is true.
44:57 For that reason, amongst possibly others, but certainly that's one of them.
45:00 But I have like a gaming PC, or you might have a workstation with a really good GPU.
45:06 You could set up things like change your Docker host on your computer, and anything you run on Docker will run over there on that machine or build on that machine and so on.
45:15 So that's a pretty interesting way to say, well, everybody in the lab, we're changing your Docker host to like the one big machine that's got the NVIDIA GPU.
45:23 I think there's a lot of flexibility.
45:25 You don't even need to have the highest-end GPU to get started with CUDA programming.
45:30 And there are a lot of people who have use cases that would benefit from GPU acceleration where an everyday commodity gaming GPU would be fine.
45:41 Not going to be true for every application.
45:42 And of course, if you scale up, what usually happens when people start with GPU acceleration is first they take their existing thing and then they add the GPU acceleration.
45:52 and now it runs a lot faster.
45:53 And then they start thinking, aha, now that it runs faster, I can increase my problem size.
45:58 I no longer have these constraints.
46:00 And then they end up needing a bigger GPU.
46:02 But for that initial speed up, you're usually fine to start prototyping and developing on your everyday monody GPU.
46:08 Not usually going to be the thing that makes sense to take to production.
46:12 And one of the biggest downsides to the GPU that's in your gaming box is that you're going to have greater latency between the CPU and the GPU.
46:22 And you're not going to have as much memory compared to what you get in a server GPU.
46:26 It's a step on the staircase maybe of getting started.
46:29 Yes.
46:29 Prototyping and so on.
46:31 Okay.
46:31 We don't have too much time left, but maybe let's dive into each of these pieces here that make up CudaPython.
46:37 There's some that are listed here.
46:39 Let me give people the overview of what parts of CudaPython you've created started with.
46:43 The first one isn't even listed here, and that is Coupy.
46:46 Okay.
46:47 So it's not listed here because it's not part of that CudaPython meta package.
46:51 So QPY is a NumPy and SciPy-like library that is GPU accelerated.
46:58 So it's the interface that you know from NumPy and SciPy.
47:03 When you invoke the operations, they run on your GPU.
47:06 So this is by far where everybody should start.
47:10 Could I get away with even saying import QPY as NP?
47:14 Is it that compatible?
47:15 You could get away with that, but there are certainly going to be cases where the semantics may not 100% line up.
47:23 I think for the most part, you would be fine doing that, but there is no way to make a 100% fully compatible drop and replacement.
47:32 And so it's important to read the docs and make sure you understand where there may be little subtle differences.
47:38 But yeah, definitely think of it as drop and replace.
47:39 Yeah, the reason I was asking is maybe that's a super simple way to experiment, right?
47:44 I've got something written in Pandas, NumPy, and so on.
47:48 Could I just change the import statement what happens without like completely rewriting it. That's kind of what's getting at. Yeah,
47:54 you can definitely do that. I mean, it's, I think the docs even say right there that, yeah, it's meant to be a drop and replacement. So yeah, that's definitely how you could get started.
48:01 The one thing to keep in mind is that if you're running in a single thread on the CPU, the problem size that you're running with may not be large enough to see an impact. So you may have to think about you running with a larger problem size than, than it makes sense with NumPy.
48:16 It might even be slower, right? Because of the overhead of CPU and stuff.
48:19 If you do like a sum of like three elements.
48:22 Although I would hope and assume that that path we maybe don't dispatch to GPU translation, but I suspect we still do.
48:30 Kupi is sort of the foundational thing that I'd say everybody should get started with.
48:33 Okay, let me ask you a follow-up question.
48:34 So this is like NumPy.
48:36 NumPy is the foundation mostly, starting to change a little bit.
48:40 But for pandas, is there a way to kind of pandify my GPU programming?
48:47 Yes, there is.
48:47 We have QDF, libqdf, which is a part of Rapids.
48:52 And libqdf has a, it's a data frame library and it has a panda mode.
48:59 I think the module is just like qdf.pandas that aims to be a drop and replacement.
49:05 And then it also has its own like data frame interface that's, I think, a little bit different than pandas in some ways.
49:12 It allows it to be more efficient for GPU in parallel.
49:16 And there's a whole bunch of other libraries that are a part of the Rapids frameworks for doing accelerated data science.
49:24 Yeah, I'm going to probably be doing an episode on Rapids later as well.
49:28 So diving more into that.
49:29 But OK, it's good to know that that kind of is the parallel there.
49:32 That was actually the next piece I was going to mention was going to be Rapids and Kudia.
49:37 Now, the next two most frequent things that you might need is the like a Pythonic interface to the CUDA run plan.
49:46 So the CUDA runtime is the thing that you use to do that command and control, that orchestration of NVIDIA GPUs.
49:53 Things like managing configurations and settings in the GPUs, loading programs, compiling and linking programs, doing things like launching work on the GPU, allocating memory, creating queues of work, creating dependencies, etc.
50:11 Cuda.core is a Pythonic API to all of those things.
50:15 And that's what almost everybody should be using for doing those sorts of management tasks.
50:20 It pretty much one-to-one maps to things, to ideas, concepts from the CUDA C++ runtime.
50:28 And then the final piece would be Numba CUDA, which is the thing that you would use to write your own CUDA kernels.
50:37 And when you're writing those CUDA kernels, there are some libraries that can help you with that.
50:42 And one of them is CUDA Cooperative, which is going to be the thing I'll probably talk about the most in my Python talk.
50:49 And so CUDA Cooperative provides you with these algorithmic building blocks for writing CUDA kernels.
50:55 And we also have a package called NVMath Python, which provides you with more, CUDA Cooperative is generic algorithmic building blocks for things like loading and storing or doing a sum or a scan.
51:09 NVMath Python provides you with building blocks for things like a matrix multiply or random numbers or a Fourier transform, etc.
51:18 And NVMath Python also has host side APIs, so you can use it.
51:23 For the most part, you can access those with Kupi's SciPy packages, but you can also go directly through NVMath Python if you want to get to the slightly lower level APIs that give you a little bit more control.
51:35 I've heard the docs talk about host side operations versus not.
51:40 What does that mean?
51:41 Host side operations means things that you call from the CPU so that your Python program calls just as a regular Python program would, and then it runs some work on the GPU, and then when it's done, it reports back to the CPU.
51:57 And typically, in a lot of cases, like Kupi, these operations are synchronous.
52:03 So like you call like Kupi sum, it launches the work on the GPU, the work on the GPU finishes.
52:10 And this whole time, the Kupi sum has been waiting for that work by default.
52:15 And so it gives you this sequential.
52:16 So real simple distributed programming model, right?
52:19 It looks like you're just calling local functions, but it's kind of distributed computing.
52:23 Distributed in the sense of it's distributed from the host to the device.
52:26 Device side operations are things that you're calling from within a CUDA kernel.
52:32 And a CUDA kernel is a function that gets run by every thread on the GPU or every thread within the collection that you tell it to run on.
52:43 For simplicity, let's just assume every thread on the GPU.
52:46 And so those operations are what we call cooperative operations in that a cooperative sum is a sum where every thread is expected to call the sum function at the same time.
52:59 and they all cooperate together, communicate amongst themselves to compute the sum across all the threads.
53:06 We're just about out of time.
53:07 You mentioned it in passing, but you're going to have a talk coming up at PyCon.
53:13 Yes.
53:13 Which, if people are watching the YouTube live stream, starts in just a couple days.
53:17 If they're listening to the podcast, maybe you can catch the video now.
53:20 It's about time enough.
53:21 But it's called GPU Programming in Pure Python.
53:24 You want to give a quick shout out to your talk?
53:26 In this talk, we're going to look at how you write CUDA kernels or how you can write CUDA kernels in Python without having to go to CUDA C++ and how you have access to all the tools that you have in CUDA C++ and you can get the same performance that you would have in CUDA C++.
53:45 Sounds like a good hands-on, not exactly hands-on, but at least concrete code version of a lot of the stuff we talked about here.
53:52 Yeah.
53:52 Yeah, great.
53:53 And now people are excited, they're interested in this.
53:57 What do you tell them?
53:57 How do they get started?
53:58 What do they do?
53:58 I would say they should go to the Accelerated Computing Hub, which is another GitHub repo that we have.
54:04 And on the Accelerated Computing Hub, we have open source learning materials and courses, self-guided courses.
54:12 And one of them is a GPU Python tutorial.
54:15 So you just go to Accelerated Computing Hub.
54:16 It's on GitHub.
54:17 Click on GPU Python tutorial.
54:19 And it takes you to a page with a whole bunch of Jupyter Notebooks.
54:24 And you start with the first one.
54:27 It opens up in CoLab.
54:28 It uses those Colab GPU instances, and you can start learning there.
54:35 There's other resources available on Accelerated Computing Hub, and that is that we're always working on new stuff.
54:42 So that is a good place to look.
54:45 There's also, I think, a PyTorch tutorial there, and we have Accelerated Python User Guide, which has some other useful learning material.
54:56 Thanks for being here, Bryce.
54:57 I love you.
54:57 Good luck with your talk.
54:58 And yeah, thanks for giving us this look at GPU programming in Python.
55:02 Thank you.
55:03 It was great being here.
55:04 Oh, I should also, I should plug my podcast, ADSP, the podcast, Algorithms Plus Data Structures Equals Programming.
55:11 We talk about parallel programming.
55:13 We talk about those three algorithms that I mentioned, transform, reduce, and scan, and how you can use them to write the world's fastest GPU-excited code.
55:22 And we talk a lot about array programming languages and all sorts of fun stuff.
55:26 So check it out.
55:27 Excellent.
55:27 Yeah, I'll link to it in the show notes for people.
55:28 Thanks.
55:29 Thanks for being here.
55:30 See you later.
55:31 This has been another episode of Talk Python To Me.
55:34 Thank you to our sponsors.
55:36 Be sure to check out what they're offering.
55:37 It really helps support the show.
55:39 This episode is sponsored by Posit Connect from the makers of Shiny.
55:43 Publish, share, and deploy all of your data projects that you're creating using Python.
55:48 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs.
55:55 Posit Connect supports all of them.
55:57 Try Posit Connect for free by going to talkpython.fm/posit, B-O-S-I-T.
56:02 And it's brought to you by Agency.
56:05 Discover agentic AI with Agency.
56:07 Their layer lets agents find, connect, and work together, any stack, anywhere.
56:12 Start building the internet of agents at talkpython.fm/agency, spelled A-G-N-T-C-Y.
56:19 Want to level up your Python?
56:20 We have one of the largest catalogs of Python video courses over at Talk Python.
56:24 Our content ranges from true beginners to deeply advanced topics like memory and async.
56:29 And best of all, there's not a subscription in sight.
56:32 Check it out for yourself at training.talkpython.fm.
56:35 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.
56:40 We should be right at the top.
56:41 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.
56:50 We're live streaming most of our recordings these days.
56:53 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube.
57:02 This is your host, Michael Kennedy.
57:03 Thanks so much for listening.
57:04 I really appreciate it.
57:05 Now get out there and write some Python code.