Learn Python with Talk Python's 270+ hours of courses

Data Science Cloud Lessons at Scale

Episode #519, published Thu, Sep 18, 2025, recorded Tue, Aug 26, 2025
Today on Talk Python: What really happens when your data work outgrows your laptop. Matthew Rocklin, creator of Dask and cofounder of Coiled, and Nat Tabris a staff software engineer at Coiled join me to unpack the messy truth of cloud-scale Python. During the episode we actually spin up a 1,000 core cluster from a notebook, twice! We also discuss picking between pandas and Polars, when GPUs help, and how to avoid surprise bills. Real lessons, real tradeoffs, shared by people who have built this stuff. Stick around.

Watch this episode on YouTube
Play on YouTube
Watch the live stream version

Episode Deep Dive

Guests introduction and background

Matthew Rocklin is the creator of Dask and cofounder of Coiled. He has spent years helping Python teams scale data workloads from laptops to clusters, and much of Dask’s early design comes from his work in the PyData ecosystem. (dask.org, Wikipedia?utm_source=chatgpt.com))

Nat Tabris is a staff software engineer at Coiled. He works on the practical edges of turning “Python on my laptop” into “Python on thousands of cloud machines,” shaping APIs such as Coiled’s cluster and function interfaces for real-world use. (Coiled)

What to Know If You're New to Python

Here are a few quick primers so the cloud-scale parts of this episode click immediately:

Key points and takeaways

  • Spinning up thousands of cores from a notebook is now a few lines of Python From a local Jupyter or VS Code session, you can create a Coiled cluster with parameters like number of workers, architecture, region, and Spot policy, then attach Dask or other engines to it. In the episode we kick off a 2,000-core cluster from a notebook, twice, to show how quickly you can go from idea to massive parallelism. The key is to keep the developer experience simple while hiding the cloud plumbing.
  • Choosing between pandas, Polars, Dask, and DuckDB is about workload shape, not hype Pandas remains the baseline for a single machine. Polars shines for high-performance, single-node analytics or when lazy execution and expression APIs matter. Dask scales Pythonic workflows across many cores and machines and plugs into pandas, scikit-learn, and more. DuckDB is terrific for in-process analytics and SQL-first tasks, often complementing DataFrame libraries. Coiled’s value is that it can stand up clusters for several of these tools so teams can pick the right engine per job.
  • Run compute near your data to cut time and cost Much cloud pain is self-inflicted by hauling data to the code. Running on VMs “next door” to S3 data and Parquet files reduces latency, egress, and brittle downloads. The episode shows decorator-style functions that execute remotely, in-region, with more memory or GPUs when needed.
  • Serverless has limits; long, big, or parallel jobs often want raw VMs Lambda is fantastic for short, stateless functions with tight time limits, but long-running analytics or a Polars job that needs 64 cores per node will run better on dedicated machines. The tradeoff is keeping the ease of use while giving you direct control of CPUs, memory, GPUs, and region.
  • Cost surprises are real; guardrails must be product features Stories in the episode include a single month burning an annual budget and a five-figure logging bill from overly chatty debug logs. Practical mitigations include choosing Spot when suitable, constraining regions and AZ spreading consciously, avoiding pricey NAT topologies, and surfacing metrics that teach teams what their workload is actually doing.
  • Environment replication is the silent killer of productivity Recreating someone’s local Python environment in the cloud can be hairy, especially with private wheels, patched libraries, or data access secrets. The team discusses syncing packages and files automatically, so the cloud runtime mirrors your dev machine well enough that the code just runs, close to the data.
  • Kubernetes can be overkill for data teams; clusters should feel like a library import Many users do not want to learn node pools, Ingress, and YAML to do ETL. The episode emphasizes APIs that feel like import coiled and a few parameters, but still deliver fleets of machines that scale up, spill down, and reuse resources when notebooks restart.
  • ARM instances, GPUs, and instance variety are practical levers Clouds increasingly push ARM instances that can be cheaper and fast for many analytics workloads. For ML and heavy vector math, a targeted GPU choice is better than “more small CPUs.” The system should let you request specific VM types, architectures, and regions programmatically.
  • Spark, Dask, Polars, DuckDB can all be “first-class citizens” on the same platform Teams rarely have just one workload type. A healthy platform lets you spin up Spark for huge shuffles, Dask for Python-native graphs, Polars for single-node speed, and DuckDB for embedded SQL analytics, without switching vendors or rewriting every pipeline.
  • Notebook ergonomics matter at cluster scale Little papercuts at 1 worker become real pain at 1,000. The show covers reusing clusters after a notebook restart, clean shutdowns so you don’t accidentally leave fleets running, and having a web UI and CLI that make the state of a cluster obvious.
  • Metrics make architecture choices obvious The right dashboards quickly show whether a workload is embarrassingly parallel, network bound, or heavy on shuffles. With those insights, you can decide to spread across AZs for more Spot capacity, or pin to a single AZ to keep shuffle traffic cheap and predictable.
  • Decorator-style “run this in the cloud” unlocks adoption The @coiled.function API mirrors how data folks already write code. Mark a function, choose a region or GPU, and execute near the data without refactoring into a new framework. It’s conceptually similar to platforms like Modal but tuned to data workflows and multi-VM scaling when needed.

Interesting quotes and stories

"I've imported library coiled, and I'm going to create a coiled cluster... And workers equals 1,000." , live demo in the episode showing cluster creation from a notebook

"We burned through our annual budget last month, we don't know why." , story about a Kubernetes-based setup that surprised the team on cost

"Lambda isn't going to cut it." , on when a 64-core Polars job or long-running analytics outgrow serverless limits

"Our core competency is turning VMs on and off." , a wry way the team describes what reliable, scalable orchestration really is in practice

key definitions and terms

  • Dask: A Python library for parallel computing that scales familiar APIs from laptops to clusters. (dask.org)
  • Coiled: A platform and Python library that provisions and manages cloud compute for data workloads, including Dask clusters and serverless functions. (Coiled, Coiled)
  • pandas: The canonical Python DataFrame library for single-machine analytics. (Pandas)
  • Polars: A fast DataFrame library written in Rust with lazy execution and an expression API. (pola.rs)
  • DuckDB: In-process OLAP database used for analytics and SQL over local files. (DuckDB)
  • Parquet: An open, columnar file format optimized for analytics. (Apache Parquet)
  • S3: AWS object storage where many teams keep data lakes and Parquet datasets. (Amazon Web Services, Inc.)
  • Availability Zone (AZ): Independent data centers within an AWS region; topology and data movement across AZs affect reliability and cost. (Amazon Web Services, Inc.)
  • NAT Gateway: Managed egress for private subnets in AWS that charges per hour and per GB; easy to overspend without care. (Amazon Web Services, Inc.)
  • AWS Lambda: Serverless compute with a maximum 15-minute execution per invocation, great for short, stateless tasks. (AWS Documentation)

Learning resources

Here are curated learning paths to go deeper on the exact tools and tradeoffs discussed.

Overall takeaway

Cloud-scale Python is less about exotic tech and more about excellent defaults. The episode shows that if you give data teams a few ergonomic primitives , spin up clusters from a notebook, run code near data with a decorator, choose machines like you choose pandas options , they will do the rest. The magic is hiding just enough cloud to unlock thousands of cores when you need them, while making costs and constraints obvious so you never wake up to a surprise bill.

Matthew Rocklin: @mrocklin
Nat Tabris: tabris.us

Dask: dask.org
Coiled: coiled.io
Watch this episode on YouTube: youtube.com
Episode #519 deep-dive: talkpython.fm/519
Episode transcripts: talkpython.fm
Developer Rap Theme Song: Served in a Flask: talkpython.fm/flasksong

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #519 deep-dive: talkpython.fm/519

Episode Transcript

Collapse transcript

00:00 Today on Talk Python, what really happens when your data work outgrows your laptop?

00:04 Matthew Rockland, creator of Dask and co-founder of Coiled, and Nat Tabree, a staff software engineer at Coiled, joined me to unpack the messy truth of cloud-scale Python.

00:15 During the episode, we actually spin up a 1,000-core EC2 cluster from a notebook, twice.

00:22 We also discussed picking between pandas and polars, when GPUs help, how to avoid surprise cloud bills, real lessons, real trade-offs shared by people who have built this stuff.

00:33 Stick around. This is Talk Python To Me, episode 519, recorded August 26th, 2025.

00:54 Welcome to Talk Python To Me, a weekly podcast on Python. This is your host, Michael Kennedy.

01:00 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both accounts over at fosstodon.org, and keep up with the show and listen to over nine years of episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams over on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows. This episode is brought to you by Sentry. Don't let those errors go unnoticed. Use Sentry like we do here at Talk Python. Sign up at talkpython.fm/sentry.

01:33 Matthew, Nat, welcome to the show. Awesome to have you here.

01:36 Hey, Michael. Good to be here.

01:37 Thanks.

01:38 It's been a while, Matt, since you've been on the show.

01:41 Every year or two, we got to chat. You and I chat at conferences sometimes too. It's always good to see you.

01:45 Absolutely. Yeah, it's always really good to see you as well. And I think it's high time for an update. And we talked a little bit a couple of days ago about what the two of you are up to at Coiled. And wow, is it quite interesting, the things that you've been doing. It's really come a long ways and quite a bit more ambitious in terms of running data science in the cloud. So

02:04 it's going to be a fun conversation with you too. I look forward to it. I also brought Nat Tabree as a colleague here. I've stopped doing some engineering work these days. I used to be really smart and now I mostly do engineering manager-y things or CEO things. So I've brought my friendly sidekick to actually explain real stuff.

02:20 Yeah, wonderful.

02:20 I always love having a couple of people on the show get a little extra riffing off of the ideas and so on.

02:26 So great to have you here, Nat.

02:28 Welcome to the show.

02:29 Thanks.

02:29 Before we dive into cloud computing for data scientists and all the lessons you all have learned, let's just do a quick catch up.

02:37 Matthew, you've been on the show a couple of times, but not everyone has listened to every episode.

02:43 In fact, an interesting stat that just came out of the recent PSF chip brains, developer survey results is 50% of the people that do Python have only been doing it for less than two years professionally. So there's a lot of new people in our industry. And 50% of Python these days is data science, which I think is, that's a shift as well. That's the beauty of Python,

03:05 right? It's enough of a programming language that like serious computer science people can get excited about it and is accessible enough that everybody can use it. That's why it's become really popular, right? Compare that to maybe C+, which is like very computer science focused, or maybe like MATLAB, which is like very user focused. Python can kind of bridge those two. And that's what's always given it. I think it's, it's special status in the world. I agree. And I think it, people,

03:31 once they get started, they have lots of reasons to stick. All right. All of that is a long-winded way of saying there's probably a lot of people who have not, have not heard about you before. So give us the quick introduction on yourself, Matthew first, and then Nat. My name is Matthew Rocklin.

03:44 I am a long-term open source contributor to the Python space, particularly sort of the data science part of Python.

03:49 That started like many years ago with projects like Tools, Multiple Dispatch, Simpy.

03:53 And then maybe 10, 12 years ago, I started a project called Dask for parallel computing.

03:58 And like around Dask was lots of other projects.

04:01 For the last decade, I've mostly focused on making it easy for Python developers to solve big problems with lots of computers and lots of hardware.

04:09 Five-ish years ago, I started a company around that called Coiled, and now we do other things.

04:13 But I've sort of been in that space in between Python developers and lots of hardware and trying to make that as easy as possible.

04:21 Yeah.

04:21 First, you built all the packages for people that do cool data science stuff.

04:25 Now you're building the infrastructure and clicking that together.

04:28 That was the next hardest problem.

04:30 Indeed.

04:30 All right.

04:31 Nat, welcome.

04:32 Yeah.

04:32 So I'm Nat Tabree.

04:34 I'm a software engineer at Coil.

04:36 Been here close to four years now.

04:39 And my background is both some, like I've done some research software engineer stuff, helping people use Python well, helping people use Dask well sometimes.

04:48 And then also some sort of cloud SRE stuff.

04:52 So something I coiled is a lot of fun because we get to do the Python stuff and we get to do the cloud stuff and we get to help other people do the Python stuff on the cloud.

05:01 I think that's actually one of the interesting, non-obvious things of building data science tools and infrastructure for data scientists. Take Jupyter as an example, right? A lot of the work on Jupyter is JavaScript-y type of stuff. But the purpose is to make it so Python people can do more Python without thinking about that kind of stuff, right? And you all probably have this sort of DevOps angle of that same thing going on there. Like you do a lot of DevOps,

05:28 so many people don't have to do much DevOps at all. Yeah. And most users don't have that that capability, right? We need to make tools that live in a space, but that don't require deep expertise at that space. That is, again, where I think Python brings a lot of power, sort of the old like XKCD import anti-gravity comic, if you remember that.

05:47 I'm flying. How are you flying? Well, just typed import anti-gravity. You know what was really funny when I was getting into Python was first learning that you could actually type that in

05:56 the interpreter. It wasn't just a comic. I didn't know that.

05:58 Yeah. If you open a Python REPL and you type import anti-gravity, something happens.

06:04 it has to be done right it has to be done so pretty amazing let's start with talking about sort of the evolution of coiled so when we first started talking about coiled it's been a while three four years i feel like maybe the first time you and i were on the show together to talk about it anyway it was effectively dask as a service right so dask is a way to do kind of grid scale out computing with pandas and that type of work and it would create a bunch of you'd point it at a cluster and it would kind of spin up some machines and focus on like how do i execute

06:39 pandas like work on a cluster das did lots of things that weren't pandas pandas users are like 30 percent of dask users das is it's a much more general purpose project than that you're 100 correct that pain that we felt in the dask community at a certain point wasn't to make dask better it was It was to make Dask easier to deploy, especially in the cloud.

06:57 A lot of people were in the cloud those days.

06:59 And it was just like a pain in the butt to bring up all the machines, set up everything correctly to manage Docker things.

07:05 And so we made a company around that.

07:07 The weird thing that happened, so in order to run Dask in the cloud, we had to figure out how to run Python effectively in the cloud.

07:14 And so we made lots of interesting technology to do that.

07:18 Though the surprising thing that happened is that a lot of our customers started using coiled not for DAS things.

07:24 They would use coiled to send a DAS cluster.

07:26 They would throw away the DAS cluster and just use the machines that coiled that brought up for them.

07:30 It turns out that making Python run well at scale in the cloud is much more generally applicable than DAS in particular.

07:39 Our customer base has shifted over time from being very DAS heavy to being more general computing heavy.

07:46 That makes tons of sense.

07:48 It's fine to spin up a bunch of DAS clusters, But really, that was one specialization of, I just need a bunch of computers to run my data science workload.

07:58 When the engineers like Nat came to me and said, hey, look, we should do this thing.

08:02 At first, I was like, that doesn't make any sense.

08:04 The cloud must offer the, just run this thing, just bring up a machine, run some code, turn off the machine.

08:11 Obviously, the clouds must do that.

08:12 I was like, no, they don't do that.

08:14 Or they don't do that well.

08:15 There's some APIs to do that, but they're really inaccessible.

08:18 If you want, go to ChatGPT and ask it to give you copy-pastable commands to turn on 100 VMs and run Hello World and turn them off.

08:27 And it'll type at you for a couple of minutes.

08:29 And it's not the kind of typing that most data scientist people who do use Python for a couple of years can do.

08:36 It's actually pretty inaccessible.

08:37 I was actually quite shocked at how hard this relatively commonplace thing was to do.

08:43 A lot of what people do with data science, but also a lot of the courses, the tutorials, the libraries, they all lead data scientists away from developing those skills as well, right?

08:54 They don't necessarily encourage you to start using Docker a lot, to start writing raw Linux commands.

09:00 How can I make this work?

09:02 It's not that they don't, but I think coming into it, like talking about those beginners, like the first two years of their job, they're still working on how do I do data science libraries right?

09:11 It begs the question, should they?

09:13 And my answer is maybe we shouldn't solve this by educating people.

09:17 maybe we should solve it by building better tooling.

09:19 Like I actually don't, like Docker is a great technology, but not necessarily for data science.

09:24 Like Docker is very much specialized to provide like a really stable system that can run for decades.

09:31 But like we want a system that can change every five minutes.

09:34 Like the choices that tools like Docker, Kubernetes or Terraform make are actually quite different than the choices you would make.

09:41 I think if you were building sort of middleware for this audience, those tools, Like the cloud gives you all the things that you would want.

09:51 It gives you this sort of fully flexible system, getting kind of hardware you want.

09:54 It's infinitely scalable.

09:56 It goes away when you stop using it.

09:58 It's like very ephemeral.

09:59 You pay only for what you use, but it's like pretty unusable.

10:03 Like it's designed for cloud infrastructure engineers.

10:05 We've built middleware on top of that, but that middleware is like not designed for our use cases.

10:10 Yeah, there are a bunch of tools like Pulumi and others that'll spin up machines, but they're pretty different.

10:15 And you hinted at it a little bit there.

10:19 Much of the cloud infrastructure and APIs, just the way that it's meant to work is it's for a little bit longer lived systems and it's more focused on web API development.

10:30 How do I maybe take an API that's running on four machines, scale it up to eight through auto scaling, not how do I get a thousand machines now for three minutes and then turn them off?

10:42 That's different, right?

10:42 Yeah.

10:43 I think a lot of the tooling that exists isn't really with our community in mind.

10:48 And so when we built Coil to run these DAS clusters, we looked around at other software and we couldn't find something that actually fit our needs.

10:56 And as a result, we went and we committed the cardinal sin.

10:59 We ruled our own.

11:00 And my hope, but we're talking to folks today, is that there's actually some interesting, we made some opinionated choices in doing that.

11:05 I think we've actually come up with some interesting things that like use Coil or don't use Coil.

11:09 I think some of the choices we made are actually pretty interesting.

11:11 some of the things we ran into in reconstructing one of these new frameworks, but in this sort of service of a sort of highly burst forward, highly flexible system, there's some interesting engineering choices in there.

11:24 There's some interesting experiences using the cloud at that scale, which I think people aren't as familiar with.

11:29 I've seen some of the stuff that you've done, and it's really neat.

11:32 It's not just hooking to this infrastructure, but it's down into programming idioms and concepts that make it almost transparent what's happening?

11:42 In building abstractions, you have to think a lot about what kind of things you abstract away from the user and what kind of things you give to them directly.

11:50 So for our users, we find that they really care about what kind of machine they run on.

11:54 They like want to specify like the exact VM type sometimes because they want to have an SSD and an A10 GPU and they want to be in this particular region where they're available.

12:05 They like have a lot of opinions about that and they have like zero opinions

12:08 about their networking setup or the security things other than please make it secure.

12:12 And I think in shaping abstractions, one makes some choices and it's interesting to sort of figure out which choices to make and how to build something that gives that set of choices to a user.

12:25 This portion of Talk Python To Me is brought to you by Sentry's Seer.

12:29 I'm excited to share a new tool from Sentry, Seer.

12:33 Seer is your AI-driven pair programmer that finds, diagnoses, and fixes code issues in your Python app faster than ever.

12:40 If you're already using Sentry, you are already using Sentry, right?

12:45 Then using Seer is as simple as enabling a feature on your already existing project.

12:50 Seer taps into all the rich context Sentry has about an error.

12:54 Stack traces, logs, commit history, performance data, essentially everything.

12:58 Then it employs its agentic AI code capabilities to figure out what is wrong.

13:03 It's like having a senior developer pair programming with you on bug fixes.

13:07 Seer then proposes a solution, generating a patch for your code and even opening a GitHub pull request.

13:13 This leaves the developers in charge because it's up to them to actually approve the PR.

13:18 But it can reduce the time from error detection to fix dramatically.

13:23 Developers who've tried it found it can fix errors in one shot that would have taken them hours to debug.

13:29 Seer boasts a 94.5% accuracy in identifying root causes.

13:34 SEER also prioritizes actionable issues with an actionability score, so you know what to fix first.

13:41 This transforms Sentry errors into actionable fixes, turning a pile of error reports into an ordered to-do list.

13:49 If you could use an always-on-call AI agent to help track down errors and propose fixes before you even have time to read the notification, check out Sentry's SEER.

13:59 Just visit talkpython.fm/SEER, S-E-E-R.

14:03 The link is in your podcast player show notes.

14:06 Be sure to use our code, TALKPYTHON.

14:09 One word, all caps.

14:10 Thank you, Dysentry, for supporting Talk Pythonemy.

14:13 What about cost?

14:14 I mean, if I were to be doing this myself, if I were to spin up, hey, I need 500 machines for 10 minutes, I'd be certainly worried that what if I didn't turn them all off?

14:26 That's catastrophically bad sort of things.

14:29 It's one thing, oh, yeah, okay, I left a GPU-enabled machine on for a day, And that wasn't pretty, but it's a whole nother to leave large, significant numbers of machines on.

14:39 Let's do that. Let's put up a bunch of machines. It might be fun. But Nat, first, do you want to say anything about costs? I mean, I think the cost story isn't just leaving things on. It's more complicated than that. It's like the cloud is really great and really cheap.

14:49 You can do that if you do it right at pennies or dollars, but there's all, I don't know if you see them, but I see all of these stories of, here's how I accidentally spent $60,000 on AWS. And it's always like, oh, I didn't even realize that you could do it that way. Yeah. And a lot of times

15:08 it's a misunderstanding of auto scaling or something like that, right? There's a crazy story of this woman who wrote a AI or not image detecting. She was a photographer and she got really frustrated that there's all this AI generated art. And so it's like, I'm going to make an app that will tell you or only show you real art, not AI art in order to have to filter a bunch of stuff. Right.

15:32 and made that serverless and it became super like fifth, sixth most popular thing in the app store.

15:37 And it just scaled like it was supposed to. There was no downtime, but it's scaled like $96,000 and climbing per sale bill. I know you guys talked about maybe a Kubernetes story with like a $50,000 surprise bill. These are not good things.

15:53 The story you just told is actually, it's a positive one, right? She made a thing that was useful and like a lot of people use it. And so it costs more money. Like it's unfortunate that it It was a surprising amount of money, but like it was all useful work.

16:04 I think what we see in customers pre-coiled is like often their costs are not useful work.

16:09 A story from, go ahead, Ned.

16:10 I mean, also part of, I think what the cloud makes hard is these like guide rails.

16:14 You're doing something that you don't know if it's going to be risky.

16:18 And so part of what we try to do is like put in, put in defaults, put in control so that you can't accidentally spend that much money.

16:27 Just to, I mean, like if you don't know the cloud, I remember before being a cloud engineer, like I don't want to sign up for this account and put in my credit card.

16:34 I don't know what the bill is going to be.

16:35 So I'll tell two quick anecdotes.

16:37 One is my first experience with another surprising bill.

16:40 I was like in graduate school, signed up for Amazon.

16:42 I was on the free tier, created some VMs and pulled around to turn them off.

16:46 And then like three months later, I get a bill for $400.

16:49 And it wasn't the VMs.

16:50 It was the like attached storage to the VMs or some networking resource that had stuck around that I had no concept of.

16:56 I wasn't there. There were abstractions. I wasn't really aware of AWS did a fine job.

17:01 They actually refunded me the money. They credit it back. They like happens all the time, but like, that's a case where the cloud is really complex and it's really easy to shoot yourself in the foot with any complex system, especially that complex system comes with dollars attached.

17:14 Yeah. And it's not just compute, right? That's in your example, it was storage, but there's all sorts of little other services. Oh, I just spun up a database for that. And we actually inserted way more data than I thought and then forgot to delete that or whatever, right?

17:27 We've got thousands of customers who do the same thing and then we run through those and we deal with them. So we've seen a lot of those same stories. Another story sort of also pre-coiled but more sort of late in professional life. I was running, this is the $50,000 story you're mentioning. I was running a Kubernetes cluster for a customer, for a research group that I was collaborating with. And we're running Jupyter stuff and Dask stuff. They were all pretty happy.

17:50 They had to learn Kubernetes we weren't super happy about, but they were able to do things they couldn't do before and operate on scales they couldn't do before.

17:55 And this was huge.

17:57 It was really exciting.

17:58 And then one month I got an email, it's like, "Hey, we burned through our annual budget last month.

18:02 We don't know why." I was like, "Hey, what's going on?

18:04 Everything seems fine in the logs, but there's a surprise $50,000 bill." So, well, one thing that's different is that we're now running this thousand node job, but only for like 10 minutes, every six hours.

18:17 So every six hours, this job comes on, runs, a thousand machines, and then turns off. Do the math, it should be like 10 bucks a day, 20 bucks a day, obviously not thousands of dollars a day. And so what had happened is that their code brought up lots of Kubernetes pods. All the pods bring up, then 20 minutes later, they went down, everything worked great. But beneath Kubernetes, there was a node pool, and the node pool had attached an auto scaling group, right? And that auto scaling group had a policy. It's like, hey, if you need lots of nodes, no problem, we'll give you lots of nodes. But in scaling down, I expected the nodes to go away. And actually the policy was, if the average CPU percentage is less than 50%, remove one node, check back every five minutes. And so they were getting a thousand nodes. And then five minutes later, they got 999 nodes and then 998 nodes.

19:06 And that would decline very slowly. And then six hours later, go back up to a thousand.

19:10 And that policy of like, remove one node if CPU utilization is low, makes a whole lot of sense if you are in the web services space, because that's kind of the cadence and kind of the dynamic scales that occur in web services.

19:23 Right, right.

19:24 Ebbs and flows throughout the day, but it's rarely a huge spike and then a huge drop, yeah.

19:30 Yeah, it makes no sense for the kind of users we deal with who want 50 GPUs for 10 minutes and 1,000 CPUs for an hour and then nothing.

19:38 It's like there's two different lessons here.

19:40 One is that the technology that we saw wasn't well-tuned for our audience, for our user base.

19:46 And also like there were just more abstractions than there needed to be.

19:50 What I really wanted at the time wasn't Kubernetes and node pools.

19:53 It was just like, I want a thousand VMs.

19:55 I wanted EC2 was the right abstraction for me.

19:59 I didn't want any other stuff on top.

20:01 And so when we built Coiled, we actually designed for that.

20:04 We designed for raw VMs.

20:06 We call the raw VM architecture.

20:08 And we just spin up a thousand EC2 instances or a thousand Google or Azure equivalents.

20:12 We hook them all up dynamically and then we shut them all down when we're done.

20:15 And that approach is kind of weird.

20:17 A lot of our customers when they first see it, like that's odd, but it actually provides like a really interesting architecture that we found to be really interesting.

20:24 If you're game, go ahead.

20:25 I do think that is certainly what people want, right?

20:29 You don't want to have one of these abstraction layers in there.

20:34 You want just, I just need these machines.

20:36 I need to run my code.

20:37 But I think you're saying, let's walk through an example.

20:40 And I think that's great.

20:41 I think one of the challenges that you're going to run into is like, how do you even make a thousand VMs quickly and not spend most of your compute on machine setup and configuration?

20:51 And things like, even if you're not using some auto-scaling, auto-tune down sort of thing, right?

20:58 You think those would be the kind of problems we'd run into with 10 machines, but with a thousand, you run into other problems, all sorts of problems that'll show up.

21:03 I'm actually, I'm expecting the demo to kind of fail.

21:07 I think it'll be interesting to see what happens.

21:09 It's weird doing a demo on a podcast.

21:11 Yeah.

21:12 I know.

21:12 Everyone listening, I'm going to narrate this very carefully.

21:16 And if you go to the YouTube stream, you can watch it at minute 21, 25.

21:21 But I will narrate it because there's some interesting ideas, like the idioms and stuff that I spoke about.

21:26 I'm on my local machine.

21:28 I'm in a Jupyter notebook, but I can be in VS Code or Curse or whatever.

21:31 And I'm typing in, I've imported library coiled, and I'm going to create a coiled cluster, just typing into Python some code.

21:37 End workers equals a thousand.

21:40 We'll ask for some ARM machines.

21:42 We'll ask for spot if it's available.

21:45 If it's not available, follow back to on-demand.

21:47 And we'll ask for each machine to have maybe just a couple of CPUs.

21:51 Matthew, before we go on, let's just talk about some of these things.

21:54 You go to Coiled and you say, create me a cluster.

21:56 Workers is a thousand.

21:57 That's a thousand EC2 instances, right?

21:59 It will be, yeah.

22:00 That's insane.

22:01 I'm also not going to Coiled.

22:02 I'm going to my local Python environment here on my MacBook.

22:06 Mac mini is in Austin, but it could be on a CI job or wherever.

22:09 By saying Coiled, I meant like you're using the Coiled API.

22:11 locally. Yeah, yeah. And then what are spot instances for people who don't live in EC2?

22:17 Yeah. So spot, also called preemptible in other clouds sometimes, are just instances that are cheaper because they don't have the guarantee of sticking around. The cloud can claim them back if some other customer willing to pay full price is willing to pay. And so we're looking for sort of cheap capacity. And then if it's not there, we actually are also willing to pay for on-demand for full priced instances.

22:41 So we're going to try to get a thousand, but we're like, hey, if there's any discount things around, I'm happy to take them too.

22:45 The alternative is maybe you're going to set up, and again, this is more like the web API world.

22:51 I'm going to set up an EC2 instance and I'm going to configure my website on it.

22:54 I'm just going to leave it running because my website should be up 24 seven in a perfect world.

23:00 So I'm going to get a machine and just leave it.

23:03 I'm going to, those are often reserved instances, which have a different type of pricing, but a commitment to long-term, right?

23:09 You pay less by committing to pay for it for a month or a year or something.

23:13 And so that's kind of the opposite of the spot.

23:15 It's like, if there's anything that just happened to be hanging around, give us your cheap temporary ones, right?

23:20 Okay, cool.

23:21 So I think people get a sense of what you're going to go ask for.

23:24 Let's just run that.

23:25 So take about a minute, maybe a couple of minutes, because it's a large number of machines.

23:29 So the first thing that's happening is that we're scraping my local MacBook for all the Python packages I've installed.

23:35 every conda package pip package local dot py file local editable package and then we are

23:43 a spinning thousand machines let's talk about that environment a little bit because it's fine to have a thousand machines but you are intending to write a bunch of code in jupyter notebook that probably depends on a bunch of stuff like what have you conda or piped installed and exactly those versions and they might be uncommon things that you want to run or maybe you have local files that you're going to subsequently say, load this CSV file or this parquet file and jam on it, right? How does that stay coherent?

24:14 I'll actually add it, make it even more complex. We're also running this first from my MacBook, but I'm actually running it on some Linux machines. And so I've got to shift the architecture of those packages where appropriate. I may have private packages that are running, I'm pulling from my company's local or artifactory repository. I may have packages that I've installed locally that I've printed debug statements into. Things get actually quite hairy trying to replicate someone's local environment into a remote environment. And we try to do exactly that. We try to replicate as much as we can a local environment remotely. That could be data.

24:51 You've mentioned data like small files might move.

24:53 Large files probably not. Yeah, those might be pulled out of S3 storage or something like that, right? It's a complicated situation just a note for what's on the screen right now it says you've booted 838 machines and uh working on the environment for 137 of them that's pretty wild we only got 160 to go yeah i

25:12 mean also things to note like we got 54 m8g larges those are like the nice generation of aws and actually didn't know those were available those are new for me we've got 942 m7g larges and four of the oldest generation M6G.

25:29 So actually Coiled had to go to the cloud and get actually like a variety of different instance types because AWS ran out.

25:34 In fact, maybe Nat, you can talk a little bit about like what just happened behind the scenes there.

25:39 What are all the steps that we had to do in order to get those machines?

25:42 They're also now all available.

25:44 There's like all sorts of things that at the scale of like, I want one or two computers that just work, that like all sorts of things break down when you're asking for hundreds or thousands.

25:56 So some of that is like, you can't just make individual API calls anymore.

25:59 You would very quickly get rate limited.

26:01 And then we would spend 10 minutes asking for these VMs instead of 15 seconds.

26:06 So we use a variety of things.

26:08 We use fleets.

26:09 We use requests that, again, so this ties nicely with a spot where you say, basically, AWS, I want you to give me any of these range of things that have the best availability at the best price.

26:24 So here, yeah, Matt is pulling up and we can actually see, we can see how much spot we got.

26:31 So it looks like we got a fair number of, actually, I don't know, what is that?

26:35 Like four fifths of our instances, we did manage to get spot.

26:39 And those are going to be anywhere from like, I don't know, roughly half the price of a on-demand instance, sometimes more, but often even less.

26:49 So they can be 30% of full price.

26:52 We were actually running this last night and I was actually hoping that it would fail.

26:56 It didn't fail, sadly.

26:58 Let me see if I can bring up a fail case.

27:00 You have an odd hope for your demo.

27:03 So it wasn't us that failed.

27:04 It was Amazon that failed, actually.

27:07 Amazon failed because it actually didn't have capacity.

27:10 It turns out that the cloud is unlimited until you start doing really big things.

27:14 And then it's like, oh, you got to be clever if you want to get.

27:18 And I mean, to some extent, everyone knows this nowadays with GPUs, right?

27:22 GPUs are fairly constrained.

27:23 But even when you're asking for just like 2,000 CPU instances, AWS often says, sorry, here's 200, unless you know how to like frame that request in the right way.

27:36 Right.

27:37 And you guys have solved, that's some of the gnarly DevOps you all have solved so other people don't have to, right?

27:43 This portion of Talk Python To Me is brought to you by our latest course, Just Enough Python for Data Scientists.

27:49 If you live in notebooks but need your work to hold up in the real world, check out Just Enough Python for Data Scientists.

27:55 It's a focused code-first course that tightens the Python you actually use and adds the habits that make results repeatable.

28:03 We refactor messy cells into functions and packages, use Git on easy mode, lock environments with uv, and even ship with Docker.

28:11 Keep your notebook speed, add engineering reliability.

28:15 Find it at Talk Python Training.

28:16 Just click Courses in the navbar at talkpython.fm.

28:20 Matt, I feel like we should run some code on this thing now. You got them sitting here ready.

28:24 There's a great problem of actually supporting Python users, I think, is that people want to do all sorts of different kinds of things. They want to load a bunch of data with Pandas or with Polars or with DuckDB, or they want to train a machine learning model with a certain kind of GPU or a certain kind of whatever. There's actually a lot of variety in what people do. And it's challenging to build a tool that provides, again, some of that abstraction, but not others. Like Snowflake provides the abstraction of SQL, and that's like a very clear thing to do.

28:54 We provide the abstraction of, here's a machine that you can play with, do whatever you want to do with it. And sometimes that's Python code, sometimes it's other weird stuff. I'm actually, rather than show code, I might actually show a few examples of like, people will play things like with pandas, like what you mentioned, like machine learning, like climate science. There's all sorts of different weird things. We actually found that a lot of people would, inside of their Python code, they would import the subprocess module, and then they would run a process that was calling some other Fortran or C code that had nothing to do with Python. And so we found that and we said, great, let's go and let's support that by running arbitrary batch jobs, arbitrary programs like Here in this example, I'm just like running Echo Hello World.

29:43 So it's a weird set of problems to handle the infrastructure for this set of people, but not be very opinionated about what they do with that infrastructure.

29:54 So at this point, they pretty much have a machine.

29:57 They can, if they can do it from a Jupyter Notebook, they're kind of good to go.

30:01 Yeah, they have a thousand machines.

30:03 And those machines look just like their machine, just more numerous or bigger or with GPUs or whatever you like.

30:10 Sure. And so how do I bring data back together?

30:12 How do I fan it out?

30:13 How do I map reduce or like execute a bunch of, I've got a million rows.

30:19 I want to send a hundred to each machine or I don't know, whatever the math works out to be a thousand each machine.

30:24 Right. So now we're asking the question, how are we going to use the machines that we have?

30:27 The answer to the questions you just asked me is Dask.

30:30 You might use Dask to, hey, you got a petabyte of parquet data on S3, use a Dask cluster.

30:35 Or if you want, use a Spark cluster.

30:37 We can spin up a Spark cluster, we can spin up a Polar, we can spin up lots of different things.

30:40 And then now you're at the point of the, you're at the level of the sort of distributed computing framework.

30:45 And they can go and run things with that, with whatever distributed computing framework they like.

30:49 Often, there's then a whole other set of problems they then deal with.

30:52 Again, when we started Quil, it was all around deploying Dask.

30:55 But what we found is actually like most times the problems they wanted to solve were not that complicated.

31:00 They were not sophisticated.

31:01 They were very simple problems.

31:03 They had a thousand parquet files in S3.

31:06 They wanted to do the same thing on each parquet file.

31:08 They actually didn't want to use DAS data frame.

31:09 They wanted to use Polars, or they wanted to use DuckDB.

31:12 And so we would give them APIs that let them, you know, so an example of an API is the coiled function API.

31:19 It's a decorator.

31:20 Think if you're familiar with modals, something similar.

31:22 And it does the same thing.

31:23 It spins up a machine.

31:24 You run your function on that machine, and it goes ahead and spins up the VM, runs it, scales it down.

31:30 Yeah, that's really neat.

31:31 And when you say on your website, it says serverless Python, is it the same thing as like...

31:36 Serverless is a weird term.

31:38 Yeah, I know it is.

31:40 Is it still spinning up dedicated VMs, running your code on that, and then having it going away, and you're just not thinking about it?

31:46 Or is it truly leveraging the serverless functionality as AWS would refer to it in its console?

31:53 Serverless always means there's a server under the hood.

31:57 But roughly the distinction is who has to worry about that server?

32:01 Yeah. Yeah. When we say serverless, we're still using EC2 instances. I mean, Amazon, when you're running things on Lambda, they're still using EC2 instances, but it's a, it's an abstraction where you don't have to worry about that. You get those EC2 instances as you need them, you get them quickly and you get them only for as long as you need them. Yeah. That makes a lot of sense. So

32:23 this is something that really kind of really blew my mind here when you all showed me is you put a decorator at coil.function onto just a regular Python function. And you can express things like the machine type you need, how many of them you want, and so on. And it just fires it up seamlessly behind the scenes when that function is called and then it goes away, right? Yeah. So this is all

32:44 using the same underlying technology of what we joke about internally is that our core competency is turning VMs on and off. Once you have that technology, writing APIs around it is pretty cheap.

32:55 And so a common API is a Python decorator.

32:58 That's one of the various APIs we do.

32:59 And yeah, a common application is I'm running PyTorch code and I'm using it on my MacBook.

33:05 It's fine, but I like to use an NVIDIA CUDA GPU.

33:08 Cool, let's decorate that function with the GPU type that you want.

33:12 And it goes ahead and runs that.

33:13 And it runs locally in my environment.

33:15 I'm still typing on my MacBook Pro in cursor or whatever.

33:18 But that function now, what it does is it just like, it spins up VM, runs code, keeps VM around for a little while.

33:23 see if I want to run anything else on it again, and then spins it down. And that gives a lot of ability for the user to start experimenting with hardware. You can now run that function on any kind of GPU you want. And they do, they try lots of different things. We had a customer a long time ago who was, this is when like, when GPUs were at first very hard to get access to, and they would run through every region in their cloud, trying to find A100s. And it's super easy in Quola to say like, great, I want to run this in region EU central one. Let's see if Frankfurt has any GPUs.

33:53 Nope, none there. Great. Let's try this in AP, Southwest, whatever. Let's see if Australia has any GPUs. And they would just, because it was now easy to play with things, it was easy to use the cloud. They started to experiment a lot more. That was really valuable. We often see people playing with ARM versus Intel versus AMD, playing with different GPU types.

34:11 Something that's interesting to me about data science, I mean, I, to some extent, come from the web world. And what you do is you just look at the list of instances, You pick one that looks boring.

34:22 You just use that.

34:23 It runs for a year.

34:25 You don't think about it.

34:26 And so much in data science, it actually makes sense to try out different instance types to explore, what's this GPU do for me?

34:36 Sometimes that's really helpful.

34:37 Sometimes it's not.

34:38 To move around to different regions.

34:40 So if you have a data set that's in one region, it makes an orders of magnitude difference how quickly you can download it if you are close to it.

34:49 than if you are far from it.

34:51 But even things like specifics of the CPU family, we, for fun, we get to run like benchmarks on different things.

35:00 And it's really nice because I just go in and like, oh, AWS came out with a new ARM instance type.

35:06 And I can go like change one line of code.

35:09 And now we run all our benchmarks on, as Matt was noticing, M8Gs.

35:13 I was surprised to see M8Gs.

35:15 Have you run the benchmarks yet now?

35:16 How are they doing?

35:17 Yeah, I don't know if we have on those yet.

35:19 I think, what was the difference between six and seven?

35:20 Do you remember?

35:20 All of these are, so these are the like Amazon designed ARM CPUs.

35:26 Some of those differences are the family of ARM.

35:30 It's like ARM v8 versus ARM v7.

35:34 Some of that actually really does make a difference for data science workloads because it has to do with those like wide instructions.

35:41 So they're like mini GPUs that they're able to do many things in parallel on the CPU.

35:47 Sometimes they put in better memory.

35:48 So I think some of these new instances have DDR5 instead of DDR4.

35:53 And it's like, does that make a difference for my workload?

35:56 Is it going to save money?

35:57 I can't tell you that a priori, but it's really easy to just try.

36:02 And sometimes you're like, oh, wow, that this hardware is like better or, oh, wow, this new hardware actually doesn't make any difference for what I do.

36:10 It seems to me like the infrastructure you all built makes experimenting way easier, right?

36:16 If it's really hard for you to set up a machine, let's come back to this in a minute, but I have a question first.

36:21 But I think a lot of times what people might do is, hey, instead of trying to do a lot of the scaling stuff, let's just set up one machine with 64 cores and a lot of memory and just set up a notebook server and let people, we'll just configure it and let people have that.

36:37 Yeah, yeah.

36:38 I do want to come back to that.

36:39 But while we're still on this ARM versus x86 thing, what's the story?

36:44 Do you all recommend ARM in the data center these days for this kind of stuff?

36:48 Or is it still x86?

36:50 What's the right choice?

36:51 Try it.

36:52 That's the short answer.

36:53 I mean, ARM is really nice sometimes.

36:56 It AWS is and a lot of the clouds are like really into ARM.

37:00 So they're doing some really nice technology.

37:03 They're offering it at a good price.

37:05 It tends to run more power efficiently.

37:07 And power is like actually one of the major costs at a data center.

37:11 I would love to run ARM for my infrastructure.

37:14 But I'm concerned that there's not, there might be some native library or some web server that doesn't run ARM, is not built for ARM or doesn't work right on ARM.

37:26 A couple months in, like, now I want to add this one thing that would be really nice, but it doesn't work on ARM.

37:31 We've been doing this for a few years now.

37:33 It's a lot better, the support, than it was, I think, three or four years ago.

37:37 But also this is like part of, so you're talking about all of those tricks we do to get this seamless experience.

37:44 And part of that is people have Intel machines on their desk or they have ARM MacBooks.

37:52 They're running on Intel or AMD or ARM in the cloud.

37:57 And all of this is stuff we're like having to figure out, OK, so you have this software locally.

38:03 What does that mean in the cloud?

38:05 And it might be slightly different versions of things.

38:07 It might even be you don't have a GPU locally.

38:11 So you have the CPU version of PyTorch, but you actually want to use PyTorch on a GPU in the cloud.

38:18 That involves figuring out how to install the right versions of those different packages.

38:23 So in the screen share that no one can see, I've just switched my code to turn off ARM and switch from region US East 1 to US West 2.

38:31 And we're just bringing up a new cluster.

38:32 And if I had written some debug code in my code, that would also be updated.

38:36 You were talking earlier, Michael, about experimentation being something to think about.

38:39 I think that is one of the major differences between data workloads and web server workloads, is that the data world, like, usually there's eventually, yes, there's a production part where you're running some model inference server. Well, there's this long period where you're experimenting, and really optimizing that period is really critical. And Coiled is very much designed to accelerate that experimentation process. Even our choice to copy your environment rather than use Docker is highly informed by that. If you put in a Docker build, Docker push cycle into the data science work cycle, it just like it gums everything up. People end up not doing it. Coiled is smooth enough and easy to use enough that the cloud is now pleasant enough to actually include inside of the user dev cycle. And that's different. That's new. That's fun. And you can just play with stuff.

39:31 So like I've got now 300 machines that are all, nope, I got all the machines. I've got a bunch of Intel machines and a thousand ARM machines. What do you want to do with them, Michael?

39:40 We can run the same experiment right now inside this podcast. And that's, I think, the joy of this.

39:47 And the cost of all of this is like dollars. That last cluster cost me $1.39. This one is costing me

39:55 45 cents so far, 80 bucks an hour. That's actually way less than I expected.

40:00 Yeah, the cloud is both like way cheaper and way more expensive than I realized going in based on whether or not you're doing it correctly or doing it incorrectly.

40:09 There's like several orders of magnitude difference.

40:11 I know we want to talk about cost at some point, Michael.

40:13 Maybe it's fun to talk through some of the like the crazy stories.

40:16 I definitely want to hear some stories.

40:17 I'm always here for stories.

40:19 I first want to talk about the cost and the logistics of this one big shared Jupyter server sort of story versus this.

40:29 No one's going to have, you asked for two CPUs per machine.

40:31 I doubt anyone's going to have a 2,000 CPU single machine.

40:35 I know, I think they might exist, but they're very rare and they're very expensive.

40:41 So what does the cost look like and the challenges of some team that says, let's just set up one huge server and we'll just share it versus working like this?

40:52 I think the shortness is like low tens of thousands of dollars for like an always on 100 core machine.

40:58 I'm like, I'm honestly bringing up ChatGPT right now to ask that question.

41:01 It's such a special time we live in.

41:03 Before the machines rise up and kill us, it's going to be, it's amazing.

41:06 ChatGPT is telling me about $30,000 if I don't do any optimization.

41:11 When we see this in practice, people turn it off on the weekends, sometimes at night, sometimes not.

41:17 But like tens of thousands of dollars is typical.

41:19 Yeah, it's expensive, but it's also, it's rarely what people actually want to use.

41:24 Because so much you're like, oh, I want to try.

41:26 Sometimes you're like, oh, I have an idea.

41:28 I want to try this experiment with like 10 different parameters.

41:31 I want to search over some things.

41:34 And you're like, okay, I got to do it.

41:36 That's going to take me 10 days now because I got to run it one at a time.

41:39 There's so much of like, I want to just try something that the one big machine doesn't let you easily do.

41:47 Sure.

41:47 And if someone else is trying that experiment, you either wait or you just go slow.

41:51 Right.

41:51 And then there's also the like, oops, I ran something that made it crash.

41:55 And then you're like, call in your DevOps person.

41:58 Can you restart the...

41:59 Or you just ran something that took way longer than you thought it would.

42:02 And you blocked it unintentionally, right?

42:05 For zero, little value, yeah.

42:06 I'll add some other things.

42:08 Or you wanted to use a GPU.

42:09 Or you wanted to now put that thing into production where it's running every day rather than you pressing a cell in Jupyter.

42:15 Or you wanted to use Cursor rather than use Jupyter.

42:18 There's like, there's many ways in which the like the big Jupyter server on the cloud, it just like not, it like it technically satisfies the requirement running code in the cloud, but just so much more that we do as data professionals. We run things in production. We run things in different hardware. We develop in new ways. We experiment. And the job description is a lot more variable than that single machine is able to satisfy. Look at cars on the road. They're not all Honda Accords. You got semi-trucks, you got bicycles, you got pedestrians.

42:52 We live in a world where we actually need a lot of different kinds of things. It's that variety that's actually really a core part of the cloud.

42:59 That variety is something that we really care about.

43:00 It's very true to the data science ethos, right?

43:03 Like we're going to experiment, we're going to explore, we're going to play.

43:07 And if that becomes seamless, I mean, one of the problems is I want to experiment, but it's going to take seven hours on my local machine.

43:14 Could I ask that question and get an answer in five minutes if I'm willing to pay 10 bucks or my company's willing to pay 10 bucks or something, right?

43:21 Well, it's going to cost 10 bucks either way.

43:23 It's going to cost either 10 bucks of machine time week or 10 bucks of machine time over five minutes just with a thousand machines. And so I think you used a great word there, Michael, which is play. I think a lot of why Python became popular is that it feels like play often. Like we're given these libraries that are both easy to use and powerful, and that feels like play. We get to like, oh, this is a cool squirt gun. I can press this button and water shoots. I can go shoot my friends. Isn't that fun? And if you look at using the Boto library and AWS, that does not feel like play.

43:54 If you go look at thinking about writing YAML and Kubernetes, that does not feel like play.

43:59 But like here today, we got to play with making 2000 VMs, half ARM, half Intel, half on the US East Coast, half on the US West Coast.

44:06 And we didn't do any work with them, but we could have.

44:08 And now suddenly the cloud is like play.

44:11 And you just do different things when things become playful.

44:13 You behave differently.

44:15 And that's really the fun thing.

44:17 Folks have fun.

44:18 And the cloud is a really fun tool to use once you get past all the pain.

44:22 I agree.

44:23 When you first hear about it, you're like, wow, what can I do with this?

44:25 But then you get into working with Bodo and you just kind of want to stop.

44:29 Now, you probably spend more time with Bodo than any of us.

44:32 Yeah.

44:32 You take one for the team, for all of us.

44:34 Thanks.

44:34 A lot of reading API docs so that you don't have to.

44:37 Two questions here.

44:38 One, we've started these clusters.

44:40 Are both still running?

44:41 Is one just still running?

44:42 if I do this in a notebook, how do I ensure that it does shut down, that I don't let it run for too long, like unnecessarily long, right? What's the workflow with that? So broadly, how do we

44:53 constrain costs? I know to make sure that things are as low cost as possible. Yeah. So one already auto shut down. So we weren't using it. Coil saw we weren't using it. It shut it down. We can bring it up again. It takes a minute. So it's easy to bring things back up and down. If you wanted to stay up, there's keywords to control all that. But default behavior is to shut things down pretty

45:10 aggressively. The other one is still up. We can go do things with it if you like. Could you set timeouts in the cluster when you create it? 100%. Idle timeouts. Yeah. Typically what we do is

45:19 people, if people do want to have something that sort of sits around for a while, they'll set an idle timeout and we'll give it a name. And at the top of their script, it'll say, hey, I want to use my cluster named prod or whatever. And it has an idle timeout of an hour. Make sure that exists.

45:33 And if it doesn't exist, we'll bring it up. If it does exist, we'll connect you to it.

45:37 And that behavior works pretty well. That's cool. So you can do things like restart my kernel in notebooks and then just like reattach to it rather than, well, now I have 7,000 server clusters.

45:47 How'd that happen? That's the opposite of what you're preaching here.

45:50 Sometimes if you do want to change your code, if you're working, like when I work on Dask sometimes and I like need to put a print statement into some code because I want to see what's happening on my cluster. At that point, I'll then recreate a new cluster because I need my code to be re-pushed up. But I can go to the logs and I can go look where all my print statements are and they're there like oh that wasn't quite right i'll change my print statement make a new cluster one minute later things are up again again that sort of minute long thing is not perfect i wish it was a second but like a minute does tend to be within the dev cycle tolerance of a lot of humans and you asked

46:22 for said i was just thinking like how could you sort of preload these types of things like at some point you could have just a thousand machines hanging around you always like a pool you hand out But as we've been going through, you have all these variations.

46:35 I want an ARM one.

46:36 I want an x86.

46:37 I want one with the GPU.

46:38 I want one with this GPU.

46:39 I want it in that region.

46:41 That makes it really hard to completely just have a whole bunch, like a fleet of them ready to just hand out.

46:46 So much of what we try to do is play that balance between speed and flexibility and low cost.

46:52 So the more you keep things sitting around, the more you're paying for them.

46:55 I think we've got, if you can get it faster than someone can go get a cup of coffee, that seems to be a much better experience.

47:03 That seems fair.

47:03 And also if you can get it faster than your experiment is going to take to run.

47:07 You would not use Coiled to like backup a web API endpoint that has to get back to human in response time.

47:14 You should go use Lambda, you should use modal, you should go use something else.

47:16 You should use Coiled when you're using enough hardware that it would be prohibitively expensive.

47:20 Like you actually don't want to have a pool of a thousand machines sitting around just in case someone wants them.

47:25 That doesn't make sense.

47:26 You should use Coiled when you have these other larger things to do.

47:30 And again, I think that sounded like I got back into pitch mode.

47:33 The point I wanted to talk about here isn't use coiled.

47:35 It's the cloud actually has these capabilities.

47:38 You can get a thousand VMs anywhere in the world of any hardware type you like for dollars.

47:44 And that's actually an incredible capability if you can do it right.

47:48 I think today the zeitgeist is go use Kubernetes.

47:52 And we just think that's like, that's dead wrong.

47:54 The answer is just go use raw VMs.

47:56 They're actually pretty good if you do a few things around them, if you figure out software environments, if you figure out how to sort of batch requests, if you're out of logs, if you're out of low cost.

48:05 But like this is, I think, actually the right foundation, I think.

48:09 Like the raw VM is maybe the right foundation for a lot of data work.

48:14 How about just pick a library and infrastructure that's exceptionally tuned to, I need to start them as fast as possible, run it one job and shut it back down.

48:24 And I don't think serverless in the AWS sense is really going to be the way because serverless gets expensive too much compute. Just ask Kara.ai, right? That's the $96,000 verse sale bill. Yeah. Serverless,

48:36 Lambda, and similar technologies typically have like a four to five X premium on cost.

48:41 They'll sell limitations. Like you can't get big machines. You can't get GPUs. Your software

48:45 environments have to be of a certain size. Making time out, right? So lots of stuff you can't do.

48:50 Yeah. Mostly we see people who like want to run their Polars job, not with 16 cores, but with 64 cores, like Lambda isn't going to cut it.

48:58 And so you just like, you want the full flexibility of the cloud.

49:01 Absolutely.

49:02 What's a typical size?

49:04 I mean, you created a thousand, you guys, and that's super impressive.

49:08 But is that common?

49:09 I mean, we see all sorts of things.

49:10 So this is like, you give people a flexible tool.

49:12 It turns out they use it in so many different ways.

49:15 We have, we actually have like, we were a little bit surprised to find how many people were doing things on one VM.

49:23 We had a whole bunch of users and the sounds less exciting, but like who would just have like individual scripts that they needed to run on a cloud VM. And this is the easiest way to do that whenever you need to get a lot of benefits of that, that maybe don't fit the mold. But here I can put a decorator on a function. And this function runs on a big machine in the cloud, right next door to the storage and S3 where the parquet file lives, or something like that, right?

49:52 It has more memory than, I don't have enough memory, but this one has enough memory or whatever, right?

49:57 And that's actually pretty neat.

49:58 And then we have other users who are making multi-thousand node clusters.

50:01 And that's a whole different set of challenges.

50:04 At that scale, you start like actually hitting cloud limits of like capacity.

50:09 And you have to do tricks like basically using multiple, they're called availability zones in AWS, like multiple data centers.

50:16 Can one of these clusters span availability zones or regions even?

50:21 We don't span regions with a single cluster today.

50:24 We haven't found people needing that scale.

50:27 And that opens yet a new set of challenges.

50:30 But yeah, we very commonly do multi-availability zone clusters.

50:35 This is, Matt's pulling up some docs and examples.

50:39 But something that this is actually really nice pairing with is if you want big spot clusters, spot can be really cheap, but it can also, there isn't as much availability of it because it's so cheap.

50:51 So this works really nicely. I'll say like all of these things, sorry, my life is all about like trade-offs and edge cases. So all of these things have like, you got to be careful not to do it in the wrong way. So with availability zones, one kind of gotcha is you pay for moving data between availability zones.

51:11 A simple read parquet file that happens to go across its, you do that a bunch of times, all of a sudden, across a thousand machines, but back and forth, like, whoa, there's your surprise.

51:20 Well, it's tricky, actually, because read parquet is fine, because parquet is probably on S3, and S3 crosses all of the availability zones in a region. But if one machine loads data on one availability zone, then transfer it to another machine, another availability zone, then you pay.

51:34 And people don't know that. Things like if you need to shuffle or sort your data set,

51:39 But also just like if you're doing naive things in terms of like pulling back your data all through one machine.

51:45 And so part of what we do is like make it easy to control that.

51:49 It's like, again, one parameter.

51:51 You can say, I do want multiple AZs.

51:54 I don't want multiple AZs.

51:55 But then giving people like you can look at network metrics to see, oh, I think my workload is embarrassingly parallel.

52:03 So this is fine.

52:04 let's take a look at metrics and see, oh yeah, this workload makes sense to span AZs, get more spot, save more money that way.

52:12 This workload, we want to keep this in a single AZ.

52:16 We're going to let AWS pick it so that they pick the one that's cheapest and has the best availability.

52:22 And that's all stuff that we like automatically do.

52:24 I want to double down on that for a second.

52:26 I think spot is a good example of like, this is one of a hundred things you have to do well.

52:31 I think a lot of what we found is that like, Using the CloudWell isn't doing one big thing right.

52:36 It's not like one magical thing that's done.

52:39 There's a long tail of a lot of small things to get right.

52:42 There's a lot of nuance and a lot of polish.

52:44 Spot's a good example.

52:45 So we were running benchmarks internally, and they were costing us some money, like our own like Dask benchmarks back in the Dask focus days.

52:52 And it was like, great, we'll switch to Spot.

52:54 Easy to do.

52:55 And the engineers hated it.

52:56 The Dask engineers hated it because all the benchmarks became very variable.

53:00 Because they asked for 100 machines, but they got 70.

53:03 or they asked for, or like machines would go away.

53:05 And so we did a few things.

53:07 One is what I said before, the sort of falling back to on-demand.

53:11 If you only have 70 machines, it's okay.

53:12 Give me 70 spot and I'll pay full price for 30.

53:15 That's a feature people really like.

53:17 Another one is availability zones, right?

53:19 So at any given hour of the day, a different data center in US East 1 has more or less spot availability or GPU availability.

53:29 And so we will look at all of them.

53:31 I'll say, okay, this is the data center, this is the availability zone that's got the most available right now.

53:36 And so we'll go to that one and we'll pull from there instead.

53:39 There's things like that are known to cloud experts, but are just like not something your average data person knows to think about.

53:46 And those are the kinds of things that you should abstract away and that we do.

53:49 And there's a hundred similar things.

53:51 And so again, if you want to get GPUs, you got to make sure you're looking at across the region.

53:56 But if those GPUs are going to talk to each other, you got to make sure you're looking across the region and then focus on exactly one AZ and not go across.

54:03 So there's a lot of sort of this interesting, sort of again, nuance to doing this stuff well.

54:08 Right, if the machines become chatty with each other, you want them all next to each other.

54:12 It's a thousand X more expensive than, I'm saying a thousand X without hyperbole.

54:16 Like you can process that.

54:20 A thousand times cheaper on a machine than you can transfer it between machines.

54:23 Compute tends to be like a fairly predictable part of the cost.

54:27 It's all of these other things that you like, You don't even think about like, oh, if I flip this setting, now I'm hitting this S3 API a lot.

54:37 And it turns out you pay per API call.

54:39 I didn't know that until that was $1,000.

54:43 There was a XGBoost debug log example.

54:46 Do you want to run through that?

54:48 Logs is another thing that like most of the time that's effectively zero money.

54:53 But then we do things at the scale where like someone was running a thousand node cluster, I think needed to, they had something that wasn't working well.

55:01 So they turned on debug level logging and it gave very chatty logs.

55:07 And I think it was like a $15,000 bill.

55:10 And that's the sort of thing that like, you just, it's not even in your mind as a possibility until you see that, that, that had a happy ending.

55:19 Cause we talked to AWS and they ended up eating that cost for the customer, but.

55:23 Also a good lesson.

55:24 If you talk to AWS, they'll give you money back.

55:26 Yeah.

55:27 just don't keep crossing those boundaries too many times yeah i mean right and part of that was like

55:32 what controls are you putting in place so this doesn't happen again and so we yeah we do some things to like now we warn if you if we see you have very chatty logs we say hey you might want

55:43 to turn this down yeah absolutely so when i create a coiled cluster locally and it's going to go do all the magic that we've been talking about what is the payment workflow how is it distributed do i I have an AWS account that I register my card with AWS, and then Coiled uses that account, and there's some kind of feed to Coiled, but then mostly I pay directly to AWS, or do I pay you all, and then you all hand it?

56:08 Like, what does this look like?

56:09 We do have a kind of trial thing that will run in our account, but primarily what we provide and what people want is running compute in their cloud account.

56:21 And we do that in part because it's simpler, But we also do it because a lot of people have their own data in their own account or they have security requirements or special networking needs.

56:34 Maybe special arrangements with pricing if they're a big customer, right? Something like that.

56:39 Sure. Yeah. So if they have contracts with AWS, they will just use whatever that discount is.

56:46 The flip side of that is people might have AWS accounts, or we have plenty of people who just see that Coiled uses AWS and they sign up for a new one. They don't know how to go in and set that up. So something that is actually kind of cool that we do that really isn't part of running clusters, but is a necessary thing to do is make that set up really easy. So part of what we're doing is using some like best practices around how to manage those credentials, how to set up the networking resources in ways that don't have a standing cost, but are secure. There's all sorts of things that like people, if you're a data scientist, you don't want to, you don't have to think about NAT gateways. And we think about that stuff. NAT gateways are like one of the famous ways to spend a lot of money on AWS. So we'll set up the network in a way that is secure, but doesn't use NAT gateways. And we do that automatically. We do that. We have a web UI for that. And we have, I really like it because I worked on this. We have this lovely CLI tool that does that setup for you, has a lot of rich widgets. And so it's, yeah, trying to make that whole thing so that in my mind, it's very important to like make the easy case easy, but make the hard case possible. So if you don't know all of this cloud stuff and you just want us to like give you sensible defaults, we'll do that. If you are a data engineer who your company is giving you a whole bunch of requirements for how the network is configured, we can support that. I think a big design

58:29 consideration we're making, Nat and I actually collaborated a bunch and fought a bunch on the setup process. It's like the setup process is something that we care very deeply about.

58:37 Like we would go to conferences and just sit down with just like run-of-the-mill new Python data developers and say like, cool, can you set up? And we would have them do it. And they had no idea how the cloud account worked, but at some point some of the company had given them some like AWS credentials file and like all of that will work. Like you can say pip and Salk, coiled setup, and just like press enter a few times and we will set things up for you in a way that is sensible.

59:01 I think we make the cloud accessible to people who don't really know how the cloud works that well.

59:07 So if you're thinking like, oh, I'm a person who knows Pandas and NumPy and Cycler, and I happen to have this cloud account, you should try out Coiled.

59:12 Coiled is actually designed for you.

59:14 It's not designed for your IT department.

59:16 But as Nat said, Nat talks lots of IT departments.

59:18 They love us too.

59:19 But the UX around that is especially smooth.

59:23 It looks really great.

59:24 All right, guys, we are pretty much out of time.

59:27 Final thoughts for folks out there or they get to the end of the show, maybe they want to try Coiled, maybe they want to try their own crack at something like this for their team.

59:37 The standard called actions that go to Coiled.io, it's easy to use, have a good time.

59:41 I think more broadly, the thing I want to say is the cloud provides a promise that is great for us, but isn't actually delivered that well.

59:51 And they shouldn't accept or tolerate the kind of shitty data platform.

59:57 This can be a delightful and a very powerful tool for the data space.

01:00:02 And if it's not, maybe don't use Coil, but use something and have high expectations.

01:00:07 We should have a degree of tastes and a degree of standard.

01:00:12 And we can meet that standard.

01:00:13 I think there's actually a lot that we can do here.

01:00:15 There's a lot of potential that's really exciting.

01:00:18 Oh, that sounds right to me.

01:00:19 I'll report, like, I think this message of, like, it is okay to be unhappy and things are supposed to be delightful is important to us.

01:00:29 I spend a lot of time being unhappy, hopefully, so that other people will be able to have delightful experiences.

01:00:35 Yeah, absolutely.

01:00:36 Absolutely.

01:00:37 It should be delightful.

01:00:38 It sounds delightful when it's created and it's gotten really complex, but it doesn't have to be.

01:00:43 I guess maybe a lesson I learned from this is use tools optimized for data science workloads.

01:00:48 don't use tools optimized for long running web apps and other things like that you hear about all

01:00:53 the time, but they're not for you necessarily. Yeah. And it can be a delightful experience.

01:00:57 I like the term, like we should all be playing and come by and play. Or again, if you don't use coil, that's fine. But like, there's other ways to do things. Go play. All right. Well, we will

01:01:06 call it a wrap on the show and people can go play. Guys, thanks for being on the show. It's been really interesting. Cool. And congrats on such a cool company, but also service. This is

01:01:15 really neat. Thanks, Michael. Thanks for having us. Yeah, you bet. Bye.

01:01:18 This has been another episode of Talk Python To Me.

01:01:22 Thank you to our sponsors.

01:01:23 Be sure to check out what they're offering.

01:01:24 It really helps support the show.

01:01:27 This episode is brought to you by Sentry.

01:01:29 Don't let those errors go unnoticed.

01:01:30 Use Sentry like we do here at Talk Python.

01:01:32 Sign up at talkpython.fm/sentry.

01:01:36 Want to level up your Python?

01:01:37 We have one of the largest catalogs of Python video courses over at Talk Python.

01:01:41 Our content ranges from true beginners to deeply advanced topics like memory and async.

01:01:46 And best of all, there's not a subscription in sight.

01:01:49 Check it out for yourself at training.talkpython.fm.

01:01:52 Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:01:57 We should be right at the top.

01:01:58 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:02:08 We're live streaming most of our recordings these days.

01:02:11 If you want to be part of the show and have your comments featured on the air, Be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:02:19 This is your host, Michael Kennedy.

01:02:21 Thanks so much for listening.

01:02:22 I really appreciate it.

01:02:23 Now get out there and write some Python code.

01:02:38 We'll be right back.

Talk Python's Mastodon Michael Kennedy's Mastodon