WEBVTT

00:00:00.020 --> 00:00:04.360
Today on Talk Python, what really happens when your data work outgrows your laptop?

00:00:04.880 --> 00:00:14.540
Matthew Rockland, creator of Dask and co-founder of Coiled, and Nat Tabree, a staff software engineer at Coiled, joined me to unpack the messy truth of cloud-scale Python.

00:00:15.380 --> 00:00:21.280
During the episode, we actually spin up a 1,000-core EC2 cluster from a notebook, twice.

00:00:22.260 --> 00:00:32.660
We also discussed picking between pandas and polars, when GPUs help, how to avoid surprise cloud bills, real lessons, real trade-offs shared by people who have built this stuff.

00:00:33.220 --> 00:00:39.700
Stick around. This is Talk Python To Me, episode 519, recorded August 26th, 2025.

00:00:54.780 --> 00:01:00.360
Welcome to Talk Python To Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:01:00.700 --> 00:01:33.480
Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython, both accounts over at fosstodon.org, and keep up with the show and listen to over nine years of episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams over on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows. This episode is brought to you by Sentry. Don't let those errors go unnoticed. Use Sentry like we do here at Talk Python. Sign up at talkpython.fm/sentry.

00:01:33.880 --> 00:01:36.400
Matthew, Nat, welcome to the show. Awesome to have you here.

00:01:36.560 --> 00:01:37.620
Hey, Michael. Good to be here.

00:01:37.840 --> 00:01:37.940
Thanks.

00:01:38.120 --> 00:01:40.680
It's been a while, Matt, since you've been on the show.

00:01:41.020 --> 00:01:45.020
Every year or two, we got to chat. You and I chat at conferences sometimes too. It's always good to see you.

00:01:45.120 --> 00:02:04.480
Absolutely. Yeah, it's always really good to see you as well. And I think it's high time for an update. And we talked a little bit a couple of days ago about what the two of you are up to at Coiled. And wow, is it quite interesting, the things that you've been doing. It's really come a long ways and quite a bit more ambitious in terms of running data science in the cloud. So

00:02:04.820 --> 00:02:19.940
it's going to be a fun conversation with you too. I look forward to it. I also brought Nat Tabree as a colleague here. I've stopped doing some engineering work these days. I used to be really smart and now I mostly do engineering manager-y things or CEO things. So I've brought my friendly sidekick to actually explain real stuff.

00:02:20.180 --> 00:02:20.820
Yeah, wonderful.

00:02:20.930 --> 00:02:26.740
I always love having a couple of people on the show get a little extra riffing off of the ideas and so on.

00:02:26.900 --> 00:02:28.120
So great to have you here, Nat.

00:02:28.500 --> 00:02:29.020
Welcome to the show.

00:02:29.160 --> 00:02:29.540
Thanks.

00:02:29.800 --> 00:02:37.480
Before we dive into cloud computing for data scientists and all the lessons you all have learned, let's just do a quick catch up.

00:02:37.740 --> 00:02:42.880
Matthew, you've been on the show a couple of times, but not everyone has listened to every episode.

00:02:43.440 --> 00:03:05.660
In fact, an interesting stat that just came out of the recent PSF chip brains, developer survey results is 50% of the people that do Python have only been doing it for less than two years professionally. So there's a lot of new people in our industry. And 50% of Python these days is data science, which I think is, that's a shift as well. That's the beauty of Python,

00:03:05.900 --> 00:03:31.340
right? It's enough of a programming language that like serious computer science people can get excited about it and is accessible enough that everybody can use it. That's why it's become really popular, right? Compare that to maybe C+, which is like very computer science focused, or maybe like MATLAB, which is like very user focused. Python can kind of bridge those two. And that's what's always given it. I think it's, it's special status in the world. I agree. And I think it, people,

00:03:31.500 --> 00:03:44.100
once they get started, they have lots of reasons to stick. All right. All of that is a long-winded way of saying there's probably a lot of people who have not, have not heard about you before. So give us the quick introduction on yourself, Matthew first, and then Nat. My name is Matthew Rocklin.

00:03:44.440 --> 00:03:49.260
I am a long-term open source contributor to the Python space, particularly sort of the data science part of Python.

00:03:49.600 --> 00:03:53.240
That started like many years ago with projects like Tools, Multiple Dispatch, Simpy.

00:03:53.720 --> 00:03:58.360
And then maybe 10, 12 years ago, I started a project called Dask for parallel computing.

00:03:58.880 --> 00:04:00.500
And like around Dask was lots of other projects.

00:04:01.140 --> 00:04:08.940
For the last decade, I've mostly focused on making it easy for Python developers to solve big problems with lots of computers and lots of hardware.

00:04:09.460 --> 00:04:12.700
Five-ish years ago, I started a company around that called Coiled, and now we do other things.

00:04:13.180 --> 00:04:21.299
But I've sort of been in that space in between Python developers and lots of hardware and trying to make that as easy as possible.

00:04:21.500 --> 00:04:21.640
Yeah.

00:04:21.799 --> 00:04:24.920
First, you built all the packages for people that do cool data science stuff.

00:04:25.000 --> 00:04:28.780
Now you're building the infrastructure and clicking that together.

00:04:28.980 --> 00:04:29.940
That was the next hardest problem.

00:04:30.180 --> 00:04:30.280
Indeed.

00:04:30.960 --> 00:04:31.240
All right.

00:04:31.580 --> 00:04:32.340
Nat, welcome.

00:04:32.600 --> 00:04:32.700
Yeah.

00:04:32.900 --> 00:04:34.120
So I'm Nat Tabree.

00:04:34.300 --> 00:04:36.080
I'm a software engineer at Coil.

00:04:36.300 --> 00:04:38.660
Been here close to four years now.

00:04:39.080 --> 00:04:47.820
And my background is both some, like I've done some research software engineer stuff, helping people use Python well, helping people use Dask well sometimes.

00:04:48.880 --> 00:04:51.980
And then also some sort of cloud SRE stuff.

00:04:52.600 --> 00:05:01.320
So something I coiled is a lot of fun because we get to do the Python stuff and we get to do the cloud stuff and we get to help other people do the Python stuff on the cloud.

00:05:01.560 --> 00:05:28.680
I think that's actually one of the interesting, non-obvious things of building data science tools and infrastructure for data scientists. Take Jupyter as an example, right? A lot of the work on Jupyter is JavaScript-y type of stuff. But the purpose is to make it so Python people can do more Python without thinking about that kind of stuff, right? And you all probably have this sort of DevOps angle of that same thing going on there. Like you do a lot of DevOps,

00:05:28.860 --> 00:05:46.940
so many people don't have to do much DevOps at all. Yeah. And most users don't have that that capability, right? We need to make tools that live in a space, but that don't require deep expertise at that space. That is, again, where I think Python brings a lot of power, sort of the old like XKCD import anti-gravity comic, if you remember that.

00:05:47.020 --> 00:05:56.680
I'm flying. How are you flying? Well, just typed import anti-gravity. You know what was really funny when I was getting into Python was first learning that you could actually type that in

00:05:56.680 --> 00:05:58.720
the interpreter. It wasn't just a comic. I didn't know that.

00:05:58.830 --> 00:06:03.100
Yeah. If you open a Python REPL and you type import anti-gravity, something happens.

00:06:04.260 --> 00:06:38.880
it has to be done right it has to be done so pretty amazing let's start with talking about sort of the evolution of coiled so when we first started talking about coiled it's been a while three four years i feel like maybe the first time you and i were on the show together to talk about it anyway it was effectively dask as a service right so dask is a way to do kind of grid scale out computing with pandas and that type of work and it would create a bunch of you'd point it at a cluster and it would kind of spin up some machines and focus on like how do i execute

00:06:39.400 --> 00:06:56.660
pandas like work on a cluster das did lots of things that weren't pandas pandas users are like 30 percent of dask users das is it's a much more general purpose project than that you're 100 correct that pain that we felt in the dask community at a certain point wasn't to make dask better it was It was to make Dask easier to deploy, especially in the cloud.

00:06:57.100 --> 00:06:58.480
A lot of people were in the cloud those days.

00:06:59.420 --> 00:07:04.780
And it was just like a pain in the butt to bring up all the machines, set up everything correctly to manage Docker things.

00:07:05.240 --> 00:07:06.840
And so we made a company around that.

00:07:07.220 --> 00:07:14.320
The weird thing that happened, so in order to run Dask in the cloud, we had to figure out how to run Python effectively in the cloud.

00:07:14.760 --> 00:07:17.860
And so we made lots of interesting technology to do that.

00:07:18.200 --> 00:07:23.940
Though the surprising thing that happened is that a lot of our customers started using coiled not for DAS things.

00:07:24.060 --> 00:07:25.880
They would use coiled to send a DAS cluster.

00:07:26.420 --> 00:07:30.320
They would throw away the DAS cluster and just use the machines that coiled that brought up for them.

00:07:30.780 --> 00:07:38.580
It turns out that making Python run well at scale in the cloud is much more generally applicable than DAS in particular.

00:07:39.740 --> 00:07:46.100
Our customer base has shifted over time from being very DAS heavy to being more general computing heavy.

00:07:46.280 --> 00:07:47.180
That makes tons of sense.

00:07:48.500 --> 00:07:58.060
It's fine to spin up a bunch of DAS clusters, But really, that was one specialization of, I just need a bunch of computers to run my data science workload.

00:07:58.320 --> 00:08:01.960
When the engineers like Nat came to me and said, hey, look, we should do this thing.

00:08:02.160 --> 00:08:04.220
At first, I was like, that doesn't make any sense.

00:08:04.440 --> 00:08:10.940
The cloud must offer the, just run this thing, just bring up a machine, run some code, turn off the machine.

00:08:11.300 --> 00:08:12.460
Obviously, the clouds must do that.

00:08:12.560 --> 00:08:14.100
I was like, no, they don't do that.

00:08:14.180 --> 00:08:14.960
Or they don't do that well.

00:08:15.320 --> 00:08:18.260
There's some APIs to do that, but they're really inaccessible.

00:08:18.660 --> 00:08:26.420
If you want, go to ChatGPT and ask it to give you copy-pastable commands to turn on 100 VMs and run Hello World and turn them off.

00:08:27.100 --> 00:08:29.080
And it'll type at you for a couple of minutes.

00:08:29.440 --> 00:08:36.039
And it's not the kind of typing that most data scientist people who do use Python for a couple of years can do.

00:08:36.479 --> 00:08:37.500
It's actually pretty inaccessible.

00:08:37.880 --> 00:08:43.099
I was actually quite shocked at how hard this relatively commonplace thing was to do.

00:08:43.219 --> 00:08:54.300
A lot of what people do with data science, but also a lot of the courses, the tutorials, the libraries, they all lead data scientists away from developing those skills as well, right?

00:08:54.530 --> 00:09:00.380
They don't necessarily encourage you to start using Docker a lot, to start writing raw Linux commands.

00:09:00.940 --> 00:09:02.060
How can I make this work?

00:09:02.440 --> 00:09:11.820
It's not that they don't, but I think coming into it, like talking about those beginners, like the first two years of their job, they're still working on how do I do data science libraries right?

00:09:11.890 --> 00:09:13.320
It begs the question, should they?

00:09:13.650 --> 00:09:17.180
And my answer is maybe we shouldn't solve this by educating people.

00:09:17.500 --> 00:09:19.020
maybe we should solve it by building better tooling.

00:09:19.620 --> 00:09:23.800
Like I actually don't, like Docker is a great technology, but not necessarily for data science.

00:09:24.040 --> 00:09:30.380
Like Docker is very much specialized to provide like a really stable system that can run for decades.

00:09:31.280 --> 00:09:33.320
But like we want a system that can change every five minutes.

00:09:34.040 --> 00:09:41.700
Like the choices that tools like Docker, Kubernetes or Terraform make are actually quite different than the choices you would make.

00:09:41.780 --> 00:09:50.680
I think if you were building sort of middleware for this audience, those tools, Like the cloud gives you all the things that you would want.

00:09:51.140 --> 00:09:54.340
It gives you this sort of fully flexible system, getting kind of hardware you want.

00:09:54.820 --> 00:09:55.780
It's infinitely scalable.

00:09:56.570 --> 00:09:58.540
It goes away when you stop using it.

00:09:58.600 --> 00:09:59.380
It's like very ephemeral.

00:09:59.800 --> 00:10:02.780
You pay only for what you use, but it's like pretty unusable.

00:10:03.380 --> 00:10:05.200
Like it's designed for cloud infrastructure engineers.

00:10:05.820 --> 00:10:09.800
We've built middleware on top of that, but that middleware is like not designed for our use cases.

00:10:10.220 --> 00:10:15.480
Yeah, there are a bunch of tools like Pulumi and others that'll spin up machines, but they're pretty different.

00:10:15.980 --> 00:10:18.500
And you hinted at it a little bit there.

00:10:19.140 --> 00:10:30.200
Much of the cloud infrastructure and APIs, just the way that it's meant to work is it's for a little bit longer lived systems and it's more focused on web API development.

00:10:30.960 --> 00:10:41.760
How do I maybe take an API that's running on four machines, scale it up to eight through auto scaling, not how do I get a thousand machines now for three minutes and then turn them off?

00:10:42.080 --> 00:10:42.720
That's different, right?

00:10:42.980 --> 00:10:43.100
Yeah.

00:10:43.380 --> 00:10:48.380
I think a lot of the tooling that exists isn't really with our community in mind.

00:10:48.750 --> 00:10:55.700
And so when we built Coil to run these DAS clusters, we looked around at other software and we couldn't find something that actually fit our needs.

00:10:56.300 --> 00:10:58.740
And as a result, we went and we committed the cardinal sin.

00:10:59.110 --> 00:11:00.040
We ruled our own.

00:11:00.360 --> 00:11:05.800
And my hope, but we're talking to folks today, is that there's actually some interesting, we made some opinionated choices in doing that.

00:11:05.950 --> 00:11:09.000
I think we've actually come up with some interesting things that like use Coil or don't use Coil.

00:11:09.080 --> 00:11:11.360
I think some of the choices we made are actually pretty interesting.

00:11:11.880 --> 00:11:23.940
some of the things we ran into in reconstructing one of these new frameworks, but in this sort of service of a sort of highly burst forward, highly flexible system, there's some interesting engineering choices in there.

00:11:24.120 --> 00:11:29.620
There's some interesting experiences using the cloud at that scale, which I think people aren't as familiar with.

00:11:29.760 --> 00:11:32.800
I've seen some of the stuff that you've done, and it's really neat.

00:11:32.940 --> 00:11:41.860
It's not just hooking to this infrastructure, but it's down into programming idioms and concepts that make it almost transparent what's happening?

00:11:42.180 --> 00:11:49.700
In building abstractions, you have to think a lot about what kind of things you abstract away from the user and what kind of things you give to them directly.

00:11:50.120 --> 00:11:54.040
So for our users, we find that they really care about what kind of machine they run on.

00:11:54.300 --> 00:12:04.840
They like want to specify like the exact VM type sometimes because they want to have an SSD and an A10 GPU and they want to be in this particular region where they're available.

00:12:05.200 --> 00:12:06.360
They like have a lot of opinions about that

00:12:06.640 --> 00:12:08.019
and they have like zero opinions

00:12:08.240 --> 00:12:11.920
about their networking setup or the security things other than please make it secure.

00:12:12.460 --> 00:12:14.640
And I think in shaping abstractions,

00:12:14.840 --> 00:12:23.960
one makes some choices and it's interesting to sort of figure out which choices to make and how to build something that gives that set of choices to a user.

00:12:25.720 --> 00:12:29.060
This portion of Talk Python To Me is brought to you by Sentry's Seer.

00:12:29.760 --> 00:12:32.560
I'm excited to share a new tool from Sentry, Seer.

00:12:33.080 --> 00:12:40.400
Seer is your AI-driven pair programmer that finds, diagnoses, and fixes code issues in your Python app faster than ever.

00:12:40.920 --> 00:12:44.600
If you're already using Sentry, you are already using Sentry, right?

00:12:45.120 --> 00:12:49.400
Then using Seer is as simple as enabling a feature on your already existing project.

00:12:50.180 --> 00:12:53.540
Seer taps into all the rich context Sentry has about an error.

00:12:54.040 --> 00:12:58.080
Stack traces, logs, commit history, performance data, essentially everything.

00:12:58.720 --> 00:13:02.940
Then it employs its agentic AI code capabilities to figure out what is wrong.

00:13:03.400 --> 00:13:07.200
It's like having a senior developer pair programming with you on bug fixes.

00:13:07.960 --> 00:13:13.360
Seer then proposes a solution, generating a patch for your code and even opening a GitHub pull request.

00:13:13.860 --> 00:13:18.480
This leaves the developers in charge because it's up to them to actually approve the PR.

00:13:18.900 --> 00:13:22.620
But it can reduce the time from error detection to fix dramatically.

00:13:23.380 --> 00:13:29.060
Developers who've tried it found it can fix errors in one shot that would have taken them hours to debug.

00:13:29.660 --> 00:13:34.080
Seer boasts a 94.5% accuracy in identifying root causes.

00:13:34.740 --> 00:13:41.420
SEER also prioritizes actionable issues with an actionability score, so you know what to fix first.

00:13:41.930 --> 00:13:48.940
This transforms Sentry errors into actionable fixes, turning a pile of error reports into an ordered to-do list.

00:13:49.580 --> 00:13:58.660
If you could use an always-on-call AI agent to help track down errors and propose fixes before you even have time to read the notification, check out Sentry's SEER.

00:13:59.320 --> 00:14:03.320
Just visit talkpython.fm/SEER, S-E-E-R.

00:14:03.940 --> 00:14:05.780
The link is in your podcast player show notes.

00:14:06.300 --> 00:14:08.620
Be sure to use our code, TALKPYTHON.

00:14:09.080 --> 00:14:09.860
One word, all caps.

00:14:10.540 --> 00:14:12.600
Thank you, Dysentry, for supporting Talk Pythonemy.

00:14:13.580 --> 00:14:14.560
What about cost?

00:14:14.760 --> 00:14:25.200
I mean, if I were to be doing this myself, if I were to spin up, hey, I need 500 machines for 10 minutes, I'd be certainly worried that what if I didn't turn them all off?

00:14:26.139 --> 00:14:28.460
That's catastrophically bad sort of things.

00:14:29.020 --> 00:14:39.220
It's one thing, oh, yeah, okay, I left a GPU-enabled machine on for a day, And that wasn't pretty, but it's a whole nother to leave large, significant numbers of machines on.

00:14:39.480 --> 00:14:49.520
Let's do that. Let's put up a bunch of machines. It might be fun. But Nat, first, do you want to say anything about costs? I mean, I think the cost story isn't just leaving things on. It's more complicated than that. It's like the cloud is really great and really cheap.

00:14:49.840 --> 00:15:08.580
You can do that if you do it right at pennies or dollars, but there's all, I don't know if you see them, but I see all of these stories of, here's how I accidentally spent $60,000 on AWS. And it's always like, oh, I didn't even realize that you could do it that way. Yeah. And a lot of times

00:15:08.760 --> 00:15:32.300
it's a misunderstanding of auto scaling or something like that, right? There's a crazy story of this woman who wrote a AI or not image detecting. She was a photographer and she got really frustrated that there's all this AI generated art. And so it's like, I'm going to make an app that will tell you or only show you real art, not AI art in order to have to filter a bunch of stuff. Right.

00:15:32.420 --> 00:15:37.240
and made that serverless and it became super like fifth, sixth most popular thing in the app store.

00:15:37.620 --> 00:15:52.980
And it just scaled like it was supposed to. There was no downtime, but it's scaled like $96,000 and climbing per sale bill. I know you guys talked about maybe a Kubernetes story with like a $50,000 surprise bill. These are not good things.

00:15:53.200 --> 00:16:03.600
The story you just told is actually, it's a positive one, right? She made a thing that was useful and like a lot of people use it. And so it costs more money. Like it's unfortunate that it It was a surprising amount of money, but like it was all useful work.

00:16:04.150 --> 00:16:08.760
I think what we see in customers pre-coiled is like often their costs are not useful work.

00:16:09.170 --> 00:16:10.540
A story from, go ahead, Ned.

00:16:10.660 --> 00:16:14.060
I mean, also part of, I think what the cloud makes hard is these like guide rails.

00:16:14.860 --> 00:16:17.960
You're doing something that you don't know if it's going to be risky.

00:16:18.580 --> 00:16:26.660
And so part of what we try to do is like put in, put in defaults, put in control so that you can't accidentally spend that much money.

00:16:27.120 --> 00:16:33.880
Just to, I mean, like if you don't know the cloud, I remember before being a cloud engineer, like I don't want to sign up for this account and put in my credit card.

00:16:34.300 --> 00:16:35.680
I don't know what the bill is going to be.

00:16:35.800 --> 00:16:37.400
So I'll tell two quick anecdotes.

00:16:37.920 --> 00:16:40.000
One is my first experience with another surprising bill.

00:16:40.060 --> 00:16:42.400
I was like in graduate school, signed up for Amazon.

00:16:42.740 --> 00:16:46.060
I was on the free tier, created some VMs and pulled around to turn them off.

00:16:46.440 --> 00:16:48.960
And then like three months later, I get a bill for $400.

00:16:49.680 --> 00:16:50.620
And it wasn't the VMs.

00:16:50.620 --> 00:16:56.480
It was the like attached storage to the VMs or some networking resource that had stuck around that I had no concept of.

00:16:56.980 --> 00:17:00.980
I wasn't there. There were abstractions. I wasn't really aware of AWS did a fine job.

00:17:01.110 --> 00:17:13.800
They actually refunded me the money. They credit it back. They like happens all the time, but like, that's a case where the cloud is really complex and it's really easy to shoot yourself in the foot with any complex system, especially that complex system comes with dollars attached.

00:17:14.060 --> 00:17:26.959
Yeah. And it's not just compute, right? That's in your example, it was storage, but there's all sorts of little other services. Oh, I just spun up a database for that. And we actually inserted way more data than I thought and then forgot to delete that or whatever, right?

00:17:27.120 --> 00:17:49.580
We've got thousands of customers who do the same thing and then we run through those and we deal with them. So we've seen a lot of those same stories. Another story sort of also pre-coiled but more sort of late in professional life. I was running, this is the $50,000 story you're mentioning. I was running a Kubernetes cluster for a customer, for a research group that I was collaborating with. And we're running Jupyter stuff and Dask stuff. They were all pretty happy.

00:17:50.020 --> 00:17:55.680
They had to learn Kubernetes we weren't super happy about, but they were able to do things they couldn't do before and operate on scales they couldn't do before.

00:17:55.780 --> 00:17:57.060
And this was huge.

00:17:57.260 --> 00:17:57.940
It was really exciting.

00:17:58.220 --> 00:18:02.180
And then one month I got an email, it's like, "Hey, we burned through our annual budget last month.

00:18:02.580 --> 00:18:04.580
We don't know why." I was like, "Hey, what's going on?

00:18:04.660 --> 00:18:16.860
Everything seems fine in the logs, but there's a surprise $50,000 bill." So, well, one thing that's different is that we're now running this thousand node job, but only for like 10 minutes, every six hours.

00:18:17.100 --> 00:19:05.900
So every six hours, this job comes on, runs, a thousand machines, and then turns off. Do the math, it should be like 10 bucks a day, 20 bucks a day, obviously not thousands of dollars a day. And so what had happened is that their code brought up lots of Kubernetes pods. All the pods bring up, then 20 minutes later, they went down, everything worked great. But beneath Kubernetes, there was a node pool, and the node pool had attached an auto scaling group, right? And that auto scaling group had a policy. It's like, hey, if you need lots of nodes, no problem, we'll give you lots of nodes. But in scaling down, I expected the nodes to go away. And actually the policy was, if the average CPU percentage is less than 50%, remove one node, check back every five minutes. And so they were getting a thousand nodes. And then five minutes later, they got 999 nodes and then 998 nodes.

00:19:06.500 --> 00:19:10.040
And that would decline very slowly. And then six hours later, go back up to a thousand.

00:19:10.370 --> 00:19:23.500
And that policy of like, remove one node if CPU utilization is low, makes a whole lot of sense if you are in the web services space, because that's kind of the cadence and kind of the dynamic scales that occur in web services.

00:19:23.980 --> 00:19:24.580
Right, right.

00:19:24.960 --> 00:19:30.000
Ebbs and flows throughout the day, but it's rarely a huge spike and then a huge drop, yeah.

00:19:30.240 --> 00:19:38.140
Yeah, it makes no sense for the kind of users we deal with who want 50 GPUs for 10 minutes and 1,000 CPUs for an hour and then nothing.

00:19:38.640 --> 00:19:40.520
It's like there's two different lessons here.

00:19:40.640 --> 00:19:46.480
One is that the technology that we saw wasn't well-tuned for our audience, for our user base.

00:19:46.930 --> 00:19:49.760
And also like there were just more abstractions than there needed to be.

00:19:50.120 --> 00:19:52.880
What I really wanted at the time wasn't Kubernetes and node pools.

00:19:53.070 --> 00:19:55.560
It was just like, I want a thousand VMs.

00:19:55.980 --> 00:19:59.440
I wanted EC2 was the right abstraction for me.

00:19:59.790 --> 00:20:01.320
I didn't want any other stuff on top.

00:20:01.700 --> 00:20:04.320
And so when we built Coiled, we actually designed for that.

00:20:04.390 --> 00:20:06.280
We designed for raw VMs.

00:20:06.570 --> 00:20:07.800
We call the raw VM architecture.

00:20:08.320 --> 00:20:12.420
And we just spin up a thousand EC2 instances or a thousand Google or Azure equivalents.

00:20:12.840 --> 00:20:15.460
We hook them all up dynamically and then we shut them all down when we're done.

00:20:15.840 --> 00:20:17.660
And that approach is kind of weird.

00:20:17.870 --> 00:20:24.140
A lot of our customers when they first see it, like that's odd, but it actually provides like a really interesting architecture that we found to be really interesting.

00:20:24.580 --> 00:20:25.840
If you're game, go ahead.

00:20:25.940 --> 00:20:28.780
I do think that is certainly what people want, right?

00:20:29.120 --> 00:20:33.920
You don't want to have one of these abstraction layers in there.

00:20:34.040 --> 00:20:36.080
You want just, I just need these machines.

00:20:36.310 --> 00:20:37.320
I need to run my code.

00:20:37.460 --> 00:20:40.440
But I think you're saying, let's walk through an example.

00:20:40.530 --> 00:20:41.360
And I think that's great.

00:20:41.390 --> 00:20:51.000
I think one of the challenges that you're going to run into is like, how do you even make a thousand VMs quickly and not spend most of your compute on machine setup and configuration?

00:20:51.570 --> 00:20:57.880
And things like, even if you're not using some auto-scaling, auto-tune down sort of thing, right?

00:20:58.030 --> 00:21:03.580
You think those would be the kind of problems we'd run into with 10 machines, but with a thousand, you run into other problems, all sorts of problems that'll show up.

00:21:03.840 --> 00:21:06.600
I'm actually, I'm expecting the demo to kind of fail.

00:21:07.220 --> 00:21:08.900
I think it'll be interesting to see what happens.

00:21:09.740 --> 00:21:11.320
It's weird doing a demo on a podcast.

00:21:11.900 --> 00:21:11.980
Yeah.

00:21:12.180 --> 00:21:12.460
I know.

00:21:12.700 --> 00:21:15.980
Everyone listening, I'm going to narrate this very carefully.

00:21:16.380 --> 00:21:20.580
And if you go to the YouTube stream, you can watch it at minute 21, 25.

00:21:21.560 --> 00:21:26.780
But I will narrate it because there's some interesting ideas, like the idioms and stuff that I spoke about.

00:21:26.880 --> 00:21:28.380
I'm on my local machine.

00:21:28.640 --> 00:21:30.840
I'm in a Jupyter notebook, but I can be in VS Code or Curse or whatever.

00:21:31.340 --> 00:21:37.600
And I'm typing in, I've imported library coiled, and I'm going to create a coiled cluster, just typing into Python some code.

00:21:37.900 --> 00:21:39.180
End workers equals a thousand.

00:21:40.140 --> 00:21:41.700
We'll ask for some ARM machines.

00:21:42.380 --> 00:21:45.080
We'll ask for spot if it's available.

00:21:45.620 --> 00:21:47.220
If it's not available, follow back to on-demand.

00:21:47.870 --> 00:21:51.300
And we'll ask for each machine to have maybe just a couple of CPUs.

00:21:51.500 --> 00:21:53.800
Matthew, before we go on, let's just talk about some of these things.

00:21:54.160 --> 00:21:55.960
You go to Coiled and you say, create me a cluster.

00:21:56.540 --> 00:21:57.320
Workers is a thousand.

00:21:57.720 --> 00:21:59.620
That's a thousand EC2 instances, right?

00:21:59.900 --> 00:22:00.520
It will be, yeah.

00:22:00.620 --> 00:22:01.080
That's insane.

00:22:01.340 --> 00:22:02.320
I'm also not going to Coiled.

00:22:02.480 --> 00:22:05.840
I'm going to my local Python environment here on my MacBook.

00:22:06.080 --> 00:22:08.960
Mac mini is in Austin, but it could be on a CI job or wherever.

00:22:09.180 --> 00:22:11.840
By saying Coiled, I meant like you're using the Coiled API.

00:22:11.880 --> 00:22:17.320
locally. Yeah, yeah. And then what are spot instances for people who don't live in EC2?

00:22:17.620 --> 00:22:40.640
Yeah. So spot, also called preemptible in other clouds sometimes, are just instances that are cheaper because they don't have the guarantee of sticking around. The cloud can claim them back if some other customer willing to pay full price is willing to pay. And so we're looking for sort of cheap capacity. And then if it's not there, we actually are also willing to pay for on-demand for full priced instances.

00:22:41.120 --> 00:22:45.540
So we're going to try to get a thousand, but we're like, hey, if there's any discount things around, I'm happy to take them too.

00:22:45.720 --> 00:22:50.880
The alternative is maybe you're going to set up, and again, this is more like the web API world.

00:22:51.120 --> 00:22:54.500
I'm going to set up an EC2 instance and I'm going to configure my website on it.

00:22:54.620 --> 00:22:59.840
I'm just going to leave it running because my website should be up 24 seven in a perfect world.

00:23:00.460 --> 00:23:02.660
So I'm going to get a machine and just leave it.

00:23:03.000 --> 00:23:08.880
I'm going to, those are often reserved instances, which have a different type of pricing, but a commitment to long-term, right?

00:23:09.220 --> 00:23:12.980
You pay less by committing to pay for it for a month or a year or something.

00:23:13.410 --> 00:23:15.300
And so that's kind of the opposite of the spot.

00:23:15.450 --> 00:23:20.240
It's like, if there's anything that just happened to be hanging around, give us your cheap temporary ones, right?

00:23:20.430 --> 00:23:21.100
Okay, cool.

00:23:21.780 --> 00:23:24.380
So I think people get a sense of what you're going to go ask for.

00:23:24.660 --> 00:23:25.560
Let's just run that.

00:23:25.970 --> 00:23:28.860
So take about a minute, maybe a couple of minutes, because it's a large number of machines.

00:23:29.460 --> 00:23:35.620
So the first thing that's happening is that we're scraping my local MacBook for all the Python packages I've installed.

00:23:35.880 --> 00:23:43.380
every conda package pip package local dot py file local editable package and then we are

00:23:43.860 --> 00:24:14.080
a spinning thousand machines let's talk about that environment a little bit because it's fine to have a thousand machines but you are intending to write a bunch of code in jupyter notebook that probably depends on a bunch of stuff like what have you conda or piped installed and exactly those versions and they might be uncommon things that you want to run or maybe you have local files that you're going to subsequently say, load this CSV file or this parquet file and jam on it, right? How does that stay coherent?

00:24:14.440 --> 00:24:51.000
I'll actually add it, make it even more complex. We're also running this first from my MacBook, but I'm actually running it on some Linux machines. And so I've got to shift the architecture of those packages where appropriate. I may have private packages that are running, I'm pulling from my company's local or artifactory repository. I may have packages that I've installed locally that I've printed debug statements into. Things get actually quite hairy trying to replicate someone's local environment into a remote environment. And we try to do exactly that. We try to replicate as much as we can a local environment remotely. That could be data.

00:24:51.070 --> 00:24:52.900
You've mentioned data like small files might move.

00:24:53.180 --> 00:25:12.600
Large files probably not. Yeah, those might be pulled out of S3 storage or something like that, right? It's a complicated situation just a note for what's on the screen right now it says you've booted 838 machines and uh working on the environment for 137 of them that's pretty wild we only got 160 to go yeah i

00:25:12.600 --> 00:25:28.620
mean also things to note like we got 54 m8g larges those are like the nice generation of aws and actually didn't know those were available those are new for me we've got 942 m7g larges and four of the oldest generation M6G.

00:25:29.130 --> 00:25:34.400
So actually Coiled had to go to the cloud and get actually like a variety of different instance types because AWS ran out.

00:25:34.670 --> 00:25:39.660
In fact, maybe Nat, you can talk a little bit about like what just happened behind the scenes there.

00:25:39.920 --> 00:25:42.560
What are all the steps that we had to do in order to get those machines?

00:25:42.780 --> 00:25:44.060
They're also now all available.

00:25:44.240 --> 00:25:55.100
There's like all sorts of things that at the scale of like, I want one or two computers that just work, that like all sorts of things break down when you're asking for hundreds or thousands.

00:25:56.020 --> 00:25:59.080
So some of that is like, you can't just make individual API calls anymore.

00:25:59.610 --> 00:26:01.080
You would very quickly get rate limited.

00:26:01.290 --> 00:26:06.180
And then we would spend 10 minutes asking for these VMs instead of 15 seconds.

00:26:06.780 --> 00:26:07.980
So we use a variety of things.

00:26:08.370 --> 00:26:09.340
We use fleets.

00:26:09.690 --> 00:26:23.720
We use requests that, again, so this ties nicely with a spot where you say, basically, AWS, I want you to give me any of these range of things that have the best availability at the best price.

00:26:24.500 --> 00:26:30.960
So here, yeah, Matt is pulling up and we can actually see, we can see how much spot we got.

00:26:31.570 --> 00:26:35.620
So it looks like we got a fair number of, actually, I don't know, what is that?

00:26:35.710 --> 00:26:38.700
Like four fifths of our instances, we did manage to get spot.

00:26:39.250 --> 00:26:49.380
And those are going to be anywhere from like, I don't know, roughly half the price of a on-demand instance, sometimes more, but often even less.

00:26:49.580 --> 00:26:52.260
So they can be 30% of full price.

00:26:52.680 --> 00:26:56.060
We were actually running this last night and I was actually hoping that it would fail.

00:26:56.160 --> 00:26:57.220
It didn't fail, sadly.

00:26:58.100 --> 00:27:00.180
Let me see if I can bring up a fail case.

00:27:00.380 --> 00:27:01.820
You have an odd hope for your demo.

00:27:03.380 --> 00:27:04.720
So it wasn't us that failed.

00:27:04.760 --> 00:27:06.740
It was Amazon that failed, actually.

00:27:07.360 --> 00:27:09.900
Amazon failed because it actually didn't have capacity.

00:27:10.120 --> 00:27:14.520
It turns out that the cloud is unlimited until you start doing really big things.

00:27:14.840 --> 00:27:17.640
And then it's like, oh, you got to be clever if you want to get.

00:27:18.040 --> 00:27:21.980
And I mean, to some extent, everyone knows this nowadays with GPUs, right?

00:27:22.140 --> 00:27:23.500
GPUs are fairly constrained.

00:27:23.930 --> 00:27:36.580
But even when you're asking for just like 2,000 CPU instances, AWS often says, sorry, here's 200, unless you know how to like frame that request in the right way.

00:27:36.840 --> 00:27:36.960
Right.

00:27:37.180 --> 00:27:42.080
And you guys have solved, that's some of the gnarly DevOps you all have solved so other people don't have to, right?

00:27:43.840 --> 00:27:48.880
This portion of Talk Python To Me is brought to you by our latest course, Just Enough Python for Data Scientists.

00:27:49.440 --> 00:27:54.820
If you live in notebooks but need your work to hold up in the real world, check out Just Enough Python for Data Scientists.

00:27:55.460 --> 00:28:02.820
It's a focused code-first course that tightens the Python you actually use and adds the habits that make results repeatable.

00:28:03.500 --> 00:28:11.080
We refactor messy cells into functions and packages, use Git on easy mode, lock environments with uv, and even ship with Docker.

00:28:11.700 --> 00:28:14.400
Keep your notebook speed, add engineering reliability.

00:28:15.100 --> 00:28:16.280
Find it at Talk Python Training.

00:28:16.530 --> 00:28:19.280
Just click Courses in the navbar at talkpython.fm.

00:28:20.180 --> 00:28:23.920
Matt, I feel like we should run some code on this thing now. You got them sitting here ready.

00:28:24.340 --> 00:28:53.920
There's a great problem of actually supporting Python users, I think, is that people want to do all sorts of different kinds of things. They want to load a bunch of data with Pandas or with Polars or with DuckDB, or they want to train a machine learning model with a certain kind of GPU or a certain kind of whatever. There's actually a lot of variety in what people do. And it's challenging to build a tool that provides, again, some of that abstraction, but not others. Like Snowflake provides the abstraction of SQL, and that's like a very clear thing to do.

00:28:54.820 --> 00:29:42.640
We provide the abstraction of, here's a machine that you can play with, do whatever you want to do with it. And sometimes that's Python code, sometimes it's other weird stuff. I'm actually, rather than show code, I might actually show a few examples of like, people will play things like with pandas, like what you mentioned, like machine learning, like climate science. There's all sorts of different weird things. We actually found that a lot of people would, inside of their Python code, they would import the subprocess module, and then they would run a process that was calling some other Fortran or C code that had nothing to do with Python. And so we found that and we said, great, let's go and let's support that by running arbitrary batch jobs, arbitrary programs like Here in this example, I'm just like running Echo Hello World.

00:29:43.160 --> 00:29:54.100
So it's a weird set of problems to handle the infrastructure for this set of people, but not be very opinionated about what they do with that infrastructure.

00:29:54.460 --> 00:29:57.380
So at this point, they pretty much have a machine.

00:29:57.600 --> 00:30:01.180
They can, if they can do it from a Jupyter Notebook, they're kind of good to go.

00:30:01.380 --> 00:30:03.100
Yeah, they have a thousand machines.

00:30:03.560 --> 00:30:10.000
And those machines look just like their machine, just more numerous or bigger or with GPUs or whatever you like.

00:30:10.100 --> 00:30:12.480
Sure. And so how do I bring data back together?

00:30:12.630 --> 00:30:13.540
How do I fan it out?

00:30:13.680 --> 00:30:18.940
How do I map reduce or like execute a bunch of, I've got a million rows.

00:30:19.070 --> 00:30:23.800
I want to send a hundred to each machine or I don't know, whatever the math works out to be a thousand each machine.

00:30:24.040 --> 00:30:27.440
Right. So now we're asking the question, how are we going to use the machines that we have?

00:30:27.780 --> 00:30:30.120
The answer to the questions you just asked me is Dask.

00:30:30.190 --> 00:30:35.560
You might use Dask to, hey, you got a petabyte of parquet data on S3, use a Dask cluster.

00:30:35.740 --> 00:30:36.940
Or if you want, use a Spark cluster.

00:30:37.360 --> 00:30:40.260
We can spin up a Spark cluster, we can spin up a Polar, we can spin up lots of different things.

00:30:40.680 --> 00:30:45.200
And then now you're at the point of the, you're at the level of the sort of distributed computing framework.

00:30:45.660 --> 00:30:48.700
And they can go and run things with that, with whatever distributed computing framework they like.

00:30:49.360 --> 00:30:52.760
Often, there's then a whole other set of problems they then deal with.

00:30:52.940 --> 00:30:55.300
Again, when we started Quil, it was all around deploying Dask.

00:30:55.680 --> 00:31:00.620
But what we found is actually like most times the problems they wanted to solve were not that complicated.

00:31:00.620 --> 00:31:01.400
They were not sophisticated.

00:31:01.940 --> 00:31:02.900
They were very simple problems.

00:31:03.500 --> 00:31:06.080
They had a thousand parquet files in S3.

00:31:06.200 --> 00:31:08.200
They wanted to do the same thing on each parquet file.

00:31:08.600 --> 00:31:09.880
They actually didn't want to use DAS data frame.

00:31:09.920 --> 00:31:12.320
They wanted to use Polars, or they wanted to use DuckDB.

00:31:12.880 --> 00:31:15.540
And so we would give them APIs that let them, you know,

00:31:15.600 --> 00:31:18.600
so an example of an API is the coiled function API.

00:31:19.080 --> 00:31:19.640
It's a decorator.

00:31:20.060 --> 00:31:21.700
Think if you're familiar with modals, something similar.

00:31:22.380 --> 00:31:23.200
And it does the same thing.

00:31:23.500 --> 00:31:24.400
It spins up a machine.

00:31:24.820 --> 00:31:30.000
You run your function on that machine, and it goes ahead and spins up the VM, runs it, scales it down.

00:31:30.220 --> 00:31:30.920
Yeah, that's really neat.

00:31:31.140 --> 00:31:36.900
And when you say on your website, it says serverless Python, is it the same thing as like...

00:31:36.900 --> 00:31:38.060
Serverless is a weird term.

00:31:38.300 --> 00:31:39.000
Yeah, I know it is.

00:31:40.220 --> 00:31:46.100
Is it still spinning up dedicated VMs, running your code on that, and then having it going away, and you're just not thinking about it?

00:31:46.500 --> 00:31:53.460
Or is it truly leveraging the serverless functionality as AWS would refer to it in its console?

00:31:53.680 --> 00:31:57.000
Serverless always means there's a server under the hood.

00:31:57.460 --> 00:32:00.960
But roughly the distinction is who has to worry about that server?

00:32:01.240 --> 00:32:22.980
Yeah. Yeah. When we say serverless, we're still using EC2 instances. I mean, Amazon, when you're running things on Lambda, they're still using EC2 instances, but it's a, it's an abstraction where you don't have to worry about that. You get those EC2 instances as you need them, you get them quickly and you get them only for as long as you need them. Yeah. That makes a lot of sense. So

00:32:23.300 --> 00:32:44.700
this is something that really kind of really blew my mind here when you all showed me is you put a decorator at coil.function onto just a regular Python function. And you can express things like the machine type you need, how many of them you want, and so on. And it just fires it up seamlessly behind the scenes when that function is called and then it goes away, right? Yeah. So this is all

00:32:44.900 --> 00:32:54.740
using the same underlying technology of what we joke about internally is that our core competency is turning VMs on and off. Once you have that technology, writing APIs around it is pretty cheap.

00:32:55.140 --> 00:32:57.540
And so a common API is a Python decorator.

00:32:58.000 --> 00:32:59.320
That's one of the various APIs we do.

00:32:59.720 --> 00:33:04.860
And yeah, a common application is I'm running PyTorch code and I'm using it on my MacBook.

00:33:05.240 --> 00:33:07.780
It's fine, but I like to use an NVIDIA CUDA GPU.

00:33:08.420 --> 00:33:12.060
Cool, let's decorate that function with the GPU type that you want.

00:33:12.480 --> 00:33:13.540
And it goes ahead and runs that.

00:33:13.880 --> 00:33:15.260
And it runs locally in my environment.

00:33:15.380 --> 00:33:17.860
I'm still typing on my MacBook Pro in cursor or whatever.

00:33:18.280 --> 00:33:23.640
But that function now, what it does is it just like, it spins up VM, runs code, keeps VM around for a little while.

00:33:23.800 --> 00:33:53.380
see if I want to run anything else on it again, and then spins it down. And that gives a lot of ability for the user to start experimenting with hardware. You can now run that function on any kind of GPU you want. And they do, they try lots of different things. We had a customer a long time ago who was, this is when like, when GPUs were at first very hard to get access to, and they would run through every region in their cloud, trying to find A100s. And it's super easy in Quola to say like, great, I want to run this in region EU central one. Let's see if Frankfurt has any GPUs.

00:33:53.500 --> 00:34:11.139
Nope, none there. Great. Let's try this in AP, Southwest, whatever. Let's see if Australia has any GPUs. And they would just, because it was now easy to play with things, it was easy to use the cloud. They started to experiment a lot more. That was really valuable. We often see people playing with ARM versus Intel versus AMD, playing with different GPU types.

00:34:11.480 --> 00:34:22.159
Something that's interesting to me about data science, I mean, I, to some extent, come from the web world. And what you do is you just look at the list of instances, You pick one that looks boring.

00:34:22.790 --> 00:34:23.580
You just use that.

00:34:23.710 --> 00:34:25.220
It runs for a year.

00:34:25.330 --> 00:34:26.340
You don't think about it.

00:34:26.620 --> 00:34:35.600
And so much in data science, it actually makes sense to try out different instance types to explore, what's this GPU do for me?

00:34:36.060 --> 00:34:37.080
Sometimes that's really helpful.

00:34:37.320 --> 00:34:38.080
Sometimes it's not.

00:34:38.389 --> 00:34:39.720
To move around to different regions.

00:34:40.100 --> 00:34:49.560
So if you have a data set that's in one region, it makes an orders of magnitude difference how quickly you can download it if you are close to it.

00:34:49.820 --> 00:34:51.040
than if you are far from it.

00:34:51.399 --> 00:35:00.040
But even things like specifics of the CPU family, we, for fun, we get to run like benchmarks on different things.

00:35:00.640 --> 00:35:06.160
And it's really nice because I just go in and like, oh, AWS came out with a new ARM instance type.

00:35:06.560 --> 00:35:08.680
And I can go like change one line of code.

00:35:09.020 --> 00:35:13.020
And now we run all our benchmarks on, as Matt was noticing, M8Gs.

00:35:13.280 --> 00:35:14.500
I was surprised to see M8Gs.

00:35:15.060 --> 00:35:16.200
Have you run the benchmarks yet now?

00:35:16.500 --> 00:35:16.940
How are they doing?

00:35:17.280 --> 00:35:18.800
Yeah, I don't know if we have on those yet.

00:35:19.000 --> 00:35:20.400
I think, what was the difference between six and seven?

00:35:20.470 --> 00:35:20.800
Do you remember?

00:35:20.940 --> 00:35:25.940
All of these are, so these are the like Amazon designed ARM CPUs.

00:35:26.420 --> 00:35:30.460
Some of those differences are the family of ARM.

00:35:30.640 --> 00:35:33.420
It's like ARM v8 versus ARM v7.

00:35:34.000 --> 00:35:40.320
Some of that actually really does make a difference for data science workloads because it has to do with those like wide instructions.

00:35:41.420 --> 00:35:46.460
So they're like mini GPUs that they're able to do many things in parallel on the CPU.

00:35:47.100 --> 00:35:48.400
Sometimes they put in better memory.

00:35:48.840 --> 00:35:52.660
So I think some of these new instances have DDR5 instead of DDR4.

00:35:53.400 --> 00:35:55.540
And it's like, does that make a difference for my workload?

00:35:56.000 --> 00:35:57.100
Is it going to save money?

00:35:57.620 --> 00:36:02.180
I can't tell you that a priori, but it's really easy to just try.

00:36:02.470 --> 00:36:10.540
And sometimes you're like, oh, wow, that this hardware is like better or, oh, wow, this new hardware actually doesn't make any difference for what I do.

00:36:10.740 --> 00:36:16.560
It seems to me like the infrastructure you all built makes experimenting way easier, right?

00:36:16.720 --> 00:36:20.880
If it's really hard for you to set up a machine, let's come back to this in a minute, but I have a question first.

00:36:21.120 --> 00:36:36.780
But I think a lot of times what people might do is, hey, instead of trying to do a lot of the scaling stuff, let's just set up one machine with 64 cores and a lot of memory and just set up a notebook server and let people, we'll just configure it and let people have that.

00:36:37.160 --> 00:36:37.520
Yeah, yeah.

00:36:38.000 --> 00:36:39.580
I do want to come back to that.

00:36:39.760 --> 00:36:44.560
But while we're still on this ARM versus x86 thing, what's the story?

00:36:44.700 --> 00:36:48.640
Do you all recommend ARM in the data center these days for this kind of stuff?

00:36:48.790 --> 00:36:49.960
Or is it still x86?

00:36:50.620 --> 00:36:51.140
What's the right choice?

00:36:51.360 --> 00:36:52.000
Try it.

00:36:52.320 --> 00:36:53.060
That's the short answer.

00:36:53.340 --> 00:36:55.440
I mean, ARM is really nice sometimes.

00:36:56.580 --> 00:37:00.540
It AWS is and a lot of the clouds are like really into ARM.

00:37:00.770 --> 00:37:03.220
So they're doing some really nice technology.

00:37:03.620 --> 00:37:05.240
They're offering it at a good price.

00:37:05.960 --> 00:37:07.640
It tends to run more power efficiently.

00:37:07.860 --> 00:37:11.560
And power is like actually one of the major costs at a data center.

00:37:11.790 --> 00:37:13.720
I would love to run ARM for my infrastructure.

00:37:14.360 --> 00:37:26.180
But I'm concerned that there's not, there might be some native library or some web server that doesn't run ARM, is not built for ARM or doesn't work right on ARM.

00:37:26.480 --> 00:37:31.680
A couple months in, like, now I want to add this one thing that would be really nice, but it doesn't work on ARM.

00:37:31.920 --> 00:37:33.580
We've been doing this for a few years now.

00:37:33.760 --> 00:37:37.320
It's a lot better, the support, than it was, I think, three or four years ago.

00:37:37.880 --> 00:37:43.920
But also this is like part of, so you're talking about all of those tricks we do to get this seamless experience.

00:37:44.700 --> 00:37:51.780
And part of that is people have Intel machines on their desk or they have ARM MacBooks.

00:37:52.240 --> 00:37:57.000
They're running on Intel or AMD or ARM in the cloud.

00:37:57.760 --> 00:38:03.060
And all of this is stuff we're like having to figure out, OK, so you have this software locally.

00:38:03.410 --> 00:38:05.080
What does that mean in the cloud?

00:38:05.080 --> 00:38:07.320
And it might be slightly different versions of things.

00:38:07.790 --> 00:38:11.040
It might even be you don't have a GPU locally.

00:38:11.300 --> 00:38:17.900
So you have the CPU version of PyTorch, but you actually want to use PyTorch on a GPU in the cloud.

00:38:18.560 --> 00:38:22.980
That involves figuring out how to install the right versions of those different packages.

00:38:23.440 --> 00:38:30.920
So in the screen share that no one can see, I've just switched my code to turn off ARM and switch from region US East 1 to US West 2.

00:38:31.280 --> 00:38:32.160
And we're just bringing up a new cluster.

00:38:32.700 --> 00:38:36.020
And if I had written some debug code in my code, that would also be updated.

00:38:36.380 --> 00:38:39.580
You were talking earlier, Michael, about experimentation being something to think about.

00:38:39.980 --> 00:39:30.940
I think that is one of the major differences between data workloads and web server workloads, is that the data world, like, usually there's eventually, yes, there's a production part where you're running some model inference server. Well, there's this long period where you're experimenting, and really optimizing that period is really critical. And Coiled is very much designed to accelerate that experimentation process. Even our choice to copy your environment rather than use Docker is highly informed by that. If you put in a Docker build, Docker push cycle into the data science work cycle, it just like it gums everything up. People end up not doing it. Coiled is smooth enough and easy to use enough that the cloud is now pleasant enough to actually include inside of the user dev cycle. And that's different. That's new. That's fun. And you can just play with stuff.

00:39:31.360 --> 00:39:39.880
So like I've got now 300 machines that are all, nope, I got all the machines. I've got a bunch of Intel machines and a thousand ARM machines. What do you want to do with them, Michael?

00:39:40.540 --> 00:39:46.520
We can run the same experiment right now inside this podcast. And that's, I think, the joy of this.

00:39:47.020 --> 00:39:55.640
And the cost of all of this is like dollars. That last cluster cost me $1.39. This one is costing me

00:39:55.900 --> 00:40:00.240
45 cents so far, 80 bucks an hour. That's actually way less than I expected.

00:40:00.500 --> 00:40:08.680
Yeah, the cloud is both like way cheaper and way more expensive than I realized going in based on whether or not you're doing it correctly or doing it incorrectly.

00:40:09.080 --> 00:40:10.680
There's like several orders of magnitude difference.

00:40:11.300 --> 00:40:12.920
I know we want to talk about cost at some point, Michael.

00:40:13.160 --> 00:40:16.020
Maybe it's fun to talk through some of the like the crazy stories.

00:40:16.200 --> 00:40:17.400
I definitely want to hear some stories.

00:40:17.520 --> 00:40:18.700
I'm always here for stories.

00:40:19.280 --> 00:40:28.840
I first want to talk about the cost and the logistics of this one big shared Jupyter server sort of story versus this.

00:40:29.300 --> 00:40:31.500
No one's going to have, you asked for two CPUs per machine.

00:40:31.820 --> 00:40:35.160
I doubt anyone's going to have a 2,000 CPU single machine.

00:40:35.800 --> 00:40:41.120
I know, I think they might exist, but they're very rare and they're very expensive.

00:40:41.700 --> 00:40:52.240
So what does the cost look like and the challenges of some team that says, let's just set up one huge server and we'll just share it versus working like this?

00:40:52.440 --> 00:40:58.260
I think the shortness is like low tens of thousands of dollars for like an always on 100 core machine.

00:40:58.600 --> 00:41:01.320
I'm like, I'm honestly bringing up ChatGPT right now to ask that question.

00:41:01.520 --> 00:41:02.900
It's such a special time we live in.

00:41:03.360 --> 00:41:05.960
Before the machines rise up and kill us, it's going to be, it's amazing.

00:41:06.920 --> 00:41:11.340
ChatGPT is telling me about $30,000 if I don't do any optimization.

00:41:11.920 --> 00:41:16.920
When we see this in practice, people turn it off on the weekends, sometimes at night, sometimes not.

00:41:17.380 --> 00:41:18.920
But like tens of thousands of dollars is typical.

00:41:19.160 --> 00:41:23.880
Yeah, it's expensive, but it's also, it's rarely what people actually want to use.

00:41:24.540 --> 00:41:26.500
Because so much you're like, oh, I want to try.

00:41:26.960 --> 00:41:28.280
Sometimes you're like, oh, I have an idea.

00:41:28.420 --> 00:41:31.320
I want to try this experiment with like 10 different parameters.

00:41:31.520 --> 00:41:33.740
I want to search over some things.

00:41:34.320 --> 00:41:35.840
And you're like, okay, I got to do it.

00:41:36.140 --> 00:41:39.160
That's going to take me 10 days now because I got to run it one at a time.

00:41:39.520 --> 00:41:46.900
There's so much of like, I want to just try something that the one big machine doesn't let you easily do.

00:41:47.060 --> 00:41:47.180
Sure.

00:41:47.320 --> 00:41:51.220
And if someone else is trying that experiment, you either wait or you just go slow.

00:41:51.500 --> 00:41:51.600
Right.

00:41:51.840 --> 00:41:55.140
And then there's also the like, oops, I ran something that made it crash.

00:41:55.960 --> 00:41:57.800
And then you're like, call in your DevOps person.

00:41:58.040 --> 00:41:59.160
Can you restart the...

00:41:59.240 --> 00:42:02.580
Or you just ran something that took way longer than you thought it would.

00:42:02.580 --> 00:42:05.160
And you blocked it unintentionally, right?

00:42:05.280 --> 00:42:06.640
For zero, little value, yeah.

00:42:06.760 --> 00:42:07.860
I'll add some other things.

00:42:08.180 --> 00:42:09.320
Or you wanted to use a GPU.

00:42:09.800 --> 00:42:15.460
Or you wanted to now put that thing into production where it's running every day rather than you pressing a cell in Jupyter.

00:42:15.840 --> 00:42:18.500
Or you wanted to use Cursor rather than use Jupyter.

00:42:18.840 --> 00:42:51.440
There's like, there's many ways in which the like the big Jupyter server on the cloud, it just like not, it like it technically satisfies the requirement running code in the cloud, but just so much more that we do as data professionals. We run things in production. We run things in different hardware. We develop in new ways. We experiment. And the job description is a lot more variable than that single machine is able to satisfy. Look at cars on the road. They're not all Honda Accords. You got semi-trucks, you got bicycles, you got pedestrians.

00:42:52.240 --> 00:42:58.680
We live in a world where we actually need a lot of different kinds of things. It's that variety that's actually really a core part of the cloud.

00:42:59.040 --> 00:43:00.640
That variety is something that we really care about.

00:43:00.740 --> 00:43:03.520
It's very true to the data science ethos, right?

00:43:03.740 --> 00:43:06.940
Like we're going to experiment, we're going to explore, we're going to play.

00:43:07.040 --> 00:43:14.160
And if that becomes seamless, I mean, one of the problems is I want to experiment, but it's going to take seven hours on my local machine.

00:43:14.760 --> 00:43:21.620
Could I ask that question and get an answer in five minutes if I'm willing to pay 10 bucks or my company's willing to pay 10 bucks or something, right?

00:43:21.760 --> 00:43:23.380
Well, it's going to cost 10 bucks either way.

00:43:23.620 --> 00:43:54.340
It's going to cost either 10 bucks of machine time week or 10 bucks of machine time over five minutes just with a thousand machines. And so I think you used a great word there, Michael, which is play. I think a lot of why Python became popular is that it feels like play often. Like we're given these libraries that are both easy to use and powerful, and that feels like play. We get to like, oh, this is a cool squirt gun. I can press this button and water shoots. I can go shoot my friends. Isn't that fun? And if you look at using the Boto library and AWS, that does not feel like play.

00:43:54.600 --> 00:43:58.460
If you go look at thinking about writing YAML and Kubernetes, that does not feel like play.

00:43:59.020 --> 00:44:05.800
But like here today, we got to play with making 2000 VMs, half ARM, half Intel, half on the US East Coast, half on the US West Coast.

00:44:06.160 --> 00:44:07.720
And we didn't do any work with them, but we could have.

00:44:08.040 --> 00:44:10.680
And now suddenly the cloud is like play.

00:44:11.060 --> 00:44:13.500
And you just do different things when things become playful.

00:44:13.600 --> 00:44:15.200
You behave differently.

00:44:15.640 --> 00:44:16.620
And that's really the fun thing.

00:44:17.020 --> 00:44:18.100
Folks have fun.

00:44:18.160 --> 00:44:22.640
And the cloud is a really fun tool to use once you get past all the pain.

00:44:22.780 --> 00:44:23.180
I agree.

00:44:23.540 --> 00:44:25.720
When you first hear about it, you're like, wow, what can I do with this?

00:44:25.800 --> 00:44:29.240
But then you get into working with Bodo and you just kind of want to stop.

00:44:29.740 --> 00:44:32.540
Now, you probably spend more time with Bodo than any of us.

00:44:32.640 --> 00:44:32.680
Yeah.

00:44:32.800 --> 00:44:34.340
You take one for the team, for all of us.

00:44:34.440 --> 00:44:34.500
Thanks.

00:44:34.680 --> 00:44:36.940
A lot of reading API docs so that you don't have to.

00:44:37.120 --> 00:44:38.160
Two questions here.

00:44:38.480 --> 00:44:39.720
One, we've started these clusters.

00:44:40.380 --> 00:44:41.220
Are both still running?

00:44:41.720 --> 00:44:42.680
Is one just still running?

00:44:42.760 --> 00:44:53.440
if I do this in a notebook, how do I ensure that it does shut down, that I don't let it run for too long, like unnecessarily long, right? What's the workflow with that? So broadly, how do we

00:44:53.740 --> 00:45:10.960
constrain costs? I know to make sure that things are as low cost as possible. Yeah. So one already auto shut down. So we weren't using it. Coil saw we weren't using it. It shut it down. We can bring it up again. It takes a minute. So it's easy to bring things back up and down. If you wanted to stay up, there's keywords to control all that. But default behavior is to shut things down pretty

00:45:10.980 --> 00:45:19.200
aggressively. The other one is still up. We can go do things with it if you like. Could you set timeouts in the cluster when you create it? 100%. Idle timeouts. Yeah. Typically what we do is

00:45:19.320 --> 00:45:33.460
people, if people do want to have something that sort of sits around for a while, they'll set an idle timeout and we'll give it a name. And at the top of their script, it'll say, hey, I want to use my cluster named prod or whatever. And it has an idle timeout of an hour. Make sure that exists.

00:45:33.910 --> 00:45:36.600
And if it doesn't exist, we'll bring it up. If it does exist, we'll connect you to it.

00:45:37.000 --> 00:45:47.560
And that behavior works pretty well. That's cool. So you can do things like restart my kernel in notebooks and then just like reattach to it rather than, well, now I have 7,000 server clusters.

00:45:47.860 --> 00:45:50.600
How'd that happen? That's the opposite of what you're preaching here.

00:45:50.960 --> 00:46:22.500
Sometimes if you do want to change your code, if you're working, like when I work on Dask sometimes and I like need to put a print statement into some code because I want to see what's happening on my cluster. At that point, I'll then recreate a new cluster because I need my code to be re-pushed up. But I can go to the logs and I can go look where all my print statements are and they're there like oh that wasn't quite right i'll change my print statement make a new cluster one minute later things are up again again that sort of minute long thing is not perfect i wish it was a second but like a minute does tend to be within the dev cycle tolerance of a lot of humans and you asked

00:46:22.540 --> 00:46:34.740
for said i was just thinking like how could you sort of preload these types of things like at some point you could have just a thousand machines hanging around you always like a pool you hand out But as we've been going through, you have all these variations.

00:46:35.100 --> 00:46:36.680
I want an ARM one.

00:46:36.690 --> 00:46:37.440
I want an x86.

00:46:37.610 --> 00:46:38.600
I want one with the GPU.

00:46:38.610 --> 00:46:39.760
I want one with this GPU.

00:46:39.790 --> 00:46:40.620
I want it in that region.

00:46:41.140 --> 00:46:45.760
That makes it really hard to completely just have a whole bunch, like a fleet of them ready to just hand out.

00:46:46.100 --> 00:46:52.560
So much of what we try to do is play that balance between speed and flexibility and low cost.

00:46:52.700 --> 00:46:55.680
So the more you keep things sitting around, the more you're paying for them.

00:46:55.960 --> 00:47:02.660
I think we've got, if you can get it faster than someone can go get a cup of coffee, that seems to be a much better experience.

00:47:03.180 --> 00:47:03.700
That seems fair.

00:47:03.770 --> 00:47:07.600
And also if you can get it faster than your experiment is going to take to run.

00:47:07.720 --> 00:47:13.860
You would not use Coiled to like backup a web API endpoint that has to get back to human in response time.

00:47:14.050 --> 00:47:16.420
You should go use Lambda, you should use modal, you should go use something else.

00:47:16.660 --> 00:47:20.400
You should use Coiled when you're using enough hardware that it would be prohibitively expensive.

00:47:20.730 --> 00:47:25.000
Like you actually don't want to have a pool of a thousand machines sitting around just in case someone wants them.

00:47:25.330 --> 00:47:26.180
That doesn't make sense.

00:47:26.540 --> 00:47:30.560
You should use Coiled when you have these other larger things to do.

00:47:30.860 --> 00:47:33.500
And again, I think that sounded like I got back into pitch mode.

00:47:33.840 --> 00:47:35.740
The point I wanted to talk about here isn't use coiled.

00:47:35.960 --> 00:47:38.220
It's the cloud actually has these capabilities.

00:47:38.680 --> 00:47:44.220
You can get a thousand VMs anywhere in the world of any hardware type you like for dollars.

00:47:44.900 --> 00:47:48.060
And that's actually an incredible capability if you can do it right.

00:47:48.380 --> 00:47:51.920
I think today the zeitgeist is go use Kubernetes.

00:47:52.560 --> 00:47:54.000
And we just think that's like, that's dead wrong.

00:47:54.420 --> 00:47:56.420
The answer is just go use raw VMs.

00:47:56.860 --> 00:48:04.780
They're actually pretty good if you do a few things around them, if you figure out software environments, if you figure out how to sort of batch requests, if you're out of logs, if you're out of low cost.

00:48:05.280 --> 00:48:09.380
But like this is, I think, actually the right foundation, I think.

00:48:09.620 --> 00:48:14.300
Like the raw VM is maybe the right foundation for a lot of data work.

00:48:14.640 --> 00:48:23.800
How about just pick a library and infrastructure that's exceptionally tuned to, I need to start them as fast as possible, run it one job and shut it back down.

00:48:24.180 --> 00:48:36.080
And I don't think serverless in the AWS sense is really going to be the way because serverless gets expensive too much compute. Just ask Kara.ai, right? That's the $96,000 verse sale bill. Yeah. Serverless,

00:48:36.280 --> 00:48:40.700
Lambda, and similar technologies typically have like a four to five X premium on cost.

00:48:41.300 --> 00:48:45.800
They'll sell limitations. Like you can't get big machines. You can't get GPUs. Your software

00:48:45.980 --> 00:48:50.280
environments have to be of a certain size. Making time out, right? So lots of stuff you can't do.

00:48:50.440 --> 00:48:57.840
Yeah. Mostly we see people who like want to run their Polars job, not with 16 cores, but with 64 cores, like Lambda isn't going to cut it.

00:48:58.070 --> 00:49:01.420
And so you just like, you want the full flexibility of the cloud.

00:49:01.780 --> 00:49:02.040
Absolutely.

00:49:02.640 --> 00:49:04.380
What's a typical size?

00:49:04.450 --> 00:49:07.700
I mean, you created a thousand, you guys, and that's super impressive.

00:49:08.070 --> 00:49:08.820
But is that common?

00:49:09.140 --> 00:49:10.360
I mean, we see all sorts of things.

00:49:10.490 --> 00:49:12.420
So this is like, you give people a flexible tool.

00:49:12.550 --> 00:49:14.960
It turns out they use it in so many different ways.

00:49:15.620 --> 00:49:22.700
We have, we actually have like, we were a little bit surprised to find how many people were doing things on one VM.

00:49:23.160 --> 00:49:52.420
We had a whole bunch of users and the sounds less exciting, but like who would just have like individual scripts that they needed to run on a cloud VM. And this is the easiest way to do that whenever you need to get a lot of benefits of that, that maybe don't fit the mold. But here I can put a decorator on a function. And this function runs on a big machine in the cloud, right next door to the storage and S3 where the parquet file lives, or something like that, right?

00:49:52.480 --> 00:49:56.760
It has more memory than, I don't have enough memory, but this one has enough memory or whatever, right?

00:49:57.030 --> 00:49:58.140
And that's actually pretty neat.

00:49:58.240 --> 00:50:01.640
And then we have other users who are making multi-thousand node clusters.

00:50:01.970 --> 00:50:04.320
And that's a whole different set of challenges.

00:50:04.930 --> 00:50:08.880
At that scale, you start like actually hitting cloud limits of like capacity.

00:50:09.170 --> 00:50:16.340
And you have to do tricks like basically using multiple, they're called availability zones in AWS, like multiple data centers.

00:50:16.880 --> 00:50:20.720
Can one of these clusters span availability zones or regions even?

00:50:21.100 --> 00:50:24.680
We don't span regions with a single cluster today.

00:50:24.750 --> 00:50:27.100
We haven't found people needing that scale.

00:50:27.730 --> 00:50:29.940
And that opens yet a new set of challenges.

00:50:30.390 --> 00:50:35.140
But yeah, we very commonly do multi-availability zone clusters.

00:50:35.880 --> 00:50:38.620
This is, Matt's pulling up some docs and examples.

00:50:39.020 --> 00:50:50.980
But something that this is actually really nice pairing with is if you want big spot clusters, spot can be really cheap, but it can also, there isn't as much availability of it because it's so cheap.

00:50:51.400 --> 00:51:10.820
So this works really nicely. I'll say like all of these things, sorry, my life is all about like trade-offs and edge cases. So all of these things have like, you got to be careful not to do it in the wrong way. So with availability zones, one kind of gotcha is you pay for moving data between availability zones.

00:51:11.240 --> 00:51:20.680
A simple read parquet file that happens to go across its, you do that a bunch of times, all of a sudden, across a thousand machines, but back and forth, like, whoa, there's your surprise.

00:51:20.960 --> 00:51:34.160
Well, it's tricky, actually, because read parquet is fine, because parquet is probably on S3, and S3 crosses all of the availability zones in a region. But if one machine loads data on one availability zone, then transfer it to another machine, another availability zone, then you pay.

00:51:34.700 --> 00:51:39.360
And people don't know that. Things like if you need to shuffle or sort your data set,

00:51:39.380 --> 00:51:45.280
But also just like if you're doing naive things in terms of like pulling back your data all through one machine.

00:51:45.770 --> 00:51:49.280
And so part of what we do is like make it easy to control that.

00:51:49.400 --> 00:51:51.600
It's like, again, one parameter.

00:51:51.730 --> 00:51:54.020
You can say, I do want multiple AZs.

00:51:54.030 --> 00:51:55.500
I don't want multiple AZs.

00:51:55.940 --> 00:52:03.260
But then giving people like you can look at network metrics to see, oh, I think my workload is embarrassingly parallel.

00:52:03.710 --> 00:52:04.680
So this is fine.

00:52:04.980 --> 00:52:12.640
let's take a look at metrics and see, oh yeah, this workload makes sense to span AZs, get more spot, save more money that way.

00:52:12.760 --> 00:52:16.060
This workload, we want to keep this in a single AZ.

00:52:16.640 --> 00:52:21.420
We're going to let AWS pick it so that they pick the one that's cheapest and has the best availability.

00:52:22.280 --> 00:52:24.460
And that's all stuff that we like automatically do.

00:52:24.600 --> 00:52:25.780
I want to double down on that for a second.

00:52:26.200 --> 00:52:31.040
I think spot is a good example of like, this is one of a hundred things you have to do well.

00:52:31.060 --> 00:52:36.480
I think a lot of what we found is that like, Using the CloudWell isn't doing one big thing right.

00:52:36.660 --> 00:52:38.440
It's not like one magical thing that's done.

00:52:39.020 --> 00:52:42.300
There's a long tail of a lot of small things to get right.

00:52:42.460 --> 00:52:43.780
There's a lot of nuance and a lot of polish.

00:52:44.340 --> 00:52:45.100
Spot's a good example.

00:52:45.600 --> 00:52:52.100
So we were running benchmarks internally, and they were costing us some money, like our own like Dask benchmarks back in the Dask focus days.

00:52:52.700 --> 00:52:53.920
And it was like, great, we'll switch to Spot.

00:52:54.180 --> 00:52:54.780
Easy to do.

00:52:55.080 --> 00:52:56.180
And the engineers hated it.

00:52:56.500 --> 00:53:00.200
The Dask engineers hated it because all the benchmarks became very variable.

00:53:00.620 --> 00:53:02.600
Because they asked for 100 machines, but they got 70.

00:53:03.020 --> 00:53:05.220
or they asked for, or like machines would go away.

00:53:05.920 --> 00:53:07.060
And so we did a few things.

00:53:07.940 --> 00:53:10.880
One is what I said before, the sort of falling back to on-demand.

00:53:11.240 --> 00:53:12.600
If you only have 70 machines, it's okay.

00:53:12.800 --> 00:53:15.400
Give me 70 spot and I'll pay full price for 30.

00:53:15.840 --> 00:53:17.260
That's a feature people really like.

00:53:17.560 --> 00:53:19.120
Another one is availability zones, right?

00:53:19.360 --> 00:53:28.360
So at any given hour of the day, a different data center in US East 1 has more or less spot availability or GPU availability.

00:53:29.420 --> 00:53:31.220
And so we will look at all of them.

00:53:31.540 --> 00:53:36.600
I'll say, okay, this is the data center, this is the availability zone that's got the most available right now.

00:53:36.960 --> 00:53:38.860
And so we'll go to that one and we'll pull from there instead.

00:53:39.340 --> 00:53:46.360
There's things like that are known to cloud experts, but are just like not something your average data person knows to think about.

00:53:46.720 --> 00:53:49.500
And those are the kinds of things that you should abstract away and that we do.

00:53:49.900 --> 00:53:51.360
And there's a hundred similar things.

00:53:51.960 --> 00:53:55.860
And so again, if you want to get GPUs, you got to make sure you're looking at across the region.

00:53:56.460 --> 00:54:03.460
But if those GPUs are going to talk to each other, you got to make sure you're looking across the region and then focus on exactly one AZ and not go across.

00:54:03.920 --> 00:54:07.960
So there's a lot of sort of this interesting, sort of again, nuance to doing this stuff well.

00:54:08.080 --> 00:54:12.120
Right, if the machines become chatty with each other, you want them all next to each other.

00:54:12.280 --> 00:54:16.260
It's a thousand X more expensive than, I'm saying a thousand X without hyperbole.

00:54:16.470 --> 00:54:17.980
Like you can process that.

00:54:20.320 --> 00:54:22.760
A thousand times cheaper on a machine than you can transfer it between machines.

00:54:23.239 --> 00:54:27.360
Compute tends to be like a fairly predictable part of the cost.

00:54:27.580 --> 00:54:37.060
It's all of these other things that you like, You don't even think about like, oh, if I flip this setting, now I'm hitting this S3 API a lot.

00:54:37.080 --> 00:54:39.400
And it turns out you pay per API call.

00:54:39.500 --> 00:54:42.680
I didn't know that until that was $1,000.

00:54:43.320 --> 00:54:46.180
There was a XGBoost debug log example.

00:54:46.800 --> 00:54:48.380
Do you want to run through that?

00:54:48.500 --> 00:54:53.140
Logs is another thing that like most of the time that's effectively zero money.

00:54:53.720 --> 00:55:01.240
But then we do things at the scale where like someone was running a thousand node cluster, I think needed to, they had something that wasn't working well.

00:55:01.280 --> 00:55:07.180
So they turned on debug level logging and it gave very chatty logs.

00:55:07.480 --> 00:55:10.520
And I think it was like a $15,000 bill.

00:55:10.900 --> 00:55:18.840
And that's the sort of thing that like, you just, it's not even in your mind as a possibility until you see that, that, that had a happy ending.

00:55:19.040 --> 00:55:23.860
Cause we talked to AWS and they ended up eating that cost for the customer, but.

00:55:23.940 --> 00:55:24.600
Also a good lesson.

00:55:24.760 --> 00:55:26.520
If you talk to AWS, they'll give you money back.

00:55:26.740 --> 00:55:26.920
Yeah.

00:55:27.240 --> 00:55:32.240
just don't keep crossing those boundaries too many times yeah i mean right and part of that was like

00:55:32.620 --> 00:55:43.120
what controls are you putting in place so this doesn't happen again and so we yeah we do some things to like now we warn if you if we see you have very chatty logs we say hey you might want

00:55:43.120 --> 00:56:08.240
to turn this down yeah absolutely so when i create a coiled cluster locally and it's going to go do all the magic that we've been talking about what is the payment workflow how is it distributed do i I have an AWS account that I register my card with AWS, and then Coiled uses that account, and there's some kind of feed to Coiled, but then mostly I pay directly to AWS, or do I pay you all, and then you all hand it?

00:56:08.260 --> 00:56:09.960
Like, what does this look like?

00:56:09.960 --> 00:56:21.540
We do have a kind of trial thing that will run in our account, but primarily what we provide and what people want is running compute in their cloud account.

00:56:21.930 --> 00:56:34.600
And we do that in part because it's simpler, But we also do it because a lot of people have their own data in their own account or they have security requirements or special networking needs.

00:56:34.820 --> 00:56:39.120
Maybe special arrangements with pricing if they're a big customer, right? Something like that.

00:56:39.280 --> 00:56:46.280
Sure. Yeah. So if they have contracts with AWS, they will just use whatever that discount is.

00:56:46.700 --> 00:58:28.240
The flip side of that is people might have AWS accounts, or we have plenty of people who just see that Coiled uses AWS and they sign up for a new one. They don't know how to go in and set that up. So something that is actually kind of cool that we do that really isn't part of running clusters, but is a necessary thing to do is make that set up really easy. So part of what we're doing is using some like best practices around how to manage those credentials, how to set up the networking resources in ways that don't have a standing cost, but are secure. There's all sorts of things that like people, if you're a data scientist, you don't want to, you don't have to think about NAT gateways. And we think about that stuff. NAT gateways are like one of the famous ways to spend a lot of money on AWS. So we'll set up the network in a way that is secure, but doesn't use NAT gateways. And we do that automatically. We do that. We have a web UI for that. And we have, I really like it because I worked on this. We have this lovely CLI tool that does that setup for you, has a lot of rich widgets. And so it's, yeah, trying to make that whole thing so that in my mind, it's very important to like make the easy case easy, but make the hard case possible. So if you don't know all of this cloud stuff and you just want us to like give you sensible defaults, we'll do that. If you are a data engineer who your company is giving you a whole bunch of requirements for how the network is configured, we can support that. I think a big design

00:58:29.140 --> 00:58:36.720
consideration we're making, Nat and I actually collaborated a bunch and fought a bunch on the setup process. It's like the setup process is something that we care very deeply about.

00:58:37.280 --> 00:59:00.900
Like we would go to conferences and just sit down with just like run-of-the-mill new Python data developers and say like, cool, can you set up? And we would have them do it. And they had no idea how the cloud account worked, but at some point some of the company had given them some like AWS credentials file and like all of that will work. Like you can say pip and Salk, coiled setup, and just like press enter a few times and we will set things up for you in a way that is sensible.

00:59:01.300 --> 00:59:06.760
I think we make the cloud accessible to people who don't really know how the cloud works that well.

00:59:07.230 --> 00:59:12.360
So if you're thinking like, oh, I'm a person who knows Pandas and NumPy and Cycler, and I happen to have this cloud account, you should try out Coiled.

00:59:12.700 --> 00:59:14.600
Coiled is actually designed for you.

00:59:14.820 --> 00:59:16.300
It's not designed for your IT department.

00:59:16.700 --> 00:59:18.380
But as Nat said, Nat talks lots of IT departments.

00:59:18.770 --> 00:59:19.380
They love us too.

00:59:19.820 --> 00:59:22.680
But the UX around that is especially smooth.

00:59:23.160 --> 00:59:23.760
It looks really great.

00:59:24.100 --> 00:59:26.580
All right, guys, we are pretty much out of time.

00:59:27.020 --> 00:59:37.020
Final thoughts for folks out there or they get to the end of the show, maybe they want to try Coiled, maybe they want to try their own crack at something like this for their team.

00:59:37.500 --> 00:59:41.080
The standard called actions that go to Coiled.io, it's easy to use, have a good time.

00:59:41.410 --> 00:59:51.100
I think more broadly, the thing I want to say is the cloud provides a promise that is great for us, but isn't actually delivered that well.

00:59:51.560 --> 00:59:56.480
And they shouldn't accept or tolerate the kind of shitty data platform.

00:59:57.240 --> 01:00:01.640
This can be a delightful and a very powerful tool for the data space.

01:00:02.340 --> 01:00:07.860
And if it's not, maybe don't use Coil, but use something and have high expectations.

01:00:07.980 --> 01:00:12.340
We should have a degree of tastes and a degree of standard.

01:00:12.580 --> 01:00:13.500
And we can meet that standard.

01:00:13.840 --> 01:00:15.740
I think there's actually a lot that we can do here.

01:00:15.840 --> 01:00:17.960
There's a lot of potential that's really exciting.

01:00:18.700 --> 01:00:19.720
Oh, that sounds right to me.

01:00:19.800 --> 01:00:28.760
I'll report, like, I think this message of, like, it is okay to be unhappy and things are supposed to be delightful is important to us.

01:00:29.120 --> 01:00:35.040
I spend a lot of time being unhappy, hopefully, so that other people will be able to have delightful experiences.

01:00:35.500 --> 01:00:36.220
Yeah, absolutely.

01:00:36.840 --> 01:00:36.980
Absolutely.

01:00:37.400 --> 01:00:38.240
It should be delightful.

01:00:38.360 --> 01:00:43.160
It sounds delightful when it's created and it's gotten really complex, but it doesn't have to be.

01:00:43.320 --> 01:00:48.560
I guess maybe a lesson I learned from this is use tools optimized for data science workloads.

01:00:48.940 --> 01:00:53.760
don't use tools optimized for long running web apps and other things like that you hear about all

01:00:53.820 --> 01:00:57.340
the time, but they're not for you necessarily. Yeah. And it can be a delightful experience.

01:00:57.400 --> 01:01:05.980
I like the term, like we should all be playing and come by and play. Or again, if you don't use coil, that's fine. But like, there's other ways to do things. Go play. All right. Well, we will

01:01:06.380 --> 01:01:15.440
call it a wrap on the show and people can go play. Guys, thanks for being on the show. It's been really interesting. Cool. And congrats on such a cool company, but also service. This is

01:01:15.540 --> 01:01:17.780
really neat. Thanks, Michael. Thanks for having us. Yeah, you bet. Bye.

01:01:18.700 --> 01:01:21.400
This has been another episode of Talk Python To Me.

01:01:22.210 --> 01:01:23.160
Thank you to our sponsors.

01:01:23.630 --> 01:01:24.860
Be sure to check out what they're offering.

01:01:24.970 --> 01:01:26.300
It really helps support the show.

01:01:27.160 --> 01:01:28.760
This episode is brought to you by Sentry.

01:01:29.180 --> 01:01:30.540
Don't let those errors go unnoticed.

01:01:30.820 --> 01:01:32.320
Use Sentry like we do here at Talk Python.

01:01:32.860 --> 01:01:35.780
Sign up at talkpython.fm/sentry.

01:01:36.460 --> 01:01:37.340
Want to level up your Python?

01:01:37.780 --> 01:01:41.400
We have one of the largest catalogs of Python video courses over at Talk Python.

01:01:41.920 --> 01:01:46.600
Our content ranges from true beginners to deeply advanced topics like memory and async.

01:01:46.940 --> 01:01:49.220
And best of all, there's not a subscription in sight.

01:01:49.630 --> 01:01:52.140
Check it out for yourself at training.talkpython.fm.

01:01:52.840 --> 01:01:57.020
Be sure to subscribe to the show, open your favorite podcast app, and search for Python.

01:01:57.480 --> 01:01:58.340
We should be right at the top.

01:01:58.840 --> 01:02:07.720
You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:02:08.360 --> 01:02:10.600
We're live streaming most of our recordings these days.

01:02:11.000 --> 01:02:18.460
If you want to be part of the show and have your comments featured on the air, Be sure to subscribe to our YouTube channel at talkpython.fm/youtube.

01:02:19.460 --> 01:02:20.580
This is your host, Michael Kennedy.

01:02:21.020 --> 01:02:21.860
Thanks so much for listening.

01:02:22.020 --> 01:02:23.000
I really appreciate it.

01:02:23.340 --> 01:02:24.940
Now get out there and write some Python code.

01:02:38.360 --> 01:02:40.240
We'll be right back.

