#442: Ultra High Speed Message Parsing with msgspec Transcript
00:00 If you're a fan of Pydantic or data classes, you'll definitely be interested in this episode.
00:04 We are talking about a super fast data modeling and validation framework called msgspec.
00:09 And some of the types in here might even be better for general purpose use than Python's
00:15 native classes.
00:15 Join me and Jim, Chris Hariff, to talk about his framework, msgspec.
00:19 This is Talk Python to Me, episode 442, recorded November 2nd, 2023.
00:25 Welcome to Talk Python to Me, a weekly podcast on Python.
00:43 This is your host, Michael Kennedy.
00:44 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython,
00:49 both on fosstodon.org.
00:52 Keep up with the show and listen to over seven years of past episodes at talkpython.fm.
00:57 We've started streaming most of our episodes live on YouTube.
01:01 Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming
01:06 shows and be part of that episode.
01:08 This episode is sponsored by Posit Connect from the makers of Shiny.
01:14 Publish, share, and deploy all of your data projects that you're creating using Python.
01:19 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Reports, Dashboards, and APIs.
01:24 Posit Connect supports all of them.
01:27 Try Posit Connect for free by going to talkpython.fm/posit, P-O-S-I-T.
01:32 And it's brought to you by us over at Talk Python Training.
01:36 Did you know that we have over 250 hours of Python courses?
01:41 Yeah, that's right.
01:42 Check him out at talkpython.fm/courses.
01:45 Jim.
01:48 Hello.
01:48 Hello.
01:48 Welcome to Talk Python.
01:49 It's awesome to have you here.
01:50 Yeah.
01:51 Thanks for having me.
01:51 Yeah, of course.
01:52 I spoke to the Litestar guys, you know, at Litestar.dev and had them on the show.
01:57 And I was talking about their DTOs, different types of objects they can pass around in their
02:03 APIs and their web apps.
02:04 And like FastAPI, they've got this concept where you kind of bind a type, like a class
02:09 or something, to an input, to a web API.
02:12 And it does all that sort of magic like FastAPI.
02:15 And I said, oh, so you guys probably work with PyDanty.
02:17 It's like, yes, but let me tell you about msgspec.
02:19 Because that's where the action is.
02:21 They were so enamored with your project that I just had to reach out and have you on.
02:25 It looks super cool.
02:26 I think people are going to really enjoy learning about it.
02:28 Thanks.
02:29 Yeah, it's nice to hear that.
02:30 Yeah.
02:31 We're going to dive into the details.
02:33 It's going to be a lot of fun.
02:33 Before we get to them, though, give us just a quick introduction on who you are.
02:37 So people don't know you yet.
02:39 So my name is Jim Christreif.
02:41 I am currently an engineering manager doing actually mostly dev work at Voltron Data, working
02:47 on the IBIS project, which is a completely different conversation than what we're going to have today.
02:51 Prior to that, I popped around a couple of startups and was most of them doing Dask was the main
02:58 thing I've contributed to in the past on an open source Python front.
03:01 For those not aware, Dask is a distributed compute ecosystem.
03:04 I come from the PyData side of the Python ecosystem, not the web dev side.
03:09 Nice.
03:09 Yeah, I've had Matthew Rocklin on a couple of times, but it's been a while, so people don't
03:13 necessarily know.
03:13 But it's like super distributed pandas, kind of.
03:18 Grid computing for pandas, sort of.
03:20 Or say like Spark written in Python.
03:22 Sure.
03:23 You know, another thing that's been on, kind of on my radar, but I didn't really necessarily
03:27 realize it was associated with you.
03:29 Tell people just a bit about IBIS.
03:31 Like IBIS is looking pretty interesting.
03:33 IBIS is, I don't want to say the wrong thing.
03:35 IBIS is a portable data frame library is the current tagline we're using.
03:39 If you're coming from R, it's dplyr for Python.
03:42 It's more than that.
03:43 And it's not exactly that, but that's a quick mental model.
03:46 So you write data frame like code.
03:49 We're not pandas compatible.
03:50 We're pandas like enough that you might find something familiar.
03:53 And it can compile down to generate SQL for 18 plus different database backends.
03:58 Also like PySpark and a couple other things.
04:01 Okay.
04:01 So you write your code once and you kind of run it on whatever.
04:03 I see.
04:03 And you do pandas like things, but it converts those into database queries.
04:08 Is that?
04:08 Yeah.
04:09 Yeah.
04:09 So it's a data frame API.
04:10 It's not pandas compatible, but if you're familiar with pandas, you should be able to pick it up.
04:16 You know, we cleaned up what we thought as a bunch of rough edges of the pandas API.
04:19 Yeah.
04:19 Were those pandas one or pandas two rough edges?
04:22 Both.
04:22 It's, I don't know.
04:23 It's pandas like.
04:25 Sure.
04:25 Yeah.
04:26 This looks really cool.
04:27 That's a topic for another day, but awesome.
04:29 People can check that out.
04:30 But this time you're here to talk about your personal project, msgspec.
04:37 Am I saying that right?
04:38 Or you say MSG or message spec?
04:40 Message spec is how it is.
04:42 I think a lot of these projects sometimes need a little, like here's the MP3 you can press play on.
04:48 Like how it's meant to be said, you know?
04:50 I mean, sometimes it's kind of obvious, like PyPI versus PyPI.
04:55 Other times it's just like, okay, I know you have a really clever name.
04:58 Yes, I know.
04:59 People say NumPy all the time.
05:01 I'm like, I don't want to, I try to not correct guests because it's, it's not kind.
05:05 But I also feel awkward.
05:06 They will say NumPy and I'll say, how do you feel about NumPy?
05:08 Like NumPy's great.
05:09 I'm like, okay, we're just going back and forth like this for the next hour.
05:12 It's fine.
05:12 But yeah, it's, it's always, I think some of these could use a little, like a little play.
05:17 So msgspec.
05:18 Tell people about what it is.
05:19 Yeah.
05:20 So gone through a couple of different taglines.
05:22 The current one is a fast serialization and validation library with a built-in support for
05:26 JSON, message pack, YAML, and TOML.
05:29 If you are familiar with Pydantic, that's probably one of the closest, you know, most popular libraries
05:34 that does a similar thing.
05:35 You define kind of a structure of your data using type annotations and msgspec will parse
05:40 your data to ensure it is that structure and does so efficiently.
05:44 It's also compatible with a lot of the other serialization libraries.
05:48 You could also use it as a stand-in for JSON, you know, with the JSON dumps, JSON loads.
05:53 You don't need to specify the types.
05:55 Right.
05:55 It's, I think the mental model of kind of like, it swims in the same water or the same pond
06:01 as Pydantic, but it's also fairly distinguished from Pydantic, right?
06:06 As we're going to explore throughout our chat here.
06:09 The goal from my side, one of the goals was to replicate more of the experience writing
06:14 Rust or Go with Rust-SERDE or Go's JSON, where the serializer kind of stands in the background
06:19 rather than my experience working with Pydantic, where it felt like the base model kind of stood
06:24 in the foreground.
06:25 You're defining the model.
06:27 Serialization kind of comes onto the types you've defined, but you're not actually working
06:30 with the serializers on the types themselves.
06:32 Got it.
06:32 So an example, let me see if I do have it.
06:35 An example might be if I want to take some message I got from, some response I got from
06:41 an API, I want to turn it into a Pydantic model or I'm writing an API.
06:44 I want to take something from a client, whatever.
06:46 I'll go and create a Pydantic class.
06:48 And then the way I use it is I go to that class and I'll say star, star, dictionary I got.
06:55 And then it comes to life like that, right?
06:57 Where there's a little more focus on just the serialization and it has this capability.
07:03 But like you said, it's optional in the sense.
07:06 Yeah.
07:06 In message spec, all types are on equal footing.
07:11 So we use functions, not methods, because if you want to decode into a list of ints, I
07:17 can't add a method to a list.
07:18 You know, it's a Python built-in type.
07:20 Yeah.
07:20 So you'd say msgspec dot JSON dot decode your message and then you'd specify the type
07:26 annotation as part of that function call.
07:29 So it could be, you know, list bracket int.
07:31 Right.
07:31 So you'll say decode and then you might say type equals list of your type or like you
07:37 say, list of int.
07:37 And that's hard when you have to have a class that knows how to basically become what the
07:42 model, the data passed in is, even if it's just a list, some Pydantic classes, you got
07:48 to kind of jump through some hoops to say, hey, Pydantic, I don't have a thing to give
07:52 you.
07:52 I want a list of those things.
07:54 And that's the top level thing is, you know, bracket bracket.
07:56 It's not, it's not any one thing I can specify in Python easily.
08:00 Yeah.
08:00 To be fair to the Pydantic project, I believe in V2, the type adapter.
08:04 Yes, exactly.
08:05 Object can work with that.
08:06 But that is, you know, it's a different way of working with it.
08:09 I wanted to have one API that did it all.
08:12 Sure.
08:12 And it's awesome.
08:13 They made it.
08:13 I mean, I want to just put this out front.
08:15 Like, I'm a massive fan of Pydantic.
08:17 What Samuel's done there is incredible.
08:19 And it's just, it's really made a big difference in the way that people work with data and Python.
08:24 It's awesome.
08:25 But it's also awesome that you have this project that is an alternative and it makes different assumptions.
08:29 And you can see those really play out in like the performance or the APIs.
08:33 So, you know, like Pydantic encourages you to take your classes and then send them the data.
08:39 But you've kind of got to know, like, oh, there's this type adapter thing that I can give a list of my class and then make it work.
08:45 Right.
08:45 But it's not just, oh, you just fall into that by trying to play with the API, you know?
08:50 Yeah.
08:51 Yeah.
08:51 And I think having, being able to specify any type means we work with standard library data classes.
08:56 The same as we work with our built-in struct type or we also work with adders types.
09:00 You know, everything is kind of on equal footing.
09:02 Yeah.
09:02 And what I want to really dig into is your custom struct type that has some really cool properties.
09:10 Not class properties, but components.
09:11 Features of the class of the type there.
09:14 Yeah.
09:14 Let's look at a couple of things here.
09:16 So, as you said, it's fast and I love how somehow italicies on the word fast makes it feel even faster.
09:23 Like it's leaning forward, you know, it's leaning into the speed.
09:26 A fast serialization and validation library.
09:28 The validation is kind of can be, but not required, right?
09:32 The types can be, but they don't have to be.
09:34 So, I think that's one of the ways it really differs from Pydantic.
09:37 But the other is Pydantic is quite focused on JSON, whereas this is JSON, message pack, YAML, and TOML.
09:45 Everyone knows what JSON is.
09:47 I always thought of TOML as kind of like YAML.
09:49 Are they really different?
09:51 It's another configuration focused language.
09:54 I think some people do JSON for config files, but I personally don't like to handwrite JSON.
09:59 YAML and TOML are like more human friendly, in quotes, forms of that.
10:04 YAML is a superset of JSON.
10:06 TOML is its own thing.
10:08 Got it.
10:08 And then message pack is a binary JSON-like file format.
10:12 Yeah, message pack.
10:13 I don't know how many people work with that.
10:15 Where would people run into message pack?
10:16 Yeah.
10:17 If they were, say, consuming an API, or what API framework would people be generating message
10:22 pack in Python, typically?
10:23 That's a good question.
10:24 So, going back to the creation of this project, actually, msgspec sounds a lot like message
10:29 pack.
10:30 And that was intentional.
10:30 It does, yeah.
10:31 Because that's what I wrote it for originally.
10:33 So, as I said at the beginning, I'm one of the original contributors to Dask.
10:37 Worked on Dask forever.
10:38 And the Dask distributed scheduler uses message pack for its RPC serialization layer.
10:43 That kind of fell out of what was available at the time.
10:46 We benchmarked a bunch of different libraries.
10:48 And that was the fastest way to send bytes between nodes in 2015.
10:52 Sure.
10:53 The distributed scheduler's RPC framework has kind of grown haphazardly over time.
10:58 And there were a bunch of bugs due to some hacky things we were doing with it.
11:01 And also, it was slower than we would have wanted.
11:03 So, this was an attempt to write a faster message pack library for Python that also did
11:09 fancier things.
11:10 Supported more types.
11:11 Did some schema validation because we wanted to catch the worker is sending this data and
11:16 the scheduler is getting it and saying it's wrong.
11:18 And we wanted to also add in a way to make schema evolution, meaning that I can have different
11:25 versions of my worker and scheduler and client process and things kind of work.
11:28 If I add new features to the scheduler, they don't break the client.
11:33 You know, we have a nice forward and backward compatibility story.
11:36 And so, that's what kind of fell out.
11:38 Yeah, it's a really nice feature.
11:39 We're going to dive into that.
11:40 But, you know, you might think, oh, well, just update your client or update the server.
11:45 But there's all sorts of situations that get really weird.
11:47 Like, if you have Redis as a caching layer and you create a message pack object and stick
11:54 it in there and then you deploy a new version of the app, it could maybe can't,
11:58 deserialize anything in the cache anymore because it says something's missing or something's
12:02 there that it doesn't expect.
12:04 Right.
12:04 And so, this evolution is important there.
12:07 If you've got long running work and you stash it into a database and you pull it back out,
12:10 like all these things where it kind of lives a little outside the process, all of a sudden
12:14 it starts to matter that before you even consider like clients that run separate code.
12:19 Right.
12:19 Like you could be the client, just different places in time.
12:22 Yeah.
12:22 Yeah.
12:23 So, adding a little bit more structure to how you define messages in a way to make the
12:26 scheduler more maintainable.
12:27 That work never landed.
12:28 It's as it is with open source projects.
12:31 It's a democracy and also a democracy.
12:33 And, you know, you don't always, paths can be done at dead ends.
12:36 I still think it'll be valuable in the future, but some stuff was changing the scheduler and
12:40 serialization is no longer the bottleneck that it was two and a half years ago when this originally
12:45 started.
12:46 So, let me put this in context for people to maybe make it relevant.
12:49 Like maybe right now someone's got a FastAPI API and they're using Pydantic and obviously
12:55 it generates all the awesome JSON they want.
12:58 Is there a way to, how would you go about creating, say, a Python server-based system set
13:06 of APIs that maybe as an option take message pack or maybe use that as a primary way?
13:11 Like it could be maybe, you know, passing an accept header.
13:14 To take message pack?
13:15 If you want to exchange message pack, client server, Python right now, what do you do?
13:19 That's a good question.
13:20 To be clear, I am not a web dev.
13:22 I do not do this for a living.
13:23 I think there is no standard application slash message pack.
13:26 I think people can use it if they want, but that's not a, it's a standardized thing the
13:31 same way that JSON is.
13:32 Yeah.
13:32 I think that Litestar as a framework does support this out of the box.
13:35 I don't know about FastAPI.
13:37 I'm sure there's a way to hack it in as there is with any ASCII server.
13:41 Yeah, Litestar, like I said, I had Litestar on those guys maybe a month ago and...
13:45 Yeah, super, super cool about that.
13:47 So, yeah, I know that they support msgspec and a lot of different options there, but,
13:52 you know, you could just, I imagine you could just return binary bits between you and your
13:57 client.
13:58 I'm thinking of like latency sensitive microservice type things sort of within your data center.
14:04 How can you lower serialization, deserialization, serialization, like all that cost that could
14:08 be the max, you know, the biggest part of what's making your app spend time and energy?
14:14 Michael out there says, would love PyArrow parquet support for large data.
14:18 There's been a request for Aero integration with msgspec.
14:22 I'm not exactly sure what that would look like.
14:24 Aero containers are pretty efficient on their own.
14:26 Breaking them out into a bunch of objects or stuff to work with msgspec doesn't necessarily
14:31 make sense in my mind.
14:32 But anyway, if you have ideas on that, please open an issue or comment on the existing issue.
14:36 Yeah, indeed.
14:37 All right.
14:38 So let's see.
14:39 Some of the highlights are high performance encoders and decoders across those protocols
14:43 we talked.
14:44 You have benchmarks.
14:45 We'll look at them for in a minute.
14:46 You have a really nice, a lot of support for different types that can go in there that
14:51 can be serialized, but there's also a way to extend it to say, I've got a custom type that
14:56 you don't think is serializable to whatever end thing, a message pack, JSON, whatever.
15:01 But I can write a little code that'll take it either way.
15:04 You know, dates are something that drive me crazy, but it could be like an object ID out
15:09 of MongoDB or other things that seem like they should go back and forth, but don't, you know,
15:13 right?
15:13 So that's really nice.
15:14 And then zero cost schema validation, right?
15:18 It validates, decodes and validates JSON two times as fast as ORJSON, which is one of
15:22 the high performance JSON decoders.
15:24 And that's just decoding, right?
15:26 And then the struct thing that we're going to talk about, which the struct type is kind
15:30 of what brings the parody with Pydantic, right?
15:33 Yeah.
15:34 You could think of it as Pydantic's base model.
15:36 It's our built-in data class-like type.
15:38 Nice.
15:38 So structs are data class-like.
15:40 Like everything in msgspec are implemented fully as a C extension.
15:45 Getting these to work required reading a lot of the CPython source code because we're doing
15:50 some things that I don't want to say that they don't want you to do.
15:54 We're not doing them wrong, but they're not really documented.
15:57 So for example, when you subclass for msgspec.struct, that's using a metaclass mechanism,
16:03 which is a way of defining types to define types.
16:06 And the metaclass is written in C, which CPython doesn't make easy to do.
16:11 So it's a C extension metaclass that creates new C types.
16:16 They're pretty speedy.
16:17 They are 10 to 100x faster for most operations than even handwriting a class that does the
16:23 same thing, but definitely more than data classes or adders.
16:25 Yeah.
16:26 It's super interesting.
16:27 And I really want to dive into that.
16:28 I almost can see the struct type being relevant even outside of msgspec in general, potentially.
16:34 So yeah, we'll see about that.
16:36 But it's super cool.
16:37 And Michael also points out, like, he's the one who made the issue.
16:40 So sorry about that.
16:42 He's commented already, I suppose, in a sense.
16:45 But yeah, awesome.
16:46 Cool.
16:47 All right.
16:47 So let's do this.
16:49 I think probably the best way to get started is we could talk through an example.
16:53 And there's a really nice article by Itmar Turner-Trowing, who's been on the show a couple
16:59 of times, called Faster, More Memory-Efficient Python, JSON parsing with msgspec.
17:04 And just as a couple of examples that I thought maybe we could throw up.
17:06 And you could talk to, speak to your thoughts, like, why does the API work this way?
17:11 Here's the advantages and so on.
17:13 Yeah.
17:13 So there's this big, I believe this is the GitHub API, just returning these giant blobs of stuff
17:17 about users.
17:18 Okay.
17:19 And it says, well, if we want to find out what users follow what repos or how many,
17:24 given a user, how many repos do they follow, right?
17:27 We could just say, with open, read this, and then just do a JSON load.
17:31 And then do the standard dictionary stuff, right?
17:34 Like, for everything, we're going to go to the element that we got out and say, bracket some
17:38 key, bracket some key.
17:40 You know, it looks like key not found errors are just lurking in here all over the place.
17:44 But, you know, it's, you should know that maybe it'll work, right?
17:48 If you know the API, I guess.
17:49 So it's like, this is the standard way.
17:51 How much memory does this use?
17:52 How much time does it take?
17:53 Look, we can basically swap out ORJSON.
17:57 I'm not super familiar with ORJSON.
17:59 Are you?
17:59 Yeah.
18:00 ORJSON is compatible-ish with the standard lib JSON, except that it returns bytes rather
18:05 than strings.
18:06 Got it.
18:06 Okay.
18:07 There's also iJSON, I believe, which makes it streaming.
18:10 So there's that.
18:11 And then it says, okay, well, how would this look if we're going to use message spec?
18:15 And in his example, he's using structured data.
18:19 So the structs, this is like the Pydantic version, but it doesn't have to be this way,
18:23 but it is this way, right?
18:25 This is the one he chose.
18:26 So maybe just talk us through, like, how would you solve this problem using message
18:30 spec and classes?
18:31 Yeah.
18:31 So as he's done here in this blog post, he's defined a couple struct types for the various
18:37 levels of this message.
18:38 So repos, actors, and interactions, and then parses the message directly into those types.
18:45 So the final call there is passing in the red message and then specifying the type as a list
18:51 of interactions, which are tree down into actors and repos.
18:54 Exactly.
18:55 So this is what you mentioned earlier about having more function-based.
18:58 So you just say decode, give it the string or the bytes, and you say type equals list of
19:04 bracket, top-level class.
19:06 And just like Pydantic, these can be nested.
19:09 So there's an interaction, which has an actor.
19:10 There's an actor class, which has a login, which has a type.
19:13 So your Pydantic model for how those kind of fit together is pretty straightforward, right?
19:19 Pretty similar.
19:19 Yeah.
19:20 And then you're just programming with classes.
19:21 Awesome.
19:22 Yep.
19:22 And it'll all work well with, like, mypy or PyRite or whatever you're using if you're doing
19:26 static analysis tools.
19:27 Yeah.
19:27 So you've thought about making sure that not just does it work well from a usability perspective,
19:32 but it, like, the type checkers don't go crazy.
19:35 Yeah.
19:35 And any, you know, editor integration you have should just work.
19:38 Nice.
19:39 Because there's sometimes, oh gosh, I think maybe FastAPIs changes, but you'll have things
19:45 like you would say the type of an argument being passed in, if it's, say, coming off the
19:50 query string, you would say it's depend.
19:52 It's a type depends, not an int, for example.
19:56 It's because it's being pulled out of the query string.
19:58 I think that's FastAPI.
19:59 And while it makes the runtime happy and the runtime says, oh, I see you want to get this
20:04 int from the query string, the type checkers and stuff are like, depends.
20:09 What is this?
20:09 Like, this is an int.
20:10 Why are you trying to use this depends as an int?
20:12 This doesn't make any sense.
20:13 I think it's a bit of a challenge to have the runtime, the types drive the runtime, but
20:17 still not freak it out, you know?
20:19 Yeah.
20:19 I think that the Python typing ecosystem, especially with the recent changes in new versions and
20:25 the annotated wrapper, are moving towards a system where these kinds of APIs can be spelled
20:30 natively in ways that the type checkers will understand.
20:33 But if your project that existed before these changes, you obviously had some preexisting
20:38 way to make those work that might not play as nicely.
20:41 So there's the upgrade cost of the project.
20:43 I'm not envious of the work that Samuel Covenant team have had to do to upgrade Pydantic to erase
20:49 some old warts in the API that they found.
20:50 It's nice to see what they've done and it's impressive.
20:52 But it's, I have the benefit of starting this project after those changes in typing
20:56 ecosystem existed.
20:57 You know, can look at hindsight mistakes others have made and learn from them.
21:01 Yeah, that's really excellent.
21:01 They have done, like I said, I'm a big fan of Pydantic and it took them almost a year.
21:06 I interviewed Samuel about that change and it was no joke.
21:09 You know, it was a lot of work.
21:10 But, you know, what they came up with, pretty compatible, pretty much feels like the same
21:15 Pydantic.
21:15 But, you know, if you peel back the covers, it's definitely not.
21:18 All right.
21:19 So the other interesting thing about Ipmar's article here is the performance side.
21:23 So it's okay.
21:23 Do you get fixed memory usage or does it vary based on the size of the data?
21:28 And do you get schema validation?
21:29 Right.
21:30 So standard lib, just straight JSON module, 420 milliseconds.
21:35 OR JSON, the fast one, a little less than twice as fast, 280 milliseconds.
21:39 iJSON for iterable JSON.
21:42 300, so a little more than the fast one.
21:45 Message spec, 90 milliseconds.
21:47 That's awesome.
21:48 That's like three times as fast as the better one.
21:51 Over four times as fast as the built-in one.
21:54 It also is doing, you know, quote unquote, more work.
21:56 It's validating the response as it comes in.
21:58 Exactly.
21:59 So you're sure that it's correct then too.
22:01 All those other ones are just giving you dictionaries and YOLO.
22:04 Do what you want with them, right?
22:06 But here you're actually, all those types that you described, right?
22:09 The interaction and the actors and the repos and the class structure, that's all validation.
22:13 So in on top of that, you've created classes which are heavier weight than dictionaries because
22:18 general classes are heavier weight than dictionaries because they have the dunder dict that has
22:23 all the fields in there effectively anyway, right?
22:26 That's not true for structs.
22:28 Structs are slot classes.
22:29 Yes.
22:30 Structs.
22:30 They are a lighter weight to allocate than a dictionary or a standard class.
22:34 That's one of the reasons they're faster.
22:35 Yeah.
22:35 Structs are awesome.
22:36 And so the other thing I was going to point out is, you know, you've got 40 megabytes
22:39 of memory usage versus 130.
22:41 So almost four times less than the standard module.
22:45 And the only thing that beats you is the iterative one because it literally only has one in memory
22:50 at a time, right?
22:50 One element.
22:51 Yeah.
22:52 So this benchmark is kind of hiding two things together.
22:56 So there is the output, what you're parsing.
22:59 Everything here except for iJSON is going to parse the full input into something.
23:03 One big batch.
23:04 It's more efficient than orJSON or the standard lib in this respect because we're only extracting
23:08 the fields we care about.
23:09 But you're still going to end up with a list of a bunch of objects.
23:11 iJSON is only going to pull one in a memory at a time.
23:14 So it's going to have less in memory there.
23:16 And then you have the memory usage of the parsers themselves, which can also vary.
23:21 So orJSON's memory usage in its parser is a lot higher than message specs, regardless
23:26 of the output size.
23:27 There's a little more internal state.
23:29 So this is a pretty interesting distinction that you're calling out here.
23:33 So for example, if people check out this article, which I'll link, there's like tons of stuff
23:39 that people don't care about in the JSON, like the avatar URL, the gravatar ID, you know,
23:45 the reference type, whether it's a brand, like this stuff that you just don't care about,
23:49 right?
23:49 But to parse it in, you got to read that.
23:51 But what's pretty cool you're saying is like, in this case, the class that Edmar came up with
23:56 is just repo driving from struct.
23:59 It just has a name.
23:59 There's a bunch of other stuff in there, but you don't care about it.
24:02 And so what you're saying is like, if you say that that's the decoder, it looks at that
24:06 and goes, there's a bunch of stuff here.
24:07 We're not loading that.
24:09 We're just going to look for the things you've explicitly asked us to model, right?
24:13 There's no sense in doing the work if you're never going to look at it.
24:16 A lot of different serialization frameworks.
24:18 I can't remember how Pydantic responds when you do this, but, you know, the comments beyond
24:23 Pydantic, so it doesn't really matter, is they'll freak out to say, oh, there's extra
24:27 stuff here.
24:28 What am I supposed, you know, for example, this repo, it just has name, but in the data model,
24:32 it has way more in the JSON data.
24:35 So you try to deserialize it.
24:37 It'll go, well, I don't have room to put all this other stuff.
24:39 Things are, you know, freak out.
24:40 And this one is just like, no, we're just going to filter down to what you asked for.
24:43 I really, it's nice in a couple of ways.
24:45 It's nice from performance, nice from clean code.
24:47 I don't have to put all those other fields I don't care about, but also from, you talked
24:51 about the evolution friendliness, right?
24:53 Because what's way more common is that things get added rather than taken away or changed.
24:59 It's like, well, the complexity grows.
25:01 Now repos also have this, you know, related repos or sub repos or whatever the heck they
25:06 have, right?
25:06 And this model here will just let you go, whatever, don't care.
25:10 Yeah.
25:10 If GitHub updates their API and adds new fields, you're not going to get an error.
25:14 And if they remove a field, you should get a nice error that says expected, you know,
25:19 field name, and now it's missing.
25:20 You can track that down a lot easier than a random key error.
25:24 I agree.
25:24 I think, okay, let's, let's dive into the struct a little bit because that's where we're
25:28 kind of on that now.
25:29 And I think this is one of the highlights of what you built.
25:32 Again, it's kind of the same mental model as people are familiar with some data classes
25:36 with Pydantic and adders and so on.
25:39 So when I saw your numbers, I won't come back and talk about benchmarks with numbers.
25:43 But I just thought like, wow, this is fast.
25:45 And while the memory usage is low, you must be doing something native.
25:49 You must be doing something crazy in here.
25:51 That's not just Dunder slots.
25:53 While Dunder slots is awesome.
25:55 There's more to it than that, right?
25:57 And so they're written in C, quite speedy and lightweight.
26:01 So measurably faster than data classes, adders and Pydantic.
26:04 Like, tell us about these classes.
26:06 Like, this is, this is pretty interesting.
26:07 It's mentioned earlier.
26:08 They're not exactly, but they're, they're basically slots classes.
26:11 So Python's data model, actually CPython's data model is either a class is a standard class
26:16 where it stores its attributes in a dict.
26:19 That's not exactly true.
26:20 There's been some optimizations where the keys are stored separately alongside the class structure
26:25 and all the values are stored on the object instances.
26:27 But in model, there's dict classes and there's slots classes, which you pre-declare your attributes
26:32 to be in this, this Dunder slots interval.
26:35 And those get stored in line in the same allocation as the object instance.
26:40 There's no pointer chasing.
26:41 What that means is that you can't set extra attributes on them that weren't pre-declared,
26:45 but also things are a little bit more efficient.
26:48 We create those automatically when you subclass from a struct type.
26:51 And we do a bunch of other interesting things that are stored on the type.
26:55 That is why we had to write a metaclass in C.
26:58 I went to read it.
26:59 I'm like, whoa, okay.
27:00 Maybe we'll come back to this.
27:01 There's a lot of stuff going on in that type.
27:03 One of the problems with this hobby project is that I wrote this for fun
27:07 and a little bit of work related, but mostly fun.
27:09 And it's not the easiest code base for others to step into.
27:12 It fits my mental model, not necessarily everyone's.
27:15 Yeah.
27:15 I can tell you weren't looking for VC funding because you didn't write it in Rust.
27:18 Seems to be the common denominator these days.
27:22 Yeah.
27:22 Why C?
27:23 Just because CPython's already in C?
27:25 And that's the...
27:26 And I use C.
27:27 I do know Rust, but for what I wanted to do in the use case I had in mind,
27:31 I wanted to be able to touch the C API directly.
27:34 And that felt like the easiest way to go about doing it.
27:39 This portion of Talk Python to Me is brought to you by Posit, the makers of Shiny, formerly RStudio, and especially Shiny for Python.
27:47 Let me ask you a question.
27:49 Are you building awesome things?
27:51 Of course you are.
27:52 You're a developer or a data scientist.
27:53 That's what we do.
27:54 And you should check out Posit Connect.
27:56 Posit Connect is a way for you to publish, share, and deploy all the data products that you're building using Python.
28:04 People ask me the same question all the time.
28:07 Michael, I have some cool data science project or notebook that I built.
28:10 How do I share it with my users, stakeholders, teammates?
28:13 Do I need to learn FastAPI or Flask or maybe Vue or React.js?
28:18 Hold on now.
28:19 Those are cool technologies, and I'm sure you'd benefit from them.
28:22 But maybe stay focused on the data project?
28:24 Let Posit Connect handle that side of things.
28:27 With Posit Connect, you can rapidly and securely deploy the things you build in Python.
28:31 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quattro, Reports, Dashboards, and APIs.
28:38 Posit Connect supports all of them.
28:40 And Posit Connect comes with all the bells and whistles to satisfy IT and other enterprise requirements.
28:46 Make deployment the easiest step in your workflow with Posit Connect.
28:50 For a limited time, you can try Posit Connect for free for three months by going to talkpython.fm.posit.
28:57 That's talkpython.fm.posit.
28:59 The link is in your podcast player show notes.
29:02 Thank you to the team at Posit for supporting Talk Python.
29:05 Okay, so from a consumer of this struct class, I just say class.
29:11 And your example is user.
29:12 You say class, user, parentheses, derived from struct.
29:14 In the field, colon, type.
29:16 So like name, colon, string, groups, colon, set of, str, and so on.
29:20 It looks like standard data classes type of stuff.
29:23 But what you're saying is your meta class goes through and looks at that and says, okay,
29:27 we're going to create a class called user, but it's going to have slots called name, email,
29:31 and groups, among other things, right?
29:33 Like does that magic for us?
29:35 Yeah.
29:35 And then it sets up a bunch of internal data structures that are stored on the type.
29:39 Okay.
29:40 Give me a sense of like, like, what's something, why do you got to put that in there?
29:43 What's in there?
29:44 So the way data classes work, after they do all the type parsing stuff, which we have
29:48 to do too, they then generate some code and eval it to generate each of the methods.
29:54 So when you're importing or when you define a new data class, it generates an init method
29:59 and evals it and then stores it on the instance.
30:01 That means that you have little bits of bytecode floating around for all of your new methods.
30:05 msgspec structs instead, each of the standard methods that the implementation provides, which
30:10 would be, you know, init, wrapper, equality checks, copies, you know, various things are
30:16 single C functions.
30:19 And then the type has some data structures on it that we can use to define those.
30:25 So we have a single init method for all struct types that's used everywhere.
30:29 And as part of the init method, we need to know the fields that are defined on the struct.
30:34 So we have some data stored on there about like the field names, default values, various
30:38 things.
30:38 Nice.
30:38 Because they're written in C rather than, you know, Python bytecode, they could be a lot
30:42 faster.
30:43 And because we're not having to eval a new method every time we define a struct, importing
30:47 structs is a lot faster than data classes.
30:49 Something, I'm not going to guess, I have to look up on my benchmarks, but they are basically
30:54 as efficient to define as a handwritten class where data classes have a bunch of overhead.
30:59 If you've ever written a project that has, you know, a hundred of them, importing
31:02 can slow down.
31:03 Yeah.
31:03 Okay.
31:03 Because you basically are dynamically building them up, right?
31:06 Yeah.
31:06 In data class story.
31:07 Yeah.
31:08 So you've got kind of the data class stuff.
31:10 You got, as you said, dunder net, repper, copy, et cetera.
31:13 But you also have dunder match args for pattern matching.
31:16 That's pretty cool.
31:18 And dunder rich repper for pretty printing support with rich.
31:21 Yeah.
31:22 If you just rich.print, it'll take that, right?
31:24 What happens then?
31:25 It preprints it similar to like how a data class should be rendered.
31:28 Rich is making a pretty big impact.
31:29 So rich is special.
31:31 I enjoy using it.
31:32 This is excellent.
31:33 You've got all the stuff generated.
31:35 So much of it isn't C and super lightweight and fast.
31:38 But from the way we think of it, it's just a Python class, even a little less weird than
31:43 data classes, right?
31:44 Because you don't have to put a decorator on it.
31:45 You just derive from this thing.
31:47 So that's super cool.
31:48 Yeah.
31:49 Super neat.
31:50 The hope was that these would feel familiar enough to users coming from data classes or
31:54 adders or Pydantic or all the various models that learning a new one wouldn't be necessary.
32:00 They're the same.
32:02 Excellent.
32:02 One difference if you're coming from Pydantic is there is no method defined on these by default.
32:06 So you define a struct with fields A, B, and C. Only A, B, and C exist as attributes on that class.
32:13 You don't have to worry about any conflicting names.
32:14 Okay.
32:15 So for example, like the Pydantic ones have, I can't remember, the V1 versus V2.
32:22 It's like, I can't remember, like two dictionary effectively, right?
32:25 Where they'll like dump out the JSON or strings or things like that.
32:28 In V1, there's a method dot JSON.
32:30 Yeah, that's right.
32:31 Which if you have a field name JSON will conflict.
32:33 They are remedying that by adding a model prefix for everything, which I think is a good idea.
32:38 I think that's a good way of handling it.
32:40 Yeah.
32:40 Yeah.
32:40 It's like model underscore JSON or dict or something like that.
32:43 Yeah.
32:43 Cool.
32:43 Yeah, that's one of the few breaking changes they actually, unless you're deep down in
32:47 the guts of Pydantic that you might encounter.
32:49 Yeah.
32:50 You don't have to worry about that stuff because you're more function-based, right?
32:53 You would say decode, or I guess, yeah, decode.
32:57 Here's some data, some JSON or something.
33:00 And then the thing you decode it into would be your user type.
33:03 You'd say type equals user rather than going to the user directly, right?
33:07 Can we put our own properties and methods and stuff on these classes and that'll work all right?
33:12 Yeah.
33:13 To a user, you should think of this as a data class that doesn't use a decorator.
33:17 Okay.
33:17 They should be identical unless you're ever trying to touch the dunder data class fields attribute
33:23 that exists on data classes.
33:24 There should be no runtime differences as far as you can tell.
33:27 And when you're doing the schema validation, it sounds like you're basically embracing the
33:31 optionality of the type system.
33:35 If you say int, it has to be there.
33:36 If you say optional int or int pipe none, it may not be there, right?
33:40 No.
33:40 It's close.
33:41 I'm going to be pedantic here a little bit.
33:43 The optional fields are ones that have default values set.
33:46 So optional bracket int without a default is still a required field.
33:50 It's just one that could be an int or none.
33:52 You'd have to have a literal none passed in.
33:53 Otherwise, we'd error.
33:54 This more matches with how mypy interprets the type system.
33:57 Got it.
33:57 Okay.
33:58 So if I had an optional thing, but it had no value, I'd have to explicitly set it to
34:02 none.
34:02 Yes.
34:03 Or would, yeah.
34:03 Or it'd have to be there in the data every time.
34:06 Like other things, you have default factories, right?
34:09 Passing a function that gets called.
34:10 If it does, I guess if it doesn't exist, right?
34:12 If the data's in there, it's being deserialized, it won't.
34:15 Okay.
34:15 Excellent.
34:16 And I guess your decorator creates the initializer.
34:19 But another thing that I saw that you had was you have this post init, which is really nice.
34:24 Like a way to say like, okay, it's been deserialized.
34:26 Let me try a little further.
34:28 Tell us about this.
34:29 This is cool.
34:29 Yeah, it's coming from data classes.
34:31 They have the same method.
34:32 So if you need to do any extra thing after init, you can use it here rather than trying to override
34:37 the built-in init, which we don't let you do.
34:40 Right.
34:40 Because it has so much magic to do.
34:42 Like let it do it.
34:43 And you don't want to override that anyway.
34:45 You'll have to deal with like passing all the arguments.
34:48 Yeah, it runs Python instead of maybe C, all these things, right?
34:51 So post init would exist if you have more complex constraints, right?
34:56 Currently, that's one reason to use it.
34:58 We currently don't support custom validation functions.
35:01 There's no .validate, decorator, various frameworks, different ways of defining these.
35:06 We have some constraints that are built in.
35:07 You can constraint the number to be greater than some value, but there's no way to specify
35:12 custom constraints currently.
35:14 It's on the roadmap.
35:15 It's a thing we want to add.
35:16 Post init's a way to hack around that.
35:18 So right now, you're looking at the screen.
35:20 You have a post init defined, and you're checking if low is greater than high, raise an error.
35:25 And that'll bubble up through decodes and raise an ice user facing validation error.
35:29 In the long run, we'd like that to be done a little bit more field-based.
35:32 Somewhere to come from other frameworks.
35:34 It is tricky, though, because the validation goes onto one field or the other.
35:38 You don't have composite validators necessarily, right?
35:41 And so there's totally valid values of this low, but whatever it is, it has to be lower than high.
35:47 But how do you express that relationship?
35:49 So I think this is awesome.
35:50 Other areas where it could be interesting is under some circumstances, maybe you've got to compute some field also that's in there that's not set.
35:59 I don't know.
35:59 There's some good options in here.
36:00 I like it a lot.
36:01 Yeah, I guess the errors just come out as just straight out of like something went wrong with under post init rather than field low has this problem.
36:08 It's a little harder to relate an error being raised to a specific field if you raise it in a post init.
36:13 Yeah.
36:13 Also, since you're looking at this, and I'm proud that I got this to work, the errors raised in post init use chained exceptions.
36:20 So you can see a little bit of the cause of where it comes from.
36:22 And getting those to work at the Python C API is completely undocumented and a little tricky to figure out.
36:28 It's a lot of reading how the interpreter does it and making me write, you know, 12 incantations to get them to bubble up right.
36:34 Yeah, I do not envy you working on this struct, this base class.
36:38 But I mean, that's where part of the magic is, right?
36:41 And that's why I wanted to dive into this because I think it behaves like Python classes, but it has these really special features that we don't normally get, right?
36:49 Like low memory usage, high performance, accessing the fields.
36:53 Is that any quicker or is it like standard struct level quick?
36:57 Attribute access and settings should be the same as any other class.
37:00 Things that are faster are init.
37:02 Reper, not that that should matter.
37:04 If you're looking for a high performance reper, that's...
37:06 You're doing it wrong.
37:07 Seems like you're doing something wrong.
37:08 Equality checks, comparisons, so sorting, you know, less than, greater than.
37:12 I think that's it.
37:12 Everything else should be about the same.
37:14 So field ordering, you talked about like evolution over time.
37:17 Does it, does this matter?
37:19 Field ordering is mostly defining how, what happens if you do subclasses and stuff.
37:22 This whole section is, if you're not subclassing, shouldn't hopefully be relevant to you.
37:27 So we match how data class handles things for ordering.
37:30 Okay.
37:30 So I could have my user, but I could have a super user that derives from my user that derives
37:35 from struct and things will still hang together.
37:37 And so figuring out how all the fields order out through that subclassing is what this doc
37:41 is about.
37:41 Yeah.
37:42 Another type typing system thing you can do a lot is have explicitly claimed something
37:46 as a class variable.
37:47 You know, Python is weird about its classes and what makes a variable that's associated
37:53 with a class and or not.
37:55 Right.
37:56 So with these type of classes, you would say like class example colon, and then you have
37:59 x colon int.
38:00 Right.
38:01 And that appears, will appear on the static type, like example.x, but it also imbues each
38:06 object with its own copy of that x.
38:08 Right.
38:09 Which is like a little bit, is it a static thing or part of the type or is it not?
38:14 It's kind of funky.
38:14 But you also can say that explicitly from the typing, you say this is a class variable.
38:19 What happens then?
38:20 Right.
38:20 Like, so standard attributes exist on the instances where a class var exists on the class itself.
38:26 Class fars are accessible on an instance, but the actual data is stored on the class.
38:31 So you don't have an extra copy.
38:33 I see.
38:33 So if there's some kind of singleton type of thing or just one of them.
38:37 Yeah.
38:37 Yeah.
38:37 It has to do with how Python does attribute resolution where it'll check on the instance
38:43 and then on the type and also there's descriptors in there, you know, somewhere.
38:46 Interesting.
38:47 Okay.
38:47 Like other things, I suppose it's pretty straightforward that you take these types and you use them
38:53 to validate them.
38:53 But one of the big differences with msgspec.struct versus pydantic.base model and others is the
39:00 validation doesn't happen all the time.
39:02 It just happens on encode decode.
39:04 Right.
39:05 Like you could call the constructor and pass in bad data or like it just doesn't pay attention.
39:10 Right.
39:11 Yeah.
39:11 Why is it like that?
39:12 So this is one of the reasons I wrote my own thing rather than building off of something
39:17 existing like pydantic.
39:18 Side tangent here, just to add history context here.
39:21 Message spec was started about three years ago.
39:22 The JSON and it kind of fell into its full model about two years ago.
39:26 So this has existed for around two years.
39:27 Yeah.
39:28 We're pre the pydantic rewrite.
39:30 Anyway, the reason I wanted all of this was when you have your own code, where bugs can
39:36 come up.
39:36 Are bugs in your own code?
39:37 I've typed something wrong.
39:39 I've made a mistake and I want that to be checked.
39:41 Or it can be user data that's coming in.
39:43 Or, you know, maybe it's a distributed system and it's still my own code.
39:46 It's just a file or database.
39:47 Yeah.
39:47 Whatever.
39:48 Yeah.
39:48 We have many mechanisms of testing our own code.
39:51 You can write tests.
39:52 You have static analysis tools like mypy, pyright, or checking.
39:55 It's a lot easier for me to validate that a function I wrote is correct.
39:59 Got it.
39:59 There are other tools I believe that we should lean on rather than runtime validation in those
40:05 cases.
40:05 But when we're reading in external data, whether it's coming over the wire, coming from a file,
40:10 coming from user input in some way, we do need to validate because the user could have
40:13 passed us something that doesn't match our constraints.
40:17 As soon as you started trusting user input, you're in for a bad time.
40:20 We don't want to arbitrarily be trusting.
40:21 We do validate on JSON decoding.
40:24 We validate on message pack decoding.
40:25 We also have a couple of functions for doing in-memory conversions.
40:28 So there's msgspec convert, msgspec to built-ins for going the other way.
40:33 So that's for doing conversion of runtime data that you got from someone rather than a specific
40:37 format.
40:37 Yeah, because if you're calling this constructor and passing the wrong data, mypy should check
40:42 that.
40:42 PyCharm should check that.
40:44 Maybe Ruff would catch it.
40:46 I'm not sure.
40:46 But there's a bunch of tools.
40:48 Yeah, Ruff doesn't have a type checker yet.
40:49 TBD on that.
40:51 Yeah, OK.
40:52 Yeah, being able to check these statically, it means that we don't have to pay the cost
40:57 every time we're running, which I don't think we should.
40:59 That's extra runtime performance that we don't need to be spending.
41:02 Yeah, definitely.
41:03 Check it on the boundaries, right?
41:04 Check it where it comes into the system, and then it should be good.
41:07 The other reason I was against adding runtime validation to these structs is I want all types
41:11 to be on equal footing.
41:12 And so if I am creating a list, the list isn't going to be doing any validation because it's
41:17 the Python built-in.
41:18 Same with data classes, same with adders, types, whatever.
41:21 And so only doing a validation when you construct some object type that subclasses from a built-in
41:26 that I've defined, or a type I've defined, doesn't give parity across all types and might
41:31 give a user misconceptions about when something is validated and when they can be sure it's
41:35 correct first when it hasn't.
41:36 Yeah.
41:37 Have you seen bear type?
41:38 I have.
41:39 Yeah.
41:39 Bear type's a pretty interesting option.
41:41 If people really want runtime validation, they could go in and throw bear type onto their
41:46 system and let it do its thing.
41:48 Even if you're not doing it, you should read the docs just for the sheer joy that these
41:51 docs are.
41:52 Oh, they are pretty glorious.
41:53 Yeah, I'll do it.
41:54 It's kind of like burying the lead a little down here, but they described themselves as
41:58 bear type brings Rust and C++ inspired zero-cost abstractions into the lawless world of the
42:04 dynamically typed Python by enforcing type safety at the granular level of functions and methods
42:09 against type hints standardized by the Python community of O order one, non-amortized worst
42:14 case time with negligible constant factors.
42:17 Oh my gosh.
42:18 So much fun, right?
42:20 They're just joking around here, but it's a pretty cool library.
42:22 If you want runtime type checking, it's pretty fast.
42:25 Okay.
42:25 Interesting.
42:26 You talked about the pattern matching.
42:28 I'll come back to that.
42:29 One thing I want to talk about.
42:30 Well, okay.
42:31 Frozen.
42:31 Frozen instances.
42:32 This comes from data classes.
42:34 Without the instances being frozen, the structs are mutable.
42:37 Yeah, I can get one, change its value, serialize it back out, things like that.
42:42 Yep.
42:42 But frozen, I suppose, means what you would expect, right?
42:44 Yeah.
42:45 Frozen has the same meaning as a data class equivalent.
42:47 How deep does frozen go?
42:49 So for example, is it frozen all the way down?
42:51 So in the previous example from Itamar, it had the top level class and then other structs
42:57 that were nested in there.
42:58 If I say the top level is frozen, do the nested ones themselves become frozen?
43:02 No.
43:03 So frozen applies to the type.
43:05 So if you define a type as frozen, that means you can't change values that are set as attributes
43:10 on that type.
43:10 But you can still change things that are inside it.
43:13 So if a frozen class contains a list, you can still append stuff to the list.
43:16 There's no way to get around that except if we were to do some deep, deep, deep magic,
43:20 which we shouldn't.
43:21 It would definitely slow it down if you had to go through and recreate frozen lists every
43:25 time you saw a list and stuff like that.
43:27 Yeah.
43:27 Okay.
43:27 And then there's one about garbage collection in here.
43:30 Yeah.
43:31 Which is pretty interesting.
43:32 There we go.
43:33 Disabling garbage collection.
43:35 This is under the advanced category.
43:36 Warning box around this that tells you not to.
43:39 What could go wrong?
43:40 Come on.
43:40 Part of this was experimenting with the DAS distributed scheduler, which is a unique application,
43:46 I think, for people that are writing web stuff in that all of its data is kept in memory.
43:50 There's no backing database that's external.
43:52 And so it is as fast to respond as the bits of in-memory computation it needs to do before
43:58 it sends out a new task to a worker.
43:59 So in this case, their serialization performance matters.
44:03 But also, it's got a lot of in-memory state.
44:05 It's a dict of types of lots of chaining down.
44:09 The way the CPython garbage collector works is that these large dictionaries could add GC overhead.
44:15 Every time a GC thing happens, it has to scan the entire dictionary.
44:19 Any container thing could contain another.
44:20 And once you do that, there could be a cycle.
44:22 And then for very large graphs, GC pauses could become noticeable.
44:27 Yes.
44:27 This is an experiment and seeing ways around that.
44:29 Because we've done some deep magic with how structs work, we can disable GC for subclasses,
44:34 user-defined types, which CPython does not expose normally and really isn't something you
44:40 probably want to be doing in most cases.
44:42 But if you do, you get a couple benefits.
44:44 The types are smaller.
44:45 Every instance needs to include some extra state for tracking GC.
44:49 I believe on recent builds, it's 16 bytes.
44:52 So it's two pointers.
44:53 So that's, you know, you're saving 16 bytes print.
44:56 That's non-trivial.
44:56 Yeah.
44:57 If you got a huge list of them, that could be a lot.
44:59 Yeah.
44:59 And two, they don't, they're not traced.
45:02 And so if you have a lot of them, that's reducing reduction in tracing overhead every time
45:07 a GC pass happens.
45:08 GC puts more overhead on top of stuff than you would think.
45:12 So I did some crazy GC stuff over at Talk Python and training with my courses.
45:16 You go to slash sitemap.xml.
45:19 I don't know how many entries are in the sitemap, but there are 30,000 lines of sitemap.
45:24 Like many, many, many, many, many, many thousands of URLs up to come back with details.
45:29 Just to generate that page in one request with the default Python settings in Python 3.10,
45:35 I think it was, it was doing 77 garbage collections while generating this page.
45:40 That's not ideal.
45:43 I switched it to just change or tweak how frequently the GC runs.
45:46 So like every 70,000, no, every 50,000 allocations instead of every 700.
45:51 And the site runs 20% faster now and uses the same amount of memory, right?
45:55 And so this is not exactly what you're talking about here, but it's in the, it plays in the
46:00 same space as like, you can dramatically change the things that are triggering this and dramatically
46:05 change the performance potentially.
46:08 The caveat is you better not have cycles.
46:10 Yeah.
46:11 So the other thing with these is, as you pointed out, is the indicator of when a GC pass happens
46:16 has to do with how many GC aware types have been allocated.
46:19 Yep.
46:19 And so if your market type is not a GC type, then the counter is an increment.
46:22 You're not paying that cost.
46:23 Right.
46:24 You can allocate all the integers you want all day long.
46:26 It'll never affect the GC.
46:27 But if you start allocating classes, dictionaries, tuples, et cetera, that is like, well, those
46:31 could contain cycles.
46:32 You have 700 more than you've deallocated since last time.
46:35 I'm going to go check it.
46:36 One place this comes up is if you have, say, a really, really large JSON file.
46:41 Because any deserialization is an alien allocation heavy workload, which means that you can have
46:46 a GC pause happen, you know, several times during it because you've allocated, you know,
46:50 that many types.
46:51 Turning up GC for these types lets you avoid those GC pauses, which gives you actual runtime
46:56 benefits.
46:56 A different way of doing this that is less insane is to just disable GC during the decode.
47:01 Do a, you know, GC disable, JSON decode, GC enable, and you only do a GC pass once.
47:07 Especially because JSON as a tree-like structure can never create cycles.
47:10 You're not going to be having an issue there.
47:12 But you're probably allocating a lot of different things that are container types.
47:15 And so it looks to the GC like, oh, this is some really sketchy stuff.
47:20 We better get on the game here.
47:21 But you know, as you said, there's no cycles in JSON.
47:25 So there's a lot of scenarios like that, like database queries.
47:29 You know, I got a thousand records back from a table.
47:31 They're all some kind of container.
47:33 So minimum one GC happens just to read back that data.
47:36 But you know, there's no cycles.
47:38 So why is the GC happening?
47:39 You can kind of control that a little bit.
47:41 Or you just turn the number up to 50,000 like I did.
47:43 It still happens, but less.
47:46 A lot less.
47:46 Yeah.
47:47 So this is pretty interesting, though, that you just set GC equals false.
47:50 Where do you set this?
47:51 Is this like in the derived bit?
47:54 It's part of the class definition.
47:56 So we make use of class definition keyword arguments.
48:00 So it goes after the struct type in the subclass.
48:03 You do, you know, my class, open over at the C, struct, comma, GC equals false, close, comma,
48:08 colon, rest of the class.
48:10 Yeah, that's where I thought.
48:11 But it is a little funky.
48:12 Yeah.
48:13 I mean, it kind of highlights the meta class action going on there, right?
48:16 What else should people know about these structs?
48:18 They're fast and they can be used for not just serialization.
48:21 So if you are just writing a program and you happen to have msgspec on your system,
48:25 it should be faster to use them than data classes.
48:27 Whether that matters is, of course, application dependent.
48:30 But they're like generally a good idea.
48:32 They happen to live in this serialization library, but that's just because that's where I wrote
48:35 them.
48:36 Yeah, that's where they.
48:37 In a future world, we might split them out into a sub package.
48:39 Yeah.
48:40 Fast struck.
48:41 Pippin saw fast struck.
48:42 Who knows?
48:43 Yet to be named.
48:45 So better than data classes.
48:47 I mean, they have the capabilities of data classes.
48:48 So that's cool.
48:49 But better than straight up regular classes, like Bayer classes, you know, class colon name.
48:54 Are opinionated a little bit.
48:56 They're how I think people probably should be writing classes.
48:59 And they're opinionated in a bit.
49:00 That means that you can't write them in ways that I don't want you to.
49:03 So the way a struct works is you define attributes on it using type annotations.
49:07 And we generate a fast init method for you.
49:09 We don't let you write your own init.
49:11 In the subclass, you can't override init.
49:13 The generated one is the one you get.
49:15 That means that like if you're trying to create an instance from something that isn't those field
49:19 names, you can't do that.
49:21 You need to use a new class method for writing those.
49:23 I believe this is how people, at least on projects I work on, generally use classes.
49:27 So I think it's a fine limitation.
49:29 But it is putting some guardrails around how the arbitrariness of how you can define a Python
49:35 class.
49:35 You could have a, you know, a handwritten class that has two attributes, X and Y, and your
49:40 init takes, you know, parameters A and B.
49:43 Sure.
49:43 Or maybe it just takes X and it always defaults Y unless you go and change it after or whatever,
49:48 right?
49:48 I guess you could do sort of do that with default values, right?
49:51 But you couldn't prohibit it from being passed in.
49:53 I'm feeling some factory classes.
49:55 The adders docs have a whole, whole page telling people about why this pattern is, is better
50:01 and nudging them to do this.
50:02 So this isn't a new idea.
50:03 Yeah.
50:03 Go check out adders and see what they're saying as well.
50:06 There's probably a debate in the issues somewhere on GitHub.
50:08 There always is a debate.
50:10 Yeah.
50:10 Let's see.
50:10 Let's go get a bunch of stuff up here I want to talk about.
50:12 I guess really quickly, since there's a lot of like C native platform stuff, right?
50:17 This is available on, you know, pip install msgspec.
50:21 We're getting a wheel.
50:22 It seemed like it worked fine on my M2 MacBook Air.
50:26 Like what are the platforms that I get a wheel that don't have to worry about compiling?
50:30 So we use CI BuildWheel for building everything.
50:33 And I believe I've disabled some of the platforms.
50:36 The ones that are disabled are mostly disabled because CI takes time.
50:40 I need to minimize them.
50:42 But everything common should exist, including Raspberry Pi and various ARM builds.
50:46 Excellent.
50:47 Okay.
50:47 Yeah.
50:47 It seemed like it worked just fine.
50:49 I didn't really know that it was like doing a lot of native code stuff, but it seems like
50:53 it.
50:53 And also available on Conda, Conda Forge.
50:55 So that's cool.
50:56 If you Conda, you can also just Conda install it.
50:59 Kind of promised talking about the benchmarks a little bit, didn't I?
51:02 So benchmarks are always...
51:04 If you click on the graph on the bottom, it'll bring you to it.
51:06 Yeah.
51:06 They're always rife with like, that's not my benchmark.
51:10 I'm doing it different, you know?
51:11 But give us a sense of just...
51:13 It says fast, italicies leaning forward.
51:16 Give us a sense of like, where does this land?
51:18 Is it, you know, 20% faster or is it a lot better?
51:21 Yeah.
51:21 So as you said, benchmarks are a problem.
51:22 The top of this benchmark docs is a whole argument against believing them and telling you to run
51:27 your own.
51:27 So take a grain of salt.
51:29 I started benchmarking this mostly just to know how we stacked up.
51:33 It's important if you're making changes to know if you're getting slower.
51:35 It's also important to know what the actual trade-offs of your library are.
51:38 All software engineering is trade-offs.
51:40 So msgspec is generally fast.
51:43 The JSON parser in it is one of the fastest in Python or the fastest, depending on what
51:50 your message structure is and how you're invoking it.
51:52 It at least is on par with or JSON, which is generally what people consider to be the fast
51:56 parser.
51:57 Right.
51:57 That's where they go when they want fast.
51:59 Yeah.
51:59 Yes.
51:59 If you are specifying types, so if you, you know, add in a type annotation to a JSON decode
52:05 call with message spec, even if you're decoding the whole message, you're not doing
52:08 a subset.
52:08 We're about 2x faster than our JSON.
52:10 You actually get a speed up by defining your types because struct types are so efficient
52:15 to allocate versus a deck.
52:17 That's kind of the opposite of what you might expect, right?
52:19 It seems like we're doing more work, but we're actually able to do less because we can
52:23 take some more, you know, efficient fast paths.
52:25 And then a thousand objects with validation compared to.
52:29 Yeah.
52:29 Mesher, Murrow, Seatters, Pydantic, and so on.
52:34 Probably the last one.
52:34 This was a grab bag of various validation libraries that seemed popular.
52:38 Mashemar is the one that DBT uses.
52:40 I think they're the primary consumer of that.
52:42 Catters is for adders.
52:43 Pydantic is, you know, ubiquitous.
52:45 This right here in this benchmark graph we're looking at is against Pydantic V1.
52:50 I have not had a chance to update our benchmarks to go against V2.
52:54 There's a separate gist somewhere that has got some numbers there.
52:57 Standard number they throw out is like 22 times faster.
53:00 So it still puts you multiples faster.
53:02 In that benchmark, we're averaging 10 to 20x faster than Pydantic V2.
53:06 In numbers I run against V1, we're about 80 to 150x faster.
53:11 So it really is structure dependent.
53:13 Yeah, sure.
53:14 Do you have one field or do you have a whole bunch of stuff?
53:17 Yeah, exactly.
53:17 And what types of fields?
53:19 To be getting more into the weeds here, JSON parsing is not easy.
53:24 Message pack parsing is like the format was designed for computers to handle it.
53:27 It's, you know.
53:28 Seven bytes in there is an integer here.
53:30 Yeah, okay.
53:31 Where JSON is human readable and parsing strings into stuff is slow.
53:35 Right.
53:35 The flexibility equals slowness, yeah.
53:37 Our string parsing routines in msgspec are faster than the ones used by orJSON.
53:43 Our integer parsing routines are slower.
53:46 But there's a different trade-off there.
53:47 Interesting.
53:48 Okay.
53:48 Yeah, I think this is, it just seems so neat.
53:50 There's so much flexibility, right, with all the different formats.
53:53 And the restrictions on the class, they exist, but they're unstruck.
53:56 But they're not insane, right?
53:58 I mean, you build proper OOP type of things.
54:02 You don't need super crazy hierarchies.
54:04 Like, that's where you get in trouble with that stuff anyway.
54:06 So don't do it.
54:07 I guess we don't have much time left.
54:09 One thing I think we could talk about a bit maybe would be, I find it, the extensions.
54:13 Just maybe talk about parsing stuff that is kind of unknown.
54:16 This is pretty interesting.
54:17 So the way we allow extension currently, there's an intention to change this and expand it.
54:23 But currently, extending adding new types is done via a number of different hooks.
54:28 They're called when a new type is encountered.
54:30 So custom user defined type of some form.
54:32 I liked doing this rather than adding it into the annotation because if I have a new type, I want it to exist probably everywhere.
54:38 And I don't want to have to keep adding in and use the serializer and deserializer as part of the type annotations.
54:45 So to define a new type that you want to encode, you can add an encode hook, which takes in the instance and returns something that msgspec knows how to handle.
54:54 This is similar to, you know, if you're coming from standard library JSON, there's a default callback.
54:59 It's the same.
54:59 We renamed it to be a little better name in my mind, but it's the same thing.
55:02 Right.
55:03 So your example here is taking a complex number, but storing it as a tuple of real and imaginary numbers, but then pulling it back into a proper complex number object.
55:12 Super straightforward.
55:13 Yeah.
55:13 But makes it possible.
55:14 Yeah.
55:15 Yeah.
55:15 That's really cool.
55:16 So people can apply this.
55:17 And this, I guess, didn't really matter on the output destination, does it?
55:21 Your job here is to take a type that's not serializable to one that is, and then whether that goes to a message pack or JSON or whatever is kind of not your problem.
55:31 Yeah.
55:31 And then the decode hook is the inverse.
55:32 You get a bunch of stuff that is, you know, core types, ints, strings, whatever, and you compose them up into your new custom type.
55:39 Jim, I think we're getting about out of time here.
55:41 But I just want to point out, like, if people hit the user guide, there's a lot of cool stuff here.
55:44 And there's a whole performance tips section that people can check out.
55:50 You know, if we had more time, maybe we'd go into them.
55:52 But, like, for example, you can call msgspec dot JSON dot encode, or you can create an encoder and say the type and stuff and then reuse that.
55:59 Right.
55:59 Those kinds of things.
56:01 Yeah.
56:01 There's another method that is, again, a terrible internal hack for reusing buffers.
56:07 So you don't have to keep allocating byte buffers every message.
56:09 You can allocate a byte array once and use it for everything.
56:12 Save some memory.
56:13 Let me ask, Ellie's got a question.
56:14 I'm going to read some words that don't mean anything to me, but they've made to you.
56:17 How does the performance of message pack plus msgspec with the array-like equals true optimization compared to flat buffers?
56:25 So by default, objects, so struct types, data classes, whatever, encode as objects in the stream.
56:31 So a JSON object has keys and values, right?
56:34 If you have a point with fields X and Y, it's got X and Y, you know, one, two.
56:39 We have an array-like optimization, which lets you drop the field names.
56:42 And so that would instead include as an array of, you know, one comma two, dropping the X and Y.
56:47 Reduces the message size on the wire.
56:48 If the other side knows what the structure is, it can, you know, pull that back up into a type.
56:53 In terms of message pack as a format, plus with the array-like optimization, the output size should be approximately the same as you would expect it to come out of flat buffers.
57:03 The Python flat buffers library is not efficient for creating objects from the binary.
57:09 So it's going to be a lot faster to pull it in.
57:11 Obviously, this is then a very custom format.
57:13 You're doing a weird thing.
57:15 And so compatibility with other ecosystems will be slower.
57:18 Or it's not slower necessarily, but you'll have to write them yourself.
57:21 Yeah.
57:21 Not everything knows how to read the message pack.
57:23 More brittle, potentially.
57:24 Yeah.
57:25 Yes.
57:25 Yeah.
57:25 But for Python, talking to Python, that's probably the fastest way to go between processes.
57:29 And probably a lot faster than JSON or YAML or something like that.
57:33 Okay.
57:33 Excellent.
57:34 I guess, you know, there's many more things to discuss, but we're going to leave it here.
57:38 Thanks for being on the show.
57:39 Final call to action.
57:40 People want to get started with message pack.
57:43 Are you accepting PRs if they want to contribute?
57:45 What do you tell them?
57:47 First, I encourage people to try it out.
57:48 I'm available, you know, to answer questions on GitHub and stuff.
57:51 It is obviously a hobby project.
57:53 So, you know, if the usage bandwidth increases significantly, we'll have to get some more
57:58 maintainers on and hopefully we can make this more maintainable over time.
58:01 But once the sponsor funds exceed $10,000, $20,000, $30,000 a month, like you'll reevaluate
58:06 your...
58:07 No, just kidding.
58:07 Sure.
58:08 Sure.
58:08 But yeah, please try it out.
58:10 Things work, should be hopefully faster than what you're currently using and hopefully intuitive
58:15 to use.
58:15 We've done a lot of work to make sure the API is friendly.
58:17 Yeah.
58:17 It looks pretty easy to get started with.
58:19 The docs are really good.
58:20 No, thank you.
58:21 Congrats on the cool project.
58:22 Thanks for taking the time to come on the show and tell everyone about it.
58:25 Thanks.
58:25 Yeah.
58:26 See you later.
58:26 Bye.
58:27 This has been another episode of Talk Python to Me.
58:29 Thank you to our sponsors.
58:31 Be sure to check out what they're offering.
58:33 It really helps support the show.
58:35 This episode is sponsored by Posit Connect from the makers of Shiny.
58:39 Publish, share, and deploy all of your data projects that you're creating using Python.
58:44 Streamlit, dash, dash, dash, dash, dash, dashboards, dashboards, dashboards, and APIs.
58:50 Posit Connect supports all of them.
58:52 Try Posit Connect for free by going to talkpython.fm/posit.
58:57 P-O-S-T.
58:58 Want to level up your Python?
59:00 We have one of the largest catalogs of Python video courses over at Talk Python.
59:04 Our content ranges from true beginners to deeply advanced topics like memory and async.
59:09 And best of all, there's not a subscription in sight.
59:12 Check it out for yourself at training.talkpython.fm.
59:15 Be sure to subscribe to the show.
59:17 Open your favorite podcast app and search for Python.
59:20 We should be right at the top.
59:21 You can also find the iTunes feed at /itunes, the Google Play feed at /play,
59:26 and the direct RSS feed at /rss on talkpython.fm.
59:30 We're live streaming most of our recordings these days.
59:33 If you want to be part of the show and have your comments featured on the air,
59:37 be sure to subscribe to our YouTube channel at talkpython.fm/youtube.
59:41 This is your host, Michael Kennedy.
59:43 Thanks so much for listening.
59:45 I really appreciate it.
59:46 Now get out there and write some Python code.
59:48 Thank you.