Learn Python with Talk Python's 270 hours of courses

#442: Ultra High Speed Message Parsing with msgspec Transcript

Recorded on Thursday, Nov 2, 2023.

00:00 If you're a fan of Pydantic or data classes, you'll definitely be interested in this episode.

00:05 We are talking about a super fast data modeling and validation framework called msgspec.

00:10 And some of the types in here might even be better for general purpose use than Python's native classes. Join me and Jim Crist Harif to talk about his framework msgspec.

00:20 This is Talk Python to Me, episode 442, recorded November 2nd, 2023.

00:40 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy.

00:45 Follow me on Mastodon, where I'm @mkennedy, and follow the podcast using @talkpython.

00:50 Both on mastodon.org. Keep up with the show and listen to over seven years of past episodes at talkpython.fm. We've started streaming most of our episodes live on YouTube. Subscribe to our YouTube channel over at talkpython.fm/youtube to get notified about upcoming shows and be part of that episode.

01:10 This episode is sponsored by Posit Connect from the makers of Shiny. Publish, share, and deploy all of your data projects that you're creating using Python. Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Reports, Dashboards, and APIs. Posit Connect supports all of them.

01:27 Try Posit Connect for free by going to talkpython.fm/posit. P-O-S-I-T. And it's brought to you by us over at Talk Python Training. Did you know that we have over 250 hours of Python courses?

01:41 Yeah, that's right. Check them out at talkpython.fm/courses. Jim.

01:45 Hello.

01:47 Welcome to Talk Python. I mean, it's awesome to have you here.

01:51 Yeah. Thanks for having me.

01:52 Yeah, of course. I spoke to the Litestar guys, you know, at litestar.dev and had them on the show. And I was talking about their DTOs, different types of objects they can pass around in their APIs and their web apps. And like FastAPI, they've got this concept where you kind of bind a type, like a class or something to an input to a web API. And it does all that sort of magic like FastAPI. And I said, Oh, so you guys probably work with PyData. It's like, yes, but let me tell you about Msgspec. Because that's where the action is. They were so enamored with your project that I just had to reach out and have you on. It looks super cool. I think people are going to really enjoy learning about it.

02:29 Yeah, thanks. Yeah. It's nice to hear that.

02:31 Yeah. We're going to dive into the details. It's going to be a lot of fun. Before we get to them, though, give us just a quick introduction on who you are. So people, you know, people don't know you yet.

02:40 So my name's Jim Christ Harif. I am currently an engineering manager doing actually mostly dev work at Voltron Data, working on the IBIS project, which is a completely different conversation than what we're going to have today. Prior to that, I've worked on a couple of startups and was most of them doing Dask, was the main thing I've contributed to in the past on an open source Python front. For those not aware, Dask is a distributed compute ecosystem. I come from the PyData side of the Python ecosystem, not the web dev side.

03:09 Nice. Yeah, I've had Matthew Rocklin on a couple of times, but it's been a while, so people don't necessarily know, but it's like super distributed pandas, kind of. Grid computing for pandas, sort of.

03:21 Or say like Spark written in Python.

03:23 Sure. You know, another thing that's been on kind of on my radar, but I didn't really necessarily realize it was associated with you. Tell people just a bit about IBIS. Like IBIS is looking pretty interesting.

03:33 IBIS is, I don't want to say the wrong thing. IBIS is a portable data frame library is the current tagline we're using. If you're coming from R, it's deployed for Python. It's more than that and it's not exactly that, but that's a quick mental model. So you write data frame like code. We're not pandas compatible. We're pandas like enough that you might find something extra familiar and it can compile down to, you know, generate SQL for 18 plus different database backends. Also like PySpark and a couple other things.

04:01 Okay.

04:02 So you write your code once and you kind of run it on whatever.

04:03 I see. And you do pandas like things, but it converts those into database queries. Is that right?

04:09 Yeah. Yeah. So it's a data frame API. It's not pandas compatible, but if you're familiar with pandas, you should be able to pick it up. You know, we cleaned up what we thought as a bunch of rough edges with the pandas API.

04:19 Yeah. Were those pandas one or pandas two rough edges?

04:22 Both. It's, I don't know. It's pandas like.

04:25 Sure. Yeah. This looks really cool. That's a topic for another day, but awesome. People can check that out. But this time you're here to talk about your personal project, msgspec.

04:36 Am I saying that right? Are you saying MSG or msgspec?

04:40 msgspec.

04:41 Right on. I think a lot of these projects sometimes need a little, like, here's the MP3 you can press play on, like how it's meant to be said, you know? And sometimes it's kind of obvious like PyPy versus PyPi. Other times it's just like, okay, I know you have a really clever name.

04:59 People say numpy.

05:00 Yes, I know. People say numpy all the time. I'm like, I don't want to, I try to not correct guess cause it's, it's not kind, but I also feel awkward. They will say numpy and I'll say, how do you feel about numpy? They're like, numpy's great. I'm like, okay, we're just going back and forth like this for the next hour. It's fine. But yeah, it's, it's always, I think some of these could use a little, like a little play on. So Msgspec, tell people about what it is.

05:20 Yeah. So gone through a couple of different taglines. The current one is a fast serialization and validation library with a built-in support for JSON, MessagePack, YAML, and TOML. If you are familiar with Pydantic, that's probably one of the closest, you know, most popular libraries that does a similar thing. You define kind of a structure of your data using type annotations and msgspec will parse your data to ensure it is that structure and does so efficiently. It's also compatible with a lot of the other serialization libraries.

05:48 You could also use it as a stand in for JSON, you know, with the JSON dumps, JSON loads, you don't need to specify the types.

05:55 Right. It's, I think the mental model of kind of like it swims in the same water, the same pond as Pydantic, but it's also fairly distinguished from Pydantic, right? As we're going to explore throughout our chat here.

06:09 The goal from my side, one of the goals was to replicate more of the experience writing Rust or Go with Rust Serde or Go's JSON, where the serializer kind of stands in the background rather than my experience working with Pydantic where it felt like the base model kind of stood at the foreground. You're defining the models, serialization kind of comes onto the types you've defined, but you're not actually working with the serializers on the types themselves.

06:32 Got it. So an example, let me see if I, see if I do have it. An example might be if I want to take some message I got from some response I got from an API, I want to turn it into a Pydantic model or I'm writing an API. I want to take something from a client, whatever. I'll go and create a Pydantic class. And then I, the way I use it is I go to that class and I'll say star star dictionary I got. And then it comes to life like that, right? Where there's a little more focused on just the, the serialization and has this capability. But like, like you said, it's optional in the sense.

07:06 Yeah. I, in msgspec, all types are on equal footing. So we use functions, not methods, because if you want to decode into a list of ints, I can't add a method to a list. You know, it's a Python built in type. So you'd say msgspec.JSON.decode your message. And then you'd specify the type annotation as part of that function call. So it could be, you know, list bracket int.

07:31 Right. So you'll say decode. And then you might say type equals list of, of your type or like you say, list of int. And that's hard when you have to have a class that knows how to basically become what the model, the data passed in is, even if it's just a list, some Pydantic classes, you got to kind of jump through some hoops to say, Hey, Pydantic, I don't have a thing to give you. I want a list of those things. And that's the top level thing is, you know, bracket bracket. It's not, it's not any one thing I can specify in Python easily.

08:00 Yeah. To, to be fair to the Pydantic project, I believe in V2, the type adapter object can work with that, but that is, you know, it's a, it's a different way of working with it.

08:09 I wanted to have one API that did it all.

08:12 Sure. And it's awesome. They made it. I mean, I want to put this out front, like I'm a massive fan of Pydantic. What Samuel's done there is incredible. And it's just, it's really made a big difference in the way that people work with data in Python. It's awesome. But it's also awesome that you have this project that is an alternative and it makes different assumptions and you can see those really play out in like the performance or the APIs. So you know, like Pydantic encourages you to take your classes and then send them the data, but you've kind of got to know like, oh, there's this type adapter thing that I can give a list of my class and then make it work. Right. But it's not just, oh, you just fall into that by trying to play with the API, you know?

08:51 Yeah. Yeah. And I think having, being able to specify any type means we, we work with standard library data classes, the same as we work with our, our built-in struct type or we also work with the adders types. You know, everything is kind of on equal footing.

09:02 Yeah. And I, what I want to really dig into is your custom struct type that has some really cool properties, not class properties, but components, features of the class of the type there. Yeah. Let's look at a couple of things here. So as you said, it's a fast and I love how somehow italicies on the word fast makes it feel even faster. Like it's leaning forward, you know, it's leaning into the speed, a faster realization and validation library. The validation is kind of can be, but not required, right? The types can be, but they don't have to be.

09:34 So I think that's one of the ways that really differs from Pydantic. But the other is Pydantic is quite focused on JSON, whereas this is JSON, MessagePack, YAML, and TOML. Everyone knows what JSON is. I always thought of TOML as kind of like, like YAML or are they really different?

09:51 It's another configuration focused language. I think people, some people do JSON for config files, but I personally don't like to handwrite JSON. YAML and TOML are like more human friendly in quotes forms of that. YAML is a superset of JSON. TOML is its own thing. And then Msgspec is a binary JSON like file format.

10:12 Yeah. msgspec. I don't know how many people work with that. Like where would people run into msgspec if they were like say consuming an API or what API framework would people be generating msgspec in Python typically?

10:23 That's a good question. So going back to the creation of this project, actually, msgspec sounds a lot like Msgspec. And that was intentional because that's what I wrote it for originally. So as I said at the beginning, I'm one of the original contributors to Dask, worked on Dask for forever. And the Dask distributed scheduler uses msgspec for its RP serialization layer that kind of fell out of what was available at the time. We benchmarked a bunch of different libraries and that was the fastest way to send bytes between nodes in 2015.

10:52 Sure.

10:53 The distributed schedulers RPC framework has kind of grown haphazardly over time. And there were a bunch of bugs due to some hacky things we were doing with it. And also it was a slower than we would have wanted. So this was an attempt to write a faster msgspec library for Python that also did fancier things, supported more types, did some schema validation because like we wanted to catch the worker is sending this data and the scheduler is getting it and saying it's wrong. And we wanted to also add in a way to make schema evolution, meaning that I can have different versions of my worker and scheduler and client process and things kind of work. If I add new features to the scheduler, they don't break the client. You know, we have a nice forward and backward compatibility story. And so that's what kind of fell out.

11:38 Yeah, it's a really nice feature. We're going to dive into that. But you know, you might think, oh, well, just update your client or update the server. But there's all sorts of situations that get really weird. Like if you have Redis as a caching layer and you create a msgspec object and stick it in there and then you deploy a new version of the app, it could maybe can't deserialize anything in the cache anymore because it says something's missing or something's there that it doesn't expect. Right. And so this evolution is important there. If you got long running work and you stash it into a database and you pull it back out, like all these things where it kind of lives a little outside the process, all of a sudden it starts to matter that before you even consider like clients that run separate code, right? Like you could be the client, just different places in time.

12:22 Yeah. Yeah. So adding a little bit more structure to how you define messages in a way to make the scheduler more maintainable. That work never landed. It's as it is with open source projects. It's a democracy and also a duocracy. And you know, you don't always have this can be done at that ends. I still think it'll be valuable in the future. But some stuff was changing the scheduler and a serialization is no longer the bottleneck that it was two and a half years ago when this originally started.

12:46 So let me put this in context for people, maybe make it relevant. Like maybe right now someone's got a FastAPI, API, and they're using Pydantic and obviously it generates all the awesome JSON they want. Is there a way to, how would you go about creating say a Python server-based system set of APIs that maybe as an option take msgspec or maybe use that as a primary way? Like it could be maybe, you know, passing an accept header.

13:14 To take msgspec?

13:16 If you want to exchange msgspec client server Python right now, what do you do?

13:19 That's a good question. To be clear, I am not a web dev. I do not do this for a living.

13:23 I think there is no standard application stash msgspec. I think people can use it if they want, but that's not a, it's a standardized thing the same way that JSON is.

13:32 Yeah.

13:33 I think that Litestar as a framework does support this out of the box. I don't know about FastAPI. I'm sure there's a way to hack it in as there is with any ASCII server.

13:40 Yeah. Litestar, like I said, I had Litestar on those guys maybe a month ago and yeah, it was super, super cool about that. So yeah, I know that they support msgspec and a lot of different options there, but you know, you could just, I imagine you could just return binary bits between you and your client. I'm thinking of like latency sensitive microservice type things sort of within your data center. How can you lower serialization, deserialization, serialization, like all that, that cost that could be the max, you know, the biggest part of what's making your app spend time and energy. Michael out there says would love PyArrow parquet support for large data.

14:19 There's been a request for arrow integration with msgspec. I'm not exactly sure what that would look like. Arrow containers are pretty efficient on their own. Breaking them out into a bunch of objects or stuff to work with msgspec doesn't necessarily make sense in my mind. But anyway, if you have ideas on that, please open an issue or comment on the existing issue.

14:36 Yeah, indeed. All right. So let's see. Some of the highlights are high performance encoders and decoders across those protocols we talked. You have benchmarks. We'll look at them for in a minute. You have a really nice, a lot of support for different types that can go in there that can be serialized, but there's also a way to extend it to say, I've got a custom type that you don't think is serializable to whatever and thing, a msgspec, JSON, whatever. But I can write a little code that'll take it either way, you know, dates or something that drive me crazy, but it could be like an object ID out of MongoDB or other things that are seem like they should go back and forth, but don't, you know, right. So that's really nice. And then zero cost schema validation, right? It validates decodes and validates JSON two times as fast as ORJSON, which is one of the high performance JSON decoders.

15:25 And that's just decoding, right? And then the struct thing that we're going to talk about, which the struct type is kind of what brings the parody with Pydantic, right?

15:33 Yeah. You could think of it as Pydantic's base model. It's our built in data class like type. So structs are data class like, like everything in msgspec are implemented fully as a C extension. Getting these to work required reading a lot of the CPython source code because we're doing some things that I don't want to say that they're not, they don't want you to do. We're not doing them wrong, but they're not really documented.

15:57 So for example, the, the, when you subclass for msgspec or msgspec.struct, that's using a meta class mechanism, which is a way of defining types that define types. And the meta class is written in C, which CPython doesn't make easy to do. So it's a C class that creates new C types. They're pretty speedy. They are 10 to 100X faster for most operations than even handwriting a class that does the same thing, but definitely more than data classes or adders.

16:25 Yeah. It's super interesting. And I really want to dive into that. Like I almost can see the struct type being relevant even outside of msgspec and in general potentially.

16:35 So yeah, we'll see about that, but it's super cool. And Michael also points out like he's the one who made the issue. So sorry about that. He's commented already, I suppose in a sense, but yeah. Awesome. Cool. All right. So let's do this. I think probably the best way to get started is we could talk through an example and there's a really nice article by Itmar Turner-Trauring who's been on the show a couple of times called Faster, More Memory Efficient Python, JSON Parsing with msgspec. And just has a couple of examples that I thought maybe we could throw up and you could talk to, speak to your thoughts of like, why is the API work this way? Here's the advantages and so on. Yeah. So there's this big, I believe this is the GitHub API, just returning these giant blobs of stuff about users. Okay. And it says, well, if we want to find out what users follow what repos or how many, given a user, how many repos do they follow? Right. We could just say with open, read this and then just do a JSON load and then do the standard dictionary stuff, right? Like for everything, we're going to go, go to the element that we got out and say bracket some key, bracket some key. You know, it looks like key not found errors are just lurking in here all over the place, but you know, it's, you should know that maybe it'll work right. If you know the API, I guess. So it was like, this is the standard way.

17:51 How much memory does this use? How much time does it take? Look, we can basically swap out or JSON. I'm not super familiar with or JSON. Are you? Yeah. Or JSON is compatible ish with the standard lib JSON, except that it returns bytes rather than strengths. Got it. Okay. There's also iJSON, I believe, which makes it streaming. So there's that. And then it says, okay, well, how would this look if we're going to use msgspec? And in his example, he's using structured data. So the structs, this is like the Pydantic version, but it doesn't have to be this way, but it is this way, right? This is the one he chose.

18:26 So maybe just talk us through, like, how would you solve this problem using msgspec and classes? Yeah. So as he's done here in this blog post, he's defined a couple of struct types for the various levels of this message. So repos, actors, and interactions, and then parses the message directly into those types. So the final call there is passing in the read message and then specifying the type as a list of interactions, which are tree down into actors and repos. Exactly. So this is what you mentioned earlier about having more function-based. So you just say decode, give it the string or the bytes, and you say type equals list of bracket, up-level class. And just like Pydantic, these can be nested.

19:09 So there's an interaction, which has an actor. There's an actor class, which has a login, which has a type. So your Pydantic Mendel model for how those kind of fit together is pretty straightforward, right? Pretty similar. Yeah. And then you're just programming with classes. Awesome. Yep. And it'll all work well with like mypy or PyWrite or whatever you're using if you're doing static analysis tools. Yeah. So you've thought about making sure that not just does it work well from a usability perspective, but it like the type checkers don't go crazy. Yeah. And any, you know, editor integration you have should just work. Nice. Because there's sometimes, oh gosh, I think maybe FastAPI has changed this, but you'll have things like you would say the type of an argument being passed in, if it's say coming off the query string, you would say it's depend. It's a type depends, not a, not an int, for example. It's because it's being pulled out of the query string.

19:58 I think that's FastAPI. And while it makes the runtime happy and the runtime says, oh, I see you want to get this int from the query string, the type checkers and stuff are like, depends. What is this? Like, this is an int. Why are you trying to use this depends as an int? This doesn't make any sense. I think it's a bit of a challenge to have the runtime, the types drive the runtime, but still not freak it out. You know? Yeah. I think that the Python typing ecosystem, especially with the recent changes in new versions and the annotated wrapper are moving towards a system where these kinds of APIs can be spelled natively in ways that the type checkers will understand. Right. But if you're a product that existed before these changes, you obviously had some preexisting way to make those work that might not play as nicely. So there's, there's the upgrade cost of the project. I'm not envious of the work that Samuel Covenant team have had to do to upgrade Pydantic to erase some old warts in the API that they found. It's nice to see what they've done and it's, it's impressive, but it's, I have the benefit of starting this project after those changes in typing system existed, you know, can look at hindsight mistakes others have made and learn from them.

21:01 Yeah, that's really excellent. They have done, like I said, I'm a big fan of Pydantic and it took them almost a year. I interviewed Samuel about that change and it was no joke.

21:09 You know, it was a lot of work, but you know what they came up with pretty compatible, pretty, pretty much feels like the same Pydantic, but you know, if you peel back the covers, it's definitely not. All right. So the other interesting thing about it, Inmar's article here is the performance sides is okay. Do you get fixed memory usage or does it vary based on the size of the data and do you get schema validation? Right. So standard lib, straight JSON module, 420 milliseconds. OR JSON, the fast one, a little less than twice as fast, 280 milliseconds. IJSON for iterable JSON, 300. So a little more than the fast one. msgspec, 90 milliseconds. That's awesome. That's like three times as fast as the better one over four times as fast as the built-in one.

21:54 It also is doing, you know, quote unquote more work. It's a validating the responses it comes in.

21:59 Exactly.

22:00 So you're sure that it's correct then too.

22:01 Yeah. And all those other ones are just giving you dictionaries and YOLO, do what you want with them. Right. But here you're actually all those types that you described, right?

22:09 The interaction and the actors and the repos and the class structure, that's all validation.

22:13 So, and on top of that, you've created classes, which are heavier weight than dictionaries because general classes are heavier weight than dictionaries because they have the dunder dict that has all the fields in there effectively anyway. Right?

22:26 That's not true for, for structs. Structs are slot classes.

22:29 Yes. Structs.

22:30 They are lighter weight to allocate than a dictionary or a standard class. That's one of the reasons they're faster.

22:35 Yeah. Structs are awesome. And so the other thing I was pointing out is, you know, you've got 40 megabytes of memory usage versus 130. So almost four times less than the standard module. And the only thing that beats you is the iterative one, because it literally only has one in memory at a time. Right. One element. Yeah.

22:52 So, so this benchmark is kind of hiding two things together. So there, there is the output, what you're parsing. Everything here except for IJSON is going to parse the full input into something.

23:03 One big batch.

23:04 msgspec is more efficient than R JSON or the standard lib in this respect, because we're only extracting the fields we care about, but you're still going to end up with a list of a bunch of objects. IJSON is only going to pull one into memory at a time. So it's going to have less in memory there. And then you have the, the memory usage of the parsers themselves, which can also vary. So orJSON's memory or usage in its parser is a lot higher than msgspec, regardless of the output size. There's a little more internal state.

23:30 So this is a pretty interesting distinction that you're calling out here. So for example, if people check out this article, which I'll link, there's like tons of stuff that people don't care about in the JSON, like the gravator URL, the gravatar ID, you know, the reference type, whether it's a brand, like this stuff that you just don't care about. Right. But the parser then you got to read that. But what's pretty cool. You're saying is like, in this case, the class that it Mark came up with is just repo driving from struct.

23:59 It just has name. There's a bunch of other stuff in there, but you don't care about it.

24:02 And so what you're saying is like, if you say that that's the decoder, it looks at that and goes, there's a bunch of stuff here. We're not loading that. We're just going to look for the things you've explicitly asked us to model. Right. That's all.

24:13 There's no sense in doing the work if you're never going to look at it.

24:16 A lot of different serialization frameworks. Can't remember how Pydantic responds when you do this, but you know, the comments beyond Pydantic, so it doesn't really matter is they'll freak out to say, Oh, there's extra stuff here. What am I supposed, you know, for example, this repo, it just has name, but in the data model, it has way more in the, the JSON data.

24:35 So you try to deserialize it. I'll go, well, I don't have room to put all this other stuff.

24:39 Things are, you know, freak out. And this one is just like, no, we're just going to filter down to what you asked for. I really, it's nice in a couple of ways. It's nice from performance, nice from clean code. I don't have to put all those other fields I don't care about, but also from, you talked about the evolution friendliness, right? Because what's way more common is that things get added rather than taken away or change. It's like, well, the complexity grows. Now repos also have this, you know, related repos or sub repos or whatever the heck they have. Right. And this model here will just let you go, whatever. Don't care.

25:10 Yeah. If GitHub updates their API and adds new fields, you're not going to get an error.

25:15 And if they remove a field, you should get a nice error that says expected, you know, field name, and now it's missing. You can track that down a lot easier than a random key error.

25:24 I agree. I think, okay, let's, let's dive into the struct a little bit because that's where we're kind of on that now. And I think this is one of the highlights of what you built again. It's kind of the same mental model as people are familiar with some data classes with Pydantic and Adders and so on. So when I saw your numbers, I won't come back and talk about benchmarks with numbers on, but I just saw like, wow, this is fast. And now the memory usage is low. You must be doing something native. You must be doing something crazy in here. That's not just Dunder slots. While Dunder slots is awesome. It's there's more to it than that. Right. And so the written quite speedy and lightweight. So measurably faster than data classes, Adders and Pydantic. Like tell us about these classes. Like this is, this is pretty interesting.

26:07 As mentioned earlier, they're not exactly, but they're, they're basically slots classes.

26:11 So Python data model actually CPython's data model is either a class is a standard class where it stores its attributes in a dict. That's not exactly true. There's been some optimizations where the keys are stored separately alongside the class structure and all of the values are stored on the object instances. But in model, there's dict classes and there's slots classes, which you pre-declare your attributes to be in this, this Dunder slots iterable. And those get stored in line in the same allocation as the object instance.

26:40 There's no pointer chasing. What that means is that you can't set extra attributes on them that weren't pre-declared, but also things are a little bit more efficient. We create those automatically when you subclass from a struct type. And we do a bunch of other interesting things that are stored on the type. That is why we had to write a meta class and see.

26:59 I went to read it. I'm like, whoa, okay. Well, maybe we'll come back to this. There's a lot of stuff going on in that type.

27:03 This is one of the problems with this, this hobby projects is that I wrote this for fun and a little bit of work related, but mostly fun. And it's not the easiest code base for others to step in to. It fits my mental model. Not necessarily everyone's.

27:15 Yeah. I can tell you weren't looking for VC funding cause you didn't write it in Rust.

27:19 Seems to be the common denominator these days.

27:22 Yeah.

27:23 Why C just because the CPython's already in C and that's the And I knew C. I do know Rust, but for what I wanted to do and the use case I had in mind, I wanted to be able to touch the C API directly. And that felt like the easiest way to go about doing it.

27:40 This portion of talk Python to me is brought to you by Posit, the makers of Shiny, formerly R studio and especially shiny for Python. Let me ask you a question. Are you building awesome things? Of course you are. You're a developer or a data scientist. That's what we do. And you should check out Posit connect. Posit connect is a way for you to publish, share and deploy all the data products that you're building using Python.

28:05 People ask me the same question all the time. Michael, I have some cool data science project or notebook that I built. How do I share it with my users, stakeholders, teammates? Do I need to learn FastAPI or flask or maybe Vue or react JS? Hold on now. Those are cool technologies and I'm sure you'd benefit from them, but maybe stay focused on the data project.

28:25 Let Posit connect handle that side of things. With Posit connect, you can rapidly and securely deploy the things you build in Python streamlet dash, shiny, Bokeh, FastAPI flask, quadro reports, dashboards and API's. Posit connect supports all of them. And Posit connect comes with all the bells and whistles to satisfy it and other enterprise requirements. Make deployment the easiest step in your workflow with Posit connect. For a limited time, you can try Posit connect for free for three months by going to talkpython.FM/Posit.

28:57 That's talkpython.FM/P O S I T. The link is in your podcast player show notes.

29:03 Thank you to the team at Posit for supporting talk Python.

29:05 Okay. So from a consumer of this struct class, I just say class, your examples, user is a class user, parentheses, dress from struct in the field, colon type. So like name, colon string groups, colon set of stir and so on. It looks like standard data classes type of stuff. But what you're saying is your meta class goes through and looks at that and says, okay, we're going to create a class called user, but it's going to have slots called name, email and groups among other things. Right. Like does that magic for us?

29:35 Yeah. And then it sets up a bunch of internal data structures that are stored on the type.

29:39 Okay. Like give me a sense of like, like what's, what's something, why, why do you got to put that in there? What's in there?

29:44 So the way data classes work after they do all the type parsing stuff, which we have to do too, they then generate some code and eval it to generate each of the model methods.

29:54 So when you're importing or when you define a new data class, it generates an init method and evals it and then stores it on the instance. That means that you have little bits of byte code floating around for all of your new methods. Msgspec structs instead, each of the standard methods that the implementation provides, which would be, you know, init, wrapper, equality checks, copies, you know, various things are single C functions. And then the type has some data structures on it that we can use to define those. So we have a single init method for all struct types that's used everywhere. And as part of the init method, we need to know the fields that are defined on the struct. So we have some data stored on there about like the field names, default values, various things.

30:38 Nice.

30:39 Because they're written in C rather than, you know, Python byte code, they can be a lot faster. And because we're not having to eval a new method every time we define a struct, importing structs is a lot faster than data classes. Something I'm not going to guess, I have to look up on my benchmarks, but they are basically as efficient to define as a handwritten class where data classes have a bunch of overhead. If you've ever written a project that has, you know, a hundred of them, importing can slow down.

31:03 Yeah. Okay. Because you basically are dynamically building them up, right? In data class story.

31:07 Yeah. So you've got kind of the data class stuff. You got, as you said, Dunder net, repper, copy, et cetera. But you also have Dunder match args for pattern matching. That's pretty cool. And Dunder rich repper for pretty printing support with rich. Yeah. If you just rich.print, it'll take that, right? What happens then?

31:25 It pre-prints it similar to like how a data class should be rendered.

31:28 Rich is making a pretty big impact. So rich is special.

31:31 I enjoy using it.

31:32 This is excellent. You've got all this stuff generated. So much of it is in C and super lightweight and fast. But from the way we think of it, it's just a Python class, even little less weird than data classes, right? Because you don't have to put a decorator on it. You just derive from, from this thing. So that's super cool. Yeah. Super neat.

31:50 The hope was that these would feel familiar enough to users coming from data classes or adders or Pydantic or all the various models that learning a new one wouldn't be necessary.

32:00 They they're the same.

32:02 Excellent. One difference if you're coming from Pydantic is there is no method to define done these by default. So you define a struct with fields, A, B, and C only A, B, and C exists as attributes on that, that class. You don't have to worry about any conflicting names.

32:15 Okay. So for example, like the Pydantic ones have, I can't remember the V1 versus V2. It's like, I can't remember like two dictionary effectively, right? Where they'll like dump out the JSON or strings or things like that.

32:28 In V1, there's a method dot JSON, which if you have a field name, JSON will conflict.

32:33 They are remedying that by adding a model prefix for everything, which I think is a good idea. I think that's a good way of handling it.

32:40 Yeah. Yeah. It's like model underscore JSON or dict or something like that. Yeah. Cool.

32:44 Yeah. That's one of the few breaking changes. They actually, unless you're deep down in the guts of Pydantic that you'll, you might encounter. Yeah. You don't have to worry about that stuff because you're more function based, right? You would say decode or I guess, yeah, decode here's some, some data, some JSON or something. And then the thing you decode it into would be your user type. You'd say type equals user rather than go into the user directly.

33:07 Right. Can we put our own, own properties and methods and stuff on these classes and that'll work all right?

33:13 Yeah. They, they, this to a user, you should think of this as a data class that doesn't use a decorator. They should be identical unless you're ever trying to touch, you know, the dunder data class fields attribute that exists on data classes. There should be no runtime differences as far as you can tell.

33:27 And when you're doing the schema validation, it sounds like you're basically embracing the optional optionality of, of the type system. If you say int, it has to be there. If you say optional int or int pipe none, may or may not be there, right?

33:40 No, it's, it's, it's close. I'm going to be pydantic here a little bit. The optional fields are ones that have default values set. So optional bracket int without a default is still a required field. It's just one that could be an int or none. You'd have to have a literal none passed in. Otherwise we'd error. This more matches with how mypy interprets the type system.

33:57 Okay. So if I had an optional thing, but it had no value, it would have to explicitly set it to none.

34:02 Yes.

34:03 Or would, yeah. Or it'd have to be there in the data every time. Like other things, you have default factories, right? Passing a function that gets called if it does, I guess if it doesn't exist, right? If the data's in there, it's being deserialized, it won't. Okay. Excellent.

34:16 And I guess your, your decorator creates the initializer. But another thing that I saw that you had was you have this post init, which is really nice. Like a way to say like, okay, it's been deserialized. Let me try a little further. Tell us about this. This is cool.

34:30 Yeah. It's coming from data classes. They have the same method. So if you need to do any extra thing after init, you can use it here rather than trying to override the built-in init, which we don't let you do.

34:40 Right. Because it has so much magic to do, like let it do it. And yeah, you don't want to override that anyway. You don't have to deal with like passing all the arguments.

34:48 Yeah. It's, you know, runs Python instead of maybe C, all these things. Right. So post init would exist if you have more complex constraints, right?

34:57 Currently that's one reason to use it. We currently don't support custom validation functions. There's no dot validate decorator, various frameworks, different ways of defining these. We, we have some constraints that are built in. You can constrain, you know, if the number to be greater than some value, but there's no way to specify custom constraints currently. It's on the roadmap. It's the thing we want to add. Post init's a way to hack around that. So right now you're looking at the screen, you have a post init defined and you're checking if low is greater than high, raise an error. And that'll bubble up through decodes and, you know, raise an isuser facing validation error. In the long run, we'd like that to be done a little bit more field-based, similar to coming from other frameworks.

35:34 It is tricky though, because you know, the validation goes onto one field or the other.

35:38 You don't have like composite validators necessarily. Right. And so there's totally valid values of this low, but it long, whatever it is, it has to be lower than high. Right. But how do you express that relationship? So I think this is awesome. Other areas where, you know, it could be interesting is like under some circumstances, maybe you've got to compute, I don't know, compute some field also that's in there. That's not set. I don't know. There's, there's some good options in here. I like it a lot. Yeah. I guess the errors just come out as just straight out of like something went wrong with under post init, right. Rather than field low has this problem.

36:08 It's a little harder to relate an error being raised to a specific field if you raise it in the post init. Yeah. Also, since you're looking at this and I'm proud that I got this to work, the post errors raised in post init use chained exceptions. So you can see a little bit of the cause of where it comes from and getting those to work at the Python C API is completely undocumented and a little tricky to figure out. A lot of reading how the interpreter does it and making the right, you know, 12 incantations to get them to bubble up. Right.

36:34 Yeah. I do not envy you working on this struct, lower this base class, but that'd be, that's where part of the magic is. Right. And that's why I wanted to dive into this because I think it's, it behaves like Python classes, but it has this, these really special features that we don't normally get, right. Like low memory usage, high performance, accessing the fields. Is that any quicker or is it like standard struct level of quick?

36:57 Attribute access and settings should be the same as any other class. Things that are faster are init, repper, not that that should matter. If you're looking for a high performance repper, that's, you're doing it wrong. Seems like you're doing something wrong. Equality checks, comparisons. So sorting, you know, less than greater than, I think that's it. Everything else should be about the same.

37:14 So field ordering, you talked about like evolution over time. Does it, does this matter?

37:19 Field ordering is mostly defining how, what happens if you do subclasses and stuff. This whole section is, if you're not subclassing, shouldn't hopefully be relevant to you. We match how data class handles things for ordering.

37:29 Okay. So I could have my user, but I could have a super user that derives from my user that derives from struct and things will still hang together.

37:37 And so figuring out how all the fields order out through that subclassing, this doc is about.

37:42 Another type typing system thing you can do a lot is have explicitly claim something as a class variable. You know, Python is weird about its classes and what makes a variable that's associated with a class and or not. Right. So with these type of classes, you would say like class example, colon, and then you have X colon int. Right. And that appears will appear on the static type, like example dot X, but it also imbues each object with its own copy of that X. Right. Which is like a little bit, is it a static thing or part of the type or is it not? It's kind of funky, but you also can say that explicitly from the typing, you say this is a class variable. What happens then? Right.

38:20 So standard attributes exist on the instances where a class var exists on the class itself.

38:26 Class vars are accessible on an instance, but the actual data is stored on the class.

38:31 So you're not having an extra copy.

38:33 I see. So if there's some kind of singleton type of thing or just one of them. Yeah.

38:37 Yeah. It has to do with how Python does attribute resolution where it'll check on the instance and then on the type. And also there's descriptors in there, you know, somewhere.

38:47 Interesting. Okay. Like other things, I suppose it's pretty straightforward that you take these types and you use them to validate them. But one of the big differences with msgspec.struct versus pydantic.basemodel and others is the validation doesn't happen all the time. It just happens on encode decode. Right. Like you could call the constructor and pass in bad data or like it just doesn't pay attention. Right. Yeah. Why is it like that?

39:13 So this is one of the reasons I wrote my own thing rather than building off of something existing like pydantic. Side tangent here just to add history context here. Msgspec was started about three years ago, the JSON and it kind of fell into its full model about two years ago. So this has existed for around two years. Yeah. We're pre the pydantic degree. Right. Anyway, the reason I wanted all of this was when you have your own code, where bugs can come up are bugs in your own code. I've typed something wrong. I've made a mistake and I want that to be checked or it can be user data is coming in or, you know, maybe it's distributed system and it's still my own code. It's just a file or database.

39:47 Yeah. Whatever. Yeah. We have many mechanisms of testing our own code. You can write tests.

39:52 You have static analysis tools like my py, pyright or checking. It's a lot easier for me to validate that a function I wrote is correct. There are other tools I believe then we should lean on rather than runtime validation in those cases. But when we're reading an external data, whether it's coming over the wire or coming from a file, coming from user input in some way, we do need to validate because the user could have passed us something that doesn't match our constraints. Yeah. As soon as you start a trusting user input, you're in for a bad time. We don't want to arbitrarily be trusting. We do validate on JSON decoding without a master pack decoding. We also have a couple of functions for doing in memory conversions. So there's msgspec convert msgspec to built-ins for going the other way. So that's for doing conversion of runtime data that you got from some rather than a specific format. Yeah. Because if you're calling this constructor and passing the wrong data, MyPY should check that. Pycharm should check that. Maybe rough would catch it. I'm not sure, but like there's a bunch of tools. Yeah. Rough doesn't have a type checker yet.

40:50 Yeah. TBD on that. Yeah. Okay. Yeah. Being able to check these statically, it means that we don't have to pay the cost every time we're running, which I don't think we should. That's extra runtime performance that we don't need to be spending. Yeah. Definitely. Check it on the boundaries, right? Check it where it comes into the system and then should be good.

41:07 The other reason I was against adding runtime validation to these trucks is I want all types to be on equal footing. And so if I am creating a list, the list isn't going to be doing any validation because it's, you know, the Python built in. Same with data classes, same with adders, types, you know, whatever. And so only doing a validation when you construct some object type that's from a built in that I've defined or like a type I've defined doesn't give parity across all types and might give a user, you know, misconceptions about when something is validated and when they can be sure it's correct versus when it hasn't.

41:37 Yeah. Have you seen bear type? I have. Yeah. Bear type's a pretty interesting option. If people really want runtime validation, they could, you know, go in and throw bear type onto their system and let it do its thing. Even if you're not doing it, you should read the docs just for the sheer joy that these docs are. Oh, they are pretty glorious. Yeah, I'll do it. You got it. This guy, they bearing the weight a little down here, but they described himself as bear type brings Rust and C++ inspired zero cost abstractions into the lawless world of the dynamically typed Python by enforcing type safety at the granular level of functions and methods against type hints standardized by the Python community of O order one non amortized worst case time with negligible constant factors. Oh my gosh. So much fun, right? They're just joking around here, but it's a pretty cool library. If you want runtime type checking pretty fast. Okay. Interesting. You talked about the pattern matching. I'll come back to that. One thing I want to talk about, well, okay. Frozen, frozen instances.

42:33 This comes from data classes without the instances being frozen. The structs are mutable. Yeah.

42:38 I can like get one changes value, serialize it back out, things like that. But frozen, I suppose means what you would expect, right? Yeah. Frozen is the same meaning as a data classic. How deep does frozen go? So for example, is it frozen all the way down? So in the previous example from itamar, it had like the top level class and then like other structs that were nested in there. Like if I say the top level is frozen to the nested ones themselves become frozen? No. So frozen applies to the type. So if you define a type as frozen, that means you can't change values that are set as attributes on that type, but you can still change things that are inside it. So if a frozen class contains a list, you can still append stuff to the list. There's no way to get around that except if we were to do some deep, deep, deep magic, which we shouldn't. It would definitely slow it down if you had to go through and like re create frozen lists every time you saw a list and stuff like that. Yeah. Okay. And then there's one about garbage collection in here. Yeah. Which is pretty interesting.

43:32 There we go. Disabling garbage collection. This is under the advanced category. Warning box around this that tells you not to. What could go wrong? Come on. Part of this was experimenting with the DAS distributed scheduler, which is a unique application, I think, for people that are writing web stuff in that all of its data is kept in memory. There's no backing database that's external. And so it is as fast to respond as, you know, the bits of in-memory computation that needs to do before it sends out a new task to a worker.

43:59 So in this case, their serialization performance matters. But also it's got a lot of in-memory compute. You know, it's a dicts of types of, you know, lots of chaining down. The way the CPython garbage collector works is that these large dictionaries could add GC overhead.

44:16 Every time a GC thing happens, it has to scan the entire dictionary.

44:19 Any container thing could contain another. And once you do that, there could be a cycle.

44:22 And then for very large graphs, GC pauses could become noticeable. Yes. This is an experiment and seeing ways around that because we've done some deep magic with how structs work.

44:32 We can disable GC for subclasses, user-defined types, which CPython does not expose normally and really isn't something you probably want to be doing in most cases. But if you do, you get a couple benefits. The types are smaller. Every instance needs to include some extra state for tracking GC. I believe on recent builds, it's 16 bytes. So it's two pointers.

44:54 So that's, you know, you're shaving 16 bytes print.

44:56 That's non-trivial. Yeah. If you got a huge list of them, that could be a lot.

44:59 And two, they don't, they're not traced. And so if you have a lot of them, that's a reducing reduction in tracing overhead. Every time a GC pass happens.

45:08 GC puts more overhead on, on, on top of stuff than you would think. So I did some crazy GC stuff over at talk Python and training of my courses. You go to / sitemap.xml.

45:18 I don't know how many entries are in the sitemap, but there are 30,000 lines of sitemap, like many, many, many, many, many thousands of URLs have to come back with details just to generate that page in one request with the default Python settings in Python 3.10. I think it was, it was doing 77 garbage collections while generating this page. That's not ideal.

45:43 I switched it to just change or tweak how frequently the GC runs. So like every 70,000, no, every 50,000 allocations instead of every 700 and the site runs 20% faster now and uses the same amount of memory. Right. And so this is not exactly what you're talking about here, but it's in the, it plays in the same space as like you can dramatically change the things that are triggering this and dramatically change the performance potentially. The caveat is you better not have cycles.

46:10 Yeah. So the, the other thing with these is, as you pointed out is the indicator of when a GC pass happens has to do with how many GC aware types have been allocated. And so if you mark a type is not a GC type, then the counter is an increment. You're not paying that cost.

46:24 So if you're going to allocate all the integers you want all day long, it'll never affect the GC. But if you start allocating classes, dictionaries, tuples, et cetera, that is like, well, those could contain cycles. You have 700 more than you've deallocated since last time. I'm going to go check it.

46:37 One place this comes up is if you have say a really, really large JSON file because any deserialization is in early in allocation, heavy workload, which means that you can have a GC pause happen, you know, several times during it because you've allocated, you know, that many types. Turning off GC for these types lets you avoid those GC pauses, which gives you actual runtime benefits. A different way of doing this that is less insane is to just disable GC during the decode. Do a, you know, GC disable, JSON decode, GC enable, and you only do a GC pass once, especially because JSON as a tree-like structure can never create cycles. You're not going to be having an issue there.

47:12 But you're probably allocating a lot of different things that are container types. And so it looks to the GC like, oh, this is some really sketchy stuff. We better get on the game here.

47:22 But you know, as you said, there's no cycles in JSON. So there's a lot of scenarios like that, like database queries. You know, I got a thousand records back from a table. They're all some kind of container. So minimum one GC happens just to read back that data, but you know, there's no cycles. So why is the GC happening? Right. You can kind of control that a little bit, or you just turn the number up to 50,000 like I did. It still happens, but less, a lot less. Yeah. So this is pretty interesting though, that you just set GC equals false. Where do you set this? Is this like in the derived bit or?

47:54 It's part of the class definition. So we make use of class definition keyword arguments.

48:00 So it goes after the struct type in the subclass. You do, you know, my class, open a breath of sea, struct, comma, GC equals false, close comma, colon, rest of the class.

48:10 Yeah. That's where I thought, but it is a little funky. I mean, it kind of highlights the meta class action going on there. Right. What else should people know about these structs?

48:19 They're fast and they can be used for not just a serialization. So if you are just writing a program and you happen to have msgspec on your system, it should be faster to use them than data classes. Whether that matters is of course, application dependent, but they're like generally a good idea. They happen to live in this serialization library, but that's just because that's where I wrote them. Yeah. It's where they, in a future world, we might split them out into a sub package. Yeah. Fast struct, Pippin's all fast struct.

48:42 Who knows? He had to be named. So better than data classes. I mean, they have the capabilities of data classes, so that's cool, but better than straight up regular classes, like bare classes, you know, class, colon, name.

48:54 Are opinionated a little bit. They're how I think people probably should be writing classes and they're opinionated in a way that means that you can't write them in ways that I don't want you to. So the way a struct works is you define attributes on it using type annotations and we generate a fast init method for you. We don't let you write your own init.

49:11 In the subclass, you can't override init. The generated one is the one you get. That means that like if you're trying to create an instance from something that isn't those field names, you can't do that. You need to use a new class method for writing those.

49:23 I believe this is how people, at least on projects I work on, generally use classes.

49:28 So I think it's a fine limitation, but it is putting some guardrails around how the arbitrariness of how you can define a Python class. You could have a, you know, a handwritten class that has two attributes, X and Y, and your init takes, you know, parameters A and B.

49:43 Sure. Or maybe it just takes X and it always defaults Y unless you go and change it after or whatever. Right. I guess you could do sort of do that with default values, right? But you couldn't prohibit it from being passed in. I'm feeling some factory classes.

49:56 The Adders docs have a whole, whole page telling people about why this pattern is, is better and nudging them to do this. So this isn't a new idea.

50:03 Yeah. Go, go check out Adders and see what they're saying as well. Huh? There's probably a debate in the issues somewhere on GitHub. There always is a debate. Yeah. Let's see.

50:10 Let's go get a bunch of stuff up here. I want to talk about, I guess really quickly, since there's a lot of like C native platform stuff, right? This is available on, you know, pip install message, msgspec. We're getting the wheel. It seemed like it worked fine on my M2 MacBook Air. Like what are the platforms that I get a wheel that don't have to worry about compiling?

50:31 So we use CI build wheel for building everything. And I believe I've disabled some of the platforms.

50:37 The ones that are disabled are mostly disabled because CI takes time and you need to minimize them, but everything common should exist, including Raspberry Pi and various ARM builds.

50:46 Excellent. Okay. Yeah. It seemed like it worked just fine. I didn't really know that it was like doing a lot of native code stuff, but it seems like it. And also available on Conda, Conda Forge. So that's cool. If you Conda, you can also just Conda install it. I kind of promised talking about the benchmarks a little bit, didn't I? So benchmarks are always.

51:04 If you click on the graph on the bottom, it'll get, bring you to it.

51:06 Yeah. They're always, always rife with like, that's not my benchmark. I'm doing it different, you know, but give us a sense of just, it says fast italicies leaning forward. Give us a sense of like, where does this land? Is it, you know, 20% faster or is it a lot better?

51:22 Yeah. So as you said, benchmarks are a problem. The top of this benchmark docs has a whole argument against believing them and telling you to run your own. So take the grain of salt. I started benchmarking this mostly just to know how we stacked up. It's important if you're making changes to know if you're getting slower, it's also important to know what the actual trade-offs of your library are. All software engineering is trade-offs.

51:40 So msgspec is generally fast. The JSON parser in it is one of the fastest in Python or the fastest, depending on what your message structure is and how you're invoking it. It at least is on par with, or JSON, which is generally what people consider to be the fast parser.

51:57 Right. That's, that's where they go when they want fast. Yeah.

51:59 So if you are specifying types, so if you, you know, add in a type annotation to a JSON decode call with msgspec, even if you're decoding the whole message, you're not doing a subset. We're about two X faster than our JSON. You actually get a speed up by defining your types because struct types are so efficient to allocate versus a deck.

52:17 That's kind of the opposite of what you might expect, right?

52:19 It seems like we're doing more work, but we're actually able to do less because we can take some more, you know, efficient fast paths.

52:26 And then a thousand objects with validation compared to, yeah, Mesher, Merle, C Adders, Pydantic and so on. Probably the last one.

52:35 This was a grab bag of various validation libraries that seemed popular. Mesher Merle is the one that dbt uses. I think they're the primary consumer of that. Catters is for Adders. Pydantic is, you know, ubiquitous. This right here in this, this benchmark graph we're looking at is against Pydantic V1. I have not had a chance to update our benchmarks to go against V2. There's a separate gist somewhere that is, got some numbers there.

52:57 The standard number they throw out is like 22 times faster. So it still puts you multiples faster.

53:03 In that benchmark, we're averaging 10 to 20 X faster than Pydantic V2. In numbers I run against V1, we're about 80 to 150 X faster. So it really is structure dependent.

53:13 Yeah, sure. You have one field or do you have a whole bunch of stuff?

53:17 Yeah, exactly. And what types of fields? To be getting more into the weeds here, JSON parsing is not easy. Msgspec parsing is like the form that was designed for computers to handle it. It's, you know, Seven bytes in there is an integer here.

53:31 Yeah. Okay. Where, where JSON is human readable and parsing strings into stuff is slow.

53:35 Right. The flexibility equals slowness. Yeah.

53:38 Our string parsing routines in, in msgspec are faster than the ones used by or JSON.

53:44 Our integer parsing routines are slower, but there's a different trade off there.

53:48 Interesting. Okay. Yeah. I think this is just seems so neat. There's so much flexibility, right? With all the different formats and the restrictions on the class, they exist, but they're unstruck, but they're, they're not insane. Right? I mean, you build a proper, proper OOP type of things. You don't need super crazy hierarchies. Like that's where you get in trouble with that stuff anyway. So don't do it. I guess we don't have much time left. One thing I think we could talk about a bit, maybe it would be, if I find it, the extensions. Just maybe talk about parsing stuff that are, is kind of unknown.

54:16 This is pretty interesting.

54:18 So the way we allow extension currently, this is, there's an intention to change this and expand it, but currently extending, adding new types is done via a number of different hooks that are called when a new type is encountered. So custom user defined type of some form.

54:32 I liked doing this rather than adding it into the annotation, because if I have a new type, I want it to exist probably everywhere. And I don't want to have to keep adding in and use the serializer and deserializer and as part of the type annotations. So to define a new type that you want to encode, you can add an encode hook, which takes in the instance and returns something that msgspec knows how to handle. This is similar to, if you're coming from standard library JSON, there's a default callback. It's the same. We renamed it to be a little better name in my mind, but it's the same thing.

55:02 Right. So your example here is taking a complex number, but storing it as a tuple of real and imaginary numbers, but then pulling it back into a proper complex number object.

55:13 Super straightforward. Yeah. But makes it possible. Yeah. Yeah. That's really cool.

55:16 So people can apply this and this, I guess, would, didn't really matter on the output destination, does it? Your job is here is to take a type that's not serializable to one that is, and then whether that goes to a message, a msgspec or JSON or whatever, it's kind of not your problem.

55:31 Yeah. And then the decode hook is the inverse. You get a bunch of stuff that is, you know, core types and strings, whatever, and you compose them up into your new custom type.

55:39 Jim, I think we're getting about out of time here, but I just want to point out, like if people hit the user guide, there's a lot of cool stuff here and there's a whole performance tips section that people can check out. You know, if we had more time, maybe we'd go into them, but like, for example, you can call msgspec.JSON.encode, or you can create an encoder and say the type and stuff, and then reuse that, right? Those, those kinds of things. Yeah.

56:01 There's another method that is, again, a terrible internal hack for reusing buffers. So you don't have to keep allocating byte buffers every message. You can allocate a byte array once and use it for everything. Save some memory.

56:13 Let me ask Ellie's got a question. I'm going to read some words that don't mean anything to me, but they've made to you. How does the performance of msgspec plus msgspec with the array-like equals true optimization compared to flat buffers?

56:26 So by default objects, so struct types, data classes, whatever, encode as objects in the stream. So a JSON object has keys and values, right? If you have a point with fields X and Y, it's got X and Y, you know, one, two. We have an array-like optimization, which lets you drop the field names. And so that would instead encode as an array of, you know, one comma two, dropping the X and Y reduces the message size on the wire. If the other side knows what the structure is, it can, you know, pull that back up into a type.

56:53 In terms of msgspec as a format, plus with the array-like optimization, the output size should be approximately the same as you would expect it to come out of flat buffers.

57:03 The Python flatten buffers library is not efficient for creating objects from the binary. So it's going to be a lot faster to pull it in. Obviously, this is then a very custom format. You're doing a weird thing. And so compatibility with other ecosystems will be slower or not slower necessarily, but you'll have to write them yourself. Not everything knows how to read message pack.

57:23 More brittle potentially.

57:24 Yeah.

57:25 Yes.

57:26 Yeah. Yeah.

57:27 But for Python, talking to Python, that's probably the fastest way to go between processes.

57:29 And probably a lot faster than JSON or YAML or something like that.

57:33 Okay. Excellent. I guess, you know, there's many more things to discuss, but we're going to leave it here. Thanks for being on the show. Final call to action. People want to get started with message back. Are you accepting PRs if they want to contribute and what's, what do you tell them?

57:47 First, I encourage people to try it out. I am available, you know, to answer questions on GitHub and stuff. It is obviously a hobby project. So, you know, if the, the usage bandwidth increases significantly, we'll have to get some more maintainers on and hopefully we can make this more maintainable over time.

58:01 But once the sponsor funds exceed a 10,000, 20, 30,000 a month, like it'll revalue your, no, just kidding.

58:08 Sure. Sure. But yeah, please try it out. Things work should be hopefully faster than what you're currently using and hopefully intuitive to use. We've done a lot of work to make sure the API is friendly.

58:17 Yeah. It looks pretty easy to get started with. The docs are really good.

58:20 Oh, thank you.

58:21 Congrats on the cool project. Thanks for taking the time to come on the show and tell everyone about it.

58:25 Thanks.

58:26 Yeah. See you later.

58:27 Bye.

58:28 This has been another episode of Talk Python to Me. Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show. This episode is sponsored by Posit Connect from the makers of Shiny. Publish, share and deploy all of your data projects that you're creating using Python. Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quatro, Reports, Dashboards and APIs. Posit Connect supports all of them. Try Posit Connect for free by going to talkpython.fm/posit. P-O-S-I-T.

58:57 Want to level up your Python? We have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription in sight. Check it out for yourself at training.talkpython.fm. Be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

59:49 [MUSIC]

59:51 [END]

01:00:01 [MUSIC]

01:00:08 [BLANK_AUDIO]

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon