#338: Using cibuildwheel to manage the scikit-HEP packages Transcript
00:00 How do you build and maintain a complex suite of Python packages?
00:03 Of course you want to put them on PyPI the best format there is as a wheel.
00:08 This means that when developers use your code, it comes straight down and requires no local tooling to install and use.
00:15 But if you have complex dependencies such as C or Fortran, then you have a big challenge.
00:20 How do you automatically compile and test against Linux, macOS, that's intel and Apple Silicon, Windows 32 and 64 bit, and so on.
00:31 That's the problem solved by CI Build Wheel.
00:34 On this episode, you'll meet Henry Schreiner.
00:36 He's developing tools for the next era of the Large Hadron Collider and is an admin of Scikit.
00:42 Hep, of course, cibuild wheel is central to that process.
00:47 This is Talk Python to Me episode 338, recorded October 14, 2021.
01:05 Welcome to Talk Python to Me, a weekly podcast on Python.
01:08 This is your host, Michael Kennedy.
01:10 Follow me on Twitter where I'm @mkennedy and keep up with the show and listen to past episodes at 'Talk Python.FM' and follow the show on Twitter via @talkpython.
01:19 We started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at 'TalkPython.FM/youtube' to get notified about upcoming shows and be part of that episode.
01:31 Hey there. I have some exciting news to share before we jump into the interview.
01:34 We have a new course over at Talk Python.
02:15 Check it out over at 'Talkbython.FM/htmx or just click the link in your podcast player show notes.
02:21 Now let's get onto that interview, Henry, welcome to Talk Python to me.
02:25 Thank you.
02:26 It's great to have you here.
02:28 I'm always fascinated with cutting edge physics with maybe both ends of physics.
02:34 I'm really fascinated with astrophysics in the super large and then also the very small, and we're going to probably tend a little bit towards the smaller, high energy things this time around, but so much fun to talk about this stuff and how it intersects Python.
02:48 So the smallest things you can measure and some of the largest amounts of data you can get out.
02:53 Yeah, the data story is actually really crazy, and we're going to talk a bit about that.
02:59 So much stuff like we used to think that atoms were the smallest things to get. Right. I remember learning that in elementary school, like, there are these things called atoms.
03:08 They combine to form compounds and stuff. And that's as small as it gets.
03:12 Not so much, right?
03:13 Yeah. That was sort of what Adam was supposed to mean.
03:16 Exactly the smallest bit, but Nope.
03:19 But that name got used up. So there we are.
03:22 All right.
03:22 Well, before we get into all that stuff, though, let's start with your story. How do you get into programming in Python?
03:27 I started with a little bit of programming that my dad taught me. He was a physicist, and I remember it was C++ and sort of taught the way you teach Java, all objects and classes just a little bit.
03:41 And then when I started at college and I wanted to take classes, and I took a couple of classes again in C++, I just really loved objects and classes.
03:52 Unfortunately, the courses didn't actually cover that much, but the book did. So I really got into that.
03:57 And then for Python, actually, right when I started College, I started using this program called Blender.
04:02 Oh, yeah.
04:03 Blender. I've heard of Blender. It's like 3D animation tool, like Maya or something like that. Right.
04:08 And it's very Python friendly, right?
04:11 It has a built in Python interpreter.
04:13 So I knew it had this built in language called Python. So that made me really want to learn Python.
04:17 And then when I went to research experience for undergraduates at Northwestern University in Chicago, and when I was there, we had this cluster that we were working on, this was in Solid State Physics, material physics.
04:32 And we would launch these simulations on the cluster.
04:37 And so I started using Python, and I was able to write a program that goes out, and it would create a bunch of threads. And it would watch all of the nodes in the cluster. And as soon as one became available, it would take it. So my simulation could just take the entire cluster. After a few hours, I would have everything.
04:54 So at the end of that, everybody hated me, and everybody wanted my scripts.
04:59 They're like, this is horrible.
05:01 I can't believe you did that to me, but I'll completely forgive you if you just give it to me and only to me because I need that power.
05:08 That's fantastic.
05:11 I think that is one of the cool things about Python, right? Is that it has this quick, prototyping approachability, like, I'm just going to take over a huge hardware, like a huge cluster of servers.
05:23 But it itself doesn't have to be, like, intense programming. It could be like this elegant little bit of code.
05:29 You can sort of do things that normally I think the programming gets in the way more, but Python tends to stay out. It looks more like pseudocode, so you can do more and learn more.
05:38 And eventually you can go do it in C++ or something.
05:43 Or maybe not.
05:45 Sometimes you do need to go do it in some other language, and sometimes you don't.
05:48 I think the stuff at CERN and LHC has an interesting exchange between C++ and maybe some more Python.
05:56 And whatnot so that'll be fun to talk about.
05:59 We've been C++ originally, but Python is really showing up in a lot more places, and there's been a lot of movement in that direction.
06:08 There's been some really interesting things that have come out. A lot of interesting things have come out of the LAC computing wise, as well as awesome.
06:15 As a computing bit of infrastructure, there's a ton going on there. And as physics, it's kind of the center of the particle physics world.
06:23 So it's got those two parallel things generating all sorts of cool stuff.
06:27 I want to go back to just really quickly.
06:29 You talked about your dad teaching a little programming.
06:31 If people are out there and they're the dad, they want to teach their kids a little bit of programming. I want to give a shout out to CodeCombat.com
06:38 Such a cool place.
06:40 My daughter just yesterday was like, hey, dad, I want to do a little Python.
06:43 Remember that game that taught me programming like, yeah, sure. So she logged in and started playing and basically solve a dungeon interactively by writing Python. And it's such an approachable way. But it's not the, like, drag and drop a fake stuff. You write real Python, which I think is cool to introduce kids that way. So anyway, shout out to them. I had them on the podcast before, but it's cool to see kids take into it in that way.
07:05 Whereas you say you could write a terminal app, they're like, I don't want to do that, but solve a dungeon.
07:10 They could do that.
07:11 I actually played with a couple of those. They're actually really fun just to play.
07:14 Yeah, they are.
07:15 I did, like 40 Dungeons along with my daughter. It was very cool.
07:18 How about now? What do you do now?
07:21 I work in a lot of different areas, and I jump around a lot.
07:25 So I do a mix of coding.
07:27 I do some work on websites because they just needed maintenance, and somehow I got volunteered and some writing less coding than I would like. But I definitely do get to do it, which is fun.
07:40 Yeah. And this is at CERN or your University or where is this?
07:44 So now I'm at Princeton University, and I'm part of a local group of RSEs Research Software Engineers, and I'm also part of Irish, which will talk about a little bit, but that's sort of a very spread out group.
08:00 Some of us are at CERN, a few or in some other places.
08:05 Fermi lab and physicists are just used to working remote.
08:09 The pandemic wasn't that big of a change for us. We were already doing all our meetings remote. We just eventually changed from video to Zoom. But other than that.
08:17 Exactly, it was real similar for me as well. That's interesting.
08:20 Fermilab that's in Chicago outside Chicago, right?
08:23 Is that still going? I got the sense of that was shutting down.
08:26 They begin neutrino physics.
08:28 They do a lot of neutrino things there, and then they're also very active just in the particle physics space. So you may be at Fermilab, but working on CERN data, I see.
08:38 Yeah. I got most of that place a little bit.
08:41 And it's a really neat place.
08:43 It is. CERN is a neat place, too.
08:44 I would love to tour CERN, but it wasn't 20 minutes down the street from where I happen to be.
08:50 I didn't make it there.
08:52 Sadly, I hope to get back there someday.
08:54 All right.
08:54 Well, let's talk about sort of the scikit-HEP side of things and how you got into maintaining all of these packages.
09:05 So you found yourself in this place where you're working on tools that help other people build packages for the physicists and data scientists, and so on. Right.
09:14 So Where'd that all start?
09:16 So with maintenance itself, the first thing I started maintaining was a package called 'Plumbum' back in 2015, and at that point, I was starting to submit some PRS, and the author came to me and said I would like to have somebody do the releases.
09:33 I need a release manager.
09:34 I don't have enough time, and I'm sure I'd be happy to do it. It was exciting for me because it was the first package or real package I got to join, and I think on the page you might even still have the original news item when it says, welcome to me.
09:50 So that was the first thing I started maintaining.
09:53 And then I was working on physics tool called Goofy when I became a postdoc, and I worked on sort of really renovating that. It started out as a code written by physicists, and I worked on making it actually installable and packaged nicely and worked with a student to add Python bindings to it, things like that.
10:15 And as part of that, I wrote a C++ package, CLI 11 first package I actually wrote and then maintained and section C++, and it was written for good fit, but now it's fairly.
10:28 I think it's done pretty well on its own.
10:31 Microsoft Terminal use it.
10:33 Microsoft Terminal uses it.
10:35 Oh, nice.
10:36 Yeah. I'm a big fan of Microsoft Terminal.
10:38 I've for a while now kind of shied away from working on Windows because the terminal experience has been really crummy.
10:45 The CMD Exe command prompt style is just like, oh, why is it so painful?
10:50 And people who work in that all day they might not see this painful, but if you get to work in something like a macOS terminal or even to not quite the same degree, but still in like a Linux one.
11:00 Then all of a sudden it kind of gets there, but I'm kind of warming up to it again with Windows Terminal.
11:07 The Xterm is one of the reasons I really moved to Mac because I loved Xterm, and then Windows Terminal is amazing.
11:14 It's a great team working on it, including the fact that they used my Purser, but it's actually quite nice.
11:22 The only problem I have in Windows Ten is it's really hard to get the thing to show up instead of seeing CMD prompt.
11:29 Windows Eleven.
11:31 I definitely think it's included now, which is great.
11:34 So CLI 11, this is a C++ 11 command line parser.
11:39 Like click or Arg pars or something like that. But for C++, right?
11:43 Yes. It was designed off of the Plumbum command line Parser on sort of a toolkit that has several different things.
11:49 I wish those things had been pulled out because I think on their own they might have maybe even been popular on their own.
11:55 It has a really nice parser, but it's a designed off of that and off click.
11:59 It has some similarities to both of those.
12:02 Yeah. I think probably that's a challenge.
12:05 We're going to get into site GitHub with a whole bunch of these different packages, but finding the right granularity of what is a self contained unit that you want to share with people or versus things like pulling out a command line Archer rather than some other library. Right.
12:20 This is a careful balance.
12:22 It's a bit challenging. I think in Python, there's a really strong emphasis to having the individual separate pieces and packages, especially in Python, partially because it has a really good packaging system and being able to take things have just pieces and be able to swap out one that you don't like is really nice.
12:41 And that's one of the things we'll talk about the PyPI as well. And that's one of the things that they focus on is small individual packages that each do a job versus all in one poetry.
12:52 Well, you'll have to do some checking or some fact checking. Balancing modernizing.
12:58 For me. I did professional C++ development for a couple of years, and I really enjoyed it until there were better options.
13:05 And then I was like, Why am I still doing this?
13:08 I would go work on those.
13:09 But one of the things that struck me as a big difference to that world is basically the number of libraries you use.
13:17 The granularity of the libraries you use, the relative acceptance of things like Pip and the ease of using another library, right in C++.
13:27 You've got the header and you've got the Linked file and you've got the DLL.
13:31 There's, like, all sorts of stuff that can get out of sync and go crazy and make weird crashes. Your app just goes away, and that's not great.
13:40 Is that still true? I feel like that difference is one of the things that allows for people to make these smaller composable pieces in Python.
13:47 I think that has a lot to do with it. What has happened in C++ is there's sort of a rise of a lot of header only libraries, and these libraries are a lot easier to just drop into your project because all you do is you put in the headers and you don't have to deal with a lot of the original issues. So a lot of these small standalone libraries are header only.
14:10 And one of the next things that I picked up as a maintainer was Pybind11, and I've sort of been in that space between C++ and Python for quite a bit.
14:22 Kind of like being in that area.
14:25 Joining the two from listening to the things that you've worked on previously and things like this that you're interested in connecting and enabling piecing together.
14:34 Like here's my script that's going to pull together the compute on this cluster or here's this library that pulls together Python and C++ and so on.
14:42 Yes, making different things work together and combining things like C++ and Python, or combining different packages in Python and piecing together a solution.
14:50 I think that's one of Python strengths versus something like MATLAB. It's been quite a bit of time in MATLAB early on and got to move a lot of stuff over to Python.
14:59 That's awesome.
14:59 It was really nice that we didn't have to have a license and things like that.
15:02 I know it's so expensive, and then you get the what are they called toolkits, the Add on tool kits, and they're like, each tool kit is the price of another $1,000 a year or $2,000 a year. It's ridiculous.
15:14 So I know of CFFI, which is a way for Python and C to get clicked together in a simple way.
15:25 How's Pybind11 fit into that? This is seamless interoperability between C++11 and Python.
15:31 How are they different CFFI?
15:34 I teach, like a little short course where I can go through the different sort of different binding tools, and it usually ends with me saying Pybind11 is my favorite.
15:42 Yeah. Cool.
15:43 Give us an overview of what the options are and stuff is closer to C types. It's more of it's focused on C versus C++, and it's actually the one I've used the least.
15:54 I was just helping just talking with the CFFI developer, but I've used it the least of those, but I think it basically parses your C headers and then automates a lot of what you would have to manually do a C type, so you have to specify what symbol you want to call and what the arguments are and what the return type is. And if one of those things is wrong, you get a SEG fault and that sort of thing.
16:17 Whereas Pybind11.
16:19 This is about building modules, extension modules.
16:22 And the interesting thing about this is that it's written in pure C++ the other tools out there so Cython can do this. It's not what it was designed for, but immediately became popular for doing this because Cython turned code.
16:37 Python like code is a new language into transported into C or C++ at a toggle. You could change as a toggle. You can change.
16:46 And then when you're there, you can now call C or C++, but it's extremely verbose, and you repeat yourself and you have to learn another language.
16:54 This weird combined Python thing and just thinking in Cython is difficult because you have to think about, well, am I in Python or am I in Cython that can that's going to be bound to Python or am I in Cython? That's just going straight to C or am I just in C or C++?
17:10 But I've actually used it's a lot of layers there.
17:12 But Python is just C++, and it's basically the C API for Python, but C++ API, it's quite natural, and you don't have to learn a new language. It uses a fairly advanced C ++, but that's it. You're learning something useful anyway.
17:30 So do you do some sort of like template type thing and then say, I'm going to expose this class to Python or something like that, and then it figures out, does it write the Python code or what is it it's writing the build like so files or what do you do here?
17:45 It compiles into the C API calls, and then that would compile into a data cell. So there's no such Python or Swig or these other tools because it's just C++ like you do any other C++, but it's actually internally using the C Python API or PyPI's wrapper for it.
18:04 And the language looks a lot like Python. The names are similar.
18:07 You just do a def to define a function and give it the name, and then you just pass it the pointer to the underlying thing. You can figure out things like types and stuff like that for you.
18:17 Doc string if you want, give the arguments names.
18:19 You can make it as Pythonic as you want.
18:21 It's verbose, but it's not overly verbose.
18:24 Yeah, that's really neat.
18:26 And for people who haven't used those kind of outputs, basically, it's just import module name, whether it's a PY file or it's a.
18:37 PyTorch. If you've used CyPy used one of those things, you've been importing some Pybind11 code.
18:45 So let's talk a little bit about Scikit-HEP.
18:48 This is one of the projects that has a lot of these packages inside of it. And your library Cibuild wheel is one of the things that is used to maintain and build all those packages, because I'm sure they have a lot of interesting and oddball dependencies.
19:09 I mean, C++ is kind of standard, but there's probably others as well. Right.
19:13 It is.
19:14 One thing that is kind of somewhat unique to help is that we are very heavily invested in C++.
19:20 It's usually either you're going to see Python or you're going to see some sort of C++ package of some sort.
19:26 It could be varies in size there, but it's mostly C ++ or Python. We really haven't used other languages as much for the past 39 years.
19:36 Is that inertia or is that by choice?
19:39 Why is that?
19:40 I think it's partially. The community is a fairly cohesive community.
19:46 We're really used to sort of working together. The experiments themselves are often might be 1000 or several thousand visitors working on a single experiment.
19:56 And we have been fairly good about sort of meeting together and sort of deciding the direction that we want to go in and sort of sticking to that.
20:04 So for C++, it was heavily root, which is a giant C ++ framework, and it's got everything in it.
20:13 And that was C++, and that's what everybody used.
20:18 I was going to write code that would run and interact with, like the grid computing or the data access and all that kind of stuff at LHC.
20:27 I would use this route library if I was doing that C++, right.
20:31 You might be using interpreted C++, which is something we invented.
20:34 Oh, okay.
20:36 This is interesting.
20:37 Is this something people can use?
20:39 We actually CINT was the original interpreter, and then it got replaced by Kleen, which is built on the LLVM.
20:47 And I think recently it was merged to mainline LLVM as Clang Ripple. I think it's called, but sort of a lightweight version.
20:57 It's a C++ interpreter. You can actually get Zeus Cling, which I think quanstack, but they package it as well. I think it's just Zeus clean.
21:09 Very interesting.
21:10 C++ really wasn't designed for a notebook, though.
21:13 It does work, but you can't rerun a cell often because you can't redefine things. Python is just really natural in a notebook. And C++ is not.
21:22 Especially if you change the type.
21:23 You compile it as an Int, and they're like, that should be a string.
21:26 That's not going to be a string. It's compiled. Yeah.
21:29 So it seems to me like the community at CERN has decided we need some low level stuff and there's some crazy low level things that happened over there. People can check out a video.
21:40 Maybe I'll mention a little bit later, but for that use, they've sort of gravitated towards C, and then for the other aspects, it sounds like Python is what everyone agreed to say. Hey, we want to visualize this. We want to do some notebook stuff.
21:53 We want to piece things together, something like that. Right.
21:56 It's certainly moving that way.
21:59 They definitely have sort of agreed that Python should be a first class language.
22:05 And join C++. That was decided a few years ago, and I think that's been a great step in the right direction, because what was happening, people are coming in with Python knowledge. They wanted to use Pandas, and I came in that way as well.
22:18 Pandas and number. And all these tools were really nice, and we were basically just having to write them all ourselves in C++.
22:26 It has a data frame. But why not just use why not just use Python, which is what people know.
22:31 Anyway, Panda exists.
22:33 There's a ton of people already doing the work maintaining it for us.
22:37 Literally has a string class, literally.
22:41 They do everything the idea.
22:44 And that's the idea behind Scikit-Hep was to build this collection of packages that would just fill in the missing pieces, the things that energy physicists were used to and needed, and some of them are general.
22:58 And we're just gaps in the data science ecosystem. And some things are very specific.
23:02 High energy physics Scikit-HEP actually sort of originated as a single package.
23:09 It sort of looked like root right at first.
23:13 And it was invented by someone called it Bartovic Rodriguez, who is actually in my office at CERN and her office mates.
23:20 He did something I think really brilliant when he did this. And that is, he created an organization called Scikit-HEP around it.
23:26 And then he went out and spoke with people and got some of the other Python packages that existed at the time to join Scikit-HEP moved them over and started building a collection of some of the most popular Python packages at the time.
23:38 And I thought that was great.
23:40 And I really wanted Scikit-HEP to become a collection of separate tools. And for the second half package to just be sort of a meta package that just grabbed all the rest.
23:50 And that's actually kind of where it is now.
23:52 I can Pip install Scikit-HEP. Is that right?
23:54 You can. And mostly other than a few little things that are still in there that never got pulled out that will mostly just install our most popular maybe 15 or so packages.
24:04 Only 15 of our most popular packages.
24:06 Yeah. So it probably doesn't really do anything other than say it depends on those packages or something like that. Right then, by virtue of installing almost entirely.
24:16 It's a really cool idea. And I like it. So maybe one of the things I thought would be fun is to go through some of the packages there to give people a sense of what's in here.
24:26 Some of these are pretty particular, and I don't think would find broad use outside of CERN, for example, Conda Forge Root.
24:34 It sounds like that's about building route, so I can install it as a dependency or something like that. Right?
24:39 Building root is horrible.
24:40 And you actually now can get it as part of a condo package, which is just way better than anything that was available for attaching to a specific version of Python because it has to compile against a very specific version of Python, but that's what it does. So unless you want something in root, then that's very HEP specific.
25:01 Some of the more general ones probably briefly mentioned our very first package that I think was really popular among energy physicists that we actually produced was uproot, which was just a pure Python package. So you didn't have to install it.
25:18 That red root files again, very specific for somebody who was in high energy physics, but you could actually read a root file and get your data without installing root. And that was a game changer.
25:31 And now you can actually install root slightly easier, but normally it's a multi hour compile, and it's gotten better. But it's still a bit of a beast to compile, especially for Python.
25:41 That doesn't sound like a beast.
25:42 Oh, my God.
25:42 Now you can just read in your files.
25:44 Basically, Jim Povarsky basically just taught Python to understand the decompiled root file structure and actually can write right now, too. But originally reading, but that's like.
25:58 If I want to create a notebook and maybe visualize some of the data, but I don't really need access to anything else.
26:03 I shouldn't depend on this beast of almost its own operating system type of thing.
26:09 We are very close to being able to use all the data science tools in Python pandas things like that.
26:14 But most data worked fine.
26:16 You just had to get the data.
26:17 And I've done this too, where I had one special install of Python and root together that had worked several hours on and it sat somewhere, and I would convert data with it. I'd move it to HDfive, and then I would do all the rest of the analysis in Python that didn't have it.
26:33 Because then I can do Python libraries that read that HDfive format, right?
26:39 The first package we had that was really popular on its own was awkward array.
26:46 I heard about this one.
26:47 Yeah, that was originally part of upper, sort of grew out of upper.
26:51 When you're reading root files, you end up with these jagged arrays so that's an array that is not rectangular, so at least one dimension is jagged. It depends on the data.
27:03 And this shows up in all sorts of places and not just particle collisions or obviously shows up lots of places and particle collisions. Like how many hits got triggered in the detector. That's a variable length list.
27:13 How many tracks are in an event that's a variable length list and can be a variable length list of structured data.
27:18 And to store that compactly the same way you'd use NumPy was one thing, but you can use arrow, and there's some other things that do this.
27:28 But Awkward Array also gives you NumPy, like indexing and data manipulation.
27:34 And that was the sort of breakthrough thing here the original one was built on top of NumPy.
27:40 The new one actually has some Pybind11 compiled bits and pieces, but it makes working with that really well. In fact, Jim Pavarsky got a grant to expand this to.
27:52 I don't remember the number of different disciplines that he's working with, but lots of different areas genomics and things like that have all use cases, and he's adding things like complex numbers and things that weren't originally needed by heavy physicist, but make it widely.
28:07 It's an Evangelism like Dev Evangelism type of role.
28:11 Go talk to the other groups and say, hey, we think you should be using this.
28:16 What is it missing for you to really love it? Something like that.
28:19 How interesting?
28:23 Looking at the awkward array page here says for a similar problem, 2 million times larger than this example given above, which one above is not totally simple. So that's pretty crazy.
28:33 It says Awkward array.
28:36 The one liner takes 4.6 seconds to run and uses two gigs of memory.
28:40 The equivalent Python list and Dictionaries takes over two minutes and uses ten times as much memory.
28:47 22 gigs. So yeah, that's a pretty appealing value proposition there.
28:51 Yeah. And it supports Numba.
28:53 Jim works very closely with the Numba teams and really is one of the experts on the Numba internals.
28:59 So it has full number support now, and he's working on adding Dask.
29:04 He's working with Anaconda on this grant and then working with adding GPU support.
29:10 Very cool.
29:11 Maybe not everyone out there knows what Numba is. Maybe give us a quick elevator pitch on number.
29:16 I hear it makes Python code fast, right?
29:18 It's just in time compiler.
29:23 It takes Python. It actually takes the Byte code, and then it basically takes that back to something or it pushes the bytes code and turns it into LLVM.
29:34 It works a lot like Julia, except instead of a new language, it's actually reading Python bytecode, which is challenging because the Python bytecode is not something that stays static or supposed to be a public detail.
29:48 There's no public promises about consistency of bytecode across versions, because they play with that all the time to try to speed up things and they add byte codes and they try to do little optimizations.
30:00 Yes, every Python release breaks number, so they just know the next Python release will not support number, and it usually takes a month or two.
30:08 But it's very impressive, though.
30:11 The speed up. So you do get full C type speed ups for something that looks just like Python.
30:17 It compiles really fast for a small problem, and it's as fast as anything else you can do.
30:24 I've tried lots of these various programming problems and you just about can't beat them, but it actually knows what your architecture is, since it's just in time compiling.
30:34 Which is an advantage over say, like C, right. It can look exactly at what your platform is and your machine architecture and say, we're going to target.
30:42 I see your CPU supports this special vectorized thing or whatever, and it's going to build that in. Right.
30:48 And then what sort of Jim does with awkward? We've done with some other things with Vector does this, too.
30:53 You can control what Python turns into, what LLVM constructs any Python turns into, because you can control that compile phase.
31:02 That's incredibly powerful, because you can say and it doesn't have to be the same thing.
31:06 But obviously you want it to behave the same way.
31:08 They can say if you see this structure, this is what it turns into LLVM machine code, which then gets compiled machine language, then gets compiled into your native machine language.
31:21 Interesting assembly.
31:22 So if you have.
31:22 Like, a certain data structure that you know, can be well represented or gets packed up in a certain way to be super efficient, you can control that.
31:31 You can say that.
31:32 Well, this operation on this data structure, this is what it should do.
31:36 And then that turned into LlVM, and maybe it can get Vectorized or things like that for you.
31:41 Yeah, that's super neat.
31:43 Another package in the list that I got to talk about, because just the name and the graphic is fantastic is a gas.
31:50 What is aghast? It's got like the scream.
31:53 I forgot who was the artist of that. But the scream sort of look as part of the logo is good.
31:59 About half of the logos come from Jim, and he did about half and then use other around or from the individual package office.
32:09 This is sort of part of the histogramming area, which is sort of the area I work in. So I can help
32:14 But Jim actually wrote aghast, and the idea was that it would convert between histogram representations.
32:19 I think it came up because Jim got tired of writing histogram libraries. I think he's written at least five.
32:25 One of the things I got the sense of by looking through all the Scikit-Hep stuff.
32:29 There's a lot of histogram stuff happening over there.
32:32 Yes, histogram is sort of the area that I was in, and it ended up coming in several pieces.
32:38 But I think one of the important things was actually. And I think aghast may not really matter.
32:42 They get archived at some point because instead of sort of translating between different representations of histograms in memory, what you can do is define a static typing protocol, and it can be checked by MyPy that describes what an object needs to be called a histogram.
33:02 And so I've defined that as a package called, UHI, universal histogram interface and anything that implements, UHI, it can be fully checked by MyPy will then be able to take any object from any library that implements, UHI.
33:17 And so all the libraries we have that produce histogram so uproot when it reads a root histogram or Hist and boost histogram when they produce histograms.
33:26 They don't need to depend on each other. They don't even depend on, UHI, that's just a static dependency.
33:33 And then they can be plotted in NPL.
33:36 Hep or they can be printed to the terminal with just a histo-print.
33:40 And there's no dependencies there.
33:42 One doesn't need the other.
33:43 And that's sort of making aghast somewhat unneeded, because now it really doesn't matter. You don't have to convert between two because they both just work.
33:52 They work on the same underlying structure. Basically.
33:55 They work through the same interface, right?
33:58 Yeah. So aghast is a way to work with different histogramming libraries.
34:03 That kind of is the intermediary of that abstraction layer on that. Okay.
34:11 What are some other ones we should kind of give a shout out to. We talked about Goofit, which is an affiliated package.
34:17 It's not part of Scikit-HEP, but we developed this idea of an affiliated package for sure, things that didn't need to be moved in, but had at least one Scikit-HEP developer working or working with them. At least that's my definition. I was never able to actually get the rest to agree to exactly that definition. But that's my working definition.
34:37 So that's why Pybind 11 gets listed there.
34:39 It's an affiliated package because we share a developer with the Pybind11 library, and we sort of have a say in that and how that is developed.
34:50 And most importantly, if we have somebody come into Scikit-HEP, we want them to Pybind11 over the other tools because that one we have a lot of experience with.
34:59 Very cool.
35:00 Another one I thought was interesting.
35:01 Is Hep units?
35:02 So this idea of representing units like the standard units, they're not enough for us.
35:08 We have our own kind of things, like molarity and stuff, but also luminosity and other stuff, right?
35:16 Different experience can differ a bit. So there's a sort of a standard that got built up for units.
35:23 And so this just sort of puts that together.
35:26 And the unit that we sort of decided on this should be the standard unit, that's one and the rest of our different scalers.
35:33 It's a very tiny little library.
35:35 It was the first one to be fully statically Typed because it was tiny, easy to do because MyPy and first constants, there was like two functions or something. And then it was done.
35:45 Probably a lot of floats.
35:49 That's sort of what it is. You can use that and ideas that the rest of the libraries will adhere to that system of units.
35:57 So then if you use this and then use that values it gives you, then you can have a nice human readable units and be sure of your units.
36:06 That's really neat.
36:07 Have you heard of Pint?
36:08 Are you familiar with this one?
36:09 I love pint.
36:11 Actually, it takes the types through, and I use Pintum, but it actually gives you a quantity out or a NumPy quantity whereas the happiness just stays out of the way. And it's a way to be more clear in your code, but it's not enforced. Pint is enforced, which I like enforcing, but it also can slow down. You can't.
36:32 These are not actual real numbers anymore. So you pay.
36:35 So it's going to add a ton of overhead.
36:36 But Pine's interesting, because you can do things like three times meter plus four times centimeter, and you end up with 3.4 meters.
36:45 Those are actually real quantities.
36:46 They're actually a different object, which is the good thing about it. But it's also the reason that then it's not going to talk to a C library that expects a regular number or something as well.
36:57 Maybe one or two more and then we'll probably be out of time for this.
37:00 What else do people maybe pay attention to that? They can generally find useful over here.
37:04 Convention vector. It's a little bit newer, but certainly for general physics.
37:09 I think it's useful because it's a library for 2D 3D and relativistic vectors, and it's a very common sort of learning example, you see, but there aren't really very many libraries that do this that actually have.
37:24 If you want to take the magnitude of a vector in 3D space, there just isn't a nice library for that. So we wrote vector to do that.
37:31 And vector is supported by awkward.
37:34 It has an awkward back end. It has a numba back end, NumPy back end and then plain object back end.
37:39 Eventually we might work on more and it even has a numba awkward. So you can use a vector inside an awkward array inside a number jit compiled loop and still take magnitudes and do stuff like that.
37:51 That's really cool, because we have a lot of those statistics.
37:57 And you can do things like ask if one vector is close to another vector and things like that even in different looks like one in polar coordinates and one in Cartesian or something like that.
38:08 It has different unit systems, and it actually stores the vector in that. So you don't have to waste memory or something if that's the representation you have.
38:16 That was a feature from Root that we wanted to make sure we got.
38:20 And it's also the idea of Momentums too. And stuff for the relativistic stuff.
38:25 We end up with a lot of that.
38:26 And then maybe just mentioned we mentioned the histogramming stuff and that's the area that's the one that I really work on.
38:32 The ones I specifically work on that are general purpose boost. Histogram is a wrapper for the C++ post boost. Histogram library boost is sort of the big C++ library just one step below the standard library.
38:45 And right at the time I was starting at Princeton, I met the author of Boostrami, who's from Physics, and he was in the process, I believe, of getting this accepted into Boost and I got accepted after that.
38:58 But one of the things that he decided to do is pull out his initial Python bindings that were written in Boost Python, which is actually very similar to Pybind 11 but requires boost instead of not requiring anything.
39:11 But the design is intentionally very similar.
39:14 And so I proposed I would work on Boost Histogram and write these Python bindings for it inside Scikit-HEP, and that would be sort of the main project I started on when I started in Princeton, and that's what I did this. Histogram is an extremely powerful histogramming library.
39:32 So it's a histogram as an object rather than like a NumPy.
39:36 There's a histogram function and you give it an array and then it spits a couple of arrays back out at you.
39:40 But you now have to manage these. They don't have any special meaning, whereas the histogram really are much more natural than object. Just like a data frame is more natural as an object where you tie that information together.
39:53 Histograms really natural that way, where you still have the information about what the data actually was on the axes.
39:58 If you have labels, you want to keep those attached to that data and you may need to fill again, which is one of the main things that your physicist really wanted, because we tend to fill histograms and then keep filling them or rebinding them or doing operations on them.
40:13 And you can do all those very naturally and boost Histograms C++ wrapper in Pybind 11.
40:19 And I actually got involved in Cibuildable because of Boost histogram because one of the things I wanted to to make sure it worked everywhere, and it obviously requires C++ compilation.
40:30 And then Hist is a nice wrapper on top of that. That just makes it a lot more friendly to use, because the original Boost scrambled here wants to keep this. Hanstobinsky wants to keep this quite pure and clean.
40:41 So Hist is the more natural. And even if you're not in Hep, I think that's still the more natural one to use.
40:46 Gold plot plot right.
40:48 There's a lot of people who use Histograms across all sorts of disciplines, so that would definitely be one of those that is generally useful.
40:56 All right. So I think that brings us to CI build wheel.
41:01 Let's talk a bit about that. And I mean, maybe the place to start here is you want our wheels, right.
41:06 The first sentence described as Python wheels are great building them across Mac Linux windows and other multiple versions of Python.
41:13 Not so much.
41:14 That's the description of wheel wheels.
41:17 Well, wheels are good.
41:19 There's times when there are no wheels and things install slower. They might not install at all.
41:24 It's generally a bad thing if you don't have a wheel, but they're not easy to make.
41:30 So tell us what is a wheel. And then let's talk about why maybe building across all these platforms and this cross product along with versions of Python.
41:38 And whatnot is a mess when you distribute Python, you have several options, the most common one. And most packages have at least an estist, which is just basically a tarball of the source.
41:49 When you modify slightly, maybe you're missing a few things or adding some things.
41:53 Otherwise, it unzips your source and puts it somewhere. Python will find it. And then that's that.
41:58 So it runs your build system.
42:00 So set up tools traditionally that's become a lot more powerful recently. But it has to run the build system to figure out what do you do with it. This is just a bunch of files, and then it puts it together in a particular structure on your computer.
42:14 And so a wheel was a package that was already everything was already in place.
42:20 So it's already in a particular structure. It knows the structure, and all Python has to do.
42:25 For a pure Python wheel, one that does not have any binary pieces in it.
42:30 It just grabs the contents inside and dumps them, following a specific set of rules into places into your site packages.
42:39 You have something installed, there's no setup PY in your wheel, there's no pyproject.main
42:45 Those sorts of things are not in the wheel. The wheel is already there.
42:49 It can't run arbitrary code.
42:51 Yeah. Exactly.
42:51 That was one of the points I was making.
42:53 One of those things that can be scary about installing packages is just by virtue of installing them.
42:59 You're running arbitrary code, because often that is, execute Python space set up PY space install or something like that.
43:08 Whatever that thing does, that's what happens when you Pip install.
43:12 But not with wheels.
43:13 As you said, it comes down in a binary blob and just like, boom, here it is.
43:16 Obviously, the thinking is we have this package delivered to a million computers. Why do we need to have every million computer run all the steps?
43:24 Why don't we just run it once and then go here and then also, that saves you a ton of time.
43:28 Right. Like I just installed Microwhiskey, and it took 30 seconds, 45 seconds to install because it didn't have a wheel. So it's up there and it just grinded away compiling it.
43:39 Yeah. So there's two possibilities.
43:42 A pure Python package wheel is still superior because of the not running arbitrary code.
43:47 Pip will actually go ahead and compile all your PYC files that goes ahead and makes the Byte code for all those. If it's a wheel, if it's a tarball, it doesn't do that.
43:59 If it doesn't pass through the wheel stage anyway.
44:01 And then every time you open the file, then it's going to the first time. It's going to have to make that byte code. So it'll be a little slower the first time you open it.
44:08 There's a variety of reasons I think it's Python wheels.
44:12 Com, something like that that describes why you should use wheels.
44:16 That's me. That's not it.
44:17 But yes, Python wheels. So they have, like, a list of advantages there.
44:22 But they also have a little like checklist. It says, how are we doing for the top 360 packages? And apparently 342 of them have wheels, and it shows you for your popular packages which ones like Click does.
44:37 But Future doesn't, for example, and so on.
44:39 Features been there for a long time.
44:43 So wheels are really good, and they actually replaced an older mechanism that was trying to do something somewhat similar called Eggs. But I avoid talking about this understanding.
44:54 Let it live in the past.
44:56 The wheels also are a great way. If you have compile and compile, that happens.
45:01 So if you compile some code as part of your build, then that, of course, is much slower.
45:09 If you have the example.
45:11 It's like it was doing GCC.
45:13 You don't have a compiler. It won't even work.
45:15 Right. Exactly.
45:15 You have to have some set up, at least a little set up. You have to have a compiler set up at the very moment.
45:20 How many Windows users have seen cannot find vcvars.bat.
45:26 I don't want to be in the environment or you have to have the right script sourced.
45:30 So Wheels also can contain binary components like SOS and things.
45:37 And they have a tag as part of their name.
45:39 They have a very special naming scheme for Wheels, and the tag is stored in the wheel, too.
45:44 And they can tell you what Python version they are good for, what platform they are supported on.
45:52 They have a build number, and then they have the Python is actually in two pieces. There's the Api and the interface.
46:01 You can see there's some huge long name that with a bunch of underscore separating it and basically.
46:09 Go ahead.
46:09 It's also one of the reasons that names are normalized.
46:12 There's no difference between a dash and underscore it's because that special wheel name has dashes in it. So the package name at that point in the file name has to be underscores.
46:20 Yeah. So basically, when you Pip install it, builds up that name and says, do you have this as a binary?
46:27 Give it to me.
46:27 Right. Something like this.
46:29 It knows how to pick out it looks for the right one. If it finds a binary, it will just download it, depending slightly on the system and how new your Pip is.
46:37 Right. And this is one of the main innovation, ideas and philosophies behind Conda and Anaconda.
46:44 Let's just take that and make sure that we build all of these things in a really clear way and then sort of package up the testing and compilation and distributing all that together.
46:55 Yes. This is very similar to this game. I think. I'm pretty sure it came after Condo, I think, where they were still in Eggs when Condo was invented and then sort of building up wheels was challenging.
47:05 Building a wheel was challenging.
47:09 CiBuild wheel has really changed that.
47:10 If you want a pure Python, it's really easy today you should be using the build tool, which I'm a maintainer of that as well.
47:17 But build just builds an estimate for you or it builds a wheel.
47:22 So you say something like Python setup.py Bdist or something like that and then boom.
47:28 You shouldn't be doing that anymore. Please don't.
47:32 How would I do it? Tell me the right way.
47:35 Well, you could do Python or Pip install build and then Python build, and that will build both an estist and a wheel, and it'll build the wheel from the estist.
47:46 If you use Pipex, which I would recommend, then you can just say pipex run build and you don't have to do anything that'll download build into a virtual environment for you.
47:54 It'll do it, and then eventually it will throw away the original after a week. Interesting.
47:58 Okay, so we could just use the build.
48:00 We should be using the build.
48:01 You should be using the build tool for estest. There's a big benefit to this, and that is it will use yourpyproject.tobal, and if you say you require NumPy, then it will go like you're using the NumPy headers, the C headers, then it will go.
48:19 When it's building Estus, it will make the Pep 517 virtual environment.
48:25 It will install Numba anything that's in your requirements in your Pyproject.mo, and then it will run the setup PY inside that environment. So you can now import NumPy directly in there and it'll work even when you're building a estest.
48:40 If you do Python Eston set up PY stuff, you can't do that because you're literally running Python giving it set up PY import NumPy. Now it's broken, right?
48:52 Nothing triggers that call to the pyproject.
48:56 Com to see what you need for a wheel.
49:00 The best way to do it is with Pip or the original way to do it was with Pip wheel, because Pip has to be able to build wheels in order to install things that got added to Pip before build existed.
49:13 But now the best way to do it would be with build wheel and that's actually doing the right thing. It's actually trying to build the wheel you want, whereas Pip wheel is actually just building a wheelhouse. So if you depend on NumPy and Numpy, does'nt have wheels.
49:27 They did better with Python 310, so I'm not going to complain about them for Python 310, but for three nine, they didn't have wheels for a while.
49:34 So it'll build the wheels there and it'll build your wheels and it'll dump them all in the wheelhouse, whatever the output is. So you'll be building Numpy wheels, which you definitely don't want to try to upload.
49:43 Yeah, definitely not.
49:44 All right.
49:45 Well, that's really cool. And I definitely learned something. I will start using build instead of doing it.
49:50 The other way you can now delete your setup PY too.
49:53 That's the big thing, right? You don't have to run that kind of stuff, right?
49:57 They're trying to move away from the any commands to set up PY because you don't even need one anymore, and you can't control that environment.
50:05 It's very much an internal detail wrapping up this segment of the conversation.
50:10 We want to wheel because that's best. It installs without requiring the compiler tools on our system.
50:15 It installs faster.
50:17 It's built just for our platform.
50:19 The challenge is when you become a maintainer, you got to solve this matrix of different Python versions that are supported and different platforms. Like, for example, there's macOS intel, there's macOS M1 Apple Silicon.
50:33 There's multiple versions of Windows.
50:35 There's different versions of Linux, right. Like Arm, Linux versus AMD 64.
50:41 Linux Mulenix versus the other Linux varieties.
50:46 So one of the challenges with the wheel is making it distributable.
50:51 So if you just go out and you build a wheel and then you try to give it to someone else that may not work.
50:55 Certainly on Linux if you try to pretty much. If you do that, it just won't work because the systems are going to be different on macOS.
51:04 It'll only work on the version you compiled it on and not anything older.
51:08 And you don't even see people trying to compile on Mac OS 10.14 because they want their wheels to work as in many places as you want.
51:20 I find the Janky. It's like I've got a Mac Mini from 2009.
51:24 We're building on that thing because it will work for most people.
51:27 I think that's how they actually build the official Python binaries.
51:31 I'm not sure.
51:32 But then Apple went in like last year.
51:35 Around this time they threw a big spanner in the works and said, you know what? We're going to completely switch to Arm and our own Silicon, and you got to compile for something different now.
51:43 Yeah. And cross compiling has always been a challenge.
51:47 And then Windows is actually the easiest of all of them. You're most likely on Windows to be able to compile something that you can give to someone else.
51:54 Yeah, that's true. That is one of the things that Microsoft's been really pretty good at is backwards compatibility.
51:59 I get holds them back in other ways, but yeah, typically you can run an app from 20 years ago and it'll still run.
52:04 Yeah, there are a few caveats, but not many, at least compared to the other systems.
52:09 Apple is really good, but you do have to understand how to you do have to set your minimum version, and you have to get a Python that had that minimum version set when it was compiled.
52:19 If you do that, it works really well.
52:21 So what actually started with Scikit-HEP, I was building boost histogram, which needed to be able to run anywhere. That was something I absolutely wanted. It had to be Pip install this histogram and it just worked no matter what.
52:33 And also we had several other compiled packages at the time. Several we had inherited and was compiled and that was quite popular.
52:41 We had a couple of specific ones and we had a couple more that ended up being becoming interested in that. In fact, during this sort of period is when Awkward started compiling pieces.
52:52 When I started with was building my own system to do this, it was called Azure Wheel Helpers, which was you can guess by the name. Azure was basically set up dev ops scripts.
53:04 It was right after Azure had come out and I wrote a series of blog posts on this and described the exact process and sort of the things I found out about how you build a compatible wheel on macOS. You have to make sure you get the most compatible C Python from Python.
53:21 Org itself.
53:23 You can't use Brew or something like that because those are going to be compiled for whatever system they were targeting.
53:28 And on Linux you have to run the mini Linux system and you should run Audit Wheel.
53:34 Actually, Mac you should run Wheel that I might be getting him. I think it's a series of things that you have to do.
53:41 And I started maintaining this multi hundred line set of scripts to do this, and I was also being limited by Azure at the time. They didn't have all the templates and stuff they have now, so everything had to be managed through Get subtree because it couldn't be a separate repository.
54:00 And then when Jim started working Awkward, he went and just rewrote the whole thing because he wanted it to look simpler for him and took a couple of things out that were needed and suddenly made it two separate things. Now I had to help maintain that. So when Python 3.8 or whatever it was came out now I had a completely different set of changes I had to make for that one and it was not working out.
54:21 It was not very easy to maintain.
54:23 And I was watching CI build Wheel and it was this package. It was a Python package that would do this and it didn't matter what CI system you were on because it was written in Python and it followed nice Python principles for good package design and had unit tests and all that sort of stuff. So it looked really good. There were a couple of things that was missing. I came in, I added, I made PRS for the things that I come up with that it didn't have and they got accepted.
54:49 And there was a shared maintainer between Pi Bind11 and CI Build Wheel as well. I think that's one of the reasons that I heard about it was really watching it and I finally decided just to make the switch and I did. At some point a little later I actually became a maintainer of CI built, but I think I started doing the switch before it made it really easy. Once I was a maintainer to say this is a package that we have some control over it's. Okay. Let's just take a choice to depend upon this because we have a say it just took out all that maintenance and now depend about does all the maintenance for us.
55:20 Does the pin moves forward to pin and see a build wheel? That's it nice.
55:24 So if I want to accomplish if I'm a package developer owner and I want to share that package to everybody, we've already determined we would ideally want to have a wheel.
55:36 But getting that wheel is hard. So CI Build Wheel will let you integrate it as the name indicates, into your continuous integration.
55:43 And one of those steps of CI could be build the wheel, right?
55:48 It reduces it down to pretty much that there's a step in your CI that says, run CI Build wheel.
55:54 And then CI Build wheel is designed to really integrate nicely with the build matrix. So for a fairly simple package or for many packages, you can really just do Mac, Windows and Linux have the same job, like in GitHub actions. It's easy to do the same job and then call CI Build wheel, and that's about it.
56:13 It just goes through all the different versions of Python that are supported.
56:18 It just goes through and makes a wheel for each.
56:21 And in fact, it even has one feature that was really nice that I struggled with a bit is testing.
56:27 So if you give it a test command, it will even take your package. It will install it in a new environment that's not in a different directory that's not related to your build at all and make sure it works and passes whatever test you give it.
56:38 We'll do that across the platforms. We'll do, like on each one test and a Windows test.
56:43 For each will really just sees the platform it's sitting on because it's inside the build matrix. And so it's run for each and for each one.
56:52 It will run that test.
56:55 And the simplest test is just Echo, and that will just make sure it installs because I won't try to install your wheel unless there's something in that test command.
57:03 Even that's useful, sometimes even that's broken, sometimes because of Numpy not supporting one of those things in that matrix.
57:09 Yeah, it can install the dependency.
57:11 So that step fails or something.
57:13 So it currently supports GitHub Actions Azure pipelines, which I don't know how long those are going to be two separate things. Maybe they'll always be separate. But Microsoft owned GitHub be like they say do stuff in Azure pipelines, and then they're kind of moving like.
57:27 I think there's somewhere the runners are the same.
57:29 They actually have the same environments.
57:32 So I think they'll exist just as two different interfaces, probably.
57:35 And Azure is not so tied to GitHub and it has more of an enterprise type.
57:39 Yeah, for sure.
57:41 It was just a rewrite and a better rewrite. In most cases of it. I got to learn.
57:46 Yeah, I think you have actions came second. All right. So then Travis CI, appFair CircleCI and GitLabCI at least all of those.
57:54 At least those are the ones we test on, and then it runs locally.
57:58 There are some limitations to running it locally.
58:02 If you target Linux and any system that has Docker and target Linux, you can just ask to build Linux.
58:08 You can actually run it from my Mac or from Windows, I assume from Windows machine. I tried Windows with Docker and Windows.
58:15 It does install to a standard location C colonback, CI buildwheel, but other than that, it's safe to run out there and Mac OS it will install to your macOS system.
58:25 It's all system versions of Python, so that's something we haven't solved yet might be able to do some day.
58:30 It's not a good idea unless you really are okay with installing every version of Python that ever existed into your system.
58:38 Maybe get a CPython.
58:39 Org Python.
58:42 Yeah, it's somewhat safe.
58:44 If you're on Windows, you could use Windows subsystem for Linux to BSL as well.
58:50 In addition to Docker, I suspect that Mini Linux has to run.
58:55 I'm sure as long as you can launch Docker.
58:57 The thing that you have to be able do is launch Docker because you have to use the Mini Linux Docker images or you should use that or derivative of that.
59:05 There's lots of rules to exactly what can be in the environment and things like that.
59:10 And PyPI maintains that one thing that also helps is that we have the Mini Linux maintainer is also a CI build wheel. Maintainer. That's one reason that those things tend they fit well together.
59:22 Features tend to match and come out at the speed like mutual linux, which is a big thing recently.
59:27 It's not actually in a released version of CI build yet.
59:30 What is mutual Linux so normal? Linux is based on G libc and that's actually what controls it's. One of two things that controls Mini Linux.
59:39 So can you download the binary wheel or do you have to build if you have an old version of Pip, they had to teach Pip about each version of Mini length.
59:49 That was a mess, so they eventually switched to a standard numbering system that is your Glibc number.
59:55 And now Pip the current Pip will be able to install a future mini linux as long as your system.
01:00:00 But that was a big problem. So Pip nine can only install Mini Linux one. It can't install many Linux even if your Glibc is fine for it.
01:00:09 The other thing is the Glibc version and Mini Linux one was based on Centos5 -2010 was send to 6.
01:00:19 Mini Linux 2014 was sent to S seven and then now they switched to DBN because of the send to us sort of switching to the Stream model.
01:00:28 So Mini Linux 224 is G Lipsy 2.24. And that's DBN eight or something like that.
01:00:37 But that's Glibc based.
01:00:38 There are distributions that are not Glibc based, most notably Alpine.
01:00:43 Very used Alpine, this tiny, tiny little Docker image. It's really fun distribution to use if you're on Docker, but it actually sounds fun to install, but I've never tried it without Docker, but it's these five megabyte Docker wheels or Docker.
01:00:57 Docker doesn't do wheels Docker images, but that doesn't use Glipsy. They use his Musil, and so measle Linux will run on Alpine.
01:01:07 Got it. So if you're building for the platform Alpine and similar ones, right?
01:01:15 Yeah. You said I can run this locally as well.
01:01:18 I know I would use it in CI because I've got that matrix of all the versions of C, Python and PyPI and then all the platforms. And I want to check as many of those boxes as possible to put wheels in it, right?
01:01:31 Suppose I'm on my Mac and I want to make use of this to fill in, maybe do some testing, at least on some of these columns.
01:01:39 How do I do that? What's the benefit there?
01:01:41 Well, I can tell you the case where it happened.
01:01:43 So we were shipping CMake and the second build organization ran out of Travis credits and they were being built.
01:01:53 We hadn't switched them over to being Emulated builds on GitHub actions yet, and it just ran out.
01:01:58 We couldn't build them, and one of them had been missed, and we also weren't waiting to upload. So we uploaded everything, but we had one set. Or maybe it was all of the Emulated builds. I think it was one set.
01:02:08 It didn't work.
01:02:09 And so we wanted to go ahead and upload those missing wheels.
01:02:13 And I tried, but I couldn't actually get Emulation Docker emulation.
01:02:20 I couldn't get that working on my Mac.
01:02:22 So the Mini Linux maintainer used his Linux machine and he had Q emulation on it, and he built the Emulated images a few hours, but he just built locally and then uploaded filled in the missing wheels.
01:02:37 So if I'm maintaining a package, I got some package I'm putting on PyPI and I want to test it.
01:02:44 Does it make sense to do it locally or does it just make sense to put it on some CI system?
01:02:50 Same builder.
01:02:51 Usually I do some local testing, but I'm also developing same builder, but usually it's probably fine to do this just in your CI and usually don't want to run the full thing every time. Usually you have your regular unit tests.
01:03:03 The CI build is going to be a lot slower because it's going through and it's making each set of wheels launching Docker images and things like that.
01:03:09 And it's installing Python each time for Mac OS and Windows.
01:03:14 Usually if you have fairly quick build I've seen some people just run CI build as part of their test suite, but usually you just run it, say right before release.
01:03:22 Maybe I usually do it once before the release and then on the release.
01:03:25 Right. Exactly.
01:03:26 That makes sense because it's a pretty heavyweight type of operation.
01:03:30 So when I look at all these different platforms, I see Mac OS, intel, macOS, Apple Silicon differentnesses of Windows.
01:03:37 And then I think about CI systems.
01:03:40 What CI systems can I use that support all these things? Like does GitHub Actions support both versions of macOS, for example?
01:03:47 Plus, Windows GitHub Actions is by far our most popular platform.
01:03:52 It switched very quickly. It used to be Travis.
01:03:54 Travis was a challenge because they didn't do Windows very similar to Windows very well.
01:03:58 And it's a challenge for us because we actually can't run our macOS tests on them anymore, because once we joined the PIPA, the billing became an issue, and we just basically just lost macOS running for it.
01:04:11 But Circle, I think Azure and GitHub auctions. I think they do all three.
01:04:17 And you can always flip things up, Travis, for the Linux and then appfair for Windows.
01:04:23 You can do it that way.
01:04:24 One of the big things that I have developed for CI build wheel was the type project TML tunnel configuration.
01:04:31 Usually that configuration for CI build wheel.
01:04:35 That way you can get your CI build wheel configuration out of your YAML files.
01:04:40 That way it works locally, which is one of the things I was after, but also you can just do it and then run on several different systems like you might like the fact that Travis is, I think the only one that does the native strange architectures.
01:04:54 You have to emulate it other places, which is a lot slower, five times slower or something.
01:04:58 Yeah. So kind of split that up, get the definition and then create maybe multiple CI jobs.
01:05:05 Really simple.
01:05:07 The example script is just a few lines. It does not take much to do this comparing to take.
01:05:14 Yeah, sure. And I didn't even scroll down here. You've got a nice grid on GitHub.com/cibuildwheel that shows on GitHub Actions, which is supported on Azure pipelines.
01:05:24 What supported CI doesn't do this.
01:05:29 Out there Appfair, Travis, Azure and GitHub Dot.
01:05:35 But we can't test it.
01:05:37 Theoretically, it does.
01:05:38 It got you.
01:05:39 And then I wonder about the M1, the Apple Silicon Arm versions versus the intel versions.
01:05:46 I don't know how well that's permeated into the world yet, but the fact they have Mac at all is kind of impressive.
01:05:52 Nobody has an M1 runner yet.
01:05:54 There are a few places I think now that you can purchase time on one, but no runners.
01:06:00 Last I checked GitHub Actions, you couldn't even run it yourself on m1. One that may have changed.
01:06:05 I don't know.
01:06:06 That was a while back.
01:06:07 Yeah, there are some crazy places out there. I think there's one called Mac Mini Colo.
01:06:13 I think that's what it's called. Let me see.
01:06:16 I think that's it.
01:06:19 You can go to these places like Mac Mini Colo, get a whole bunch of Mac minis and put them into this crazy data center.
01:06:28 But that's not the same as I upload a text file into GitHub that says Run on Azure on Get of Actions, and then that's the end of it. Right. You probably got to set up your whole, like, some whole build system into a set of minis.
01:06:42 And that doesn't sound very practical for most people.
01:06:45 Ideally, what you could do is you just need one mini, and then you set up a GitHub, actions hosted Runner, locally hosted Runner and other systems.
01:06:56 Git Labci was big on that.
01:06:58 You can do anything on GitLab CI. We just haven't tested that because they don't have those publicly. But if you have your own, you can do that.
01:07:05 I know somebody who does this with basically has a Mac mini and runs the M one builds on that.
01:07:11 But you could do that. I have a Mac mini and the lead developer of Cibuildwheel also has M one.
01:07:19 He has an M one or something. I don't know.
01:07:21 Mine is Mac.
01:07:22 That's what I'm talking to you right now on it's a fantastic little machine.
01:07:25 Yeah, it's very impressive. I love the way that boosts histogram. It was fast.
01:07:29 I have a 16 inch, almost maxed out MacBook and the Mac Mini M one. It was faster and boost histogram than this thing.
01:07:37 Yeah, I have a maxed out 15 inches, a little bit older, a couple of years, but I just don't touch that thing unless I literally need it as a laptop because I want to be somewhere else. But I'm definitely not drawn to it.
01:07:47 So you could probably set up one of these Mini's for $700 and then tie it up. But that's again, not as easy as just clicking the public free option that works, but still, it's within the realm of possibility.
01:08:00 Apple has actually helped out several like, I know Homebrew and a few others they've helped out with by giving them the Mac minis or something that they could build with. So I believe Brew actually builds Homebrew actually builds on Realm ones.
01:08:17 I know it does because the bills are super fast. I remember that like it builds root, like, 20 minutes.
01:08:22 The root recipe, because I maintain that.
01:08:25 And the normal one takes about an hour. It's running on multiple cores, but it's like three times faster. It's done in 20 minutes. Just thought something was wrong. When I first saw that.
01:08:34 How could it be done?
01:08:35 Something broke. What broke?
01:08:37 All right, Henry, we're getting really short on time, a little bit over, but it's been a fun conversation. How about you give us a look at the future? Where are things going with all the stuff?
01:08:46 Next thing I'm interested in being involved with is Scikit build, which is a package that currently sort of augments set up tools, but hopefully eventually sort of replace set up tools as the thing that you build with, and it will call out to CMake.
01:09:03 So you basically just basically write a CMake file and this could wrap an existing package.
01:09:09 Or maybe you need some of the other things that Cmake has and this will then let you build that as a regular Python package.
01:09:16 In fact, recently somebody sort of put together CI build wheel psychic build and C make example and built LLVM and pulled out just the claim format tool and made wheels out of that.
01:09:28 And now you can just do Pip and so claim format. It's one to two megabytes. It works on all systems including Apple Silicon and things. I just tried it on Apple Silicon yesterday and it's a Pip install.
01:09:37 Now you can claim format C++code and that's just mindblowing added to pre commit the precommit CI it runs in two. I mean, I've been fighting for about a week to reduce the size of the claim format recipe from 600 megabytes to just under the 250. That was the maximum for freaking at CI.
01:09:54 And then you can now Pip install under about a megabyte for Linux that sort of thing. And I think that would be a really great thing to work on. It's been around since 2014, but it needs some serious work.
01:10:09 And so I'm currently actually working on writing a grant to try to get funded to just work on basically the scikit build system and looking for interesting science use cases that would be interested in adapting or switching an existing build system over or adapting to it or taking something that has never been available from Python and making it available.
01:10:30 And yes, root route might be one.
01:10:33 I'm looking for a wide variety of scikit build package is fundamentally just the glue between set of tools, Python module and CMake.
01:10:41 Yeah. So it's a real way to take some of these things based on CMake and sort of expose them to Python.
01:10:46 Yeah. So you can just have a Cmake package that does all the C make things well, like finding different libraries and that I'm a big Cemic person.
01:10:56 How do you use it very heavily.
01:10:57 Most C++ does. It's about 60%. I think of all build systems or CMake based now get from gateways numbers, but they may seem like, but I think it's very powerful. It can be used for things like that.
01:11:11 And we'll really open up a much easier C++ more natural in C++ and Fortran and things like that in CUDA then is currently available. Set up tools just utilize is going away in Python 3.12.
01:11:24 Setup tools is not really designed to build C++ packages or packages. It was really just a hack on top of distributors which happened to be build. Just Python itself.
01:11:35 Well, Scikit sounds like the perfect tool to apply to the science space because there's so many of these weird compiled things that are challenging to install and deploy and share and so on. So making that easier sounds good.
01:11:49 All right.
01:11:50 Well, I think we're probably going to need to leave it there just for the sake of time, but it's been awesome to talk about all the internals of supporting Scikit-HEP, and people should check out CI Build wheel.
01:12:02 It looks like if you're maintaining a package either publicly or just for internal for your organization, it looks like it would be a big help if it's got binary.
01:12:09 Any sort of binary build in it?
01:12:12 If not, build is fine.
01:12:14 And I learned about build, which is good to know.
01:12:17 All right.
01:12:17 So before you get out of Henry, let me ask you the two final questions.
01:12:21 You're going to write some code.
01:12:23 Python code. What editor would you use?
01:12:25 Depends on how much it'll either be VI if it's a very small amount.
01:12:29 If it's a really large project that takes several days, then I'll use PyCharm and then I've really started using Vs code quite a bit. And that's sort of expanding to fill in all the middle ground and kind of eating in on both of the other edges.
01:12:44 There's some interesting stuff going there. Good choice.
01:12:46 But all with the VI mode, there are plugins added, of course.
01:12:50 And then notable PyPI package. I mean, we probably talked about 20 already.
01:12:55 If you want to just give a shout out to one of those, that's fine. Or if you got a new idea.
01:12:58 I'm going to go with one that might not get mentioned, but I'm really excited by it. The development of it is I think developers quite new, but what he's actually done as far as the actual package has been nice.
01:13:12 It needs some nice touches.
01:13:15 And that is plot text.
01:13:19 And I'm really excited about that because it makes these the actual plot. It makes a really, really nice.
01:13:25 And they're plotted to the terminal and it can integrate with.
01:13:29 And of course, I'm interested in it because I want to integrate it with I want to see it integrated with a textual app that combines this with file browsers and things like that.
01:13:44 With Apple.
01:13:45 You could cruise around your files, use your root IO integration, pull these things up here and put the plot right on the screen.
01:13:53 Right. But in the terminal.
01:13:55 Yeah. This is really cool. I had no idea. And this is based on Rich. You say it can integrate with Rich. Okay.
01:14:01 Got it. Yeah.
01:14:01 As soon as I saw it, I started trying to make sure the two people were talking to each other will, and as soon as they can this.
01:14:08 Yeah, exactly.
01:14:09 These things work together.
01:14:10 That's very cool.
01:14:11 They seem like they should right. They're in the same general zone.
01:14:14 Yeah, and they do now.
01:14:16 There had to be some communication back and forth as far as what size the plots were in.
01:14:20 This shouldn't work in.
01:14:21 It a good recommendation.
01:14:23 Definitely one I had not learned about, so I'm sure people enjoy that.
01:14:26 All right, Henry, final call to action. People want to do more with wheels.
01:14:29 ci build Wheel or maybe some of the other stuff we talked about. What do you tell them?
01:14:33 Look through.
01:14:34 I think one of the best places to go is the Scikit Hep Developer pages. If you have no interest in Scikit Hep tools or Hep at all and that sort of shows you how these things integrate together really well.
01:14:43 And nice documentation.
01:14:47 Build Wheel itself is nice. And the PyPI a lot of the IPA projects have gotten good documentation as well as packaging Python.
01:14:55 We've updated that quite a bit look like to reflect some of these things, but I really like the Scikit Developer pages. I mean, I'm biased because I wrote most of them nice.
01:15:06 Yeah, I'll link those together.
01:15:08 I'll try to link to pretty much everything else we spoke to as well, so people can check out the podcast player showing us to find all that stuff. I guess one final thing that we didn't call it that, I think is worth pointing out is CI build Wheel is under the PyPI, the Python Packaging Authority, so it gives it some officialness.
01:15:23 I guess you should say yes.
01:15:24 That happened after I joined one of the first things I wanted to do was I thought this should really be in the PyPI, and I was sort of pushing for that. And the other developers were fine with that.
01:15:35 And so we brought it up and I actually joined the PyPI just before that by becoming a member of Build.
01:15:40 So I got to vote on Build oil coming in. But it was a very enthusiastic vote, even without my vote.
01:15:46 And Pipex joined right at the same time too. So those were fighting time.
01:15:50 Pipex is a great library.
01:15:51 I really like the way Pipex work. It's a great tool.
01:15:54 All right, Henry, thank you for being here. It's been great.
01:15:57 Thanks for all the insight on all these internals around building and installing Python packages.
01:16:01 There's also a lot more on my blog, so I sign numpy.GitLab.IO that links to all those other things, obviously too.
01:16:09 Thank you for being here. Yeah.
01:16:11 See you. Thanks for having me.
01:16:12 You bet. This has been another episode of Talk Python to me.
01:16:16 Our guest on this episode was Henry Schreiner, and it's brought to you by us over at Talk Python training and the transcripts were brought to you by 'AssemblyAI'.
01:16:25 Do you need a great automatic speech to text API?
01:16:27 Get human level accuracy in just a few lines of code?
01:16:30 Visit 'talkpython.fm/AssemblyAI' want to level up your Python?
01:16:35 We have one of the largest catalogs of Python video courses over at Talk Python.
01:16:39 Our content ranges from true beginners to deeply advanced topics like memory and async.
01:16:44 And best of all, there's not a subscription in site.
01:16:46 Check it out for yourself at 'Training.Talkpython .FM be sure to subscribe to the show.
01:16:51 Open your favorite podcast app and search for Python.
01:16:54 We should be right at the top.
01:16:55 You can also find the itunes feed at /itunes, the Google Play feed at /play and the Direct RSS feed at /RSS on Talk Python.FM We're live streaming most of our recordings these days.
01:17:08 If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at 'TalkPython.FM/youtube'.
01:17:15 This is your host, Michael Kennedy.
01:17:18 Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.