Learn Python with Talk Python's 270 hours of courses

#29: Python at the Large Hadron Collider and CERN Transcript

Recorded on Thursday, Sep 24, 2015.

00:00 The largest machine ever built is the Large Hadron Collider at CERN.

00:03 Its primary goal was the discovery of the Higgs boson, the fundamental particle which gives all objects mass.

00:10 The LHC team actually achieved this audacious goal in 2012, winning them the Nobel Prize in physics in the process.

00:18 Today on Talk Python to Me, Kyle Cranmer is here to share how Python was at the core of this amazing achievement.

00:25 This is episode number 29, recorded Thursday, September 24th, 2015.

00:51 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

01:05 This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.

01:09 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python.

01:16 This episode is brought to you by Hired and CodeShip.

01:19 Thank them for supporting the show on Twitter via at Hired underscore HQ and at CodeShip.

01:25 I don't have much news to share this week, but I am both honored and thrilled to bring you this episode,

01:31 and I can't wait for you to listen to it.

01:32 So, let's get right to the interview.

01:34 Let me introduce Kyle.

01:37 Kyle Cranmer is an American physicist and professor at New York University at the Center for Cosmology and Particle Physicists

01:44 and affiliated faculty member at NYU's Center for Data Science.

01:48 He is an experimental particle physicist working primarily on the Large Hadron Collider based in Geneva, Switzerland.

01:54 Cranmer popularized a collaborative statistical modeling approach and developed statistical modeling,

02:00 which was used extensively for the discovery of the Higgs boson at the LHC in July 2012.

02:06 Kyle, welcome to the show.

02:08 Thank you. Thank you. It's a pleasure to be here.

02:10 Yeah, we have some amazing science and programming to talk about today, so I'm really excited to dig into all these topics with you.

02:17 Yeah, no, I'm excited to see where it goes.

02:20 Yeah, for sure.

02:21 So, let's, you know, we're going to talk about the Large Hadron Collider,

02:26 about using Python for scientific research and all those sorts of things,

02:31 as well as some other cool projects that you've got going on.

02:33 But people like to know how folks like you and your position kind of got started and they like to hear the background.

02:41 So, maybe we could start with, you know, what got you interested in physics and what got you interested in programming and how do you get to where you are?

02:48 I've been interested in, you know, in physics, you know, since I was a kid, not really knowing that that's what it was called.

02:53 But later on, you know, I think, I guess it was in high school is when I really realized that it was physics that I wanted to do.

03:01 I grew up in Arkansas and, you know, Arkansas is not exactly known for, like, leading the tide of physicists and computer scientists in the world.

03:09 But they had started a special math and science high school that it was public school, but you actually lived there.

03:15 And when I was there, I was just surrounded by all sorts of people, kind of the nerds and geeks of Arkansas.

03:21 And it was really a special time.

03:23 So, during that time, I, you know, got even more into physics, but it's also, that's when I was first exposed to serious programming.

03:30 So, actually, even Python, like in 95, actually 94, I guess, I had a friend that was into early web things and he was playing with Zope and the Zope, you know, object database.

03:41 And so, I started working with him and did some early web projects and that was kind of my first exposure to Python.

03:49 So, that was a long time ago.

03:51 And then that's, it's funny how much those experiences kind of keep being revisited today.

03:58 Yeah, I'm sure you keep coming back to that.

04:00 You know, basically, programming these days seems like a required skill to be a physicist.

04:06 Yeah, no, well, it depends on what you do, but definitely for what we do, programming is a required skill.

04:14 And unfortunately, it can, you know, for people that don't have those strengths, it really takes away from their ability to try to do the physics that they want to do.

04:23 So, you know, so for incoming graduate students, you usually see a pretty big divide between people that are, have some programming skills and don't.

04:31 And usually, the people that don't will catch up a little bit later, but, you know, you lose time and that's unfortunate.

04:36 Right.

04:37 I'm sure it's like a huge scramble, like, oh my gosh, I got to learn all this programming stuff too because, you know, we have projects or whatever, right?

04:44 Right, right.

04:44 It also colors a lot of the flavor about how we approach computing because somehow you have this enormous computing problem that you need to deal with and you would like to do it as nicely as possible.

04:56 But it also can't be too fancy or the bulk of the physicists might not be able to understand what's going on.

05:02 You have older physicists from the Fortran days and you have younger physicists that maybe never took any, you know, programming courses, like any serious programming courses.

05:10 So things have to be somehow kept simple but still work for the difficult problems.

05:16 And so it's a difficult balance to strike.

05:18 Yeah, I'm sure.

05:20 So let's talk a little bit about what you guys are doing at the Large Hadron Collider.

05:24 And first of all, you know, congratulations on the Higgs boson discovery.

05:30 That's amazing.

05:30 Oh, thank you.

05:31 No, it was, yeah, many, many years of work.

05:35 And when it finally came, it was a huge treat.

05:39 I don't know.

05:39 It's funny to have such a big thing like that happen fairly early in your career.

05:44 It's like, now what?

05:47 Yeah.

05:47 So, but yeah, so at the LHC, you know, we have this huge collider that's in Switzerland.

05:53 It's about 17 miles around and underground.

05:57 There are all these super connecting magnets that help protons go, you know, bend them in a circle at essentially the speed of light.

06:03 And they're colliding together all the time.

06:06 And they smack into each other and they make, you know, a lot of new particles that come flying out.

06:11 And they hit our detector.

06:13 And our detector, you can think of sort of like a digital camera.

06:17 You know, it's like basically a bunch of pixels.

06:19 And the particles smack into it and you get an image.

06:22 But it's a 3D image.

06:24 So it's a 3D detector.

06:25 And the detector is like the size of a 12-story building.

06:29 So, yeah.

06:30 I think that, you know, when you just hear about particle colliders and especially LHC, you have maybe this idea of like a tube where things are shooting around.

06:43 And, you know, how big does the tube have to be?

06:45 Not that big.

06:46 But the actual experiments, the collectors, I was blown away when I learned about how big they are.

06:53 Like you said, 12 stories.

06:54 These things are huge.

06:55 Yeah.

06:56 No, they are.

06:56 And the range of scales is pretty crazy because we have to be able to track where these particles go very precisely.

07:04 So, like, close to where they interact, you know, we're measuring things at the, like, micron level.

07:09 And then they fly out over, like, the size of a building.

07:14 And we're still measuring where they're going at this very precise level.

07:18 But, you know, it's just this gargantuan thing.

07:20 So you have to align it properly and all sorts of challenges there.

07:24 And there are about 100 million, you know, well, a few hundred million electronic readouts coming out of this beast.

07:30 So, you know, it's like a 100-megapixel camera or something like that.

07:35 And we're taking 40 million photos every second.

07:38 That's a stunning amount of data.

07:41 That is a stunning amount of data.

07:42 And so if you, you know, we have to slap special electronics, like, straight onto the detector to be able to start preprocessing it and compressing it and sort of, you know, coming up with some way to deal with the data volume.

07:56 Because, you know, it's something, it's, you know, there are just totally staggering numbers about the data flow that's coming straight out of the detector.

08:04 So how do you capture and store that?

08:06 Do you store that, like, on hardware right on, like, Atlas in the machines?

08:11 Or do you, like, get that into, like, a cluster of servers?

08:15 Or what happens?

08:16 Right, right.

08:17 So we have a kind of a hierarchical online real-time system for tossing away, you know, the majority of the data.

08:27 So we have to actually, we write algorithms that have to look at the data and real-time decide, does this look interesting or not?

08:35 And so we go from the sort of 40 million a second through this, like, three levels of filtering down.

08:41 And then we get to the point that we save something like a few hundred of these collisions every second.

08:48 And that turns into, you know, several petabytes a year of data that we actually analyze later.

08:55 That's amazing.

08:56 It's got to be a little stressful to work on that initial filtering algorithm.

08:59 Because what if you threw away the Higgs boson before you discovered it, right?

09:03 That's right.

09:04 Yeah.

09:04 No, people, we always worry that we're, you know, kind of throwing the baby out with the bathwater.

09:09 And sorry about the living in New York here.

09:13 Yeah, no worries.

09:14 So the, yeah, we call that thing the trigger.

09:18 And, you know, that's something that I worked on a bit.

09:21 It's true that, like, if we don't find anything else in this next run of the LHC, you know, a lot of people will think exactly that.

09:29 That maybe, you know, the way the trigger was configured, we were throwing away the interesting stuff.

09:33 But luckily, we're not stuck to that.

09:36 You know, we can go and we can change it and things like that.

09:38 But that is the worry.

09:40 Sure.

09:41 Yeah.

09:41 I mean, you still have the time spent and the energy and all that, right?

09:45 Yeah.

09:46 Yeah, sure.

09:46 You can rerun it, of course.

09:48 But, you know, you got to, I suspect time is a valuable thing on that machine.

09:54 Yeah.

09:55 No, for sure.

09:56 It's expensive to run.

09:57 So, absolutely.

09:59 Yeah.

10:00 So, you know, I have a lot of listeners who are scientists and physicists and data science and so on.

10:07 But a lot of them who are probably not.

10:12 And so, I wanted to make a movie recommendation and a book recommendation just for people to, you know, if they want to kind of set the stage and learn the background, you know, as part of this whole thing we're talking about.

10:23 I wanted to recommend the Particle Fever documentary.

10:26 Have you seen this?

10:28 Yeah, yeah.

10:28 I actually have a credit in that movie.

10:30 I worked with them quite a bit.

10:35 And at one point there was a scene that was shot in my office but ended up having to cut it because it didn't really fit well with the, you know, it was a good choice.

10:45 But it was painful.

10:46 But they were nice.

10:47 I worked with them a fair amount and got to go to the, like, you know, the opening in Sheffield at a documentary film festival and hang out with the producers and the whole crew.

11:00 But it's a great film.

11:04 I think it definitely, it's good for a non-physics audience also.

11:08 It's not a technical film.

11:09 It just basically captures what it's like to be inside one of these experiments and the sort of stress and the, you know, the drama associated to it.

11:18 I think it's really one of the best science documentaries ever.

11:21 I absolutely agree with you.

11:22 I think it really captures the excitement, the imagination, the drama in a way that, you know, anybody could appreciate.

11:30 And so, I definitely recommend people watch that.

11:32 It's available for streaming on Netflix and iTunes and other places.

11:36 And then the other thing is the book called Present at the Creation, Discovering the Higgs Boson by Amir Axel.

11:44 I messed up his name.

11:45 But that's also really good.

11:47 So, people who are out there and want to learn more about what we're talking about, I think I recommend those.

11:51 Okay, great.

11:52 Yeah, I actually haven't read that second book.

11:54 Yeah, I really enjoyed that book as well.

11:58 It predates the Higgs Boson.

12:00 So, it's like a lot of anticipation.

12:02 So, that's cool.

12:02 I see.

12:03 Okay, great.

12:03 Maybe we could talk a little bit about, like, the really big picture of software at the LHC.

12:11 Because there's not just one team and there's not just one experiment.

12:14 There's how many collectors?

12:15 Are there seven collectors?

12:16 Right.

12:17 So, well, there are two really big kind of multipurpose particle detectors, Atlas and CMS.

12:23 And I'm on Atlas.

12:24 And those two experiments have, you know, in the neighborhood, a little bit more than 3,000 physicists working on them.

12:32 So, you know, so it's a, there are big groups of people.

12:35 And then there are two other experiments that are, you know, slightly smaller in scale, but they do, you know, and slightly more specialized in terms of the physics that they do.

12:45 And then there are several other smaller dedicated experiments that are quite a bit smaller.

12:51 And so, I don't, you know, it depends on how you count a little bit.

12:55 But usually, you know, there's sort of the two big multipurpose detectors and two other more specialized ones that are the dominant, like, LHC experiments.

13:05 Okay, cool.

13:06 And so, maybe from, like, the higher level or larger scale, like, the thing that actually runs the machine down into the experiments, down into more, like, the data processing details.

13:15 Could you give us a picture, like, what the software looks like there, what you guys are doing?

13:18 Sure, yeah.

13:19 So, I mean, it's mainly, you know, we have a whole bunch of collisions.

13:23 And each collision, you know, if you think of what this metaphor of it being like an image, you know, it's like a pipeline for doing a bunch of image processing, you know.

13:31 And you're looking for, you're trying to find the collisions that, you know, maybe have evidence of some new particle.

13:39 So, you have lots of teams of people that are looking for different things, and each of those teams will develop a little pipeline to process the data to try to, you know, to search for what they want.

13:50 Also, to put it into perspective a little bit, we had a quadrillion, you know, a couple quadrillion collisions total at the LHC.

14:00 And when we discovered the Higgs, it was, you know, of the order of 100 or 1,000 of those collisions that were the interesting ones.

14:08 So, it's a huge needle in a haystack problem.

14:11 But it's also not really like a data mining kind of just generally looking for something weird in the data.

14:17 We have theories that tell us, you know, what to look for, which is good because there's such small little deviations in the data that it would be basically impossible to find if you didn't have a good guide.

14:30 And then this processing chain, because there's so much data and performance is such an issue, most of it, well, several years ago, the decision was to write most of the software in C++.

14:41 You know, C++ has also evolved a ton during the time.

14:44 Are you using like, are people using like C++ 11 and those types of things?

14:50 Right.

14:50 So, the different experiments kind of, you know, move to these new, you know, new standards and new computing technologies kind of at different paces.

14:59 There's a lot of worry.

15:01 You know, it's generally a pretty conservative attitude, you know, but we are making those kinds of transitions.

15:08 But, you know, you just, it has to go through a lot of vetting before we make a big jump like that.

15:14 We also usually have a very homogeneous computing environment in terms of like operating systems and things like that because, you know, we run into issues where, you know, you don't want to have to be worrying about like floating point arithmetic in your kernel or something, you know, when you're, because it's, so, so we just tried it.

15:33 Yeah, so it's a little bit funny, you know, that CERN was responsible for sort of developing the web browser, right, you know, and HTML and things like that.

15:42 And so, they had this huge win of, you know, where the web was born.

15:45 And then that was followed by this idea of like, okay, we had the web, now we're going to have grid computing.

15:51 And there was a lot of money poured into it.

15:53 And the promise of the grid basically turned into what has happened with the cloud.

15:59 And, but in, you know, and then within IMG physics, we do have the grid, but it's kind of like a huge global, you know, batch system in some sense.

16:09 So, it tends to be, you know, more uniform and things like that than what, you know, at first people were working really hard to be able to work over very heterogeneous computing environments.

16:20 But, you know, that all evolved over, you know, more than a decade.

16:24 Yeah, I'm sure.

16:25 I used to do a lot of work in sort of scientific computing and visualization.

16:29 And it's super hard to do reproducibility and checking stuff, you know.

16:35 Right.

16:36 If you've got a sufficiently complicated series of mathematical steps, you can apply to something.

16:42 You know, like, if it's so complicated, how do you know when you're right or not?

16:46 Right.

16:46 You know, how do you know when you're discovering something new versus, oh, it's like I expected or whatever, right?

16:52 Right.

16:53 Well, we're working a lot right now on trying to address the sort of reproducibility, you know, issues kind of specific and the challenges associated to our field.

17:02 And there are a lot of challenges because there's so much data and software is very complicated.

17:05 Yeah.

17:06 So the core algorithms tend to all be C++, but they're, you know, they're organized into lots of, you know, lots of different tools.

17:13 And, you know, you have a way of kind of composing this pipeline between different processing algorithms.

17:19 And in the end, the configuration of that thing is such a beast that that's the first place where you see Python happening is that we have a way of kind of, you know, doing introspection on all of the tools.

17:30 And then we just represent their configuration in terms of Python objects.

17:33 And then there's a whole separate layer of, you know, of computing, which I mean, of programming, which is just essentially the configuration.

17:40 And that includes both this trigger, that online system that's tossing out the data, as well as the, you know, the people that are analyzing the data, how they, you know, configure all these tools to be able to process the kajillions of events into something that's more manageable.

17:55 Now, that sounds really interesting.

17:57 When I was doing some research, it seemed like one of the major pieces used in Atlas was this thing called Athena.

18:05 Right.

18:07 That's right.

18:07 That's the kind of name of the C++ framework that we use that also includes the way that it builds the Python bindings for configuring all the tools.

18:17 And yeah, so that's, yeah, I've spent more hours than I'd like to admit doing programming in that framework.

18:24 But then what's also interesting, I think a lot of your audience will find interesting, is that once you've used that huge, heavyweight data processing pipeline, usually you get to something quite a bit smaller.

18:36 And that's where a lot of the more interactive and exploratory part of the data analysis happens.

18:41 And at that stage, a lot of people, well, people stop using things like Athena for the most part.

18:48 And that's where you start using, see people using Python a lot more in terms of data analysis.

18:53 And so it's an interesting transition because people are always arguing about where do you make that swap, you know?

19:00 Sure.

19:01 Yeah.

19:02 I suspect you guys probably do a lot of IPython.

19:05 Is that true?

19:05 Well, you would think that more people would.

19:08 I guess part of it is that it's still, even at that stage, you still have so much data to process that the kinds of things that people end up wanting to do are, you know, well suited to having like, you know, programs that, you know, that look really like programs that run.

19:26 And they might be Python based, but, you know, you kind of sort of batch systemy, you run over this thing, and then you get some results and look at them.

19:33 There are times when you're doing something very interactive.

19:35 And so years ago, the team at CERN that makes this tool called Root, which is like kind of the dominant data analysis package in high energy physics, came up with something like an interpreter because you want to sit there and have this feedback loop, right?

19:51 You know, like the end where you can, you know, type commands, see plots.

19:55 And that was actually done amazingly.

19:58 They wrote a C++ interpreter many, many years ago.

20:01 And so you actually write these commands in C++, and then they're interpreted and executed on the fly.

20:08 That's actually pretty interesting by itself, isn't it?

20:11 It is interesting.

20:12 It, of course, had all sorts of issues, and C++ wasn't really meant for doing that, but it worked practically.

20:18 And now they've gone through and they have a much heavier duty version of this interpreter that's based on, you know, Cling and more modern, like, compiling, compiler technologies and things.

20:27 But Python, obviously, is another way to go with that, which is nice.

20:42 This episode is brought to you by Hired.

20:45 Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.

20:51 Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.

21:01 Typically, candidates receive five or more offers in just the first week, and there are no obligations, ever.

21:07 Sounds pretty awesome, doesn't it?

21:09 Well, did I mention there's a signing bonus?

21:11 Everyone who accepts a job from Hired gets a $2,000 signing bonus.

21:15 And as Talk Python listeners, it gets way sweeter.

21:20 Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $4,000.

21:27 Opportunity's knocking.

21:29 Visit Hired.com slash Talk Python To Me and answer the call.

21:32 So people started moving to the Python.

21:45 Well, some people started moving to the Python way of doing things, I don't know, whatever, you know, eight years ago or something like that.

21:52 But the field is kind of split between, you know, which way.

21:55 And then since then, like, things like IPython and the IPython notebook have come around.

21:59 I think that that's great, especially from the point of view of this, like, reproducibility.

22:03 So we're working now that we give tons of talks.

22:07 If you go to CERN, we have this agenda system.

22:09 And you can see that there are, like, hundreds of thousands of presentations happening within these experiments every year.

22:15 That's excellent.

22:16 Yeah, so one of the things, but they're always, like, PowerPoint or, you know, Keynote or whatever, or LaTeX-based PDF presentations.

22:23 And you read about what someone's doing, but it's not very handy for, like, trying to have reproducibility or for another graduate student to pick up where someone left off.

22:32 So we have this effort now to try to make it so that the agenda system can, you know, can basically display notebooks directly so people can upload their IPython notebook directly and visualize it.

22:45 And then if someone else thinks it's interesting, they can, you know, download it and execute it.

22:49 There are efforts about trying to make it so that the whole computing environment associated to that notebook can be, you know, packaged up so that, because they usually aren't just standalone Python, Python notebooks with, like, SciPy dependency.

23:04 They have a bunch of dependencies.

23:06 So if you can package that all up, that's very handy.

23:11 So there are tools like Binder now.

23:12 There's a tool called Everware.

23:14 And previously, there's something like SageMath, which all allowed you to sort of execute a notebook, you know.

23:21 But the problem was how do you get all these, you know, these software dependencies packaged up?

23:26 And now that problem is starting to be solved.

23:28 Right.

23:29 Oh, that's really excellent.

23:30 Because I can imagine you guys have so much data and maybe these back-end systems you've got to reach into to actually work with the data that you're trying to, you know, do physics on.

23:41 That's right.

23:41 You can't just take the program and hand it out, you know, like, oh, and here's our, you know, 50 gigs of data and you've got to get it this way, right?

23:48 Right.

23:49 And not only that, there's also things like databases that say, like, how was the detector aligned on Friday, you know, November 25th or something.

23:57 So there are all these databases involved that you have to connect to for the software to run.

24:03 And that's also through tons of authentication layers.

24:06 So it's a huge pain in the butt, basically.

24:09 But people are solving it.

24:12 And I think that will be a huge change.

24:15 And the Project Jupyter people, you know, luckily had this great foresight to separate the notebook from the background kernel.

24:23 So we're actually also writing kernel based on this C++ interpreter of root.

24:29 So it still looks like notebooks and all the display and everything is the same.

24:34 But in the background, instead of Python, it's the C++ interpreter, which is, you know, interesting.

24:40 Yeah, I mean, that certainly opens it up to a much wider audience.

24:44 Like, you're saying, like, the group that's working directly with Athena and so on, they can just, you know, possibly start using IPython or what do you call them?

24:53 Call them Jupyter notebooks now?

24:54 I don't know.

24:54 I'm not really sure what the naming is.

24:56 Yeah, the front end, kind of the language agnostic part is now Project Jupyter.

25:01 But it's great because we have people like Fernando Perez, who's, you know, leading this effort as part of this advisory board for a project that we got, a grant we got from the National Science Foundation to try to take the tools that have been developed in Iron View Physics, which are mainly very siloed.

25:18 You know, it's like we're trying to solve our problem.

25:20 It's a very hard problem.

25:20 And we don't have a lot of extra time or money.

25:22 Right.

25:23 And then, but now we've done some nice things.

25:25 So let's try to open that up, make it more interoperable with, like, the scientific Python world.

25:31 And it's definitely a two-way street.

25:33 There are lots of other great tools out there that we don't use.

25:35 So we're working on improving the interoperability of all of these things.

25:39 Yeah, I think that's going to be good for science all over.

25:41 And the Jupyter guys just got a huge grant.

25:45 I'm not sure all the folks that contributed, but it was millions, like six million or something like that.

25:50 Do you remember?

25:50 I don't know the number, but they rightfully have been getting some support because they're doing some great things.

25:57 And yeah, I'm really happy to see that.

26:01 So, you know, people will start building C++ and, you know, I guess Ruby and maybe imagine Fortran.

26:07 I don't know.

26:08 That probably is important somewhere in science.

26:11 But I try to not touch that stuff.

26:13 Right.

26:14 You talked about this sort of uniform computing environment.

26:18 And I know just where you're coming from with that, like, just the whole reproducibility and setup and everything.

26:23 Can you talk to, like, are you using Linux?

26:27 What distribution?

26:28 What do things look like there?

26:30 Right.

26:30 So it's definitely Linux-based.

26:32 CERN has a distribution called Scientific Linux that they maintain.

26:39 I've kind of stopped following all the ins and outs of it, to be honest.

26:44 But, you know, at some point it kind of was a derivative of some Red Hat-type package way back in the day.

26:50 But since then it's evolved.

26:51 And now I'm not even sure which distribution it's closest to.

26:55 And then there's been quite a bit of emphasis in virtualization technologies and, you know, OpenStack-related things.

27:04 And what we're starting to see more use of now are also Docker images.

27:08 So I have a student who's working on, you know, making, you know, Docker images that have our very kind of specific computing environment,

27:17 specifically for the issue of reproducibility.

27:21 There, you know, I think you also haven't really seen in high-energy physics, but we're starting to see now as, like, more and more, you know, web-based tools and things like that.

27:31 So you have web services.

27:32 And we have a project going on where we're trying to make not analyses just reproducible but reinterpretable.

27:38 So, you know, you had a team of people analyzing the LHC data, and they were asking a certain question.

27:44 But you can reuse that analysis pipeline to answer other questions.

27:47 So we're trying to wrap up this very heavy computational, you know, pipeline with a very simple web interface, you know, web APIs and things,

27:56 so that people can submit, you know, requests to the system, and they'll be processed through all of this infrastructure.

28:04 And then you'll, you know, come out with a very simple answer.

28:06 That's really cool.

28:09 Yeah, I think it'll be great.

28:10 It also, it addresses a lot of issues with reproducibility when you have, when you can't just, like, say, you know, the typical open data route.

28:21 Like, here it is, do it again.

28:23 Because, you know, there are so many steps involved, and the, you know, the configuration of everything is so heavy that basically almost no one can do that.

28:32 But if the experiments host it, host that service, it's very valuable.

28:35 So.

28:36 Yeah, that makes sense.

28:37 I suspect if it's sufficiently complicated and has enough configuration parameters and variation, like, even the original researchers couldn't reproduce it if they didn't have the details, right?

28:45 Yeah.

28:46 Well, I mean.

28:47 I mean, without redoing the work from scratch, literally.

28:50 Yeah.

28:51 No, exactly.

28:52 I mean, the goal is that we should all be able to do it in practice.

28:55 That's, you know, rarely, rarely checked.

28:57 But that's what we're working on now is to try to make it so we can more confidently say, yes, we can actually reproduce this stuff.

29:04 That's super cool.

29:05 I hadn't even thought about Docker, but that makes perfect sense.

29:08 I mean, I always think of Docker as, here's the way I'm going to, like, horizontally scale my web app so I can do that easier or, like, you know, get higher density on servers and web data center type places.

29:21 But it makes really good sense for scientific computing, doesn't it?

29:23 Yeah, yeah.

29:24 So we have, you know, these experiments.

29:26 At this point, you know, each of the big experiments has put out, you know, several hundred papers.

29:30 So you have, like, thousands of scientific results.

29:32 And now, if associated to each one of these results, you have some sort of, like, Docker image and you can, you know, on demand spin up this service that you want to reproduce or reinterpret what was done.

29:46 You end up making a very powerful, high-level scientific tool.

29:50 So it's been an idea for several years.

29:53 But now, with, you know, these tools that are around, it's really, you know, possible.

29:58 And so, and it's starting to happen.

30:00 Yeah, I think that's fantastic.

30:02 And are you guys, I think, considering or actively putting these into the official Docker repository?

30:07 So right now, that's the sort of model is that that's going to be hosted at CERN, partially because, you know, it's a lot of disk space.

30:18 And we're just working closely with the CERN IT people.

30:20 And there's a lot of trust between the CERN computing and IT and the experiments.

30:24 So that's how that's being developed now.

30:27 But, and some of these things are sensitive, you know, like, yeah, these are big international collaborations.

30:35 And different countries are kind of at different places in terms of their attitudes about open science and things like that.

30:42 So some people still want to keep these things closed, but they're willing to entertain the idea of hosting the service.

30:48 So that's a big political discussion.

30:53 Actually, the more I think about it, that it starts to bring me back to when I was in grad school.

30:56 Yes, I can see that.

30:58 So I suspect once a paper is published in a peer-reviewed journal and approved, pretty much everybody would be pretty happy to have their stuff public.

31:09 But as you're developing that, right, before you've declared the Higgs boson to be found, for example, you would necessarily want to give all your algorithms away and let other people take a shot at it, right?

31:19 You want to keep that until you're ready to publish your papers and publicly do your results, right?

31:24 Yeah, no, that's true.

31:26 And also, you know, I mean, the experiments also worry a lot about, you know, you have to be kind of careful about the scientific message that's coming out of these experiments.

31:37 I mean, they're very, very expensive international projects.

31:40 And you don't want people, like, claiming this and that and adding a lot of noise and drama.

31:46 So, you know, we work very hard to get lots and lots of internal cross-checks of things and then make sure everyone's on the same page.

31:53 And then we want to have kind of one unified voice from an experiment.

31:56 And then we have another experiment to check it.

31:58 But we don't want too much noise.

32:01 So, you know, that's a lot of motivation for keeping these things kind of internal until they're ready.

32:08 Yeah, absolutely.

32:09 I mean, it took several years for the Higgs boson analysis to finally declare, hey, we found it, right?

32:17 Yeah, that had a lot to do with just collecting enough data.

32:20 Once we had enough data, the process was sort of streamlined enough that it was really a matter of like a week and a half or something.

32:28 It was about a week and a half that we went from the last data that we had to the talk that, you know, where the discovery claim was made.

32:36 So, but that was in some sense just like adding the last bits of data.

32:40 But, yeah.

32:42 Yeah, yeah, very cool.

32:45 So, I want to ask you a few more things about LHC and CERN and particle physics.

32:50 And then we could talk about space a little bit maybe.

32:52 Oh, sure.

32:53 Yeah.

32:53 So, you said one of your goals there is that you guys are trying to move towards a more sort of dedicated professional software role there at CERN.

33:04 Can you talk about that a little?

33:05 Oh, right.

33:06 So, I think, you know, the physicists, you know, need to write, obviously, a lot of code and be proficient to be able to do the science they want.

33:13 And, but you also, there's a lot of infrastructure for the processing that's needed.

33:18 So, you know, typically what happened was physicists that were very strong on computing kind of specialized on that and then, you know, became sort of software professionals with a physics background.

33:31 And that's, that model has worked, you know, surprisingly well.

33:35 There are some people, you know, relatively few people that are, don't really have a physics background that are really more just, you know, software professionals.

33:44 But separately, there's been, you know, CERN has had like an IT department that deals with actual computing infrastructure that's going on, but has also developed several different tools that were more services.

33:56 Sure.

33:57 Because you guys have a lot of computers, a lot of network.

33:59 How many computers do you have there?

34:01 Oh, the CERN, I should know the number.

34:04 Hundreds, like a hundred thousand or something, right?

34:06 Like really a lot.

34:07 It's a lot, yeah.

34:08 There's a, there's a great YouTube video that I'll, that you can post that associated to it that has a wonderful overview of the whole processing.

34:16 Oh, yeah, that'd be awesome.

34:17 And lots of little factoids that's nicely produced.

34:21 But I'm not going to say, I won't say a bunch of wrong numbers, so.

34:24 Sure.

34:25 Sure.

34:25 This episode is brought to you by CodeShip.

34:42 CodeShip has launched organizations, create teams, set permissions for specific team members, and improve collaboration in your continuous delivery workflow.

34:51 Maintain centralized control over your organization's projects and teams with CodeShip's new organizations plan.

34:56 And as Talk Python listeners, you can save 20% off any premium plan for the next three months.

35:02 Just use the code TALKPYTHON, all caps, no spaces.

35:06 Check them out at CodeShip.com and tell them thanks for supporting the show on Twitter where they're at CodeShip.

35:11 But the, but there were people that were working on, on that side.

35:20 And then also just general services for like, you know, like the system we use to manage all the papers and the, and the comments to the papers and things like that.

35:33 And like an agenda system where people put up their talks and, you know, schedule meetings and the video services and blah, blah, blah, blah, blah.

35:40 And, and those, the people that were working on that have started branching out into the kinds of services they provide.

35:47 So, so I guess one example that's made me interesting is the very first website in the United States was called Spires.

35:54 And it is, preceded this, the archive where like scientists put the versions of their papers before they're published, this preprint server.

36:03 So the, the Spires was basically just a database of all of the, high energy physics literature.

36:09 You know, like who wrote what papers and who cited whom and da, da, da, da, da.

36:13 And, and so it was like a, it was the first website in the United States was where physicists would go to see where, you know, where the, who's citing who and which papers are about certain topics.

36:23 And that has evolved into something new called Inspire.

36:26 And that's a, you know, an international effort.

36:29 but the, the software technology that that's based on, which is mainly Flask based now and, you know, Pythonic, based things, are, has evolved into a set of different tools that are pretty nice.

36:42 So one of them is a data repository where, this is not solving actually CERN's problem.

36:47 It's like the long tail of science.

36:49 So all these small experiments that are out there that have some data, they can put their data onto these servers and then they can refer to that data when they write their papers.

36:58 So there's a service called Zenodo and Zenodo started working with GitHub and made it so that people that now, not just data that you want to point to when you're writing a paper, but the software of your paper, you can, there's a little web hook in, in GitHub that whenever you make a new release, we'll push a copy of the, of the code to the service of Zenodo.

37:19 So, and they'll, they'll, they'll, the, the jargon is meant a DOI.

37:23 So it's a digital object identifier, which the publishing industry knows about how to use and things like that.

37:28 And it points to a specific version of the, of the code that was used and you can download that version of the code.

37:34 So it's sort of out of the realm of version control.

37:37 It's more, it's just that GitHub is not provided, like people could delete their GitHub repo, but once you get a DOI, it's kind of a, a trust relationship that that code is exactly.

37:47 that code is always going to be there at that link.

37:49 Right.

37:49 Almost like escrow for your code.

37:51 Yeah, exactly.

37:53 So it's redundant in a, in a sense, but you know, there's a link back obviously to the repository.

37:57 So you can see how it's evolved, but, but this is the, the, you know, the basic common denominator for all of the publishing industry and how to track citations to, to different, you know, research outputs, whether they be papers or data or software.

38:12 So that, so that connection has been, you know, is nice.

38:15 And now there are actually thousands of, pieces of code that have been pushed from GitHub to Zenodo where they, and been given these DOIs.

38:23 And the same, that is also, you know, Flask based and, and they have been very nice about how they break up the project, into different, you know, Flask components and things like that.

38:34 And now we're reusing that same kind of infrastructure to try to build these tools for doing a reproducibility and, reinterpretations and things like that.

38:44 So there's a, there's a new effort that's evolving that has a, you know, much more of a like modern, you know, Pythonic programming, you know, web services mentality, which is not necessarily coming from within the experiments, but coming from within the CERN lab.

39:00 And, and I think it's going to change how the experiments sort of approach these problems, in the next few years.

39:06 That's really cool.

39:07 It seems like a huge step down the right path for reproducibility.

39:10 Yeah, no, I think it's very, very important, for what we're doing.

39:14 Yeah, of course.

39:15 Of course.

39:16 Okay.

39:16 The last thing I wanted to ask you about, was this project called Recast that you guys have going that, not you necessarily, but folks at CERN have an LHC of going.

39:27 So Recast is the system that I was referring to that is, for trying to make the analyses reinterpretable.

39:33 and that's the one where we bring in these sort of Docker images for the, for the different analysis pipelines and try to make this web service, for being able to either reproduce or reinterpret, one of the published LHC analyses.

39:46 So that one is, is definitely within CERN, but using a lot of these more modern, you know, web service, type approaches.

39:54 There was another, another project that maybe, you also wanted to talk about, which was the space one.

40:00 I don't know if that's what.

40:01 Yeah, absolutely.

40:02 So.

40:02 Or I don't know if there's a, we're running out of time.

40:05 No, no, that, yeah, no, we definitely have time.

40:07 Let's talk about that one.

40:08 And that's, Kreyfis.

40:09 Yeah.

40:10 That's right.

40:10 That's right.

40:11 So, Kreyfis is this project, which is, we started up, which is a lot of fun.

40:16 and it's a very, very sort of small team of people right now.

40:20 So it's, it's kind of the opposite of these LHC experiments.

40:23 and our basic idea is that we, we want to try to detect very, you know, ultra high energy cosmic rays.

40:30 So these are particles coming from outer space that have much higher energies than the LHC.

40:34 And they smack the atmosphere and create a shower of new particles.

40:39 So you get, you get tons and tons of particles hitting the ground.

40:42 And the idea is that if you get a bunch of people running an app on their phone, the camera of your phone, the little CMOS sensor in your camera is like a miniature particle detector.

40:54 So if one of these particles happens to fly through your camera, it will light your camera up.

40:59 And so basically the idea is that if you get a bunch of people at night running this app, like while their phone's plugged in and they're sleeping, and you see a bunch of these phones light up at the same time,

41:09 then that's, like an indication they just, you just got hit by a high energy cosmic ray.

41:15 And, and, these things are very rare.

41:17 So if you want to see them, like the, one of the best shots you have is to try to cover as much of the earth with a detector as you can.

41:24 And, and it looks like if we have, you know, in the order of a million people running this app, which is a big number, but you know, not crazy big number, that we might be able to do some like really cutting edge science.

41:35 And, and be able to study like the very highest energy particles that people have ever seen in the universe.

41:42 So, so it's a really fun, you know, like citizen science project and it has a lot of data science challenges.

41:49 I'm sure it does.

41:50 That's, that's really cool.

41:51 So does it run on the major phone platforms, iPhone, Android, things like that?

41:55 Yeah.

41:56 So we have an Android app and an, and an iPhone app that they're both kind of in beta right now.

42:01 and, and so they collect data and, you know, one phone can't say anything.

42:06 So they're uploading their data about, sort of when they saw flashes and, and where they were when they saw the flashes to, you know, a central server.

42:15 And, we've, we've just, so it's been fun because we, the people working on this all kind of have a LHC type of background, but we now get a fresh set of choices about what we're going to use.

42:26 So, so, you know, we have Elasticsearch and Django and Spark and, you know, you know, just much, you know, more kind of modern stack of, of, of tools for being able to process the data.

42:39 And then we're working on trying to make it so that the final stages of the data analysis are, are more like IPython notebooks.

42:46 And that we have this ability to, not just view the notebook, but to launch the notebook and recompute it.

42:53 So that's using things like SageMath or this binder project, and, and, and Everware, which have, you know, with Docker, all of the computing environment.

43:03 And this is going to, I think, be just awesome for doing outreach because they're, this has really grabbed the public's imagination.

43:09 We got covered by NPR and Wired and some people.

43:13 And so the, and IFL science.

43:15 So we have like almost a hundred thousand people that have signed up and we're, we haven't even launched the app yet.

43:22 That's really cool.

43:22 So if people want to get started, how do they do it?

43:25 Is that a thing they can do yet?

43:27 Or do we have to wait a while?

43:28 we're going to have to wait a little bit longer.

43:30 we're, we're working as fast as we can, with our ragtag team.

43:34 But the, but if you, you can sign up for it right now, if you go to the website, it's, crayfish.io.

43:42 So it's like crayfish without the H.

43:44 Right.

43:45 So see, yeah, I'll definitely add a link.

43:47 C R A Y F I S dot IO.

43:50 That's right.

43:51 That's right.

43:51 Fantastic.

43:52 So this, you know, reminds me a little bit of the, the whole protein folding stuff that people try to do with like running it on idle computers and the study at home.

44:02 But, but it seems even cooler because this is actually the science.

44:06 That's right.

44:07 Yeah.

44:07 Happening.

44:08 Not just the post analysis computational bits.

44:11 That's right.

44:12 Yeah.

44:12 I think there's sort of three types of citizen science projects like this.

44:15 There are ones like study at home and the protein folding where you're basically donating compute cycles, you know, to solve a problem, but you're not really involved in otherwise.

44:23 then there are projects like, there's a great platform called the Zooniverse, which I encourage people to go check out if you like these things.

44:33 So they are the very interactive.

44:35 You're like looking at pictures of galaxies and trying to, to classify them or, you know, like pictures of the ocean and trying to say if you see certain things in them.

44:44 And it's, it's a lot like mechanical Turk, you know, you, so you, you're farming out, these things that computers might not be great at.

44:52 We don't have, we're not good at writing algorithms for doing and having humans do it.

44:56 and this one, this project is neither of those.

45:00 The citizen science is actually the instrument.

45:02 I mean, you are, you are doing the science.

45:04 And, so that's, that's, yeah, I think, you know, that makes our project really unique.

45:10 Yeah.

45:11 That's really awesome.

45:11 So I encourage everyone out there to go sign up, participate.

45:15 I'll definitely do it.

45:16 This is exciting.

45:17 Great.

45:18 Thank you.

45:18 Yeah.

45:19 And then you said this is somewhat, done with a, a grant from Amazon web services.

45:25 That's right.

45:26 Yeah.

45:26 So, so we talked to Amazon and they thought it was an exciting, you know, an exciting project.

45:31 And so they, they've been great about trying to support some, in different, you know, scientific projects that, that, need some computing, that, you know, have, that are still in a prototyping stage.

45:44 And, so they, yeah, they've given us a grant for some of the, you know, AWS, sort of budget.

45:50 And, and we're using that for, you know, collecting the data and some of our, you know, you know, offline processing of the data.

45:57 and, and, and we try to make it so that there's a kind of a feedback loop so that the, you know, the Django server is showing stats on what the current status of the network is.

46:07 and, so it's, it, that's been very nice of them.

46:11 And also Yandex, which is like a sort of the, the Russian Google is, has a group of people that are data scientists, and part of a lab that they run that are very well integrated at CERN.

46:23 And some of them have joined our project and they're, you know, helping us, you know, build the, the, the, sort of offline spark based, you know, data processing system and things like that.

46:33 So it's a lot of fun to have, you know, these different mix of people and a very fresh project.

46:39 We move very fast.

46:40 It's nice.

46:40 Yeah.

46:40 How exciting.

46:41 I mean, I know how it is to pick up a project that's been around for 10 or 20 years.

46:45 It's kind of stale in the technology and what you can do because you don't want to break stuff, but it's nice to start over and do exciting things, right?

46:52 Absolutely.

46:53 Yeah.

46:53 It's a breath of fresh air.

46:54 Cool.

46:55 So we're kind of getting short on time.

46:57 Let me ask you just a few more questions before we wrap things up.

47:00 What do you think the most amazing or interesting thing about the Large Hadron Collider and all the stuff you guys did there that people might not really know about?

47:08 Oh boy.

47:10 Part of it, I think that's really great about the whole project is just that, you know, it, it, we have people from all around the world, right?

47:18 And they're not all employed by one, you know, one, you know, central, you know, corporation or something like that.

47:26 And so, and we're all contributing.

47:28 So I guess in that sense, it has a lot of the flavor of like a successful open science, I mean, it's open software project where people are contributing from all around.

47:36 And I think the other part is that, you know, everyone should feel proud about it because, you know, it's, you know, it's not a huge amount of your tax dollars.

47:45 You know, if you look at like, I forget for every thousand dollars of taxes you pay, it's just a few cents or something is going to this kind of project.

47:52 But, you know, everyone made this happen.

47:54 And I think everyone in the world should be proud of the fact that this, this happened.

47:59 Yeah, it's definitely a global effort, which is awesome.

48:02 It is a global effort.

48:03 And it's, you know, this is a, I mean, it's a really big accomplishment what's happened in terms of our understanding of the universe.

48:10 So, and there's obviously all of the software and computing and the technical challenges of pulling it off.

48:18 But, you know, it's a big multi-decade global thing.

48:22 And, you know, I don't know, humanity should be proud of itself.

48:25 Yeah, absolutely.

48:27 That's great.

48:27 How do you see Python in science sort of evolving, growing, changing, shrinking?

48:36 I mean, right now, I think it's growing very fast.

48:40 I mean, you know, the whole scientific Python side of things and, you know, distributions that are working, working better.

48:47 And GitHub is changing, you know, how people are doing science very, very quickly.

48:52 If you look in, like, the astronomy and astrophysics community, there have been, you know, some big efforts to create new analysis platforms.

49:01 There's a great project called AstroPy, which is a, you know, just a poster boy for, you know, for different, for, you know, kind of open science mentality of the community contributing to making the tools they need to do the science that they want.

49:18 And the, that's working pretty well because, you know, a lot of those experiments, they have a lot of new experiments coming up and they want a fresh start.

49:26 So, you know, we have this issue that we have a lot of legacy code and, you know, a lot of, you know, a lot of inertia about how we've done things.

49:35 But we're, we're trying to move things into a more, you know, a less siloed software picture that looks more like that model.

49:43 And, and Python is a big, I think Python is like the dominant kind of language of people that have that mentality and at least in, you know, particle physics and astrophysics.

49:53 Yeah, that, that's my feeling as well.

49:55 But obviously I do a lot less science than you do.

49:57 So that's, that's great to hear.

50:00 Yeah, I mean, I think the other place that you see it is just that to the extent that data science is attached to doing sciences, as opposed to like business analytics, you know, you see just a big influx of Python and R.

50:12 And so I think that, you know, there are a lot of, a lot of other scientific fields that are undergoing a similar, that are undergoing a similar transformation, but, but maybe a little bit heavier on the R side than on the Python side.

50:25 Yeah, it's definitely down to those two, I think.

50:28 And modern analysis for the most part.

50:31 Right.

50:31 Kyle, thank you for being on the show and taking the time.

50:34 I know you're super busy and this, this is really, really interesting.

50:38 Thanks for sharing it with everyone.

50:39 No, absolutely.

50:40 Thank you for inviting me.

50:41 It's great.

50:41 Yeah, you bet.

50:42 Take care.

50:42 Bye.

50:43 This has been another episode of Talk Python to Me.

50:47 Today's guest was Kyle Cranmer and this episode has been sponsored by Hired and CodeShip.

50:51 Thank you guys for supporting the show.

50:53 Hired wants to help you find your next big thing.

50:55 Visit Hired.com slash Talk Python to me to get five or more offers with salary and equity presented right up front and a special listener signing bonus of $4,000.

51:03 CodeShip wants you to always keep shipping.

51:07 Check them out at CodeShip.com and thank them on Twitter via at CodeShip.

51:11 Don't forget the discount code for listeners.

51:13 It's easy.

51:13 Talk Python.

51:14 All caps.

51:15 No spaces.

51:16 You can find the links from today's show at talkpython.fm/episodes slash show slash 29.

51:23 Be sure to subscribe to the show.

51:25 Open your favorite podcatcher and search for Python.

51:27 We should be right at the top.

51:28 You can also find the iTunes and direct RSS feeds in the footer of the website.

51:33 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

51:38 You can hear the entire song at talkpython.fm.

51:41 This is your host, Michael Kennedy.

51:43 Thanks for listening.

51:44 Smix, take us out of here.

51:47 Staying with my voice.

51:49 There's no norm that I can fill within.

51:50 Haven't been sleeping.

51:52 I've been using lots of rest.

51:53 I'll pass the mic back to who rocked it best.

51:56 Developers, Developers, Developers, Developers, Developers, Developers, Developers, Developers,

52:07 you .

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon