#29: Python at the Large Hadron Collider and CERN Transcript
00:00 The largest machine ever built is the Large Hadron Collider at CERN. It's primary goal was the discovery of the Higgs Boson: the fundamental particle which gives all objects mass. The LHC team actually achieved that audacious goal in 2012, winning the Nobel Prize in physics in the process. Today on Talk Python To Me, Kyle Cranmer is here to share how Python was at the core of this amazing achievement! This is episode number 29 recorded Thursday, September 24th 2015.
00:00 [music intro]
00:00 Welcome to Talk Python to Me, a weekly podcast on Python- the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy, follow me on twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on twitter via @talkpython.
00:00 This episode is brought to you by Hired and Codeship. Thank them for supporting the show on twitter via @hired_hq and @codeship
00:00 I don't have much news to share this week. But I am both honored and thrilled to bring you this episode. I can't wait for you to listen to is, so let's get right to in.
00:00 Let me introduce Kyle.
00:00 Kyle Cranmer is an American physicist and a professor at New York University at the Center for Cosmology and Particle Physics and Affiliated Faculty member at NYU's Center for Data Science. He is an experimental particle physicist working, primarily, on the Large Hadron Collider, based in Geneva, Switzerland. Cranmer popularized a collaborative statistical modeling approach and developed statistical methodology, which was used extensively for the discovery of the Higgs boson at the LHC in July, 2012.
02:07 Kyle, welcome to the show.
02:08 Thank you, it's pleasure to be here.
02:10 Yeah, we have some amazing science and programming to talk about today, so I'm really excited to dig into all these topics with you.
02:18 Yeah, and I'm excited to see where it goes.
02:21 Yeah, for sure. So, let's, you know, we are going to talk about the Large Hadron Collider, about using Python for scientific research, and all those sorts of things as well as some other cool projects that you've got going on. But, people like to know how folk like you, in your position, kind of got started and they like to hear the background. So maybe we could start with you know, what got you interested in physics and what got you interested in programming, and how did you get to where you are.
02:48 I've been interested in physics since I was a kid not really knowing that that is what it was called, but later on I think I guess it was in high school when I really realized it was physics that wanted to learn. I grew up in Arkansas and it is not exactly known of leading physicists and computer scientists in the world, but they had started a special math and science high school, it was public school but you actually lived there; and when I was there I was just surrounded by all sorts of people who were kind of the nerds and geeks of Arkansas and it was really special time. So during that time, I got even more into physics but also that was when I first exposed to serious programming.
03:30 So, actually even Python like in 1995, or actually 1994 I guess, I had a friend that was into early web frames and he was playing with Zope, and Zope- you know, object, database, and so I started working with him, and did some early web projects and that was kind of my first exposure to Python. So, that was long time ago. But, and that is funny how much those experiences kind of keep being revisited today.
03:58 Yeah, I'm sure you keep coming back to that, you know. Basically programming these days seems like a required skill to be a physicist.
04:07 Yeah, I know, well, it depends on what you do, but definitely for what we do programming is a required skill, and unfortunately I can- you know, for people that don't have those strengths it really takes away from their ability to try to do the physics that they want to do, so you know, for incoming graduate students you usually see a pretty big divide between people that had some programming skills and don't, and usually the people that don't will catch up a little bit later, but you lose time and that is unfortune.
04:37 Right, I'm sure it's like a huge scramble, like, "oh my Gosh, I've got to learn all those programming stuff too". Right?
04:45 Right. But also, color is a lot of the flavor about how we approach computing, because somehow you had this enormous computing problem that you need to deal with, and you would like to do it as nicely as possible, but it also can't be too fancy or the bulk of the physicists might not be able to understand what is going on. You have older physicists from the Fortran days, and you have younger physicists that maybe never took any programming courses, like any serious programming courses, so things have to be somehow kept simple, but still work for the difficult problems, and so it's a difficult balance to keep.
05:19 Yeah, I'm sure. So, let's talk a little bit about what you guys are doing at the Large Hadron Collider. And, first of all, congratulations on the Higgs boson discovery, that's amazing.
05:31 Oh thank you, it was many, many years of work and when it finally came, it was huge treat, I don't know, it's funny to have such a big thing like that happened fairly early in your career.
05:47 Like, now what?
05:47 Yes, so at the LHC we have this huge collider that is Switzerland, it's about 17 miles around, and underground and all these super-connecting magnets that help protons go bind them in the circle, essentially the speed of light, and they are colliding together all the time, they smack into each other, and they make a lot of new particles that come flying out, and they hit our detector and our detector you can think of sort of like digital camera, you know, it's like basically a bunch of pixels and the particles smack into it and you get an image; but it's a 3D image, so it's a 3D detector and the detector is like the size of a 12 story building, so-
06:31 Yeah, I think the- when you just hear about particle colliders and especially LHC, you have maybe this idea of like a tube, or things are shooting around, but the actual experiments, the collectors, I was blown away when I learned about how big they are. Like, you said 12 story- these things are huge.
06:56 Yeah, they are. And the range of scales is pretty crazy because we have to be able to track where these particles go very precisely, so like close to where they are interacting we are measuring things at the micron level, and then they fly out over like the size of a building, and we're still measuring where they are going at this very precise level. But, you know, it's just the 7:19 things, you have to align it properly and all sorts of challenges there. And there are about a 100 million or a few hundred million electronic readouts coming out of this beast, it's like a hundred mega pixel camera or something like that, and we are taking 40 million photos every second.
07:40 That's a stunning amount of data.
07:42 That is a stunning amount of data. And, so if you we have to slap special electronics like straight on to the detector, to be able to start pre processing it and compressing it and sort of coming up with some way to deal with the data volume because, it's something, there are just totally staggering numbers about the data flow that is coming, straight out of the detector.
08:05 So how do you capture and store that? Do you store that like on hardware right on like Atlas, on the machines or do you like get that into like a cluster of servers, or what happens?
08:17 Right, right. So, we have kind of a hierarchical online real time system for tossing away the majority of the data. So we have to actually - we write algorithms that have to look at the data and real time decide does this look interesting or not, and so we go from the sort of 40 million a second, to like 3 levels of filtering down, and then we get to the point that we save something like a few hundred of these collisions every second. And that turns into several petabytes a year of data that we actually analyze later.
08:55 That's amazing. It's got to be a little stressful to work on that initial filtering algorithm because- what if you threw away the Higgs boson before you discovered it, right?
09:04 That's right, yeah, we always worry that we are going to kind of throw away the baby eye with the bathwater, and sorry about the- [police sirens] living in New York.
09:14 Yeah, noise.
09:16 So we call that thing "the trigger", and that's something that I worked on a bit, it's true that if we don't find anything else in this next run of the LHC, a lot of people will think exactly that, maybe you know, the way the trigger was configured we were throwing away the interesting stuff, but luckily we are not stuck to that, we can go and we can change it, things like that. But that is the worry.
09:41 Sure, you still have the time spent, and the energy and all that, right.
09:45 Yeah. Sure.
09:48 You could re-run it of course, but you got to- I suspect time is a valuable thing on that machine.
09:54 Yeah, for sure, it's expensive to run, so absolutely.
10:01 Yeah, so, you know, I have a lot of listeners who are scientists and physicists and data scientists and so on, but a lot of them who are probably not, so I want to make a movie recommendation and a book recommendation just for people to if they want to kind of set the stage and learn the background as part of this whole thing we are talking about, I wanted to recommend the "Particle Fever" documentary. Have you seen this?
10:28 Yeah, yeah, I actually have a credit in that movie, I worked with them quite a bit, and at one point there was a scene that was shot in my office, but ended up having to cut it because it didn't really fit well with the- it was the good choice, but it was painful, but they are nice, I worked with them a fair amount and I got to go to the opening in Sheffield in the documentary film festival and hang out with the producers and the whole crew. But, it's a great film, I think it's good for non physics audience also it's not a technical film, it just basically captures what it's like to be inside one of these experiments and sort of stress on the drama associated to it, I think it's really one of the best science documentaries, ever.
11:22 I absolutely agree with you, I think it really captures the excitement, the imagination, the drama, in a way that anybody could appreciate it and so I definitely recommend people watch that it's available for streaming on Netflix and iTunes and other places. And then the other thing is the book called "Present At The Creation: Discovering the Higgs Boson", by Amir D. Aczel, That's also really good, so people who are out there and want to learn more about what we are talking about, I recommend those.
11:52 Ok, great. I actually haven't read that book, but-
11:57 I really enjoyed that book, it pre dates the Higgs boson, so it's like a lot of anticipation.
12:03 Ok, great.
12:05 Maybe we could talk a little bit about like the really big picture of software at the LHC, because there is not just one team and there is not just one experiment, there is- how many collectors are there, seven?
12:18 Right, so there are two really big kind of multipurpose particle detectors Atlas and CMS, and I am on Atlas. And those two experiments have in the neighborhood a little bit more than 3000 physicists working on them. So it's a big group of people and then there are two other experiments, that are slightly smaller in scale but they do and slightly more specialized in terms of a physics that they do and then there is several other smaller dedicated experiments that are quite a bit smaller. And so it depends on how you count a little bit but usually there is the 2 big multipurpose detectors and two other more specialized ones, that are dominant like LHC experiments.
13:06 Ok, cool. And so maybe from like the higher level or larger scale like the thing that actually runs the machine down into the experiments done in the more details like the data processing details- could you give us a picture like what the software looks like there, what you guys are doing?
13:18 Sure. Yeah, so I mean, it's mainly we have a whole bunch of collisions and each collision, you know, if you think of in what this metaphor is like an image, it's like a pipeline for doing a bunch of image processing, you know, and you are looking for- you are trying to find the collisions that maybe have evidence of some new particle.
13:39 So, we have lots of teams of people that are looking for different things, and each of those teams will develop a little pipeline to process the data to try to search for what they want. Also to put it into perspective a little bit, we had a couple of quadrillion collisions total at the LHC and when we discovered the Higgs, it was on the order of a hundred or a thousand of those collisions were the interesting ones.
14:08 So it's a huge "needle in a haystack" problem, but it's also not really like a data mining kind of just generally looking for something weird in the data. We have theories that tell us what to look for, which is good, because there is such small little deviations in the data that it would be basically impossible to find if you didn't have a good guide. And then, this processing chain, because there is so much data and performance of such an issue, most of it, several years ago the decision was to write most of the software in C++, you know C++ has also evolved to ton during the time, so...
14:45 Are you using like, are people using like C++ 11, and those types of things?
14:50 So the different experiments kind of move to these new standards and new computing technologies kind of at different paces. There is a lot of worry. It's generally a pretty conservative attitude, but we are making those kinds of transitions, but it has to go through a lot of things before we make a big jump like that. We also usually have a very homogeneous computing environment, in terms of like operating systems and things like that, because we run into issues where you don't want to have to be worrying about like floating point arithmetic, or something.
15:32 It's a little bit funny, the CERN was responsible for sort of developing the web browser, HTML and things like that, and so I have this huge win of where the web was born and that was followed by this idea of like "ok we have the web now we are going to have grid computing", and there is a lot of money purred into it, and the promise of the grid basically turned into what has happened with the cloud. But within high energy physics we do have the grid but it's kind of like a huge global batch system in some sense, so it tends to be more uniform and things like that, than what- the first people worked really hard to be able to work over very 16:19 genius computing environments, but that all evolved over more than a decade, so...
16:25 Yeah, I'm sure I used to do a lot of work in sort of scientific computing and visualization and it's super hard to do reproducibility and checking stuff, you know.
16:36 Right.
16:37 If you got sufficiently complicated series of mathematical steps you are going to apply to something, it's so complicated, how do you know when you are right or not, you know?
16:47 Right.
16:48 How do you know when you are discovering something new versus, "Oh, it's like I expected", or whatever, right?
16:53 Right, we are working a lot right now on trying to address the sort of reproducibility issues kind of specific and the challenges associated to our field, and there are a lot of challenges, so much data and the software is very complicated.
17:06 Yeah so the core algorithms tend all to be in C++ but they are organized in a lots of different tools, and you have a way of kind of composing this pipeline between different processing algorithms and in the end the configuration of that thing is such a beast, but that is the first place where you see Python happening, is that we have way of kind of doing introspection on all the tools and then we just represent their configuration in terms of Python objects and then there is a whole separate layer of computing, of programming, which is just central the configuration and that includes both this trigger that online system that is tossing out the data, as well as the people that are analyzing the data how they configure all these tools, to be able to process the events into something that is more manageable-
17:56 That sounds really interesting. When I was doing some research, it seems like one of the major pieces used in Atlas was this thing called "Latina"
18:08 Right, that's right, that is the new kind of name of the C++ framework that we use, that also includes the way that it builds the Python bindings for configuring all the tools, and yeah, so that's- yeah. I spent more hours than I'd like to admit doing programming in that framework. But then what is also interesting, I think a lot of the audience will find interesting, is that once you've used that huge heavy weight data processing pipeline usually you get to something quite a bit smaller and that is where a lot of the more interactive and exploratory part of the data analyses happens. And at that stage, a lot of people stop using things like Latina for the most part and that is where you start using, C people are using Python a lot more in terms of data analyses and so it is an interesting transition, because people are always arguing about where you make that swap, you know.
19:03 I suspect you guys probably do a lot of IPython? Is that true?
19:06 You would think that more people would, I guess part of it is that it's still, even at that stage you still have so much data to process, that the kinds of thing that people end up wanting to are well suited to having like programs that look really like programs that run and they might be Python based, but you know, you kind of sort of batch system run over this thing and then you get some results to look at them.
19:33 There are times when you are doing something very interactive and so years ago, the team at CERN that makes this tool called "Root" which is the kind of the dominant data analysis package in high energy physics, came up with something like an interpreter, because you want to sit there and have this feedback loop right to you and like, and where you type commands, and that was actually done amazingly, they wrote a C++ interpreter many, many years ago, and so you actually write these commands in C++ and then that's the interpreted and executed on the fly.
20:09 That's actually pretty interesting by itself, isn't it?
20:12 It is interesting. It of course had all sorts of issues and C++ wasn't really meant for doing that, but it worked practically and now they have gone through and they have a much heavier version of this interpreter that's based on clean and more modern compiler technologies and things. But Python obviously is the another way to go with that which is nice.
20:12 [music]
20:12 This episode is brought to you by Hired. Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
20:12 Each offer you receive has salary and equity presented right up front and you can view the offers to accept or reject them before you even talk to the company. Typically, candidates receive 5 or more offers in just the first week and there are no obligations, ever.
20:12 Sounds pretty awesome, doesn't it? Well did I mention the signing bonus? Everyone who accepts a job from Hired gets a $2,000 signing bonus. And, as Talk Python listeners, it get's way sweeter! Use the link hired.com/talkpythontome and Hired will double the signing bonus to $4,000!
20:12 Opportunity is knocking, visit hired.com/talkpythontome and answer the call.
20:12 [music]
21:43 So people started moving to the Python while some started moving into the Python way of doing things, I don't know who would that be, 8 years ago or something like that, but the feel is kind of split between which way and since then things like IPython and IPython Notebook have come around, I think that that is especially from the point of view of like reproducibility, so we are working now that we give tons of talks, if you go to CERN, we have this agenda system and you can see that there are like hundreds of thousands of presentations happening on these experiments every year.
22:15 That's excellent.
22:17 Yeah, so one of the things, but there are always like powerpoint or
22:17 keynote or whatever LaTeX based pdf presentations, and you read about what someone is doing, but it is not very handy for like trying to have reproducibility or for another graduate student to pick up where someone left off. So, we have this effort now to try to make it so that the agenda system can basically display notebooks directly so people can upload IPython notebook directly and visualize it and if someone else thinks it's interesting, they can download it and execute it.
22:49 There are efforts about trying to make it so that the whole computing environment associated to that notebook can be packaged up. Because they usually aren't just stand alone Python, IPython notebooks with like SciPy dependency, they have a bunch of dependencies. So if you can package that all up, that's very handy. So there are tools like bindor now there's a tool called ever-wear and previously there was something like sagemath which all allowed you to sort of execute a notebook, you know, but the problem was how do you get all these software dependencies packaged up and problem started to be solved.
23:28 Right, that is really excellent, because I can imagine you guys have so much data, and maybe these back end systems you've got to reach into it to actually work with the data that you are trying to do physics on.
23:42 That's right.
23:42 That you can't just take the program and hand it out, you know, like "oh and here is our 50GB of data you've got to get it this way", right?
23:48 Right. And not only that, there's also things like databases, that say like how was the detector of lying done on Friday November 21st or something, and so there are all these databases, involved, that you have to connect to for the software to run, and that's also through tones of authentication layers, so, it's a huge pain in the but basically. But people are solving it, and I think that will be a huge change and the project Jupyter people luckily had this great foresight to separate the notebook from the background kernel, so we are actually also writing kernel based on the C++ interpreter of root, so it still looks like notebooks, and all the display and everything is the same but in the background of Python it's the C++ interpreter which is interesting.
24:40 Yeah, that certainly opens it up to a much wider audience like you are saying, the group that is working directly with Latina and so on, they can just possibly start using IPython in an hour. But you call them Jupyter notebooks- I don't know, I'm not really sure about the name
24:58 Yeah the front end kind of language agnostic part is now Project Jupyter. But it's great, because we have people like Fernando Perez who is leading this effort as part of this advisory board for a grant that we got from the National Science Foundation, to try to take the tools that have been developed in high energy physics which are mainly very sciload25:18 , like we are trying to solve our problem it's a very hard problem, and we don't have a lot of extra time or money. But now we've done some nice things and so let's try to open that up, make it more interoperable with like the scientific Python world, and it's definitely a two way street. There are lots of other great tools out there that we don't use, so we are working on improving the interoperability of all of these things.
25:39 Yeah, I think that is going to be good for science, all over. And The Jupyter guys just got a huge grant, I'm not sure all the folks have contributed but it was millions, like 6 millions or something like that? Do you remember?
25:51 I don't know the number, but they rightfully had been given some support because they are doing some great things and-
26:00 Yeah, I'm really happy to see that. So, you know, people will start build C++ and I guess Ruby and maybe I imagine Fortran, I don't know- that is probably supported somewhere inside, I try not to touch that stuff.
26:15 Right.
26:15 You talked about this sort of uniform computing environment, and I know just where you are coming from with that, just the whole reproducibility and setup and everything. Can you talk to like are you using Linux, what distribution, what do things look like there?
26:30 Right so it is definitely Linux based, CERN has a distribution called Scientific Linux, that they maintain I've kind of stopped following all the ins and outs, to be honest, but at some point they kind of was derived from some red hat type package way back in the day, but since then it has evolved and now I'm not even sure what this distribution is closest to. And then there has been quite a bit of emphases in virtualization technologies, openstack related things and we were starting to see more use of now also Docker images. So, I have a student who is working on making Docker images that have a very kind of specific computing environment, specifically for the issue of reproducibility.
27:21 I think we also haven't really seen in high energy physics but we are starting to see now as like more and more web based tools and things like that, so you have web services. And we have a project going on, where we are trying to make not analyses just reproducible but re-interpretable. So, you had a team of people analyzing like data and they were asking a certain questions but you can reuse that analyses pipeline to answer other questions. So we are trying to wrap up this very heavy computational and pipeline with a very simple web interface, web apis and things so that we can, people can submit requests to the system and they'll be processed through all of this infrastructure and then you'll come out with the very simple answer. And-
28:09 That's really cool.
28:10 Yeah, I think it will be great and also it addresses a lot of issues with reproducibility when you have- when you can't just like say, typical open data, like here it is do it again. Because there is so many steps involved and the configuration and everything is so heavy that basically almost no one can do that so, but if the experiments host that service, it's very valuable, so...
28:36 Yeah, that makes sense. I suspect if it's sufficiently complicated and has enough configuration parameters and variations like even the original researchers can reproduce it. They didn't have the details, right?
28:47 Yeah, well, I mean-
28:48 I mean without redoing the work from scratch, literally.
28:51 Exactly. I mean the goal is that we should all be able to do it in practice, that's you know, rarely checked, but that is what we are working on now, is to try to make it so we can make it so we can more confidently say yes we can actually reproduce this stuff.
29:05 That's super cool. I hadn't even thought about Docker but that makes perfect sense. I always think of Docker as here is the way I'm going to horizontally scale my web app so I can do that easier or like get hired and 29:17 on servers and web data center type place but it makes a really good sense for scientific computing, doesn't it?
29:24 Yeah, so we have these experiments at this point, each of the big experiments is put out several hundred papers, so you have like thousands of scientific results and now, if associated to each one of these results, you have some sort of like Docker image and you can on demand spin up this service that you want to reproduce or reinterpret what was done, that you end up making a very powerful high level scientific tool. So, it's been an idea for several years but now, with these tools that are around it's really possible and so, and it's starting to happen.
30:00 Yeah, I think that's fantastic. And are you guys considering or actively putting these into the official Docker repository?
30:09 So right now, that's the sort of model is that that is going to be hosted at CERN, partially because it's a lot of disk space and we are just working closely with the CERN IT people, and there is a lot of trust between the CERN computing and IT and the experiments, so that is how it is being developed now, but and some of these things are sensitive, you know, like these are big international collaborations and different countries kind of at different places in terms of their attitude about open science and things like that. So some people still want to keep these things closed, but they are willing to entertain the idea of hosting the service. So that's a big political discussion.
30:52 Actually, the more I think about it, it brings me back to when I was in grad school yes. I can see that, so. I suspect once a paper is published in a Paraview journal and approved, pretty much everybody would be pretty happy to have their stuff public, but as you are developing that, right, before you've declared the Higgs boson to be found for example, you would necessarily want to give all your algorithms so that other people take a shot at, right, you want to keep that until you are ready to publish your papers and publically do results, right?
31:26 Yeah, and that's true. And also, you know, I mean the experiments also worry a lot about- you have to be kind of careful about the scientific message that is coming out of these experiments, I mean they are very, very expensive, international projects and you don't want people like claiming this and that and adding a lot of noise and drama, so we work very hard to get lots and lots of internal cross checks of things and then make sure everyone is on the same page and then we want to have kind of one unified voice from an experiment and then we have another experiment to check it. But, we don't want too much noise, so that is a lot of motivation for keeping these things kind of internal until they are ready.
32:09 Yeah, absolutely. It took several years for the Higgs boson analyses to finally declare "hey we found it, right"?
32:17 Yeah, that had a lot to do with just collecting enough data, once we had enough data the process was sort of streamlined enough that it was really a matter of like a week and half or something, it was about a week and a half that we went from, the last data that we had the talk were the discovery claims was made, so but that was in some sense just like adding the last bits of data. But, yeah.
32:44 Yeah, very cool. So, I want to ask you a few more things about LHC and CERN and particle physics. And then we can talk about space a little bit maybe.
32:53 Oh sure, yeah.
32:55 So you said one of your goals there is that you guys are trying to move towards sort of dedicated professional software, role there at CERN. Can you talk about that a little?
33:06 Right, so I think you know, the physicist need to write obviously a lot of code, and be proficient to be able to do the science they want, but also there is a lot of infrastructure for the processing that is needed. So, typically, what happened was the physicists that were very strong on computing kind of specialized on that, and then you know, became sort of software professionals with the physics background, and that models worked surprisingly well, there are some people relatively few people that don't really have physics background that are really more just software professionals. But, separately, there has been you know, CERN has had like an IT department that deals with actual computing infrastructure that's going on, but has also developed several different tools that were more services-
33:58 Sure, because you guys have a lot of computers, a lot of network, how many computers do you have?
34:02 The CERN- I should know the number-
34:06 Like a hundred thousand or something, right, like really a lot.
34:08 It's a lot. There is a great YouTube video that you can post associated to it, it has a wonderful overview of the whole processing at CERN and a lots of little factoids that is nicely produced. But I'm not going to say a bunch of round numbers, so-
34:26 Sure.
34:26 [music]
34:26 This episode is brought to you by Codeship. Codeship has launched organizations, create teams, set permissions for specific team members and improved collaboration in your continuous delivery workflow. Maintains centralized control over your organization's projects and teams with Codeship's new organization's plan.
34:26 And as Talk Python listeners, you can save 20% off any premium plan for the next 3 months. Just use the code TALKPYTHON.
34:26 Check them out at codeship.com and tell them "thanks" for supporting the show on Twitter where they are at @codeship.
34:26 [music]
35:17 But there were people that were like on that side, and then also just general services, for like the system we use to manage all the papers and the comments to the papers and things like that and like an agenda system where people put up their talks and schedule meetings and the videos blah-blah-blah, and the people that were working on that have started branching out into the kinds of services they provide. So, I guess one example that is maybe interesting is the very first website in the United States was called Spirus, and it preceded this dark 35:59 like scientists put the versions of their papers before they were published, this pre print server.
36:05 So the Spirus was basically just the database of all of the high energy physics literature, you know, like who wrote what papers, and who sided whom and so on. And, so it was the first website in the United States was where physicists would go to see who is siding who and which papers are about certain topics. And that has evolved into something new called an Inspire, and that is you know, an international effort but the software technologies that that is based on which is mainly Flask, and Pythonic, based things has evolved into a set of different tools that are pretty nice.
36:42 So one of them is a data repository where this is not solving actually CERN's problem, it's like the long tale of science, all these small experiments that are out there that have some data, they can put their data onto these servers, and then they can refer to that data when they write their papers.
36:59 So there is a service called Zenodo and Zenodo started working with GitHub and made it so that people that now not just data that you want to point to in your writing a paper, but the software of your paper you can, there is a little web book in GitHub that whenever you make a new release will push a copy of the code to the service of Zenodo and they will develop the DOI- digital object identifier which the publishing industry knows about how to use, and things like that, and appoints to a specific version of the code that was used, and you can download that version of the code. So it's out of the realm of the version control, it's just that GitHub is not provided like people can delete their GitHub repo but once you get a DOI is kind of a trust relationship that that code is exactly that code is always going to be there-
37:50 Like, almost like Escrow for your code, right?
37:53 Yeah exactly, so it's redundant in a sense but there is a link back obviously to the repository so you can see how it has evolved, but this is the basic common denominator for all of the publishing industry and how we track citations to different research outputs, whether they be papers or data or software. So that connection has been, is nice and now there are actually thousands of pieces of code that have been pushed from GitHub to Zenodo where they have been given these DOIs . And that is also Flask based, and they have been very nice about how they break up the project into different you know, Flask components and things like that.
38:35 And now we are reusing that same kind of infrastructure to try to build these tools for doing a reproducibility, and re-interpretations and things like that. So there is a new effort that is evolving that has a much more of a like modern pythonic programming web services mentality, which is not necessarily coming from within the experiments, but it is coming from within the CERN lab and I think it is going to change how the experiments sort of approach these problems in the next few years.
39:07 That's really cool, it seems like huge step down the right path for reproducibility.
39:11 Yeah, and I think it's very important for what we are doing.
39:15 Yeah, of course. Ok, the last thing I wanted to ask you about was this project called Recast, that you guys have going that not yo necessarily, but folks at CERN have an LHC of going.
39:28 So Recast is the system as referring to that is for trying to make the analyses re-interpretable, and that is the one we are bringing this sort of Docker images for the different analyses pipelines and try to make this web service for being able to either reproduce or reinterpret one of the published LHC analyses. So that one is definitely within CERN but using a lot of these more modern web service type approaches. There was another project that maybe you also want to talk about what shows the space, I don't know if that is what-
40:02 Yeah, absolutely, so-
40:04 I don't know if there is- if we are running out of time?
40:06 No, no we definitely have time, let's talk about that one, and that's the Crayfis, yeah?
40:09 That's right. So Crayfis is this project which is, we started up, which is a lot of fun and it's a very sort of small team of people right now, so it's kind of the opposite of these LHC experiments. And our basic idea is that we want to try to detect ultra high energy cosmic rays, so these are particles coming from outer space that have much higher energies than the LHC, and they smack the atmosphere and create a shower of new particles, so you get tones and tones of particles hitting the ground.
40:43 And the idea is that if you get the bunch of people running an app on their phone, the camera of your phone, is like a miniature particle detector. So, if one of these particles happens to fly through your camera, it will light your camera up. And so, basically idea is that if you get a bunch of people at night running this app like while their phones plugged in and they are sleeping, and you see a bunch of these phones light up at the same time, then that is like an indication you just got hit by a high energy cosmic ray.
41:14 And these things are very rare, so if you want to see them, like one of the best shots you have is to try to cover as much of the Earth with the detector as you can and it looks like if we have in the order of the million people running this app which is a big number, but you know, not crazy big number, that we might be able to do some like really cutting edge science, and be able to study like the very highest energy particles that people have ever seen in the universe. So, it's a really fun like citizen science project, and it has a lot of data science challenges.
41:50 I'm sure it does. That's really cool. So, does that run on the major phone platforms, iPhone, Android, things like that?
41:56 Yeah, so we have an Android app, and an iPhone app, they are both kind of in Beta right now, and so they collect data and you know, one phone can't stay anything, so they are uploading their data about sort of when they saw the flashes, and where they were when they saw the flashes to a central server, and we have just- so it has been fun, because we, the people working on this all kind of have a LHC type of background, but we now get a fresh set of choices about what we are going to use.
42:26 So you know, we have elasticsearch, Django and Spark and you know, just much more kind of modern stack of tools for being able to process the data and then we are working on trying to make it so that the final stages of the data analyses are more like IPython Notebooks and that we have this ability to not just view the notebook but to launch the notebook and re-compute it, so that is using things like sagemath or this binder project and ever wear, which have wit Docker all of the computing environment, and this is going to I think be just awesome for doing outreach, because this has really grabbed the public's imagination, we got covered by MPR and Wired and some people and so we have like almost a hundred thousand people that have signed up and we haven't even launched the app yet.
43:22 That's really cool. So, if people wanted to get started, how do they do it? Is that a thing they can do yet or do we have to wait a while?
43:30 You are going to have to wait a little bit longer, we are working as fast as we can, but you can sign up for it right now if you go to the website, it's crayfis.io.
43:45 Right, so I'll definitely add a link. C-R-A-Y-F-I-S.io.
43:51 That's right.
43:52 Fantastic. So this reminds me a little bit of the whole protein folding stuff that people try to do with like running it on idle computers and the seti@home, but it seems even cooler because this is actually the science.
44:07 That's right, yeah.
44:08 - happening, not just the post analyses computational bits.
44:11 That's right, yeah. I think there is sort of three types of projects like this, there is one like citizens science projects like this, there are ones like seti@ home and the protein folding, where you are basically donating compute cycles, to solve a problem but you are not really involved otherwise. Then there are projects like there is a great platform called the Zooniverse, which I encourage people to go check out if you like these things, they are very interactive.
44:35 You are looking at pictures of galaxies, and trying to classify them, or like pictures of the ocean and trying to say if you see certain things on them, and it's a lot like Mechanical Turk, you are farming out these things that computers might not be great at, or we are not good at writing algorithms for doing and having humans do it. And this one, this project is neither of those, the Citizens science is actually the instrument, I mean, you are doing the science, and so that's I think, that makes our project really unique.
45:10 And that's really awesome. So I encourage everyone out there to go sign up, participate, I'll definitely do it, this is exciting.
45:18 Great, thank you.
45:19 Yeah, and then you said this is somewhat done with a grant from Amazon Web Services?
45:26 That's right, yes. So we talked to Amazon, they thought it was an exciting project, and so they've been great about trying to support some different scientific projects that need some computing, that are still in a prototyping stage, and so they have given us grant for some of the AWS sort of budget and we are using that for collecting the data and some of our offline processing of the data and we try to make it so that there is kind of a feedback loop so that the Django server is showing stats on, what the current status of the network is.
46:07 And, so that's been very nice of them, and also, Yandex, which is like the Russian Google has a group of people that are data scientists and part of a lab that they run that are very well integrated at CERN and some of them who joined our project and they are helping us build the sort of offline spark based data processing system and things like that, so it's a lot of fun to have the different mix of people and a very fresh project, we move very fast, it's nice.
46:40 Yeah, how exciting, I mean, I know how it is to pick up a project that's been around for ten or twenty years, it's kind of stale in the technology and what you can do because you don't want to break stuff, but it's nice to start over and do exciting things, right?
46:52 Yeah, absolutely. A breath of fresh air.
46:56 Cool, so we are kind of getting short on time, let me ask you just a few more questions before we wrap things up. What do you think the most amazing or interesting thing about LHC and all the stuff that you guys did there that people might not really know about?
47:09 Oh, boy. Part of it I think that is really great about the whole project is just that we have people from all around the world, right, and they are not all employed by one central corporation or something like that. So, we are all contributing, so I guess it has a lot of the flavor, like a successful open software project where people are contributing from all around. I think the other part is that everyone should feel proud about it because it's not a huge amount of your tax dollars if you look at like I forget for every $1000 of taxes you pay just a cents or something is going to this kind of project, but you know, everyone made this happened and I think everyone in the world should be proud of the fact that this happened.
48:00 Yeah, it's definitely a global effort which is awesome.
48:03 It is a global effort, and it is, is it's a really big accomplishment what's happened in terms of our understanding of the universe. And there is obviously all of the software and computing and the technical challenges of pulling it off, but it's a big multi decade global thing, humanity should be proud of itself.
48:27 Yeah, absolutely, that's great. How do you see Python in science, sort of evolving- growing, changing, shrinking?
48:36 Right now, I think it is growing, very fast. The whole scientific Python side of things and distributions that are working better, GitHUb is changing how people are doing science very quickly. If you look in like the astronomy and astrophysics community there have been some big efforts to create new analyses platform, so there is a great project called AstroPy, which is just a poster boy for a different kind of open science mentality of the community contributing to making the tools that need to do the science that they want.
49:17 That's working pretty well because they have a lot of new experiments coming up and they want a fresh start. So at the LHC we have this issue that we have a lot of like a C code and a lot of inertia about how we've done things but we are trying to move things into a less sciload software picture that looks more like that model and Python is a big, dominant kind of language of people that have that mentality at least at particle physics and astrophysics.
49:53 Yeah, that's my feeling as well but obviously I do a lot less science than you do. That's great to hear.
50:00 Yeah I mean the other place that you see it is just that to the extent that data science is attached to doing sciences as opposed to like business analytics, you see just a big influence of Python and so I think that there are a lot of other scientific fields that are undergoing a similar transformation but maybe a little bit heavier on the our side than on the Python side.
50:25 Yeah, it's definitely down to those two I think, modern analyses for the most part.
50:32 Right.
50:34 Kyle thank you for being on the show and thank you for taking the time, I know you are super busy, this is really, really interesting. Thanks for sharing with everyone.
50:39 Absolutely. Thank you for inviting me. It's great.
50:42 Yeah, you bet.
50:43 Take care, buy.
50:43 This has been another episode of Talk Python To Me.
50:43 Today's guest was Kyle Cranmer and this episode has been sponsored by Hired and CodeShip. Thank you guys for supporting the show!
50:43 Hired wants to help you find your next big thing. Visit hired.com/talkpythontome to get 5 or more offers with salary and equity right up front and a special listener signing bonus of $4,000 USD.
50:43 Codeship wants you to ALWAYS KEEP SHIPPING. Check them out at codeship.com and thank them on twitter via @codeship. Don't forget the discount code for listeners, it's easy: TALKPYTHON
50:43 You can find the links from the show at talkpythontome.com/episodes/show/29
50:43 Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes and direct RSS feeds in the footer on the website.
50:43 Our theme music is Developers Developers Developers by Cory Smith, who goes by Smixx. You can hear the entire song on our website.
50:43 This is your host, Michael Kennedy. Thanks for listening!
50:43 Smixx, take us out of here.