#144: Machine Learning at the Large Hadron Collider Transcript
00:00 Michael Kennedy: We all know Python is becoming increasingly important in both Science and Machine Learning. This week we journey to the very forefront of physics. You'll meet Michela Paganini, Michael Kagan and Matthew Feigart. They all work at the Large Hadron Collider and are using Python and Machine Learning to help make the next major discovery in physics. Join us this week on Talk Python to Me, Episode 144 recorded December 14th 2017. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem and the personalities. This is your host Michael Kennedy, follow me on Twitter where I'm @mkennedy. Keep up with the show, listen to past episodes at talkpython.fm and follow the show on twitter via @talkpython. This episode is brought to you by Linode and Talk Python Training. Be sure to check out what the offers are for both of these segments. It really helps support the show. Hey everyone, before we get to the interview, I want to share a quick update about our Python courses with you. Do you work on a software team that needs training and could really use the chance to level up their Python? Maybe your entire company is looking to become more proficient. We have special offers that make our courses here at Talk Python the best option for everyone you work with. Our courses don't require an ongoing subscription like so many corporate training options do and they're roughly priced about the same as a book. We're here to help you succeed. Send us a note at sales@talkPython.fm to start a conversation. Now let's get to the interview. Michael, Michela, Matthew, welcome to Talk Python.
01:42 Michael Kagan: Hi everyone, thanks for having me.
01:45 Matthew Feickert: It's an honor to be on the show. Thanks so much for inviting us.
01:48 Michael Kennedy: Yeah you bet, the honor is mine to be honest. You guys are doing amazing stuff. You're pushing the boundaries of science and I am just really excited that you are here to share what you are working on and how that intersects with the Python's face with my listeners. We're going to talk about Large Hadron Collider, particle physics and machine learning and how those three amazing things go together. But before we get into them, let's just start real quickly with how you guys got into programming and Python. Michela, you want to go first?
02:17 Michela Paganini: Yeah, actually when I started programming, it certainly wasn't Python for me at first. It was IDL and Matlab in some undergraduate physics and astronomy labs but then when I got to CERN and I started my career as a graduate student in ATLAS that's when I first encountered Python and C++ and so just like everybody else, I learned on the job and I'm really thankful that I had great mentors that really taught me what good code and bad code looked like such that I can start writing good code instead.
02:47 Michael Kennedy: I feel so much of what people do in programming really is learning on the job, even for people who have degrees in computer science. You study one thing but then you go and you actually build something different, I think it's probably not that different actually than what most people went through.
03:03 Michela Paganini: Yeah absolutely, I think that's something that a lot of people could relate to.
03:07 Michael Kennedy: Nice and IDL, that's the astrophysics type language, is that right?
03:11 Michela Paganini: That's correct, I started as an astronomer in undergraduate and so that was my first encounter I would say with real coding.
03:19 Michael Kennedy: I see so you started out with really tremendously large things, you're like "No, let's have a look at the really small stuff instead."
03:24 Michela Paganini: Yes, it was quite a transition.
03:27 Michael Kennedy: Michael Kagan, how about you?
03:29 Michael Kagan: My story's pretty similar to Michela's. But I took a few classes in undergraduate and actually the first language I really used on the job was Fortran cause a lot of the old Physics simulation code for doing calculations is built in Fortran so that was my first real projects. And then once I got to graduate school, everything on the modern high-energy physics experiments was C++ and that's where I learned on the job and then Python is used there a bit and so I was beginning to learn Python through some of the glue scripting that's done there and then once I got into more machine learning stuff and realized how many great tools there were, then I dove a lot more into using Python.
04:12 Michael Kennedy: That's cool, at CERN a lot of the actual data processing is done in C++ but then the consumption of that is done in Python if I understand it right?
04:22 Michael Kagan: Yeah, that's right. Basically the enormous database of code that crunches on the data is all on C++ and then there's a wide range of different Python scripts the set of code that helps crunch on data for one type of experimental apparatus and from another piece of experimental apparatus and then put that information together.
04:44 Michael Kennedy: Nice Matthew how about you?
04:47 Matthew Feickert: Similarly the first time I really did any programming was in CS101 course at university where I learned little bit of C and some MATLAB and nothing was really clicking then but then as an undergraduate, I joined the research group at the University of Illinois and my advisor at the time was like "Hey, do you want to do this stuff?" And it seemed really exciting. He was telling me about his work at the Large Hadron Collider so I said sure and he was like great, and he sat me down in front of a Linux terminal and that's when I first got introduced to Bash and C++ and things like that. Then I actually starting to get to see how I could use programming to actually solve problems and analysis. That's when I had this aha moment and it started to click. But I really didn't start using Python much until I got to grad school and as Michael said, once I started to get a bit more introduced to how the physics community is using machine learning, then Python and it's really great ecosystem for machine learning tools that became a really obvious choice for me to start to hone my Python skills but for me, pretty much all of my programming was just learned on the job.
06:01 Michael Kennedy: That's really cool. I think one of the things that really draws people once they get started with Python is all the extra libraries that you could use, you can just pip install tensorflow, you can just do that, you can do that, it's like wait a minute it's all right here.
06:16 Michael Kagan: That's been fantastic especially with these very very quickly growing set of libraries and tools and even in the machine learning domain, there's a new package pretty frequently that you want to try out. pip and Conda and these package managers that make this so quick is really much faster development time than a lot of the stuff we're doing in C++ where a new package comes out and given the code base is enormous, can take a relatively long amount of time just to compile any changes you want to make.
06:45 Michael Kennedy: Yeah, that's for sure. I think of it as one of Python's super power, it's really cool. Michela, one of the things I wanted to ask you about is it feels like this open-source aspect of Python fits really well with the open science, this research is for public consumption. A lot of stuff you can do here, seems like it's much better done with an open-source set of software than say Matlab and proprietary, paid add-ons.
07:15 Michael Kagan: Is that for Michael or Michela?
07:16 Michael Kennedy: Michela.
07:18 Michela Paganini: Hi, I totally agree with you in the sense that it's a lot easier for us being a large collaboration to be able to share tools and be able to collaborate across so many different domains even within physics using these libraries that one can very simply as you both said install in your environment as opposed to perhaps using something closed source which then needs to be distributed correctly and so I totally agree that it fits very well with the design that at least I have in mind for what our workflows should look like and I think a lot of the other people in this podcast would agree with me.
07:55 Michael Kennedy: Yeah very cool.
07:57 Michael Kagan: I was going to say I would just add and this is something even the experiments take quite seriously. Even within any language, even if there is proprietary software that could be quite useful, we are very very hesitant about engaging with it because of our inability to look inside the box and our inability to really know that it's doing what we want to do, which is why even within the experiment, we end up writing a lot of our own code even if there is a proprietary solution just cause it's not really adequate for us to be able to do the research we need to do.
08:24 Michael Kennedy: That makes a lot of sense, is that new? Is that something that 15 years ago, would have people said the same thing?
08:31 Michael Kagan: Yeah absolutely cause a lot of these decisions that I'm even thinking about are about packages from 15 years ago at the beginning of the experiment where there's even a nice neural network package from 10 years ago that was decided not to really be used because it was proprietary at the time.
08:45 Michael Kennedy: Very interesting.
08:46 Michela Paganini: I would say that those decisions were even more important back then than they are right now. I think these days we see more adoption of standard tools. Still open sourced tools but a lot of the standard tools from industry whereas, I would say that back in the days maybe decades ago at CERN, we would tend to really customize every single piece of code and write it ourselves.
09:09 Michael Kennedy: I guess you're probably right, that's an interesting point because today we're swimming in open-source software and these ideas and it seems more accepted so it's even more important to have that. Early when it wasn't so obvious. Nice, so let's talk about what you guys each do day to day. You all are doing such amazing stuff. You're all involved in ATLAS to some degree. And we'll talk about the Large Hadron Collider a little bit, but maybe you could just touch on what that is. Maybe Matthew, let's start with you.
09:36 Matthew Feickert: Yeah sure. I'm a graduate student at Southern Methodist University in the United States but I'm stationed over at CERN and so like you said, I work on ATLAS. As a graduate student and since you could say one of my main responsibility is to make plots. What I mean by that is that I'm both learning how to do analyses and I am actually one of the people who is going in and writing the code and so I do work on specific analysis that's trying to measure your specific properties of the Higg's boson given a specific decay channel that it might have or that it does have and then the way I'm doing this right now is by using Jupyter Notebooks so I can actually go in and use some of the great Python tools like Keras to be able to interact with the data and actually write some neural networks and actually try and do some exploratory data analysis. In addition to that, I also do some operations work on ATLAS so I work on something that's called our trigger system. We can talk a bit more about that if there's time.
10:47 Michael Kennedy: The trigger system is really amazing and critical as well.
10:51 Matthew Feickert: Yeah.
10:53 Michael Kennedy: We'll definitely talk about that.
10:54 Matthew Feickert: There I'm just writing a combination of C++ and Python to try and do your optimization studies and provide support for the trigger system.
11:03 Michael Kennedy: Cool, how about you, Michela? What do you do day-to-day, what are you doing on all these experiments?
11:09 Michela Paganini: Of course, these days, I always joke about this that what I spend most of my time on is preparing talks and posters for conferences and workshops and interviews and all of that but I guess my average day as PhD student in ATLAS cause it's primarily three things I would say. First of all, training neural networks and then while those are training, I contribute to the experiments or my analysis code base. I read lots of papers in the archive that I would say certainly rough 90% of my time is spent on coding and documenting the code and I'm also involved in various reconstruction and analysis groups and my goal personally is to bring better algorithmic thinking to the table to help identify bottom quarks or pairs of Higg's bosons in my case or any other particle that we might be interested in at the Large Hadron Collider. I can go into more detail if you want about some of the work that I'm doing with neural networks. Specifically, I'm working on speeding up a part of our simulation that is very computationally intensive and my idea is to use generative adversarial networks to have a higher accuracy but at the same time, faster simulator that is powered by deep learning.
12:19 Michael Kennedy: That is really awesome and I do want to ask you more about that but one question that came to mind while you're describing that is you say you spend 90% of your time writing code. When you got into physics, did you think and you thought about what am I going to do as the physicist if once I'm done with the books, is that what you actually saw yourself doing?
12:38 Michela Paganini: Absolutely not, I was not ready for that as first. It came as a surprise, it was a pleasant surprise I have to say. It turned out that I love coding a lot more perhaps than I like more traditional ideas of physics but it was a transition let's say because not much of our course work prepares us for what the reality of the work of a graduate student in an experiment like ATLAS is really like and as I said, the majority of it is coding.
13:04 Michael Kennedy: Yeah for sure, I didn't take any graduate classes in physics but I took a number of high-level ones and I don't remember doing hardly any coding for them. It was all pen, paper, prove this, prove that. Lot of equations, not much actual software.
13:19 Michela Paganini: I would certainly advocate some of the curriculums should be probably updated to reflect what the real life of a graduate student in experiments in high-energy physics looks like.
13:30 Michael Kennedy: Yeah, that makes a lot of sense. Alright Michael, how about you?
13:32 Michael Kagan: I'm a research scientist at SLAC without the k, so that's the Stanford Linear Accelerator.
13:38 Michael Kennedy: Not the chat thing, but the really fast thing at Stanford.
13:41 Michael Kagan: Exactly, it's a Linear Accelerator Center. It's one of the DOE Department of Energy National Laboratories that's run by Stanford. This used to be a high-energy physics lab primarily and that's been turned into a big x-ray laser a free electron laser but anyways, I don't really work on that stuff, I'm part of the team that works on ATLAS and so as a research scientist, we work in ATLAS which is this 3000 person collaboration and each of us work for our own institutions and so one aspect of that is with 3000 people trying to improve and run a large piece of equipment and then do all this data analysis is we have to organize. So I found myself now in a phase of my career where I'm more and more part of that organization. I'm helping to run one of the groups which looks at a certain kind of particle which we might find in our detectors which is called a bottom quark and we develop algorithms to find those particles in the detector, and so once we find them, we can then give those algorithms out, supply them to any other analyzer on the experiment who wants to look at some data and find out in a given collision, how many bottom quarks were there in that collision. I spent a lot of my time running that group which is maybe 50 or 60 people organized together, working together to get that moving forward and working and then with the free time that I have left, working with postdocs and grad students on data analysis. I've worked both with Michela and Matthew on various data analysis projects and then also on exploring how we might take some new ideas in machine learning or even develop some when needed to solve some of our specific tasks.
15:31 Michael Kennedy: That sounds really interesting and you're using some machine learning and those types of algorithms to create these techniques for discovering these bottom quarks.
15:41 Michael Kagan: Yeah, absolutely. Machine learning has been around for a while. We've had algorithms that work in, it's still a very common and powerful paradigm which is we look at our data and based on our domain knowledge, we can develop all sorts of features or use all sorts of algorithms to say okay, this looks like the properties of a bottom quark and we can compute all these features and then run them, train machine learning algorithms to for instance classify whether this set of data really was a bottom quark or not, so that's been around for a while and one of the things we're also working on then is saying okay, if we can take a step back and look at this data and see if we can think about it in different ways, sometimes that maps onto problems like problems in vision or even natural language processing where we can then start to use some of the super-modern techniques coming from things like deep learning so that we can even improve our classification in some cases we're doing regression or even generative type problems so there's a --
16:49 Michael Kennedy: You are working with a huge camera basically right?
16:51 Michael Kagan: Exactly.
16:52 Michael Kennedy: With ATLAS.
16:55 Michael Kagan: I keep thinking of it in meters but it's a 75 foot tall and a 120 foot long detector that sits a 100 meters underground and it's built of many different kinds of cameras and these different kinds of cameras detect different kinds of particles and we take all that information together to build a picture of what happened every time we collide.
17:17 Michael Kennedy: Maybe that's a great place to segue into just a really quick summary of this particle physics stuff. Matthew maybe we'll let you start this off. I was told in fifth grade atoms that's the smallest stuff that everything is made of. But not so much, right? Tell us about it.
17:40 Matthew Feickert: I think everyone has this idea going through schooling that you have the periodic table where you have your atoms, they are made of protons and neutrons and electrons and that's it. So it turns out that's not really the whole story and that all electrons as it turns out do seem to be a fundamental particles, protons and neutrons these are actually composite particles that are made of even more fundamental particles that we call quarks and there's also gluons in there which are other sub-atomic particles so it turns out that there's this whole particle zoo if you will but the amazing thing is that when we go and explore the world and actually look it turns out that there is 12 matter-type particles in the world, that there is six of these quark particles that make up the protons and neutrons, we call these things protons and neutrons more generally hadrons and then there's things like the electron that we call leptons and there's both electrically charged things, electrically charged leptons like the electron and then there's also their ghostly neutral cousins that hardly even interact with matter at all so I think maybe Michela and Michael can talk about the fundamental forces and other things.
19:00 Michael Kennedy: Yeah sure Michela take it away, tell us about it.
19:02 Michela Paganini: Yeah of course. On top of all of these matter particles that Matthew just described, we also have what we call the force carriers or gauge bosons in a way, and these are particles that can be thought of as being exchanged among other fundamental particles to mediate some of the forces and attractions that connect these other more fundamental particles and so we can think of the photon for example as being the force carrier for electromagnetism and then I think to complete the puzzle, we have the most recently discovered particle for the Standard Model of particle physics which is the Higg's boson and by the way, the standard model of particle physics it's just this great theory that we've come up with over the years that puts all of our fundamental particles in this periodic table that Matthew and I have just tried to describe to you so there is a little bit of an analogy maybe to chemistry, but at even more fundamental level than the atom itself.
19:59 Michael Kennedy: And it's really amazing that this was created somewhat theoretically and then the machine to go find things like the Higg's boson was built and then that really was there.
20:08 Michela Paganini: That's true. It's very fascinating. I think at a certain point in history, experiment was ahead of theory and then theory surpassed experiment once again, it's a very fascinating field to be in and I think at different historical moments, things are very different from what they look like today.
20:27 Michael Kennedy: How much do you think that comes from people just getting better at theory versus computational techniques assisting theory?
20:36 Michela Paganini: I think it's probably part what you said but as well as the energy regimes that we're trying to probe now I think most particles physicists would agree that in terms of the energies that we are able to probe right now, we think we have a good understanding of all of the particles that could exist there but what we're really searching for right now is something that is even higher energy and that's the main issue. At that point, the complication is not so much whether the theory is there or not but it's being able to produce machines so that the hardware technology even to go search for any of these particles, and I think certainly software will help us get the most out of the hardware that we currently have and the next hardware that we will build in the future generations, but it's both hardware and software in my opinion.
21:28 Michael Kennedy: Okay, really cool. Maybe the last one on this physics intro stuff. Michael, what was the big deal about finding the Higg's boson, what did people learn from it?
21:38 Michael Kagan: Absolutely, the Higg's boson is probably the hardest of particles to describe in some ways. It sounds enormous but that the job of the Higg's boson is many things and it's probably that the easiest way to explain it is it gives mass to all the other particles and so the way you can think about that as particles move around, they bump into Higg's bosons and the Higg's bosons slow them down and effectively, give them mass. If you're trying to walk through water, it takes a lot more force or lot more effort to move your body and that's the same way it's happening with the particles. This is an imperfect analogy so if any theoretical physicist are listening, I apologize. That's roughly the idea but it turns out the Higg's boson really played a fundamental role in making this theory that Michela and Matthew explained which is incredibly predictive, maybe the most predictive theory ever makes sense. Without the Higg's boson, the theory effectively predicts things that have probabilities larger than one which we knew didn't make any sense and that's a little bit how you were saying that knowledge of the theory breaking down really helped drive what became a 40 or 50 year long experimental search.
22:50 Michael Kennedy: It's amazing how people get excited when they're wrong. The theory might be wrong, it'll be so exciting.
22:58 Michael Kagan: Absolutely. Every time something doesn't make sense, the theoretical physicists get incredibly excited that they're going to have to come up with a completely new theory.
23:07 Michael Kennedy: I don't think we're done with that at all. If people are out there listening and they really wanted to get a sense for what's going on in LHC, in particle physics, I definitely want to recommend the documentary Particle Fever which I think is available on Netflix but I'll link to at least the trailer. And there's a book called Present out of Creation Discovering the Higg's Boson which is great and then there's another one We Have No Idea a Guide to the Unknown Universe, who recommended that one?
23:33 Matthew Feickert: So I recommended that. That's actually co-written by our ATLAS colleague Daniel Whiteson and then also the famous PhD comics cartoonist Jorge Cham. I really like that book because I think it's both a celebration of how much we still don't know about the universe and how now is really a great time to get into science because we're truly in an age of discovery. But it also talks about just how much we do know as well. It's just a great celebration of science and I'd also given that it's co-written by a physicist, it really does convey ideas really well.
24:09 Michael Kennedy: Excellent, people can check all three of those things out. They're really a good background information. This portion of Talk Python to Me is brought to you by Linode. Are you looking for bulletproof hosting that's fast, simple and incredibly affordable? Look past that bookstore and check out Linode at talkpython.fm/linode L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe so no matter where you are, there's a data center near you. Whether you want to run your Python web app, host your private Git server or a file server, you get native SSDs on all machines, a newly upgraded 200 gigabit network and 24-7 friendly support even on holidays and a 7-day money back guarantee. Want a dedicated server for free for the next four months? Use the coupon code Python17 at talkpython.fm/linode. Let's catch up on the LHC just a little bit. Michela, could you just give people a sense of the scale? Michael said there are 3000 people working on ATLAS and ATLAS is just part of one of the experiments. Give us a sense of what this place is like.
25:15 Michela Paganini: Yeah absolutely, I really love CERN. It's a group of fantastic people, some of the brightest minds in the world and it brings together researchers and engineers and computer scientists from all around the world, probably 100s of different nationalities and one of my favorite thing about it is that I can absolutely say every single time I'm there, I'm never hanging out with more than one or two people from the same nationality so to me, that's the best thing about it. Again, there is probably 10s of 1000s of people across the various different experiments. We've been talking a lot about the LHC, Large Hadron Collider, that hosts four experiments, so not only ATLAS but also LHCB, ALICE and CMS but then again, the Large Hadron Collider is only one small part of the entirety of CERN, there are a lot of other experiments going on, and for example some anti-matter experiments at the anti-proton accelerator as well as even astronomy experiments as far as I know. It's a large laboratory that spans two different countries at the border between France and Switzerland so fantastic place to work at.
26:24 Michael Kennedy: It sounds really really cool and people can go tour it? They can set up a tour?
26:28 Michela Paganini: Absolutely, anybody can just show up and do the quick tour of for example, Point 1 which is where ATLAS is located so you can visit our control room and learn more about our experiment and if you are lucky enough to be able to visit during a shutdown period. So oftentimes in the winter that's when we have quick shutdowns, you might even be able to go underground and visit the actual experiment which is absolutely breath-taking.
26:55 Michael Kennedy: Yeah, I'm sure that it is. Just the scale, from the pictures, it looks like it's just incredible.
27:00 Matthew Feickert: I just want to jump in and just make an additional comment about what Michela said in the sense that CERN really is an open laboratory and it's a big part of CERN's mission that the scientific discoveries that are made there are made for all of humanity so CERN really welcomes the public getting involved and being curious and coming and asking questions. If you're ever traveling through nearby Geneva, Switzerland sign up for a tour, come visit.
27:28 Michael Kennedy: Yeah, that sounds great. Michael, the data that flows through these experiments out to the collectors and into the trigger that Matthew mentioned and then on to the larger computing structures that are there, it's pretty insane. Do you want to give us an overview of the scale of data?
27:46 Michael Kagan: Yeah absolutely. I'll give you a little bit of the physics background about why we have to design the systems this way. A lot of the things we're searching for are very rare and the physics that we deal with is probabilistic so we might be looking for something that's interesting that only happens in one out of a trillion collisions or maybe even less frequently. And so we have to collide protons as many times as possible, so we collide protons 40 million times a second and those collisions fly out into the massive detectors that Michela was describing or that we've been discussing. The thing is we can't record all that data so we can only record a fraction of that data because it would simply be too much so we have a set of systems called the trigger which allows us to go from 40 million collisions of which many are not super-interesting down to about kilohertz, so about a 1000 a second and each of those collisions, the data that comes out of the detector I think is about a megabyte per collision. So that means we're taking about a gigabyte of data per second.
28:49 Michael Kennedy: That's the data that made it through the trigger that was not discarded by the hardware?
28:54 Michael Kagan: Exactly, that's just the data that made it through the detector and the processing there is a combination of custom built hardware and FPGAs which are fast enough to deal with looking at the data really quickly at 40 million times a second and helping us pipeline it down through various both hardware and software systems down to this kilohertz rate. We don't run the detector all the time, we run it a lot, there's shutdown periods, there's times when you have to refill the beam and I think we accumulate something like three or four petabytes of data a year.
29:29 Michael Kennedy: Yeah that's just crazy and then it's not just there in Geneva, it's also broadcast out right?
29:37 Michael Kagan: I think there's something like a 170 institutions around the world that make up the worldwide computing grid or the LHC worldwide computing grid and that there's something like 300,000 or maybe more at this point computing cores. We distribute the data around the world and then we often need to process and re-process and analyze that data and that's done on this enormous computing grid. Once it's stored and distributed, that's how we go and analyze it is by basically sending jobs to this grid which you can think of as a precursor to what the cloud is now.
30:10 Michael Kennedy: There's so much data probably that I suspect put it on your laptop doesn't make a lot of sense right? You need to send your computation to the data rather than bring the data to you.
30:19 Michela Paganini: Exactly. I just said exactly, that's exactly how it works. You certainly wouldn't want to be downloading petabytes of data onto your computer so thankfully we have this grid of Tier 0, Tier 1, Tier 2, Tier 3 locations spread all around the world where you can send your scripts and they'll be run and then you'll get the results back.
30:42 Michael Kennedy: Give me a sense of what that's like. You have a question, you want to ask about the data you could write some C++, you could write some Python, you could write maybe even Fortran, what is the mechanism from I have this Python here and I would like to make it run there and analyze the data. What's the steps look like there?
31:00 Michela Paganini: From the user's perspective which is the one that I get, it's very simple because of the work of 100s of people who have made it simple for us so we simply have an interface with our computing grid where we can specify specific locations if we want to do so, we can specify the length of the job, the number of cores we're requiring, the number of nodes, etc, and then we can simply submit our script as long as it's in a format that is compliant with the way our systems are able to handle and then you specify what data to operate on, whether it's true data that has been collected from the LHC or if it's simulated data, you can also operate on that. And then you're able to monitor all of your jobs and eventually, get the results back and downloads, the histograms or whatever format your results will come to you as.
31:52 Michael Kennedy: It sounds really cool, and Michael, you said 300,000 computing cores?
31:57 Michael Kagan: I think it might be even larger at this point. As of 2010, it was more than 200,000 and it's ever growing with the amount of data we have and the amount of computing we need. Just as Michela was saying, we send our jobs out and it's built on top of virtual machine file system. Actually all these sites are working in coherently with having the same distribution of the ATLAS software located all these sites, so you can send your job with a known even version of the software and it's already available locally to you.
32:29 Michael Kennedy: Wow that sounds really, really fun. One thing that I was wondering looking at the LHC is we have ATLAS, we have LHCB and ALICE and CMS, what is the purpose of ATLAS? What is ATLAS trying to do relative to the larger goal of LHC? Maybe Matthew take that.
32:48 Matthew Feickert: So ATLAS and CMS, these are two of our general purpose detectors and so the idea there is these detectors were explicitly designed and then built to be sensitive to a wide range of interesting physics, whereas for example, backtracking a bit, ATLAS and CMS, they have if you will, they're sometimes referred to as cylindrical onions in the sense that their architecture is you have your beam pipe and then you have successive layers of very detailed detectors going out around them.
33:21 Michael Kennedy: Is that so you can basically take a 3d picture?
33:23 Matthew Feickert: Yeah exactly, because when we have these collisions, when you have the really hard collisions in some sense the result of the collision is that you have sprays of particles coming out in all directions and so you want to have as much coverage as possible and the idea is if you have both calorimetry systems and tracking systems, then you're able to get a much more detailed picture of what actually happened because we're trying to reconstruct essentially point-like interactions that are happening at the subatomic level but we're doing that by seeing what is just coming splattering through a detector so it's like trying to reconstruct what might have happened if you took two, if there's some sort of a car crash and the only way you could investigate what happened is if you're able to look at what the walls of a tunnel or something near by. So ATLAS and CMS, they are general purpose detectors and then ALICE and LHCB their geometries are a little bit different so they are more specialized detectors that are looking at specific types of physics.
34:30 Michael Kennedy: Very cool. So one of the things I wanted to dig into with each one of you is what you're doing day-to-day and how Python and Machine Learning fit into that. Michela, let's start with you, you already mentioned your generative adversarial networks and some really amazing stuff. You said that you were able to speed up some of these simulations a 100,000 times using a pretty cool technique.
34:52 Michela Paganini: That's correct.
34:54 Michael Kennedy: That's incredible, can you talk about how you're doing that?
34:56 Michela Paganini: Yes, these are obviously preliminary results and there's a lot more R&D that is now just being started within our collaboration but the point is that we built this great prototype. We called it Calogen, calo stands for calorimeter. That is one of the detector layers inside of this big onion-like structure that we just described our detectors looked like and the calorimeter measures the energy deposited by certain particles as they travel through as Matthew was just describing. The issue is that because some of these physical processes the particles undergo are so complicated, simulating these traversals of the particles through the calorimeter is really really computationally intensive and that's actually taking more than half of the computing gridpower that we were just describing, so it's billions of CPU hours per year so what I'm working on is this new technique to speed up that part of the simulation which currently occupies the majority of our computing resources worldwide and what I'm using is generative adversarial networks. I think some of your audience maybe will recognize these words so we use GANs to provide a function approximator to our simulator while retaining hopefully the majority of the accuracy that the slower physics driven simulator possesses. And again as I said, multiple preliminary results have been put out so far and we are achieving speed up times of over a 100,000 times but now the complicated part will really be to learn how to calibrate all of this machinery and port it into the real simulation within the experiment, but speaking of Python, I think the cool thing for everybody to know is that this is very easily built using Keras and TensorFlow so very standard machine learning tools from Python as well as other standard tools from the Python ecosystem such as NumPy, scikit-learn, h5py and matplotlib all make it into my projects.
36:59 Michael Kennedy: That's really cool. I think people are probably familiar with most of those but h5py, what is that?
37:03 Michela Paganini: It's just the interface for hdf5 in Python, hdf5 being a very standard data format that can be ported across various different languages, C++, Python and h5py certainly saved my life in terms of being able to open these files.
37:21 Michael Kennedy: Of course, that's really cool. Michael, do you want to talk a bit about how you're using machine learning for what you're up to?
37:27 Michael Kagan: Yeah absolutely. In the past, I've been working a lot on again these ideas of taking detector measurements and turning them into, classifying whether this data was from a given particle or not and so I've been working on connecting the data that we have with ideas of machine learning so when we were talking about quarks and when you produce quarks, it turns out they produce collimated streams of particles that smash into these calorimeters and leave a bunch of energy distributed in space. It turns out we can connect those distribution of energy and space with basically imaging type approaches and then we've been running a lot of computer vision type techniques to study those jets, those quarks. That's really jumped into connecting with things like convolutional neural networks and modern computer vision. I've been working on that just like Michela with tools like Keras and TensorFlow and built on top of core packages like SciPy and NumPy.
38:27 Michael Kennedy: One of the things I was going to ask you is it seems like the things you are actually looking for are quite rare. There was only a few collisions that produced the Higg's boson for example back in 2013-14. In my sense, I haven't done a lot of machine learning, but my sense is that you have to give a lot of examples to machine learning and then the machine learning can find more of those, even if their subsequently rare but how you would bootstrap this? How do you get it started and where there is enough? How do you teach them is what I'm asking so that they can then go and find these things especially when those occurrences are rare?
39:01 Michael Kagan: In some sense, it is a bootstrap but it's based on this idea that those super-rare particles like Higg's bosons, we don't observe them directly. They decay into other things that we know about like electrons.
39:11 Michael Kennedy: I see, decay result is easy to find and you can teach it to find those and when you find them in certain configurations, you're like this may have originated from what we're looking for.
39:22 Michael Kagan: Exactly, so I work a lot on making sure we can find not electrons but other types of particles that are produced copiously and then we can say in this case, I know how to find electrons, I can train an algorithm using our simulation which is very precise. I can go look for those in our real data to make sure that the simulation makes sense in a different configuration and then I can go hunting with my well-calibrated and well-tuned algorithm for finding electrons. I can go hunt for configurations with four or five electrons and that might be a really rare thing you want to look for.
39:55 Michael Kennedy: Very interesting, I see how that works cause I was thinking about this, how do you get started trying to find very rare things but I can see that now. I'll throw this question out to everyone. Are any of you using special hardware or are you just leveraging the many, many cores out there on the computing environment? Are you using the tensor CPU-type things or GPUs or those sorts of things?
40:19 Michela Paganini: I think the majority of us are using GPUs to train some of our machine learning models these days on top of obviously the worldwide grid that we just described but that's more reserved for simulating samples or doing more standard types of analysis. I think most of us rely on GPUs these days.
40:40 Michael Kagan: I would add I think one of the interesting potential future directions on top of all the work with GPUs and CPUs is if we ever want some of these algorithms to work in our trigger in that super-fast system that needs to operate at 40 megahertz, this maybe a place where there's some people already in the community beginning to look at putting neural networks on to FPGAs so we can actually run them at super-high speeds, so that maybe a future direction this field will move.
41:04 Michael Kennedy: That would be really amazing. Maybe Matthew it's a good time to talk about that trigger thing so you have to take 40 million observations across this thing and get it down to 1000 and you got to do that extremely fast every second, so how do you do that?
41:23 Matthew Feickert: Michael's already given a very nice intro summary there but the trigger is an immensely complex system. I definitely don't understand how all of it works. I actually just work on a sub-system of the trigger sub-system for ATLAS. So if you want to actually do an analysis then you need to have, let's say you're looking for a Higg's decay that goes to two b quarks, if you want to go searching for that, then you want to have some confidence that there ever could have been a recording of an event in the ATLAS detector that had two b quarks in it, so you want to make sure that some of these interesting collisions aren't getting thrown out and so that's one of the reasons we have the trigger systems that we have what we call a trigger menu which is a list of basically logical sequences that we are looking for in the different sub-systems of the detector to say that this looks like at the lower level, hardware level, that this event looks like it might have had two sprays of particles that came from b-quarks. So I work on a sub-system of this called the b-jet trigger and so there we already have some really good logical chains in place but we want to make sure that as we go up to higher energies and as we go up to even more collisions, so basically what we call luminosity as the number of collisions that we have per crossing increases, we want to make sure that our trigger system can still deal with this because in a single crossing of the beam, we don't just get one collision point, we might get right now somewhere between 30 and 50 but as we go to higher energies and higher luminosities we are looking at getting something like 200 collisions that are happening every beam crossing so if you're trying to say oh I have an electron or I have a b-jet that looks interesting over here I wonder if it might also have a partner that can tell me if I had a collision you're going to have a really difficult problem because you're trying to now pick out what might these other energy deposits or tracks be from 200 other collisions that are happening at the same time.
43:43 Michael Kennedy: Yeah, it's one of these combinatorial type things. How many different relationships between 30 or 200 right? That turns out to be astronomically bigger.
43:52 Matthew Feickert: Yeah it gets pretty crazy.
43:55 Michael Kennedy: I'm sure you it does, I'm sure it does. So that's mostly embedded C++ is that right? What other stuff you doing there? Is there any Python at that layer?
44:06 Matthew Feickert: It's actually in the trigger system, no but as a student, what I'm doing a lot more is performance studies of how our trigger is doing and so there I do use hybrid of Python and C++. We have colleagues at the University of Chicago that have written a really nice analysis framework that a lot of people in the collaboration use and it's all implemented in C++ but then there is also a way for us to interact with it from Python so that's really nice because then that allows us to for example, I was able to write some command line interface tools using argparse and things like that so I can give if I want to do a quick performance study, I can without really having to ever write a line of C++ I can spin up a small analysis to say how is the trigger performing under these scenarios and things like that?
45:03 Michael Kennedy: And you're using things like Jupyter Notebooks and stuff to do that exploration?
45:07 Matthew Feickert: For the trigger, I don't use Jupyter Notebooks so much. I use Jupyter more for the exploratory data analysis than I'm using for actual PhD research but for the operations work that I do in ATLAS, that's more just going into your favorite editor and writing some command line tools in Python and then also making sure that interfaces well with these C++ frameworks that we developed.
45:34 Michael Kennedy: Okay, very interesting. So Michela, maybe we could talk a little bit about where you guys are going in the future. LHC has been a really long running project. It started producing major results in 2013-14 but it was started way before then. Where is the future going now and what you're up to?
45:54 Michela Paganini: Of course, so there are still so many open questions to think about the standard model and we know, as Michael said that the standard model has been validated over and over again and it's probably one of the most predictive theories in the history of all theories but there are some missing pieces so for example, we don't have a complete quantum theory of gravity for example so there is this hypothetical particle called the graviton that some are searching for. I think there are very many open questions that could potentially be answered through some of the theories that have already been proposed, whether it's super-symmetry or others and so I think right now the goal is to continue looking for those while we also continue making very high precision measurements of all of the particles that we know and love so I think some ideas from what else could be out there doesn't necessarily come from directly searching for these new particles but it could come from measuring properties of particles that are already a part of the Standard Model and try to see if there is any even small tiny deviations from what the standard model prediction is to try to see if there could be potentially some other mechanism out there that causes a small deviation in this property that we knew or we thought had a specific value. So I think in terms of the physics, that's where we're going now.
47:23 Michael Kennedy: It could look entirely right but there could be these subtle, subtle deviations. Just look at Newton and gravity and then Einstein and gravity. Newton looked right.
47:34 Michela Paganini: Yes of course, to first degree it was but obviously there are then in certain regimes some deviations from the more simplistic theory so we think that perhaps there could be other corrections to what we're measuring that could come from more complicated theories that we're still to validate.
47:53 Michael Kennedy: Michael one thing I wanted to ask you is how do you feel that machine learning and doing that with Python and TensorFlow, all these things have changed physics and the physics exploration? How much would it have just been more work or is stuff actually being discovered that wouldn't have been discovered, what do you think?
48:12 Michael Kagan: Yeah, I think a lot of algorithms may not have been discovered or may be applied or implemented in a reasonable time scale. Since we're not professional machine learning engineers or machine learning researchers, a lot of the times the tools we're using we either build from scratch or use what's available and so this availability of things like TensorFlow make it really easy for us to implement and just try things out on our data and then try to make these connections between the machine learning world. Things like these generative models we might not have even access to these kinds of potentially fast speed ups or new ways to look at the data with vision or natural image processing approaches. These algorithms may not even be implemented anywhere. Nowadays, you can just download the model if you want and use that as a place to start and I think that one of the things that's been really helpful is the core C++ libraries that we use at CERN are built with internal dictionaries but the idea of reflections so that they can really easily be bound to Python and we can just take our data that's in one format and really quickly switch into Python and just start pounding, running through all these different kinds of algorithms and see what comes out.
49:30 Michael Kennedy: It's really cool, so it's a little bit like what NumPy did for numerical analysis? You have something similar where it's got that native layer really close to Python?
49:40 Michael Kagan: Exactly and in fact, we can also interface directly with NumPy and pull our data right into formats that are standardized in that machine learning data science community.
49:52 Michael Kennedy: Michela, maybe I could ask you the same question because your work has made computational so much faster and it seems like that machine learning played an important role there as well. If you could really make these simulations a 100,000 times faster, that's almost like going from regular computers to quantum computers. That's such a jump.
50:11 Michela Paganini: Yes of course, obviosuly we would like to empower more physicists to be able to do the analysis that they want to do at the precision levels that they require and right now obviously that simulation bottleneck means that some analysis cannot necessarily be performed with the statistical uncertainties that we would like to have or some of them cannot be performed at all in terms of the required accuracy that they need and so hopefully this types of projects will enable more physics to be done and more analysis in the future of the LHC.
50:46 Michael Kennedy: Very cool. Matthew, do you think there's room for normal developers, people who want to just go contribute it's open-source, maybe some kind of open-science, they want to play around and they're really good at programming but they're not physicists, is there a place for them to jump in and be a part of this?
51:02 Matthew Feickert: Yeah, I think so. We have as Michael mentioned earlier, it's really important our code is open source and so if people actually want to go take a look at it, it's out there and then I think there is some efforts to get something like this started on ATLAS but at LHCB, one of the other experiments they have what's called the LHCB starter kit and that was started by two PhD students Kevin Dunns and Tim Head who have now left the field to go work at Google and be a data science consultant, but they would've thought hey a lot of our students are coming in and we're all physicists but we're not all necessarily software experts so they created this thing called the LHCB Starter Kit and they've done some partnering with softwarecarpentry.org to actually go ahead and hold training seminars where people from Software Carpentry come in and help them learn how to actually get started to do data analysis and programming in physics so I think Software Carpentry and Data Carpentry, those are really great organizations so if people want to get involved there, they definitely can as well.
52:12 Michael Kennedy: That's really cool and I have the Software Carptentry guys on the show couple of months back.
52:16 Matthew Feickert: Yeah, that was a good episode.
52:17 Michael Kennedy: Thanks it was really nice. I think it's great. I think more and more of these tools are becoming accessible to everyone. I suspect uploading to the whole computing system and running stuff there is probably restricted to you guys but still people that work on these algorithms and in a sense, they're contributing. People make TensorFlow and Keras better, there are in a sense making it better for you guys as well, right?
52:41 Matthew Feickert: Yes.
52:41 Michela Paganini: Absolutely.
52:43 Matthew Feickert: There's a big push right now, I think Michela is a great example of this to go out into the communities that are going and actually building these great tools and interact with them directly and contribute. We want to try and use really powerful tools to do the best analysis we can so the people that are making the Python tools that we use better, that they're really directly impacting science, so we appreciate it.
53:11 Michael Kennedy: Yeah, that's amazing. I have so many more questions to ask you all, but we're pretty much out of time so I want to be respectful of that as well. Before we go, I'm going to ask you all the two questions. I'll start with Michael. Favorite editor?
53:25 Michael Kagan: Emacs.
53:26 Michael Kennedy: Michela?
53:26 Michela Paganini: Sublime.
53:27 Michael Kennedy: Matthew?
53:27 Matthew Feickert: Adam with Vim key bindings.
53:30 Michael Kennedy: Alright, right on. And Michael, notable PyPI package?
53:33 Michael Kagan: PyTorch and scikit-optimize.
53:36 Michael Kennedy: I don't think I've talked about PyTorch on the show before, maybe just really briefly talk about it, what that is?
53:41 Michael Kagan: Yeah, PyTorch it's a deep learning library that's built on a different way of building the graphical representation of your deep neural network for all the downstream computations but actually the API I find it very smooth and easy to really quickly spin up a neural network and have it running.
53:59 Michael Kennedy: Excellent. Michela, how about you?
54:01 Michela Paganini: I'm a huge fan of Keras, I've been using it since it was first started as a project and it's been enabling me to go from idea to experimentation very quickly and then huge shoutout to matplotlib. It allowed me to make all of the plots that I've ever made during my PhD. So those two for sure.
54:20 Michael Kennedy: Excellent, Matthew?
54:21 Matthew Feickert: Definitely Keras and like Michela said, that's a bread and butter thing for us. And then also scikit-hep which is--
54:29 Michael Kennedy: HEP is high energy physics?
54:30 Matthew Feickert: Yeah, so similar to sciki-learn, how that was supposed to be a toolbox for machine learning. Scikit-hep is a collection of Python tools that have been developed inside the high-energy physics community that are meant to try and help us interface with things like NumPy and Pandas data frames and make our lives a lot easier when we're actually trying to go from our root data file format to things like NumPy--
54:55 Michael Kennedy: Yeah, very cool. I'll give you guys a chance for a final call to action. People are excited about what they heard, they want to get involved. Michael go first?
55:03 Michael Kagan: Sure I think there's a lot of ways in which you can get involved, especially through the CERN Open Data Portal and CERN Open Science. You can download some of our data, you can play around with it and also just help spread knowledge about science and scientific reasoning and how that can benefit society, both from advancing science and advancing the way we think.
55:25 Michael Kennedy: Excellent, Michela?
55:26 Michela Paganini: Certainly very interested in working at the intersection of science and machine learning or data science. CERN I think is quickly becoming one of the best places in the world to do that because the scale and the fascinating problems that we're working on are completely unparalleled and we're always looking for new skilled software engineers that are curious about mysteries of the universe and as I said before, the community at CERN is truly fantastic so we can always use more Python experts and you absolutely don't have to be a physicist to make a huge impact at CERN.
55:57 Michael Kennedy: Excellent, it sounds really fun. I'd love to work there. Matthew, you've got the final word.
56:00 Matthew Feickert: Yes as physicists we love to collaborate and we like to think at least that we have really cool and really hard problems so we're always looking for--
56:09 Michael Kennedy: And you consider those to be the same thing, right? Cool equals hard.
56:12 Matthew Feickert: Yeah, probably. We have started in the last couple of years to have some really fruitful collaborations with our CS colleagues so if you know a friendly neighborhood particle physicist and you want to talk with them, please do. We're happy to talk and we really want to try and have our field grow with other fields and try and do the best science we can.
56:38 Michael Kennedy: Thank you all for being on the show. It's been really interesting and I've learned a lot. Thanks for sharing what you're doing and keep up the good work.
56:45 Michela Paganini: Thank you.
56:46 Matthew Feickert: Thanks for having us on, Michael.
56:48 Michael Kagan: Thanks and thanks for hosting this great show.
56:50 Michael Kennedy: You're welcome. Bye everyone.
56:51 Michael Kagan: Bye.
56:53 Michael Kennedy: This has been another episode of talk Python to Me. This week's guests have been Michela Paganini, Michael Kagan and Matthew Feigart and this episode's been brought to you by Linode and Talk Python Training. Linode is bullet-proof hosting for whatever you're building with Python. Get your four months free at talkpython.fm/linode. Just use the code Python17. Are you or a colleague trying to learn Python? Have you tried books and videos that just left you bored by covering topics point by point? Well check out my online course Python Jump Start by Building 10 Apps at talkpython.fm/course to experience a more engaging way to learn Python and if you're looking for something a little more advanced, try my Write Pythonic Code course at talkpython.fm/pythonic. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python, we should be right at the top. You can also find itunes feed at /itunes. Google play feed at /play and direct RSS feed at /rss on talkPython.fm. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get out there and write some Python code.