Monitor performance issues & errors in your code

#289: Discovering exoplanets with Python Transcript

Recorded on Tuesday, Sep 29, 2020.

00:00 When I saw that headline machine learning algorithm confirms 50 new exoplanets in historic first, I knew that Python must be operating somewhere in the background in that the story must be told. That's how this episode was born. Join David Armstrong and Jeff gamper as they tell us how they use Python and machine learning to discover not one, but 50 new exoplanets in pre existing Kepler satellite data. This is talk Python to me, Episode 289, recorded September 29 2020.

00:43 Welcome to talk Python to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at m Kennedy. Keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode is brought to you by brilliant.org. And us.

01:05 David Jeff, welcome to talk Python to me, you know, thanks for having us. Yeah, thanks for having us. Yeah, it's good to have you guys here. You know, when I saw this article, its title was something like machine learning algorithm confirms 50 new exoplanets historic First, it didn't say anything about Python, where I'm like, I gotta contact these guys, and find out the story because I bet you Python has evolved. And oh, my gosh, what an amazing discovery. And pythons all over astrophysics at the moment. I mean, almost everyone uses it. Good chance, I think. Yeah, exactly. Okay, machine learning. And astrophysics combined. This has to be five thought. So I reached out to you guys. And I'm really excited to have you on the show to talk about the research that you're doing some of the discoveries that you've made. And it sounds also like this is kind of the beginning of the stories where things are going. Yeah, so this was really a test case we wanted to put together to build up for a load of big missions that are coming online soon. I mean, you might have heard of the Kepler satellite before, but it's been around for a while. But it's the biggest data set we have for these exoplanet sort of transits I guess we'll talk about in a bit. But it's a perfect test case to work with. Yeah, that's awesome. And it's, it's so amazing. You found all these planets as a small test case, but Kepler starting to near the end of its life, right? Kepler has stopped recording already. So the main mission stopped in 2013. And it had an extension called k two that went on for a bit longer than that. Yeah. But the data is still full of things to discover, as you can see, amazing. I before we get too far into the topics, though, let's just set the stage really quickly with each of your stories. How'd you get into programming in Python? Would you like to start David? Or should I go ahead? Sure. I mean, I got into programming with God. It's like college education here, what you call an A level in computer science with learning Pascal, if you believe it. Yeah.

02:45 It's going back a fair way right. Now I got into Python while I was doing my undergrad degree. And then mostly when I started doing a PhD in astrophysics, because back then it was kind of either Python or it or you basically pick one when you started. I think Python has won the war now, but Yeah, I think so as well. I mean, I haven't done it. I do. But there was this sort of domain specific programming language for astronomy called idea, Ray. Yeah, but it's proprietary, and you have to pay a fee for it. And I think that's what killed it in the end, but it's not killed. There's still people that use it. But the numbers are shrinking. Yeah, yeah, for sure. Jeff have a. So I also got started with Python during my undergrad. At that time, I was very much interested in quantitative finance and quantitative trading, not interested in that anymore for a long time, and wrote my undergrad thesis with Python and R. And then I came to work to do my master's, my David and Theo, Theo is another co author on the paper. And there was this Master's project to do exoplanet detection and validation with Python and machine learning. So I just jumped on it and got started. And that was kind of fun. Second project with pipe. Yeah, that sounds really fun. I got into programming has programming in general and to C++, because I had to learn enough programming for a math research project. And it's funny how these things sort of take you down paths that you don't necessarily expect them to take you like, Oh, I'm kind of interested in this project. I guess I got to learn that Wait, now that's my job, or that's my specialty. how that happened? Yeah, exactly. I mean, this has been the case for me. It's been quantitative trading than astronomy, and then PhD in medical imaging, which is all Python. Yeah, absolutely. Especially the ML side of the medical imaging analysis for sure. So how about what you're up to day to day? What do you do these days. So nowadays, I'm towards the end of my PhD in medical imaging, which mostly involves all things Python, pi torch, deep learning, computer vision networks, all these sort of things. And I also work as a senior scientist at startup in London, where mostly we work with climate models, but also remote sensing. So there's quite a bit of machine learning there as well, where computer vision on the remote sensing images and also fitting neural nets to climate data. So that's the last of today. How interesting I think one of the cool takeaways are seems like you could infer from that is you're using

05:00 This skill you have with Python to do all these different things, right, like quantitative trading, climatology, and astrophysics, like those are not typically a common skill set. I mean, the common things in all those things, it's programming, being able to code and Python is a good language to start for everyone, and has very kind of big community around it, where most of the tools that you will need on your day to day basis for day to day tasks are already well developed and well maintained. So that's really kind of what makes bytom really cool for that. Absolutely. David, how about you stack overflow? Of course, of course, you can always copy and paste a little segment to make it do something. Who knows what I mean? So I'm a lecturer at the University of work. I'm in the physics department. So most of my days, well, bits of research and teaching, writing grant proposals, supervising students, this kind of thing. I mean, it there's more and more of the later ones and less and less of the research. But I think that's how it goes. Yeah, absolutely. That is how it goes. Cool. Cool. Cool. I think being in academics is just really fun. You get to explore so many amazing ideas. And you're not rushed, right? It's not like you've got a six week sprint to write this code and move on to the next thing. Yeah, you can really think things through and through, I try and work out what you're doing. I mean, there's always something to do. Right. So a different kind of rush. But nothing like industry, I think. Yeah, yeah. Well, as a PhD student, I'd like to speak here. Well, sometimes you do have to rush if you have a deadline. And your supervisor is behind you telling you now you got to do this. You got to submit your paper there. So yeah, yeah, I was in grad school for a while. And there's this guy. He was just everybody knew him. He was working on his PhD. And he had been around for nine years in the Ph. D. program. And he just you could tell he just meant to be out of university, wonderful guy. But he had no, it was just no urgency for him to finish this degree and get going until they said, you know, there's a 10 year limit in this program. And if you don't finish within 10 years, you don't get your PhD. He's like, Oh, my gosh, he was done in six months. Well, 10 years as well. That's a lot of time. Yeah. He just chilled he just loved it. Like, I don't know, it's easy living, I guess. Alright, so let's talk about astronomy, and Python. Now, in this article, there's the official research paper that you guys put out, and then there was pops, I type articles. And in that one, the one from I think it was the other tech Republic, it said, Now, astronomers are using machine learning algorithms to search for planets beyond our solar system, formerly known as exoplanets. Is there a different name? Am I missing something? Are they off here? That they're still planets, but we just call planets outside the solar system? extrasolar planets, kind of just an exoplanets for sure. extra solar? Yeah. Okay. I don't know if it's a formal name in the same way, that exoplanets is what they've been called in the literature for a few years now. So yeah, okay. That's what I thought, you know, they're not wrong. They're not wrong.

07:45 Super. Alright. So when was the first exoplanet discovered? There's a bit of a arguing about some of the details, but it's roughly 20 years ago, around 1995. I think, for some of the first discoveries. I mean, there's been a there were a few candidates before them, but it was back and forth in the literature for a while, right. I feel like this was done by a couple of astronomers in the observatory in Hawaii. And so this was a ground based discovery originally, right? Yeah. The two astronomers were based in Geneva. Actually, it was Michelle my orientalia. Kayla. Okay. And they got the Nobel Prize for it last year. Right. And then Oh, wow. Okay. Yeah. So that was the first one discovered. I mean, since then, we've gone up over about 4000. Now, and it goes up every day. I mean, it's just been accelerating rapidly. Yeah. So we've always thought solar system that that we have, this cannot be unique, but we didn't have any proof of it. And so now, like you said, we do and a lot of the data that people are analyzing must be coming from Kepler these days, right. So maybe tell us maybe we start with the story, a Kepler. So Kepler, launched in 2009, it was active in the, what we call the primary mission for four years. And it just stared at one patch of the sky, like a fairly large bit, but still just one patch for those straight four years taking measurements of how bright the stars were every half an hour, about 200,000 stars. And that's what really revolutionized the field. And so as the statistics, I mean, we went from having sort of hundreds of planets that we knew about to having thousands, and it was the first one that could really find planets that were a bit closer to being like they're even if they were usually much hotter than the Earth, right, the ones we found, right, because a lot of the discoveries early had to do with planets that were both very close to the stars. So the pulsing was frequent and very large, because so it was extreme, right, like a hot Jupiter type of thing. Yeah, that's right. And those Jupiter's were fascinating at the time. I mean, no one expected them to exist at all until we started finding them. Yeah, really interesting stuff. But of course, you want to find the planets that are closest to the earth, right? That's what everyone's got in their mind. So you're always trying to push that boundary out of it? Yeah, absolutely. So we had Kepler and he said that shut down in 2013 and other ones must be coming online. Yeah. So Kepler, they actually the excitement around this can have gone down right with the discoveries being made, and then have coastal like so Kepler, they actually what happened is it had these reaction wheels in the satellite and too many of them broke. So they ended up not being able to stabilize it very well, and they sort of repurposed it by this very clever technique.

10:00 Kind of bouncing it off the wind from the sun. But I don't want to go into that there was called at for a bit longer and carried on for a bit. But that's finished now as well. Eventually it ran out of fuel. But now the one that's active that everyone's looking at called test is another NASA mission. That's, again, observing stars, trying to look for planets transiting, like measuring the brightness of the stars, but doing it in a very different way. So it looks almost the entire sky. And that's been up for about two years now. And it's still going on. Wow. Okay, maybe give us a little background on how this works. So it's not visual, right? They don't see the planets, much as we'd like to they measure. Wouldn't that be amazing? You're like, Oh, look, it's a Waterworld. Now, it's measuring the effects of a planet on the star, right? Yeah. So the way we were looking at it, it's called the transit method. It's, it's found most of the planets to date so far, that might change going forward. I mean, it's when, if you imagine a planet going around the star, if everything lines up absolutely perfectly, sometimes the planet passes between us in the star between the telescope and the star. And that's going to block out a tiny little bit of the light from the planet from the star, like a little, almost like an eclipse. Yeah, it isn't Eclipse, and then we call them transits, because they're planets. But it's a cool eclipses if you have a star instead. And it's a very small fraction. I mean, those hot Jupiters we talked about dropped out about a percent of the stars light. But something like the earth, it's more very small fractions of a percent, parts per million, we usually end up talking about. And so we're trying to measure the brightness of the stars and look for these dips in the right. And if you see them in a sort of regular pattern with the right kind of shape, and lots of other things, you can say it's a planet. Yeah, there was also the wobble method. Yeah. And do you want to do it? Yeah. Yeah, go ahead. Yeah. I mean, with a wobbler ethic, it's mostly again, it's an indirect observation method for detecting the planets orbiting the star. And then it's mostly based on the sun, or the star and the planet having the common center of mass, and then kind of going in an orbit around that common center of mass. And that would make the star wobble. But the most of the kind of the observation methods for the wobble effect are for the large planets, which we've had been detecting before. And for the smaller planets that might not necessarily work, especially for the earth like planets. So that's why transit photometry that is used by the Kepler is a bit better, is a lot better to detect the earth like planets. And that's what we're interested in is quantifying the frequency of this earth like planets. Yeah, figuring out if anything lives, there is one thing, but finding them is the first step, right? And the closer the better. Yes, you got to start somewhere. Exactly. So this used to be done by hand, or maybe studying one off, like, let's look at this one star, maybe you guys give us the background on how this was done before. And then some of the techniques that you brought in with Python here. So previously, people would still survey lots of stars, they try to observe loads of them and try and find the ones that had these transits around them. But you kind of identify candidates automatically. And then once you have those candidates, you'd have to look through the ball and basically say, Okay, this one's false, because it's clearly, you know, the telescopes jumped out here or something like that. And, you know, people would go through this, and you'd have large teams of often PhD students looking at these candidates for a while, you know, it wasn't really a great use of people's time. You got very familiar with the data, right? And that's kind of okay. That only works for so long. So, yeah, yeah, you study all the science, you go to school for seven, eight years, with your undergrad, and you're just looking for variations, common variations, right? I mean, there's a lot of science that's like that one way or another, right. I know. Yeah. I mean, I know people who sat in fields, basically watching swallows flying and measuring, like how often they did it, and when for days upon days upon days, and it's kind of nice to be outside, right, that one's not so bad. But, you know, there's a lot of data taking, right, yeah. And But anyway, the more we can automate that, the better really, I mean, not only because you don't want people to have to spend all their time doing that kind of thing, but because if you're trying to do statistics with it, you really don't want human biases coming into it. Yeah, absolutely. Like some of the earliest things. I don't want to name any names or surveys here, but we could start, I think you could see a frequency and how often people flag the candidates as potential planets based on when the coffee breaks where there was a peak after coffee breaks and lunch, right? And you know, that's fine if you find all of them. Exactly. And yeah, we talked about this previously, this is super interesting, that human bias, not even bias in the traditional sense, but just like moods, and stuff would affect it. You know, you've had your afternoon, coffee and cake or cookies or whatever. And so now you're a little more focused a little a little happier. And oh, yeah, that probably is a planet. Yeah, you're a little less pessimistic. Yeah. Yeah. So you flag a few more things, right? Yeah, that actually becomes really important if you work in medical imaging. Because there it's not just one PhD student looking at a particular light curve for a planet here, you have multiple doctors looking at the same image, and all of them have different idea of what is happening in that image, and then try to figure out what is the true kind of thing? How interesting and of course, it has a way more serious consequences. Right? If there's hundreds of thousands of planets out there, and we missed one Well, yeah, maybe we'll even be found later. But if you're trying to determine Do you need treatment for cancer, yes or no? And they're like, no, or yes or whatever, right? That is a serious knock on effect. All right.

15:00 So you talked about this light curve, you have, tell us maybe describe the data that you're looking at. Because if you're doing say, this cancer research and having machines Look at that, that's like a mammogram or something along those lines. But this is not a picture in the traditional sense that you're analyzing with this study, right? Yeah. So when I got started on this project, as a master's student, I've spent quite a bit of time with David just asking him loads of questions to understand what is the data? What is it that I'm looking at? And how do I even work with it? What am I supposed to do with it, because around the Kepler satellite itself, there is a massive data processing pipeline, first satellite itself, naturally looks at the stars, records the brightness of a star over time, then that data is transitioned to the research center where the data is processed, along with the information of a kind of engineering information of what was happening on the satellite. And that together is being post processed to determine to kind of get the final light curve or brightness of a star. And then that brightness of a star over time is processed, continuously detect new planetary signals again, and again, and again and again. And from that, then there are parameters derived on what specific orbital period this planet might have, but what might be the size of it, etc, etc, etc. And these become the features that we work with. And I probably missed a whole lot of things already. So they probably add some more on that. I mean, I don't want to add too much detail. And I'll get go right. Now, fundamentally, though, it's the satellite measuring how bright these stars are over time. And we have to take pictures with the satellites, you have like a CCD image, like you'd see on a normal camera, but they have to, like identify the stars in that and work out where they are and how bright they are at each point, and try and turn that into a long time series. But in the end, what we've got is lots of time series data. Right? Once you've narrowed it down to a single star, you're able to just say, what is the brightness of that star over time? Yep. So I mean, and in Kepler's case, most of them we'd have a measurement every 30 minutes. And that would carry on almost continually for about four years. So you're talking around 60,000 data points reached on there. So it gets quite nice. Yeah, that's quite a bit of data. One thing is stepped back a little bit on the science side, I guess, one thing that just blows my mind about this is the timescale that you're working on here. If you've got, say, like mercury or hot Jupiter going around around really fast, like you get lots of measurements. But if you've got something far out from the star, it could be put into a year, multiple years. Like, that's just a couple of passes. Right? Yeah. And so each part, you get one transit, right? And if there's fewer than three, it starts to get very difficult to say for sure whether it's really a planet, you'll see. Yeah, and not just two random blips where the instrument got a bit hot or something. Yeah. So really, we kind of require that there's at least three in most cases before you'd really claim something was a planet and even that's pushing it. Yeah. So ideally, you want more than that. And you can see, you need to stare for four years just to start to get anything like a year. I know, it's great. Yeah, it's such a large time scale in your planets. And this is why, you know, this is why most of the planets, we know, just just on like the same sort of periods as the earth. I mean, the vast majority of abundance, we know actually have periods less than Mercury. Well, the Mercury's, I forget off the top of my head, it's around 80 days, right, posted that most of the planets we know, were sort of sub 20 day periods, just because it's easier to find them. That's fine. Yeah, it's where you get all the measurements. But the other ones must be out there. Right. Yeah, I mean, the indications we have are the planets are everywhere. Very, very common. And it's just a matter of getting the detection efficiency good enough to find the ones that are further out, really, I think, yeah, David talked about this data, large data, you know, 60,000 measurements for a single star and hundreds of thousands of stars, you maybe want to talk about the data processing pipeline. And like how you took this data and worked through it, got it to somewhere you guys could analyze. I mean, the big advantage was that the most of the data is already public. So all of the parts of the data set, you can access it online on the associated Kepler websites, and then do some of the processing steps to kind of derive the features that you're interested in. Okay. And also the confirmed planets and confirmed false positives, as labels for machine learning are also available online. So kind of when I was working on the master's project, back then, David pointed me in the right directions, where do I have to get the labels? Where do I have to get the features, and then naturally pointed me and directed me in terms of what features are we supposed to derive from that what kind of post processing is supposed to happen? And that's where really the advantage of Python comes in and packages like psychic learn, because most of these steps are already available in terms of normalizing the features and making the machine learning pipelines and making the hyper parameter optimization pipelines. So that's all been kind of easy to set up. In terms of the features for the data set, there were I don't know how much into the details we want to get. But there are a few features that we computed on top of the features available with the Kepler satellite Kepler data processing pipeline. And we naturally have feature importance to determine what do we need what we don't need? Yeah, tell us about feature importance. I wouldn't remember which features were important. But now, in fact, I can still remember a few I've been looking at this more recently, I think, I guess. I mean, the one that always comes up at the top is the shape of the transit. So when the planet is in front of the star, there's a sort of characteristic shape you get. It's very much like a Yukon shape. And that tells you a lot. I mean, a lot of the things

20:00 That can cause kind of false positives. So what we're trying to do with all of this is work out, when we see a candidate, whether it's a real planet, or whether it comes from something else, like quite often you get things called Eclipse and vineries, where it's basically just two stars orbiting each other, and you see the eclipse of one of them. And that can give you a very similar signal that might look a bit like a planet, but isn't actually, and there's a few other things like that, that can cause false positives, right? And they tend to have these very slightly subtly different shapes and sort of narrower eclipses that are a bit more V shaped and so on and things like that. And that will that always helps us a lot, but it doesn't get every case, sadly enough. If my memory is not too wrong, David, was it also the case that some of the uncertainty measurements that were computed by the Kepler satellite pipeline, were also computer those important features, or am I wrong? I don't think we use them in the end. Okay. No, but they did go into start with but they didn't come out so important when you're doing those tests. Okay, yeah, this I'm no expert at image processing and with machine learning, but it seems to me that these models are very good at detecting small minute changes that humans often miss, right? Like, it was so successful detecting breast cancer and mammogram scans that a lot of trained professionals actually missed. And these slight variations in like, Oh, this is how it is when it's a star with a star rather than a planet. And a star is subtle, but somehow machine learning seems to be able to find those and pull them out.

21:18 This portion of talk by enemy is brought to you by brilliant.org. Brilliant has digestible courses in topics from the basics of scientific thinking all the way up to high end science, like quantum computing. And while quantum computing may sound complicated, really, it makes complex learning uncomplicated, and fun, it's super easy to get started. And they've got so many science and math courses to choose from. I recently used brilliant to get into rocket science for an upcoming episode, and it was a blast. The interactive courses are presented in a clean and accessible way. And you could go from knowing nothing about a topic or having a deep understanding. Put your spare time to good use and hugely improve your critical thinking skills. Go to talk python.fm slash brilliant and sign up for free. The first 200 people that use that link, get 20% off the premium subscription, that's talk python.fm slash brilliant, or just click the link in the show notes.

22:11 I think the best thing here is that like sometimes in the extreme cases, like someone looking at it, it was awake and paying proper attention and everything would spot the difference. But what the machine can do is give you a nice sort of quantified boundary between the two and say, you know, this is where you need to draw your line and stuff nearer here falls on this side or this side. And here's how confident you should be about it, given what we can see. And it can make all those quantified. And that's the real benefit, I think, yeah, you're not just relying on people sort of keeping themselves the same the whole way through which no one does, of course. And one of the things I think is probably a challenge here, it's certainly a challenge in some places is, as you look at more of these signals, you get better at identifying those things, but you weren't good at the first hundred you looked at or whatever, right? You've become trained and looking at these type of data. And so you bring on a new grad student, they go through that similar like learning curve as well. And you can try and mitigate all that by like getting multiple people to look at every candidate, but then you're just multiplying up the workouts right? in machine learning, they often refer to it as graduate student descent, where you just have multiple graduate students, and they iterate on each and every of them exactly. It's like a different form of deep learning, right? If they get there, they can probably still get the wrong minimum right? Or the wrong optimization. So you have one of the things about this, I think it sounds like made it work is you talked about there's already these Kepler data where it's labeled, right, these are verified exoplanet signals, these are verified false positives. And these are not exoplanets. Right. So you were able to leverage that to teach a psychic learn model. Yes. So we took some of the verified planets, and some of the verified false positives took the features. And essentially, you create a panda's data frame that you can feed into the sacred learn machine learning models, split them into your cross validation folds, and fit the hyper parameters and see what you get on most of all of the metrics to really evaluate the performance of these classifiers already out there, it's more the question of how to interpret them from the understanding of the model. So one interesting thing, from what I remember was that the data set itself seemed quite easy in terms of us getting very high accuracy metrics, but the slight difference between the different classifiers was interesting. So linear discriminant analysis performed slightly better than quadratic discriminant analysis, which meant that the kind of the decision boundary between the false positives and the confirmed planets was actually way simpler. And you didn't need like a quadratic function to describe what is the difference between them. So you needed to be careful of not trying to overfit to too many outliers, essentially. And the clear the distinction, the better it is for you guys. Right. Yeah. And just quantifying how confident we should be about that distinction. I mean, that was a big thing of the improvement we made is trying to get probabilistic output so we could really get a proper probability for whether something was a planet or not. Yeah, versus just trying to rank the best ones, which is what people have tried to do before. I understand

25:00 Normal surveys normal, like studies where they have got very basic statistics, right? Like, how many standard deviations is this from the mean, or various other standard statistical analysis? What about this? Like how, once you fed that all through machine learning what you get out the other side? Like, how do you make sense of that? I could say. So, probabilistic machine learning is a huge thing in itself, right. And even that, but there's different ways. So, I mean, we used multiple different models, one of them that's naturally probabilistic as a Gaussian process classifier, which we built with GP flow is a different package. And that actually took a GPU to run, which we thankfully got for free off Nvidia. So I should probably say that, you know, they'll be happy. Awesome. Yeah, that's cool. So that sort of naturally gives you a probability of class membership. So you set it up with the whole training seven labels that you have. And it'll tell you what the actual probability that any sample actually is a member of any one of those classes. Yeah, but all the other things we have to calibrate. So we get a some sort of score out of the machine learning model. And then we have to take all of our examples and say, Okay, well, we've got a score of point six, but actually 80% of these really planets. So really, that needs to be point eight. And you can kind of calibrate that way, sort of bootstrapping yourself, do you have people go back through and basically re analyze what the machine learning suggested just focusing on those stars, what was the once it says, you know, 67% likelihood, what happened, that's definitely where we had to actually work on calibrating the classifiers, that were non probabilistic. We also wanted to compare it against the probabilities produced by some of the previous methods, like Vesper, which was a physics based model that evaluates the likelihood of different false positive scenarios and the planetary scenario. And that's where the things got interesting, because the probabilities that our machine learning classifiers were producing, versus the probabilities produced by previous models were slightly different, particularly for false positives, if I'm not mistaken, more than slightly. I mean, they disagreed in something like 27% of the cases. Yeah, yeah. And this is best, or I should say, before, we get deep into that as a sort of field standard tool. It's like a international physical model. So you're fitting models of these different planet and false positive scenarios to the data and trying to say, How well does each model fit? Right? From what you understand? Does it make sense with the way you think planets and gravity and all that works? Yeah, yeah. Cuz, because we can sort of build up a fake system with a binary star and so on and say, Okay, well, this is what that data looks like. And, you know, does it match does it not, and so on. And you can do that properly, in a Bayesian way and get some good statistics out of it. The trouble is to make it run fast enough on a large number of samples, because to do this properly, is very, very slow. And best to make some approximations to, and it has a whole set of results. But till now, it's been the only thing that can run fast enough to really do this on large numbers of candidates. And so we were the first people to be able to compare against that. And we actually got some quite big discrepancies, which is kind of worrying right? Oh, wow. Yeah. I mean, it's, it's not that it's both exciting and worrying, because somebody has got to change something right? or learn something. Yeah. Yeah. Well, it's important to do you know, so we haven't really got to the root of that yet as to what's causing it. I mean, but when there was an independent label, it pretty much favored our classifier. So we're optimistic about that part, at least. Yeah, that is optimistic. You have you won't tell us more about this GP flow library. So GP flow was something I briefly worked with when working on this project. At that time, I was quite interested in cash and prosperous classifiers and Bayesian nonparametric. But also, how does it the problem with Gaussian process generally is, is that they're quite hard to scale. Because you need to do matrix inversion to be able to kind of get the posterior. What GP flow allows you to do is to formulate the whole problem or other that uses the methods that formulate the problem have been for kind of fitting your Gaussian process to the problem by formulating this optimization problem. And that's where the GPUs come in. And GP flow is naturally built on TensorFlow, which is built for optimizing parameters on the GPUs. And that's what GP flow is. But for Gaussian process, specifically, with a few kind of inference methods already pre built into the package. And you can just take the Define your Gaussian process, define your kernel, define the hyper parameters, give it your data, and it will do the work for you. That's what Jupiter flow essentially is. Yeah. It basically requires an Nvidia graphics card underneath. Well, any decent GPU help it speed up, right, but you can get away without it just takes forever. Yeah. Yeah.

29:18 Nice. So you have How about the compute cluster? I, you know, I speak. I've had folks from like CERN, Large Hadron Collider, they've got like this incredible compute cluster that they can send their stuff out. The one Kilometer Array, folks also had this huge amount of data. They're processing a big compute cluster. What about yours? So back when I started this project, I was doing it all on the laptop that they gave to us as part of the program. But as David mentioned, later on, it was run on a GPU. So yeah, it kind of progressed from a tiny laptop that I had to the GPU. Yeah, though, still, I mean, just one GPU. We never really got to big computing nodes and clusters. Like they're just it's not really a processing limited job, I think, at least at the scale we were running at right

30:00 I think it's really interesting. And that's why I want to kind of tease that out because it didn't take a huge amount of computing power to do this. All of the non Gaussian process models, I mean, there's some random forests and multi layer perceptron things in there, like in there, all of them, even by the end, when we're running sort of, like publication level results, we're still running on a pretty standard iMac, just on a desktop computer. So it's just not using these big supercomputers, either. Yeah, how much of the processing is happening in there? Is the do you start from the brightness over time signal? Or do you start from the actual image of the whole sky? The image of the star like where do you begin this analysis? Yeah, so this is the good question. Right. So a lot of the quick analysis I've been talking about is all of the training the models and so on, like a lot of the features were pre computed by NASA as part of the Kepler pipeline. And that saves us a lot of processing time. I mean, you really does. So you can't really understate it's actually a bit more of a processing problem when you try and do that as well. Right. But it's already been sort of condensed down to you care about the star. Here's the data, go look at it. Yeah. So here's a lot of metadata on it. And we'll still take the time series. And we'll calculate some extra things from it, but but not at the same level. So we don't have to start with the sort of real pixel level camera images. And that takes a long time to process. Yeah, you talked about Kepler been done tests coming online, maybe give us a sense for where this kind of research is going where like, additional projects, additional studies you all might be doing? What is the future look like with new telescopes, larger field of view more data, so on? Well, I mean, with this one, we were really just trying to demonstrate the method on calculate, but test is what you need it I mean, so capital has maybe 8000 candidates, I think in the in the database, the yield estimates for tests kind of imply that we're going to find around 20,000 planets on top of several hundred thousand false positives. So you're already upping it by a couple of orders of magnitude anyway, yeah, that's in a sample of around 20 million like, so it's a much bigger data set all by itself. And that hasn't really been done yet to this sort of scale. So really, the next thing we want to do is to generalize this model and apply it to test. Yeah. And there's future missions coming on. So there's one called plateau the European Space Agency mission, this used to launch in a few years. And while various things coming online, it's a big interesting topics. There's lots of stuff in the pipeline, how long will test run for a well, will you be able to detect things that are further out, like Neptune type of things? Or is it still short. So it's isn't one of the trade offs with tests was that it's observing the whole sky, but that means for most of the sky, it observes it for less time, just a month actually in for most stars? to start with. I mean, the interesting thing about Tesla is they built it, so it has about 10 years of fuel, I think, like a lot more fuel that it's currently funded to operate for. Domestically, it could be going for the next decade, they will figure it out. If we could get someone to pay attention the next eight years, we'll just keep it running. You know, if you're doing well enough, eventually it just sort of, yeah, it's like couple, I guess, once you're doing well enough, right? That's dude, I've got a bit of machine learning question. Like, how applicable are the models actually built on Kepler to the test setup? Because from the test, you won't have initial data to start with? Right? Yeah. So this is one of the big challenges of building it for testers, we have to simulate a training set by making models all the different scenarios and starting to inject them into the time series, right? Because the measurement might be more sensitive, it might look slightly different. It's not the same instrument. So different cadence, different noise. Yeah. All that stuff. Yeah, for sure. I mean, the advantage of that, though, is we can start to really build up the explainability of it, because we can test out things, how things perform on different scenarios, and actually just increase the size of our training set hugely. So there's some advantages to that. Yes. And work for sure. Yeah, you kind of got to go back to research, you could go find the ones that are verified by Kepler and say, let's look at those, see what the curves look like, line it up? Yeah. And it's a nice thing, because you can you can look at the stars that Kepler observed and test observed and say like, Oh, well, we know there's not a planet there. So this is clearly some kind of noise, and use that in your training set and this kind of thing. Yeah. There's some nice, little synergies to that to building with it. Yeah, absolutely. Well, what about other fields? Can people take this idea of creating these machine learning models, studying the data that was already labeled, and automatically get insight into I don't know, climate, or weather or earthquakes, things like this? I mean, of course, machine learning is being used everywhere. Yes, you must have a I mean, I didn't quite understand the question, if you could repeat what is it that? Well, I'm just thinking there must be some inspiration that other fields can take from this who are not yet using machine learning. Do you see any that where it's being underutilized? I guess, obviously, it's been utilized a lot of places, probably plenty will feel like I can't think of many off the top of my head. But I mean, one big field is remote sensing. There is just so much data on observing the Earth's surface. And most of it just kind of is there unused. If you think about two satellites, Sentinel one and Sentinel two, most of these satellites are not put to use at all, but yet they produce terabytes of data. And we've actually seen that in practice, but the problem there is that there is very few labels. But what you can learn is that actually it's okay to use labels that have been produced by

35:00 Other methods. So the same way that as some of the planners have been confirmed by multiple follow up observations, there is still some likelihood that they could have been wrong. But ultimately, once you put all of that together into one big data set, there will be still some signal from which machine learning can learn what features to use to do the classification and that way scale up, once you have large data sets and slowly, slowly move towards some answer that you're looking for. Or you might not be looking for an answer. You just want to see what happens. Yeah. And I guess if you start with sort of messy labels, and you get a head start off classification, it can tell you which labels are most likely to be messy. And you can make your training set better, right? You start iterating. Yeah. What about using it for like iteration? Like, I don't know very much about this data, but we can label a few things, see what it detects, go back and pay attention to what it thinks is important. And say, yo, you're right, No, you're wrong. Let's retrain you do it again, and so on. Yeah. So that actually, there are quite a few publications in machine learning that look into these kind of problems. And you can think of it like, intuitively, it sounds very much like expectation maximization, you move your parameters slightly, then you maximize over your data set. And especially Amazon has quite a bit of research on that, because they have the Amazon Mechanical Turk, and they have loads of people labeling different data sets. Yeah. And sometimes people will label them wrong. And therefore you want to fit your model to some of the labels, see how that compares to other labels? And then kind of given that in for what are the possible kind of labeling noises and given that kind of optimize your model, etc, etc. and iterate on that continuously now. Very cool. Well, this is such a neat project. And like I said, I think it really captures the imagination when people see what you guys have done here. Because there's so much data out there. And I think we're just scratching the surface on what we've learned, right? This stuff is so hard to detect that the more more we can do these studies, and the more we can learn about stuff outside of the solar system is just amazing. Yeah, I mean, for sure. And we're keeping getting more and more data just coming in year on year with this. Yeah. Can you study anything from the ground? Is there anything accurate enough on ground based telescopes to give you that level of information? Or is it got to be as Yeah, so I mean, we actually run a telescope in Chile called mdts. The Next Generation transit survey that survey stars and regularly picks up these transits is what you can't do is get down really to the earth size planets around sun like stars. That's really tough. Yeah, but certainly for Neptune sized planets and Jupiter sized planets, we can definitely go from the ground. Yeah. And some of the first transits were found from the ground for sure. Oh, yeah. Awesome. Now, before we run out our conversation, I do just want to ask you a couple of more sort of philosophical questions. First of all, it seems to me that there's just an incomprehensible level of stars out there, right. We've got how many stars in an average galaxy? 100 billion? I don't know about the average. But that's about the Milky Way. 100 billion. Yeah. Is that something like that? How many galaxies I mean, uncountable, right? They keep going out to the edge of what we can see. Right? But it's millions of galaxies, or more. I mean, it's not a countable number. I mean, there's just so you got to multiply those two numbers, and then you think each star may have a planet many stars have planets, maybe multiple planets, they're certainly common. Yeah. Like, it's probably more common than not, at the very least. So yes, there's definitely planets everywhere. It seems incomprehensible to me, that planet that life is only here, right? Yeah. And I think a lot of astronomers agree with that. Obviously, we can't detect that life yet. It seems remarkable. But it wouldn't exist, right? Like, I mean, what it is hard to say is How often would intelligent life happen in this kind of thing. And there's all sorts of open questions there. Yeah, I think that's a really interesting thing to ponder. Just you put those numbers together, and the fact that you, you all just keep discovering more and more planets, so they must be fairly common, the chances that there's not something like what we have here in other places in the galaxy, it seems pretty small chance that we find it or ever interact with it, maybe zero, but the fact that it exists, is already interesting to dream about. It would be incredible to prove that it exists, that's for sure.

38:59 It sure would. What's even more interesting is that if we do kind of ponder about it, like you said, like about other life kind of forms to get to exist is the impact it might have on the society and the society's perception on itself. Like, as previously, when there were biggest dynamical discoveries that kind of changed our worldview. The society changes without too and what can be the impacts of I don't know, if we observe sufficiently many exoplanets, and we discover something new, like, maybe that will change our thinking and the way kind of we operate with one another. I don't know. Yeah, that would be great. If it turns out that there really is life on Venus.

39:35 Yes, that's a very new piece of news that they discovered. I forgot the gas, but some gas that is typically possibly in wasn't it? Yeah, it was basking. That's right.

39:44 Amazing. So it feels to me like there must be life out there. And I think your work is tipping the scales and the likelihood that even if it has nothing to do with actually detecting or interacting the life itself, it's really cool. Like you said, you got to find the planets to be able to look for life on them.

40:00 That's right.

40:03 Talk by enemy is partially supported by our training courses, pythons, async. And parallel programming support is highly underrated. Have you shied away from the amazing new async and await keywords because you've heard it's way too complicated, or that it's just not worth the effort for the right workloads. 100 times speed up is totally possible with minor changes to your code. But you do need to understand the internals. And that's why our course async techniques and examples and Python, show you how to write async code successfully, as well as how it works. Get started with async and await today with our course at talk python.fm slash async.

40:42 The other one, I'm a big fan of SpaceX in the sense that it's really pushed the boundaries of what we can do with space. Like when I saw those two rockets land side by side, and people thought it was fake because it was so synchronize in. Like, there's no way that's gonna look that good. That's just CGI, right? And so the stuff that we're doing is so amazing. And yet, they're not always positive for astronomy, right, like for their internet satellites have, there's a big uproar about them messing up ground based telescopes. And so right, yeah, and I could find some pictures where you see 50 of these satellites sort of streaking through the sky. Right, right, exactly. It's not too bad now, because there's not that many, but I think they have plans to hugely expand the numbers. Yes. I just wanted to ask you to what you thought about that. I think some of the future generations are supposed to be designed to be a bit less kind of bright in the images we have. But there's still a big open question about what effect they're gonna have. I think people aren't really sure yet. Yeah, it's certainly not exactly my field. I mean, I work with datasets from satellites. The reason I thought about it, those things passing in front of stars are gonna mess up these light curves very much in the way that some other kind of transit what, well, if the light curves coming from a satellite that's already outside of these, the orbit of these ones, that is okay, right. But all the ground based stuff can be affected. Yeah, yeah. And sometimes, I mean, some of the stars, we look at relatively bright, but people study these very faint galaxies and things where the signal could be completely drowned out by a satellite like that passing. And that's what makes you worry. I mean, there's other examples of radio telescopes where that data sets became unusable because of people's mobile phones near them and things like that, right. So you can see how some of the really sensitive measurements get affected by anything. And a big trade off right satellites is going to be something it is going to be something the fact that you all can detect things from so far away, and so faint, it still just blows my mind. Yeah, I mean, some of these, like the wobble method Jeff was talking about earlier, like, we can sometimes measure stars moving towards us at a sort of walking pace. Yeah, like when they're orbiting. It's that whole star moving towards us that a few meters per second, like a brisk walk edits. 10 million light years away, or something insane, right? Or how far hundreds of light years for sure. hundred light years? Yeah, okay. Yeah. So many, many, many millions of miles, right? Very far away. And it's Yeah, like walking pace. Unbelievable. So, I don't know, we'll see what happens, I guess. I mean, maybe there's things you could do if the what you're measuring is like looking for a long time you could you knew the satellite was coming, just don't take that signal, at that moment, have a little gap in the time, if there's a continuous stream of them? I mean, what do you do? Yeah, I know. It's not good. I think that is the end goal. I think it's also That, to me, the question that comes up with this project that SpaceX has with releasing these satellites is also like, why do we need that in the first place? Because to me, it's rather strange. And it's kind of more of a symptom of the problems we have in our century, overall, is like, the focus is growth, growth, growth. Yeah. So we can throw satellites in the sky, let's just throw even more satellites. Maybe we don't need that. And maybe in the first place, we need to think about, well, maybe we need to think about de Graaff. If we think about climate change, etc, etc. Yeah, yeah, there's an argument for providing internet to a lot of places where you just can't get the connections to. And I think once you get into those sort of things, it's quite hard to decide like, What's the best thing to do? Never solidly speaking. If it's a for profit thing that's different. Yeah. I mean, the point there is that there are people in the specialists who study these kind of things like anthropologist, etc, etc. In terms of how much do we need all that stuff. And sometimes it's useful to consult them too. And maybe we should consult them too in this case, and not just look at it as a technical kind of thing. Well, it can hurt astronomers or will not hurt astronomers, but look at it from a broader picture. Yeah, it's a good question. And I think we're gonna leave it there for our main conversation. But let me ask you the final two questions for let y'all go. Though, you have started you if you're gonna write some Python code, what editor Do you use at the moment, for the most part, I'm using pi charm, and sometimes I'm using atom and sometimes I'm using vim I'm developing on a different machine that I'm connecting to remotely. Alright, David, I've been using Text Wrangler for years, which is probably not a very fancy option. I think it has to be BB edit. Now they've asked me to upgrade. Perfect. And then notable pi package like some library out there that you've run across. You're like this is really cool. It really helped with our project, or maybe you don't know about people. So check this out. guys got any that come to mind? I mean, the obvious one is TP flow, which Yeah, enabled a lot of the gassing process calculations

45:00 I mean, everyone knows about psychic man, I guess. Yeah, everyone definitely knows about psychic learning. The one that was most useful to me recently was stereo, which is a package that allows you to work with geo referenced images and remote sensing images. So that was really useful for me. And pytorch. I really like pi torch big fun if we want to get in the whole debate of karass versus my torch. And definitely in the fighter side, the PI torch game. Awesome. All right. Yeah, I hadn't heard of that. That one for geospatial data. Very cool. All right, final call to action. People are out there listening. They want to learn more about your study, or maybe learn how they can take those ideas and apply them to their real research. What do you say? Just reach out on Twitter? Yeah. I mean, if you don't want to read the actual paper, that's definitely the best thing to do. I mean, we're happy to answer questions. Okay, cool. I'll put your lengthier ways to get in touch with you in the show notes. And of course, the article. Cool. Sounds good. Great. Thanks a lot. Yeah. All right. Thank you guys, for being on the show is really fun to talk about your project, and congratulations on doing this work. It's pretty cool. Thanks for inviting us. It's been great. Yeah, thank you so much. Bye, bye. Bye. This has been another episode of talk Python to me. Our guests on this episode were David Armstrong and Jeff camper. And it's been brought to you by

46:11 brilliant dot org encourages you to level up your analytical skills and knowledge. Visit talk python.fm slash brilliant and get brilliant premium to learn something new every day. One to level up your Python. If you're just getting started, try my Python jumpstart by building 10 apps course. Or if you're looking for something more advanced, check out our new async course the digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at slash iTunes. The Google Play feed is slash play in the direct RSS feed net slash RSS on talk python.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. I'll get out there and write some Python code

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon