#81: Python and Machine Learning in Astronomy Transcript
00:00 The advances in astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity. We've learned by studying the frequency of light that the universe is expanding. By observing the orbit of Mercury that Einstein's theory of general relativity is correct. It probably won't surprise you to learn that Python and data science play a central role in modern day astronomy. This week, you'll meet Jake Vander PLAs, an astrophysicist and data scientist from the University of Washington, showing Jake and me while we discuss the state of Python, in astronomy, this is taught by fun to me, Episode 81, recorded October 21 2016.
00:39 developer,
00:41 developer, in many senses of the word because I make these applications vows and use these verbs to bake this music constructed. On to think when I'm coding another software design. In both cases, it's about design patterns, anyone can get the job done. It's the execution is interesting.
01:03 Welcome to talk Python, to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy, follow me on Twitter, where I'm at m Kennedy. Keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. I'm very excited to announce this episode is sponsored by not one but two new sponsors. And they both have excellent offerings for Python developers. Welcome gocd by thoughtworks, and data school to the show. Thank you both for supporting the show. Jake, welcome to talk Python.
01:37 Thanks. Good to be here.
01:39 Yeah, it's great to have you. I'm a huge fan of astronomy and science. And I'd love to talk to you about how Python and astronomy interact and all the problems you're solving. But before we get to those, let's start with your story. How did you get into programming in Python in the first place?
01:54 Well, I can't I came to programming relatively late my I had a little bit of early experience in like sixth grade with hyper script hypercard. But didn't do much. I took a took a small programming class in high school and I didn't really do much programming aside from evaluating physics stuff in Mathematica simple things until I was in grad school, actually. So I arrived to grad school and started working with a with a research scientist who later became a faculty and I asked him, Hey, what this was around 2006. I asked him, Hey, what what programming language should I use? Most of the people around we're using idml. This interactive data language, it's a proprietary scripting language that's similar to MATLAB or Python in some ways. And he was one of the only people using Python at the time. And he said that, well, you should use Python. That's the future. Everyone's going to be doing that soon. And so I, I decided to do it. And I learned Python. I taught myself Python over winter break. Sudoku was big at the time. So I wrote a Sudoku solver. And that was my way of learning how to do control flow and everything in Python. And then,
03:12 yeah, that's great. I really think writing those Yeah, I think writing those little games like that are a great way to learn a language, or at least at least get started with it, right? Because the problems are not so complicated. You don't need there's not a lot of interaction. It's not like, Well, how do I talk to a database? How do I do a UI? How do I call the web?
03:30 Yeah, it was, it was super fun. And being someone who didn't really have a background, any formal background and algorithms, it was a nice way to wrap my head around what sort of problems you can do in programming.
03:42 So pretty much Python, From then on, you've been doing a lot of stuff, you've been contributing to some machine learning libraries to the whole psychic area.
03:52 Yeah. So where that came in, is, basically I started doing all this work in Python, I was reading, you know, horrible little one off scripts, like most scientists do, who don't, who don't have formal training. And a couple years into my PhD program, I wrote my first paper. And the first paper was pretty interesting. It was using this relatively new at the time algorithm, version of manifold learning called locally linear embedding, is using that to explore some astronomical spectra. This algorithm was implemented out there, there was the the paper introducing it had a had a link to a little tarball of MATLAB code. But I found pretty quickly that the code didn't scale to, to the size of the problem we had, which was, you know, hundreds of thousands or millions of spectra in several thousand dimensions. And so I spent a summer basically looking at this, this science and nature paper, looking at this MATLAB software, and trying to figure out how to write a more scalable version of the algorithm. And what came out of that was this C++ package, and I published the paper and you know, did the standard thing of putting the C++ package in the tarball on my website, and I thought to myself, you know, this, this is ridiculous, then the next person who tries to use this in astronomy is going to have to spend, you know, they're gonna have to hire another grad student to spend a summer and figure out how to implement this. So I, I started asking people about how to make sure that your code can be used by other people. And then I found out there's this whole catchphrase called reproducibility, open science and things like that. So that was my foray into in reproducibly, and open science. And as I was asking around, someone mentioned that there was this brand new package that I might be interested in that algorithm called scikit, learn. And so I got in touch with Guile Baraka, who was getting psychic learn off the ground, and they thought it would be a good contribution. So I started think that was 2010. somewhere around there, I started contributing to scikit learn when it was really young. And, you know, I haven't looked back as I've really been turned on by this idea of open and reproducible science by making sure your software products that come out of your research are actually well documented and reusable. And this thing that was sort of a side project in the beginning is turned into most of what I do during my my day to day work.
06:29 Isn't it funny how life takes those kinds of turns? Like, you plan to do one thing, and you discover another and it really becomes something you're passionate about? That's cool. Yeah. And it turns out now, I'm
06:40 way more excited by General software tools than I am about the astronomy research that that drew me into grad school.
06:49 Yeah, it sounds a lot like my story. That's awesome. Let's talk just really high level for just a moment about machine learning. I know a lot of people out there are into data science, and they know machine learning. But there's all sorts of listeners. So I mean, like how people used to solve problems, and they would use like statistics or other types of things. But then this whole machine learning seemed to formalize it, bring some algorithms together, like kind of give us an overview of what the whole story there is.
07:16 Yeah. So when I, whenever I introduce machine learning, I always emphasize the fact that it's just a set of models to data. And when you fit a line to data, you're you're doing machine learning, when you when you take two clumps on a two dimensional plot and draw a line between them to say this side is one type, and that sides, the other type, that's a form of machine learning. And where machine learning gets powerful as these algorithms that you can do by eye or by hand in two dimensions, like drawing a line on a piece of paper. Once you formalize those algorithms, you can scale them up to large number of points and large number of dimensions. You said you have 1000 dimensions in your previous problem, like you're not doing that by eye, right. Yeah, yeah. So you know, 1000 dimensions, the equivalent is fitting a 999 dimensional hyperplane, to split things into two groups. And you can't really do that by. But the the key is, the machine learning is nothing more complicated than fitting these models to data in a way that scales to large datasets and to, to high dimensional datasets. And of course, it grew out of it grew out of artificial intelligence and statistics, in some sense. But I think the the core distinction between the machine learning way of doing things and the statistics way of doing things, this distinction is put together is described really well on this paper by Leo breiman, called statistical modeling the two cultures. The overview summary of that is that, you know, in classic statistics, you're building models where you care about the model parameters, like you fit a line to the data, and the slope is telling you something fundamental about the world. Whereas in machine learning, you fit a line to data, and not so much interested in the slope, you're just interested in what that line can tell you about new data that you, you know, you want to predict something about.
09:09 I see. So a lot of machine learning is about prediction in the future, like you create a model and then you want to ask it questions.
09:16 Yeah, absolutely. And you're learning something about an unknown data. I mean, the the distinction is not completely black and white there. But I think it's a useful way to think about machine learning versus statistics.
09:28 Yeah, that is an interesting way to put it. What are some of the major tools in Python that people use?
09:33 Yeah, so scikit learn is one of them. This is the Python package that's built on on NumPy inside pi, and kind of uses the classic tools. It's really nice for for doing sort of small to medium scale machine learning and modeling problems. It doesn't have a particularly good scalability story. There are some ways to parallelize certain operations within scikit learn. But if you want to go to out of core data and things like that There, there, there are other ways to do this I'm so psychic learn, to be honest, is for the bulk of what I end up doing in my work. I can use scikit learn and for scaling to large data sets often, in the work I'm doing, I'm doing kind of massively parallel stuff where I can I can split the data into chunks and run a small second learn algorithm on one of those chunks and you know, loop through them that way.
10:32 Our parallelism that way? Yeah, that makes sense. So maybe you're looking at some large part of the sky, and you could break it in a little grids or something. Yeah,
10:38 exactly. We're often looking at things object by object rather than trying to do things all at once. If you need to do larger models that are doing things all at once there are these interesting libraries recently built around things like spark and TensorFlow. I'm not as experienced with those but but the TensorFlow stuff is interesting. And in particular, there's this SK flow package that I've been rather intrigued by that kind of builds a psychic learn API around the TensorFlow back end. Oh, that sounds like that's worth looking into. That sounds cool. There's also there's also a PI Spark, which is interesting. So we're, I'm in the process right now, it's been kind of fun. I'm working with some computer scientists and some neuro imaging people and some database specialists to put together a comparison between a number of Python oriented approaches to doing scalable computation in a scientific setting. So hopefully, that paper will be coming out in the next several months.
11:39 Oh, yeah, that sounds really interesting. Give us some examples of how this, how this whole machine learning story applies to astronomy. So like, what types of problems or things are people doing with this?
11:54 Yeah, so astronomy, we have a lot of areas where we want to predict certain aspects of things. So So one example, just to be concrete is, let's say we're looking for the distances to distant galaxies and, and the distances or redshifts of galaxies are important and constraining things about our understanding about the cosmology of the universe, the structure of the universe. But getting a an accurate distance to a galaxy is an expensive observation you have to use, you have to do a spectral observation, which basically, you know, you, you, you look at an individual object, and you split the light from that object using like a diffraction grading into into its whole spectrum, you know, red on one end, blue on the other end, and 1000 bins in between. and given something like that, you can isolate certain emission lines or absorption lines and calculate its redshift, which, which is similar to its distance. And that's really, really accurate. But the problem is, it's incredibly expensive. Because you have to look at individual objects and line up these diffraction gratings, one by one. And when we're looking at when we're just taking pictures of the sky, we're getting thousands or millions of galaxies a night. And it's and we don't have the resources to take a spectrum of all of those. So the question then is, Can you can you take a small set of objects where you have these very detailed spectral observations, and learn something about them so that you can predict what the redshift might be from a more coarse photometric kind of kind of picture observation of them. And this maps pretty well into a machine learning model, right? You have you, you take a picture of the whole sky. And so you get data about each object that way at a coarse level. And then you take spectra of a certain collection of objects, and that gives you finer detail, more information about a subset of them. And then you want to build a model that can predict that you know, that more information, the redshift, and the distance for for all the rest of them. So at first glance, machine learning seems to map pretty well on to astronomy data. The thing that's difficult about it in practice, is most machine learning models assume some sort of statistical similarity between your training set and your unknown set. And in astronomy, unless you specifically design it that way, it's difficult to get that statistical similarity. So for example, we tend to have spectra of nearby bright objects because they're easier to take spectra for if you're looking at distant faint objects, the noise characteristics are different, the statistical distributions are different. So a straightforward machine learning approach to that will will miss some things and you might not even know you're missing things.
14:46 Yeah, I can imagine you're one of the things that just blows my mind is that we can see things so far away and so small and effectively so far in the past and still make intelligent statements about them.
15:01 Yeah, it's
15:03 unbelievable some of the things you guys are doing.
15:06 And the thing that blows me away actually about astronomy and astrophysics in general is the fact that these, these laws that we discovered in the laboratory here over the course of the centuries, actually apply to what we see out there 10 billion light years away, right? And it's, it's not just that we're assuming they apply, it's that we can actually test them confirm that they apply. When one example is, you know, there's all these scientists in the 18th and 19th centuries studied the behavior of gases, right? What happens if you blow up a balloon? And how fast is the air come out? And, and all that led to led to this formalized field of thermodynamics and statistical mechanics. And as you as you go into, as you go into more detail, like what happens if you're, if you're looking at ionized gases and things like that we're learning we learned all this stuff in the lab. And then in the mid 20th century, figured out that the cosmic microwave background, this is like echo of the Big Bang, actually comes from comes from a plasma in the early universe, and we can understand the properties of the plasma there by the same laws. And the reason that we know that the universe is 13 point, you know, I can't remember the decimals, but 13 point something billion years old, with a very, very good accuracy. One of the reasons we know that is because we understand the thermodynamics and statistical mechanics of the plasma in the early universe, and can compute what that says about the cosmic microwave, microwave background. And that story, right there is just fascinating to me.
16:45 Yeah, it's totally fascinating. And what I think is also fascinating is the guys who discovered it was at Bell Labs in New Jersey, I think, yeah, yeah. The guys who discovered that whole cosmic background radiation, weren't looking for it. They found it on accident.
17:00 It was in their way, right. Yeah. And the, their their first hypothesis, I guess, was that it was pigeon droppings on the detector. And once I cleaned off all the pigeon droppings that I had to figure out, it was something else. Yeah. So they got the you got the Nobel Prize for finding static in their instruments and realizing the static was significant.
17:20 Normally, that would be a problem, right? You want to get rid of it. Very cool. And so you gave a really interesting talk at PI data. Think 2015. I'll be sure to link to the it's up on YouTube. I'll link to the video, he talked about how distance is super important in astronomy, and it relates to many of these big ideas that we hear about if we're sort of paying attention, I guess.
17:42 Yeah, absolutely. distances is fundamental to a lot of what we do. And it's also really, really hard to figure out I mean, if you, if you think about just looking@a.of light in the sky, how do you how do you tell how far away that is. And so a big part of the story of astronomy over the past couple centuries has been people figuring out how to determine how far away things are. So the first step that that people figured out is, we can do it geometrically, you know that the same way as if you put your finger in front of your eye and close one eye and the other, your finger seems to change seems to jump around and in front of the background. That's that's called parallax. And we can use a similar type of trick to find the distance to nearby stars. Because the earth is on one side of the sun in June, and on the other side of the sun in December. And if you compare what the nearby stars look like, compared to the background stars in June and December, you see them jump back and forth. And you can use the geometry of that to figure out the distance to those stars. I see. So you you measure the sky, and you basically see which ones kind of move more, and which ones are more or less fixed. And based on the parallax, you can say well, these ones that move there, five light years away or something. Yep. And you can calculate that based on the angle and what we know about the Earth's orbit about the sun. But that only works to within Well, up until a couple months ago, it was within maybe a few thousand light years. There's this guy on mission that data was just released in the last couple of weeks. And that one of the things that Gaia can do is it's going to give us really accurate parallax distances out to out to previously unheard of distances. So we're going to really be able to figure out the three dimensional structure of the stars in our galaxy. But that parallax is not going to work when you go out to to more distant galaxies. So you have to come up with other other ideas and one of the ideas that's been really fruitful is this idea of standard candles. If I if I stick you on a street in the dark and I turn on 100 watt light bulb, and I put it right next year I it's really bright, but if I put it two blocks away down the street, it's really dim. That, that brightness and dimness, you can compute that. Because the, the apparent brightness is, is attenuated by a factor of one over the distance squared. So if you know that if you look at a light two blocks away, and you know that it's 100 watt light bulb, and you have a very accurate photo meter, you can, you can compute exactly how far away that that light bulb is. And this works with stars too. If we know the exact brightness of a star, the exact intrinsic brightness of the star, and we look at its apparent brightness, we can we can compute the distance very easily. Now, the trick there is you need to know the the intrinsic brightness of the star.
20:42 And that's the kind of stuff that amazes me, because you look at these things super far away, and how do you know their intrinsic brightness? Right?
20:50 Yeah, it's, it's and it's really difficult one, one thing you can do is build off these things we learned from parallax. And you can look for certain classes of stars that are always around the same brightness and you know, you know, their brightness when you know the parallax distance. And then you look for the same class of stars that are further out. And you can sort of infer their distance that way. So this is why it's kind of called in astronomy, it's known as the distance ladder, you know, we have, we have these direct methods that lead to more indirect methods of distances as we go further and further out. And one of the coolest stories of this distance ladder is, back in the early 20th century, there was this woman named named Henrietta Leavitt. And she was looking at variable stars. So there are stars out there that get brighter and fainter with time. And she was looking at particularly at this class of stars called Cepheids variables, it was named after the fourth brightest star in the constellation cepheus. And she found something curious, when she when she was looking at the, at the variation of these, they would get brighter and fainter with a period of, you know, somewhere between a day and a couple days, something like that. And she, she found that these, the the period of how fast they got brighter and dimmer, was related to their intrinsic brightness. And so there's this, there's this nice plot where she shows that as this, it's roughly linear trend between period and trinsic brightness. And that's really nice, because then you can period is something that you can find out in the sky. So she looked at, and, you know, found all these stars and confirmed that the period and the intrinsic brightness were related. So then Hubble came along, and you've probably heard of Hubble from the Hubble Space Telescope. And what he did is he used the telescopes available to him and found more and more of these stars. And based on this period, brightness relation was able to estimate the distances to all these stars. And the thing that really completely blew open our understanding of the universe was when Hubble pointed his telescope at one of the, what they called them the spiral nebula. So there were these spiral shaped clouds out in the out in the sky that for a long time, people thought were just clouds of dust in our galaxy. But Hubble found individual Cepheid variables in the the Andromeda spiral Nebula, and found that it wasn't in our galaxy, it was it was about two and a half million light years away, farther away than anything we ever would have imagined existed. So in in one fell swoop, the study of variable stars led to us understanding that the universe is orders of magnitude bigger than we ever imagined.
23:37 That's really amazing how that that ladders up there, right. And beyond that, we also learned that the universe is not contracting or sort of static, but it's, it's sort of going away from itself. And so yeah,
23:51 yeah, so the same at the same time, he was he found that not only released galaxies really far away, but if you look, he looked at all these galaxies, there were the spiral Nebula and and now we know them as galaxies, because we know they're, they're separate, separate groups of stars. He looked at all these and he found that there was a relationship between how far away they are and how fast they're receding from us. We can we can measure their recession velocity by looking at the redshift of the light. It's kind of like the Doppler shift the when a, you know when a siren goes goes by you you hear it high at first and passes and it goes low. You notice me here? Yeah, and let's say Doppler shift, we could, we could see that the light was shifting to a lower frequency, just like the sound shifts from to lower frequency when a car goes away from you. And you can measure measure the velocity and you found this relationship between the distance and the velocity which basically describes a uniformly expanding universe. And you know, are right around the same time, Einstein Einstein's predictions, general relativity, people were realizing that the general relativity equations that describe gravity and just explained the orbit of Mercury among other things, you could you could solve those in a way that led to an expanding universe. So it was another confirmation of general relativity. So and this is all based on on finding distances to galaxies.
25:40 This portion of talk Python to me is brought to you by gocd. From thoughtworks. Go CD is the on premise, open source Continuous Delivery server. With go CDs comprehensive pipeline and model, you can model complex workflows for multiple teams with ease, and go see these Value Stream Map lets you track changes from commit to deployment at a glance, go see these real power is in the visibility it provides over your end to end workflow, you get complete control of and visibility into your deployments across multiple teams say goodbye to release a panic and hello to consistent predictable deliveries. Commercial support and enterprise add ons, including disaster recovery are available. To learn more about gocd visit talkpython.fm/ go CD for a free download. Let's talkpython.fm/ g OCD. Check them out, it helps support the show.
26:41 I think one of the things that's super interesting about this is, you know, this concept of variable stars and the work that woman did was very manual, right? Like she, she would look at pictures and, and so on. And
26:55 yeah, she was measuring the way they measured brightness of stars back before CCDs is you were you're looking at photographic plates. And the brighter something is, the more it's saturated. So you'd have to do a detailed measurement of the size of the.on, your photographic plate and use that to compute the brightness of the star. It's just it's amazing to me that any of that work got done, given how easy we have it No, you know,
27:22 yeah, exactly. Exactly such a different world. But at the same time, we've kind of answered those questions for the simple small ones we focused on. And now the amount of data that you guys are getting is so much larger, that you have to start applying these machine learning algorithms just just to deal with it. Right?
27:40 Yeah, absolutely. So the project I've been involved in, that's just starting to get off the ground first light is gonna be in a couple years of this project called the Large Synoptic Survey Telescope. And you can think of this as a overviews, it's a it's a 10 year movie of the entire southern sky. So it's a very Wide Field Camera that's going to be on a mountaintop and Chile and the Atacama Desert, one of the driest places on the earth. So we don't get can Don't, don't run into much weather. And it'll be able to scan the entire night sky every three nights or so. So get about 100 hundred full sky frames in this movie per year. And then do that for a decade. And the big thing this is going to open up is the more of the time domain. You know, typically, typically astronomers tend to treat the sky as this fixed thing. They're these their individual individual times where we look at specific regions of the sky and see what what has changed. But we don't really have a global survey yet of of the time domain of the sky analysis T is going to do this on a huge scale, we're gonna have 10 years of data with something like 30 ish terabytes per night of of data coming through with us to the full survey size is going to be in the hundreds of petabytes by the end. So it's, it's really, it's, it's bigger than anything that's been done before. And it's really forcing astronomers to confront these old tool chains that they've had, that don't really scale anymore. You know, the stuff that you could do, you could sit down 10 years ago, you could sit down on a computer and and download, you know, all of the Sloan Digital Sky Survey, and do some sort of local analysis. I don't know if even in 10 years, I don't know if we're gonna have had hundreds of petabytes size of hard drives in our laptop. Right. And we're gonna have to, we're gonna have to do it a little bit differently.
29:39 Yeah, that's really a lot of data. And the other thing you had said that was interesting is this data is being collected for everyone. Which means that it's not specifically focused on some type of answering some type of questions, the techniques and the tools and like machine learning stuff you have to apply has a greater challenge.
30:01 Yes, V, it has to be really, really general. Because this data, like you said, it's it's collected for everyone, there's not, there's not really specific areas that that it's addressing, it's one of these discovery class missions similar to, to the Hubble Space Telescope, you know, you you put it out there, and you hope that the things you find are things that you're not going to be able to predict at the moment. And what that means is that is that any particular Ops, for any particular science case, you're not necessarily going to have the have the best data, you know, if you're, if you're designing lsst, to do one thing, like look for variable stars, you would do it very differently than if you're doing it in general, because you have to balance all these different different concerns and different areas of research. So for example, going back to variable stars, that one of the challenges with lsst is that rather than just observing in the same region of the spectrum every night that you'd want to do, if you want to, you know, if you want to look at a variable star and see how the brightness changed from one night to the other, you'd want to take the exact same observation. But lsst is not taking the exact same observation every night, it's it's getting a breadth of different different bands throughout the spectrum, everything from infrared, to in the near ultraviolet. And what that means is that, that's really good for things like determining the redshift of galaxies via machine learning, right. But it's very attendance, it's actually very bad for finding variable stars, because now you have to model not only the variability, but you have to model the spectral variability over the course of time to and it gets much more challenging. So so as data grows, and as the heterogeneity, heterogeneity of the data grows, having these having these sophisticated algorithms, whether it's machine learning, or some sort of forward modeling, or some sort of nonparametric modeling, that's becoming increasingly important. And it's things that need to happen kind of in real time, while you're while you're observing the sky, because we want to, we want to be able to alert people within a minute or so if something changes on the sky and we find an interesting object, there's going to be this alert stream so that somebody sitting at a telescope in another part of the world could point their telescope there right away and catch this interesting phenomenon.
32:27 Yeah. Wow. I'm really excited to see what comes out of this. that's a that's a big project.
32:32 Yeah, it's gonna be it's gonna be huge. It's really going to go into define the way that we do astronomy over the over the course of the 2020s.
32:40 Yeah, for sure. So let's, let's talk about some of the libraries that you might be using to answer questions here. So the two major ones, and I guess one is kind of a subset of the other is Astro pi and Astro ml.
32:53 Yeah, so Astro pi is actually the big community standard. And that, that it's been a really cool project to watch and to be involved in. Because it started a few years ago, where there was these, everyone had their own little Python library to do things this this grew out, I should I should step back 10 years ago, most people were using ADL. And so the community evolved these sets of routines and ideal to do a lot of the common tasks. And as more and more people move over to Python, because of, well, I'll go into that later. But as more as more and more people moved into Python, because of its advantages, people built a whole bunch of different tools to do different things. It was this sort of smattering. And the Space Telescope Science Institute people, the folks behind Hubble came together in around 2012 2011. And said, we should consolidate all this and create one Uber package to rule them all and Astro pi was born. And it's actually In fact, it's actually accomplished its goal. And pretty much everyone is using it now. So that's an incredible package and really, really well done and awesome software engineering behind it a lots of buy in from the community. Astro ml is something that I started along the same time, I didn't have as broad of a vision, but I just wanted to bring together functionality and examples of doing machine learning specifically for astronomy. In Python, we actually wrote the package to accompany our book, which is a it's a Princeton press book on statistical modeling, machine learning, and, and etc. In Python for astronomy.
34:38 Yeah, that's great. And to what kind of things you cover in your book about like, what problems are solved or presented or datasets, things like that.
34:47 Yeah, and that book we what we do is we walk through a lot of the, the it's it's meant to be an intro graduate text on statistics and machine learning, with astronomers in mind. So we Walk through all the all the basics of data mining, statistics, machine learning, all the while using these data sets drawn from astronomy and problems, situations that astronomical researchers will run into. And along the way, we also provide code snippets and provide figures with the full figure source available online so that if people want to actually use these techniques, they can grab our scripts and start modifying them from there and see where it goes. So Astro ml is what drives that a little bit in our next edition of the book, which might happen in the next year. So my big task is going to be to incorporate Astro pi, because we actually wrote that book before Astro pi existed. So it's already a little bit outdated. And I want to make make sure that everyone's I'm pointing everyone to the tools that are in Astro pi.
35:53 Yeah, that makes a lot of sense. So you created the book, you're like, we really should make this like a package that people can just use, and whatnot. And then now it's a little more mature, right? Yep. Yep. Do you know of any discoveries that were done as a result of Astro ml? it's
36:10 been it's been referenced in, in a lot of papers. I don't know offhand if there's anything that were that came exactly from that, but. But it's definitely been used for a lot of the incremental building of knowledge over the last few years. And it's been fun to see that.
36:27 Yeah, I'm sure that's really rewarding. That's awesome.
36:29 Yeah, another big thing in the astronomy community is that is forward modeling and Bayesian approaches. So I alluded to earlier, the fact that machine learning is a little bit difficult, because the statistical similarity of the samples is not always is not always a good assumption. So the way that astronomers tend to get around that is to use forward modeling. So you have some, you have some model for your system based on the physics that you know, and you can, you can look at the noise properties and the selection effects of your observations to constrain that model. And then that model will tell you about tell you about the data that you observe. And that tends to be tends to work really well in a Bayesian setting. So a huge push in the last few years in astronomy has been to use tools like Markov chain, Monte Carlo to do Bayesian analysis, and to do these really large, high dimensional high dimensional models to learn about the data. So there's, there's one package that's been pretty impactful there is the MC package, MC E. And that's a package for doing Markov chain Monte Carlo doing Bayesian estimation, written by us an astronomer and used for AI. It's been cited, I think, thousands of times in the astronomy community, because so many people are doing that style of analysis.
37:57 Yeah, that's really amazing. Yeah, I guess the whole how do you solve these prediction problems more quickly, is really important. And Monte Carlo scenario, simulations are really good at that.
38:09 Mm hmm. In particular, particularly the the Bayesian approaches, or machine learning tends to be more of a frequentist approach and the Bayesian for modeling approaches give us some advantage. And when you when you have, when you have some a priori idea about what's driving your observations, you can you can take advantage of that more in a Bayesian context and in a machine learning context.
38:34 So you wrote this book, The statistics, data mining, machine learning and astronomy, and you survived that process. And you got to come back for more, and you're just about to finish up a book, right? Another one?
38:45 Yeah, I'm just I'm finishing one. That's, it's an O'Reilly book. So think, you know, cute little animal on the cover, you probably see. What's your animal? animal is a is a Mexican bearded lizard. Yeah, this one is the Python data science handbook. So the reason I did this I, for years, I've been approached by people who are, you know, in research or in tech or something like that. And they say, Hey, I know how to use MATLAB, I know how to use R. But I want to, I want to learn how to do Python. And I want to learn how to analyze data in Python. And I hadn't found a really good resource to point them to except for kind of collections of videos online. So I decided to write it. And it's taken much longer than I thought, because because life gets in the way. But but we're we're at the point where it's I'm doing the final edits right now, so it should be released pretty soon.
39:37 Yeah, that's great. Congratulations on that, obviously sure to link to it as well from the show notes so everyone can find it.
39:43 And one thing I'm particularly excited about this book, I wrote it all in the form of Jupyter notebooks, and got the publisher to agree to let me make the Jupyter notebooks public. So you can buy the printed version of the book or you'll be able to go on GitHub and just work through the Jupyter notebooks. for free. Wow, that really is cool. Yeah. Okay, but you should buy books to
40:07 support the project. That's cool. But it's very cool that the basically it's a live book, right? Like, if you have the data and you have the code and you can run it, you can explore it.
40:16 Mm hmm. Yeah. And we're working on to getting a hosted version of it up there on some cloud service. So you could just basically click and have a have a live executable textbook at your fingertips.
40:30 Yeah, it's interesting. I think a lot of things are going that way. Right? The the days of just a printed book in a zip file, are our fading lessons. Yeah,
40:38 cuz there's, there's so much better ways of doing it. Now.
40:56 This portion of talk Python, to me is brought to you by data school. Have you thought about making a career change into the exciting world of data science, but don't know how to get started? This school helps data science beginners, like you to analyze interesting datasets and build machine learning models to predict the future, all using Python. You don't need a PhD or a background in mathematics. Just a keen interest in using data to answer your questions. This school has created a Data Science Learning Path exclusively for talk Python listeners to visit talkpython.fm/ data school to launch your data science career, this school is run by my friend Kevin Markham. So I know that you're going to get excellent content, check it out at talkpython.fm/data school. So let's talk a little bit about where you work and what you do. Because you are breaking some rules around how people in academia, and scientists work with programming technology and how programmers are involved. And I think that's really interesting, too. You're at the University of Washington, but you're at this place called the Science Institute, right? Yeah.
42:03 So I'm in the Science Institute. I've been here since the beginning of 2014. And the goal of the East Science Institute is to, to basically further, further computational research around campus. And so it's, it's existed for a while, but we really got a big, big boost. In 2014. When I came on, we got this joint grant between New York University, UC Berkeley, and u dub. And we all created some version of this Data Science Institute. And so it's a five year grant to support what we're doing. And the goals are basically to see how we can how we can reshape the culture of academia to take more advantage of, of data science tools to train people better to provide career paths for software focused researchers. And so for example, the the job that I have right now, where I, what I do day to day is I spend a lot of time consulting with researchers around the university helping them figure out their data challenges. I mentor students who kind of have one foot in their their home domain, their science and one foot in like a data science program. And I work a lot on on maintaining the software that astronomers and other scientists use. And this is a position that's not really it's I feel like sort of a stepchild in academia, because no one really understands that type of position, it doesn't fit into the doesn't fit into the model of graduate student, postdoc faculty. So we have a number of people that are in a similar position to me that are that are working on this. And it's been super fun to see what comes out of this and the kind of kind of novel trainings and novel approaches to research that we can do. And particularly fun, because it's not only happening at U dub, it's happening at NYU, and UC Berkeley as well. And we can we can compare notes with those institutions and see how things are going.
44:04 That sounds like such a fabulous job.
44:07 Yeah, it's it's good for the time being. I mean, I, I'm worrying that I'm peaking early, because it's so fun. I don't know what will come next. From here. No, that's
44:15 really cool. One of the things you pointed out in your PI Data talk is that every field is entering a data rich era. So there's all these biologists, sociologists, you're basically there to help support like the biologists, sociologists, chemists, all the people who are hitting the limits of how much data they can handle.
44:36 Yeah, absolutely. And the way we're doing this is we have a number of different ways to engage with people on campus. So one is we have these open office hours. So just like you used to go see your professor during class, we have office hours that are oriented towards researchers who have to have a challenge. They can come talk to one of our people and we have people with expertise in everything from statistics. machine learning to software engineering to cloud computing scalability. Another thing we do is we run these incubator programs. So it's sort of designed off off these startup incubators that are coming from the Silicon Valley where, instead of incubating their startup idea, we're incubating the research idea and letting, letting researchers work shoulder to shoulder with a data scientist who has an expertise that complements theirs. And we also have graduate fellowships where students are have one foot in their own department, one foot in East science, and are taking not only say, astronomy courses, but also database, machine learning, statistics, computer science courses and getting getting a credit on their PhD for that.
45:43 Yeah, what I thought that was really fascinating is, you know, having gone through some part of a Ph. D. program, just, there's so many things you've got to take and learn. And you're so busy learning your specialty, right, like biology, if that's what your PhD was, that it's really hard to be a good data science, software type person as well.
46:04 Yeah, absolutely.
46:05 I think, you know, you said that these guys in this these cohorts, they, they basically get half of their requirements for their PhD program waived yet, so that they can focus the other half on sort of complementing this with data science and programming, right? Yeah,
46:22 that's, that's the idea. And then what comes with that is they get, they have their home department advisor, but they're also matched with a co advisor that's more methodological. And so it leads to not only the student growing a lot, but it leads to some interesting interdisciplinary collaborations around campus. And we've had a number of pretty cool grants that have been awarded based on some of these partnerships.
46:47 Yeah, that sounds really quite amazing. I wish that was around when I was in school.
46:53 Yeah, I do, too. I had to pick a lot of stuff up on my own, it would have been nice to have something like this.
46:58 Yeah. If anyone out there is listening, and they're maybe in a position where they're like, Oh, this is interesting. How do we do this? Right. Like, another thing that I thought you pointed out was really interesting is it's in a beautiful location. And he said that that was really important.
47:10 Yeah, definitely. So we have this data science studio, that it's, it's an old library, branch location on campus. And we're, we're on the sixth floor of this tower with where we have the whole floor with 360 degree views looking at over Mount Rainier and the Olympic Mountains and things like this. And it's important, not just so that I can have an awesome view while I'm reading code, but it's important because we want, we want people around campus to interact with each other. And so we want to be a place where people would like to come and just hang out. So there you know, it's, it's getting back to the we call it the water cooler effects, you know, the people who were around in the 60s and 70s, working on the working on computationally intensive science, talk about the days when everyone would go to the mainframe on campus, and you'd be sitting there waiting to put your punch cards in. And, you know, a hydrologist would be talking to an astrophysicist and finding out there's they're solving the same equations with their programs. So they would, you know, have that sort of talk. And as the campus moved towards desktop oriented computing, those sorts of opportunities went away. And I think we're, we're better off if we can have that, that sort of connection. So one of the cool things about our space here is it's a it's a space that's open to anyone on campus for, for just hanging out and working, but also for scheduling meetings. So we have people from all different departments that are scheduling their group meetings here coming in, hanging out having coffee, and then meeting someone from the other side of campus who's solving the same differential equation in their completely different field. Yeah, that's
48:53 great. Yeah, very, very nice. Like I said, I wish that existed when I was in school. Alright, so we're kind of getting near the end of the show. And I have a couple of questions that I wanted to run by you. One is kind of almost metaphysical. So I just heard the other day that there was some study done that there was we underestimated the number of galaxies by like a factor of 20. Or something amazing, right? Yeah, that was really interesting. That, you know, already, there's so many galaxies out there. And every galaxy has, you know, so many stars and so many planets, what do you think the chances that there's intelligent life outside, they're not necessarily like, there's people visiting, there's right just out there. We'll never meet them.
49:38 That's hard to say, you know, at this point, so, so one thing that you'd have, we have this astrobiology group, which sounds like a funny area of study, because what are they studying? Right, but, but they're working on really interesting things combining what we know about biology, about geophysics about planetary astronomy, and looking for locations around the universe. Where life might exist. And so they study extremophiles around here, like organisms that live on on deep sea vents and acidic boiling water environments, things like that. But one thing that's come out of that group is this, this notion that simple life, you know, microbial life, probably could exist just about anywhere. And I tend to think that I would be, I would be pretty surprised if we don't find some sort of microbial life elsewhere in our solar system. The other thing to come out of that, as we study the dynamics of planets and things like this, is that there are a lot of things about Earth in particular that make it very special and a lot of coincidences that. that are, that would be hard to duplicate, you know, things like the fact that Jupyter exists, keeps us from having a large number of asteroid impacts on Earth, it's kind of a big shield. And you know, that they'll asteroid impacts, as we know from geological paleontologists pull out history is, that can be pretty bad for life on Earth. So, so the type of stability that we have on earth, and particularly over the last 10s, hundreds of millions of years, I suspect that's rather rare. And that that makes me think that intelligent life might be rather rare. I think I think the seas of Europa when we get when we get something that can burrow through the ice and look down there, I really hope there's something swimming around down there. It would be really cool. It'd be very cool. I
51:38 hope I hope I get to see that someday. That'd be awesome. All right. So another one, my wife's a professor here at Portland State University. So hang out with some other her colleagues and stuff. And one of her colleagues had this student she's teaching like a numerical methods for partial differential equations, you know, something like that. There's, there's, you know, she's using Python and a lot of things like NumPy, and so on in her class for the computation. And one of her students came and said, Hey, I just I know MATLAB, can I just use MATLAB? Why do I need to learn this Python thing? What would you tell that student? If you got that question?
52:16 Yeah, that's a good question. So number one, I think, use the tool that's most effective for your research. And for example, if you if there are programs in MATLAB that don't exist in Python, and they are required from your free research, there's no reason to learn a new tool just because it's a new tool. But on the other hand, there, there's some distinct advantages to Python. I alluded to this earlier, when I talked about the field of astronomy shifting from 90% idml, over the last 10 years to probably 90% Python. And the advantages that I see are number one, its openness, right? It's it's open, and it's free. So that means one thing that has come up with idml is their their site licenses required for every instance that you run, you know, it's a pay to play type of operating system or type of interpreter. And so when people started running, running, parallelize jobs, taking advantage of all the computers in the department, there were times where a grad student would start a job, and it would use all the site licenses for the entire department and research in the department ground to a halt. Right. So you don't have that problem in Python, because there's no site licenses with Python. So Python can be cheaper to use. The other thing about it is that to to serve students, well, you know, the the number of academic jobs versus the number of undergrad degrees or PhDs granted, is extremely small. So most of our students are going to be going out into the world and working somewhere other than an academic department. And people in in the outside world in the tech world, they're much more excited about someone with Python chops than someone with idml chops or MATLAB chops just because that's, you know, the way the world is gone. So that's another good reason to move to Python. The other thing that I love about Python is the culture of open source. So, particularly now, 10 years ago is different. But now, just about anything you want to do in Python, you can go out there and there's somebody who has made an open source library for it, put it on a place like GitHub or Bitbucket, and made available and often these libraries are really, really well done. There's just been this culture of, of well designed open source, particularly in the scientific Python community. And it means that you know, you can do a, an amazing number of things just out of the box with Python and the scientific installation.
54:49 Yeah, what do you think that means for reproducibility? Like, I want to store this thing of code interpreter even Unlike a Linux Docker image of the thing that I use to generate my paper,
55:05 yeah, that's huge. And, you know, it comes back to the, in the beginning, I was telling you how I got into the Python open source world, I realized that there was, I thought it was ridiculous that spent this whole summer building this tool, and then no one was going to be able to use. And the tools in Python for enabling that sort of reproducibility, even having like an executable paper, are huge. And I think it's really helping science drive drive itself forward. Because we don't need to reinvent the wheel every time we do a new study.
55:34 I take I think that would have to leave it there for the topics. But I do have two final questions for you before I let you go. Okay. I just saw on pipeline that we passed 90,000 distinct packages. There are so many amazing things you can install. And in your field, you probably get exposed to really interesting things that maybe not everybody knows about. Tell us about like one of your favorite Python packages that you might recommend.
56:00 But I mentioned this earlier, but the emcee for Markov chain Monte Carlo. I think that's just an incredible package and allows you to do so much as far as Bayesian modeling. I could talk about one of my packages, but yeah,
56:14 well, you know, it's your own packages are not off limits, like Astro ml is all right, you know? Yeah. So
56:20 one one that I'm working on recently, which is a lot of fun is this Altair package. And what it what it is, is a Python interface to Vega light. And Vega light is a visualization, grammar and statistical visualizations that basically outputs interactive JavaScript plots. And so we've been we've been writing this Python wrapper and trying to make a nice API to to create the Vega light grammars and Vega light visualizations. I'm pretty excited about this. Because, you know, there's, there's so many options for plotting out there right now, there's, you know, there's matplotlib, there's bouquet there's plotly, there's hollow views, there's GG plot wrapper for Python, there's, I'm gonna miss something and someone's gonna get mad at me seaborne things like that. But the one thing that that Altair that's interesting about Altair is that it interfaces to this Vega light grammar. And that grammar, I think, has the has the possibility of becoming sort of a lingua franca between these various visualization packages. And the Vega light if you if you've heard of d3, which is driving a lot of interactive visualization on the web, Vega and Vega light are coming out of the same research group. So it's people who really know what they're doing as far as visualization design. Yeah, that's cool. That's a great pedigree. Yeah. So you know, and I get to write the Python classes that output this stuff. And it's pretty fun. Great. What's the package called? called? Altair? a lt AR?
57:58 All right. Awesome. All right. And when you write some Python code, what editor Geez,
58:04 I go back and forth these days between Emacs and Adam, actually, like the Emacs key bindings, but I like the way that Adam arranges an entire project. And let's see, let's see, see all the files?
58:17 Yeah, yeah. Those are both nice. Cool. All right. So any final call to action? I heard you had an announcement about your PhD cohorts like 5050. program.
58:28 Yeah. So we, we just put out this announcement for 20 2017 PhD fellowships or postdoctoral fellowships. So this is looking for people who have recently finished their PhD who are interested in continuing research in their own field, but also adding some sort of computational or data science element to it. And it's similar to the graduate program I described earlier, you have you apply to have one foot in your domain department, one foot in the Science Institute with, with two advisors, one from the domain and one in a methodological area. And we have just a great set of postdocs here who are doing some really phenomenal work with that. And so if you're, if you're graduating PhD students in this East Science Institute are data science stuff. Sounds good. I did encourage you to apply the applications are due sometime mid January.
59:22 All right, that's playing time to get them in there. Cool. Yep. All right. And when's your book coming out?
59:27 Probably probably January, I think. I don't know at this one. It depends on how quickly I get this. This corrected manuscript back to them.
59:37 Yeah, of course, I saw that you can get like an early access version of it, right?
59:42 Yeah, the Early Access is there. So if you want to take a look at the pre release right now, you can you can go by it, and they'll update you to the released version when it comes out.
59:51 All right, sounds great. So Jake, it's been super interesting talking about astronomy to you and thanks for coming on the show and sharing your story.
59:59 Yeah, thanks for having me. You bet. Bye bye.
01:00:02 This has been another episode of talk Python to me. Today's guest has been Jake Vander PLAs and this episode has been sponsored by gocd and data school. Thank you both for supporting the show. Go CD is the on premise open source Continuous Delivery server will improve your deployment workflow but keep your code and builds in house, check out go CD at talkpython.fm/g OCD and take control over your process. Data school is here to help you become effective with Python data science tools quickly skip years at the university. Check out the talk Python me learning path at talk Python FM slash data school. Or you are a colleague trying to learn Python. Have you tried books and videos that just left you bored by covering topics point by point, well check out my online course Python jumpstart by building 10 apps at talkpython.fm/course to experience a more engaging way to learn Python. And if you're looking for something a little more advanced, try my write pythonic code course at talkpython.fm/pythonic. And you can find the links from this episode at talkpython.fm/episodes slash show slash 81. And be sure to subscribe to the show open your favorite podcatcher and search for Python we should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at /play in direct RSS feed at /rss on talk python.fm. Our theme music is developers developers, developers by Cory Smith Goes by some mix. Cory just recently started selling his tracks on iTunes. So I recommend you check it out at talkpython.fm/music. You can browse his tracks he has for sale on iTunes and listen to the full length version of the theme song. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Let's mix. Let's get out of here.
01:01:50 Dealing with my boys. Having been sleeping. I've been using lots of rats. Got the mic back.
01:02:08 Developers developers