Python and Machine Learning in Astronomy

Episode #81, published Fri, Oct 21, 2016, recorded Fri, Oct 21, 2016

Episode Deep Dive Transcript

The advances in Astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity. We have learned by studying the frequency of light that the universe is expanding. By observing the orbit of Mercury that Einstein's theory of general relativity is correct.

It probably won't surprise you to learn that Python and data science play a central role in modern day Astronomy. This week you'll meet Jake VanderPlas, an astrophysicist and data scientist from University of Washington. Join Jake and me while we discuss the state of Python in Astronomy.

Links from the show:

Jake on Twitter: @jakevdp
Jake on the web: staff.washington.edu/jakevdp
Python Data Science Handbook: shop.oreilly.com/product/0636920034919.do
Python Data Science Handbook on GitHub: github.com/jakevdp/PythonDataScienceHandbook
Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data: press.princeton.edu/titles/10159.html
PyData Talk: youtube.com/watch?v=qOOk6l-CHNw
eScience Institue: @UWeScience
Large Synoptic Survey Telescope: lsst.org
AstroML: Machine Learning and Data Mining for Astronomy: astroml.org
Astropy project: astropy.org
altair package: pypi.org/project/altair

Episode Deep Dive

Guest Introduction and Background

Jake VanderPlas is an astronomer, data scientist, and open science advocate at the University of Washington’s eScience Institute. Originally focused on astrophysics, Jake transitioned into broader computational research, helping fellow scientists leverage Python-based data tools. He has contributed important libraries (e.g., in scikit-learn) and written authoritative resources such as the Python Data Science Handbook. Throughout his career, Jake has championed open-source software, reproducible research, and the unification of scientific and programming communities.

What to Know If You're New to Python

If you’re exploring Python for the first time and want to follow this discussion on astronomy and data science, here are a few foundational elements to be aware of:

Familiarize yourself with NumPy and SciPy for numerical computations.
Understand basic machine learning concepts to see how libraries like scikit-learn apply to real-world data.
Be comfortable with scripting in Python: Reading in data, organizing code, and using packages.
A minimal awareness of scientific Python tools like Matplotlib for plotting can help you visualize ideas discussed in astronomy and data analysis.

Key Points and Takeaways

Python’s Growing Role in Astronomy Python has become the de facto language in astronomy, overtaking older proprietary tools (e.g., IDL). Its expansive open-source ecosystem (NumPy, SciPy, Matplotlib, etc.) allows astronomers to share, reproduce, and extend each other’s work more easily.
- Links / Tools:
  - NumPy
  - SciPy
Machine Learning for Astronomical Data The explosion of data from telescopes necessitates robust methods to classify, cluster, and predict astrophysical phenomena. Python’s scikit-learn library makes it straightforward to experiment and scale these machine learning techniques, especially for tasks like determining galaxy distances or identifying variable stars.
- Links / Tools:
  - scikit-learn.org
AstroPy and AstroML AstroPy provides community-driven, all-in-one utilities for astronomical computing, whereas AstroML focuses on machine learning and statistical methods tailored to astronomy. AstroPy has effectively become the standard library for astronomers, with functionality such as coordinate transformations, FITS file handling, and more. AstroML offers example-driven solutions that integrate neatly with scikit-learn and NumPy.
- Links / Tools:
  - astropy.org
  - astroML.org
Distance Measurement Techniques in Astronomy Gauging distances to celestial objects underpins nearly every major discovery in astronomy. From parallax for nearby stars to “standard candles” like Cepheid variables or supernovae for distant galaxies, these methods often depend on data-processing pipelines coded in Python. Accurate distances confirm theories such as an expanding universe and general relativity.
- Links / Tools:
  - LSST Overview (lsst.org)
Big Data Era with LSST (Large Synoptic Survey Telescope) LSST will capture a detailed, 10-year “movie” of the entire southern sky, generating terabytes of data each night. This onslaught of information forces astronomers to adopt more scalable computational techniques, often in Python, to handle real-time alerts and massive catalogs of objects. The project’s design is also meant to enable discoveries nobody can currently predict.
- Links / Tools:
  - lsst.org
Bayesian Methods and MCMC in Astronomy Astronomers increasingly rely on Bayesian approaches for handling uncertain or noisy data. With Python-based tools like emcee (pronounced M-C), researchers can implement Markov Chain Monte Carlo to estimate complex model parameters that better reflect real-world variations in cosmic phenomena.
- Links / Tools:
  - emcee
Forward Modeling and Reproducible Science Instead of just blindly using machine learning, astronomers combine physical models and Bayesian computations to more accurately fit observational data. The openness of Python libraries facilitates reproducible studies, helping others confirm or build upon existing work without reinventing entire toolchains.
- Links / Tools:
  - GitHub.com (hosting reproducible research code)
Jake’s Python Data Science Handbook Jake wrote the Python Data Science Handbook to provide a thorough, hands-on guide for people transitioning to Python for data analysis. It was fully created in Jupyter notebooks and is available with open-source materials, so readers can directly experiment with the examples online or in their local setup.
- Links / Tools:
  - Python Data Science Handbook (O’Reilly) (GitHub repo)
eScience Institute and Interdisciplinary Collaboration The eScience Institute at the University of Washington aims to support research by providing data science expertise across multiple domains. Scientists from fields like biology, sociology, and chemistry collaborate with data specialists, bridging gaps in knowledge and facilitating new partnerships that accelerate breakthroughs.
- Links / Tools:
  - escience.washington.edu
Altair for Python Data Visualization Beyond astronomy, Jake discussed Altair, a declarative statistical visualization library for Python that leverages Vega-Lite for interactive, JavaScript-based plotting. This approach can simplify the creation of interactive charts and dashboards, making data exploration more intuitive for science and beyond.

Links / Tools:
- Altair GitHub
- Vega-Lite

Interesting Quotes and Stories

"I realized I spent a whole summer building a tool that then no one else could use. That got me into the world of open-source and reproducible science." -- Jake

"LSST is going to define how we do astronomy in the 2020s, it's effectively a 10-year movie of the sky." -- Jake

"We don’t need to reinvent the wheel every time we do a new study if we share our code openly." -- Jake

Key Definitions and Terms

IDL: Interactive Data Language; proprietary environment once popular in astronomy.
Standard Candle: An astrophysical object (e.g., certain variable stars or supernovae) with known intrinsic brightness, allowing distance measurement based on observed brightness.
Markov Chain Monte Carlo (MCMC): A class of algorithms (like emcee) to sample from complex probability distributions for Bayesian analysis.
Forward Modeling: An approach that uses known physical models and simulates observations to compare against actual data, rather than simply fitting patterns blindly.

Learning Resources

Python for Absolute Beginners: Ideal if you’re just stepping into programming or Python. Learn the fundamental building blocks of Python with this beginner-friendly approach.
Data Science Jumpstart with 10 Projects: A project-focused dive into data science techniques, including ways Python and libraries like NumPy and pandas can handle real datasets.

Overall Takeaway

Python’s versatility and open-source ethos have made it indispensable for modern astronomy. From scikit-learn machine learning workflows and Bayesian methods to domain-specific frameworks like AstroPy, Python enables scalable, reproducible research on unprecedented data volumes. Combining transparent modeling approaches with powerful tooling, astronomers (and scientists in general) can collaboratively push the boundaries of our understanding of the universe.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 The advances in astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity.

00:06 We have learned by studying the frequency of light that the universe is expanding.

00:11 By observing the orbit of Mercury, that Einstein's theory of general relativity is correct.

00:16 It probably won't surprise you to learn that Python and data science play a central role in modern-day astronomy.

00:22 This week, you'll meet Jake Vanderplass, an astrophysicist and data scientist from the University of Washington.

00:28 Join Jake and me while we discuss the state of Python in astronomy.

00:31 This is Talk Python To Me, episode 81, recorded October 21st, 2016.

00:56 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

01:09 This is your host, Michael Kennedy.

01:12 Follow me on Twitter where I'm @mkennedy.

01:14 Keep up with the show and listen to past episodes at talkpython.fm.

01:17 And follow the show on Twitter via at Talk Python.

01:20 I'm very excited to announce this episode is sponsored by not one, but two new sponsors.

01:25 And they both have excellent offerings for Python developers.

01:28 Welcome GoCD by ThoughtWorks and DataSchool to the show.

01:32 Thank you both for supporting the show.

01:34 Jake, welcome to Talk Python.

01:37 Thanks.

01:38 Good to be here.

01:39 Yeah, it's great to have you.

01:40 I'm a huge fan of astronomy and science.

01:44 And I'd love to talk to you about how Python and astronomy interact and all the problems you're solving.

01:49 But before we get to those, let's start with your story.

01:52 How did you get into programming in Python in the first place?

01:54 Well, you know, I came to programming relatively late.

01:57 I had a little bit of early experience in like sixth grade with Hyperscript, HyperCard, but didn't do much.

02:05 I took a small programming class in high school, and I didn't really do much programming aside from evaluating physics stuff and Mathematica, simple things, until I was in grad school, actually.

02:20 So I arrived at grad school and started working with a research scientist who later became a faculty.

02:29 And I asked him, hey, this was around 2006, I asked him, hey, what programming language should I use?

02:36 Most of the people around were using IDL, this interactive data language.

02:41 It's a proprietary scripting language that's similar to MATLAB or Python in some ways.

02:47 And he was one of the only people using Python at the time, and he said, well, you should use Python.

02:53 That's the future.

02:54 Everyone's going to be doing that soon.

02:56 And so I decided to do it, and I learned Python.

03:00 I taught myself Python over winter break.

03:04 Sudoku was big at the time, so I wrote a Sudoku solver, and that was my way of learning how to do control flow and everything in Python.

03:12 Yeah, that's great.

03:13 I really think writing those little games like that are a great way to learn a language, or at least get started with it, right?

03:22 Because the problems are not so complicated.

03:24 There's not a lot of interaction.

03:26 It's not like, well, how do I talk to a database?

03:27 How do I do a UI?

03:29 How do I call the web?

03:30 Yeah.

03:31 It was super fun.

03:33 And being someone who didn't really have a background, any formal background in algorithms, it was a nice way to wrap my head around what sort of problems you can do in programming.

03:42 So pretty much Python from then on, you've been doing a lot of stuff.

03:46 You've been contributing to some machine learning libraries, to the whole Scikit area.

03:51 Yeah.

03:52 So where that came in is basically I started doing all this work in Python.

03:56 I was writing horrible little one-off scripts like most scientists do who don't have formal training.

04:04 And a couple years into my PhD program, I wrote my first paper.

04:09 And the first paper was pretty interesting.

04:12 It was using this relatively new at the time algorithm, a version of manifold learning called locally linear embedding.

04:19 It was using that to explore some astronomical spectra.

04:24 This algorithm was implemented out there.

04:25 The paper introducing it had a link to a little tarball of MATLAB code.

04:30 But I found pretty quickly that the code didn't scale to the size of the problem we had, which was hundreds of thousands or millions of spectra in several thousand dimensions.

04:43 And so I spent a summer basically looking at this science and nature paper, looking at this MATLAB software, and trying to figure out how to write a more scalable version of the algorithm.

04:56 And what came out of that was this C++ package.

05:01 And I published the paper and did this standard thing of putting the C++ package in a tarball on my website.

05:08 And I thought to myself, this is ridiculous.

05:12 And the next person who tries to use this in astronomy is going to have to spend, you know, they're going to have to hire another grad student to spend a summer and figure out how to implement this.

05:22 So I started asking people about how to make sure that your code can be used by other people.

05:30 And then I found out there's this whole catchphrase called reproducibility, open science and things like that.

05:36 So that was my foray into reproducibility and open science.

05:40 And as I was asking around, someone mentioned that there was this brand new package that might be interested in that algorithm called scikit-learn.

05:49 And so I got in touch with Gael Varaku, who was getting scikit-learn off the ground, and they thought it would be a good contribution.

05:59 So I started, I think that was 2010 or somewhere around there, I started contributing to scikit-learn when I was really young.

06:06 And, you know, I haven't looked back.

06:09 I've really been turned on by this idea of open and reproducible science by making sure your software products that come out of your research are actually well-documented and reusable.

06:21 And this thing that was sort of a side project in the beginning has turned into most of what I do during my day-to-day work.

06:29 Isn't it funny how life takes those kinds of turns?

06:32 Like you plan to do one thing and you discover another and it really becomes something you're passionate about.

06:39 That's cool.

06:39 Yeah, and it turns out now I'm way more excited by general software tools than I am about the astronomy research that drew me into grad school.

06:49 Yeah, it sounds a lot like my story.

06:51 That's awesome.

06:52 Let's talk just really high level for just a moment about machine learning.

06:57 I know a lot of people out there are into data science and they know machine learning, but there's all sorts of listeners.

07:03 So, I mean, like how people used to solve problems and they would use like statistics or other types of things.

07:09 But then this whole machine learning seemed to formalize it, bring some algorithms together.

07:13 Like kind of give us an overview of what the whole story there is.

07:16 Yeah, so whenever I introduce machine learning, I always emphasize the fact that it's just a set of models to data.

07:25 You know, when you fit a line to data, you're doing machine learning.

07:29 When you take two clumps on a two-dimensional plot and draw a line between them to say this side is one type and that side is the other type,

07:37 that's a form of machine learning.

07:40 And where machine learning gets powerful is these algorithms that you can do by eye or by hand in two dimensions,

07:47 like drawing a line on a piece of paper.

07:49 Once you formalize those algorithms, you can scale them up to a large number of points and large number of dimensions.

07:55 You said you have a thousand dimensions in your previous problem.

07:58 Like you're not doing that by eye, right?

07:59 Yeah, yeah.

08:00 So, you know, a thousand dimensions, the equivalent is fitting a 999-dimensional hyperplane to split things into two groups.

08:07 And you can't really do that by eye.

08:09 But the key is that machine learning is nothing more complicated than fitting these models to data in a way that scales to large data sets and to high-dimensional data sets.

08:19 And, of course, it grew out of artificial intelligence and statistics in some sense.

08:25 But I think the core distinction between the machine learning way of doing things and the statistics way of doing things,

08:32 this distinction is put together, is described really well in this paper by Leo Briman called Statistical Modeling, the Two Cultures.

08:42 The overview summary of that is that, you know, in classic statistics, you're building models where you care about the model parameters.

08:50 Like you fit a line to the data and the slope is telling you something fundamental about the world.

08:56 Whereas in machine learning, you fit a line to data.

08:59 And you're not so much interested in the slope.

09:02 You're just interested in what that line can tell you about new data that you, you know, you want to predict something about.

09:09 I see.

09:10 So a lot of machine learning is about prediction in the future.

09:13 Like you create a model and then you want to ask it questions.

09:16 Yeah, absolutely.

09:17 And you're learning something about unknown data.

09:20 I mean, the distinction is not completely black and white there, but I think it's a useful way to think about machine learning versus statistics.

09:27 Yeah, that is an interesting way to put it.

09:29 What are some of the major tools in Python that people use?

09:32 Yeah, so scikit-learn is one of them.

09:36 This is the Python package that's built on NumPy and SciPy and kind of uses the classic tools.

09:43 It's really nice for doing sort of small to medium scale machine learning and modeling problems.

09:52 It doesn't have a particularly good scalability story.

09:56 There are some ways to kind of parallelize certain operations within scikit-learn.

10:01 But if you want to go to out-of-core data and things like that, there are other ways to do this.

10:07 So scikit-learn, to be honest, is for the bulk of what I end up doing in my work, I can use scikit-learn.

10:15 And for scaling to large data sets, often in the work I'm doing, I'm doing kind of massively parallel stuff where I can split the data into chunks

10:25 and run a small scikit-learn algorithm on one of those chunks and loop through them that way or parallelize them that way.

10:33 Yeah, that makes sense.

10:34 So maybe you're looking at some large part of the sky and you could break it into little grids or something.

10:38 Yeah, exactly.

10:39 I'm often looking at things object by object rather than trying to do things all at once.

10:44 If you need to do larger models that are doing things all at once, there are these interesting libraries recently built around things like Spark and TensorFlow.

10:54 TensorFlow. I'm not as experienced with those, but the TensorFlow stuff is interesting.

11:00 And in particular, there's this SKFlow package that I've been rather intrigued by that kind of builds a scikit-learn API around a TensorFlow backend.

11:10 Oh, that sounds like that's worth looking into. That sounds cool.

11:12 And there's also PySpark, which is interesting.

11:16 So where I'm in the process right now, it's been kind of fun.

11:19 I'm working with some computer scientists and some neuroimaging people and some database specialists to put together comparison between a number of Python-oriented approaches to doing scalable computation in a scientific setting.

11:36 And so hopefully that paper will be coming out in the next several months.

11:40 Oh, yeah. That sounds really interesting.

11:41 Give us some examples of how this whole machine learning story applies to astronomy.

11:49 So, like, what types of problems or things are people doing with this?

11:53 Yeah, so astronomy, we have a lot of areas where we want to predict certain aspects of things.

12:00 So one example, just to be concrete, is let's say we're looking for the distances to distant galaxies.

12:08 And the distances or red shifts of galaxies are important in constraining things about our understanding about the cosmology of the universe, the structure of the universe.

12:19 But getting an accurate distance to a galaxy is an expensive observation.

12:25 You have to do a spectral observation, which basically, you know, you look at an individual object and you split the light from that object using, like, a diffraction grading into its whole spectrum.

12:39 You know, red on one end, blue on the other end, and a thousand bins in between.

12:43 And given something like that, you can isolate certain emission lines or absorption lines and calculate its red shift, which is similar to its distance.

12:52 And that's really, really accurate.

12:53 But the problem is it's incredibly expensive because you have to look at individual objects and line up these diffraction gratings one by one.

13:01 And when we're looking at, when we're just taking pictures of the sky, we're getting thousands or millions of galaxies a night.

13:08 And it's, and we don't have the resources to take a spectrum of all of those.

13:13 So the question then is, can you, can you take a small set of objects where you have these very detailed spectral observations and learn something about them so that you can predict what the red shift might be from a more coarse photometric kind of, kind of picture observation of them.

13:32 And this maps pretty well onto a machine learning model, right?

13:35 You have, you, you take a picture of the whole sky, and so you get data about each object that way at a coarse level.

13:43 And then you take spectra of a certain collection of objects, and that gives you finer detail, more information about a subset of them.

13:51 And then you want to build a model that can predict that, you know, that more information, the red shift and the distance for, for all the rest of them.

13:58 So at first glance, machine learning seems to map pretty well onto astronomy data.

14:04 The thing that's difficult about it in practice is most machine learning models assume some sort of statistical similarity between your training set and your unknown set.

14:18 And in astronomy, unless you specifically design it that way, it's difficult to get that statistical similarity.

14:24 So for example, we tend to have spectra of nearby bright objects because they're easier to take spectra for.

14:31 If you're looking at distant, faint objects, the noise characteristics are different, the statistical distributions are different.

14:38 So a straightforward machine learning approach to that will miss some things, and you might not even know you're missing things.

14:47 Yeah, I can imagine.

14:48 You know, one of the things that just blows my mind is that we can see things so far away and so small and effectively so far in the past and still make intelligent statements about them.

15:01 Yeah, yeah.

15:02 It's pretty amazing.

15:03 It's unbelievable some of the things you guys are doing.

15:06 And the thing that blows me away, actually, about astronomy and astrophysics in general is the fact that these laws that we discovered in the laboratory here over the course of the centuries actually apply to what we see out there 10 billion light years away, right?

15:24 And it's not just that we're assuming they apply, it's that we can actually test and confirm that they apply.

15:30 One example is, you know, there's all these scientists in the 18th and 19th centuries studied the behavior of gases, right?

15:39 What happens if you blow up a balloon and how fast does the air come out?

15:43 And all that led to this formalized field of thermodynamics and statistical mechanics.

15:51 And as you go into more detail, like what happens if you're looking at ionized gases and things like that, we learned all this stuff in the lab.

16:01 And then in the mid-20th century, figured out that the cosmic microwave background, this light echo of the Big Bang, actually comes from a plasma in the early universe.

16:15 And we can understand the properties of the plasma there by the same laws.

16:19 And the reason that we know that the universe is 13 point, you know, I can't remember the decimals, but 13 point something billion years old with a very good accuracy.

16:30 One of the reasons we know that is because we understand the thermodynamics and statistical mechanics of the plasma in the early universe and can compute what that says about the cosmic microwave background.

16:42 And that story right there is just fascinating to me.

16:45 Yeah, it's totally fascinating.

16:47 And what I think is also fascinating is the guys who discovered it, was that Bell Labs in New Jersey?

16:52 I think.

16:53 Yeah, yeah.

16:53 Yeah, the guys who discovered that whole cosmic background radiation weren't looking for it.

16:59 They found it on accident and it was in their way, right?

17:01 Yeah.

17:02 And their first hypothesis, I guess, was that it was pigeon droppings on the detector.

17:08 And once they cleaned off all the pigeon droppings, they had to figure out it was something else.

17:13 Yeah, so they got the Nobel Prize for finding static in their instrument and realizing the static was significant.

17:19 Yeah, normally that would be a problem, right?

17:22 Yeah.

17:22 You want to get rid of it.

17:23 Very cool.

17:24 And so you gave a really interesting talk at PyData, I think 2015.

17:28 I'll be sure to link to the – it's up on YouTube.

17:31 I'll link to the video.

17:32 Mm-hmm.

17:32 And you talked about how distance is super important in astronomy.

17:35 And it relates to many of these big ideas that we hear about if we're sort of paying attention, I guess.

17:42 Yeah, absolutely.

17:43 Distance is fundamental to a lot of what we do.

17:46 And it's also really, really hard to figure out.

17:48 I mean, if you think about just looking at a dot of light in the sky, how do you tell how far away that is?

17:57 And so a big part of the story of astronomy over the past couple centuries has been people figuring out how to determine how far away things are.

18:07 So the first step that people figured out is we can do it geometrically.

18:12 You know, the same way as if you put your finger in front of your eye and close one eye in the other, your finger seems to jump around in front of the background.

18:21 That's called parallax.

18:23 And we can use a similar type of trick to find the distance to nearby stars because the Earth is on one side of the sun in June and on the other side of the sun in December.

18:33 And if you compare what the nearby stars look like compared to the background stars in June and December, you see them jump back and forth.

18:41 And you can use the geometry of that to figure out the distance to those stars.

18:45 I see.

18:46 So you measure the sky and you basically see which ones kind of move more and which ones are more or less fixed.

18:51 And then based on the parallax, you can say, well, these ones that move, they're five light years away or something.

18:56 Yeah.

18:56 And you can calculate that based on the angle and what we know about the Earth's orbit about the sun.

19:01 But that only works to within, well, up until a couple months ago, it was within maybe a few thousand light years.

19:10 There's this Gaia mission.

19:11 The data was just released in the last couple weeks.

19:16 And that one of the things that Gaia can do is it's going to give us really accurate parallax distances out to previously unheard of distances.

19:28 So we're going to really be able to figure out the three-dimensional structure of the stars in our galaxy.

19:35 But that parallax is not going to work when you go out to more distant galaxies.

19:40 So you have to come up with other ideas.

19:43 And one of the ideas that's been really fruitful is this idea of standard candles.

19:49 If I stick you on a street in the dark and I turn on a 100-watt light bulb and I put it right next to your eye, it's really bright.

19:56 But if I put it two blocks away down the street, it's really dim.

19:59 And that brightness and dimness, you can compute that because the apparent brightness is attenuated by a factor of one over the distance squared.

20:12 So if you know that, if you look at a light two blocks away and you know that it's a 100-watt light bulb and you have a very accurate photometer,

20:20 you can compute exactly how far away that light bulb is.

20:24 And this works with stars, too.

20:26 If we know the exact brightness of a star, the exact intrinsic brightness of a star, and we look at its apparent brightness, we can compute the distance very easily.

20:36 Now, the trick there is you need to know the intrinsic brightness of the star.

20:43 Yeah, that's the kind of stuff that amazes me because you look out at these things super far away.

20:48 And how do you know their intrinsic brightness, right?

20:50 Yeah, and it's really difficult.

20:52 One thing you can do is build off these things we learned from parallax, and you can look for certain classes of stars that are always around the same brightness.

21:01 And you know their brightness when you know the parallax distance.

21:05 And then you look for the same class of stars that are further out, and you can sort of infer their distance that way.

21:13 So this is why it's kind of called, in astronomy, it's known as the distance ladder.

21:16 You know, we have these direct methods that lead to more indirect methods of distances as we go further and further out.

21:23 And one of the coolest stories of this distance ladder is back in the early 20th century, there was this woman named Henrietta Leavitt.

21:33 And she was looking at variable stars.

21:36 So there are stars out there that get brighter and fainter with time.

21:39 And she was looking at particularly at this class of stars called Cepheid variables.

21:44 It was named after the fourth brightest star in the constellation Cepheus.

21:52 And she found something curious when she was looking at the variation of these.

21:58 They would get brighter and fainter with a period of, you know, somewhere between a day and a couple days, something like that.

22:04 And she found that the period of how fast they got brighter and dimmer was related to their intrinsic brightness.

22:15 And so there's this nice plot where she shows that.

22:18 And it's this roughly linear trend between period and intrinsic brightness.

22:22 And that's really nice because then you can, a period is something that you can find out in the sky.

22:28 So she looked out and, you know, found all these stars and confirmed that the period and the intrinsic brightness were related.

22:34 So then Hubble came along.

22:35 And you've probably heard of Hubble from the Hubble Space Telescope.

22:37 And what he did is he used the telescopes available to him and found more and more of these stars.

22:44 And based on this period brightness relation, was able to estimate the distances to all these stars.

22:50 And the thing that really completely blew open our understanding of the universe was when Hubble pointed his telescope at one of the, what they called then the spiral nebula.

23:02 So there were these spiral shaped clouds out in the sky that for a long time people thought were just clouds of dust in our galaxy.

23:11 But Hubble found individual Cepheid variables in the Andromeda spiral nebula and found that it wasn't in our galaxy.

23:20 It was about two and a half million light years away, farther away than anything we ever would have imagined existed.

23:26 So in one fell swoop, the study of variable stars led to us understanding that the universe is orders of magnitude bigger than we ever imagined.

23:37 That's really amazing how that ladders up there, right?

23:41 And beyond that, we also learned that the universe is not contracting or sort of static, but it's sort of going away from itself and accelerating, right?

23:51 Yeah.

23:52 Yeah, so at the same time, he found that not only were these galaxies really far away, but if he looked at all these galaxies,

24:00 they were the spiral nebula, and now we know them as galaxies because we know they're separate groups of stars.

24:08 He looked at all these and he found that there was a relationship between how far away they are and how fast they're receding from us.

24:16 We can measure their recession velocity by looking at the redshift of the light.

24:20 And it's kind of like the Doppler shift, when a siren goes by you, you hear it high at first and it passes and it goes low.

24:30 You know, this...

24:31 Yeah.

24:32 And that same Doppler shift, we could see that the light was shifting to a lower frequency,

24:38 just like the sound shifts to a lower frequency when a car goes away from you.

24:43 And you can measure the velocity.

24:46 And he found this relationship between the distance and the velocity, which basically describes a uniformly expanding universe.

24:55 And, you know, right around the same time, Einstein's predictions, general relativity,

25:04 people were realizing that the general relativity equations that describe gravity

25:09 and explained the orbit of Mercury, among other things, you could solve those in a way that led to an expanding universe.

25:17 So it was another confirmation of general relativity.

25:21 And this was all based on finding distances to galaxies.

25:25 This portion of Talk Python To Me is brought to you by GoCD from ThoughtWorks.

25:45 GoCD is the on-premise, open-source, continuous delivery server.

25:49 With GoCD's comprehensive pipeline and model, you can model complex workflows for multiple teams with ease.

25:56 And GoCD's value stream map lets you track changes from commit to deployment at a glance.

26:02 GoCD's real power is in the visibility it provides over your end-to-end workflow.

26:07 You get complete control of and visibility into your deployments across multiple teams.

26:12 Say goodbye to release day panic and hello to consistent, predictable deliveries.

26:17 Commercial support and enterprise add-ons, including disaster recovery, are available.

26:21 To learn more about GoCD, visit talkpython.fm/gocd for a free download.

26:27 That's talkpython.fm/gocd.

26:31 Check them out. It helps support the show.

26:33 I think one of the things that's super interesting about this is, you know,

26:45 this concept of variable stars and the work that woman did was very manual, right?

26:51 Like she would look at pictures and so on.

26:55 Yeah, she was measuring the way they measured brightness of stars back before CCDs

26:59 is you're looking at photographic plates and the brighter something is, the more it's saturated.

27:07 So you'd have to do a detailed measurement of the size of the dot on your photographic plate

27:13 and use that to compute the brightness of the star.

27:16 It's just, it's amazing to me that any of that work got done given how easy we have it now, you know?

27:22 Yeah, exactly. Exactly. It's such a different world.

27:24 But at the same time, we've kind of answered those questions for the simple small ones we focused on.

27:30 And now the amount of data that you guys are getting is so much larger that you have to start applying these machine learning algorithms

27:38 just to deal with it, right?

27:40 Yeah, absolutely.

27:41 So the project I've been involved in that's just starting to get off the ground,

27:45 first light is going to be in a couple of years, is this project called the Large Synoptic Survey Telescope.

27:50 And you can think of this as an overview as it's a 10-year movie of the entire southern sky.

27:57 So it's a very wide-field camera that's going to be on a mountaintop in Chile in the Atacama Desert,

28:03 one of the driest places on the Earth, so we don't get, don't run into much weather.

28:09 And it'll be able to scan the entire night sky every three nights or so.

28:14 So get about 100 full sky frames in this movie per year and then do that for a decade.

28:21 And the big thing this is going to open up is more of the time domain.

28:26 You know, typically astronomers tend to treat the sky as this fixed thing.

28:32 There are individual times where we look at specific regions of the sky and see what has changed.

28:40 But we don't really have a global survey yet of the time domain of the sky.

28:45 And LSST is going to do this on a huge scale.

28:48 We're going to have 10 years of data with something like 30-ish terabytes per night of data coming through.

28:58 So the full survey size is going to be in the hundreds of petabytes by the end.

29:02 So it's really, it's bigger than anything that's been done before.

29:07 And it's really forcing astronomers to confront these old tool chains that they've had that don't really scale anymore.

29:15 You know, the stuff that you could do, you could sit down, 10 years ago you could sit down on a computer

29:21 and download, you know, all of the Sloan Digital Sky Survey and do some sort of local analysis.

29:28 I don't know if, even in 10 years, I don't know if we're going to have hundreds of petabyte size hard drives in our laptop, right?

29:37 We're going to have to do it a little bit differently.

29:39 Yeah, that's really a lot of data.

29:41 And the other thing you had said that was interesting is this data is being collected for everyone,

29:46 which means that it's not specifically focused on some type of answering some type of questions.

29:54 So the techniques and the tools and like the machine learning stuff you have to apply has a greater challenge.

30:01 Yeah, it has to be really, really general because this data, like you said, it's collected for everyone.

30:07 There's not really specific areas that it's addressing.

30:12 It's one of these discovery class missions, similar to the Hubble Space Telescope.

30:16 You know, you put it out there and you hope that the things you find are things that you're not going to be able to predict at the moment.

30:22 And what that means is that for any particular science case, you're not necessarily going to have the best data.

30:33 You know, if you were designing LSST to do one thing, like look for variable stars, you would do it very differently than if you're doing it in general

30:42 because you have to balance all these different concerns and different areas of research.

30:48 So, for example, going back to variable stars, one of the challenges with LSST is that rather than just observing in the same region of the spectrum every night,

30:58 that you'd want to do if you want to, you know, if you want to look at a variable star and see how the brightness changed from one night to the other,

31:05 you'd want to take the exact same observation.

31:07 But LSST is not taking the exact same observation every night.

31:11 It's getting a breadth of different bands throughout the spectrum, everything from infrared to,

31:18 in the near ultraviolet.

31:20 And what that means is that that's really good for things like determining the redshift of galaxies via machine learning, right?

31:30 But it's actually very bad for finding variable stars because now you have to model not only the variability,

31:38 but you have to model the spectral variability over the course of time, too.

31:42 And it gets much more challenging.

31:44 So as this data grows and as the heterogeneity of the data grows, having these sophisticated algorithms,

31:55 whether it's machine learning or some sort of forward modeling or some sort of nonparametric modeling,

32:01 that's becoming increasingly important.

32:03 And it's things that need to happen kind of in real time while you're observing the sky

32:09 because we want to be able to alert people within a minute or so.

32:13 If something changes on the sky and we find an interesting object, there's going to be this alert stream so that somebody sitting in a telescope in another part of the world

32:22 could point their telescope there right away and catch this interesting phenomenon.

32:27 Yeah. Wow. I'm really excited to see what comes out of this. That's a big project.

32:32 Yeah, it's going to be huge. It's really going to define the way that we do astronomy over the course of the 2020s.

32:40 Yeah, for sure. So let's talk about some of the libraries that you might be using to answer questions here.

32:47 So the two major ones, and I guess one is kind of a subset of the other, is AstroPy and AstroML.

32:53 Yeah, so AstroPy is actually the big community standard.

32:57 And it's been a really cool project to watch and to be involved in because it started a few years ago

33:03 where everyone had their own little Python library to do things.

33:08 This grew out, I should step back.

33:11 Ten years ago, most people were using IDL.

33:14 And so the community evolved these sets of routines in IDL to do a lot of the common tasks.

33:21 And as more and more people moved over to Python because of – well, I'll go into that later.

33:27 But as more and more people moved into Python because of its advantages,

33:32 people built a whole bunch of different tools to do different things, and it was this sort of smattering.

33:39 And the Space Telescope Science Institute people, the folks behind Hubble,

33:43 came together in around 2012, 2011, and said, we should consolidate all this and create one Uber package to rule them all.

33:50 And AstroPy was born, and it's actually accomplished its goal.

33:57 Pretty much everyone is using it now.

33:59 So that's an incredible package and really, really well done, and awesome software engineering behind it.

34:07 Lots of buy-in from the community.

34:09 AstroML is something that I started along the same time.

34:13 I didn't have as broad of a vision, but I just wanted to bring together functionality

34:19 and examples of doing machine learning specifically for astronomy in Python.

34:25 And we actually wrote the package to accompany our book, which is – it's a Princeton Press book on statistical modeling, machine learning, and et cetera in Python for astronomy.

34:37 Yeah, that's great.

34:39 And so what kind of things do you cover in your book about – like what problems are solved or presented or data sets, things like that?

34:47 Yeah, in that book, what we do is we walk through a lot of the – it's meant to be an intro graduate text on statistics and machine learning with astronomers in mind.

34:59 So we walk through all the basics of data mining, statistics, machine learning,

35:06 all the while using these data sets drawn from astronomy and problem situations that astronomical researchers will run into.

35:16 And along the way, we also provide code snippets and provide figures with the full figure source available online

35:23 so that if people want to actually use these techniques, they can grab our scripts and start modifying them from there and see where it goes.

35:31 So AstroML is what drives that a little bit.

35:34 In our next edition of the book, which might happen in the next year or so,

35:39 my big task is going to be to incorporate AstroPy because we actually wrote that book before AstroPy existed.

35:46 And so it's already a little bit outdated and I want to make sure that everyone's –

35:51 I'm pointing everyone to the tools that are in AstroPy.

35:53 Yeah, that makes a lot of sense.

35:54 So you created the book and you're like, we really should make this like a package that people can just use,

35:59 the techniques and whatnot, and then now it's a little more mature, right?

36:03 Yep.

36:04 Yep.

36:05 Do you know of any discoveries that were done as a result of AstroML?

36:09 It's been referenced in a lot of papers.

36:14 I don't know offhand if there's anything that came exactly from that, but it's definitely been used for a lot of the incremental building of knowledge over the last few years,

36:25 and it's been fun to see that.

36:26 Yeah, I'm sure that's really rewarding.

36:28 That's awesome.

36:29 Another big thing in the astronomy community is forward modeling and Bayesian approaches.

36:36 So I alluded to earlier the fact that machine learning is a little bit difficult

36:41 because the statistical similarity of the samples is not always a good assumption.

36:49 So the way that astronomers tend to get around that is to use forward modeling.

36:53 So you have some model for your system based on the physics that you know,

36:59 and you can look at the noise properties and the selection effects of your observations to constrain that model,

37:06 and then that model will tell you about the data that you observe.

37:11 And that tends to work really well in a Bayesian setting.

37:15 So a huge push in the last few years in astronomy has been to use tools like Markov Chain Monte Carlo

37:22 to do Bayesian analysis and to do these really large high-dimensional models to learn about the data.

37:31 So there's one package that's been pretty impactful there is the MC package, E-M-C-E-E.

37:40 And that's a package for doing Markov Chain Monte Carlo, doing Bayesian estimation,

37:46 written by an astronomer and used for, it's been cited, I think, thousands of times in the astronomy community

37:54 because so many people are doing that style of analysis.

37:57 Yeah, that's really amazing.

37:58 Yeah, I guess the whole how do you solve these prediction problems more quickly is really important.

38:05 And Monte Carlo simulations are really good at that.

38:09 Mm-hmm.

38:09 And particularly the Bayesian approaches.

38:13 Machine learning tends to be more of a frequentist approach, and the Bayesian forward modeling approaches give you some advantage.

38:22 And when you have some a priori idea about what's driving your observations,

38:28 you can take advantage of that more in a Bayesian context than in a machine learning context.

38:34 So you wrote this book, The Statistics, Data Mining, Machine Learning, and Astronomy,

38:38 and you survived that process and you decided to come back for more.

38:42 And you're just about to finish up a book, right?

38:44 Another one.

38:45 Yeah, I'm just, I'm finishing one that's, it's an O'Reilly book.

38:48 So think, you know, cute little animal on the cover.

38:51 You've probably seen that.

38:52 What's your animal?

38:53 The animal is a Mexican bearded lizard.

38:56 Yeah, this one is the Python Data Science Handbook.

39:00 So the reason I did this is I, for years I've been approached by people who are, you know,

39:05 in research or in tech or something like that.

39:08 And they say, hey, I know how to use MATLAB.

39:11 I know how to use R.

39:13 But I want to learn how to do Python and I want to learn how to analyze data in Python.

39:17 And I hadn't found a really good resource to point them to except for kind of collections of videos online.

39:23 So I decided to write it.

39:25 And it's taken much longer than I thought because life gets in the way.

39:30 But we're at the point where it's, I'm doing the final edits right now.

39:35 So it should be released pretty soon.

39:37 Yeah, that's great.

39:38 Congratulations on that.

39:39 I'll be sure to link to it as well from the show notes so everyone can find it.

39:43 One thing I'm particularly excited about this book, I wrote it all in the form of Jupyter Notebooks

39:48 and got the publisher to agree to let me make the Jupyter Notebooks public.

39:53 So you can buy the printed version of the book or you'll be able to go on GitHub

39:57 and just work through the Jupyter Notebooks for free.

40:00 Wow, that really is cool.

40:02 Yeah.

40:02 Okay.

40:03 But you should buy the book to support all the work.

40:06 Definitely support the project.

40:08 That's cool.

40:09 But it's very cool that basically it's a live book, right?

40:13 Like if you have the data and you have the code and you can run it, you can explore it.

40:16 Yeah.

40:17 And we're working on, too, getting a hosted version of it up there on some cloud service.

40:24 So you could just basically click and have a live executable textbook at your fingertips.

40:29 Yeah, it's interesting.

40:31 I think a lot of things are going that way, right?

40:33 The days of just a printed book and a zip file are fading, let's say.

40:38 Yeah, yeah.

40:39 Because there's so much better ways of doing it now.

40:41 This portion of Talk Python To Me is brought to you by DataSchool.

40:59 Have you thought about making a career change into the exciting world of data science, but

41:03 don't know how to get started?

41:04 DataSchool helps data science beginners like you to analyze interesting data sets and build

41:10 machine learning models to predict the future, all using Python.

41:13 You don't need a PhD or a background in mathematics, just a keen interest in using data to answer

41:19 your questions.

41:19 DataSchool has created a data science learning path exclusively for Talk Python listeners.

41:25 So visit talkpython.fm/DataSchool to launch your data science career.

41:29 DataSchool is run by my friend Kevin Markham, so I know that you're going to get excellent

41:34 content.

41:34 Check it out at talkpython.fm/DataSchool.

41:37 So let's talk a little bit about where you work and what you do, because you are breaking

41:47 some rules around how people in academia and scientists work with programming technology and how

41:55 programmers are involved.

41:56 And I think that's really interesting.

41:57 So you're at the University of Washington, but you're at this place called the eScience

42:02 Institute, right?

42:03 Yeah.

42:03 So I'm in the eScience Institute.

42:04 I've been here since the beginning of 2014.

42:06 And the goal of the eScience Institute is to basically further computational research around

42:17 campus.

42:18 And so it's existed for a while, but we really got a big boost in 2014 when I came on.

42:26 We got this joint grant between New York University, UC Berkeley, and UW.

42:33 And we all created some version of this Data Science Institute.

42:36 And so it's a five-year grant to support what we're doing.

42:39 And the goals are basically to see how we can reshape the culture of academia to take more

42:46 advantage of data science tools, to train people better, to provide career paths for software-focused

42:55 researchers.

42:56 And so, for example, the job that I have right now where I, what I do day-to-day is I spend

43:02 a lot of time consulting with researchers around the university, helping them figure out their

43:07 data challenges.

43:09 I mentor students who kind of have one foot in their home domain, their science, and one

43:15 foot in like a data science program.

43:18 And I work a lot on maintaining the software that astronomers and other scientists use.

43:24 And this is a position that's not really, it's, I feel like, sort of a stepchild in academia because

43:29 no one really understands that type of position.

43:33 It doesn't fit into the model of graduate student postdoc faculty.

43:39 So we have a number of people that are in a similar position to me that are working on

43:44 this.

43:44 And it's been super fun to see what comes out of this and the kind of novel trainings and

43:52 novel approaches to research that we can do.

43:54 And particularly fun because it's not only happening at UW, it's happening at NYU and UC Berkeley as

43:59 well.

44:00 And we can compare notes with those institutions and see how things are going.

44:04 That sounds like such a fabulous job.

44:06 Yeah, it's good for the time being.

44:08 I mean, I'm worrying that I'm peaking early because it's so fun.

44:12 I don't know what will come next.

44:14 It's all downhill from here.

44:15 No, no, that's really cool.

44:16 One of the things you pointed out in your PyData talk is that every field is entering a data-rich

44:22 era.

44:23 So there's all these biologists, sociologists.

44:25 You're basically there to help support like the biologists, sociologists, chemists, all the

44:32 people who are hitting the limits of how much data they can handle.

44:36 Yeah, absolutely.

44:37 And the way we're doing this is we have a number of different ways to engage with people on campus.

44:44 So one is we have these open office hours.

44:46 So just like you used to go see your professor during class, we have office hours that are oriented

44:52 towards researchers who, if they have a challenge, they can come talk to one of our people.

44:56 And we have people with expertise in everything from statistics, machine learning, to software

45:01 engineering, to cloud computing and scalability.

45:04 Another thing we do is we run these incubator programs.

45:07 So it's sort of designed off these startup incubators that are coming from the Silicon Valley, where

45:12 instead of incubating their startup idea, we're incubating their research idea and letting

45:17 researchers work shoulder to shoulder with a data scientist who has an expertise that complements

45:23 theirs.

45:23 And we also have graduate fellowships where students have one foot in their own department,

45:30 one foot in e-science and are taking not only, say, astronomy courses, but also database,

45:36 machine learning, statistics, computer science courses and getting a credit on their PhD for

45:42 that.

45:43 Yeah.

45:44 What I thought that was really fascinating is, you know, having gone through some part of

45:48 a PhD program, just there's so many things you've got to take and learn.

45:53 And you're so busy learning your specialty, right?

45:56 Like biology, if that's where your PhD was.

45:58 That is, it's really hard to be a good data science software type person as well.

46:04 Yeah, absolutely.

46:05 I think, you know, you said that these guys in these cohorts, they basically get half of

46:12 their requirements for their PhD program waived.

46:16 Yep.

46:17 So that they can focus the other half on sort of complementing this with data science and

46:21 programming, right?

46:22 Yeah, that's the idea.

46:23 And then what comes with that is they get, they have their home department advisor, but

46:27 they're also matched with a co-advisor that's more methodological.

46:31 And so it leads to not only the student growing a lot, but it leads to some interesting interdisciplinary

46:38 collaborations around campus.

46:40 And we've had a number of pretty cool grants that have been awarded based on some of these

46:45 partnerships.

46:47 Yeah, that sounds really quite amazing.

46:49 I wish that was around when I was in school.

46:52 Yeah, I do too.

46:54 I had to pick a lot of this stuff up on my own.

46:56 It would have been nice to have something like this.

46:58 Yeah.

46:59 If anyone out there is listening and they're maybe in a position where they're like, oh,

47:02 this is interesting.

47:03 How do we do this?

47:03 Right?

47:04 Like another thing that I thought you pointed out was really interesting is it's in a beautiful

47:08 location.

47:08 And you said that that was really important.

47:10 Yeah, yeah, definitely.

47:12 So we have this data science studio that it's an old library branch location on campus.

47:21 And we're on the sixth floor of this tower with where we have the whole floor with 360 degree

47:26 views looking out over Mount Rainier and the Olympic Mountains and things like this.

47:31 And it's important not just so that I can have an awesome view while I'm writing code,

47:34 but it's important because we want people around campus to interact with each other.

47:40 And so we want to be a place where people would like to come and just hang out.

47:44 So it's getting back to the, we call it the water cooler effect.

47:51 The people who are around in the 60s and 70s working on, working on computationally intensive

47:57 science, talk about the days when everyone would go to the mainframe on campus and you'd

48:04 be sitting there waiting to put your punch cards in.

48:06 And, you know, a hydrologist would be talking to an astrophysicist and finding out that they're,

48:12 they're solving the same equations with their programs.

48:15 So they would, you know, have that sort of talk and, and as the campus moved towards desktop

48:20 oriented computing, those sorts of opportunities went away.

48:24 And I think we're, we're better off if we can have that, that sort of connection.

48:29 So one of the cool things about our space here is it's a, it's a space that's open to anyone on

48:34 campus for, for just hanging out and working, but also for scheduling meetings.

48:39 So we have people from all different departments that are scheduling their group meetings here,

48:43 coming in, hanging out, having coffee, and then meeting someone from the other side of campus

48:48 who's solving the same differential equation in their completely different field.

48:52 Yeah, that's great.

48:53 Yeah.

48:54 Very, very nice.

48:54 Like I said, I wish that existed when I was in school.

48:58 All right.

48:59 So we're kind of getting near the end of the show and I have a couple of questions

49:03 that I wanted to run by you.

49:05 One is kind of almost metaphysical.

49:07 So I just heard the other day that there was some study done that there was, we underestimated

49:15 the number of galaxies by like a factor of 20 or something amazing, right?

49:19 Yeah.

49:20 That was really interesting that, you know, already there's so many galaxies out there and every

49:27 galaxy has, you know, so many stars and so many planets.

49:29 What do you think the chances that there's intelligent life outside the earth?

49:33 Not necessarily like there's people visiting the earth, right?

49:35 Just out there somewhere.

49:36 We will never meet them.

49:37 That's hard to say.

49:39 you know, at this point, so, so one thing at UW, we have this, astrobiology group,

49:44 which sounds like a funny area of study because what are they studying?

49:48 Right.

49:48 But, but they're working on really interesting things, combining, what we know about biology,

49:54 about geophysics, about planetary astronomy, and looking for locations around the universe

50:00 where, life might exist.

50:02 And so they study extremophiles around here, like organisms that live on, on deep sea vents

50:08 and, in acidic boiling water environments, things like that.

50:11 but one thing that's come out of that group is, this, this notion that, simple

50:19 life, you know, microbial life, probably could exist just about anywhere.

50:25 And I tend to think that, you know, I would be, I would be pretty surprised if we

50:29 don't find some sort of microbial life elsewhere in our, our solar system.

50:34 but the other thing to come out of that as we study the dynamics of planets and

50:39 things like this is that there are a lot of things about earth in particular that make

50:43 it very special and a lot of, coincidences that, that are, that would be hard to duplicate.

50:50 You know, things like the fact that Jupyter exists, keeps us from having a large number

50:57 of asteroid impacts on earth.

50:58 That's kind of a big shield.

51:00 And, you know, that, that asteroid impacts, as, as we know from geological paleontological

51:07 history is, they can be pretty bad for life on earth.

51:11 So, so the type of stability that we have on earth, particularly over the last, tens,

51:17 hundreds of millions of years, I suspect that's rather rare.

51:21 And that, that makes me think that intelligent life might be rather rare.

51:25 You know, I think, I think the seas of Europa, when we get, when we get something that can

51:30 burrow through the ice and look down there, I really hope there's something swimming around

51:35 down there because it would be pretty cool.

51:37 It'd be very cool.

51:38 I hope I hope I get to see that someday.

51:40 That'd be awesome.

51:41 All right.

51:42 So another one, my wife's a professor here at Portland State University.

51:46 So I hang out with some other, her colleagues and stuff.

51:49 And one of her colleagues had this student, she's teaching like a numerical methods for

51:55 partial differential equations, you know, something like that.

51:58 Her, her, you know, she's using Python and a lot of things like NumPy and so on in her

52:06 class for the computation.

52:07 And one of her students came and said, Hey, I just, I know MATLAB.

52:10 Can I just use MATLAB?

52:11 Why do I need to learn this Python thing?

52:13 What would you tell that student if you got that question?

52:16 Yeah, that's a good question.

52:17 so number one, I think, use the tool that's most effective for your research.

52:23 And for example, if you, if there are programs in MATLAB that, don't exist in Python and,

52:29 and they were, they're required from your, for your research, there's no reason to learn a

52:33 new tool just because it's a new tool.

52:35 But on the other hand, there, there's some distinct advantages to Python.

52:39 And I alluded to this earlier when I talked about the field of astronomy shifting from 90% IDL over

52:45 the last 10 years to probably 90% Python.

52:47 And the advantages, that I see are number one, it's openness, right?

52:54 It's, it's open and it's free.

52:55 So that means, one thing that has come up with IDL, is there, there are site licenses

53:01 required for every instance that you run.

53:03 You know, it's, it's a pay to play type of operating system or type of, interpreter.

53:07 And so when people started running, running parallelized jobs, taking advantage of all the

53:14 computers in the department, there were times where a grad student would start a job and it

53:18 would use all the site licenses for the entire department and research in the department ground

53:23 to a halt, right?

53:24 so you don't have that problem in Python because there's no site licenses with Python.

53:28 So Python can be cheaper to use.

53:31 the other thing about it is that, to, to serve students well, you know, the, the

53:39 number of academic jobs versus the number of, undergrad degrees or PhDs granted is extremely

53:47 small.

53:47 So most of our students are going to be going out into the world and working somewhere other

53:51 than an academic department.

53:52 And, people in, in the outside world and the tech world, they're much more excited about

53:57 someone with Python chops than someone with IDL chops or MATLAB chops, just because that's,

54:02 you know, the way the world has gone.

54:04 So that, that's another good reason to move to Python.

54:07 the other thing that I love about Python is the culture of open source.

54:13 So particularly now, 10 years ago, it was different, but now, just about anything you want to

54:20 do in Python, you can go out there and there's somebody who has made an open source library

54:25 for it, put it on a place like a GitHub or Bitbucket and, made it available.

54:30 And, and often these libraries are really, really well done.

54:33 There's just been this culture of, of well-designed open source, particularly in the scientific Python

54:39 community.

54:40 And it means that, you know, you can, you can do a, an amazing number of things just

54:45 out of the box with Python and the scientific installation.

54:48 Yeah.

54:50 What do you think that means for reproducibility?

54:52 Like I want to store this thing, of code, the interpreter, maybe even like a Linux Docker

55:01 image of the thing that I use to generate my paper.

55:04 Yeah, that's huge.

55:06 And, you know, it comes back to the, in the beginning, I was telling you how I got into

55:09 the Python open source world.

55:11 I realized that there was, I thought it was ridiculous that I'd spent this whole summer

55:15 building this tool that then no one was going to be able to use.

55:18 And, the tools in Python for enabling that sort of reproducibility, even having like an

55:24 executable paper are, are huge.

55:26 And I think it's really helping science drive, drive itself forward because we don't need

55:32 to reinvent the wheel every time we do a new study.

55:34 All right, Jake, I think that we have to leave it there for the topics, but I do have two final

55:38 questions for you before I let you go.

55:40 Okay.

55:40 I just saw on PyPI that we passed 90,000 distinct packages.

55:46 There's so many amazing things you can just pip install.

55:50 And, in your field, you probably get exposed to really interesting things that maybe not

55:55 everybody knows about, tell us about like one of your favorite PyPI packages that you might

55:59 recommend.

55:59 Well, I mentioned this earlier, but the MC, for Markov chain Monte Carlo, I think that's

56:06 just an incredible package and allows you to do so much as far as, Beijing modeling.

56:11 I could talk about one of my packages, but yeah, well, you know, it's your own packages are not

56:17 off limits.

56:17 Like Astro ML is all right, you know?

56:20 Yeah.

56:20 So one, one that I'm working on recently, which is a lot of fun is this, Altair package.

56:25 And what, what it is, is a Python interface to Vega light.

56:28 And Vega light is a, a visualization grammar aimed at statistical visualizations, that

56:37 basically, outputs interactive, JavaScript plots.

56:42 And so we've been, we've been writing this Python wrapper and trying to make a nice API to,

56:47 to create Vega light grammars and Vega light, visualizations.

56:51 I'm pretty excited about this because, you know, there's, there's so many options for plotting

56:56 out there right now.

56:57 There's, you know, there's matplotlib, there's bokeh, there's plotly, there's hollow views, there's

57:04 ggplot wrapper for Python.

57:07 There's, I'm going to miss something and someone's going to get mad at me.

57:11 seaborn, you know, things like that.

57:14 but the one thing that, that Altair, that's interesting about Altair is that it interfaces

57:21 to this Vega light grammar.

57:23 And that grammar, I think has the, has the possibility of becoming sort of a lingua franca between these

57:29 various visualization packages.

57:31 And the Vega light, if you, if you've heard of D3, which is driving a lot of interactive

57:35 visualization on the web, Vega and Vega light are come out of the same research group.

57:40 So it's people who really know what they're doing as far as, as visualization design.

57:45 Yeah, that's cool.

57:47 That's a great pedigree.

57:47 Yeah.

57:48 So, you know, and I, now I get to write the Python classes that output this stuff and

57:53 it's pretty fun.

57:54 Great.

57:54 What's the package called?

57:55 It's called Altair, A-L-T-A-I-R.

57:58 All right.

57:59 Awesome.

58:00 All right.

58:00 And when you write some Python code, what editor do you use?

58:03 I go back and forth these days between Emacs and Atom.

58:07 I actually like the Emacs key bindings, but I like the way that Atom, arranges an

58:14 entire project and lets you, lets you see all the files.

58:17 Yeah.

58:18 Yeah.

58:18 Those are both nice.

58:19 Cool.

58:19 All right.

58:20 So any final call to action?

58:21 I heard you had an announcement about your, PhD cohorts, like 50, 50 program.

58:28 Yeah.

58:29 Yeah.

58:29 So, we, we just put out this announcement for our 20, 2017 PhD fellowships or postdoctoral

58:35 fellowships.

58:36 So, this is looking for people who have recently finished their PhD who are interested

58:41 in continuing research in their own field, but also adding some sort of computational or

58:46 data science element to it.

58:47 And it's similar to the graduate program I described earlier.

58:50 You have, you apply to have one foot in your domain department, one foot in the e-science

58:56 institute with, with two advisors, one from the domain and one in a methodological area.

59:01 And we have just, a great set of postdocs here who are doing some, really phenomenal

59:07 work with that.

59:08 And it, so if you're, if you're, graduating PhD students and this e-science institute or data

59:15 science stuff sounds good.

59:16 I'd encourage you to apply to applications or do sometime mid January.

59:21 All right.

59:22 That's plenty of time to get them in there.

59:23 Cool.

59:24 Yep.

59:24 All right.

59:25 And when's your book coming out?

59:26 probably, probably January.

59:29 I think, I don't know at this point.

59:31 It depends how, how, how quickly I get this, this corrected manuscript back to them.

59:36 Yeah, of course.

59:38 I saw that you can get like an early access version of it, right?

59:41 Yeah.

59:42 The early access is there.

59:43 So if you want to take a look at the pre-release right now, you can, you can go buy it and they'll

59:48 update you to the released version when it comes out.

59:51 All right.

59:52 Sounds great.

59:53 So Jake, it's been super interesting talking about astronomy with you.

59:56 Thanks for coming on the show and sharing your story.

59:58 Yeah.

59:59 Thanks for having me.

01:00:00 You bet.

01:00:00 Bye-bye.

01:00:01 This has been another episode of talk Python to me.

01:00:05 Today's guest has been Jake Vanderplast.

01:00:08 And this episode has been sponsored by GoCD and data school.

01:00:11 Thank you both for supporting the show.

01:00:13 GoCD is the on-premise open source continuous delivery server.

01:00:18 Want to improve your deployment workflow, but keep your code and builds in-house.

01:00:23 Check out GoCD at talkpython.fm/gocd and take control over your process.

01:00:28 Data school is here to help you become effective with Python's data science tools quickly.

01:00:34 Skip years at the university.

01:00:35 Check out the talk Python to me learning path at talkpython.fm/data school.

01:00:40 Are you or a colleague trying to learn Python?

01:00:43 Have you tried books and videos that just left you bored by covering topics point by point?

01:00:48 Well, check out my online course Python Jumpstart by building 10 apps at talkpython.fm/course to experience a more engaging way to learn Python.

01:00:57 And if you're looking for something a little more advanced, try my write Pythonic code course at talkpython.fm/pythonic.

01:01:04 You can find the links from this episode at talkpython.fm/episodes slash show slash 81.

01:01:11 Be sure to subscribe to the show.

01:01:13 Open your favorite podcatcher and search for Python.

01:01:15 We should be right at the top.

01:01:16 You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm.

01:01:25 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

01:01:30 Corey just recently started selling his tracks on iTunes, so I recommend you check it out at talkpython.fm/music.

01:01:37 You can browse his tracks he has for sale on iTunes and listen to the full-length version of the theme song.

01:01:43 This is your host, Michael Kennedy.

01:01:45 Thanks so much for listening.

01:01:46 I really appreciate it.

01:01:48 Smix, let's get out of here.

01:01:49 Stay tuned.

01:01:51 Stay tuned.

01:01:53 Stay tuned.

01:01:55 I'll be right back.

01:01:57 I'll pass the mic back to who rocked it best.

01:01:59 I'll pass the mic back to who rocked it best.

01:02:11 you you Thank you.