Monitor performance issues & errors in your code

#163: Python in Geoscience Transcript

Recorded on Thursday, May 3, 2018.

00:00 Michael Kennedy: Learn how Python is being used in research to understand the inner workings of the Earth. This week, you'll meet Lindsey Heagy, a PhD student in geophysics at the University of British Columbia. She shares how she's using Python to solve these computational problems along with an amazing framework for viewing scientific writing itself through the lens of Python and opensource. This is Talk Python to Me, Episode 163, recorded May 3rd, 2018. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the library, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. This episode is brought to you by MongoDB and Anvil. Please check out what they're offering during their segments, it really helps support the show. Lindsey, welcome to Talk Python.

01:06 Lindsey Heagy: Thanks, good to be here.

01:07 Michael Kennedy: It's great to meet you. You're doing some really cool stuff in geophysics intersecting with Python so I'm super excited to explore that with you today, but before we do, let's get started with your story. How did you get into programming and Python?

01:20 Lindsey Heagy: Most of that actually was undergrad. So I really hadn't touched much computing until I started at University. I did my undergrad in Edmonton University of Alberta and it was in my first year that I sort of got introduced a bit to programming ideas, but that was actually in Perl.

01:36 Michael Kennedy: So did programming look like a good thing you wanted to do or were you like whoa, what is this?

01:40 Lindsey Heagy: I was more like what is this? It was interesting to start getting exposed to the types of things you can automate to make your life a bit easier. So that was all that we had really done in the first year and then got a bit into C++ in my second year when I was training in physics, and that was intimidating for sure. And then finally actually once I started getting much more into geophysics, we were all in MATLAB and that was much closer to like sort of the day-to-day scientific work is readily supported by. Yeah, that was really my first tour into computing.

02:14 Michael Kennedy: Yeah, and there's a pretty easy switch from MATLAB to Python and they're not the same. Obviously, they have really different philosophies. MATLAB is like let's make it all super commercial and every little function you want, you got to buy that individually and so on. The sort of feel of scripting is kind of the same. That's pretty cool. So how did you get into Python actually? Was that in grad school?

02:38 Lindsey Heagy: Yeah, it was grad school and it was actually on credit. Rowan Cockett, who I've done a lot of work with, and he just recently finished his PhD. So that was my first or second year at grad school. He had really suggested that we started working together on a number of projects as like a geophysics group and so that's how SimPEG started, which I'm sure we'll come back to. Then starting to build software together was when we looked at Python because there's such a healthy community and so many tools for making that easy. That yeah, that's how we got into it.

03:08 Michael Kennedy: Yeah, that's cool, what year was that?

03:10 Lindsey Heagy: It was only a few years ago so I guess like my second year of grad school.

03:12 Michael Kennedy: 2015?

03:14 Lindsey Heagy: Yeah, earlier than that, maybe 2014, yeah. So I've been here a while.

03:19 Michael Kennedy: I'm just thinking because if you look at the history of Python and especially how it's appeared in data science, it looks like around 2012, I don't know what the actual trigger for this was, but something happened that just really drove the adoption of Python, especially in the data science space. I'm pretty sure Jupyter has a lot to do with it, but I'm sure there's other factors there as well so it makes sense.

03:39 Lindsey Heagy: Yeah, yeah.

03:41 Michael Kennedy: What do you do today? You're still in grad school, is that right?

03:45 Lindsey Heagy: Yes, I am. My sort of day-to-day is I'm trying to wrap up my PhD so it's a lot of writing at the moment.

03:51 Michael Kennedy: Like the thousand little details you didn't know you had to finish up are attacking you?

03:56 Lindsey Heagy: Exactly, yup. So that's where I'm at at this point. But I'm fortunate to be a part of a really great group of people and so there's a lot of collaboration that goes on back and forth. Most of my PhD and grad study journey has really been a lot of collaborative work, which is a lot of fun.

04:11 Michael Kennedy: Nice, and so you're studying geophysics and you're working at this placed called the Geophysical and Version Facility, is that correct? Do I have that right?

04:19 Lindsey Heagy: Yes, that's right.

04:22 Michael Kennedy: First of all, what is geo-science or geophysics, like generally?

04:26 Lindsey Heagy: Geophysics would be a subset of geo-science. So geo-science is basically anything concerned with trying to understand the Earth so that is very, very broad.

04:32 Michael Kennedy: Right. It could be plate tectonics, it could be magnetic field, I guess it could even be climate change, right?

04:38 Lindsey Heagy: Yep, it could be climate change, it could be atmospheric studies, it could be like economic geology, trying of map out where different rock units are. All of that.

04:46 Michael Kennedy: Okay, so what is geophysics?

04:47 Lindsey Heagy: Geophysics, we're then understanding the Earth through physics and so what I do specifically in geophysics is we're looking at geophysical imaging. So it's a lot of like medical imaging in a lot of ways. When you go to the doctor, you're hoping that they don't have to drill into you to get information about what's going on inside of you. There are sensors, like if you go into an MRI or something like that, there's sources and sensors on one side and we can take the data that have been collected in that survey of you, and work with those to get an image of what's going on inside your head, and so we basically do the exact same thing, but then on the Earth. It's a larger scale survey, but it's the same general principles.

05:29 Michael Kennedy: Okay, that sounds really interesting. I've seen some really amazing graphics and I'll link to a couple of them, of course, in the show notes. What are some of the types of questions that you are trying to answer, or people generally in the field are trying to answer?

05:43 Lindsey Heagy: So it totally ranges. Our group, a lot of the history has been really connected with minerals so trying to map out and locate where mineral deposits are, characterize them, delineate the different units and things like that. One of the big topics that's becoming more and more relevant is characterizing ground water. Trying to figure out where do we have pockets of aquifers? How much is in that aquifer, can we quantify that using geophysics? Because in a lot of places right now what's done is wells are drilled and you can get water levels at single points, but that obviously doesn't characterize the whole aquifer.

06:16 Michael Kennedy: Right, of course. And that's going to be an increasingly interesting question. We saw California go through some really serious droughts and they're finally out of it, but that kind of stuff is going to be happening more and more, most likely, so those questions become more critical, right?

06:31 Lindsey Heagy: Yeah, absolutely. So ground water and sea water intrusion is another big thing in California. So it's not only that you are losing aquifer water, the sea water is actually also coming in. There's a whole bunch of different things going on that we really need to get a handle on fairly soon.

06:47 Michael Kennedy: All right, so what are some other questions or some other areas people are focused on?

06:50 Lindsey Heagy: A lot of the work that I've been doing is looking at trying to monitor sub-service injections. Like if carbon dioxide or hydraulic fracturing is done, we're injecting fluids into the Earth and trying to track out where there's fluid growing. I've been doing a lot of work looking at electromagnetics when we have steel cased wells there because often we've got this well and we're going to inject the fluid through there, but what's interesting about a well is that it is actually very, very conductive. So in a lot of ways it's like a big electrode so it can help us get current to depth, but we're then trying to understand the mechanisms and how does that actually happen and how does it work?

07:25 Michael Kennedy: Yeah.

07:25 Lindsey Heagy: So that's been a lot of what I've done.

07:27 Michael Kennedy: That sounds really interesting. I guess the first place we should probably start looking at some of the programming side of things is I guess let's talk about SimPEG. So SimPEG is this thing for simulation and parameter estimation. What does that mean? Put that in plain English words.

07:47 Lindsey Heagy: Fair enough. What we need to do if we're trying to collect any sort of data so we can maybe go out and do a magnetic survey. In that case, the source is the Earth's magnetic field and then basically whatever kind of rocks that you have that are susceptible, meaning that they act like little dipoles, and so they'll try and line up their dipoles with the Earth's field and then we can go over with the magnetometer and try and find these. So that would be a magnetic survey. What we need to be able to do is we need to be able to simulate the physics of that process. So in this case we would be simulating the magneto-static equations.

08:22 Michael Kennedy: I was going to say so different types of minerals maybe react differently to the magnet so if you're looking for a cobalt or lithium or something else like you can detect that using magnets?

08:34 Lindsey Heagy: Yes, some of them. So it depends if they're magnetic or not, so not all minerals are and then the strength can vary. Mag is just one survey that you can do, but generally we'll try and do a few different surveys. So they might be magnetic, they might also be conductive. If you have something like iron, it's pretty easy to get current through that and it's magnetic. So if you go out and do both of those surveys together, we can start to like pinpoint what these different minerals are.

08:58 Michael Kennedy: So, iron is easy?

08:59 Lindsey Heagy: Iron is easy, yeah.

09:02 Michael Kennedy: Nice, okay, so back to the SimPEG. So you want to analyze the magnetic response of the stuff deep underground.

09:09 Lindsey Heagy: Yes, we first need to be able to simulate the set of equations that governs that process. So that's a simulation piece. We're going to solve some sort of partial differential equation that's governing the physics. So we assume that we know what the Earth is, like what the Earth model is, what the physical properties are, and we can predict what the data should be. So that's the simulation piece. The parameter estimation piece is trying to go backwards. It's that we have data, and now from that data we're going to try and estimate what these are. So we do that through an optimization process. We can basically say okay, I know what these data are, and I know how to guess, or I know how to simulate data, and so now we're going to try and find a model that fits those data and is somewhat geologically reasonable.

09:48 Michael Kennedy: Okay, that's pretty cool. This is all done in Python? Is there maybe some underlying C or FORTRAN code that gets brought in there or is it all straight Python?

09:59 Lindsey Heagy: It's basically all Python. We interface to lower level solvers and things like that and there's a little bit of Cython for some of the meshing stuff, but for the most part it's pure Python.

10:08 Michael Kennedy: Okay. Yeah, that's really cool. There's a picture, you mentioned Rowan Cockett, and I happen to just have grabbed something off Twitter from him so it's kind of a funny tie-in. But if people are kind of interested in seeing what those look like, I'll put in the show notes this cool sort of three dimensional graph of what he's called the Richard Equations. That's pretty cool. Are you familiar with this?

10:34 Lindsey Heagy: I'm a co-author on that paper actually.

10:36 Michael Kennedy: Oh, awesome. Tell us a little bit about like what the main question was and what you guys found using SimPEG on it.

10:43 Lindsey Heagy: What this was, this paper actually describes a lot of the fluid flow machinery that we have in SimPEG. Richard's Equation describes fluid flow going through soil. So it's a two phase flow where you have air and water in some sort of medium, like a soil. Depending on how hydraulically conductive pieces of that soil are, basically how easy water goes through that front propagates down differently. In this case, what we developed in this paper is describing how do we solve those equations, but then the important piece is when you want to actually go back and estimate things like hydraulic conductivity, we need that inversion piece and so we need to have gradients that tell you basically if I change this model parameter this much then it changes my data in this manner.

11:31 Michael Kennedy: I see, like how responsive is it to variation in parameters and estimation and stuff.

11:36 Lindsey Heagy: Absolutely and so that's like such an essential piece to be able to solve the parameter estimation problem. So in this paper, we go through and basically derive all the mathematics from that and then show a couple examples.

11:48 Michael Kennedy: Yeah, the pictures are really, really compelling and it's pretty interesting. I think it's really nice what you guys have put together here. Are a lot of people using SimPEG for other work, or is it mostly within your group?

12:01 Lindsey Heagy: It started mostly within our group, but it is starting to branch out a bit. We've got some collaborators at Colorado School of Mines. There's some people with USGS who are starting to dabble in it. We recently gave a short course that traveled around the world and so what was a great opportunity to start introducing people to the Python ecosystem and for some researchers, SimPEG was an appropriate tool so they're starting to come in and explore that a bit.

12:30 Michael Kennedy: Hey, everyone, Michael here. Did you know I'll be doing a three part webcast series about MongoDB and Python from May to June? We'll see why MongoDB is a great choice for Python web apps. In this series, we'll go through the entire process of building a clone of PyPI, Python's packaging website over at pypi.org. Everything from building the front end to deploying the web app in MongoDB to the cloud. You'll learn everything from document modeling basics to special considerations for running MongoDB in production. The webinar is free, so just click on the link in the show notes, or go to MongoDB.com/Webinar/Python and sign up. See you in May. Sounds like a lot of what you learned from SimPEG sort of fed back into some of the other work that you're doing as well around like more general education research and science, right?

13:19 Lindsey Heagy: Yeah, absolutely. I think for our group and for me, in particular, SimPEG was really my first entry point to the whole opensource ecosystem and like what it's like to operate and run an opensource project and be involved in an opensource project and what some of the strategies are, things like peer review and issue tracking and all of that sort of stuff. That was all really learned when we first jumped into SimPEG.

13:43 Michael Kennedy: I think it's really interesting. You come from a place of like say, working with MATLAB as a community in general, right? Like they publish packages you can add on and you kind of wait for them to release new versions and you get what they give you, right? There's not a whole lot of give and take there and that's so much to contrast with things like SimPEG or other opensource projects, right? Like even on the courses that I do I'll have people who come and say "oh, you did this little demo like this but actually, I've refactored it to really take into account this other thing." You get these out of the blue people helping you out, even if you don't ask for it. It's really quite an interesting experience.

14:25 Lindsey Heagy: It really is and just the ability to put code out there, share it, and get other people's opinions is such a part of the Python community and what people follow. Whereas, if you look at something like MATLAB, there's just not easy mechanisms to do that. So even if you write something useful, like you email code then to your friends, and that's just not sustainable.

14:48 Michael Kennedy: Which one do you want me to run, the one that you sent me on May 23rd, or the one you sent in June? I can't remember which one we're working from. That's not great version control, is it?

14:58 Lindsey Heagy: No.

14:59 Michael Kennedy: So maybe let's spend a little time talking about this presentation that you gave at SciPy 2016. It was called Using Opensource Tools to Refactor Geo-Science Education. I think you kind of put it around the scope of geo-science, but I would say scientific education more generally, right?

15:18 Lindsey Heagy: Yeah, I mean, geo-sciences are our domain so that's where we thought through a lot of these things, but as we've been developing along the way, really trying to take a bit of a big picture perspective and figure out are there things here that we can learn and perhaps translate to teaching and learning as well as scientific publication?

15:35 Michael Kennedy: Yeah, so maybe give us the big idea on your talk and then we'll touch on some of the pieces.

15:40 Lindsey Heagy: GeoSci.XYZ is a collection of like opensource, they're basically living textbooks. The idea is we're trying to take a lot of what we've learned from developing opensource software and try and apply that to opensource educational resources so that there are opportunities for collaboration for peer review, for iterating on things, and to grow and develop resources as a community. One of the things we've really noticed is in geophysics in particular, it's a really small field so there really aren't actually many good textbooks, especially for introductory level classes. So what happens is there are professors scattered around the world teaching this and they're all developing all of their own course notes from scratch and each of them has different expertise and different background so there's one aspect of that that's going to be really strong, but then the rest of it, they're having to go and learn from A to Z a whole bunch of techniques that there may or may not be.

16:37 Michael Kennedy: They maybe learned it in grab school, but they haven't used it for 15 years and now they got to write about it because they got to be comprehensive.

16:43 Lindsey Heagy: Yeah, and so we're hoping to eliminate some of that so that people who are experts in given topics can contribute that and then can leverage what other people know.

16:53 Michael Kennedy: I think that is really a quite interesting perspective. You talked about some of the problems with this sort of teaching and learning and writing sort of largely focused around textbooks and stuff, but more generally as well. If there are bugs in the book, you don't know. They're hard to be tracked. Maybe you can find it, right? Versioning is difficult. Do you have the current version? You don't know, right? How do you diff a book, like that is two textbooks. You just kind of flip through it, right? I don't know. It's not easy, is it? Sometimes it's really obvious like oh, we added a chapter on this, or a figure on that. But other times, you said it's much more subtle. Like oh, there is an error. It should've been a minus sign in this equation right here and you probably wouldn't catch that, right?

17:42 Lindsey Heagy: That's really a challenging position to be in when you're first trying to understand a topic. Like you work through the whole question, you create the plot based on your understanding, and it doesn't match what theirs is, and then what do you do? Like they're different and they could be wrong, you could be wrong, and there's no way to sort that out. Like there's no way to contact the author really in most cases.

18:03 Michael Kennedy: Yeah, it's quite hard and a lot of times you're a student so you're like well, I'm wrong because I'm new at this. I must be wrong, right? Until, somehow, you maybe decide no, I really am right. This is really broken. That can be frustrating. You had a really nice way of breaking down the things involved in that type of creation, that educational content creation, and sort of framing it in terms of concepts from the Python space and you started with functions and you asked the question like what are functions in the context of science and writing? So what are functions in that context?

18:43 Lindsey Heagy: There's a few different things to think through. One of the things that we first looked at... Maybe I'll give you a little bit of the back story before diving into this. One of the big motivating factors for this project was that my supervisor had developed a website quite a number of years ago instead of a textbook. So he was really forward looking in the sense that he wanted to get content out there for students in a very tangible, easy to interact with way. They built this website. It's a great site. But then we found a few typos in it and wanted to try to go in and fix them and realized it was tangled up in this crazy HTML mess. The first in the refactor of this was really identifying what actually is the data here? What's the data and what is the packaging? So in this case, really the data is just the text and the equations and the images. The HTML and CSS, all of that, is just packaging. In this case, perhaps a function is actually Sphinx. You take your data, which is text and images, and then you compile that into a website. What's powerful about that is then as styles and things like that update and there's better ways to interact, or somebody builds a fantastic new search tool, you've separated out what is data and what is the packaging and so you can immediately leverage all of those new developments.

20:03 Michael Kennedy: Yeah, it sounds super obvious, as people who write software. You think of course, you're going to do these types of things but then in practice you go look at, for example the website you're talking about, and it's all crammed together. A lot of times these are written by people who don't have sort of formal software training either so they maybe don't even have some of these ideas sort of in the back of their mind when they come to it. So you're like all right, well let's break this out into restructured text. People can edit that super easy. You probably know an LaTeX or something that's an easier version of that, so that's good. Obviously the styles, so that's one part. Another part you talked about was capturing input. So you probably have some picture, but that picture has like a view onto it, it has parameters for the equation, it has all kinds of stuff, right? So that's another aspect of sort of the function analogy, right?

20:56 Lindsey Heagy: One of the great things in the Sphinx documentation is the matplotlib plug-in. That's what we've been leveraging in order to capture the inputs to your figures. To create a figure, we're running some sort of code with, as you said, some inputs. So maybe we're looking at the electrostatic responsiveness sphere and you want to change it from a resistive sphere to a conductive sphere. That should be something that the user of this resource should be able to readily do. You can't do that with a textbook, obviously, but in this case if you've preserved that source code, then that's actually an entry point for people to actually take that single picture and start to be able to explore that.

21:37 Michael Kennedy: It sounds like that's touching on one of the whole super significant things in science in general these days is the whole reproducibility thing, right? This just makes the whole paper, or the book, or whatever it is more reproducible if you can re-execute it to regenerate the output, right?

21:53 Lindsey Heagy: Well Rowan and I gave a talk at Jupyter Con last year and sort of touched on reproducibility. I think one of the things that's important to keep in mind too is like what is the point of reproducibility I guess. It is good practice to have your content be able to be regenerated, that's a good thing. The way that you actually build upon somebody's ideas in science is you take what they've done and then extend it. Also, by being able to at least capture all of the instructions to get to point A then somebody can immediately pick that up and start to play around with it and hopefully get to point B, which is then actually maybe some sort of new discovery that extends on that work.

22:37 Michael Kennedy: That's cool so you can actually literally build on the sort of algorithm and steps. So there's probably a lot of data exploration in Jupyter as well in this. Like you could copy a little bit from Jupyter, play around, put it back in restructured, or can you load restructured directly?

22:52 Lindsey Heagy: I don't actually know about that. I'm sure somebody's written something.

22:57 Michael Kennedy: It's got to be out there somewhere, right?

22:59 Lindsey Heagy: Yeah, yeah. We sort of have been developing Jupyter notebooks in parallel. We've been teaching a lot of courses actually where computation is like not taught at all, but we want to be able to have people play with figures. So we've been leveraging Jupyter and active widgets to basically rack functions that compute things and give you plots to make that interactive. That's been really exciting to see that people can actually get up and running and they're running code, but they don't necessarily even need to know anything about Python, what is Jupyter, any of these things.

23:31 Michael Kennedy: Right, right, right. Just inputs and pictures. Another thing you said is once you have this concept of reusable function, you can test it like in Travis CI or continuous integration, right?

23:42 Lindsey Heagy: This has been exciting to see. There's a few aspects to the testing. First off is like when we have code snippets and things like that in the textbook, we can test them. Same thing with all of the figures, we can test those. If there's API changes, or things like that down the road, we'll catch that so that the code always continues to work. I think we've all seen the case where somebody actually wrote a textbook and there's printed examples of code in there and there's inevitably a bug. There's nothing that can be done about that once it's been published. Then you just end up with generations of frustrated students who can't even get the code to run at the first stop. So here at least that's something that we can test. Another thing that's been kind of interesting too is you can go in and test links so making sure that all of the things that you are pointing to and all of the extra resources that you are connected to continue to be there, and if not, then you can go in and find something else relevant to point people to.

24:37 Michael Kennedy: Yeah, that's a really good point because you don't necessarily control all of the external resources that exist, right? You don't keep checking them?

24:42 Lindsey Heagy: Yeah.

24:44 Michael Kennedy: Yeah, yeah, really cool. So the functions, that's a pretty low-level concept in structuring code so the next level up would be classes maybe?

24:53 Lindsey Heagy: So it's starting to get to a bit more organization. So I mean a function, you've defined a piece that is reusable and now a class, we're going to try and define something that you can perhaps inherit and build upon. One of the things that we pointed to in this analogy is just looking at a given page structure. So when you're talking through a concept, there's a few obvious things that every page has. Like it has a title, it has contributors to that page. One of the things that we've been trying to promote is a purpose statement on each of the pages. Just to give like a high-level overview of why should I care about what is on this page?

25:28 Michael Kennedy: I think that's a really good idea because one of the major benefits of people writing say like unit tests against real, proper software code is you know when you're done. If you put out all the things it's supposed to do in the test and it does them, you can stop messing around because people can like fiddle with the code and think about what it might need in the future forever. There's this really clear this is what I've wanted to do, I've done it, now what's next, right? You obviously have the same problem in writing and I really love this idea of like we're going to give this a purpose and almost test it. You also said that this leads really well to a collaboration, right? Because people coming to it know what the purpose is. They all agree upon the purpose and it sort of helps communicate that.

26:13 Lindsey Heagy: Yeah, absolutely. We have multiple authors contributing content to one resource. Everybody's got a bit of a different writing style and all of those sorts of things and that's really fine. But it can lead to sort of a hodge-podgey resource. At the very least, you know what is going to be achieved in each page. It's so much easier to collaborate and give meaningful feedback as well. So if I am reviewing somebody else's page and I know what they're trying to accomplish and have maybe a couple ideas about some different examples that they could include to help achieve their goal, that's easy to then point them to, but if the purpose isn't clear from the outset and it's not immediately transparent, it's very hard to then give productive feedback.

26:55 Michael Kennedy: Yeah, for sure. It also helps with peer review. Simple peer review question is does this thing do what it says it does, right? Rather than is it good? Or, is it accurate or whatever? That's really hard to answer. Another thing is you say it leads to templates, right? So you could say this page is for a case study of this type and then that means it has this structure, which can really help with writing.

27:21 Lindsey Heagy: Yeah, and it helps solicit input from other authors as well. A case history, and how we've defined it is it's basically like an exploration, or geophysics example. So we walk through, you start with some sort of question, and then we're going to walk through what are the relevant, physical properties? So what are the different rock types or things like that that we're going to look for? How are we going to try and detect those? What did the data look like? And we go through and how do we process those data? Interpret the results and then did we actually answer our initial question? That's sort of like seven steps that we've broken all of these case histories into. Once you actually start having a few examples of that, we've been able to send these templates out to researchers around the world who have experience in different applications than we do. And just said "Hey, do you have a good example that you can put into this framework?" And once you layout the pieces, people are a lot more willing to go in and put their content there because you've just removed like all of that overhead of figuring out how should I structure this?

28:24 Michael Kennedy: Right, how long should it be, what should I say, what's important, what's not? Yeah, it's super, super interesting. I had Jesse Davis on the show quite a while ago and he did something similar for blogging. He talked about how to write a good developer blog. He had come up with five design patterns for blog posts. What are your goals, then this pattern applies. It's just like once you know what you're trying to do and you have the pattern, you're like okay, these three steps, this is what I do. And then all of a sudden, it's the writer block can be largely gone. It's much, much quicker. I really like this idea.

29:00 Lindsey Heagy: This is where we've seen definitely the most contribution coming in because it is an easy place to jump in once something's structured.

29:06 Michael Kennedy: Okay, so once you have functions and classes you might want to reuse them other places, that might be like import? Import saving?

29:12 Lindsey Heagy: Yes so this is where things get a little fuzzy, but we've played around with this analogy and one of the things that has been kind of exciting is we developed an equation bank and so one of the resources we've been working on is boat electromagnetics. Maxwell's equations are going to show up all over the place in multiple places. That's something that you don't want to repeat writing, especially when you have multiple people. We want to try and stick to the same notation conventions and all of that sort of stuff. Ideally, we don't want people rewriting these. So we've actually set up an equation bank and that's something that you can just include in your page, just include Maxwell's equations.

29:53 Michael Kennedy: That's cool and so if there's a mistake you fix it one place and it fixes everywhere.

29:57 Lindsey Heagy: Exactly, yeah.

29:57 Michael Kennedy: Nice. Let's see, you can also maybe think of links as external as a sort of thing you import, like an external resource that you depend upon.

30:07 Lindsey Heagy: Yeah, in a sense because what you're doing in that way is you got some word that's linked so maybe we linked the word to some sort of specific geophysical system. That is containerized piece knowledge that you can sort of bring in and expose to the user in a meaningful way that also is in context. So that's one piece that I think fits in to that analogy.

30:30 Michael Kennedy: Yeah, another one is once you start importing things you see structures and dependencies and then you could almost say like well, we should refactor this into this other form that's better once you see the large overall structure, that's pretty cool.

30:43 Lindsey Heagy: Because then you get to looking at ideas of like which concepts build upon each other and what concepts do you need to understand this given method, or another method, which is kind of cool to be able to actually like introspect the field.

30:56 Michael Kennedy: Yeah, that's pretty amazing and I'm sure there'll be some good visualizations at some point. And then at the very outer end, someone else wants to use the thing you've created so you have the pip and packaging analogy as well.

31:09 Lindsey Heagy: This is something that I would love to see this idea evolve a bit more, but I think at the basic level making it clear how people can use things. So applying a license, showing which concepts build upon others. So if you're importing this more advanced resource, what are the things that you should be familiar with before that? And then as well, versioning and all of that. Make sure that that's clear when you're changing things.

31:37 Michael Kennedy: This portion of Talk Python to Me is brought to you by Anvil. With Anvil, you can build full-stack web apps with nothing but Python. Building for the web is complex. You typically have to write JavaScript, HTML, CSS, some front-end framework like React, and then you've only done the front-end, you still have the server to write and then you have to decide where and how to deploy it. With Anvil, all you need to know is Python to build production ready apps and deploy and scale them with a single click. You have a visual designer for your page and you've got the entire Python ecosystem to integrate with. It even comes with a built-in database as a service. I've been using Anvil myself and I'm really excited how accessible it makes the web, even for people who are not excited about writing HTML. And if you happen to take my 100 Days of Code Course, you'll see near the end we actually spend a lot of time building a really cool web app with Anvil. I'll put that app in the show notes, but you can find it at pypoint-100days.anvilapp.net. Get started at talkpython.fm/Anvil and we'll throw in a 10% discount on an individual plan just for you Talk Python listeners! If you've been afraid of the web, go have a look. This is something special and they're doing really interesting things with Python. So that's the whole overall like conceptual way of thinking about the work that you guys are doing in education and writing in the software Python space, but you actually took a lot of the tooling literally from the software space, right? Things like git and continuous integration. What all did you use there?

33:08 Lindsey Heagy: Everything is hosted on GitHub so that's our peer review mechanism that does all the versioning for us, issue tracking and all of that. We have used Sphinx to actually build the pages and then I did mention that the matplotlib plug-in has been one that we're using to generate these reproducible figures. We started out hosting stuff on Read the Docs, but then the site got way too big. So we host it separately now. And then Travis CI for all of the testing pieces.

33:37 Michael Kennedy: It's not just the concept, you're actually applying a lot of these tools and techniques to it. It's pretty cool.

33:41 Lindsey Heagy: Oh yeah, and then Jupyter throughout as well.

33:43 Michael Kennedy: Yeah. I'm sure Jupyter is in there. Are you guys moving to Jupyter Lab these days, or are you sticking with Jupyter? What's the thought there?

33:52 Lindsey Heagy: I've dabbled in Jupyter Lab. I'm quite excited to start diving into it a bit more. I've just been with writing, I'm trying to wrap up the PhD, I'm hesitating diving into new and exciting tools because it's easy to lose track of time there.

34:05 Michael Kennedy: Yeah, yeah, I can imagine. Defense probably has top priority.

34:10 Lindsey Heagy: Yeah, at this point.

34:11 Michael Kennedy: Yeah, Jupyter Lab looks really cool. I haven't done anything with it, but I've kind of looked and said oh, this looks a little nicer than Jupyter. Maybe I should start learning this.

34:19 Lindsey Heagy: I've been excited to see some of the Markdown plug-ins and things like that that they've been working on and actually being able to execute and test code within Markdown. I think there's a lot of utility there for the writing that we've done to make that process a lot easier for contributors. So that I'm excited to start playing with.

34:39 Michael Kennedy: Yeah, I really like this idea of testing work and one of the things that I've seen as something of a detraction from the whole notebook way of working and I definitely see the exploration and flow benefits, but one of the things I see is less good is it's harder to say run test over your Jupyter notebook, or do code coverage of the code in your Jupyter notebook as part of those tests and things like that. You actually tweeted out a really cool project that when I first saw it I didn't realize that it actually had to do with its origins were in geophysics, but over at github.com/opengeophysics/testipymb is a thing that lets you unit test Jupyter notebooks, right?

35:26 Lindsey Heagy: We've just started this. This basically got pulled out of the geo site ecosystem because there's a part of a lot of these courses we were distributing notebooks and there were a whole bunch of different people who were contributing to these notebook repositories and then we were deploying them either on Microsoft Azure, My Binder, and then stepping up in front of a course and using them to teach. When you're in front of a classroom, you really don't want errors popping up, especially if you didn't write the original notebook.

35:54 Michael Kennedy: Yes, exactly.

35:57 Lindsey Heagy: Yeah so this was really born out of need to make sure that at least if you are standing in front of the class, the notebook should run. What we've been trying to do here is we've extracted a lot of the work that we've done using MBConvert, that just runs the notebook and makes sure that it completes with no errors. I think there's a lot that we can think about to increase the utility of this, but as a first pass like just making sure that the notebook goes from A to Z without any errors, that makes sure too that you properly defined all your dependencies and all of those pieces that are so easy to forget, especially to new contributors it's not always clear like if it works on my machine, why doesn't it work over here?

36:37 Michael Kennedy: Yeah, exactly so it tests things that your Python environment has the dependencies and so on and stuff like that.

36:43 Lindsey Heagy: Yeah.

36:44 Michael Kennedy: Which that can be challenging in the whole data science, scientific computing base, right?

36:49 Lindsey Heagy: Another one of the reasons I wanted to have something like this out there is that we're sharing notebooks that go along with our publications and so a lot of them are built on SimPEG. We know that down the road we are going to make changes that are not backwards compatible and if you catch that right away, it's very easy to fix, but if you let the notebook lag by I hear, it's really hard to then go in and maintain and upgrade that. Part of this too was just to be able to put like the Travis Cron jobs that run once a month on our research notebooks and make sure that the research is still continuous to run.

37:25 Michael Kennedy: Yeah, that's a really good idea, just run it periodically and just go grab everything new and see if it works. Yeah, that's cool. So do you have any way to do more specific data result validation? So for example, if you have a cell, it would be great if it could convert it like to a function you could call with parameters and get the response out and things like that, right? Like not only does it still run, but actually it gives me the same results.

37:53 Lindsey Heagy: Yeah, I think that that would be super cool. I mean the simplest way right now to do that is to include like an assert statement that like downloads your archive data set and make sure that you can still reproduce that, but having something like that, a little more exposed on the outside I think would be quite neat. Same thing with even sort of checking figures. So make sure that your figure looks the same.

38:14 Michael Kennedy: Right. Pixel by pixel just compare that the picture is the same. Of course, when matplotlib updates to have like slightly faded cool axis, you know it's going to break, but you could just say oh, no, no, this new picture is still okay for us, we'll just upgrade that. Or update the baseline. That would be nice, but it sounds like a lot of work, right?

38:39 Lindsey Heagy: Yeah, absolutely. But I think the biggest thing is just knowing when stuff changes. So even if you compared these two things and they are different, but you can visually tell that it's just style update like that's fine.

38:50 Michael Kennedy: A friend of mine has this project called Approval Tests and I don't know its integration with Python, I know he was doing some there and it's all based on that idea. Like you write a test and you either get a result as a picture, or as a JSON document or whatever and you just go yeah, that looks good and then all subsequent tests just go is it the same? Like you don't have to do all sorts of testing, you just feed it two pictures and it goes they're the same or they're different. You can either re-approve the new one, or it's an error. That idea, I don't know, maybe somehow these can be put together, but it sounds cool.

39:21 Lindsey Heagy: Yeah, that'd be interesting.

39:22 Michael Kennedy: Yeah, yeah, nice. One thing that sort of comes to mind in this whole space is there's got to be a lot of data that you're collecting to get all these pictures and stuff, right? Like the Earth is big. What are some of the challenges around big data in geophysics?

39:42 Lindsey Heagy: In a lot of cases, we're not necessarily encountering sort of the same style of big data problems that you think about when you think of like social media. We don't have data sets that are that big, at least in our group.

39:51 Michael Kennedy: Not like CERN, for example?

39:53 Lindsey Heagy: No, not nearly like that. But a lot of what we're looking with is small, disparate data sets and so we'll have collected a whole bunch of different types of geophysical surveys over one setting and now we want to try and integrate all those different data types and figure out okay, like what is this telling us in terms of the rocks? What does that mean in the terms of the geology? So that's one aspect where we're starting to see machine learning coming in in a very powerful way is actually trying to either take data that have been interpreted independently and then try and sort of merge those interpretations to then give you something stronger in terms of like I think this is rock A and this is rock B, rather than this is magnetic and susceptible, this one is conductive.

40:36 Michael Kennedy: I see, more like trying to draw the proper conclusions from the raw data.

40:41 Lindsey Heagy: Yeah so really trying to drive much more so at the geologic interpretation.

40:46 Michael Kennedy: Antoher thing I was wondering is how is machine learning being used there? Because with these pictures and a lot of this data it seems like somebody could come along and make some really interesting uses of TensorFlow, or PyTorch, or something like that.

41:01 Lindsey Heagy: Yeah, I mean I haven't seen a ton of neural network work yet, but that's also just one sample point. Where I have seen a fair bit of work done is on the clustering side of things. So really trying to either cluster interpretations or one of the things that I think is really kind of cool is when we start to meld together deterministic inversions so that's a lot of the work that we've done in the past where you're running some sort of optimization problem to fit your data, but then connecting that with statistical or machine learning approaches where you say okay, I got some sort of geologic knowledge about these rocks and how do I now couple that to my physics? So that's one area where I think there's a lot of potential.

41:41 Michael Kennedy: Yeah, we're just at the beginning of all this machine learning stuff, right? Who knows what it'll look like in 20 years.

41:47 Lindsey Heagy: Yeah, yeah.

41:47 Michael Kennedy: Yeah. It'll be wild. So it sounds like you do a fair amount of programming around your research projects and stuff and I know what the world looks like as a full on software developer, right? You write a lot of code and things like that. But how do you think about balancing programming and working on some of the libraries, versus say research, versus writing? I know right now it's writing because of the time frame and all.

42:15 Lindsey Heagy: That's the question of any academic is how do you balance all of these things? So if anybody knows I would love to know. But I think in a lot of ways what's been powerful with the group that I'm in is that a lot of this is collaborative so that there are pieces that you're working on together in each of these aspects because as an academic you do need to write about the things that you've been doing so you need to do things and that involves programming and research and then you need to share that with the world and that's writing.

42:46 Michael Kennedy: Yeah, and I think having this opensource angle to it, the collaboration really helps, right? Because people can do some of the programming where they have the specialty, right?

42:54 Lindsey Heagy: Absolutely, yeah. And there's less duplication of efforts, especially on a lot of those like more mundane tasks of just parsing data files and things like that because there's no need for everybody to be writing that, but you all need it.

43:06 Michael Kennedy: Somebody gave me this thing and I need a lot of this program, or generate idea, no, we'll just import this thing and we'll read it and we'll be good.

43:15 Lindsey Heagy: Yeah.

43:15 Michael Kennedy: So another thing that is major in that space is getting sort of citations and credits, right? Those are sort of your up-votes for your career in a sense.

43:25 Lindsey Heagy: Yeah.

43:26 Michael Kennedy: Are you familiar with the Journal of Opensource Software?

43:28 Lindsey Heagy: Yes, I am. I'm actually the geo-science editor.

43:31 Michael Kennedy: You are? I didn't realize that.

43:32 Lindsey Heagy: Yeah.

43:33 Michael Kennedy: Okay, that's awesome. Yeah, so what are your thoughts? Obviously you must be a supporter if you're one of the editors, right? But just maybe tell people really quickly what it is if they didn't listen to my show a while ago with Arfon.

43:46 Lindsey Heagy: So JOS is the Journal of Opensource Software. It's really a developer-friendly journal so in that sense the biggest thing is you're actually getting peer review much more so on your code development and practices. There's a short paper, it's like one to two pages, so it's meant to be pretty light weight. That's supposed to be a couple hours work. What we're really evaluating is the software. So I'm obviously very supportive of this. I think it's fantastic to be getting credit tied basically immediately to the software because I think that is a big piece that is missing in academia. One of the things I've been really sort of stunned by, I've been with JOS now for almost a year, is just what a positive process it is. I send out requests for reviewers and people not only say yes, they're very enthusiastic to jump in and learn about projects. I've seen people do small pull requests to fix typos, or fix small things. They're writing very well thought out issues and then the authors are so grateful to have somebody with a fresh set of eyes coming in and looking at what they've done. So it's just been such an amazingly positive process in like peer review, which is not something I've really seen elsewhere.

45:02 Michael Kennedy: Normally it's like oh, I don't really want to do this. No, I think it's a really cool project and I definitely wanted to give a shout out to it because I think it really ties in well with the way you're thinking about the writing and structuring it in terms of software and opensource. I think this is just the perfect compliment to it.

45:19 Lindsey Heagy: Related to that, JOSE is the Journal of Opensource Education is starting up very soon. In fact, they might actually be accepting submissions. It's parallel and I think there's a lot of interesting things sort of from the geo side perspective where we're now thinking about developing open education resources and they'll be accepting submissions like that. So a few avenues.

45:40 Michael Kennedy: Yeah, that's really really cool. I definitely think the work that you are doing, even though it sort of found its roots in geophysics, I think it could equally apply to biology, or chemistry, or lots of things.

45:51 Lindsey Heagy: Hopefully. Yeah, that's the goal.

45:53 Michael Kennedy: Hopefully with a slight adaptation.

45:54 Lindsey Heagy: Yeah.

45:56 Michael Kennedy: Another thing that I ran across doing some research for the show is EarthPy at earthpy.org. So over at EarthPy it's a collection of IPython notebooks with examples of Earth science and the related Python code. If people are listening and they're into geo-science of some sort, maybe there's some really interesting things to draw from there.

46:18 Lindsey Heagy: Oh cool, I had not seen that before so I will also be checking that out.

46:22 Michael Kennedy: Yeah, that's really quite neat. What's the IoT story in all of your research? I can imagine a bunch of little sensors planted around. I don't know if you were involved with any of that.

46:36 Lindsey Heagy: I'm not involved in any of that, but there's definitely things coming. There's a lot of interest in looking at the use of drones for smaller scale surveys. I think actually to looking at more of the precision farming and precision viticulture for vineyards, I think there's a lot of potential on those smaller scales for bringing in sensors that are giving you much more real-time feedback and you can make decisions about which areas need to be irrigated or not based on what you're seeing.

47:05 Michael Kennedy: You could get that information like at 2:00 PM it needs to be irrigated, not at 3:00, right?

47:10 Lindsey Heagy: Yeah, exactly.

47:10 Michael Kennedy: Like super detailed. With the advent of such cheap, connected devices and the ability to program them so easily with Python. Things like MicroPython and like five dollar microchips that you can set up, I don't know how long you can run those on battery, but they got to last a long time. It just seems like you could do such great stuff with that.

47:35 Lindsey Heagy: Yeah, and this is something I'm going to be very curious to watch and see what happens over the next few years because I think a lot of these ideas are just starting to be worked out and people are figuring out how to make geophysical sensors a lot smaller because that's been problem for a long time is in a lot of cases, they're big. People are making progress. It totally depends on the survey. If you're trying to go out and do a gravity survey, it's a very big, slow because you need to go out, you need to level it, it needs to stay level. If the wind picks up, that's a problem. That's a challenging one. But looking at magnetometers and things like that, people can now start to make them only a few centimeters so that's then something that's readily adapted to an IoT application.

48:15 Michael Kennedy: That sounds quite cool. All right well, I think maybe we'll leave it here for our geophysics talk. Of course I have two more questions for you before you go. If you're going to write some Python code, what editor do you use, or editors if you use more than one?

48:31 Lindsey Heagy: Well it would be Jupyter and Sublime is the combination I use.

48:33 Michael Kennedy: Nice. Yeah, I figured Jupyter must be in there. Although, maybe Jupyter Lab someday, I don't know. We'll see.

48:39 Lindsey Heagy: Yes, absolutely.

48:40 Michael Kennedy: Nice. And then notable PyPI package? We definitely have SimPEG out there which people can pip and install SimPEG and it's really cool that it's that easy, right?

48:51 Lindsey Heagy: Yeah.

48:53 Michael Kennedy: What else is like really people maybe haven't heard of that would be great?

48:56 Lindsey Heagy: I guess there's two. There's discretize, which SimPEG is built on and that does a lot of finite volume meshing and all that. We got octree and tensor meshes and cylindrical meshes. That's an interesting package if you want to simulate PDE's. One that I'm very excited about is called properties and it basically does strong typing validation, serialization, and all of that in Python, but it does a really good job of helping you design an API that's meant to be interactive. So in the sense of Jupyter, you want to design an explorable API, properties is a great package to help you with that.

49:30 Michael Kennedy: Properties. It sounds really cool and I've never heard of that. That's great. That's why I ask this question. That's awesome. Okay so final call to action, maybe there's educators or scientists out there who would like to adapt some of the work you've done, what would you say to them?

49:46 Lindsey Heagy: Yeah, I mean I think that if they have ideas and want to contribute directly to any of these resources we are always keen to have more contributors. I think, too, that trying to build in a way that promotes community and build in a way that you can invite others to go in and contribute is such a powerful thing to keep in mind because then you yourself don't have to do all of the work and the end product that comes out is going to be better than any one person could've done. Sort of the parallel to that then too is if you see something that you have ideas about how to contribute to, get involved and get in touch. People are always looking for contributors and everybody has something to add even if you don't know how they are approaching it or where to even start.

50:28 Michael Kennedy: Yeah it doesn't have to be a major rewrite of some system. First contributions can be much, much smaller, just a little thing you can fix, right?

50:36 Lindsey Heagy: Yeah, absolutely.

50:36 Michael Kennedy: Absolutely. All right, Lindsey, thank you so much for being on the show and sharing what you're up to, it's been really fun to chat.

50:42 Lindsey Heagy: Thanks, Michael, it's been good.

50:43 Michael Kennedy: Yeah, and good luck on the dissertation defense.

50:45 Lindsey Heagy: Thank you very much.

50:46 Michael Kennedy: Yup, bye.

50:46 Lindsey Heagy: Bye.

50:49 Michael Kennedy: This has been another episode of Talk Python to Me. Our guest for this episode has been Lindsey Heagy and it's brought to you by MongoDB and Anvil. In residency help web apps are built with Python and MongoDB register for my webinar I'm doing with MongoDB over at mongodb.com/webinar/python. See you there. Anvil lets you build your web apps quickly and easily. With Anvil you can get your web app in Python up and running in hours, not weeks. Want to level up your Python? If you're just getting started try my Python Jumpstart by Building 10 Apps, or our brand new 100 Days of Code in Python. If you're interested in more than one course, be sure to try out the everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite pod-catcher and search for Python. We should be right at the top. You can also find iTunes feed at /itunes, Google Play feed at /play and direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon