Monitor performance issues & errors in your code

#134: Python in Climate Science Transcript

Recorded on Monday, Oct 16, 2017.

00:00 Michael Kennedy: What's the biggest challenge facing human civilization right now? Fake news, poverty, hunger, oppression? Yes, all of these are huge problems right now, but if climate change kicks in you can bet that it'll amplify these problems and many more. That's why it's critical that we get answers and fundamental models to help understand where we are, where we're going and how we can improve things. On this episode you'll meet Dr. Damien Irving, he's a climate science researcher using Python to understand what the climate models are telling us. This is Talk Python To Me, Episode 134, recorded October 16th, 2017. Welcome to Talk Python To Me a weekly podcast on Python, the language, the libraries, the ecosystem and the personalities. This is your host, Michael Kennedy, follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. Hey everyone, I'm super excited to bring you this Climate Science/Climate Change Episode. But before we do I want to just catch you up really quick on my free MongoDB course that I talked about last week. If you're looking to learn MongoDB and especially with Python, check out freemongodbcourse.com. Just last week over 5,000 people signed up and really enjoyed it. So drop by the website, signup and check it out. Now let's talk with Damien about climate research and Python. Damien welcome to Talk Python.

01:33 Damien Irving: Thanks for having me, very happy to be on.

01:34 Michael Kennedy: Yeah, I'm really excited about our topic that we've got lined up today, Python and climate science. And I think there's just so many aspects to talk about that, but we're going to cover on a bunch of them right? Cover a bunch of different things, the programming, the problems you're trying to solve, the education problems in terms of educating data scientists and much more than that, right?

01:54 Damien Irving: Yeah, absolutely, hopefully we can have a wide ranging chat.

01:57 Michael Kennedy: Yeah, I think that's the way we should definitely do this. But before we get to that, let's start with your story, how'd you get into programming in Python?

02:04 Damien Irving: I did my undergraduate degree in science, majoring in meteorology, way back in 2008. But that actually kind of involved very little programming which sounds shocking but is actually pretty common for people who study science. So I think the only programming course I did was a short summer, intensive learning FORTRAN in the engineering department. That actually put me ahead of the curve and ahead of most undergraduate scientists and so I picked up a summer job.

02:30 Michael Kennedy: Yeah, that's amazing.

02:31 Damien Irving: Yeah, it's scary to think back. Yeah, so I was able to pick up a job off the back of that basically doing some work for an assistant professor in the meteorology department after I finished my undergraduate degree and he kind of sat me down in front of a command prompt, pointed me towards Feris, which is a scripting language used in oceanography and kind of left me to it. And I bumbled my way through that.

02:53 Michael Kennedy: Nice, was that actually in FORTRAN or what language was that in?

02:56 Damien Irving: No, so Feris is kind of a stand-alone language I guess just for doing basic data analysis in oceanography. So it's a fairly limited language, but it's good if you're dealing with oceanographic data. But yeah, I kind of bumbled my way through that and then bumbled my way through using FORTRAN and things like that in an honors degree which is kind of a year-long research project you do after your undergraduate in Australia. Yeah, than was my kind of introduction to programming and it's kind of a pretty typical scientist experience, kind of self-taught, no formal training or training in programming.

03:26 Michael Kennedy: Right, here's a problem to solve, we think programming could help, here's some tools, have at it right?

03:30 Damien Irving: Yeah, go, but then actually after my honors year I was lucky enough to get a job at CSIRO which is a national science organization in Australia and it was kind of a half research, half support scientist role. In my support science role I got to spend a lot of time with IT people who basically introduced me to Python and that's kind of where my path kind of diverged from typical, I got to spend a lot more time with IT nerds than your typical scientist would so I was kind of really lucky in that respect and they pushed me onto Python, thank goodness.

04:00 Michael Kennedy: Yeah, what is this language, nobody knows what you're doing, go learn Python.

04:04 Damien Irving: Yeah.

04:05 Michael Kennedy: Nice and so nowadays, what do you do?

04:07 Damien Irving: So today, nowadays I'm a post-doctoral research fellow at CSIRO, so I'm back there after my PhD which is nice.

04:13 Michael Kennedy: That's cool, what was your PhD in?

04:15 Damien Irving: My PhD was in waves in the upper atmosphere and how they affect the weather down at the surface.

04:20 Michael Kennedy: Okay, didn't even know there are waves in the upper atmosphere, that's awesome.

04:24 Damien Irving: Yeah, it's kind of, I mean the atmosphere is kind of a fluid, I guess you don't think of it like a fluid like you would an ocean, but it flows the same way as a fluid and stuff so there are waves just like if you drop a stone into a pond they ripple and things like that. So looking at those waves in the atmosphere.

04:41 Michael Kennedy: Okay, very cool.

04:42 Damien Irving: My works right now involves looking at climate model data. There's probably about 40 or so modeling groups around the world that have computer climate models and periodically every five or 10 years or so they all perform sets of common experiments so they'll have worlds with humans in them and worlds without humans in them and worlds with different levels of greenhouse gases and all those things. They'll make all that data available and the entire data set is multiple petabytes in size for individual researchers like me, typically only require you know a subset of the data set. So all that data in Australia is at a national computing facility and rather than have people download that data to their own institution which would obviously be impracticable. They build a lot of, I guess, analysis infrastructure on top of the data so that you login remotely.

05:32 Michael Kennedy: I think that's cool. So there's so much data that it's in one place and you send the code and the analysis to the data rather than the other way, huh?

05:39 Damien Irving: Yeah, so they have all the analysis infrastructure on top of the data so you don't have to move it anywhere. So I live Hobart, Australia but I spend my days on a computer that sits on top of all the data in Cambara so that's kind of what I spend my days doing.

05:51 Michael Kennedy: That's awesome. That sounds really fun, that's an incredible amount of data. How much of what you work with is it simulation and how much is it analyzing observational type data?

06:01 Damien Irving: For me it's pretty much all simulation, so it's all scenarios of the future world to look at climate change or it's kind of reruns of the past 150 years with various elements taken out. So humans might be removed or different aspects to kind of figure out what's caused what, if you like, in the observed climate.

06:21 Michael Kennedy: I see. What if we never ate beef, or something like this right, never raised cattle?

06:25 Damien Irving: Yeah, you can have models that have different kind of treatments of what's happening at the surface in terms of vegetation and yeah, livestock and stuff like that, yeah, farming practices. So I deal almost exclusively in fictitious worlds but there are plenty of people who deal with actual observations.

06:42 Michael Kennedy: Yeah, and I'm sure that there's a really important place for both. So maybe that's a good place to start. Let's talk about the types of problems that you guys are trying to solve with Python. Maybe some of the modeling and how that goes, things like this.

06:55 Damien Irving: I guess most problems that climate scientists are trying to solve when they're using Python is that someone's run one of these models, either a fictitious kind of world like I'm looking at or ones where a whole bunch of observations have been thrown in and the model is used to fill the gaps, basically. So obviously we don't observe temperature and rainfall and all those things, everywhere. So models can be used to kind of fill in the gaps between the observations, using the observations as kind of a ground-truth. So in a way, whichever type of model you're using it'll output large, multidimensional arrays of data, so they'll have a time access, a latitude access, a longitude access and a depth or a height, altitude access. So large, multidimensional arrays and then it's basically trying to draw insights from that data about how the climate system works really. So a lot of kind of time-series analysis on those large arrays, other statistical analysis and just a lot of the time it's just the fundamental task of actually visualizing what the model is simulating. So actually visualizing the temperature data or the humidity data or whatever output it is, just what is this model doing? And a lot of the challenge is just seeing it.

08:06 Michael Kennedy: Right, I can imagine right, like once you've got so much data just a great way to visualize it so you actually understand because that's a lot of axes.

08:14 Damien Irving: Yeah, it seems like a simple thing but yeah actually just seeing what a model is simulating is quite a challenge at times.

08:21 Michael Kennedy: Yeah, so do you feel like there's a lot Pandas, NumPy, Matplotlib, Jupyter notebook, those types of things?

08:27 Damien Irving: Yeah, definitely NumPy or libraries built on top of that for multidimensional arrays. But Pandas and stuff would be probably more common among, I guess, a lot of scientists are like me and have model data that comes out on these nice grids and then you have I guess the other scientists who it's kind of their more GIS problems, they've been out on a research vessel and they've dropped a temperature probe down at one point and then 100 nautical miles later they drop another one and then another one so then they have this kind of unstructured spatial data and it becomes much more of a GIS Pandas kind of problem.

09:01 Michael Kennedy: Right, cause when you simulate the data you don't have to clean it up as much.

09:04 Damien Irving: No, it becomes fairly nice and the grid is, you know all the latitudes and longitudes are perfectly spaced and it's this very nice, regular grid. Whereas if you're more of an observational scientist and you're just taking observations wherever you can you have to clean it, you have to get it on a grid that's a bit nicer, so I thank my lucky stars some days that I'm not in that space, but they don't have the ridiculous amount of data that I've got to deal with so maybe there's some nice aspects to it too.

09:32 Michael Kennedy: Yeah, of course. Of course, that's pretty interesting. So for the simulations, I suspect that you have Python in action there, but is it all in Python? Or is it C++ or what's the mix there?

09:46 Damien Irving: Most climate models themselves are written in FORTRAN so mostly you're using Python for the post-processing. So analyzing the data once it's come out of the model and this I guess one reason most of them are in FORTRAN is kind of legacy issues that most climate models out there have been developed for decades, so when they started FORTRAN was the only thing available and the other one I guess is, well I guess from a speed point of view, you know with siphoning and things like that these days you probably could do Python, write a climate model in Python and I think some very simple models written in Python do exist but in general the models are written in FORTRAN and the post-processing data analysis is done with Python.

10:26 Michael Kennedy: Right, okay that's pretty interesting. It scares me a little to think of that much FORTRAN.

10:31 Damien Irving: Oh, yeah, yeah.

10:33 Michael Kennedy: But that's okay. yeah, when I started in, I studied engineering for awhile and they said the most important language you're going to learn is FORTRAN and I said please let me take C++, they said, nope, you have to learn FORTRAN then you go do these elective courses if you want. I'm like, okay, I see how it's going to be. But it sounds like it really might be the case in climate science, that's cool. So you talked about like tremendous amounts of data probably parallelism is like an important thing there right?

10:58 Damien Irving: A lot of the problems are embarrassingly parallel in that you want to just do the same process on every single latitude point or something like that, so there's no kind of talking between analysis points or you want to do the same process across lots of different climate models. So yeah, so some people will write I guess parallel code explicitly if you like using the multiprocessing platform library or something like or increasingly I guess it's kind of built into the package they're using, so Dask I'm sure is one that's been talked about on previous episodes of your podcast, but that's kind of built into xarray which is kind of a build on of Pandas for multidimensional arrays and so it kind of chunks your problem and does all this parallel scheduling for you. So some people are doing that but sometimes I feel like it's a little bit like everyone talks about big data but most people aren't actually in that space with the analysis they do. And it's a little bit the same, I think most people are still in the median data space, where even if they're dealing with a big data set, they're only interested in a smaller subset. And you can kind of get around the need for writing parallel code by just using the num key functions that allay to vectorize your problem rather than looping and a lot of the packages in climate science kind of do lazy loading, where you can load all the metadata about the data first and have a look at what's in there and then you can just actually load the data that you want instead of the whole thing at once.

12:25 Michael Kennedy: Okay, that's sounds pretty interesting. Do you leverage GPUs, running on GPUs for computational stuff any or is it mostly just straight CPU?

12:35 Damien Irving: I think I've now spoken to people who are using GPUs as well as CPUs as well, so I know it can be done, I haven't personally. But yeah, some people do, yeah.

12:44 Michael Kennedy: Okay, that's cool. There's probably even some computational bits you can plug in that like below what you're actually doing could harness the power of CPUs maybe. So all this data, is this like in a giant database, is it in a bunch of CSVs? Where do you get your data from?

13:00 Damien Irving: Most of the output from climate models these days goes into a file format called NetCDF, I think network common data form must be what it stands for. So it's a subset of HDF5 so it's a self-describing format so it carries all it's metadata with it, which is really cool.

13:19 Michael Kennedy: Right.

13:19 Damien Irving: And then in weather, climate, ocean science there's a whole set of CF conventions they're called, so climate forecast conventions around what metadata you use in the files. So how exactly do you describe the time axis, do you say days since Y,Y,Y, slash, M, M, slash D, D? So all these things so that then when people build libraries in, whether in climate science they can assume this basic level of metadata compliance in the files which means that they can write functions that really speed up your analysis 'cause they can assume certain things about the data. So NetCDF in this kind of CF compliant format, metadata format is ubiquitous for weather, ocean, climate science now.

14:05 Michael Kennedy: All right, that's interesting. So it sounds like maybe if you've got this kind of format, like a bunch of different libraries and packages can read and understand it. So what are the major packages that you might use in Python to work with climate science?

14:17 Damien Irving: Probably these days, first off you would be installing your environment using Conda for sure. So a big headache used to be just getting all the non Python dependencies installed so obviously there's a lot of our libraries that are C underneath or there's NetCDF libraries that have to be installed and things like that and that used to be a complete nightmare and Conda has absolutely been a game changer. So yeah, people are installing their stuff with Conda.

14:42 Michael Kennedy: Yeah, and Conda gives you, basically a virtual environment but it delivers the packages pre-compiled in a binary form for your OS, rather than via source like pip or something like that right?

14:53 Damien Irving: But it used to be that if you wanted to use a particular library that had C dependencies or something you had to figure out how to install those C dependencies yourself.

15:01 Michael Kennedy: Right.

15:02 Damien Irving: As well as installing it, but now it's just you got a Conda install, one line it's all done, it's amazing. But yeah, so basically in terms of the main libraries that get used, so I mentioned xarray before, so basically that library takes Pandas which is obviously that label array concept for two-dimensional arrays and then expands it for multidimensional arrays. It was actually a Conda scientist, Steven Hoyer who initially wrote it. And now it's kind of been taken up by the board at Pi Data community. So that one is very popular and then the other one is the Met Office in the UK have written one called Iris which is similar to xarray, except that xarray will let your files not be so kind of NetCDF metadata compliant so if you have files from a project that isn't very strict on their metadata in their NetCDF files it'll work just fine. Where as Iris really demands those things of you. For me I'm using data from projects that are very good about the metadata and stuff and so I use Iris because it is so strict about the metadata in the input files you can do tasks slightly faster if you'd like with less commands, 'cause it can make more assumptions of the input data. Yeah, so I use Iris, but--

16:20 Michael Kennedy: If you're going to draw results you probably want to have some strictness on data you're basing those results on right?

16:24 Damien Irving: Some people find it overly restrictive in that they want some files that don't have standard metadata type things so they'll go the xarray route, but basically if you're a climate scientist you're kind of choosing between Iris or xarray for the bulk of your work. So for input/output in NetCDF files for calculating basic statistical quantities and basic visualization stuff like that, they're the core libraries you'll leverage off.

16:52 Michael Kennedy: This portion of Talk Python To Me is brought to you by ParkMyCloud, every second your cloud servers are running is costing you money, cut your monthly cloud spend and stop paying for idle instances and VMs with ParkMyCloud. A cloud cost management tool that turns off resources when you don't need them, from their dashboard automatically schedule your instances to be turned on or off, saving you as much as 65% or more on your cloud spend. Manage databases, auto scaling groups, set up logical groups of servers to turn off during nights and the weekends when they're not in use. Whether using AWS, Azure or Google Cloud it's easy to save money with ParkMyCloud. Try ParkMyCloud and see why it's chosen by McDonalds, Capital One, Unilever, Fox and more saving customers tens-of-thousands of dollars every month. Visit talkpython.fm/park and cut the cost of your cloud today. That's talkpython.fm/park. It seems to me like the whole Python data science space in the last five, 10, definitely five years, has gotten really polished and really blown up. Probably it was a little harder to work with Python and stuff in the beginning right?

17:59 Damien Irving: When I started, yeah. Obviously Conda didn't exist, Iris and xarray weren't there. There was something a little bit like those that existed. Even these days they're taking it a step forward and I'm not sure if you've heard of the GeoViews or PolyViews library for visualization in Python?

18:16 Michael Kennedy: Yeah, I don't think I've heard of that one, tell us about it.

18:18 Damien Irving: So basically I guess it's getting into that kind of declarative visualization space. The idea being that for data exploration you don't want to spend all your time kind of tweaking the axes of a plot and deciding on exactly which type of plot you want to create to look at this particular data. You basically just want to tell the library the characteristics of your data and have it decide what the best axes would be, what the best map projection would be and all those things. So there's kind of two major libraries getting into that space. One's called HoloViews and the other one's called Altair. So HoloViews is much more established I guess. And it can kind of use Matplotlib or Bokeh under the hood depending on whether you want like a static image or an interactive image.

19:02 Michael Kennedy: Right, that sounds great, and you could maybe almost publish it to the web, real easy with Bokeh yeah?

19:06 Damien Irving: Yeah, so with HoloViews you can definitely do that, but anyway so HoloViews doesn't have support for geographic plots of the type on a world map of whatever kind of map projection you'd like which is a common thing in climate science. So the Met Office again, have developed GeoViews on top of HoloViews and so basically the idea being that you just throw your data at it. You give it a description of the basic characteristics of your data and it figures out the rest. And you can have a static image or you can have an interactive image that you publish to the web. This is really kind of taking down the barrier to kind of what I was talking about before, just that task of visualizing the data you've got, quickly and easily--

19:44 Michael Kennedy: Right. That's awesome, the more the system can do it automatically, just look at and go well, it looks like the axis should be this and so I'll do this. Like that's really great.

19:51 Damien Irving: These type of things, they're just game changing in terms of the amount of analysis you can get done. You know you're not spending your days mucking around with axes and map projections, it just happens in what used take weeks now takes a couple of hours.

20:04 Michael Kennedy: Oh, that's awesome. So either now or going into the future how do you see machine learning and AI starting to work it's way into analyzing climate data and making predictions and stuff?

20:16 Damien Irving: There is some active stuff happening in that space. I was at a conference last year at Lawrence Berkeley National Lab in San Francisco and there was a group there looking at applying machine learning to weather and climate type problems so I think it'll definitely happen. I think at the moment they're in that kind of space where they're trying to figure out which problems are kind of amenable to machine learning type solutions. It's almost like it doesn't make sense to apply to everything, there will be certain questions within weather and climate science that could be answered much better by applying that. So it seems like, if you like machine learning as a solution and now they're trying to find good questions to answer with it.

21:02 Michael Kennedy: Yeah, there's a bit of that yeah.

21:04 Damien Irving: But that was kind of interesting. It was kind of, it was an interesting conference.

21:07 Michael Kennedy: Oh, I'm sure. I know that machine learning has been taught to do things like look at mammograms and predict breast cancer as good or better than professionals, right. At least under some circumstances so it feels like it's really good at looking at pictures and finding subtle, subtle trends that people might miss and I'm just wondering, like it seems like there's probably some good ways to use it to understand things that are subtle but then explode later in climate evolution.

21:36 Damien Irving: Well particularly like you can imagine it could look at a weather map and identify the fronts or other weather features and it's just a question of would it do that better than existing ways that we do it where you might look at a temperature field and you can look at the gradient, the temperature gradients and identify it. The question is would it actually do it better than we do it now or yeah, so that's kind of I think the question at the moment is figuring out what it would be most useful for.

22:02 Michael Kennedy: Sure, I guess one of the real challenges is if you're trying, especially for prediction, if you're trying to predict the future you know in the long-term, not like what's it going to be like tomorrow in this town, but like what's it going to be like in 20 years? Machine learning is really good when you feed it lots of examples. It was like this and the outcome was that, it was like this, that. You know like that breast cancer one was given like 100,000 mammography scans and the answer, they always ask for more right. We only live from the past towards the future once right, like we can't, we don't have a bunch of examples to feed it right?

22:32 Damien Irving: Yeah, you're right which makes it difficult in a climate sense 'cause there's only one realization of our past climate. Whereas the applications might be more in forecasting and stuff where--

22:41 Michael Kennedy: Yeah, exactly.

22:42 Damien Irving: A cold front has come across Portland, Oregon, 100s and 100s of times in the past. So yeah, so maybe there's more of a weather forecasting application.

22:51 Michael Kennedy: Right, that's interesting. Not as cool as if we get answers to major, major questions with machine learning, but who knows, maybe some day someone will figure out a way to make it work. But speaking about how to make it work one of the challenges I suspect is all of this programming and these data tools and programming tools are great, but you said your experience was you were given tons of programming support as a scientist and that's probably true for many of them, so what do we need to be doing to help, like instill those real programming skills?

23:20 Damien Irving: I don't think the, I mean I only graduated from my undergraduate degree back in 2008. I don't think the situation has changed much since. So I think obviously Software Carpentry is a big one, a bigger organization that's helping with this. And I think you've had Greg Wilson on a previous episode talking a little bit about Software Carpentry and everything they do.

23:40 Michael Kennedy: Yeah, I actually had Jonah Duckles on the podcast and that was back on Episode 93. We talked about Software Carpentry which is a group that basically works with scientists around the world to help them become better programmers and better data scientists. It's really cool, so there's some aspects with climate science stuff, is there a particular course for a climate scientist or is just general stuff that's Software Carpentry teaches?

24:04 Damien Irving: I'm actually working on some, I guess climate specific stuff, basically the Australian Meteorological and Oceanographic Society which is like the professional society in Australia for weather, ocean, climate scientists have hosted a Software Carpentry workshop alongside their annual conference for the last four years now. So I'm basically in the process of writing those teaching materials we've been using for that up into a, I guess a more climate specific course, that'll be hosted with Data Carpentry which is actually a sister, sibling organization of Software Carpentry that actually has--

24:37 Michael Kennedy: Right, right.

24:38 Damien Irving: Discipline specific materials. Yeah, so if you're lucky enough I guess to be young scientist coming through has someone at your institution who's really into Software Carpentry or a professional society like AMOS who offer these types of things then I guess your situation is a bit better than mine was when I came through. But there's still a long way to go and a lot of people who sift through the cracks and just kind of get lumped on a research project with not much assistance at all in terms of of learning how to program which is really sad on a lot of fronts. Sad personally for them but also in terms of limiting the progress of their research it's terrible.

25:14 Michael Kennedy: I also suspect that there's a sharing component that's limited as well, right. Like if you can write, you go to MatLab and write a script that will analyze something and come up with an answer that's one thing. If you could form it in a way and say in Python, in a way that is reusable and general and tested, then you could put it out on GitHub and all sorts of people could use it and add to it and I think there's a pure scientific research sort of sharing the knowledge upside there as well.

25:43 Damien Irving: When you think about where I guess the state of climate science is at the moment and I guess for lack of a better word the computational literacy of the community you basically at the point where you're just introducing people to say GitHub and to having their own personal code under version control and things like that to go to the next step and have them writing code that the wider community can use and it's up on GitHub and people are submitting pull requests and you know it's tested and there's continuous integration of all those things. It seems like there are exceptional individuals if you like, within the discipline who do those things but they're very rare. So I think the discipline is another five or 10 years off of kind of just having the computational literacy, if you like, to do those things that as you say, make life so much easier for everyone.

26:33 Michael Kennedy: Yeah, it wouldn't really be great, but those are some advanced ways of working in it. Especially if people don't get started early in that, right, they don't build up those skills slowly over time, you know they're busy solving real problems with science and computation probably with some thing like MATLAB or something. Yeah, I can definitely see how that's really, really quite challenging. But I think it's important for a lot of things. So hopefully the software, Data Carpentry folks can keep that going, that's great.

27:02 Damien Irving: They're doing great things. They're an amazing organization.

27:04 Michael Kennedy: What else would you recommend to people out there who are scientists to level out their skills or to keep improving or whatever?

27:11 Damien Irving: Really just encouraging people to kind of participate in the wider Python community. Go to a Python conference or a SciPy conference or something like that which seems obvious to people who are developers in Python and stuff, but academics usually don't really think about attending conferences outside of their research discipline.

27:30 Michael Kennedy: Right, they might go to a climate science conference and not Python, for example, right?

27:34 Damien Irving: The PyCon's that I've been to, there are a lot of kind of support staff there so there's support staff at the institution I work at or the Bureau of Meteorology or wherever it might be, they're all there but the actual scientists who are doing a lot of data science with Python aren't and it probably wouldn't occur to them. I must say my first kind of PyCon really blew my mind in terms of, compared to an academic conference with like the recorded lectures that go off straight away. Even simple things like you could use your own laptop rather than having to give them a USB and put it into their Microsoft Windows machine. And people were like doing things like live, like live coding on the screen and stuff, as opposed to just static PowerPoint presentation. It was all very mind blowing compared to an academic conference. So I think yeah, I'd really encourage people to kind of try and get actively involved in the community in some way and it will really kind of broaden their horizons and keep their skills improving.

28:26 Michael Kennedy: Yeah, I would totally second that. I think, especially the PyCon conferences, you know, pick your continent maybe. There's a lot of science conversation and data science conversations going on there. Like at PyCon U.S. this year the keynote speaker was Jake VanderPlas on the first day, who opened it basically surveying all the different ways people use Python, how it's used in astrophysics, how it's used in space telescopes and all sorts of things. I think it was would actually be a really welcoming environment to people who have say some programming skill and some programming ambition but mostly are doing data analysis type stuff, I think they should definitely check it out.

29:05 Damien Irving: Yeah, for sure. Even like in Australia there isn't quite a critical mass to have a stand alone SciPy conference but at PyCon Australia there's always a data science track. So you don't have to kind of sit there and listen to talks on web development and Django.

29:21 Michael Kennedy: Yeah, exactly.

29:22 Damien Irving: And stuff like that.

29:23 Michael Kennedy: I don't care anymore about what Instagram did with Python, I'm over it. Yeah, even if it was cool.

29:27 Damien Irving: No, you can definitely have an almost a pure data science experience when you go to one.

29:31 Michael Kennedy: Sure that's awesome. Yeah, for absolutely the hallway track of those conferences is great as well, you know just the people you meet doing similar stuff.

29:38 Damien Irving: A lot of the times you don't have much in common with researchers once you get outside of your specific niche but you often have a lot in common with, in terms of the Python libraries you use and the types of basic data analysis you do. So end up, oh, I have a lot in common with all these people who are from very different fields and that doesn't usually happen. If you're just talking about your specific research discipline.

30:00 Michael Kennedy: Yeah, I was really blown away at how many different people kind of solve similar problems with similar tools but you know the outcome is really different 'cause of the questions they ask. Cool, so I definitely want to encourage people to go to the local Python conferences, they're really good. Some of that Software Carpentry stuff we talked about seems like it would be really beneficial to one of the major problems in science, just the whole reproducibility thing, right. The better you can get your code on GitHub, maybe create a docker image that people can download and run exactly. Like the more that you could share, distribute and sort of save your computation seems really valuable.

30:37 Damien Irving: The reproducibility crisis, I guess is a big one. So I guess the central tenant is obviously if someone publishes some research that they describe their methods in such a way that someone else working in that field would be able to reproduce their results if they wanted to. It turns out that most papers published these days aren't reproducible and it's for a variety of reasons. Some of them had to do with experimental design, the availability of the data sets that were analyzed and things like that. A big one is computational reproducibility in that most papers don't make the code available that they wrote to do the analysis or the details of the software environment that their code was executed in. So yeah, the software carpentry skills and just the basic things about using version control and all those types of things are huge if we're ever going to actually get past the reproducibility crisis and get to a point where our research is truly reproducible.

31:30 Michael Kennedy: Yeah, I definitely think that's something that's important I mean the more that our research depends upon code and data the more important it is that I think that that's accessible. This portion of Talk Python To Me has been brought to you by Rollbar. One of the frustrating things about being a developer is dealing with errors. Ah, relying on users to report errors, digging through log files, trying to debug issues or getting millions of lurks just flooding your inbox and ruining your day. With Rollbar's full stack air monitoring you get the context, insight and control you need to find and fix bugs faster. Adding Rollbar to your Python app is as easy as pip install rollbar. You can start production errors and deployments in eight minutes or less. Are you considering self-hosting tools for security or compliance reasons? Then you should really check out Rollbar's compliant SAS option, get advanced security features and meet compliance without the hassle of self-hosting, including HIPAA, ISO 27001, Privacy Shield and more. They'd love to give you a demo, give Rollbar a try today. To to talkpython.fm/rollbar and check 'em out. I know the guys at Large Hadron Collider are doing stuff with like taking their code and putting it in like an escrow thing. It's not GitHub but it's something kind of like GitHub where it's like we promise we won't change or delete this code 'cause it's linked to by this paper, things like that. It's been awhile, I forgot what the name was, it was like two years ago when we talked about this. But it sounds really important.

33:00 Damien Irving: Some papers that I published just recently I give the link to GitHub of where the code is if people want to see, I guess the latest version if you like, but then also I guess at the time of publication you take a snapshot if you like of your code repository at that point and there's websites, there's one called figshare and another one called Zenodo. Where you can put, they kind of call it the long tail of your research so things like code, supplementary figures, supplementary tables and stuff. You put it all out there, it gets a DOI, a digital object identifier and those--

33:31 Michael Kennedy: Yeah, that's right.

33:32 Damien Irving: And those website guarantee that they're not going to disappear and it'll be around for all eternity for people to be able to get. So yeah, in general the best practice these days is definitely get people to link to GitHub so they can see the latest version, but also have a version up on persistent storage place like figshare or Zenodo so if you ever change the name of your GitHub repo or something like that it doesn't all just disappear.

33:56 Michael Kennedy: Right, exactly it's great that it is, that GitHub is there and you do get the latest version, but you could delete that if you just were in a bad mood or whatever, right, or your account gets suspended or hacked or whatever, you definitely want to be careful. So that sounds really cool, that's what the Large Hadron Collider guys were talking about as well, the digital object identifiers for the papers, that sounds great. What about like docker containers and these other things where you sort of ship like whole systems, do you see much of that being used?

34:22 Damien Irving: I hear people talking about it as a possible solution I think because I'm kind of more tuned in than most climate scientists of the kind of reproducibility scene. But I think in general they were probably asking too much I guess of a regular climate scientist to be so up on these things to be able to do the whole docker thing themselves so I think it's a possible solution in future but it's kind of out of the reach of a regular climate scientist right now if you know what I mean?

34:51 Michael Kennedy: Yeah, I definitely know what you mean. I think it's out of reach of a lot of developers as well. Like not that many people are actually doing you know complicated docker things or containers in practice. But still it seemed like it would be a great solution because you can capture the whole platform and its dependencies not just the code plus the data.

35:08 Damien Irving: I mean I guess a very lighter style version of that is Conda environment, 'cause you can post your environment on anaconda.org and then someone can just kind of go Conda, ENV and then the URL to where you posted it on your profile in anaconda.org and then it'll install your software environment.

35:27 Michael Kennedy: Oh, that is really cool. I didn't know that that was a feature they had, that's awesome.

35:30 Damien Irving: Yeah, it's really cool and they actually go a step further now, they've got I guess a bit of a bigger thing called Conda Kapsel, I think it's K-A-P-S-E-L. But in that one instead of posting an environment to anaconda.org that someone can install you basically write the specifications of the libraries you use in essentially your README file and then Conda Kapsel just takes your README file and then installs it all.

35:56 Michael Kennedy: Yeah, that sounds really cool.

35:58 Damien Irving: They're doing all the stuff around Conda is, in kind of Conda environments is really exciting for things that are doable for regular scientists right now 'cause it takes one line of code.

36:09 Michael Kennedy: Yeah, which is totally possible. It sounds like the Conda guys are doing tons of good stuff. I know with anaconda distribution and all that but it sounds like even maybe even more than I realized. That's awesome. We talked a little bit about reproducibility and the various things people should be doing. Like what are some of the steps that you think are within reach to get people to sort of become better at participating in a wider community, do more stuff on GitHub, more libraries, things like that.

36:36 Damien Irving: One of the major issues I think up until now has been, up until recently has been, there isn't really an incentive for people to take the code that they have hacked together for themselves, for the research problem that they are doing today for this particular paper and putting the time and effort to make it general enough that the wider community could use it and they could pip install it or Conda install it and those types of things. There's a few journals out there now, there's The Journal of Open Research Software, which I'm actually an associate editor on. Or there's one called The Journal of Open Source Software as well and basically the idea is because citations of academic papers is currency, in academia that's how you get promoted, that's what your career depends on, that's the kind of incentive. If people can write a paper documenting this research, scientific research software that they've released and then if people start citing that every time they use it in their papers then that's the kind of incentive that academics kind of need to go that extra step and actually take their personal code and make it community code. That's starting to happen and certainly we're getting a lot of submissions at The Journal of Open Research Software and stuff. So now I guess it's a matter of it becoming part of the culture for people to actually look for those papers and cite them in the method section of their papers to absolutely make sure that the authors of that software are getting the academic credit that they deserve for it.

37:55 Michael Kennedy: Oh, that's a really good point, you're right that it definitely is the currency of academia. And I guess sort of making that a habit right, it's one thing to GitHub and grab a package and just go and like do some analysis with it, but how? I guess it would be great to have in the package and if you're going to cite this, or use this in an academic paper please, you know, this goes in the bibliography or whatever.

38:17 Damien Irving: Just getting people to start the method section of their paper with just one paragraph talking about the software that they used and citing the actual publications that relate to that software, that would make a huge difference. So it's getting that to be a cultural thing where our methods sections always have a bit about the statistical methods that we use, why don't they have a paragraph on the code that we used and where we got it from?

38:41 Michael Kennedy: I think that's a great idea and it's so low hanging fruit, right, it'd be really easy to do that.

38:46 Damien Irving: Yeah, so I hopefully as these journals get bigger and people start pushing people to cite them that'll happen and in 10 years time it'll just be every method section has a paragraph on code.

38:56 Michael Kennedy: I definitely see that as a possible future for us. So speaking of possible futures, let's have conversation a little bit about climate change. You know you've studied it more than almost, I'm sure you've studied more than anyone else I've spoken to, and more than most people I'd say. So what do you think, climate change, is this a real thing, did people cause it?

39:17 Damien Irving: Absolutely, I guess the frustrating thing from a climate scientist point of view is that question, is climate change real and are humans causing it hasn't been an active research question for more than 30 years. It's been accepted in the scientific community for at least that long.

39:34 Michael Kennedy: Well hold on, if I turn on the news here in the U.S. There's always some other, there's like half of the people on the TV channels, oh I have 400, 4,000 scientists who say this is not real, like, just give us a sense of like why it's so accepted and what not, maybe put a little push back on that vision that gets projected by news.

39:56 Damien Irving: Even there statistics where they go, 97% of climate scientists agree, it's actually 100%. Like I've been going to climate science conferences for a decade now and I've never, ever, ever, sat in a presentation or read a paper that suggests that climate change isn't happening. It's a given and it's been a given for a very long time. So the disconnect between kind of what's being discussed in the science community and what's being discussed in public is very frustrating.

40:24 Michael Kennedy: I'm sure it is, yeah.

40:25 Damien Irving: It's not unique, obviously like if I went to a health conference, you know the difference between public discussions and policy on health would be very different from what experts in health think should be happening, it's not a unique thing if you like but it is particularly frustrating just because it, yeah, that's a question that we moved on from 30 plus years ago.

40:47 Michael Kennedy: Yeah, I think one of the things that like make people feel that this is more up for debate than it is is at least in the U.S., I don't know outside the U.S., but there's this tendency of you're going to present something you're going to present both sides of it. So if you're going to talk about climate change have somebody for and somebody against and that makes it feel like it's 50/50 and it's not 50/50, it's like 1,000 to one or something. It doesn't seem like it really needs like this other side to say, well here's the other side of the argument. This guy he only works for this coal company, but he's really studied it, it'll be fine you should listen to him.

41:21 Damien Irving: That gets you tearing your hair out when the people that they put up on television, for the against case, like it's in a debate on physical reality.

41:31 Michael Kennedy: Right, so I'm with you, I think this is an important, this might be the fundamental challenge of our generation, our generation's probably multi-generational actually but you know what are some of the things you think we can do as just citizens of the world and what do you think we can do as people with the magical wand of software development where we can actually make things that analyze change?

41:56 Damien Irving: I definitely think getting I guess politically active is important, so writing to congress people, attending protests, getting involved in community organizations, whether they're against a pipeline that's being built or they're encouraging people to divest their money from fossil fuel companies or whatever it may be. I think it's got to just be that groundswell of kind of grassroots activism that gets things changed because I mean there are a lot of vested interests in keeping things the way they are, I mean the biggest companies in the world are fossil fuel companies like Exxon Mobile and companies like that. So it's a formidable opponent and it really needs a massive grassroots effort. Now I guess the more I've gotten to know about the issue the more I've kind of got involved in those types of things and I certainly when I first started as a climate scientist I wasn't politically aware or active at all. But all I guess the organizations that I've been involved with as I've become more active are really crying out for help with IT stuff. Whether it's their website, whether they want to analyze some voting data or whatever the case may be. Like if walked up a grassroots organization doing climate activism and said hey, I've got a bunch of IT skills they'll fall over themselves, with gratitude and with things that you can help with and you'll be up to your eyeballs in way too many things that you've got time to do. But yeah, no, definitely I think developers and people with IT skills have a particular role to play just because they're such important skills and these organizations are just crying out for help with that kind of stuff. So yeah, if you want to get involved, there's absolutely ways to put your skills to good use.

43:36 Michael Kennedy: Yeah, that sounds like a really good thing to do. Certainly if you gave a couple of hours a week or something to one of these organizations and they really are totally missing the software side of the story, right. They don't have a lot of people to help out there, you could probably make a pretty big difference there.

43:52 Damien Irving: Oh yeah, absolutely.

43:53 Michael Kennedy: So how much of this do you think is like a political fight versus an economic fight? I mean like every time you take an action or you buy something or you don't buy something you're kind of voting with your dollars or your pounds or whatever right and you know do you think it's more important to act as consumers or to push on the political side or where do you think the leverage points are?

44:16 Damien Irving: I think it's both. I mean I think I used to think that it was just definitely talk to politicians and get them to change their minds but if you read the writing of major climate activists like Bill McKibbon and stuff like that he has a whole book about the fact that he started all these campaigns in Washington thinking he had to be there to tell the politicians and then in the end he comes to the realization that it's corporations that rule the world and that's why 350.org, his organization focused so much on divestment and of actually thinking about where you're spending your money because at the end of the day maybe where you spent your dollars is more important than what you write down at the ballot box in terms of the impact it has.

44:55 Michael Kennedy: Right, and once the voting is kind of done, like things are set for awhile, right. But you buy stuff everyday, you consume things or don't everyday, that's pretty interesting. Here in the U.S. I don't think there's a lot of positive policy that's going to be passed in the next two years on climate stuff but I feel like we still get many, many choices on what we buy, what we don't buy, where do we get our energy from, things like that. So there's still lots of things that people can do but I totally agree with you on donating some time to activist groups. What do you think is the most the most exciting or encouraging development in the last couple years around sort of positive change to fight climate change? And then maybe, what do you think is like a setback that we've had that is unfortunate lately?

45:45 Damien Irving: The most exciting thing would definitely be just the growth in renewable energy. So I'm sure it's similar in the U.S. but here in Australia in particular the growth of people having solar panels on their own roofs and stuff has been huge and it's kind of in spite of government efforts to kind of slow it down. If you like, because obviously energy companies have a lot of lobbyists and things like that. But in spite of that renewables are just going from strength to strength, so that's definitely the most exciting.

46:15 Michael Kennedy: I would definitely agree with you that a little bit with the politics versus dollars, renewable energy's becoming the financially smart choice and once that happens like it's just forget the politics, it's going to solve itself at that point. But yeah, we can slow or hasten it for sure through politics.

46:33 Damien Irving: Probably your question of what was the most discouraging things is probably the ability of kind of vested interests to slow things down. It almost feels like sometimes with climate change is that we will eventually get there, and we will eventually reduce our emissions significantly but like it's a slow victory is kind of a loss, because of all the heat that will have accumulated in the climate system in the time it took to get there, we have to win and we have to win fast! And the fact that we're potentially winning but winning incredibly slowly, is a big problem.

47:08 Michael Kennedy: It's both encouraging and really frustrating at the same time right?

47:12 Damien Irving: Yeah, winning slowly is not really an option but the ability of vested interests to slow things down makes it feel like it could be a very slow victory, which would be really not a victory at all in the end.

47:25 Michael Kennedy: Right, all right, well maybe we'll leave it there with that for the climate science stuff. Let me just ask you the final questions that I ask everyone who's on the show. So if you're going to write some Python code what editor do you open up?

47:39 Damien Irving: I'm a simpleton, it'll be like TextWrangler or gedit or a simple graphical text editor like that. I used to be kind of self conscious about that and then I was teaching at a Software Carpentry workshop and one of the helpers was a core Python developer and he told me he uses simple graphical editors like that too. Ever since he told me that I felt really good about me self.

48:00 Michael Kennedy: Hey, if the core developers can do it, you can definitely do it.

48:03 Damien Irving: Yeah, exactly.

48:04 Michael Kennedy: All right, and notable PyPI package? I mean you named a few that are involved in climate science?

48:10 Damien Irving: Yeah, I thought I might actually give a shout out to a little one, unknown one, I use GitPython, I'm not sure if you've ever used that, it's basically just a hook to git, but basically--

48:21 Michael Kennedy: G-I-T python?

48:22 Damien Irving: Yeah, all one word, G-I-T python and so basically I use it because a lot with these NetCDF files that we use that have the metadata in them, some of the tools that have been built kind of in the global history attribute of these files keeps a record of what was entered in the command line to produce this file. So I can do that basically, I can have script that at the end puts in the history attribute that the command line was python, the name of the script and then whatever input arguments it was, so I have a complete record of what was entered in the command line to produce this file and then with GitPython I can also have every commit in Python has a unique 40 character string associated with it. I can basically put the first seven or so digits of that so I know which version of the code was executed and so I actually use gitpython.

49:12 Michael Kennedy: Yeah, that's awesome.

49:13 Damien Irving: Yeah, gitpython helps a lot just with reproducibility down to which version of that script did I use.

49:17 Michael Kennedy: Yeah, and the arguments as well which is really important. Awesome, all right, so we've heard a lot about the tools, a lot about how we can contribute, maybe some of the problems. What's the final call to action if people are interested how do they get further involved or do something along those lines.

49:34 Damien Irving: I can give a shameless plug for myself, so I have a blog drclimate.wordpress.com where I talk a lot about Python. I talk about research best practice in general but a lot of the time that means I'm focusing on Python basically in climate science. So if people want to subscribe to that that's a way to kind of keep up to date with things

49:53 Michael Kennedy: We can follow you on Twitter, right?

49:54 Damien Irving: Yeah, @drclimate as well.

49:56 Michael Kennedy: @drclimate, which I'll put in the show notes of course.

49:58 Damien Irving: Just to kind of, it'd be good to have I guess more people in the climate science Python discussion if you like. I feel like through some of my involvement in software carpentry with other disciplines between the bio sciences, bioinformatics and stuff like that. The community around R and those languages is huge and they seem to be a lot further down the track in terms of dealing with the reproducibility crisis and releasing packages that other people use. I feel like climate science is a smaller community but we need that kind of strong sense of community around Python to really help with some of those challenges.

50:34 Michael Kennedy: Right, maybe some people who could actually like convert something into a package that could be reused and help getting it on PyPI or on GitHub, like maybe those type of contributions can be helpful as well?

50:45 Damien Irving: Oh yeah, absolutely yeah. There will be a lot of people that have very interesting code, from a research perspective that would need a bit of assistance in actually releasing it.

50:54 Michael Kennedy: Sure, all right well that sounds great. Thank you so much for all your thoughts and sharing what you guys are up to in the climate space.

51:00 Damien Irving: Oh no worries, thanks a lot for inviting me on the show.

51:03 Michael Kennedy: You bet, talk to you later Damien. This has been another episode of Talk Python To Me. Today's guest has been Dr. Damien Irving and this episode's been brought to you by ParkMyCloud and Rollbar. Do you hear that sucking noise? That's your cloud provider making you pay for your idle instances. Turn on ParkMyCloud, plug the leaks and save money. Visit talkpython.fm/park to get started. Rollbar takes the pain out of errors. They give you the context and insight you need to quickly locate and fix errors that might have gone unnoticed until your users complain of course. And as Talk Python To Me listeners track a ridiculous numbers of errors for free at rollbar.com/talkpythontome. Are you or a colleague trying to learn Python? Have you tried books and videos that just left you bored by covering topics point-by-point? Well, check out my online course, Python Jumpstart by Building 10 Apps at talkpython.fm/course to experience a more engaging way to learn python. And if you're looking for something a little more advanced try my Write Pythonic Code course at talkpython.fm/pythonic. Be sure to subscribe to the show, open your favorite pod catcher and search for python, we should be right at the top. You can also find iTunes feed at /itunes. GooglePlay feed at /play. And direct RSS feed at /rss on talkpython.fm. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon