« Return to show page
Transcript for Episode #252:
What scientific computing can learn from CS
00:00 Michael Kennedy: Did you come into Python, from a computational science side of things? We're you just looking for something better than Excel and MATLAB and you got pulled in by all that Python has to offer? That's great. But following that path often means some formal practices from software development weren't part of that journey. On this episode, you'll meet Martin Heroux who does data science in the context of academic research. He's here to share his best practices and lessons for data scientists of all sorts. This is Talk Python To Me, Episode 252, recorded January 30th 2020. Welcome to Talk Python To Me a weekly podcast on Python, the language, the libraries, the ecosystem and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via @talkpython. This episode is sponsored by Clubhouse and Linode. Please check out what they're offering during their segments. It really helps support the show. Everyone a couple of announcements. First, I have a very exciting new course that we just launched this week. If you've always wanted to solidify your programming foundations and get into Python, but you've never had the chance to take a computer science course, I built a course for you. It's called Python for the Absolute Beginner. It's for absolute beginners, because we start from the very beginning, how to install Python, what is a variable? Why do we need loops and so on. And then we go on to build some really fun projects that will teach you 80% of the Python language. Another challenge people run into when learning programming is having something to work on that's not too big, not too small, but just right, that will actually teach them what they're studying. That's why this course is 50% video lecture and 50% hands on exercises. So this sounds useful to you or to your colleagues. Visit talkpython.fm/beginners to find out more. Another quick announcement. Last week, we had a brand new sponsor Springboard. I'm super happy to have them supporting the show. But I did forget to mention something that I think will be really useful for a lot of you. If you want to take one of their AI machine learning online career tracks, they're offering 20 scholarships with $500 each, exclusively to Talk Python listeners using the code TALKPYTHONTOME, all one word, all caps. It only takes 10 minutes apply. So that's the thing I left out. Use that code get the $500 scholarship TALKPYTHONTOME, all one word. Just visit talkPython.fm/springboard if you're interested. Another announcement, recently, we've had to trim back our RSS feed length. It had all 252 episodes in there, all the show notes and everything and as far as I was concerned we were going to ship those you as long as we could. I want to make sure that everyone can go back through the entire catalog and get everything that they want. And yet, because of the size, I think some of the players, especially in the Apple ecosystem started to go crazy and ship old episodes as if they were new and to not show the latest ones and all sorts of weird things started happening. So in order to fix this or let's say rather in order to attempt to fix this and it seems to have worked, we've reduced our RSS feeds to only show the last half a year of episodes. Again, that's not 'cause, we don't want you to get the old ones. That's because we want you to get at least the new ones. So what a pain, but I know that some of you, are going through the entire catalog and learning as you go trying to go through the entire history of what we've done which, thank you that kind of blows my mind. That's an incredible effort. So what I've done is I've also made a separate RSS feed that is not part of the subscribe at iTunes or subscribe at Google Play feed because that's where the problems were. If the entire RSS feed works for your player, you're welcome to have it, I put it up at talkPython.fm/episodes/rss_full_history, you can get that in the links, in this show in the show notes here. And you can also get that on the footer of every single page at talkpython.fm. So that's easiest way to get it, just go to the bottom, look for the RSS full history link, and put that link directly into your podcast player. If you want to get the full history. If you're happy with the latest half year, just do nothing. And make sure you're subscribed as you have been, and you'll keep getting the new episodes, hopefully without any hiccups this time. Martin, welcome to Talk Python To Me.
04:39 Martin Heroux: Hey Michael, how's the going?
04:40 Michael Kennedy: Hey, it's going really well. I'm happy to have you on the show.
04:42 Martin Heroux: Thank you very much for having me.
04:43 Michael Kennedy: Yeah, we're going to talk about the science side of Python, which I think is a really interesting place to talk. You know if there's that famous, well known article called entitled The Incredible Growth of Python, done by the data scientists at Stack Overflow. There's this super sharp inflection point around 2012, where the growth of Python which had been pretty stable, it was like a third fourth tier in rank language. And then it just took off at that point and went up and up. And I think my theory is, a lot of people came in from data science. But I think that kind of overshadows that they're just a lot of people that started adopting Python. For this a little bit beyond Excel, or the first I just need to do a little coding, I just need to do a little automation. And I think people are just coming into this from so many angles. And I think that a large portion that probably is scientists, what do you think?
05:38 Martin Heroux: I totally agree. I mean, that's how I came into getting into Python was a transition from MATLAB, for example, as a fairly first language for anybody who's interested in doing just that little bit of automation, that little bit of something to go beyond Excel. I think Python is perfect and now with the tooling and for example, Anaconda, those types of things. It just makes it so accessible.
05:58 Michael Kennedy: Yeah, absolutely. You know, my theory that is really popular, is you don't have to take on all the computer science ideas and concepts all at once. You don't have to have a static class and a function with accessibility modifiers and all that. It's just like, well just write the three lines you need here. And then Oh, if you need this idea functions, then you can learn them. If you need classes, you can learn them and so on. It's really, really approachable.
06:19 Martin Heroux: Yeah, no, I totally agree that.
06:20 Michael Kennedy: For sure. And now, before we get too much into our topic, let's start with your story. How did you get into programming and Python?
06:26 Martin Heroux: Unlike some of your previous guests, I'm not your traditional path person. I'm a physical therapist by training. And then after that went on to do graduate work, and it's after my master's that a PhD and it's during the PhD that I learned two different languages that are very different from each other. One of them was LabVIEW programming, and that was for my data acquisition and interface for data collection. And that's, I guess you can call it coding.
06:50 Michael Kennedy: Is it more of a visual a type of day like ETL type thing?
06:53 Martin Heroux: Yeah. And you connect your lines in one person and they saw the back the not the front-end, but the back-end of my program and said it looks like a bomb diagram, which you know, kind of, I guess it does, it's all the spaghetti behind the scenes. The logic is there once you understand coding, but it doesn't teach you how to program properly. But then at the same time, my supervisor took on the task of teaching me MATLAB, which by the end of my PhD, I was quite competent. And it was a great skill that those two together worked quite well. And then I progressed, did my postdoc and continued that, a lot of MATLAB. And then it was really I'd heard of Python from a colleague that went to do some post grad work in San Francisco. And I met her at a conference and she says, all the folks I work with use Python. And so me being just curious, I wouldn't looked at it, but that was probably, you know, 2007, 2008 or something. And I kind of read on it, read up. I saw these tutorials on you know, going to Python from MATLAB, but it's the tooling still wasn't there, the how to install it, how to get the various packages. I tried for a couple weekends. And then I just kind of gave up because obviously I was proficient at MATLAB. And I didn't I'm not a programmer. I just wanted to get my job done. My research done.
08:02 Michael Kennedy: So right. And a lot of this stuff, the data science tools and libraries were just early seeds growing at that time, right? I think NumPy came out or it was created around that time and probably Matplotlib. And Jupyter wasn't around yet. So there was a lot of support that wasn't there right?
08:19 Martin Heroux: Exactly, so for somebody who's, you know, in a sense of non expert, it was just too daunting. And so therefore, I went back to MATLAB, and then it was just finally I'm a Linux person, started that in probably 2010. Maybe never looked back at the whole time some quite.
08:36 Michael Kennedy: What distribution are you on these days?
08:38 Martin Heroux: Linux Mint, or Ubuntu kind of depends on pretty much how I'm feeling on the day that I installed but it's between those two usually.
08:44 Michael Kennedy: Okay, cool.
08:45 Martin Heroux: Just what I've run on my own on my computers, I've put it on the mom's computer, so I can help her out from a distance 'cause it's kind of easy. You can use older computers and just use it in the lab or for some of the experiments we run. We just need a basic computer. So I just thought it's a nice way to reuse some of the equipment and then with this open source kind of mind. And that I had Python started to be a little bit more appealing. And I looked a little again, visited, partly because I was looking at the courses for Software Carpentry, which is this thing that teaches scientists that have no computer skills, some of the basic skills and usually around either R or Python, and you get a bit of shell and you get some git. And so that really made me look again into Python. And from there, that's all I've been using for the last, maybe four or five years now.
09:30 Michael Kennedy: Beautiful. Yeah, the Software Carpentry stuff is pretty interesting. I've had the folks behind that on the show before. It's not a lot, right. It's not a lot to get you started. It's a couple of days, a little bit of git, it's not even all programming. It's here's how to use the show. And here's what source control has and so on. But at least it plants the seed like hey, these are things you should pay attention to.
09:49 Martin Heroux: Yeah, it's really here's the landscape of a few things. And it's a teaser. But if you have some labs or some institutions, it's the culture you they use R, for example, if they do a lot of genomics, for example, everybody's using R, and then it's okay, 'cause then you leave the course the introduction, then everybody around, you can help. But at our institute, I did run some Software Carpentry course. And therefor it's hard the uptake because we're so diverse in what we do, because we there's people who do genomics, where I work, and then there's other people who do really just kind of epidemiology type things. And so the range is so different that the needs are different. So it's a good format, but it doesn't apply for all that's kind of what I found along the way.
10:31 Michael Kennedy: Yeah, for sure. It sounds to me, like reading some of the stuff you've written and looking through some of the stuff you put out there that we've talked to data science folks and even scientific computation folks, who also have maybe some programmers around them or some kind of software stuff going on as well, where most of their work revolves around code. But it sounds to me like what you guys are doing, it's a little bit more like you're scientists out there on your own, and yeah, here's some stuff that you can learn and use. But when you get stuck, you know, you're kind of alone on the internet sort of, is that right?
11:10 Martin Heroux: Pretty much it is that while it'd be nice if we had a big amount of funding to have dedicated people to program and to automate some of our things. It's just not the case that the funding is not all that great, pretty much around the world. But so if you're working in one of those places that has that, it's great. But most of everybody else, we just have to, you know, figure it out ourselves. And that's kind of how I learned, this just by either I could continue to do this pointing and clicking and realizing the mistakes I was making and not able to reproduce anything and those types of things or teaching yourself but that's not easy. It's a bit overwhelming, I think. Whereas it rather than having a not enough resources, I think can be overwhelming. There's just so many resources for people and unfortunately, we're strapped for time. And so that's a big stress on people at this more senior level. They just feel they don't have the time and the students well, they have a little bit more time but if their supervisors and the people above them aren't necessarily coding, well, then they don't feel the need as much, to learn it because obviously other people, got by without it. And so it's to try and for myself, it's definitely, as you have said on your show a few times, it's my superpower. Really, it's things that other people just can't even fathom could be answered. I'll just take the data and do a few things in Python and then get them the answer pretty quickly.
12:22 Michael Kennedy: Yeah, they're like, I don't really even know if we can answer this question, you like, give me pandas and five lines of code and we'll see what we can do with this, right?
12:30 Martin Heroux: Exactly, yeah, and part of it is actually not even just the fact that maybe they don't need to do it, or learn how to do it, but just know that it's possible. And value that, and in terms of valuing it, I think that's what's interesting is that regardless of what type of science you're doing, in the end it's all based around computers. It almost doesn't matter what field you're in, you're going to be most of the time dealing with data. And so to be able to do that in a repeatable, efficient manner, as you always say computers are really good at that, the mundane of repetitive tasks.
13:01 Michael Kennedy: Yeah, yeah, yeah.
13:01 Martin Heroux: So what why not automate it? People are noticing and I think there's a push that even from the senior people, even though they might not know, I think they see that okay, that's this is where it's going.
13:11 Michael Kennedy: Yeah, that's a little bit challenging, when the incentives don't necessarily line up. Right, the experienced folks, they're busy, they've got stuff to publish and research projects. They'd like to learn. But at the same time, you know, they've got a job, they've got family or whatever, right? They've got classes to teach or people to mentor in their field. So that makes it really challenging, I think as well.
13:34 Martin Heroux: Yeah, no, it's it's the pressures are very high. And this whole publish or perish thing, although some people throw it around as a joke. It really is true, unfortunately. And it's in that model where you know, rather than focusing on good science that's reproducible that's what you expect. When I was a kid growing up, these scientists are just, you know, these amazing people that were white lab coats most of the time you would think and that are just extremely careful. And I think we're just people trying to get by. And the incentives make it that we do have to do a lot of work. And a lot of times, we might cut some corners. It's just the way it is, we just have to get a lot out there. And there's a lot of pressure to do that rather than take a bit of time. And either do it carefully or learn some new ways to make it last a little bit longer, in terms of I can pass on my data with the code and somebody else can reuse it, or I later on my future self, I can reuse it or convince myself or convince others that, hey, this was actually the real answer kind of thing.
14:30 Michael Kennedy: Yeah. I'm having a hard time even coming up with, an example where you're doing research and there's not a lot of data to be processed. You know what I mean? What field is it foster like? Well, we have seven things. We're going to study these, these seven, I'll just put them on a line, like almost anywhere that people are doing research now. There's so much data available that you can collect and you whatever way whether it be economics, you can just go load up all the data about all the economies and all the purchasing and habits and what not or if you're doing urology, you can the amount of EEG data or whatever is ridiculous, right? So it seems like programming is a really important skill just to almost be functional these days.
15:10 Martin Heroux: Well, definitely. And I think that in this for myself, it actually started with first was a bit of electronics, and I didn't know anything about that, obviously, being a physiotherapist and I ran into trouble. And then I almost stalled my PhD for was it six or eight months just because of a simple wire, a couple wires that needed to be soldered together and connected. And nobody where I was from knew anything about that. And it took a long time just to find the right person. And I was like, never again, am I going to have to be not self sufficient. So I took myself for electronics. And then I think with programming it was the same thing is that you can find the right people, but either it's going to take time and money. And it's not always if they're not exactly in your field, they'll do their best and then it's going to be multiple iterations, to get to the final. what you would like and you don't even understand how it got there. So I think this idea of learning just enough to get you by, really is quite valuable. And as you go the learning curve, although it's Python, and it's obviously easier than other languages, if it's not your thing, it's a big learning curve, the ideas of programming. At the same time, you don't want to just learn programming, you want your results, that's number one. So you're juggling all these things. But once you get over that curve, the big learnings done and then it's just adding on for each new project, you might learn a new technique you might learn oh, now I'm going to use databases or now I'm going to to learn, git and version control my code and things like that that come along. But so it's how to do it progressively not get intimidated and not just say, "I'm going to go back to my point and click programs 'cause at least I can get the results quickly."
16:39 Michael Kennedy: Exactly, or Excel.
16:40 Martin Heroux: Yeah, yeah, Excel, I try not to use that word and it's not my I have a fear, I have an allergy to be honest. I walked by the people who have those screens open and oh, I just Yeah, yeah, I don't personally have this thing about Excel in science and just really for more than just a simple you know, keeping a few numbers of doing some, you know, basic, adding or addition or stuff like that. It's fun, but it really, I mean, it just doesn't have a place in my opinion. And nowadays, so.
17:07 Michael Kennedy: Yeah, well, I think the problem is like it's fine for the simple case. Like, I use Excel for accounting and the stuff. And it's, it's great. It's going to take a few numbers, it's kind of add this column and it's got to divide by that and put, you know, 20% over there. I don't know what something like that right, that's fine. We did a whole episode on Excel, with Chris Moffitt Episode 200. Escaping Excel hell with Python?
17:30 Martin Heroux: Yes.
17:31 Michael Kennedy: And that was a lot of fun. One of the things that dawned on me that really makes these worksheets in Excel or any, you know, could be Google Sheets, it's really nodifference. That makes it really tricky is it's kind of full of gotos. You know, when you when you have a formula that's down here in like A32. And then it says add stuff over there and then make this decision and then take that from over here. And then there's a formula over the it's like, there's no way to look at it and tell what order that it flows, right? It's just go go there, then go back over there then take some of this. And yeah, it's just it seems like it takes, some of the worst practices when you try to push it too far.
18:13 Martin Heroux: Yeah, I think if it fits on a page, and and you've got a few pages, I think, you know, and everything is self explanatory. I'm okay with that. And, you know, even I've done that for a survey that I published with one of the papers that I wrote, and it was the best way to present it. But when you've got a lot of those embedded formulas, I as a person opening it, have no idea what's the flow of all of this. And the bigger they get. And that's there's research that's been done and people who've analyzed Excel spreadsheets is that for every hundred things that you do in Excel, there's going to be approximately five errors. And the more complex it is, the more likely you're going to make these mistakes. And I think the hardest ones are, for example, that one that came across in some of the data that I was involved in, I asked to see it. This, for example, take the average 'cause we did three responses from somebody, take the average of the three. But what happens is there's a missing data point. Well, just by default, from what we did, Excel puts a zero, averaging two numbers with a zero and it was just, but you had to search through and it was like, page number eight on, was II52. And the very difficult mistakes to find. Whereas coding, if it's done, you know, if you choose your variable names well, and you do a bit of documentation, it can almost read like a story. And you can even somebody who can't code necessarily, can at least follow and kind of it, or you can just sit down with them and just tell them the story of the code and they might be able to say, that doesn't make any sense.
19:38 Michael Kennedy: Yeah, wait, you're doing this wrong, right?
19:40 Martin Heroux: Yeah Excel is much more nebulous.
19:42 Michael Kennedy: Yeah, it's exactly very vague. This portion of Talk Python To Me, is sponsored by Clubhouse. Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings teams together to ship value not features. Great teams choose Clubhouse because they get, flexible workflows where they can easily customize workflow' state for teams or projects of any size. Advanced filtering, quickly filtering by project or team to see how everything is progressing. And effective sprint planning, setting their weekly priorities with iterations and then letting Clubhouse run the schedule. All of the core features are completely free, for teams with up to 10 users. And as Talk Python listeners, you'll get two free months on any paid plan with unlimited users and access to the premium features. To get started today, just click the Clubhouse link, in your podcast player show notes or on the episode page. So you wrote a blog post called errors in science, I make them view part 123 and I'll link to part three and one of the things I found amusing from there was you actually linked to article which talks of the 12 of the biggest spreadsheet fails in history. Do you remember that article?
20:54 Martin Heroux: Yes. It's pretty scary. And I mean, in our case, well, I guess it could some of these mistakes, They lead to, in science lead to retraction with the biggest problems, would be that and that's kind of a big embarrassment and something we don't want to do.
21:06 Michael Kennedy: Right, other people can write base their conclusions on a paper that then has retracted. So it's kind of a bit of a house of cards as well.
21:13 Martin Heroux: Exactly, and unfortunately, some people even though it's retracted, some people continue citing in the original one, which is, you know, you're not supposed to do but people haven't realized it's been retracted. So it's a bit of an issue. But some of the other ones that, you know, if you do this in the financial industry, that there are some companies that very simple Excel mistakes have led to millions and millions of dollars being lost. And so yeah, so that article just kind of highlights the, some of the worst, and the European Union actually has a whole kind of group that's working on this and dealing with how can we improve these practices with with regards to spreadsheets because the mistakes are so costly.
21:40 Michael Kennedy: Right, I'll just read just a couple 'cause I think they're kind of amusing. Trans Alta Canadian company, made a cut and paste error in their spreadsheet and it cost them $24 million. Where they caused them to buy, they bought some us power transmission hedge contracts at the wrong rate. Fidelity their Magellan Fund, which is like a investment, mutual fund type fund thing, they had to cancel a $4 per share dividend because they had a missing minus sign. So they thought they made a profit instead of a loss. MI5, the British spy agency actually bugged over 1000 wrong phone numbers. They're just like, all these weird errors anyways, they're really fun to read for people like this, you know, sucked into that world. So there's the the drag and drop LabVIEW type stuff, which is not that great. There's some just not very much programming skills at all, maybe a little MATLAB. There's trying to over leverage Excel. A lot of these could be fixed by you know, a little bit of Python, a little bit of pandas or you know, whatever library is that works with the type of data that you're using, right?
22:58 Martin Heroux: Yeah. And I think with Python it's the fact that and you've mentioned this yourself. It's just a Swiss Army knife. It's that MATLAB has a license. And its application is very, you know, engineer math and science focus. Whereas, Python, I had a master student who who came in, finished with us and I kind of his first study, he programmed with me. Second study, he did most of it and I just looked over his shoulder and made sure everything was okay. But he went on now he's working, you know, in a sense, he's got a new superpower is that it's universal, he can use it for such a variety of things. That's not specifically for science. So while it's obviously great, the learning curve isn't that steep if you've got a bit of time, and it's just so diverse rather than some of the more specialized languages. So I think as an introduction, Python, I mean, it really it's so simple, easy to read, and it's a new skill. That's kind of how I sold it to the institute, where I work at here in Australia, is they were looking for ideas for, almost like a vision for for 2022. I just said, well, wouldn't it be great if we could advertise that every student that comes and does a PhD with us will have basic skills in at least one programming language when they leave. And they thought that was a great idea. So that's how I pitched it that, you know, in this day and age, it's got to be part of the education not simply the, you know, sitting at the workbench or working with spreadsheets.
22:58 Michael Kennedy: Yeah, something you just have to pick up in your spare time as part of your research project or whatever be better.
22:58 Martin Heroux: Yeah, because we have tons of spare time when we're doing our PHDs, if it's evenings and weekends. And it's a few of us who are, you know, a bit silly enough to do it that hermits and we just do it on our own, 'cause we just feel the need to do it. But yeah, a lot of other people, you know, sensibly, don't do it and that's fine. But I think in the long term, if it's become more acceptable, and it's actually presented right from the beginning, that hey, this is how it is. You're going to learn a bit of programming to learn a bit of this, it just becomes part of the, of the learning process rather than when you're just, it would be nice if you learnt it, but there's no structure we're not going to teach you and I'm not going to point you to any resources. That's difficult for a lot of learners.
22:58 Michael Kennedy: Yeah, well, I think it's a great service to offer that to your students, right? To say, you're going to learn whatever it is, you're here to study and get your master's degree in or whatever it is. And also, you're going to have this programming skill, in the context of that, because I'm sure that if you go out there and apply for a job, and there's 10 people, probably more, but let's say there's 10 people apply. Two of them also have good programming skills that are relevant to that, in addition, and all things being equal, like that's down to two candidates, you know what I mean?
22:58 Martin Heroux: Yeah, it's really great.
22:58 Michael Kennedy: Now, one of the things you did that I'd like to just chat a little bit about I think it's interesting is you did a survey around scientific computing at your institute, right? So yeah, so let's talk about that a little bit. What was the survey trying to get at and what were some of the results?
22:58 Martin Heroux: Well, after I pitched this idea of training, obviously, I had some ideas. I'd done the Software Carpentry teacher training. So I had my own opinion of how it all should work. But again, who am I to say how everything should run? And what are the needs of the people, especially because where I work, there's people who work on cells. Some people work on epidemiology, some people work on human research. And the fields are as different as schizophrenia, falls in balance people with vestibular issues, we studied the gamut. And so rather than impose what I thought people would need, I just said, Hey, now we've got these easy survey, online things to do nowadays. So how about I just create a little survey, get some information. So I'm a scientist, I like data. So I just said, let me get some data so I can make informed decisions. And also it allows me to track if, I were to continue this over time, I'd be able to see how people are changing. So it was a variety of questions based on you know, what field are you in? What's your level of training or education? And currently, what are your current practices? And then looking at in terms of what would you want to learn? How would you want to learn that and those types of questions. So I sent out the survey our Institute, I can't really say the number right now it's probably, we're now above 200 people for sure, I think depends on how many of the students are present, but it's a bigish institute but not all that big, but we had 80 people responded it was nice, 'cause senior scientists replied, senior postdocs, postdocs, students, and a few staff as well. So the response is kind of reflect a bit of everybody. And so that was kind of nice.
22:58 Michael Kennedy: Yeah, it's good,
22:58 Martin Heroux: As you might expect, it's not all that surprising that pretty much everybody analyzes data with computers. There's some that don't, and I would have to say, well, it's possibly some of the staff and also some of the senior people. It's kind of why, I don't want to become this most senior scientist because at some point, you almost stop looking at the data, you entrust the pyramid of senior postdocs and postdocs and students to to handle the day to day data. And so they kind of don't do that. But most people as part of their job at some point, manage some form of data. So it's as useful as it's just becoming this is what we do we manage with data. And most people want some form of training or knowledge. And I guess one of the unsurprising, as we were talking before about Excel is that most people 80% or so do stuff with Excel. And that's okay. But what's surprising to me and I guess, because Excel is even easier to pick up is the fact that nobody's actually got most people 20% only had actual training had done a course of some sort with regards to Excel.
22:58 Michael Kennedy: Yeah, I think people think Excel is easy. It is easy, but that can be a really advanced tool. When I was in college, I took a course on basically like spreadsheets and man back in the day was like Lotus 123, and all sorts of random things that aren't really around anymore. But I remember learning, I felt like I can use Excel, I can load data, I can do formulas, whatever, like there's a lot to learn about stuff that you can do in there. So that's the wrong aids only one out of five get any real instruction on and that's surprising.
22:58 Martin Heroux: Yeah, and I have a colleague who went off and kind of did a an online course for Excel and he came back and everybody is really impressed. He was showing everybody new things and so done properly, as he was doing it. I could see that, okay, it's not as bad as I used to picture it. But unfortunately, he's one out of, you know, everybody I've ever come across. And so really, the majority of people just think it's just point, click, add some numbers, try a few things. But yeah, there's not much education behind it. And I think what's interesting also in the survey, I asked about other tools. So for example, we do a lot of data management or statistics. And there's obviously the typical point and click programs and a lot of those have the equivalents. But for you know, you could do that in programming is that many more people were using point and click programs than people who would program yet the ones that programmed, always had more training, just because obviously to learn it, they had to do some form of training is the point and click. It's the problem that it's just it's deceptively simple. And I think statistics is one of those areas that you know, I think a lot of people have allergies to for some reason in science. I kind of liked them myself. I think they're important and you got to understand them, but with this point and click stuff, a lot of people just kind of click some buttons, think it's okay. And then this big output comes out with sometimes sometimes some figures a lot of times these big tables, and they're difficult to interpret. And yet some people just say, Oh, where's my p value in there? And then they're just say, Great. This one worked. And if it didn't, well, you know, maybe click a few different buttons and see if the answer changes, which is obviously not great science, but it does happen. Whereas with programming, it's you kind of have to know what you're doing from the get go. And so there's a lot less chance of that just kind of what people would call p hacking. It's just searching for and trying permutations of it until you get something that oh, that looks better or what I was expecting.
22:58 Michael Kennedy: Yeah, well, if you're going to wire stuff together and push buttons, like it's almost always going to give you an output. So yeah, well, there's that there's the answer, right? Well, you kind of got to really think through it, line it up, or it's not going to give you an output. It's going to give you an exception, trace back or something, right?
22:58 Martin Heroux: And I think the value of programming, for example, in those contexts, and it's one of those things I've never even thought of. Now I do it almost all the time is that if I'm using a new statistic, or I'm using new program to analyze my data. Why risk putting in my own data that I spent all this time putting in and hoping that I'm doing it right to get an answer, rather than generate, which is quite simple nowadays, generate some data that I know the answer to so you simulate the data and put in some artifacts or put in a certain trend. So it will give me a statistical result, run it through, and is it as expected, if so great, then I can apply that to my actual hard earned data. And I can kind of trust the data and the results of it more versus other than that. I'm sure those people out there who work with statisticians on a day to day basis. There's other people who themselves are quite competent. But my experience has been a lot of people, a lot of labs, not everybody, but a lot of us kind of just do that. We just collect data, put it in the program, fiddle about with our own data, and then the end, see what kind of comes out that makes sense and hoping there is hope that Okay, I think I did it, right? But it's pretty risky in terms of this is science and we're going to publish it make it publicly available. So yeah, it was a bit of an eye opener in terms of. But what was nice is that my survey highlighted that, okay, this is what people are currently doing. But what do people want to do? Are they interested in it? And yeah there was definitely, a big interest in learning these skills. More of the questions was about, do you think how valuable is it to have computer coding as a skill currently? And the average was it was important. And then the next question was, how is it for the future as a young as a young investigator? In the future career, how important is it, it was very important. So to highlights the fact that this is a shift to even the fields that may historically not have incorporated coding into their education of their students are now seeing that yeah, well, maybe I don't do it as a senior postdoc. But I think that for the future that even those people are acknowledging that it's valuable. But as you might expect, that one of my questions was trying to understand what would they want? But most importantly, in a sense what are the biggest problems? Why aren't you coding right now? Why aren't you learning? And what are the obstacles and some of them are addressable.
22:58 Michael Kennedy: Lack of institutional resources, like you could have more yeah, we could have more examples. But one of the biggest one, that area that you got for your survey was lack of time. And it's hard to help people make more time.
22:58 Martin Heroux: Exactly, I wonder whether that's a real problem, or it's a perception. I think, for people who like myself, when I first got introduced to coding, it's the unknown. And you think my God, that's going to take me, you know, a year of my time.
22:58 Michael Kennedy: Right, these are like, the Wizards of Silicon Valley that like create this magical digital world around us. And it's, it's like engineering plus, with all the math and everything. It seems so hard from the outside, and then you do it. You're like, oh, that's all that was not so hard. I just called this library.
22:58 Martin Heroux: Exactly. With all the new tools, it makes it much simpler. My one piece of advice to students or people who come and ask me about these things, is just pick one task and your next study, don't do the whole thing. Just pick one thing, whatever you if it's about the acquisition of the data. Or if it's just making your figures.
22:58 Michael Kennedy: Or cleaning up the data.
22:58 Martin Heroux: Yeah, just pick one thing that you can automate. And if you want, I can help you along, just do one step. And it makes it so different than if you're learning from a book that's giving you examples about, I don't know, leeches, or the economy or some other problem that doesn't relate to me. Where is this one, I'm actually building using my learning to actually help me out, you're going to find a solution, 'cause you have in a sense, you have to, but it's not daunting, 'cause you're not going out learning everything that you need. I'm not setting up a web page to present my data to do a survey or whatever it is. Just pick a simple biteable chunk. And then on your next study, just add to that. Yeah. I think that's why the lack of time I think, is they think they have to just stop and learn. And that's just not feasible,
22:58 Michael Kennedy: Right, I got to redo the way I do everything. We're going to rewrite this whole thing. Yeah, exactly. Like, there's a bunch of little steps. You know, I would say, what is the step that makes you the least want to work on this project? You know what I mean? What is the most not fun thing you do? Could you automate that?
22:58 Martin Heroux: Yeah, and I guess the other one I've done for myself, actually, one of the big pushes to do Python, especially for data acquisition, for some of the research I do is a lot of taking responses from other people. And we do these perceptual illusions, to know where your body is in space. And we always have to get the answer. And for me, it was after we were done these studies, the first few times I participated in them and helped out is, once you look at the data, and there's an outlier, or a few points that you're just like, That can't be right. But then what do you do? You go back and it's a piece of paper, and then it's a number that I wrote. And then I'm like, What, did I write it wrong? Or that the person you know. Did I not understand what they're saying or their accent might have been different. And then you try to follow the chain and you're always and I hated that because I had no real reason, to exclude the database on unknown error. And that just frustrated me and it happened enough that I kind of built this system around testing these things that it's all computer. It's built around Python with a bit of pygame on top for the interface. And that way, now the data goes straight into the computer. Each response is verified with the person who's given to the answer. It shows it to them and says, is this actually what you answered? And it's just saved so much trouble. And now we offer our study which did at zero data that was even in question. And so we've gone from a lot of stress and possibly a few data points that were wrong. To now I live in peace that after the studies done, and you know, we get good, surprisingly, as experimenters, if you do it over and over, you kind of, it's less stressful to run the experiments. I can kind of think on my feet. But when it's the students and it's their first time running an experiment, they're pretty stressed out. There's a lot to think about. And you could see how easily they could forget you know, to check something or write something down, that's not correct. And then they go and, you know, copy paste that into the computer. There's so many levels of errors that can happen there. So for myself, part of it was early on the lowest hanging fruit but that one was one I had to tackle 'cause I was just so insecure. I didn't feel good about the fact that I just don't know. I mean, I think, let's say 95% of my data is right, but how can I get it too close to 100? And I think we're nearing that. And I think it was a big undertaking. But now, the system's there, and we've used it multiple times over and yeah, so it's, you have to pick what's most important, or as you said, what's the most annoying job that you have to do in a data entry for us was one of the huge ones. Nobody wants to do that. That's boring, and so yeah, just do it on the fly. And then the computer does it perfectly every time as long as you you know, again, test your code and make sure it's doing what you expect was a whole other story. We might get it.
22:58 Michael Kennedy: Yeah, we should get into that as well. But at least you know, it's long as the software is working, right? It's, it's the foundation, right? Your data is the foundation of your research, which is the foundation of your papers and your work, and you definitely want that to be right. So if you could automate that. This portion of Talk Python To Me is brought to you by Linode. Whether you're working on a personal project or managing your enterprises infrastructure Linode has the pricing support and scale that you need to take your project to the next level, with 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise grade hardware, S3 compatible storage and the next generation network. Linode delivers the performance that you expect at a price that you don't. Get started on the node today with a $20 credit and you get access to native SSD storage, a 40 Gigabit network, industry leading processors, their revamped Cloud Manager at cloud.linode.com, root access to your server along with their newest API and a Python CLI. Just visit talkPython.fm/linode when creating a new Linode account. You automatically get $20 credit for your next project. Oh and one last thing, they're hiring go to linode.com/careers to find out more, let them know that we sent you. I do want to come back and ask you a question about this of importance of learning computer programming for the future grad students and what not. One of the things I think is interesting here is probably that means the people are seeing more and more data that they need to work with. It's probably one angle, I would suspect that's pretty straightforward. I wonder another one about just almost like, sort of grant money angles as well, because when you're a new grad student or postdoc, maybe you don't have a grant yet, and maybe you don't have a lot of resources, and things like Python and the SciPy, it doesn't matter if you have money or not. You can you have the some of the best computational tools in the world. These ones that people use to find black holes and work from the Higgs boson and whatnot, and you get those tools for free as well. What do you think about that angle?
22:58 Martin Heroux: Yeah, well, I mean, for us, it's actually that's the truth is that there's things that I wouldn't have been able to do. Some of it is going data mining on our own data that we've accumulated. And I'm about to do the same thing with a colleague of mine that I write a blog with, we're going to go data mine, three different data sets that have just been laying there that have a piece of information that would be really useful together, especially that, you know, doing it by hand or by Excel, which is, I think it's an insurmountable task. But programming makes that possible. And then the other thing is, as you mentioned, a lot of the this move for all the science and then funding agencies especially are asking people to make their data public. Some journals also asked to put your code public. So now really, you have access to all these things that maybe I can't collect this data. Maybe I can't do this, but I have access now. I can download it, I can search through it. I can ask new questions on old data
22:58 Michael Kennedy: Right, get their Jupyter Notebook and their data that you can you start thinking from there, right, you can start exploring and changing it and slicing it differently and trying to make discoveries, right?
22:58 Martin Heroux: Yeah, and I think that's the whole push towards open data is when you do clinical studies, for example, clinical trials to see if an intervention works well, that's important. But then really, it's the meta analysis, the thing that puts it all together. And so that push for people to, present the data in their papers in a way that could be useful for scientists to, do science on science. They analyze all the studies, put them together and say, overall, the evidence is this. Well, similarly, I think this is where coding and debt open data can help is, let's not reinvent the wheel. And some of these data sets can be combined, or they can be just searched, and by people who have completely different ideas that you've never thought of, but somebody is going to go out there and just either, you know, contact you or just themselves just move on with that data. And I think just progress will happen a lot faster and more efficiently. And that kind of research, as you're pointing out, doesn't really need any grant money. I mean, you can work on a fairly cheap machine and just get the data and try out a few things. And that's where it's going to definitely big data but big collaboration as well.
22:58 Michael Kennedy: Right, it needs time, but it doesn't need money for resources potentially, right?
22:58 Martin Heroux: Yeah, it's a culture change, because of the incentives, there's a lot of incentive to be exclusive. I own this data set, I'm the one who collected it. And my group pushes these results out, and I want to keep it. I don't want it be scooped by anybody else, for example, some of that I can understand.
22:58 Michael Kennedy: Right, there's still discoveries to be made in this data. And here's our first one, we're hanging on to it for the rest, right? I do understand that might be hard file together.
22:58 Martin Heroux: Yeah, and so I think it's a culture change that way. It's a gradually I think that people will be more transparent. And hopefully, you know, you can see putting an embargo saying, you know, we'll make it public, but in a certain amount of time, but at some point, but the competition for these things is quite high. So therefore, you can see why some people like to kind of hoard their data a little bit and not necessarily make it publicly available. But the move is for a lot of the top journals to require people to put their data in some form on public repositories. So the move is there. And if you're funded, actually a lot of funding agencies and Europe and I think the NIH has some of these policies is that if your funded by us. well, you're going to make your data available to others.
22:58 Michael Kennedy: Right. This is ultimately paid for by the taxpayers. This is not yours, so share.
22:58 Martin Heroux: Exactly, Similar for open publications is that you know, we're giving you the money to do this. Therefore, when you publish, you have to publish an open source journals or you're going to pay the fee to make it an open publication within that paywall journal, so that those are the other similar terms of the move towards open science.
22:58 Michael Kennedy: Sure, well, one of the interesting things that you've written about, and I think it comes back to this sort of which comes first, but one of the challenges is the life cycle. And the openness or the lack thereof the data and the algorithms and the libraries for science is not necessarily the same as for random software project by software developers, say vs Flask. Right? You talked about some of the important lessons that people in the science space can take from the computer science side of things, things like GitHub and issue tracking and whatnot. Do you want to talk about that a bit?
22:58 Martin Heroux: Well, it's computer science. And I was always, why is that it called a science? Really aren't they just, you know, nerds or geeks on the computer. And I had that that false perception, obviously. And now I'm kind of one of those people myself to some extent. But when I was starting out, it was all just isn't it just a mishmash of, they're just typing? There's no rhyme or reason about it. And to be honest, it's kind of interesting. I coated MATLAB for the longest time. And I had so many misconceptions. My code was my PhD code was two scripts that were you know, and I thought it was cool that I had like, 3000 some lines of code. It's so much copy and paste. It was very embarrassing. I don't want to go back to that. Because I just didn't know any better. And the people who taught me didn't know any better.
22:58 Michael Kennedy: Nobody said, What are you doing here? They're like, well, that looks like mine. So this must be fine.
22:58 Martin Heroux: So it's just as an inheritance, it's just this unfortunately, the natural selection is in the wrong direction in this case. So gradually, you know, I read up more about Software Carpentry pointed me in a good direction. And then I read some books. I fell upon your podcast, and then Brian Okken's as well. And then I started realizing that all this stuff has already been figured out, which obviously is obvious people have figured this out and you know, a long time ago in the 70s 80s. And up till now, it's just so much simpler the workflow, all the things like GitHub with making suggestions or changes and all those kind of things, pull requests.
22:58 Michael Kennedy: You put your code up there, and somebody will find if they find something wrong, they'll put an issue in there. If they can fix or improve something, they'll do a PR, and it's, science talks so much about this research study is valid. We know the Big Bang happened this way, 'cause this is peer reviewed. Right? Yeah. And I mean, that's kind of what GitHub is. Just less formally, less credentialed, but in a sense, right.
22:58 Martin Heroux: Yeah. But what's interesting there is I spoke to, we have a floor meeting once a week and just before Christmas I, I did talk about the side, this comparison of what we can learn from computer scientists. And that was a bit that I was pointing out was that a computer scientists will write their code and do it as best as they can, and then put it up there. And in, in a sense, not that they expect there to be bugs. Because, some people will then go to the next step, which is they'll make tests for their code. But there's still this knowledge that there possibly and most likely as a bug of some sort, or, or a case that I've not considered that if you throw it at my code, okay, it'll crash. Whereas, and that's expected. And then when you get a pull request, you're thankful. And you say, you know, great, it's not a improving the code. It's actually a fixing a problem with my current code. And it's this iterative process that happens over time for improvement versus in science. And I realized this doesn't speak to everybody out there. But a vast majority of us if we do code, but even if we don't, we just do our work. And then we polish it up into this little you know, either Word or PDF document that we submit to a journal. And this polished final version has most of the time no data attached to it, it's just a few figures and some tables gets assessed by two, three other my peers, and they have no idea that, you know, all the steps that were taken, there is research on this, that kind of shows that the number of somewhat arbitrary decisions that are made along the way to collect your data in terms of. Okay, well, I'm going to use this versus that. I'm going to exclude these cases versus that I'm going to use this test or this filter setting versus that. There's so many permutations that in the end, it's almost of course, you're going to find something significant, but in itself is just a self fulfilling prophecy. You're setting yourself up to find something. And then as the the Texas sharpshooter example. It's that thing of the guy who he's got a barn and there's bullet holes in the burn and there's always a red circle around it. And this guy comes by and he says, "Wow, do you hit that every time?" He goes, "Yeah, I'm so good." And then his wife's comes out. And then she says he actually just shoots the barn and he draws a circle around it afterwards. So if you want to find something you'll find it. And yeah, there's just I don't want to say that I'm a skeptical about other scientists. But we are one of the people I work with Simon Gandy, he does a lot of work on cognitive biases. And we have to almost protect ourselves from our own science in a sense, because the people doing it aren't the computers, it's us, we have to make the decisions. And without knowing it, we have these biases. That is that you're not aware of it. Somebody might even pointed out to you, and you might just not acknowledge that it's there. You might think that doesn't apply to me, but I know other people that it applies too. And so we just have to come up with ways and I think that coding, making decisions before the data comes in and making it transparent. It's one of the ways that we can improve what we do. And so in terms of things that we can learn from computer scientists, I think, for example, the push towards publishing the data and the code, it's so interesting, because once we publish it, if I make that it's a big stress on people. Currently, it's almost like some people don't even make an effort to clean the data or the code 'cause they don't want anybody use it.
22:58 Michael Kennedy: Right, just leave it totally obscure and like, well, I have no idea what this does. And the variable names are bad. And what is this about? And just forget it, right? That's exactly the outcome.
22:58 Martin Heroux: Yeah, and that's somebody that as well in saying that I think it's over 80% of the data sets that are out of our code is actually not really useful. But if you think of it from a computer science point of view. Well, that's actually almost the start of it, right? Unless you have a very big group, and there's been multiple eyes on the code. This might be the first public appearance of this code. And it would be nice if people had the chance to look at it, run it a few times figure out if there's any issues. But the risk of that the way the current publishing system works is that if anybody finds a genuine, not more, just not a little error, but anything that's of significance. That means the paper probably will need to be retracted, which is a huge deal in science, Versus in computer science. Well, sure, put a pull request and I'll fix that for you. They're humble enough to know That I'm not perfect, there's going to be things, I may have made a mistake in this versus in science. It's what we publish, and we just want to move on. We want to just that thing, somehow my CV, I've ticked that box, another number, and I move forward. So that's one lesson that I don't know how to incorporate it exactly. Um, in terms of the workflow of how we do things.
22:58 Michael Kennedy: It's very tricky, because the incentives are sort of opposed to the right behavior, somewhat described about, like, if there's a big problem, we're going to have to retract the paper. You know, there are some things trying to solve that a little bit like JOSS, Journal of Open Source Software. You familiar with that? So I mean, you can publish the code there and get some eyes on it. But fundamentally, right, eventually, you got to take it and do the research. And if the promise found afterwards, it's bad. But at the same time, I mean, isn't really the Zen of science to get it right.
22:58 Martin Heroux: Again, the pressure, the time constraint that we all have makes it that it doesn't happen, but another lesson from computer science, part of this was here listening to Brian Okken and his podcast, and then also trying to get through his book, which is really good. But again, it's that thing of it's good but it doesn't really apply, or at least I couldn't jump into it and directly apply it. Because it's not this, it's too big of a step for what I do, because most of the code, some of it is reusable, but a lot of it is case specific. And so to build up a whole test suite, around my code seems a bit overkill. But it that's a definitely a way to prevent the issues that if I could, you know, test my code, maybe simulate some data.
22:58 Michael Kennedy: That'd be very useful. And I think it's a great lesson, right? Take some pytests and write some tests or even in MATLAB or whatever. But I do think testing science is harder than testing than testing a lot of things that people talk about testing.
22:58 Martin Heroux: Yes, yes.
22:58 Michael Kennedy: About unit testing. For example, if I write a test for my online course site, right? I want to test that users can sign up but they can't sign up if they pass a bad email address or something like that, right? Here's a good email address. It returns true if email is valid. Here's a bad email address, it returns false, email not valid. But if I have here are some wiggles and gravity, was that a black hole collision, like, I don't know, we've never observed it before, you know what I'm getting at? It's just like it's really hard. And so I don't know, do you have any advice? Like maybe take data, the outcome of and then try to run your code? Predict something that, you know should be some way or I don't know, what do you advise people there 'cause it seems much harder, from a software side to write the right test for that? And so.
22:58 Martin Heroux: Yeah, I definitely, I think that's the test your code where it's possible in terms of just line to line or functions. But then, when it actually is interfacing with the data, I think simulating the data or a data set that's already been analyzed and processed. Those are the two things that I think have to be done a little bit more in some fields. Just do it in natively, whereas we don't. And sometimes there's a limitation to how much we can test the code in this case, but I think that being able to simulate data, and having to know the outcome before you run it through it, it gives you at least some of those amount of certainty. So yeah, that's one thing that people could start doing a little bit more. But where do they start? I mean, for myself, that's one that I don't know how there's Greg Wilson was the founder of Software Carpentry has the series of papers on in a sense of best computer practices, scientific computer practices, and good enough, and in there, it's true, we don't have enough time to be perfect. So good enough, for most of us is good. And he mentions testing your code, but there's, that's where it ends is that you know, he might put a reference but I don't necessarily think that the current way to do it for you know, big software programs or projects, translate all that well says, obviously, some types of sciences. That's what it is because it's all based around big programs and in big pipelines of data analysis, but on the people that are more like myself that do data collection themselves. They write their code themselves. They write the papers themselves. It's I just not sure how to test my code. Or if I do, it's going to be clunky. It's going to take me a long time. And is it necessary to be at that level? I would love for somebody. But again, it's this thing, everybody strapped for time. And so somebody like Brian obviously has no interest in writing a textbook or a blog series on how can scientists test the code. And also, it's the fact that we're so different, that the fields are so varied. And so I still feel that while I think computer uptake in terms of programming, Python, those types of things has definitely skyrocketed. Some of the tools that come along with computer science haven't matured or morphed enough to make it easily implementable. So you know, Github is something that version controls obviously. Another important thing to talk about in terms of computer software and I think that Software Carpentry teaches that on its two day course, and it's just logical. I'm in terms of the way yet scientists, There's a comic strip called PhD piled higher and deeper. And there's this one that everybody puts up as a slide often it's the he has his thesis version as a doc a Word document. And it has a thesis version one Doc, and then it has version 1.2. And then it has final, and then final final. And then final, final underscore dash final changes. And I cringe because I actually looked back when I found that at my own pieces, and that's exactly what I had.
22:58 Michael Kennedy: Look this is an interesting naming convention. It's slightly different than mine.
22:58 Martin Heroux: Everybody kind of laughs at it when you show it, but really, there's no great workaround, because, you know, like, Word, for example, does have this inbuilt thing that you can, you know, track changes, everybody loves it. But once you accept that change, it's gone. Yeah. So if you want to have a history of it, you need to save a new, a new version of it. And there's the more collaborative online things that you know, Google Docs and stuff like that, but still the idea of having a history And in a sense, that's what science is about is being able to show from the raw data, all the steps that were done, and I think it should continue on to the manuscript as well. So version control is another important lesson that I apply.
22:58 Michael Kennedy: I just noticed that Google Docs, I know they had a version history, but you can tag different versions, which is cool. Like you could have final final final final changes.
22:58 Martin Heroux: And I guess the last thing to highlight, I think that, again, I'm a bit of a geek. So I'd actually like to do this. But the people I've been surrounded generally aren't people who code and especially don't code in Python. But this idea of a code review. Yeah, I was baffled to hear that people actually do that, like people sit in a room and present code. And I was just like, that would be as a person who doesn't know how to code or just learning to just sit in on that would be useful. And then also to sit there and have other people who code kind of critique it but also improve it or see if there's any issues and help answer some of my questions. I'm like, wow. And I'm sure it happens in some bigger labs. But on the day to day, pretty much my second personality is the person I do my code review with it. That's about all I have is I just sit there and does this makes sense. Talk out loud, maybe. And it's another way that although it seems currently that the publication once it's out there, you don't want any problems to be found, 'cause you may have to retract it. So what can be done beforehand?
22:58 Michael Kennedy: 'Right. How much can you bring that up ahead of...
22:58 Martin Heroux: Exactly.
22:58 Michael Kennedy: Yeah, absolutely.
22:58 Martin Heroux: Yeah, so that's one of the things that we could do. But the time constraints make it that it just unfortunately, I haven't seen it implemented very much.
22:58 Michael Kennedy: I think it's probably it sounds to me like it and my experience working in like a cognitive science research lab and some other stuff. I feel like, maybe it's time, but even the bigger challenge is expertise, right? Usually, you know, who are you going to go to? Because you're the person that knows the most about that. So here we are, I'm going to do what I can right. Yeah, I think it would be a really cool service to set up some kind of GitHub but some kind of online thing, maybe you can link your GitHub profile to it or something, where it allows scientists to donate a little bit of time in order to receive code reviews of their code, right? If I put up my code into this project, it'll still stay hidden, not going to come out. But if I agree to code, review somebody stuff collaboratively with them, then they'll do it for me or someone else. Someone else will do that for mine, right? Like I earned a half hour code review by doing a half hour for someone else. Or it seems like there's probably some way to like, put this out on the internet and create it in a way where you know, that 20 people that are doing this kind of thing actually could get together and have a look and help each other out.
22:58 Martin Heroux: Yeah, no, I think that'd be useful and you know, you have to find the people that you would trust to do that. But it's similar to you know, the burning beans for for giving good feedback. I think here you can earn trust points or whatever it would be. You know, the quality of your comments and keeping it to yourself. But all Yeah, and proper coding practices, simply breaking it down into small Bible chunks means that you can reuse it yourself. So once maybe you've got some feedback on a section, or maybe somebody will tell you turn that into a function. Well, then once it's been reviewed, you can kind of trust it. But I think a lot of us just get into it, we just want the answer. And the in a sense, we probably reinvent the wheel way too many times.
22:58 Michael Kennedy: Yeah, that's something I definitely saw a lot of is like, here's a huge long script with no function or branching or, or other than an if statement, you know, it doesn't have it doesn't have a lot of structure to it, which means it's nearly it cannot be reused, almost, right?
22:58 Martin Heroux: Exactly.
22:58 Michael Kennedy: And one of the things you touched on is one of the big differences between scientific computation code and like computer science, formal software developer code, is you had hinted at it before, but it's like, you often hear that code is read way more times than it's written. It should be written first for the software developers. And then secondly, so it also runs, things like that, right? Whereas with science code, it's like, once you get it working and you get the graph, you're done, you don't need to polish it, you don't need to touch it again, it works. It did the thing, we got the outcome, we're going to come up something different next time or whatever, right? And I think that puts a very different kinds of pressure on organizing your code, reusing your code documenting your code, there's just like, all these different pressures. And I'm not saying one is better than the other necessarily. If your job is to, like I caught a quickly analyze the state and get it out. Like, I'm not going to tell you what, and you have to make a package while you're out there, right? Like that's not your problem. But it does mean the code is treated differently at potentially as errors lurking in different ways and so on, right?
01:00:44 Martin Heroux: Yeah, and I think that's the workflow hasn't been adapted very much to the life cycle of the kind of code that we would write. And I think it's the lesson of take as much as you can and turn it into functions and things that you can reuse. Then, you know, build tests around those, for example, or have coders on those, and then the nuances for this current study, Well, that'll be, you know, something to maybe just be a little bit more careful about and maybe have somebody look at it or check the outputs. But it's true that, you know, its really wrong about that and Brian Okken said that about, you know, in terms of testing, why not when once the test is built, once the code is written, you're going to make it legible obviously, make it clear, but also make sure that it's correct. Because you're going to be looking at this over and over and over. And it's going to, whereas in science I write all the time, I read very little code. I barely know how to read other people's code. And that's interesting to me too, because I'm sure that with all your experience with the various languages, you can kind of look at different code and kind of pick up quickly. The general structure of things even if the variable names are funny or, versus for me, because I've got such a personal style that I've kind of adopted based on the various things, the resources I've used. You know, I can read my code very well. But then the second I go just slightly differently. I'm just bleary eyed. I'm like, well, what are they doing here? And I can't. And it's because we were very much just code for ourselves. Not everybody, but many scientists code for themselves or right. Yeah. And so, it makes it a little bit more difficult. But that idea of code reviews, I agree that if you can't, somebody else doesn't know how to code Well, they're not going to be able to help you with the line to line things. And while I have a bit of a love hate relationship with Jupyter Notebooks, the one thing that I've think that they're really useful for, is you can sit in a room with your lab group, for example. And you can enter twine code with outputs with figures, and you can tell the story of your code. And while the you know, if there's a you're calling the wrong function, there's a little bit of a mistake, those might be more difficult for others who don't code defined, but at least the structure of your code what it's doing. What its exceptions that it's trying to catch, you can do that. I mean, I think Python's very readable as a language. But I think that the Jupyter Notebook makes it even more accessible because you can document that you could put little interpretations here and there and then the output, it's kind of a way to do a bit of a code review for people who would probably run away from that and be a bit scared of saying, "Let's come to this code view at 12 today, nobody's going to show up. But come on, I'm going to show you how I analyze my data. And I'll show you some results. You can kind of secretly put in a bit of a code review and hide it in a Jupyter Notebook disguise.
01:03:33 Michael Kennedy: That's a good idea. Very, very, cool, well, thinking through this, looking back on what you've been saying, it does sound like there's some interesting things to take from the computer science world and adopt into the scientific world. But I would say one really good thing that it seems to me is, if what we're doing is moving people from Excel into something like Python, even if there's fewer tests than maybe ideally and whatnot. Surely the code can be more validated in Python or in Jupyter than it could be as it is in Excel, right? So that's got to be good for reproducibility and correctness.
01:04:09 Martin Heroux: Yeah, no, I totally agree. And simple lessons about variable naming, and just how to structure the code into functions, the more functions you have, then the main script, if you like, just becomes very readable. If you give it in one of your courses, it's all about you highlighted just make, I was surprised at how long your function names were. But on the other hand, they were so clear, I'm like, I know exactly what this thing does, versus I had this weird, I don't know where I learned it. But you have to abbreviate everything and keep everything as short as right, right? And then six months from now, I can't remember it what it does, but when you take the time to just make a bit of a very clear descriptive function name then the main script actually reads really well. And so those are little lessons that it'll help everybody interpret it a bit better. And if you make that public, it's so much easier for somebody else to look at that possibly even reuse it. Whereas if you open up an Excel spreadsheet I don't even know how would you find out? All the cells that have background computations or macros?
01:05:08 Michael Kennedy: I don't even know how you get started. I have no idea. You can't get to every Excel. It's an Excel there, you got it, it's crazy. Yeah, well, I really appreciate that that comment. I'm like, like you're allergic to Excel. I'm allergic to comments. So I'm about to write a little name function and give it a comment to say what it should do. And like, "Oh, maybe it's named to just be what it does." And then we won't need a comment will we?
01:05:29 Martin Heroux: Yeah, that's a good tip. So I use it all the time.
01:05:30 Michael Kennedy: That's awesome. I really do like this good enough practices in scientific computing that you referenced. And I'll be sure to put that in for the show notes. But it's got like really easy to adopt in reasonable stuff. So it's clear, concise steps. That's really great. All right, Martin, I think it's a good place to leave it for our main topic. But before you get out of here, let me ask you the last two questions. If you're going to write some Python code, do some research. What editor do you use?
01:05:57 Martin Heroux: It's been PyCharm for a few years now. And just to anybody who out there is an academic or a researcher you can get a free pro license if you just email them your academic email. And you can renew it every year.
01:06:09 Michael Kennedy: I thought that's cool. So if it comes from edu or something like that?
01:06:13 Martin Heroux: Exactly, and then you get the functionalities for the scientific mode. That's the main one I would use.
01:06:17 Michael Kennedy: Right.
01:06:17 Martin Heroux: But there's probably other than you get, then you can actually access all the JetBrains programs IDEs, but personally, I just use PyCharm. But yeah, so we can get a pro license as academics.
01:06:27 Michael Kennedy: Nice editor, and nice tip. Alright, notable PyPI package that you've run across that people should know about.
01:06:33 Martin Heroux: I would say, one that I emulated because at the time it was a bit above my head but there's something called PsychoPy. Not in psychopath, but psychology. And it's this framework to test to collect data in presenting different types of stimuli that you might do in psychology like visual illusions or sounds. And it's a nice way they actually have now implemented it. So you can actually design and run your experiments on the web, but you can also implement it on your own machines. So PsychoPy is a group in England that's been doing that. Just recently, there's a package called dabest. I think it's dabest. I don't know how to say it D-A-B-E-S-T py.
01:07:10 Michael Kennedy: dabest.
01:07:10 Martin Heroux: Yeah, it's data analysis through estimation. And it's pretty much there's this trend towards moving away through from just simple p values. And just saying that something is significant or not to providing appropriate estimates of these effect sizes. And sometimes your data is parametric, which means normally distributed, other times it's not. And so there's this move towards estimating these things in the confidence intervals. But there's some of the statistical software doesn't do it for you. So these people, I'm pretty sure they're from Singapore. And it's cross platform, but they have a version for Python. And it gives you these beautiful plots, and computes your estimates for you. But it gives you these wonderful graphs of the effects. And it's all done through bootstrapping. So it has no assumptions about the distribution of your data. So that's a really neat when they came out, I think, just a few not even a year ago I think, So that's quite useful.
01:07:57 Michael Kennedy: Yeah, those are great suggestions. Awesome. Alright, final call to action. People are out there. And maybe they're scientists or the scientific computing. They want to bring some more of these ideas from computer science to the world. What do they do?
01:08:07 Martin Heroux: I would say, for yourself, a smallest viable chunk, implement something for yourself. And then also just try to in the tea room, coffee room just mentioned it to people. And if you hear about issues don't sound like a smartass. Why didn't you code that? Just say, "Hey, if you want to sit down for two seconds, I can just show you something." And because I think that there's this dualism of people who code and those who don't. And I think that the magic of it for some people is a bit intimidating. And so rather than keeping the magic for ourselves, and being a bit elusisive. I think, just demystify the whole thing and just say, "Hey, look, it's really simple. Let's just do this." And I think the more people can do that and make it part of the culture of how we do things, I think that helps and support from above. So you know, the the senior scientists who may not have come through a time when coding was available all that readily or was a specialty, even if they don't want to themselves, support the junior people and realize that for their future careers and just for the betterment of reproducible science, try to support it if you can.
01:09:07 Michael Kennedy: Very good advice and super interesting ideas. Thanks for sharing them with us.
01:09:11 Martin Heroux: Alright, no, thank you, Michael.
01:09:12 Michael Kennedy: You bet, bye.
01:09:12 Martin Heroux: Bye. This has been another episode of talk Python to Me. Our guest on this episode was Martin Heroux. And it's been brought to you by Clubhouse and Linode. Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings teams together to ship value not features. Fall in love with project planning, visit talkpython.fm/clubhouse. Start your next Python project on Linode's state of the art cloud service. Just visit talkpython.fm/linode, L-I-N-O-D-E, you'll automatically get a $20 credit when you create a new account. You want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course. Or if you're looking for something more advanced, check out our new Async course that digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show, open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes. The Google Play feed at /play and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.