What scientific computing can learn from CS

Episode #252, published Fri, Feb 21, 2020, recorded Thu, Jan 30, 2020

Episode Deep Dive Links Transcript

Did you come into Python from a computational science side of things? Were you just looking for something better than Excel or Matlab and got pulled in by all the Python has to offer?

That's great! But following that path often means some of the more formal practices from software development weren't part of the journey.

On this episode, you'll meet Martin Héroux, who does data science in the context of academic research. He's here to share his best practices and lessons for data scientists of all sorts.

Episode Deep Dive

Guest Introduction and Background

Our guest, Dr. Martin Haru, has an unconventional path into Python and scientific computing. He started out as a physiotherapist, later went on to earn a PhD, and discovered firsthand how coding and automation could transform his research workflow. Formerly reliant on MATLAB, Excel, and LabVIEW, Martin was drawn to Python for its readability, open-source philosophy, and large scientific ecosystem. Today, he uses Python-based tools to streamline data acquisition, analysis, and reproducibility in academic research.

What to Know If You're New to Python

If you’ve come across MATLAB or Excel but have not tried Python yet, this episode highlights the simplicity and power Python brings to scientific workflows. Here are a few quick tips and reminders before diving into the detailed recap:

Python’s readability reduces the barrier to writing clean, repeatable code.
Scientific libraries such as pandas, NumPy, and matplotlib can often replace or extend Excel and MATLAB functionality.
Tools like PyCharm or Jupyter Notebooks can help you stay organized and visualize your analyses as you go.
Even partial automation (e.g., cleaning data or generating a single report) can immediately pay off.

Key Points and Takeaways

1. Transitioning from Excel / MATLAB to Python

Many researchers find Python to be more flexible, open-source, and scalable than older tools they started with. While MATLAB or Excel can work for certain tasks, they often lack collaboration and reproducibility features that Python offers by default.

Tools and Links
- pandas
- NumPy

2. Common Errors and “Excel Hell”

Working with large, complex spreadsheets leads to hidden and untrackable errors, as illustrated by high-profile spreadsheet disasters. Python’s scripting and coding-based approach can minimize these mistakes and make it easier to validate your results.

Tools and Links
- Episode #200 “Escaping Excel Hell” (Talk Python Podcast)

3. Software Carpentry Influence

Projects like Software Carpentry teach foundational coding skills and version control practices to scientists. These short courses give researchers just enough confidence to begin automating tasks and managing data more systematically.

Links
- Software Carpentry

4. Reproducibility and Open Science

In research, reproducibility is critical. Python’s approach, through scripts, version control, and clear code, lets others (or your future self) understand exactly how results were generated, potentially reducing retractions and errors down the road.

Tools and Links
- GitHub
- Version control basics

5. Leveraging Simulated Data and Testing Approaches

Instead of testing code directly on the final dataset, researchers can generate simulated data whose outcome is already known. This technique (akin to unit tests in software) helps confirm that the analysis pipeline works as intended before it touches real data.

Tools and Links
- pytest (for automated testing)

6. Code Reviews for Scientific Scripts

Computer science teams normalize code reviews and see bugs as part of the improvement process. Academia could benefit from a similar mindset: a peer or group review of analysis scripts helps catch mistakes and fosters collective coding knowledge.

Tools and Links
- Pull requests on GitHub

7. Cognitive Biases and p-Hacking

Subtle “p-hacking” and trial-and-error approaches can emerge when you rely on point-and-click stats tools. Coding in Python often enforces a more deliberate, documented workflow that lessens the risk of chasing spurious significance.

Links
- Texas Sharpshooter fallacy overview (Wikipedia)

8. “Good Enough” Practices vs. Perfection

Greg Wilson and others advocate for “good enough” coding and reproducibility, basic version control, simple tests, and documented code. Fully professionalizing every project might not be realistic under time constraints, but minimal improvements go a long way.

Links
- Good Enough Practices in Scientific Computing (Paper)

9. PyCharm and Other Tools for Scientists

PyCharm can be especially helpful for scientists thanks to features like “Scientific Mode” and robust debugging. Academic institutions often offer free or discounted licenses, removing another barrier to better software practices.

Tools and Links
- PyCharm

10. PsychoPy and DABEST for Experimentation

Martin highlighted specific Python packages for experimental design and data analysis. PsychoPy simplifies running psychology and neuroscience experiments, while DABEST (“data analysis using bootstrap-coupled estimation”) offers intuitive effect-size plots that go beyond p-values.

Tools and Links
- PsychoPy
- DABEST Python

11. Balancing Speed and Transparency

A big concern among scientists is time pressure: learning or improving Python feels time-consuming. Yet, investing in code-based workflows often saves effort in the long run by preventing data rework, confusion, or even full-scale retractions.

12. Small Steps to Overcome the “Time Barrier”

Instead of rewriting everything in Python, scientists can pick one part of their workflow, like data cleaning or generating a plot, and automate it. That single improvement often snowballs into discovering more efficient ways to handle data.

Interesting Quotes and Stories

"It's become my superpower, analyzing data in minutes with Python when others say, 'I'm not even sure if we can answer this question.'" , Martin

"Without programming, I'd have to trust these point-and-click steps and hope no step got overlooked. Coding forces me to be precise." , Martin

"The risk with spreadsheets is it's so easy to hide or embed an error, like missing a minus sign or mixing up zeros, that cost people tens of millions of dollars." , Michael

"Never again do I want to be stuck in my PhD because I didn’t know how to solder a wire. I decided I’d learn enough electronics, then the same with code." , Martin

Key Definitions and Terms

Reproducibility: The principle that another researcher (or future you) can replicate the same computational steps and arrive at the same result.
p-Hacking: Manipulating data or analyses (intentionally or unknowingly) until a statistically significant outcome is found.
Texas Sharpshooter Fallacy: Creating a “bullseye” around data points that already exist, giving the illusion of a pattern or “accuracy.”
Version Control: Software tools (e.g., Git) that track changes in files or code, ensuring transparency and allowing rollbacks and collaboration.
Pull Request: A feature on platforms like GitHub that allows contributors to propose changes, enabling reviews and discussions.

Learning Resources

If you’re interested in growing your Python skills in these areas, consider these courses from Talk Python Training:

Move from Excel to Python with Pandas: Great fit for anyone currently working with spreadsheets and seeking to move into programmatic data analysis.
Up and Running with Git: Perfect for learning the fundamentals of version control to improve collaboration and reproducibility.
Getting started with pytest: Learn how to systematically test and validate your code using Python’s popular test framework.
Python for Absolute Beginners: Ideal if you’re totally new to coding and want a thorough, easy-to-follow start.

Overall Takeaway

By adopting even basic software development practices, like version control, “good enough” testing, and open code, scientists can dramatically reduce errors and increase the impact of their work. Small wins, such as automating a single report or verifying analyses via simulated data, often catalyze broader changes across labs. Ultimately, bridging the gap between scientific research and computer science “best practices” benefits the entire scientific community, and Python is a natural, accessible tool to lead the way.

Links from the show

Neuroscience Research Australia: neura.edu.au
Martin Héroux: researchgate.net

Errors in science: I make them do you? Part 3: scientificallysound.org

PyPI Packages
DABEST: pypi.org/project/dabest
PSYCHOPY: pypi.org/project/PsychoPy

Spreadsheet Blunders
12 of the Biggest Spreadsheet Fails: blogs.oracle.com
Common spreadsheet errors: datacarpentry.org

Best Practices for Scientific Computing: journals.plos.org
Good enough practices in scientific computing: journals.plos.org
Full episode RSS feed: talkpython.fm/episodes/rss_full_history

Springboard bootcamp scholarships [code TALKPYTHONTOME]: talkpython.fm/springboard
Episode #252 deep-dive: talkpython.fm/252
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #252 deep-dive: talkpython.fm/252

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Did you come into Python from a computational science side of things?

00:03 Were you just looking for something better than Excel and MATLAB and you got pulled in

00:07 by all that Python has to offer?

00:09 That's great.

00:10 But following that path often means some formal practices from software development weren't

00:16 part of that journey.

00:17 On this episode, you'll meet Martin Haru, who does data science in the context of academic

00:21 research.

00:22 He's here to share his best practices and lessons for data scientists of all sorts.

00:27 This is Talk Python To Me, episode 252, recorded January 30th, 2020.

00:32 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

00:50 ecosystem, and the personalities.

00:51 This is your host, Michael Kennedy.

00:53 Follow me on Twitter where I'm @mkennedy.

00:56 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter

01:01 via at Talk Python.

01:02 This episode is sponsored by Clubhouse and Linode.

01:05 Please check out what they're offering during their segments.

01:08 It really helps support the show.

01:09 Hey everyone, a couple of announcements.

01:11 First, I have a very exciting new course that we just launched this week.

01:16 If you've always wanted to solidify your programming foundations and get into Python, but you've

01:21 never had the chance to take a computer science course, I built a course for you.

01:25 It's called Python for the Absolute Beginner.

01:28 It's for absolute beginners because we start from the very beginning.

01:32 How to install Python?

01:33 What is a variable?

01:34 Why do we need loops?

01:35 And so on.

01:36 And then we go on to build some really fun projects that will teach you 80% of the Python

01:41 language.

01:42 Another challenge people run into when learning programming is having something to work on

01:46 that's not too big, not too small, but just right that'll actually teach them what they're

01:51 studying.

01:51 That's why this course is 50% video lecture and 50% hands-on exercises.

01:57 So if this sounds useful to you or to your colleagues, visit talkpython.fm/beginners to find

02:03 out more.

02:03 Another quick announcement.

02:05 Last week, we had a brand new sponsor, Springboard.

02:08 I'm super happy to have them supporting the show, but I did forget to mention something that

02:12 I think will be really useful for a lot of you.

02:14 If you want to take one of their AI or machine learning online career tracks, they're offering

02:19 20 scholarships worth $500 each exclusively to Talk Python listeners using the code Talk Python

02:26 to me, all one word, all caps.

02:28 And it only takes 10 minutes to apply.

02:30 So that's the thing I left out.

02:32 Use that code, get the $500 scholarship, Talk Python To Me, all one word.

02:36 Just visit talkpython.fm/Springboard if you're interested.

02:40 And another announcement.

02:42 Recently, we've had to trim back our RSS feed length.

02:46 It had all 252 episodes in there, all the show notes and everything.

02:52 And as far as I was concerned, we were going to ship those to you all.

02:56 As long as we could.

02:57 I want to make sure that everyone can go back through the entire catalog and get everything

03:01 that they want.

03:02 And yet, because of the size, I think some of the players, especially in the Apple ecosystem,

03:08 started to go crazy and ship old episodes as if they were new and to not show the latest

03:15 ones.

03:15 And all sorts of weird things started happening.

03:18 So in order to fix this, or let's say rather in order to attempt to fix this, and it seems

03:24 to have worked, we've reduced our RSS feed to only show the last half a year of episodes.

03:29 Again, that's not because we don't want you to get the old ones.

03:32 That's because we want you to get at least the new ones.

03:35 So what a pain.

03:36 But I know that some of you are going through the entire catalog and learning as you go,

03:43 trying to go through the entire history of what we've done, which thank you.

03:46 That kind of blows my mind.

03:47 That's an incredible effort.

03:49 So what I've done is I've also made a separate RSS feed that is not part of the subscribe

03:53 at iTunes or subscribe at Google Play feed because that's where the problems were.

03:57 If the entire RSS feed works for your player, you're welcome to have it.

04:02 I put it up at talkpython.fm/episodes /rss underscore full underscore history.

04:07 You can get that in the links for your in this show in the show notes here.

04:12 And you can also get that on the footer of every single page at talkpython.fm.

04:16 So that's the easiest way to get it.

04:18 Just go to the bottom, look for the RSS full history link and put that link directly into

04:23 your podcast player if you want to get the full history.

04:26 If you're happy with the latest half year, just do nothing and make sure you're subscribed

04:31 as you have been and you'll keep getting the new episodes, hopefully without any hiccups

04:36 this time.

04:37 Martin, welcome to Talk Python To Me.

04:39 Hi, Michael.

04:40 How's it going?

04:40 Hey, it's going really well.

04:41 I'm happy to have you on the show.

04:42 Thank you very much for having me.

04:43 Yeah.

04:44 We're going to talk about the science side of Python, which I think is a really interesting

04:49 place to talk.

04:50 You know, if there's that famous, well-known article called entitled The Incredible Growth

04:56 of Python done by the data scientists at Stack Overflow.

04:59 And there's this super sharp inflection point around 2012 where the growth of Python, which

05:05 had been pretty stable, it was like a third, fourth tier rank language.

05:10 And then it just took off at that point and went up and up and up.

05:14 And I think my theory is a lot of people came in from data science, but I think that that

05:20 kind of overshadows that there are just a lot of people that started adopting Python for

05:25 this a little bit beyond Excel or the first, I just need to do a little coding or I just

05:30 need to do a little automation.

05:31 And I think people are just coming into this from so many angles.

05:35 And I think that a large portion of that probably is scientists.

05:37 What do you think?

05:37 I totally agree.

05:38 I mean, that's how I came into getting into Python was a transition from MATLAB, for example,

05:44 as a early first language for anybody who's interested in doing just that little bit of

05:48 automation, that little bit of something to go beyond Excel.

05:51 I think Python's perfect.

05:53 And now with the tooling and, for example, Anaconda, those types of things, it just makes

05:57 it so accessible.

05:57 Yeah, absolutely.

05:58 You know, my theory, the reason it's really popular is you don't have to take on all the

06:03 computer science ideas and concepts all at once.

06:06 You don't have to have a static class and a function with accessibility modifiers and

06:10 all that.

06:10 It's just like, well, just write the three lines you need here.

06:13 And then, oh, if you need this idea of functions, then you can learn them.

06:15 If you need classes, you can learn them and so on.

06:17 It's really, really approachable.

06:18 Yeah, no, I totally agree with that.

06:20 For sure.

06:20 And now, before we get too much into our topic, let's start with your story.

06:25 How'd you get into programming in Python?

06:26 Unlike some of your previous guests, I'm not your traditional path person.

06:30 I'm a physiotherapist by training.

06:32 And then after that, went on to do graduate work.

06:35 And it's after my master's, did a PhD.

06:37 And it's during the PhD that I learned two different languages that are very different

06:42 from each other.

06:42 One of them was LabVIEW programming.

06:44 And that was for my data acquisition and interface for data collection.

06:47 And that's, I guess you can call it coding.

06:49 Is it more of a visual type of day, like ETL type thing?

06:53 Yeah, and you connect your lines.

06:55 And one person, when they saw the back, not the front end, but the back end of my program

06:58 one said, it looks like a bomb diagram, which, you know, it kind of, I guess it does.

07:01 So it's all this spaghetti behind the scenes.

07:03 The logic is there once you understand coding, but it doesn't teach you how to program properly.

07:08 But then at the same time, my supervisor took on the task of teaching me MATLAB, which by

07:13 the end of my PhD, I was quite competent in.

07:14 And it was a great skill.

07:15 Those two together worked quite well.

07:18 And then I progressed, did my postdoc and continued that, a lot of MATLAB.

07:22 And then it was really, I'd heard of Python from a colleague that went to do some post-grad

07:29 work in San Francisco.

07:30 And I met her at a conference and she says, oh, the folks I work with use Python.

07:34 And so me being just curious, I went and looked at it, but that was probably, you know, 2007,

07:41 2008 or something.

07:42 And I kind of read on it, read up, I saw these tutorials on, you know, going to Python.

07:48 from MATLAB.

07:48 But the tooling still wasn't there, the how to install it, how to get the various packages.

07:53 I tried for a couple of weekends and then I just kind of gave up because obviously I was

07:57 proficient at MATLAB and I didn't, I'm not a programmer.

07:59 I just wanted to get my job done, my research done.

08:02 So, right.

08:02 And a lot of the stuff, the data science tools and libraries were just early seeds growing

08:08 at that time, right?

08:09 I think NumPy came out, it was created around that time and probably Matplotlib and Jupyter

08:15 wasn't around yet.

08:16 So there was a lot of support that wasn't there, right?

08:19 Exactly.

08:19 So for somebody who's, you know, in a sense, a non-expert, it was just too daunting.

08:23 And so therefore I went back to MATLAB and then it was just, finally, I'm a Linux person.

08:29 I started that in probably 2010 maybe and I've never looked back.

08:34 I've been using Linux the whole time.

08:35 So I'm quite...

08:35 What distribution are you on these days?

08:37 Linux Mint or Ubuntu kind of depends on pretty much how I'm feeling on the day that I install,

08:42 but it's between those two usually.

08:44 Okay, cool.

08:45 It's just what I've run on my computers.

08:46 I put it on my mom's computer so I can help her out from a distance because it's kind of

08:50 easy.

08:50 You can use older computers and just use it in the lab or for some of the experiments we

08:54 run.

08:54 We just need basic computers.

08:56 So I just thought it was a nice way to reuse some of the equipment.

08:59 And then with this open source kind of mind and that I had, Python started to be a little

09:05 bit more appealing and then I looked a little, again, revisited it, partly because I was looking

09:09 at the courses for software carpentry, which is this thing that teaches scientists that have

09:14 no computer skills, some of the basic skills and usually around either R or Python and you

09:20 get a bit of shell and you get some Git.

09:22 And so that really made me look again into Python.

09:25 And from there, that's all I've been using for the last maybe four or five years now.

09:29 Beautiful.

09:30 Yeah, the software carpentry stuff is pretty interesting.

09:33 I've had the folks behind that on the show before.

09:36 It's not a lot, right?

09:37 It's not a lot to get you started.

09:38 It's a couple of days, a little bit of, it's not even all programming.

09:42 It's here's how to use a shell and here's what source control is and so on.

09:45 But at least it plants the seed like, hey, these are things you should pay attention to.

09:49 Yeah, it's really, here's the landscape of a few things and it's a teaser.

09:54 But if you have some labs or some institutions, it's the culture.

09:58 They use R for example, if they do a lot of genomics, for example, everybody's using R.

10:03 And then it's okay because then you leave the course, the introduction, then everybody around

10:07 you can help.

10:07 But at our institute, I did run some software carpentry course.

10:11 And it's hard, the uptake, because we're so diverse in what we do because there's people

10:16 who do genomics at where I work.

10:18 And then there's other people who do really just kind of epidemiology type things.

10:22 And so the range is so difficult, different that the needs are different.

10:25 So it's a good format, but it doesn't apply for all.

10:29 That's kind of what I found along the way.

10:30 Yeah, for sure.

10:31 It sounds to me like reading some of the stuff you've written and looking through some of

10:36 the stuff you put out there that we've talked to data science folks and even scientific computation

10:44 folks who also have maybe some programmers around them or some kind of software stuff going

10:52 on as well, where most of their work revolves around code.

10:56 But it sounds to me like what you guys are doing, it's a little bit more like you're

11:01 scientists out there on your own.

11:02 And yeah, here's some stuff that you can learn and use.

11:05 But when you get stuck, you know, you're kind of alone on the internet sort of.

11:09 Is that right?

11:10 Pretty much it is that while it'd be nice if we had a big amount of funding to have dedicated

11:15 people to program and to automate some of our things, it's just not the case that the funding

11:20 is not all that great pretty much around the world.

11:23 But so if you're working in one of those places that has that, it's great.

11:26 But most of everybody else, we just have to, you know, figure it out ourselves.

11:31 And that's kind of how I learned is just by either I could continue to do this pointing

11:35 and clicking and realizing the mistakes I was making and not able to reproduce anything and

11:40 those types of things or teaching yourself.

11:42 But that's not easy.

11:43 It's a bit overwhelming, I think.

11:44 Whereas rather than having not enough resources, I think it can be overwhelming because there's

11:48 just so many resources for people.

11:50 And unfortunately, we're strapped for time.

11:52 And so that's a big stress on people at this more senior level.

11:55 They just feel they don't have the time.

11:56 And the students, well, they have a little bit more time.

11:59 But if their supervisors and the people above them aren't necessarily coding, well, then they

12:04 don't feel the need as much to learn it because obviously other people got by without it.

12:07 And so it's to try and for myself, it's definitely, as you have said on your show a few times,

12:12 it's my superpower, really.

12:14 It's things that other people just can't even fathom could be answered.

12:17 I'll just take the data and do a few things in Python and then get them the answer pretty

12:21 quickly.

12:22 Yeah.

12:22 They're like, I don't really even know if we can answer this question.

12:25 You're like, give me pandas and five lines of code and we'll see what we can do with this.

12:29 Right.

12:30 Exactly.

12:30 Yeah.

12:31 And part of it is actually not even just the fact that maybe they don't need to do it

12:35 or learn how to do it, but just know that it's possible.

12:38 Yeah.

12:38 And value that in terms of valuing it, I think that's what's interesting is that regardless

12:44 of what type of science you're doing, in the end, it's all based around computers.

12:49 It almost doesn't matter what field you're in.

12:51 You're going to be most of the time dealing with data.

12:53 And so to be able to do that in a repeatable, efficient manner, as you always say, computers

12:59 are really good at that, the mundane, repetitive tasks.

13:01 Yeah.

13:01 So why not automate it?

13:03 People are noticing.

13:05 And I think there's a push that even from the senior people, even though they might

13:08 not know, I think they see that, okay, that's, this is where it's going.

13:11 Yeah.

13:11 That's a little bit challenging when the incentives don't necessarily line up.

13:15 Right.

13:16 The, the experienced folks, they're busy.

13:19 They've got stuff to publish and research projects.

13:21 They'd like to learn, but at the same time, you know, they've got a job, they've got a family

13:26 or whatever.

13:27 Right.

13:27 They've got classes to teach or people to mentor in their field.

13:31 So that makes it really challenging.

13:33 I think as well.

13:34 Yeah, no, it's, it's, the pressures are very high and this whole publisher perish thing,

13:39 although some people throw it around as a joke, it really is true, unfortunately.

13:42 And it's in that model where, you know, rather than focusing on good science, that's reproducible

13:49 that, you know, that's what you expect.

13:51 When I was a kid growing up, these scientists are just, you know, these amazing people that

13:55 wear white lab coats most of the time you would think, and that are just extremely careful.

14:00 And I think we're just people trying to get by and, you know, the incentives make it that

14:05 we do have to do a lot of work.

14:07 And a lot of times we might cut some corners.

14:09 It's just the way it is.

14:10 We just have to get a lot out there.

14:11 And there's a lot of pressure to do that rather than take a bit of time and either do it carefully

14:16 or learn some new ways to make it last a little bit longer in terms of I can pass on

14:20 my data with the code and somebody else can reuse it.

14:22 Later on, my future self, I can reuse it or convince myself or convince others that, hey,

14:28 this was actually the real answer kind of thing.

14:30 Yeah.

14:30 I'm having a hard time even coming up with an example where you're doing research and there's

14:37 not a lot of data to be processed.

14:39 You know what I mean?

14:40 Yeah.

14:41 Like, what field is it possible?

14:43 Like, well, we have seven things.

14:44 We're going to study these seven.

14:46 I'll just put them on a line or like almost anywhere that people are doing research now,

14:50 there's so much data available that you can collect in whatever way, whether it be economics,

14:55 you can just go load up all the data about all the economies and all the purchasing and habits

15:00 and whatnot.

15:01 Or, you know, if you're doing neurology, you can, the amount of EEG data or whatever is ridiculous.

15:08 Right.

15:08 So it seems like programming is a really important skill just to almost be functional these days.

15:13 Well, definitely.

15:14 And I think that in this, for myself, it actually started with, first, it was a bit of electronics.

15:19 And I didn't know anything about that, obviously, being a physiotherapist.

15:22 And I ran into trouble.

15:23 And then I almost stalled my PhD for, was it six or eight months just because of a simple

15:28 wire, a couple of wires that needed to be soldered together and connected.

15:32 And nobody where I was from knew anything about that.

15:35 And it took a long time just to find the right person.

15:37 And I was like, never again am I going to have to be not self-sufficient.

15:42 So I taught myself electronics.

15:43 And then I think with programming, it was the same thing is that you can find the right

15:47 people.

15:47 But either it's going to take time and money.

15:50 And it's not always if they're not exactly in your field, they'll do their best.

15:54 And then it's going to be multiple iterations to get to the final what you would like.

15:57 And you don't even understand how it got there.

15:59 So I think this idea of learning just enough to get you by really is quite valuable.

16:03 And as you go, the learning curve, although it's Python and it's obviously easier than other

16:07 languages, if it's not your thing, it's a big learning curve.

16:11 The ideas of programming at the same time, you don't want to just learn programming.

16:14 You want your results.

16:15 That's number one.

16:16 So you're juggling all these things.

16:18 But once you get over that curve, the big learning is done.

16:21 And then it's just adding on for each new project.

16:23 You might learn a new technique.

16:24 You might learn, oh, now I'm going to use databases or now I'm going to learn Git and

16:28 now version control my code and things like that that come along.

16:31 So it's how to do it progressively, not get intimidated and not just say, oh, I'm going

16:35 to go back to my point and click programs because at least I can get the results quickly.

16:38 Exactly.

16:39 Or Excel.

16:40 Yeah.

16:41 Excel.

16:41 I try not to use that word and it's not my, I have a fear.

16:45 Yeah.

16:47 I have an allergy to be honest.

16:48 I walk by the people who have those screens open and I, oh, I just, yeah.

16:51 Yeah.

16:52 I don't, I personally have this thing about Excel and science.

16:54 It just really, for more than just a simple, you know, keeping a few numbers or doing some,

16:58 you know, basic adding or addition or stuff like that.

17:01 It's fine.

17:01 But it really, I mean, it just doesn't have a place in my opinion nowadays.

17:06 So, yeah.

17:08 Well, I think the problem is like, it's fine for the simple case.

17:11 Like I use Excel for accounting and the stuff that we do.

17:15 And it's, it's great.

17:16 It's got to take a few numbers.

17:17 It's got to add this column and it's got to divide by that and put, you know, 20% over

17:21 there.

17:21 I don't know what, something like that.

17:22 Right.

17:22 That's fine.

17:23 We did a whole episode on Excel with Chris Moffitt, episode 200, escaping Excel hell with

17:29 Python and bandas.

17:31 And that was a lot of fun.

17:32 One of the things that dawned on me that really makes these worksheets in Excel or any, you

17:38 know, could be Google sheets.

17:39 It's really no, no difference.

17:41 That makes it really tricky is it's kind of full of go-tos.

17:46 You know, when you, when you have a formula that's down here in like a 32 and then it says

17:53 add stuff over there and then make this decision and then take that from over here.

17:57 And then there's a formula over there.

17:58 It's like, there's no way to look at it and tell what order that it flows.

18:03 Right.

18:03 It's just go here and go there, then go back over there, then take some of this.

18:07 And yeah, it's just, it seems like it takes some of the worst practices when you try to

18:11 push it too far.

18:13 Yeah.

18:13 I think if it fits on a page and, and you've got a few pages, I think, you know, and everything

18:18 is self-explanatory.

18:20 I'm okay with that.

18:21 And, you know, even though I've done that for a survey that I published with one of the papers

18:25 that I wrote and it was the best way to present it.

18:27 But when you've got a lot of those embedded formulas, I, as a person opening it, have no

18:33 idea what's the flow of all of this and the bigger they get.

18:37 And that's, you know, there's research that's been done and people who've analyzed Excel spreadsheets

18:42 is that for every hundred things that you do in Excel, there's going to be approximately

18:46 five errors.

18:47 And the more complex it is, the more likely you're going to make these mistakes.

18:51 And I think the hardest ones are, for example, the one that came across in some of the data

18:55 that I was involved in, I asked to see it this, for example, take the average, because

18:59 we did three responses from somebody, take the average of the three.

19:02 But what happens is there's a missing data point.

19:04 Well, just by default, from what we did, Excel puts a zero.

19:07 So now you're averaging two numbers with a zero.

19:10 And it was just, but you had to search through it.

19:13 And it was like in, you know, page number eight on, you know, this II52.

19:17 And they're very difficult mistakes to find.

19:19 Whereas coding, if it's done, you know, if you choose your variable names well, and you

19:25 do a bit of documentation, it can almost read like a story.

19:28 And you can, even somebody who can't code necessarily can at least follow and kind of, or you could

19:33 just sit down with them and just tell them the story of the code.

19:36 And they might be able to say, that doesn't make any sense.

19:38 Yeah.

19:38 Wait, you're doing this wrong, right?

19:40 Whereas Excel is much more nebulous.

19:42 Yeah.

19:42 It's very vague.

19:46 This portion of Talk Python To Me is sponsored by Clubhouse.

19:49 Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings

19:54 teams together to ship value, not features.

19:56 Great teams choose Clubhouse because they get flexible workflows where they can easily customize

20:01 workflow state for teams or projects of any size.

20:04 Advanced filtering, quickly filtering by project or team to see how everything is progressing.

20:10 And effective sprint planning, setting their weekly priorities with iterations and then

20:14 letting Clubhouse run the schedule.

20:16 All of the core features are completely free for teams with up to 10 users.

20:20 And as Talk Python listeners, you'll get two free months on any paid plan with unlimited users

20:26 and access to the premium features.

20:27 So get started today.

20:29 Just click the Clubhouse link in your podcast player show notes or on the episode page.

20:33 So you wrote a blog post called Errors in Science.

20:38 I make them.

20:38 Do you?

20:39 Part one, two, three.

20:40 And I'll link to part three.

20:43 And one of the things I found amusing from there was you actually linked to an article which

20:48 talks about the 12 of the biggest spreadsheet fails in history.

20:51 Do you remember that article?

20:54 Yes.

20:54 It's pretty scary.

20:56 And I mean, in our case, well, I guess it could some of these mistakes, you know, they

21:00 lead to in science lead to retraction with the biggest problems would be that.

21:03 And that's kind of a big embarrassment and something we don't want to do.

21:06 Right.

21:06 And other people can write based their conclusions on a paper that then has retracted.

21:11 So it's kind of a bit of a house of cards as well.

21:13 Exactly.

21:13 You know, and unfortunately, some people, even though it's retracted, some people continue

21:17 citing the original one, which is, you know, you're not supposed to do, but people

21:20 haven't realized it's been retracted.

21:21 So it's a bit of an issue.

21:22 But some of the other ones that, you know, if you do this in the financial industry,

21:26 that there are some companies that, you know, very simple Excel mistakes have led to, you

21:30 know, millions and millions of dollars being lost.

21:33 And so, yeah, so that article just kind of highlights the some of the worst.

21:37 And the European Union actually has a whole kind of group that's working on this and dealing

21:43 with how can we improve these practices with regards to spreadsheets because the mistakes

21:48 are so costly.

21:49 Right.

21:49 I'll just read just a couple because I think they're kind of amusing.

21:53 Hans Alta, a Canadian company, made a cut and paste error in their spreadsheet and it cost

21:58 them $24 million where they caused them to buy.

22:03 They bought some U.S. power transmission hedge contracts at the wrong rate.

22:08 Fidelity, their Magellan fund, which is like an investment mutual fund type fund thing.

22:14 They had to cancel a $4 per share dividend because they had a missing minus sign.

22:20 So they thought they made a profit instead of a loss.

22:22 MI5, the British spy agency, actually bugged over a thousand wrong phone numbers.

22:29 There's just like all these weird errors.

22:32 Anyways, a really fun read for people that get, you know, sucked into that world.

22:37 So there's the drag and drop lab view type stuff, which is not that great.

22:41 There's some just not very much programming skills at all.

22:44 Maybe a little MATLAB.

22:45 There's trying to over leverage Excel.

22:48 A lot of these could be fixed by, you know, a little bit of Python, a little bit of pandas

22:53 or, you know, whatever library it is that works with the type of data that you're using.

22:57 Right.

22:58 Yeah.

22:58 And I think with Python, it's the fact that, and you've mentioned this yourself, it's just

23:03 the Swiss army knife.

23:04 It's that MATLAB has a license and its application is very, you know, engineer, math and science

23:10 focused.

23:11 Whereas Python, I had a master's student who came and finished with us.

23:14 And I kind of, his first study, he programmed with me.

23:17 Second study, he did most of it.

23:20 And I just looked over shoulder and made sure everything was okay.

23:22 But he went on now and he's working.

23:24 You know, in a sense, he's got a new superpower is that it's universal.

23:28 He can use it for such a variety of things.

23:30 That's not specifically for science.

23:32 So while it's obviously great, the learning curve isn't that steep if you've got a bit

23:37 of time and it's just so diverse rather than some of the more specialized languages.

23:42 So I think as an introduction, Python, I mean, it really, it's so simple, easy to read.

23:47 And it's a new skill.

23:48 That's kind of how I sold it to the institute where I work at here in Australia is they were

23:53 looking for ideas for almost like a vision for 2022 or what.

23:58 And I just said, well, wouldn't it be great if we could advertise that every student that

24:02 comes and does a PhD with us will have basic skills in at least one programming language

24:07 when they leave.

24:08 And they thought that was a great idea.

24:09 So that's how I pitched it that, you know, in this day and age, it's got to be part of the

24:13 education, not simply the, you know, sitting at the workbench or working with spreadsheets.

24:17 Yeah.

24:18 Something you just have to pick up in your spare time as part of your research project

24:22 or whatever.

24:22 Be better if it was more, right?

24:24 Because we have tons of spare time when we're doing our PhDs.

24:27 It's evenings and weekends.

24:29 And it's the few of us who are, you know, a bit silly enough to do it, bit hermits.

24:33 And we just do it on our own because we just feel the need to do it.

24:36 But yeah, a lot of other people, you know, sensibly don't do it.

24:38 And that's fine.

24:39 But I think in the long term, if it's become more acceptable and it's actually presented

24:43 right from the beginning that, hey, this is how it is.

24:46 You're going to learn a bit of programming to learn a bit of this.

24:47 It just becomes part of the learning process rather than when you just, it would be nice

24:52 if you learned it, but there's no structure.

24:54 We're not going to teach you and I'm not going to point you to any resources.

24:57 That's difficult for a lot of learners.

24:58 Yeah.

24:59 Well, I think it's a great service to offer that to your students, right?

25:03 To say, you're going to learn whatever it is you're here to study and get your master's

25:08 degree in or whatever it is.

25:10 And also you're going to have this programming skill in the context of that, because I'm sure

25:15 that if you go out there and apply for a job and there's 10 people, probably more, but let's

25:21 say there's 10 people apply.

25:22 Two of them also have good programming skills that are relevant to that in addition.

25:26 And all things being equal, like that's down to two candidates.

25:30 You know what I mean?

25:30 Yeah.

25:31 So it's really great.

25:32 Now, one of the things you did that I'd like to just chat a little bit about, I think is

25:37 interesting, is you did a survey around scientific computing at your institute, right?

25:44 Yeah.

25:44 So yeah.

25:44 So let's talk about that a little bit.

25:46 What was the survey trying to get at and what were some of the results?

25:49 Well, after I pitched this idea of training, obviously I have some ideas.

25:53 I'd done the software carpentry teacher training.

25:56 So I had my own opinion of how it all should work.

25:59 But again, who am I to say how everything should run and what are the needs of the people?

26:03 Especially because where I work, there's people who work on cells.

26:06 Some people work on epidemiology.

26:08 Some people work on human research.

26:09 And the fields are as different as schizophrenia, falls in balance, people with vestibular issues.

26:15 We study the gamut.

26:15 And so rather than impose what I thought people would need, I just said, hey, you know, we've

26:20 got these easy survey online things to do nowadays.

26:23 So how about I just create a little survey, get some information because I'm a scientist.

26:27 I like data.

26:27 So I just said, let me get some data so I can make informed decisions.

26:30 And also it allows me to track.

26:32 If I were to continue this over time, I'd be able to see how people are changing.

26:36 So it was a variety of questions based on, you know, what field are you in?

26:39 What's your level of training or education?

26:41 And currently, what are your current practices?

26:44 And then looking at in terms of what would you want to learn?

26:47 How would you want to learn that?

26:48 And those types of questions.

26:49 Right.

26:50 So I sent out the survey.

26:51 Our institute, I can't really say the number right now.

26:54 It's probably where we're now above 200 people for sure.

26:57 I think depends on how many of the students are present, but it's a biggish institute, but

27:01 not all that big.

27:01 But we had 80 people respond.

27:03 And it was nice because senior scientists replied, senior postdocs, postdocs, students, and a

27:08 few staff as well.

27:09 So the responses kind of reflect a bit of everybody.

27:11 And so that was kind of nice.

27:14 And yeah, it's good.

27:15 As you might expect, we're not, you know, it's not, it's not all that surprising that

27:18 pretty much everybody analyzes data with computers.

27:20 There's some that don't.

27:22 And I would have to say, well, it's possibly some of the staff and also some of the senior

27:26 people.

27:27 It's kind of why I don't want to become the most senior scientist, because at some point

27:30 you almost stop looking at the data.

27:31 You entrust the pyramid of senior postdocs and postdocs and students to handle the day-to-day

27:38 data.

27:38 And so they kind of don't do that.

27:40 But most people, as part of their job at some point, manage.

27:44 Some form of data.

27:45 So it is a useful, it's just become, this is what we do.

27:48 We manage with data.

27:49 And most people want some form of training or knowledge.

27:53 And I guess one of the unsurprising, as we were talking before about Excel, is that most

27:58 people, 80% or so, do stuff with Excel.

28:02 And that's okay.

28:03 But what's surprising to me, and I guess because Excel is even easier to pick up, is the fact

28:08 that nobody's actually got, most people, 20% only had actual training, had done a course

28:12 of some sort with regards to Excel.

28:14 Yeah.

28:15 I think people think Excel is easy.

28:16 In some sense, it is easy.

28:18 But that can be a really advanced tool.

28:21 When I was in college, I took a course on basically like spreadsheets.

28:26 And man, back in the day, it was like Lotus 1, 2, 3, and all sorts of random things that aren't

28:30 really around anymore.

28:32 But I remember learning, I felt like I can use Excel, I can load data, I can do formulas,

28:38 whatever.

28:38 Like, there's a lot to learn about stuff that you can do in there.

28:41 So that's, you know, that only one out of five get any real instruction on it.

28:46 That's surprising.

28:47 Yeah.

28:47 And, you know, I have a colleague who went off and kind of did an online course for Excel.

28:52 And, you know, he came back and everybody was really impressed.

28:54 And just, he was showing everybody new things.

28:56 And so, done properly, as he was doing it, I could see that, okay, it's not as bad as I

29:02 used to picture it.

29:03 But unfortunately, he's one out of, you know, everybody I've ever come across.

29:07 And so, really, the majority of people are just think it's just point, click, add some

29:11 numbers, try a few things.

29:12 But, yeah, there's not much education behind it.

29:15 And I think what's interesting also in the survey, I asked about other tools.

29:19 So, for example, you know, we do a lot of data management or statistics.

29:22 And there's obviously the typical point and click programs.

29:26 And a lot of those have the equivalence.

29:28 But for, you know, you could do that in programming, is that many more people were using point

29:32 and click programs.

29:32 than people who would program.

29:34 Yet the ones that programmed always had more training.

29:37 Just because, obviously, to learn it, they had to do some form of training.

29:40 This is the point and click.

29:41 It's the problem that it's just, it's deceptively simple.

29:44 And I think statistics is one of those areas that, you know, I think a lot of people have

29:48 allergies to for some reason in science.

29:50 I kind of like them myself.

29:51 I think they're important.

29:53 And you got to understand them.

29:55 But with this point and click stuff, a lot of people just kind of click some buttons,

29:59 think it's okay.

30:00 And then this big output comes out with sometimes some figures.

30:03 A lot of times these big tables.

30:04 And they're difficult to interpret.

30:07 And yet some people just say, oh, where's my p-value in there?

30:09 And then they're just like, great, this one worked.

30:11 And if it didn't, well, you know, maybe click a few different buttons and see if the answer

30:15 changes.

30:15 Which is obviously not great science, but it does happen.

30:18 Whereas with programming, it's, you kind of have to know what you're doing from the get

30:22 go.

30:22 And so there's a lot less chance of that just kind of what people would call p-hacking.

30:26 It's just searching for and trying permutations of it until you get something that, oh, that

30:31 looks better or what I was expecting.

30:32 Yeah.

30:33 Well, if you're going to wire stuff together and push buttons, like it's almost always going

30:36 to give you an output, right?

30:37 So yeah, well, there's the answer, right?

30:40 Well, in programming, you kind of got to really think through it and line it up.

30:43 Or it's not going to give you an output, it's going to give you an exception, you know,

30:47 a traceback or something, right?

30:48 Yeah.

30:49 And I think the value of programming, for example, in those contexts, and it's one of those things

30:54 I'd never even thought of.

30:55 Now I do it almost all the time is that if I'm using a new statistic or I'm using a new

31:00 program to analyze my data, why risk putting in my own data that I spent all this time putting

31:06 in and hoping that I'm doing it right to get an answer rather than generate, which is quite

31:10 simple nowadays, generate some data that I know the answer to.

31:14 So, you know, simulate the data and put in some artifacts or put in a certain trend.

31:18 So it will give me a statistical result, run it through.

31:20 And is it as expected?

31:22 If so, great.

31:23 Then I can apply that to my actual hard earned data.

31:26 And I can kind of trust the data and the results a bit more versus other than that.

31:31 I'm sure there's people out there who work with statisticians on a day to day basis.

31:34 There's other people who themselves are quite competent, but my experience has been a lot

31:39 of people, a lot of labs, not everybody, but a lot of us kind of just do that.

31:43 We just collect data, put it in the program, fiddle about with our own data, and in the

31:47 end, see what kind of comes out that makes sense.

31:49 And hoping in the end, it's a hope that, okay, I think I did it right.

31:52 Yeah.

31:52 But it's pretty risky in terms of, you know, this is science and we're going to publish

31:56 it, make it publicly available.

31:57 So yeah, it was a bit of an eye opener in terms of, but what was nice is that my survey

32:02 highlighted that, okay, this is what people are currently doing, but what do people want

32:07 to do?

32:07 Are they interested in it?

32:08 And yeah, there was definitely a big interest in learning these skills.

32:12 One of the questions was about, do you think how valuable is it to have computer coding as

32:17 a skill currently?

32:18 And the average was, it's important.

32:21 And then the next question was, how is it for the future as a young, as a young investigator

32:26 in the future career?

32:28 How important is it?

32:29 And it was very important.

32:30 So it's highlights the fact that there's this shift to even the fields that may historically

32:34 not have incorporated coding into their education of their students are now seeing that, yeah,

32:39 well, maybe I don't do it as a senior or a postdoc.

32:42 But I think that for the future, even those people are acknowledging that it's valuable.

32:46 But as you might expect, one of my questions was trying to understand what would they want?

32:51 But most importantly, in a sense, what are the biggest problems?

32:54 Why aren't you coding right now?

32:55 Why aren't you learning?

32:56 And what are the obstacles?

32:57 And some of them are addressable.

32:59 Like lack of institutional resources.

33:01 Like you could have more of that.

33:03 Or you could have more examples.

33:04 But one of the biggest one in that area that you got for your survey was lack of time.

33:10 And it's hard to help people make more time.

33:12 Exactly.

33:13 I wonder whether that's a real problem or it's a perception.

33:17 I think that for people who, like myself, when I first got introduced to coding, it's the unknown.

33:23 And you think, my God, that's going to take me, you know, a year of my time.

33:27 Right.

33:27 These are like the wizards of Silicon Valley that like create this magical digital world around us.

33:33 And it's like engineering plus with all the math and everything.

33:37 It seems so hard from the outside.

33:40 And then you do it.

33:41 You're like, oh, that's all?

33:42 Well, that was not so hard.

33:43 Right?

33:43 I just called this library.

33:44 Exactly.

33:45 With all the new tools, it makes it much simpler.

33:48 My one piece of advice to students or people who come and ask me about these things is just pick one task in your next study.

33:54 Don't do the whole thing.

33:55 Just pick one thing.

33:57 Whatever you, if it's about the acquisition of the data or if it's just making your figures.

34:02 Or cleaning up the data.

34:03 Yeah.

34:04 Just pick one thing that you can automate.

34:06 And if you want, I can help you along.

34:09 Just do one step.

34:10 And it makes it so different than if you're learning from a book that's giving you examples about, I don't know, leeches or the economy or some other problem that doesn't relate to me.

34:19 Whereas this one, I'm actually building, using my learning to actually help me out.

34:24 You're going to find a solution because in a sense you have to.

34:27 But it's not daunting because you're not going out and learning everything that you need.

34:31 I'm not setting up a web page to present my data or to do a survey or whatever it is.

34:35 Just pick a simple, biteable chunk and then on your next study, just add to that.

34:40 And then you'll see.

34:41 That's really good advice.

34:42 Yeah.

34:42 I think that's why the lack of time, I think, is they think they have to just stop and learn.

34:47 And that's just not feasible.

34:49 Right.

34:49 I got to redo the way I do everything.

34:51 We're going to rewrite this whole thing.

34:52 Yeah.

34:53 Yeah, exactly.

34:53 Like, there's a bunch of little steps.

34:56 You know, I would say, what is the step that makes you the least want to work on this project?

35:01 You know what I mean?

35:02 Like, what is the most not fun thing you do?

35:05 Could you automate that?

35:07 Yeah.

35:07 And I guess the other one I've done for myself, actually, one of the big pushes to do Python,

35:12 especially for data acquisition, for some of the research I do, is a lot of taking responses

35:16 from other people.

35:17 And we do these perceptual illusions to know where your body is in space.

35:21 And we always have to get the answer.

35:24 And for me, it was after we were done these studies, the first few times I participated

35:28 in them and helped out, is once you look at the data and there's an outlier or a few points

35:33 that you're just like, that can't be right.

35:35 But then what do you do?

35:36 You go back and it's a piece of paper.

35:38 And then it's a number that I wrote.

35:39 And then I'm like, well, did I write it wrong?

35:41 Or did the person, you know, did I not understand what they were saying?

35:43 Or their accent might have been different.

35:45 And then you try to follow the chain and you're always, and I hated that because I had no real

35:49 reason to exclude the data based on a known error.

35:53 And that just frustrated me.

35:54 And it happened enough that I kind of built this system around testing these things that,

35:59 you know, it's all computer.

36:01 It's built around, well, Python with a bit of Pygame on top for the interface.

36:06 And in that way, now the data goes straight into the computer.

36:09 Each response is verified with the person who's given me the answer.

36:13 It shows it to them and says, is this actually what you answered?

36:15 And it's just saved so much trouble.

36:18 And now our study we just did had zero data that was even in question.

36:22 And so we've gone from a lot of stress and possibly a few data points that were wrong to now I live in peace that after the study is done.

36:30 And, you know, we get good, surprisingly, as experimenters.

36:33 If you do it over and over, you kind of, it's less stressful to run the experiments.

36:36 I can kind of think on my feet.

36:38 But when it's the students and it's their first time running an experiment, they're pretty stressed out.

36:42 There's a lot to think about.

36:44 And you could see how easily they could forget to, you know, to check something or to write something down that's not correct.

36:49 And then they go and, you know, copy paste that into the computer.

36:52 There's so many levels of errors that can happen there.

36:55 So for myself, part of it was early on the lowest hanging fruit.

36:58 But that one was one I had to tackle because I was just so insecure.

37:02 I didn't feel good about the fact that I just don't know.

37:05 I mean, I think, let's say, 95% of my data is right.

37:08 But can I get it to close to 100?

37:10 And I think we're nearing that.

37:13 And I think it was a big undertaking.

37:15 But now the system's there and we've used it multiple times over.

37:18 And, yeah, so you have to pick what's most important or, as you said, what's the most annoying job that you have to do.

37:24 And data entry for us was one of the huge ones.

37:26 Nobody wants to do that.

37:27 That's boring.

37:28 And so, yeah, just do it on the fly.

37:30 And then the computer does it perfectly every time as long as you, you know, again, test your code and make sure it's doing what you expect, which is a whole other story that we might get into.

37:39 Yeah, we should get into that as well.

37:41 But at least, you know, it's as long as the software is working right, it's the foundation, right?

37:46 Your data is the foundation of your research, which is the foundation of your papers and your work.

37:53 And you definitely want that to be right.

37:55 So if you can automate that.

37:56 This portion of Talk Python To Me is brought to you by Linode.

38:01 Whether you're working on a personal project or managing your enterprise's infrastructure, Linode has the pricing, support, and scale that you need to take your project to the next level.

38:10 With 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise-grade hardware, S3-compatible storage, and the next-generation network, Linode delivers the performance that you expect at a price that you don't.

38:25 Get started on Linode today with a $20 credit and you get access to native SSD storage, a 40-gigabit network, industry-leading processors, their revamped cloud manager at cloud.linode.com, root access to your server, along with their newest API and a Python CLI.

38:41 Just visit talkpython.fm/Linode when creating a new Linode account and you'll automatically get $20 credit for your next project.

38:49 Oh, and one last thing.

38:50 They're hiring.

38:51 Go to linode.com slash careers to find out more.

38:55 Let them know that we sent you.

38:55 I do want to come back and ask you a question about this importance of learning computer programming for the future grad students and whatnot.

39:05 One of the things I think is interesting here is probably that means people are seeing more and more data that they need to work with.

39:14 That's probably one angle I would suspect that's pretty straightforward.

39:17 I wonder another one about just almost like sort of grant money angles as well.

39:25 Because when you're a new grad student or postdoc, maybe you don't have a grant yet.

39:29 Maybe you don't have a lot of resources.

39:31 And with things like Python and the SciPy space and whatnot, you know, it doesn't matter if you have money or not.

39:38 You have some of the best computational tools in the world.

39:41 Those ones that people used to find black holes and work from the Higgs boson and whatnot.

39:46 And you get those tools for free as well.

39:49 What do you think about that angle?

39:50 Yeah, well, I mean, for us, it's actually that's the truth is that there's things that I wouldn't have been able to do.

39:56 Some of it is going data mining on our own data that we've accumulated.

40:00 And I'm about to do the same thing with a colleague of mine that I write a blog with.

40:04 We're going to go data mine three different data sets that have just been laying there that have a piece of information that would be really useful together, especially.

40:12 That, you know, doing it by hand or by Excel, which is I think it's an insurmountable task.

40:17 But programming makes that possible.

40:20 And then the other thing is, as you mentioned, a lot of this move for all of science and funding agencies, especially, are asking people to make their data public.

40:27 Some journals also ask to put your code public.

40:30 So now, really, you have access to all these things that maybe I can't collect this data.

40:36 Maybe I can't do this, but I have access now.

40:39 I can download it.

40:40 I can search through it.

40:41 I can ask new questions on old data.

40:43 Right.

40:43 Get their Jupyter notebook and their data, and then you can start thinking from there, right?

40:49 You can start exploring and changing it and slicing it differently and trying to make discoveries, right?

40:53 Yeah.

40:54 And I think that's the whole push towards open data is when you do clinical studies, for example, clinical trials to see if an intervention works.

41:01 Well, that's important.

41:03 But then, really, it's the meta-analysis, the thing that puts it all together.

41:06 And so that push for people to present the data in their papers in a way that could be useful for scientists who do science on science.

41:14 They analyze all the studies, put them together and say, overall, the evidence is this.

41:19 Well, similarly, I think this is where coding and open data can help is let's not reinvent the wheel.

41:25 And some of these data sets can be combined or they can be just searched.

41:29 And by people who have completely different ideas that you'd never thought of, but somebody is going to go out there and just either contact you or just themselves, just move on with that data.

41:37 And I think just progress will happen a lot faster and more efficiently.

41:41 And that kind of research, as you're pointing out, doesn't really need any grant money.

41:44 I mean, you can work on a fairly cheap machine and just get the data and try out a few things.

41:49 And I think that's where it's going to, definitely.

41:51 Big data, but big collaboration as well.

41:53 Right.

41:53 It needs time, but it doesn't need money for resources potentially, right?

41:58 Yeah.

41:59 It's a culture change because of the incentives.

42:01 There's a lot of incentive to be exclusive.

42:04 I own this data set.

42:06 I'm the one who collected it.

42:07 And my group pushes these results out.

42:10 And I don't want to be scooped by anybody else, for example.

42:13 Some of that I can understand.

42:15 Right.

42:15 There's still discoveries to be made in this data.

42:18 And here's our first one.

42:20 Yeah.

42:20 But we're hanging on to it for the rest, right?

42:22 Which is, yeah, I understand that.

42:23 That might be a hard thought to get that.

42:25 Yeah.

42:25 And so I think it's a culture change that way.

42:28 It's a gradually, I think, that people will be more transparent.

42:31 And hopefully, you know, and you can see putting an embargo saying, you know, we'll make it public,

42:36 but in a certain amount of time.

42:37 But at some point, but the competition for these things is quite high.

42:41 So therefore, you can see why some people like to kind of hoard their data a little bit and

42:44 not necessarily make it publicly available.

42:47 But the move is for a lot of the top journals to require people to put their data in some

42:51 form on public repositories.

42:53 So the move is there.

42:54 And if you're funded, actually, a lot of funding agencies in Europe, and I think the NIH has

42:58 some of these policies, is that if you're funded by us, well, you're going to make your data

43:02 available to others.

43:03 Right, right.

43:03 This is ultimately paid for by the taxpayers.

43:05 This is not yours.

43:06 So share.

43:08 Exactly.

43:08 Right.

43:09 Similar for open publication is that, you know, we're giving you the money to do this.

43:12 Therefore, when you publish, you have to publish in open source journals, or you're going to

43:16 pay the fee to make it an open publication within that paywall journal.

43:20 So those are similar in terms of the move towards open science.

43:23 Sure.

43:24 Well, one of the interesting things that you've written about, and I think it comes back to this

43:30 sort of, which comes first, but one of the challenges is the life cycle and the openness

43:38 or the lack thereof of the data and the algorithms and the libraries for science is not necessarily

43:46 the same as for random software project by software developers, right?

43:50 Say versus Flask, right?

43:52 You talked about some of the important lessons that people in the science space can take from

43:57 the computer science side of things, things like GitHub and issue tracking and whatnot.

44:01 Do you want to talk about that a bit?

44:02 Well, it's computer science.

44:04 And I was always thought, why is that called a science?

44:06 Really?

44:06 Aren't they just, you know, nerds or geeks on the computer?

44:09 And I had that false perception, obviously.

44:12 And now I'm kind of one of those people myself to some extent.

44:15 But when I was starting out, it was all just, isn't it just a mishmash of they're just typing

44:19 and there's no rhyme or reason about it.

44:21 And to be honest, it's kind of interesting.

44:23 I coded MATLAB for the longest time and I had so many misconceptions.

44:28 My code was, my PhD code was two scripts that were, you know, and I thought it was cool

44:35 that I had like 3,000 some lines of code.

44:37 Yeah.

44:37 I had so much copy and paste.

44:40 It was very embarrassing.

44:41 I don't want to go back to that because I just didn't know any better.

44:44 And the people who taught me didn't know any better.

44:45 Right.

44:46 Nobody said, what are you doing here?

44:47 They're like, well, that looks like mine.

44:49 So this must be fine.

44:50 Exactly.

44:51 Yeah.

44:52 So it's just an inheritance.

44:53 It's just this, unfortunately, the natural selection is in the wrong direction in this case.

44:57 So gradually, you know, I read up more about it.

45:00 Software carpentry pointed me in a good direction.

45:02 And then I read some books.

45:04 I fell upon your podcast and then Brian Okken as well.

45:08 And I started realizing that all this stuff has already been figured out, which obviously

45:13 is obvious.

45:13 People have figured this out in, you know, a long time ago in the 70s, 80s.

45:17 And up till now, it's just so much simpler.

45:20 The workflow, all the things like GitHub with making suggestions or changes and all those

45:25 kind of things, pull requests.

45:27 Right.

45:27 You put your code up there.

45:28 Somebody will find, if they find something wrong, they'll put an issue in there.

45:32 If they can fix something or improve something, they'll do a PR.

45:35 And it's, you know, science talks so much about this research study is valid.

45:42 We know the Big Bang happened this way because this is peer reviewed, right?

45:46 Yeah.

45:47 And I mean, that's kind of what GitHub is.

45:49 Just less formally, less credentialed, but in a sense, right?

45:53 Yeah.

45:53 But what's interesting there is I spoke to, we have a floor meeting once a week.

45:59 And just before Christmas, I did talk about this comparison of what we can learn from computer

46:05 scientists.

46:06 And that was a bit what I was pointing out was that a computer scientist will write their

46:10 code and do it as best as they can and then put it up there.

46:15 And in a sense, not that they expect there to be bugs because, you know, some people will

46:19 then go to the next step, which is they'll make tests for their code.

46:22 But there's still this knowledge that there possibly and most likely is a bug of some sort

46:28 or a case that I've not considered that if you throw it at my code, okay, it'll crash.

46:33 Whereas, and that's expected.

46:34 And then when you get a pull request, you're thankful and you say, you know, great.

46:38 It's not a improving the code.

46:40 It's actually a fixing a problem with my current code.

46:42 And it's this iterative process that happens over time for improvement versus in science.

46:48 And I realize this doesn't speak to everybody out there, but a vast majority of us, if we

46:53 do code, but even if we don't, we just do our work and then we polish it up into this little,

46:59 you know, either Word or PDF document that we submit to a journal.

47:03 And this polished final version has most of the time, no data attached to it.

47:08 It's just a few figures and some tables gets assessed by two, three other of my peers.

47:12 And they have no idea that, you know, all the steps that were taken.

47:17 There is research on this that kind of shows that the number of somewhat arbitrary decisions

47:23 that are made along the way to collect your data in terms of, okay, well, I'm going to use

47:27 this versus that.

47:28 I'm going to exclude these cases versus that.

47:30 I'm going to use this test or this filter setting versus that.

47:32 There's so many permutations that in the end, it's almost, of course, you're going to find

47:36 something significant, but in itself, it's just a self-fulfilling prophecy.

47:40 You're setting yourself up to find something.

47:41 And then as the Texas sharpshooter example, it's that thing of the guy who, he's got a barn

47:48 and there's bullet holes in the barn and there's always a red circle around it.

47:53 And there's a guy who comes by and he says, wow, do you hit that every time?

47:58 And he goes, yeah, I'm so good.

47:59 And then his wife comes out and then she says, he actually just shoots.

48:02 And he draws a circle around it afterwards.

48:04 So if you want to find something, you'll find it.

48:08 And yeah, there's just, I don't want to say that I'm a skeptical about other scientists,

48:13 but we are one of the people I work with, Simon Gandivia, does a lot of work on cognitive biases.

48:19 And we have to almost protect ourselves from our own science in a sense, because the people

48:24 doing it aren't the computers, it's us.

48:26 We have to make the decisions.

48:27 And without knowing it, we have these, and a bias is that, is that you're not aware of it.

48:31 Somebody might even point it out to you and you might just not acknowledge that it's there.

48:36 You might think that doesn't apply to me, but I know other people that it applies to.

48:39 And so we just have to come up with ways.

48:41 And I think that coding, making decisions before the data comes in and making it transparent,

48:46 it's one of the ways that we can improve what we do.

48:48 And so in terms of things that we can learn from computer scientists, I think, for example,

48:53 the push towards publishing the data and the code, it's so interesting because once we publish it,

48:58 if I make that, it's a big stress on people.

49:01 Currently, it's almost like some people don't even make an effort to clean the data or the code

49:05 because they don't want anybody to use it.

49:06 Right, right.

49:07 Just leave it totally obscure and like, well, I have no idea what this does.

49:11 And the variable names are bad.

49:12 And what is this about?

49:13 And just, ah, forget it, right?

49:15 That's the hopeful outcome.

49:16 Yeah.

49:16 And somebody looked at that as well and saying that I think it's over 80% of the data sets

49:20 that are out there or code is actually not really useful.

49:22 But if you think of it from a computer science point of view, well, that's actually almost the

49:27 start of it, right?

49:28 Unless you have a very big group and there's been multiple eyes on the code, this might be the

49:32 first public appearance of this code.

49:34 And it would be nice if people had the chance to look at it, run it a few times, figure out

49:39 if there's any issues.

49:40 But the risk of that, the way the current publishing system works is that if anybody finds a genuine,

49:45 not more just, not a little error, but anything that's of significance, that means the paper

49:50 probably will need to be retracted, which is a huge deal in science.

49:53 Right.

49:53 Versus in computer science, well, sure, put a pull request and I'll fix that for you.

49:58 They're humble enough to know that I'm not perfect.

50:01 There's going to be things, I may have made a mistake in this versus in science, it's what

50:05 we publish and we just want to move on.

50:07 We want to just, you know, that thing's something on my CV.

50:10 I've ticked that box, another number and I move forward.

50:13 So that's one lesson that I don't know how to incorporate it exactly in terms of the workflow

50:18 of how we do things.

50:19 It's very tricky because the incentives are sort of opposed to the right behavior somewhat,

50:25 as you described about like, if there's a big problem, we're going to have to retract the

50:29 paper.

50:31 There are some things trying to solve that a little bit like Joss, Journal of Open Source

50:36 Software.

50:36 Are you familiar with that?

50:37 Yeah.

50:37 So, I mean, you can publish the code there and get some eyes on it, but fundamentally, right,

50:44 eventually you've got to take it and do the research.

50:46 And if the problem is found afterwards, it's bad.

50:49 But at the same time, I mean, isn't really the zen of science to get it right?

50:54 Again, the pressure, the time constraint that we all have makes it that it doesn't happen.

50:58 But another lesson from computer science, part of this was listening to Brian Okken and his

51:04 podcast and then also trying to get through his book, which is really good.

51:08 But again, it's that thing of it's good, but it doesn't really apply, or at least I couldn't jump

51:14 into it and directly apply it because it's too big of a step for what I do because most of the code,

51:21 some of it is reusable, but a lot of it is case specific.

51:25 And so to build up a whole test suite around my code seems a bit overkill, but that's definitely

51:32 a way to prevent the issues that if I could test my code, maybe simulate some data.

51:36 That'd be very useful.

51:37 And I think it's a great lesson, right?

51:39 Take some pytest and write some tests or even in MATLAB or whatever.

51:43 But I do think testing science is harder than testing a lot of things that people talk about

51:50 testing when they talk about unit testing.

51:52 For example, if I write a test for my online course site, right?

51:56 I want to test that users can sign up, but they can't sign up if they pass a bad email

52:00 address or something like that, right?

52:01 Here's a good email address.

52:03 It returns true.

52:04 Email is valid.

52:05 Here's a bad email address.

52:06 It returns false.

52:07 Email not valid.

52:08 But if I have, here are some wiggles and gravity.

52:13 Was that a black hole collision?

52:14 Like, I don't know.

52:16 We've never observed it before.

52:18 I'm not even like, you know, like, you know what I'm getting at?

52:21 It's just like, it's really hard.

52:23 And so, I don't know.

52:24 Do you have any advice?

52:25 Like, maybe take data that you know the outcome of and then try to run your code and predict

52:30 something that you know should be some way?

52:32 Or I don't know.

52:33 What do you advise people like?

52:34 Because it seems much harder from a software side to write the right test for that kind

52:40 of stuff.

52:41 Yeah, I definitely think that's the test your code where it's possible in terms of just

52:45 line to line or functions.

52:46 But then when it actually is interfacing with the data, I think simulating the data or a

52:52 data that's already been, you know, analyzed and processed.

52:55 Those are the two things that I think have to be done a little bit more.

52:58 And some fields just do it innately, whereas we don't.

53:01 And sometimes there's a limitation to how much we can test the code in this case.

53:05 But I think that being able to simulate data and having it know the outcome before you

53:10 run it through it, it gives you at least some amount of certainty.

53:13 So yeah, that's one thing that people could start doing a little bit more.

53:17 But where do they start?

53:18 I mean, for myself, that's one that I don't know how.

53:20 There's Greg Wilson, the founder of Software Carpentry, has this series of papers on, in

53:26 a sense, best computer practices, scientific computer practices, and good enough.

53:30 And in there, it's true.

53:31 We don't have enough time to be perfect.

53:33 So good enough for most of us is good.

53:36 And he mentions testing your code.

53:38 But that's where it ends, is that, you know, he might put a reference.

53:41 But I don't necessarily think that the current way to do it for, you know, big software programs

53:47 or projects translate all that well.

53:49 Obviously, some types of sciences, that's what it is, because it's all based around big programs

53:53 and big pipelines of data analysis.

53:55 But on the people that are more like myself, that do data collection themselves, they write

54:00 their code themselves, they write the papers themselves.

54:02 It's, I'm just not sure how to test my code.

54:05 Or if I do, it's going to be clunky, it's going to take me a long time.

54:08 And is it necessary to be at that level?

54:11 I would love for somebody.

54:12 But again, it's this thing, everybody's strapped for time.

54:15 And so somebody like Brian, obviously has no interest in writing a textbook or a blog series

54:20 on how can scientists test their code.

54:22 And also, it's the fact that we're so different, that the fields are so varied.

54:26 And so I still feel that while I think computer uptake in terms of programming, Python, those

54:31 types of things has definitely skyrocketed.

54:33 Some of the tools that come along with computer science haven't matured or morphed enough to

54:40 make it easily implementable.

54:42 So, you know, GitHub is something that version controls, obviously, another important thing

54:46 to talk about in terms of computer software.

54:49 And I think software carpentry teaches that on its two-day course.

54:53 And it's just logical, I mean, in terms of the way.

54:55 Yet scientists, I don't know, there's a comic strip called PhD, Piled Higher and Deeper.

55:02 And there's this one that everybody puts up as a slide often.

55:06 It's the, he has his thesis version as a doc, a Word document.

55:12 And it has a thesis version one doc, and then it has version 1.2.

55:17 And then it has final, and then final, final, and then final, final, underscore, dash, final

55:21 changes.

55:21 And I cringe because I actually looked back when I found that at my own thesis.

55:27 And that's exactly what I had.

55:29 You're like, look, this is an interesting naming convention.

55:32 It's slightly different than mine.

55:33 Exactly.

55:34 Everybody kind of laughs at it when you show it.

55:38 But really, there's no great workaround.

55:40 Because, you know, like a Word, for example, does have this inbuilt thing that you can, you

55:46 know, track changes, everybody loves it.

55:47 But once you accept that change, it's gone.

55:49 So if you want to have a history of it, you need to save a new version of it.

55:53 And there's the more collaborative online things that, you know, Google Docs and stuff like

55:57 that.

55:57 But still, the idea of having a history, and in a sense, that's what science is about, is

56:02 being able to show from the raw data, all the steps that were done.

56:06 And I think it should continue on to the manuscript as well.

56:09 So version control is another important lesson that we can apply, I think.

56:13 I just noticed that Google Docs, I know they had a version history, but you can tag different

56:17 versions, which is kind of cool.

56:19 Like, so you could have final, final, final, final changes.

56:22 Just in one doc, right?

56:27 And I guess the last thing to highlight, I think that, again, I'm a bit of a geek, so

56:32 I'd actually like to do this.

56:33 But the people I've been surrounded generally aren't people who code and especially don't

56:38 code in Python.

56:38 But this idea of a code review.

56:40 Yeah.

56:41 I was baffled to hear that people actually do that.

56:43 Like, people sit in a room and present code.

56:46 And I was just like, that would be, as a person who doesn't know how to code or just learning,

56:50 to just sit in on that would be useful.

56:52 And then also to sit there and have other people who code kind of critique it, but also improve

56:57 it or see if there's any issues and help answer some of my questions.

57:01 I'm like, wow.

57:02 And I'm sure it happens in some bigger labs.

57:03 Yeah.

57:04 But on the day-to-day, pretty much my second personality is the person I do my code review

57:09 with.

57:09 That's about all I have is I just sit there and I, does this make sense?

57:13 Talk out loud, maybe.

57:13 And it's another way that, although it seems currently that the publication, once it's out

57:19 there, you don't want any problems to be found because you may have to retract it.

57:23 So what can be done beforehand?

57:25 Well, you know.

57:26 Right.

57:26 How much can you bring that up ahead of the release?

57:29 Exactly.

57:30 Yeah, yeah, yeah.

57:31 Absolutely.

57:31 Yeah.

57:31 So that's one other thing that we could do.

57:33 But the time constraints make it that it just, unfortunately, I haven't seen it implemented

57:37 very much.

57:37 Well, I think it's probably, it sounds to me like it's, in my experience, working in

57:41 like a cognitive science research lab and some other stuff, I feel like maybe it's time,

57:47 but even the bigger challenge is expertise, right?

57:51 Usually, you know, who are you going to go to?

57:53 Because you're the person that knows the most about that.

57:56 So here we are.

57:57 I'm going to do what I can, right?

57:58 Yeah.

57:59 I think it would be a really cool service to set up some kind of, not like GitHub, but

58:05 some kind of online thing.

58:07 Maybe you can link your GitHub profile to it or something where it allows scientists to donate

58:15 a little bit of time in order to receive code reviews of their code, right?

58:19 Like if I could put up my code into this project, you know, it'll still stay hidden, not going

58:24 to come out.

58:24 But if I agree to code review somebody's stuff collaboratively with them, then they'll do

58:29 it for me or someone else will do that for mine, right?

58:32 Like I earned a half hour of code review by doing a half hour for someone else.

58:36 Or I don't know.

58:37 It seems like there's probably some way to like put this out on the internet and create it in

58:41 a way where, you know, the 20 people that are doing this kind of thing actually could

58:45 get together and have a look and help each other out maybe.

58:48 Yeah.

58:48 No, I think that'd be useful.

58:49 And, you know, you have to find the people that you would trust to do that.

58:53 But it's similar to, you know, the earning beans for giving good feedback.

58:57 I think here you can earn trust points or whatever it would be.

59:00 And, you know, the quality of your comments and keeping it to yourself.

59:04 But yeah.

59:04 And proper coding practices, simply breaking it down into small, biteable chunks means that you

59:10 can reuse it yourself.

59:11 So once maybe you've got some feedback on a section or maybe somebody will tell you,

59:15 turn that into a function.

59:16 Well, then once it's been reviewed, you can kind of trust it.

59:19 But I think a lot of us just get into it.

59:22 We just want the answer.

59:22 And we, in a sense, we probably reinvent the wheel way too many times.

59:25 Yeah.

59:25 That's something I definitely saw a lot of is like, here's a huge long script with no

59:30 function or branching or other than an if statement.

59:33 You know, it doesn't have, it doesn't have a lot of structure to it, which means it's nearly,

59:37 it cannot be reused almost, right?

59:40 Exactly.

59:40 And one of the things you touched on is one of the big differences between scientific

59:45 computation code and like computer science, formal software developer code is you hinted

59:52 at it before, but it's like you often hear that code is read way more times than it's written.

59:58 It should be written first for the software developers.

01:00:01 And then secondly, so it also runs things like that, right?

01:00:04 Whereas with science code, it's like, once you get it working and you get the graph, you're

01:00:09 done.

01:00:09 You don't need to polish it.

01:00:10 You don't need to touch it again.

01:00:11 It works.

01:00:13 It did the thing.

01:00:14 We got the outcome.

01:00:15 We're going to come up with something different next time or whatever.

01:00:17 Right.

01:00:18 And I think that puts a very different kinds of pressure on organizing your code, reusing

01:00:23 your code, documenting your code.

01:00:24 There's just like all these different pressures.

01:00:26 And I'm not saying one is better than the other necessarily.

01:00:29 If your job is to like, I got to quickly analyze this data and get it out.

01:00:32 Like, I'm not going to tell you all.

01:00:34 And you have to make a package while you're at it, right?

01:00:36 Like, that's not your problem.

01:00:37 But it does mean the code is treated differently.

01:00:41 It potentially has errors lurking in different ways and so on, right?

01:00:45 Yeah.

01:00:45 And I think that's the workflow hasn't been adapted very much to the lifecycle of the kind

01:00:52 of code that we would write.

01:00:53 And I think it's the lesson of take as much as you can and turn it into functions and things

01:00:58 that you can reuse.

01:00:59 And then, you know, build tests around those, for example, or have code reuse on those.

01:01:04 And then the nuances for this current study, well, that'll be, you know, something to maybe

01:01:08 just be a little bit more careful about and maybe have somebody look at it or check the

01:01:12 outputs.

01:01:12 But it's true that, you know, I was listening to it really rung a bell when Brian Okken said

01:01:17 that about, you know, in terms of testing, why not?

01:01:20 When once the test is built, once the code is written, you're going to make it legible,

01:01:24 obviously make it clear, but also make sure that it's correct because it's, you're going

01:01:28 to be looking at this over and over and over.

01:01:30 And it's going to, you know, whereas in science, I write all the time.

01:01:34 I read very little code.

01:01:36 I barely know how to read other people's code.

01:01:38 And that's interesting to me too, because I'm sure that with all your experience of the

01:01:43 various languages that you know, you can kind of look at different code and kind of pick

01:01:48 up quickly the general structure of things, even if the variable names are funny or versus

01:01:53 for me, because I've got such a personal style that I've kind of adopted based on the

01:01:57 various things, the resources I've used, you know, I can read my code very well.

01:02:01 But then the second I go just slightly differently, I'm just bleary eyed.

01:02:06 I'm like, what are they doing here?

01:02:07 And I can't.

01:02:08 And it's because we were very much just code for ourselves.

01:02:10 Not everybody, but many scientists code for themselves or maybe one other person.

01:02:15 Yeah.

01:02:16 Yeah.

01:02:16 And so it makes it a little bit more difficult.

01:02:19 But that idea of code reviews, I agree that if you can't, somebody else doesn't know how

01:02:23 to code, well, they're not going to be able to help you with the line to line things.

01:02:27 And while I have a bit of a love hate relationship with Jupyter notebooks, the one thing that I've

01:02:33 think that they're really useful for is you can sit in a room with your lab group, for example,

01:02:40 and you can intertwine code with outputs, with figures, and you can tell the story of

01:02:47 your code.

01:02:48 And while the, you know, if there's a, you're calling the wrong function or there's a little

01:02:53 bit of a mistake, those might be more difficult for others who don't code defined, but at least

01:02:57 the structure of your code, what it's doing, what it's exceptions that it's trying to catch.

01:03:02 You can do that.

01:03:03 I mean, I think Python's very readable as a language, but I think that the Jupyter notebook

01:03:07 makes it even more accessible because you can document it.

01:03:09 You could put little interpretations here and there, and then the figures are output.

01:03:12 It's kind of a way to do a bit of a code review for people who would probably run away from

01:03:17 that and be a bit scared of saying, let's come to this code review at 12 today.

01:03:21 Nobody's going to show up, but if you can, come on, I'm going to show you how I analyze

01:03:24 my data and I'll show you some results.

01:03:26 You can kind of secretly put in a bit of a code review.

01:03:29 You can hide it in a Jupyter notebook.

01:03:31 Yeah.

01:03:31 Disguise it.

01:03:32 Exactly.

01:03:32 That's a good idea.

01:03:35 Very, very cool.

01:03:35 Well, you know, thinking through this, looking back on what you've been saying,

01:03:39 it does sound like there's some interesting things to take from the computer science world

01:03:43 and adopt into the scientific world.

01:03:45 But I would say one really good thing that it seems to me is if what we're doing is moving

01:03:51 people from Excel into something like Python, even if there's fewer tests than maybe ideally

01:03:57 and whatnot.

01:03:58 Surely the code can be more validated in Python or Jupyter than it could be as it is in Excel,

01:04:06 right?

01:04:06 So that's got to be good for reproducibility and correctness.

01:04:09 Yeah.

01:04:09 No, I totally agree.

01:04:10 And simple lessons about, you know, variable naming and just how to structure the code into

01:04:16 functions.

01:04:16 The more functions you have, then the main script, if you like, just becomes very readable.

01:04:21 If you give it in one of your courses, it's all about you highlighted, just make, I was

01:04:25 surprised at how long your function names were.

01:04:27 But on the other hand, they were so clear.

01:04:30 I'm like, I know exactly what this thing does versus I had this weird, I don't know where

01:04:34 I learned it, but you know, you had to abbreviate everything and keep everything as short as you

01:04:37 can.

01:04:37 Right, right, right.

01:04:37 And then six months from now, I can't remember what it does.

01:04:40 But when you take the time to just make a bit of a very clear descriptive function name,

01:04:45 then the main script actually reads really well.

01:04:47 And so those are little lessons that, you know, it'll help everybody interpret it a bit better.

01:04:52 And if you make that public, it's so much easier for somebody else to look at that, possibly

01:04:56 even reuse it.

01:04:57 Whereas if you open up an Excel spreadsheet, I don't even know, how would you find out all

01:05:02 the cells that have background computations or macros?

01:05:06 I don't even know how you get started.

01:05:07 I have no idea.

01:05:08 You can't go through every Excel, it's in Excel, there you got it, it's crazy.

01:05:12 Yeah, yeah.

01:05:13 Well, I really appreciate that comment.

01:05:14 I'm like, you're allergic to Excel, I'm allergic to comments.

01:05:18 So anytime I'm about to write a little name function and then give it a comment to say what

01:05:23 it should do, I'm like, oh, maybe its name should just be what it does.

01:05:26 And then we won't need a comment, will we?

01:05:28 Yeah, no, that was a good tip.

01:05:29 So I use that all the time there.

01:05:31 Awesome.

01:05:31 I really do like this good enough practices in scientific computing that you referenced.

01:05:36 And I'll be sure to put that in for the show notes.

01:05:39 But it's got like really easy to adopt and reasonable stuff.

01:05:44 So it's clear, concise steps.

01:05:46 That's really great.

01:05:47 All right, Martin, I think it's a good place to leave it for our main topic.

01:05:50 But before you get out of here, let me ask you the last two questions.

01:05:53 If you're going to write some Python code, do some research.

01:05:56 What editor do you use?

01:05:57 It's been PyCharm for a few years now.

01:05:59 And just to anybody who out there is an academic or a researcher, you can get a free pro license

01:06:05 if you just email them your academic email.

01:06:07 And you can renew it every year.

01:06:09 So it just gives you that extra bit of...

01:06:10 So if it comes from a .edu or something like that?

01:06:13 Exactly.

01:06:13 And then you get the functionalities for the scientific mode.

01:06:16 That's the main one I would use.

01:06:17 Right.

01:06:18 And you can actually access all the JetBrains programs, IDEs.

01:06:23 But personally, I just use PyCharm.

01:06:24 But yeah, so we can get a pro license as academics.

01:06:27 Nice editor and nice tip.

01:06:29 All right.

01:06:29 A notable PyPI package that you've run across that people should know about?

01:06:32 I would say one that I emulated because at the time it was a bit above my head.

01:06:37 But there's something called PsychoPy.

01:06:39 Yeah.

01:06:39 Not in psychopath, but psychology.

01:06:41 And it's this framework to test, to collect data in presenting different types of stimuli

01:06:47 that you might do in psychology, like visual illusions or sounds.

01:06:50 And it's a nice way they actually have now implemented it.

01:06:54 So you can actually design and run your experiments right on the web.

01:06:56 But you can also implement it on your own machines.

01:06:59 So PsychoPy is a group in England that's been doing that.

01:07:02 Just recently, there's a package called DABEST.

01:07:05 I think it's DABEST.

01:07:07 I don't know how to say it.

01:07:08 D-A-B-E-S-T.

01:07:09 Pi.

01:07:10 DABEST.

01:07:11 Yeah.

01:07:12 It's data analysis through estimation.

01:07:14 And it's pretty much there's this trend towards moving away through from just simple p-values and just saying that something is significant or not

01:07:22 to providing appropriate estimates of these effect sizes.

01:07:24 And sometimes your data is parametric, which means normally distributed.

01:07:28 Other times it's not.

01:07:29 And so there's this move towards estimating these things in the confidence intervals.

01:07:33 But there's some of the statistical software doesn't do it for you.

01:07:36 So these people, I'm pretty sure they're from Singapore.

01:07:39 And it's cross-platform, but they have a version for Python.

01:07:42 And it gives you these beautiful plots.

01:07:43 It computes your estimates for you, but it gives you these wonderful graphs of the effects.

01:07:48 And it's all done through bootstrapping.

01:07:50 So it has no assumptions about the distribution of your data.

01:07:53 So that's a really neat one.

01:07:54 They came out, I think, just a few years, not even a year ago, I think.

01:07:56 So that's quite useful.

01:07:57 Yeah, those are great suggestions.

01:07:58 Awesome.

01:07:59 All right.

01:07:59 Final call to action.

01:08:00 People are out there.

01:08:01 Maybe they're scientists or do scientific computing.

01:08:04 They want to bring some more of these ideas from computer science to their world.

01:08:07 I would say for yourself, smallest, biteable chunk.

01:08:12 Implement something for yourself.

01:08:14 And then also just try to, in the tea room, coffee room, just mention it to people.

01:08:18 And if you hear about issues, don't sound like the smartass who says, why didn't you code that?

01:08:23 Just say, hey, if you want to sit down for two seconds, I can just show you something.

01:08:27 And because I think that there's this dualism of people who code and those who don't.

01:08:31 And I think that the magic of it for some people is a bit intimidating.

01:08:35 And so rather than keeping the magic for ourselves and being a bit elusive, I think just demystify

01:08:41 the whole thing and just say, hey, look, it's really simple.

01:08:43 Let's just do this.

01:08:44 And I think the more people can do that and make it part of the culture of how we do things,

01:08:48 I think that helps.

01:08:49 And support from above.

01:08:51 So the senior scientists who may not have come through a time when coding was available all

01:08:56 that readily or was a specialty, even if they don't want to themselves, support the

01:09:00 junior people and realize that, you know, for their future careers and just for the betterment

01:09:04 of reproducible science, try to support it if you can.

01:09:07 Very good advice and super interesting ideas.

01:09:10 Thanks for sharing them with us.

01:09:11 All right.

01:09:11 No, thank you, Michael.

01:09:12 You bet.

01:09:12 Bye.

01:09:13 See you.

01:09:21 This has been another episode of Talk Python To Me.

01:09:23 Our guest on this episode was Martin Haru, and it's been brought to you by Clubhouse and Linode.

01:09:29 Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings teams

01:09:34 together to ship value, not features.

01:09:36 Fall in love with project planning.

01:09:38 Visit talkpython.fm/clubhouse.

01:09:42 Start your next Python project on Linode's state-of-the-art cloud service.

01:09:45 Just visit talkpython.fm/Linode, L-I-N-O-D-E.

01:09:50 You'll automatically get a $20 credit when you create a new account.

01:09:53 Want to level up your Python?

01:09:55 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:10:00 Or if you're looking for something more advanced, check out our new async course that digs into all the

01:10:06 different types of async programming you can do in Python.

01:10:08 And of course, if you're interested in more than one of these, be sure to check out our

01:10:12 Everything Bundle.

01:10:13 It's like a subscription that never expires.

01:10:15 Be sure to subscribe to the show.

01:10:17 Open your favorite podcatcher and search for Python.

01:10:19 We should be right at the top.

01:10:20 You can also find the iTunes feed at /itunes, the Google Play feed at /play,

01:10:25 and the direct RSS feed at /rss on talkpython.fm.

01:10:29 This is your host, Michael Kennedy.

01:10:32 Thanks so much for listening.

01:10:33 I really appreciate it.

01:10:34 Now get out there and write some Python code.

01:10:36 I'll see you next time.