Maintainable data science: Tips for non-developers

0:00

1:10:48

Links Episode Deep Dive Transcript

Did you come to software development outside of traditional computer science? This is common, and even how I got into programming myself. I think it's especially true for data science and scientific computing. That's why I'm thrilled to bring you an episode with Daniel Chen about maintainable data science tips and techniques.

Episode Deep Dive

Guest introduction and background

Daniel Chen is a data scientist and educator who came to software development from a non-traditional path. Originally immersed in neuroscience and epidemiology, he discovered programming out of necessity for data analysis and later became an instructor with Software Carpentry and Data Carpentry. Daniel has deep experience teaching scientists how to structure, clean, and analyze data with Python and R, and he authored the book Pandas for Everyone. His unique journey highlights how many data scientists arrive at coding from fields like public health or the life sciences and learn vital software engineering practices over time.

What to Know If You're New to Python

Many listeners of this episode come from non-computer-science backgrounds and are just beginning their Python journey. A good grasp of basic file operations, loops, and Python’s package ecosystem will help you follow along with discussions on structuring and reusing code. If you’re unfamiliar with version control (like Git) or simple best practices for writing Python functions, reviewing these concepts can be beneficial. Below are a few resources to get you started exploring Python more effectively:

Python for Absolute Beginners (training.talkpython.fm/courses/explore_beginners/python-for-absolute-beginners?utm_source=talkpythondeepdive): Learn Python’s fundamentals, including core syntax and how to think in code.
Practice writing small functions to transform data, and gradually move repeated code blocks into functions or modules for reusability.
Experiment with Jupyter Notebooks if you’re coming from a data or scientific background, but consider how you’ll keep your code clean and versioned.

Key points and takeaways

Project Organization as a Foundation Maintaining a clear and logical directory structure is crucial for managing both code and data effectively. Placing raw data in a read-only folder, saving processed versions in separate folders, and using scripts to transform and save data incrementally helps avoid confusion. This discipline also paves the way for more advanced workflow tools (e.g., makefiles or DAG systems like Airflow) and fosters reproducibility.
- Tools and references:
  - “A Quick Guide to Organizing Computational Biology Projects” by William Noble
  - Good Enough Practices in Scientific Computing
Tidy Data and Data Cleaning Adopting “tidy data” principles allows you to spot and fix data issues early in your pipeline. Consistent column names, correct handling of missing data, and normalizing data formats make subsequent analysis much smoother. This approach also encourages writing small, composable scripts rather than one giant notebook of transformations.
- Tools and references:
  - PyJanitor (for cleaning workflows in pandas)
Breaking Up Jupyter Notebooks Jupyter notebooks are popular for quick exploration but can become unwieldy if you cram all code, documentation, and transformations into a single file. Moving data-cleaning steps into separate Python modules or scripts can reduce clutter and clarify the logical flow of your analysis. Keeping notebooks concise, often just for final visualizations or experiment tracking, makes version control diffs far more manageable.
- Tools and references:
  - Papermill (parameterizing and executing notebooks)
  - Google Colab (collaborative notebooks)
Code Smells and Refactoring Even if your code runs successfully, structural “smells” (e.g., huge functions, deeply nested if-statements, too many repeated lines) can harm maintainability and readability. Paying attention to these smells lets you refactor code into smaller, more reusable functions. This practice is essential in data science contexts, where experimental code often grows organically.
- Tools and references:
  - Jenny Bryan’s Code Smells talk (R community focus, but widely applicable)
  - PyProjRoot (helps avoid “path smell” in notebooks)
Naming Conventions and File Numbering A simple but powerful habit is to prefix script names with numbered sequences (e.g., 001_load_data.py, 010_clean_data.py, 020_analysis.py) to reflect processing order. This approach is especially helpful in academic or research settings where there’s a definite sequence of transformations. It also integrates neatly with incremental building workflows, caching partial results along the way.
- Example references:
  - “0, 1, 2” or “001, 010, 020” prefix style for ordering scripts
Collaboration and Git Using Git or another version control system is essential, even for solo projects, because it preserves a complete history of your code. Commit and push small increments often to keep changes reviewable and traceable, this is especially true when multiple collaborators share the same code base. Remember that notebooks can be tricky in Git diffs, which is another reason to separate logic into .py files.
- Tools and references:
  - GitHub
  - Pair programming for real-time collaboration
Pair Programming for Faster Learning Pair programming can accelerate learning, uncover hidden assumptions, and improve code quality, two people working side by side often spot mistakes and alternative approaches more readily. In a research environment, pair programming is less common due to single-person projects, but it can be a game-changer when introduced.
- Tools and references:
  - VS Code Live Share for remote pair programming
Data Science vs. Traditional Software Engineering Many data scientists start coding primarily to solve a problem, then discover that maintainable software practices make their results more reproducible and scalable. Concepts like modular design, version control, and code reviews might feel “optional” at first, but they evolve into essentials as projects grow. Embracing these practices (small functions, naming standards, consistent packaging) ultimately saves time and frustration.
Extending pandas for Cleaner APIs Tools like pandas-flavor let you add methods to native DataFrame objects without forking or changing pandas itself. This can simplify data transformations, letting you keep your logic close to the data and remain consistent with “method chaining” or fluent APIs.
Incremental Improvement in Data Projects Daniel emphasized the “10% better each time” philosophy for code organization. Rather than overhauling an entire codebase all at once, adopt small, consistent improvements. This incremental approach not only helps maintain momentum but also ensures your code evolves gracefully alongside your experiments and data changes.
Be Kind, All Else Is Details Daniel references Greg Wilson’s reminder from the teaching perspective: “Be kind, all else is details.” Beyond coding best practices, a supportive and empathetic community fosters learning and growth. This spirit is a hallmark of the Python ecosystem and helps retain newcomers, especially those from non-CS backgrounds.

Reference:
- Greg Wilson’s Teaching Tech Together

Interesting quotes and stories

“I was always tinkering around with computers… but it wasn’t until I saw other people learning programming from scratch that I realized my earlier exposure gave me a head start.” , Daniel Chen

“If you’re in a situation where you’re running bits of code all over your notebook, that’s a sign to fix it. It’s cheap to create a new file.” , Daniel Chen

“Teaching is the best way to learn. The more I taught, the deeper my own understanding got.” , Daniel Chen

“Be kind, all else is details.” , Greg Wilson (via Daniel Chen)

Key definitions and terms

The Carpentries: A collective of projects including Software Carpentry, Data Carpentry, and Library Carpentry that teach foundational coding and data science skills to researchers worldwide.
Tidy Data: A principle where each variable forms a column, each observation forms a row, and each type of observational unit forms a table, facilitating cleaner transformations and analyses.
Code Smells: Signs that code, though functional, may be poorly structured or hard to maintain (e.g., massive functions, nested loops, repeated blocks).
Pair Programming: A collaboration method where two people work together on one machine or shared environment to improve code quality and learning speed.
Papermill: A tool that allows you to parameterize and execute Jupyter notebooks, capturing the output in a standardized, reproducible way.

Learning resources

Python for Absolute Beginners - For listeners who need a methodical introduction to Python basics.
Move from Excel to Python with Pandas - Ideal for those coming from spreadsheets, wanting to learn Pandas for data cleaning and analysis.
Data Science Jumpstart with 10 Projects - Project-based approach to learning essential data science practices in Python.

Overall takeaway

This episode underscores the importance of applying software engineering principles, such as version control, modular code, naming conventions, and thoughtful file structures, to the day-to-day workflows of data scientists. By emphasizing incremental improvements, reusability, and collaboration, Daniel Chen shows how these practices save time and boost reproducibility. Whether you come from a biology lab or a traditional software background, adding these maintainable processes leads to better data projects and a more scalable, enjoyable coding experience.

Links from the show

Daniel on Twitter: @chendaniely
Pandas for Everyone book: amazon.com
pyprojroot project: github.com
Pyopensci: pyopensci.org

Jenny Bryan naming things: speakerdeck.com

Jenny Bryan’s code smells:
Talk: youtube.com
Slides: speakerdeck.com

3 papers that are highly relevant papers:
A Quick Guide to Organizing Computational Biology Projects: journals.plos.org
Best Practices for Scientific Computing: plos.org
Good enough practices in scientific computing: plos.org

Episode #227 deep-dive: talkpython.fm/227
Episode transcripts: talkpython.fm

---== Don't be a stranger ==---
YouTube: youtube.com/@talkpython

Bluesky: @talkpython.fm
Mastodon: @talkpython@fosstodon.org
X.com: @talkpython

Michael on Bluesky: @mkennedy.codes
Michael on Mastodon: @mkennedy@fosstodon.org
Michael on X.com: @mkennedy

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Did you come to software development outside the traditional computer science path?

00:04 This is common, and it's even how I got into programming myself.

00:07 I think it's especially true for data scientists and folks doing scientific computing.

00:12 That's why I'm thrilled to bring you an episode with Daniel Chen about maintainable data science tips and techniques.

00:18 This is Talk Python To Me, episode 227, recorded August 6th, 2019.

00:23 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:43 This is your host, Michael Kennedy.

00:44 Follow me on Twitter, where I'm @mkennedy.

00:47 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

00:54 This episode is sponsored by Indeed and Rollbar.

00:56 Please check out what they're offering during their segments.

00:59 It really helps support the show.

01:00 Dan, welcome to Talk Python To Me.

01:03 Hi, Mike. Nice to meet you.

01:04 It's great to meet you as well.

01:05 I'm so glad that we got a chance to run into each other at PyCon this year

01:09 and learn about what you're up to because we're going to have a good time talking about it.

01:13 Yeah, and this year was the first year I was at PyCon, and I typically live in the data science world,

01:18 so it was one super cool to be at pretty much a convention of Python users.

01:23 And I almost forgotten how Python is outside of data science.

01:29 Django is a thing.

01:31 It was a thing that was repeated back to me.

01:35 Yeah, exactly.

01:36 That's pretty interesting.

01:37 What was your take on it?

01:39 Do you recommend people, especially data scientists, attend PyCon?

01:43 You're happy you went?

01:44 Yeah.

01:44 I mean, it was super cool.

01:46 I mean, data science is sort of one of the growing parts of Python as a language.

01:51 And I think a lot of people have said it's sort of the reason why Python has picked up in popularity recently.

01:58 And so, yeah, it was super cool just to see all the booths.

02:02 I personally gave a pandas tutorial there, so it is becoming more and more of a thing.

02:08 And I think there were two or at least three pandas-related tutorials during the session.

02:15 Yeah, I know Kevin Markham gave one as well, I'm pretty sure.

02:20 At least something on data science there.

02:22 So, yeah, there was definitely some interest.

02:23 I think I met one other person who's doing one.

02:25 So, yeah, it's pretty incredible, right?

02:27 Yeah, yeah.

02:28 And then, again, there's this whole web stack of things that I almost never really use,

02:35 but a lot of people do use it as well.

02:38 So, it's super cool just to see it and be reminded what Python can do as a language.

02:45 Yeah, that's cool.

02:45 And for me, it's exactly the opposite, right?

02:47 Like, I spend a lot of my days writing web apps and APIs and things.

02:50 And then to see the data science stuff, it really reminds me, like, there's a really different way to work and other things to optimize than, you know, scalable web apps.

02:59 Yeah, yeah.

03:00 For sure.

03:00 So, totally recommended to go.

03:02 And, like, you know, there was, like, some talk on Twitter, like, don't always try to do the hallway track if you can, because, like, speakers,

03:10 sometimes speakers would like people in their audience.

03:12 But I tried to go to the talks that I can.

03:16 But then there were a few, like, education-related hallway track groupings or meetups.

03:21 And that's what I attended to.

03:23 So, it was just nice seeing other, like, Python educators, which I also went to SciPy a couple weeks later.

03:30 So, it was some of the people I saw again for the second time.

03:32 And I was like, oh, cool.

03:33 Yeah, I definitely love going to Python.

03:35 As people know, I talk about it all the time.

03:37 And it's a really great experience.

03:39 And what I think is interesting is a lot of people feel like they have to be experts to go.

03:42 I met a lot of people who are fairly beginner in their career.

03:45 And it was really valuable to them to be there.

03:47 So, I just want to throw that out there for people.

03:49 Yeah, super welcoming.

03:50 I mean, that's sort of the reason why I stuck around with Python.

03:54 I'm also, like, pretty active in the R community as well.

03:57 And between Python and R, a lot of people join for whatever reason.

04:01 But again, like, the saying, as the saying goes, like, they stay because of the community.

04:05 And everyone's, like, just super nice and helpful and super beginner-friendly.

04:09 Absolutely, absolutely.

04:10 All right, well, I haven't got a chance to ask you the opening questions.

04:13 Let's start there.

04:14 So, before we get into all the techniques and tips and stuff you have for data scientists

04:21 to bring in more structured programming stuff to make their data science techniques and tools better,

04:27 let's talk about you.

04:29 How did you get into programming in Python?

04:31 So, I was pretty much always surrounded by computers as a kid.

04:34 I always had, like, the hand-me-down computer when I was a kid from, like, my parents when they were working.

04:41 I guess it sort of does help that my dad is a software engineer.

04:45 But it wasn't really, like, a thing when I was at home.

04:47 Other than, like, hey, dad does things on computers.

04:49 That's kind of cool.

04:51 I was always tinkering around with computers.

04:53 So, like, I do remember, like, the first thing I would do every time, like, I open a new app is, like,

04:57 hey, let's go to edit and preferences and just see, like, what I can change.

05:00 And it was sort of just tinkering.

05:01 I grew up in New York City.

05:03 I'm from Queens.

05:04 And I went to one of the specialized math and science high schools in New York City.

05:10 And so, for us, sophomore year, it was actually mandated that every student take one semester of computer science and one semester of technical drawing or drafting.

05:20 That's pretty cool.

05:21 I think drafting is less valuable than people imagined it.

05:24 Because I remember I had a drafting class as well and I don't really see it.

05:28 But the software thinking and tools and ideas are certainly is.

05:32 What language was that in?

05:34 Pen and paper.

05:35 And then CAD towards the end.

05:36 Yeah.

05:37 So, yeah, we were, we were, like, in, like, a room and we were drawing, like, isometrics by pen or pencil and ruler.

05:44 Those big slanted tables.

05:46 And the programming one, what was that?

05:48 What technologies did you all cover?

05:49 It was, like, it was only towards the end.

05:51 And it was, like, in some CAD program that I don't remember.

05:54 Oh, okay.

05:55 Yeah.

05:55 And it's, like, interesting.

05:56 Like, that was a thing I never, I thought it was super cool.

05:59 And then, like, now that 3D printing is a thing, it's sort of, like, oh, wait, I used to kind of, I've done this once.

06:05 But, like, I just haven't done it in many years.

06:07 So, it's, like, kind of interesting.

06:09 Yeah.

06:09 How interesting.

06:10 Yeah.

06:11 Yeah.

06:11 How cool.

06:11 Well, that's a great introduction.

06:13 And then did you study computer science in college?

06:16 So, I didn't.

06:17 And that was part of it was I didn't notice until I was in college.

06:21 But when we had to take computer science in high school, it was sort of, man, all of these other people.

06:27 And by, like, other people, it's, like, just a handful.

06:29 It was, like, man, they're really good at this.

06:31 There's no way I'm going to be able to study this in the future or, like, for a career and whatnot.

06:37 We did the one semester of computer science.

06:39 I didn't go for the AP or anything because, like, originally I was going to, like, go down and be a medical doctor.

06:46 That was my original plan.

06:47 So, then in college, I ended up, how do I make my, like, medical education as strong as possible?

06:53 Let's do, like, neuroscience and sort of, like, a bio-heavy program.

06:57 But, like, that's where I sort of took my first set of statistics courses.

07:02 And I was, oh, yeah, like, we hear about mean and standard deviation.

07:05 But to finally understand it in the context of, oh, yeah, here's the exam scores for the previous exam.

07:13 Like, how do you actually rank?

07:15 Just something like that, like, get some meaningful understanding of, like, where do you rank in the class?

07:20 And it's, like, oh, maybe this is how the curve is going to be.

07:22 Or, like, did I didn't do very well in the exam, but, like, I'm actually kind of okay.

07:26 So, that was cool.

07:28 And then because I ended up switching into neuroscience my second year, I had to stay, like, a fifth year in college.

07:36 And so, like, my last two years was, like, oh, you only need four classes to do, like, a computer science minor.

07:40 So, I was, like, yeah, I've done this in high school.

07:42 Let's, like, just pick this up for fun.

07:44 And then so, it was that first intro computer science class when we got to, like, the actual Python programming portion where I was, like, wait, this is actually not as terrible.

07:55 And I would see the other students, like, in which case, like, they would be freshmen and I would have been, like, already a junior.

08:00 And I would see the freshmen, like, they've never seen this before.

08:03 Their struggles were essentially my struggles back in high school.

08:06 And then I realized, oh, it's literally because, like, I saw it before.

08:10 And, like, even though, like, not much of it got retained, it was just thinking about things procedurally, just doing it once.

08:17 Now I can actually think about, like, syntax errors versus, like, doing everything at once.

08:22 That's sort of when I was, like, huh, maybe I could have done this as, like, a career choice.

08:25 But, nope, nope.

08:28 Let's keep going down the medicine route.

08:30 So, I ended up doing a master's in public health and epidemiology just to stack on more research skills.

08:37 The thought being was, hey, research in medicine was super cool, but I'm pretty sure if I ever start medical school, I'm never going to learn this stuff again.

08:44 So, let's just learn everything and then go to medical school.

08:47 So, I did my master's in epidemiology, and that's when I took my first, like, intro to data science course.

08:53 And that is probably the most life-changing moment in my life.

08:58 When I was doing my master's, I was already just learning about all of these other basic statistical techniques.

09:03 I've never heard of logistic regression before.

09:05 And that's, like, the type of analysis you do when you have a binary outcome.

09:09 So, for us, it was like, did this person die?

09:10 Yes or no.

09:11 Or did this person get cancer?

09:13 Yes or no.

09:14 And I've never seen that before.

09:15 And it was just like, wow, this is amazing.

09:17 And then I take my data science class and I was like, what is this random forest thing?

09:21 This is amazing.

09:22 Or, like, what is this, like, ridge and lasso regression?

09:25 And, like, I can just, like, condense, like, thousands of variables into, like, something meaningful.

09:30 Like, that's super cool.

09:32 And so, that sort of started this whole trajectory down to where I am now.

09:36 Because it wasn't until, like, that data science course I, during that semester, because there was so much learning to do, the instructors set up a software carpentry workshop.

09:46 And so, I was an attendee for software carpentry.

09:49 I think software carpentry is a really cool project for folks with the background exactly like you described.

09:54 I actually had Jonah Duckles on the show way back in episode 93 talking about software carpentry.

10:01 So, it's been a really long time since I've spoken about it.

10:03 Maybe just tell the listeners out there what software carpentry workshop is about.

10:08 Because it'd be good for a lot of folks who are in the data science and sort of science in the programming space.

10:13 So, yeah, it's sort of expanded over the past, like, couple years.

10:18 But software carpentry and their sister program, Data Carpentry, they're housed under this one umbrella called the Carpentries.

10:25 And, essentially, they're this nonprofit organization.

10:28 And their goal is simply to teach researchers or scientists the skills that they need for, in the sense of software carpentry, like, programming skills.

10:37 And then, in the case of Data Carpentry, like, working with data.

10:40 So, like, data skills.

10:42 And the two really just go hand in hand.

10:43 So, you'll mix and match.

10:45 They have a lot of overlap.

10:46 And, essentially, there's these two-day workshops where they cover Bash for the shell.

10:52 And the whole premise of that is to, like, show you about, like, what is a working directory?

10:56 And programs do one thing and one thing really well.

11:00 And you can pipe them into one another to chain things together.

11:03 So, that's, like, what you're supposed to take away from Bash.

11:06 And then they go through Git for version control, which it's really hard to get an understanding of Git in three hours.

11:12 But it's just to show you that, like, there are better ways than naming your files final, final, final, et cetera, et cetera.

11:19 Putting the date on the end.

11:21 No, like, really, final.

11:23 Yeah, and putting the date.

11:24 And then there's a section on Python or some of, like, R or any of the other programming languages.

11:29 And it used to be that they also had a fourth section on SQL.

11:34 But then, usually, SQL gets bumped out for, like, a longer Python or R session.

11:40 So, it's a two-day workshop that covers those skills.

11:43 And it's really to give, like, researchers a primer because we go into science not thinking that we're going to program.

11:51 And so, like, a lot of this stuff is just like, oh, I picked it up on my own.

11:55 And it's just a bunch of stuff hobbled together.

11:58 And that's how we learned it.

11:59 And actually, that's how, like, a lot of people in data science, like, that's how they learn programming.

12:04 And then this is, like, the first time, like.

12:05 I feel like, yeah, I feel like this is actually really common, as you're saying.

12:09 And I think it's also a little bit why Python is growing a lot in the data science space.

12:15 Is it's, like, what can I do that's an easy step to do just enough computation to solve my problem so I can go back to what I actually care about?

12:22 Because I don't want to be a programmer.

12:23 I want to be a biologist or a doctor or whatever.

12:26 But then you slowly find yourself six months later with, like, a lot of scripts.

12:31 And you're running code and you're using pandas or NumPy.

12:35 And you're like, well, I have no qualification for this.

12:38 But here I am, like, in it somehow, even though I swore I would never do this because I hated math or something like that, right?

12:43 Yeah.

12:43 So that's the whole premise of the carpentries is, like, okay, let's take one step back.

12:46 You learn how to do this on your own.

12:48 And let's, like, refresh, like, the actual basics and, like, kind of, like, steer you in the correct way.

12:53 That's the general lowdown of what the carpentries are.

12:57 That's cool.

12:57 And you started as a student, but you became an instructor, right?

13:00 I was a student, like, fall of 2013.

13:03 And then it was, like, just at the cusp of, wait, I can actually teach this stuff.

13:08 It wasn't, like, that much a leap and bounce.

13:11 Like, I already knew a little bit about Python programming.

13:13 So, and then the Bash stuff.

13:15 I was, like, one of those people in college that was, like, I'm just going to install Linux and see what happens.

13:20 Deal with, like, problems that come from that.

13:22 I've been saying to myself, like, it's the year of the Linux desktop since, like, 2010 or something.

13:29 It's almost here.

13:30 It's almost here.

13:31 So I ended up signing up to, like, go help out.

13:35 You end up realizing that, like, for a lot of newcomers, a lot of the problems that they have aren't actually that complicated.

13:42 And then just to go into, like, education theory a little bit, it's, they don't have a lot of nodes to make connections with.

13:49 And so a lot of their problems is also, like, just they'd made a typing mistake, right?

13:53 Like, they're just not used to hitting tab to tab complete things.

13:55 So, like, everything is mainly a typo.

13:57 So I started off helping out a few workshops.

14:01 And then I matriculated into, like, their next, like, instructor class where I was, like, certified to be an instructor where it was mainly, like, getting familiar with the material and, like, learning how to teach the material.

14:13 That's cool.

14:14 Yeah.

14:14 And then I was an instructor.

14:15 And my first couple of years as an instructor, that was, like, right on the border of, I was still in grads.

14:21 I was, like, finishing up my master's program.

14:23 And also, like, I had a job.

14:26 But I ended up working so much during my job that my boss was pretty much, like, please go home.

14:31 And so I spent a lot of time going home.

14:34 But it was really just, like, go teach, like, other workshops.

14:36 And it was, like, super nice being in the New York City area because, like, going to a university or any place was pretty much local for me.

14:43 So I got a lot of teaching experience out of that.

14:45 And I didn't know at the time, but I say it now, like, teaching is, like, one of the best ways to learn something.

14:50 So Bash and Git and Python and later on R, like, I just got more familiar with it just because I was teaching it all the time.

14:59 And then, you know, once you have some foundation, like, learning the next small bit of information is, it becomes easier and easier.

15:06 And then it just snowballs into something.

15:09 That's cool.

15:10 Yeah.

15:10 And then, like, all of that teaching knowledge ended up being, like, the foundation for, like, the book that I ended up writing or was tasked to write or call pandas for everyone.

15:20 I mean, it's really, like, an honor that, like, I got recommended to write this thing.

15:24 So I should frame it in that sense.

15:26 I've done a lot of training as well.

15:28 And I feel like once you kind of go through a couple of cycles of that, you just get so good at learning something with enough depth to present it that it becomes, like, this really great power.

15:39 And it's kind of addicting, right?

15:41 You're like, oh, what's the next thing I can learn?

15:42 What's the next research project I can go on?

15:44 And, yeah.

15:45 So it sounds like you did the software carpentry thing and it kind of somehow sucked you down this pandas for everyone hole of writing this book, which is Addison Wesley, which is pretty cool.

15:55 Even writing the book, like, now you're just like, oh, I just can't write, like, really janky code anymore.

15:59 Like, this actually needs to be, like, quote, unquote, like, the better way of doing things.

16:03 So, like, there was, like, still, even though I was, like, writing a book and I was supposed to be the expert in this, like, a lot of it was also, like, I should probably read this part of the documentation just to make sure.

16:13 Because, like, I also learned this, like, on my own.

16:16 Right. Well, that's the thing about the difference of practicing as a programmer or as a data scientist versus an author or an instructor, right?

16:25 Like, as a practicing person, you have a problem.

16:28 You're like, I need to figure out how to make pandas do this.

16:30 Like, it doesn't matter how it happens.

16:32 But if you can make it happen, you're done.

16:33 Like, that's the end of the research.

16:35 You're done.

16:35 This part is solved.

16:37 What's the next problem?

16:38 But as an instructor, like, well, but there's these other two ways.

16:40 And but if somebody says, well, why not this way versus that way?

16:44 What's the difference?

16:44 All of a sudden, like, all these cases that would you would never go down, like, you have to start going down those now, which is I think is awesome, actually.

16:51 But it's definitely a different way of thinking.

16:53 It's super cool because, like, now it becomes its own, like, learning path.

16:56 Like, you see other people have problems and you see how they think about it.

16:59 And, like, it sort of adapts how you present material.

17:02 For me, when I was originally, when I first started off teaching workshops out of the book, I pretty much went in the order that I presented the chapters in.

17:12 And then more and more recently, like, I realized, like, wait, like, tidy data principles is actually, like, one of the most important things in, like, data science and data cleaning.

17:20 After we load our first data set, I pretty much just jumped to, like, that chapter.

17:24 Because if you can really understand that, everything else becomes way easier, quote unquote easier.

17:29 Yeah, sure.

17:30 Well, if you're trying to do operations on bad data and it keeps crashing, like, that's no fun.

17:35 Like, why does it say none?

17:37 Is that invalid?

17:38 You know, it doesn't have this attribute.

17:39 I don't understand.

17:40 Like, well, let's talk about that.

17:43 This portion of Talk Python To Me is brought to you by Indeed Prime.

17:47 Are you putting your Python skills to good use?

17:50 Find your dream role with Indeed Prime and start doing more of what you love every day.

17:55 Whether you're a developer, data scientist, or anything in between, one application puts you in front of hundreds of companies like PayPal and VRBO in over 90 cities.

18:04 Indeed Prime showcases your experience and tech skills to match you with great fit roles that meet and exceed your salary, location, and career goals.

18:13 And when you start a one-on-one conversation with one of their career coaches, you'll get resume reviews and personalized advice to help you get what you deserve.

18:21 So, if filling out countless job applications isn't your thing, let top tech companies apply to you.

18:28 Join Indeed Prime for free at talkpython.fm/indeed.

18:32 That's talkpython.fm/indeed.

18:35 The reason I wanted to talk a little bit about software carpentry, other than just like you have been doing in his school,

18:42 is I think it's a really good segue into this larger topic of how do you take the average data scientist and the work that they're doing

18:51 and help bring in these more computer science, maybe not even computer science, let's say software engineering principles,

18:58 to help them basically be more effective, right?

19:01 So, maybe we start at the beginning.

19:03 We've got some idea.

19:05 We probably found out we can open up a Jupyter notebook, load something into Pandas,

19:10 and poke around with it with a Matplotlib or something, right?

19:15 Maybe that's it, right?

19:16 Maybe we've seen a lot of Matlab code as well, where it's like, well, I got this does this thing,

19:22 but it's like there's no functions.

19:23 Maybe there's loops, maybe not, right?

19:26 It's just like all crammed in there.

19:28 And those are PhDs writing that.

19:30 So, like really brilliant people, but they just don't have the software engineering skills.

19:34 So, where do we start with that?

19:37 There's a few papers I would direct people to sort of get a sense of where I'm coming from.

19:43 So, there's this one paper by William Noble called A Quick Guide to Organizing Computational Biology Projects.

19:48 And that's sort of the premise of how, I guess, I would present, how do we introduce software skills?

19:56 And in that paper, he literally talks about you should have a folder structure.

19:59 And maybe this is one way you should set up your folders for your analysis projects.

20:03 And I'll talk a little bit about that in a bit.

20:05 But yeah.

20:06 Yeah.

20:07 So, it's called A Quick Guide for Organizing Computational Biology Projects.

20:10 And, you know, it's probably focused on biologists, but I'm sure that it's like pretty generally applicable.

20:14 Yeah.

20:15 Yeah.

20:15 Other than like maybe the sequence.py file, like replace that name with whatever you need.

20:21 Right.

20:21 It's Hubble.py or whatever.

20:24 Yeah.

20:25 And the other two papers, the first author is by Greg Wilson, who restarted Starforce Carpentry, like back in the 2000s.

20:33 And he wrote two papers, one in 2014 called like Best Practices for Scientific Computing.

20:38 And then in 2017, the paper is called Good Enough Practices in Scientific Computing.

20:43 If you just look at the papers, it almost seems like, hey, we're presenting like the ideal case.

20:48 And then we almost realize like that's impossible in the real world.

20:51 But they're both like pretty good papers.

20:53 And they talk about like different things.

20:55 Right.

20:55 What would we have if we had like the perfect adaptation of software engineering to this world?

21:00 Like, okay, well, what can we reasonably ask people to do that will make their life better?

21:05 It sounds like.

21:05 Yeah.

21:06 And the way I approach it is like, just like when I teach data science skills, I pretty much make a beeline to tidy data and tidy data principles.

21:15 In this case, it's almost like a beeline towards project organization.

21:20 Just having some kind of structure to your analysis project.

21:25 That will snowball into all of the cool tools that you probably heard of and don't know how people end up there.

21:33 But if you take slow steps, I found that project organization is the fundamental thing where it's sort of like the gateway to everything else.

21:42 Right.

21:43 Because a lot of what you need, it sounds like, is code organization.

21:47 Right.

21:48 It's like the architecture and functions, classes, different modules, the concept of I'm going to pass data to this thing and make it reusable.

21:56 All of that stuff really seems to be like natural follow-ons of like, well, how do we organize this project by function or by purpose?

22:05 And like, just really think through that, right?

22:07 Yeah.

22:07 And it doesn't even have to be as complicated as, oh, we're doing like proper software engineering and like we need to create a Python package.

22:15 Like that can all be deferred to much later.

22:18 Because usually what ends up happening, you mentioned like, hey, I'm a scientist.

22:21 I found out about Jupyter Notebooks.

22:22 It's a really cool tool.

22:24 Taking pictures of black holes out, like using them.

22:27 So, yeah, you have all these tools.

22:29 And like the scenario is like, hey, it's great that you're using a programming language to work with data.

22:35 Excel is a great GUI for data, but it has its limitations.

22:39 Cool.

22:40 You are now using a programming language.

22:42 Where can we go from there?

22:44 And like when you are in that beginning state, just to make everything work, like you dump everything in one folder.

22:51 You have like your Jupyter Notebooks.

22:52 You have all your scripts, all your data.

22:55 Your data files.

22:56 Yeah.

22:56 If I say load this, I just want to say the file name.

23:00 I don't want to have to think about like where that's relative to the other on some like server or something like that, right?

23:05 Yeah.

23:05 And then like as an academic, you might have like a Word doc in there or maybe a LaTeX file.

23:09 And then you compile that thing and it very quickly becomes this folder with hundreds of files and you can't find anything.

23:17 And that's when you just start end up, you know, maybe the word final comes into like the beginning of the file name just so like you can find things, right?

23:24 Yeah.

23:25 I was going to say it already sounds bad.

23:26 And then if you start trying to do version control by like having multiple files named the same thing, then you're really pushing your luck.

23:32 Yeah.

23:32 So the most important thing, I think like if you're at that point, where can you go next, right?

23:38 It's always like trying to do things incrementally.

23:40 Like how do you make your life like 10% better each time?

23:43 And then it's like a nice way, especially if you're like brand new grad student or you're in science, but like you've never really learned programming.

23:52 Like where can you go from there?

23:53 It's useful to have some kind of guide or path that you can follow or think about to like make yourself better and do these things more efficiently.

24:01 Yeah.

24:01 So let's talk about some of the programming things that you can think about.

24:05 One of the ones that you have is like try to make your code easy to read.

24:10 Oh, yes.

24:11 So one of the things I talk about in programming is like make things easier to read.

24:15 Do things in steps.

24:17 Don't try to like write one for loop that have a whole bunch of like side effects going on, right?

24:23 Like things should just be incremental just to take like a cue from like education.

24:27 Like we as human beings can only carry, I think the number is like four plus or minus three objects in our mind at the same time, like roughly seven.

24:34 You should pretty much follow that too.

24:37 When you're programming, you shouldn't have to have, I mean yourself or like potentially another reader try to carry like 10 different things going on at the same time.

24:47 It's just not helpful for.

24:48 Like maybe an example is I'm trying to go through a loop.

24:51 I'm really trying to do three things.

24:52 Like as I get the data and I compute something with the first step and then I do some other filtering and I do another thing.

24:59 I could try to cram that into like one giant loop or maybe it should be three separate little loops.

25:04 One that like cleans the data, one that like does that computation, another that then filters it, right?

25:10 Three loops sounds like a better step than one giant loop trying to do it all.

25:13 Yeah.

25:13 Or you can more in like education framework, like, or you can like group things together.

25:18 And in programming, the way we group things together is like write functions.

25:21 So then you end up with one giant loop and it's really just making three function calls, but that's easier to keep track of than like, let's say we didn't write the function.

25:30 Now we have like three different things like scattered in our code and you end up with a loop that's 150 lines long.

25:38 And that's like scary because like I see a loop and like before I even look at this thing, like I'm already like, oh man, we are in for a ride.

25:47 Right.

25:47 So let me just give my perspective from like the software development, web dev, you know, more application side of things.

25:54 It's like, if I see a function that's more than 10 lines long, it starts to make me nervous.

26:00 I'm like, there is something going on here that's probably bad unless there's like a lot of error handling and like response, like even 10 is a lot in the typical scientific computing bits that I've at least seen a while ago.

26:14 There was more than 10 lines.

26:15 There's more than 10 lines.

26:16 And so like Jenny, Brian, like from the R world has like this talk about like code smells.

26:23 And it's like, that's like one of those like code smells of like, Hey, why does it look like this?

26:27 Or like, at least when you're working with data or in the PI data stack, usually you shouldn't have to write for loops in the sense of like, if you're trying to operate on a data frame, they should be an apply call to a function.

26:39 Even like sometimes when I see loops, it's like, yes, I will write them just because something broke.

26:44 And I'm just trying to figure out like where my data frame, like I have a bad value, but like the final result ends up being like an apply call or something.

26:51 Yeah.

26:52 It's interesting because a lot of the libraries, NumPy, Pandas and whatnot, can they do the looping?

26:57 They do it much faster and more efficient than you will in Python.

27:00 One of the cool things that I like teach during the data science part is like when we go over like applying functions, if you're doing numerical computations, like just the NumPy decorator for like vectorize or the number decorator for vectorize, like just wrap the decorator around your computation function.

27:19 It like pretty much for free gives you order of magnitude speed improvements.

27:24 And so it's like, it's way better than just you trying to like optimize this thing yourself.

27:28 Right.

27:29 And that's like one of the other programming things.

27:30 It's premature optimization is like the bane of all evil or whatever.

27:34 Just write the thing you want, especially if they're like loops.

27:37 Python has many mechanisms to like help you with that and make it faster pretty much for free.

27:42 That's definitely cool.

27:43 I love this idea of code smells.

27:45 I'm fascinated by it.

27:46 I want to come back to it.

27:46 But another thing I want to throw in that I kind of, I feel like is in this realm is like the idea of reusability.

27:53 You can write code so that it's easily reusable or that it's not so much.

27:57 So like I could write a function, but maybe I have a bunch of global variables that I'm still using and it makes the function like it moves the code away.

28:04 So I understand that it's like it's more compact and more readable, but it doesn't necessarily make it reusable.

28:09 So thinking about like how do I parameterize these things and make them something that I can use in other situations or once you solve this problem in this way, like I never have to think about this again.

28:21 I just now use this in the other part and that was rough, but that was Friday and I don't have to think about it ever again.

28:26 Like that's a pretty good principle, I think, here as well.

28:28 Yeah.

28:29 And even when you're writing your functions, you can write your function for your use case now.

28:33 And, you know, for example, it's like a function that is a regular expression parser for like a US telephone number, which is, if you try to write one of those, it's like way more complicated than ever it needs to be.

28:43 But it's like final exam in like regular expression 101 or something.

28:48 Like it's really like way worse than it should be.

28:50 Yeah.

28:51 You'll write your function with that in.

28:52 And like one of the things I end up doing is even if I have hard coded things within the function and then I realize later on like, oh, wait, I pretty much need to run that function again.

29:02 But instead of like the second index, I need like the fourth index or whatever.

29:06 You can make backwards compatible functions or code by like saying like, oh, I'm just going to create a default parameter in my function.

29:15 It's going to default to the one that already works.

29:16 But then now I can just reuse that function later on and like just change that value.

29:20 Simple things like that, that you don't have to rewrite the function just for your second use case.

29:26 I like, I talk about like, if you ever hit control C on your computer, you better be paying attention when and how many times you're hitting control V, right?

29:35 And if it's like more than three times, you're probably doing something wrong.

29:39 Yeah, for sure.

29:40 For sure.

29:41 One of the things I think would be nice, you talked about premature optimization and all these performance stuff.

29:46 What is your recommendation around like how you structure your code?

29:52 So a lot of times I imagine that the data science stuff has pretty much like there's a Jupyter notebook and most of the code, like the supporting functions are kind of the beginning.

30:03 And then later on, like they're kind of using them and so on.

30:06 When do you tell folks to break out like separate Python modules that you could load into your notebooks?

30:12 And like, what's the, how do you think about like different module files versus notebooks and things like you can apply refactoring tools really easily to a bunch of files using PyCharm or things like VS Code, not so easily in Jupyter, right?

30:27 So where's the balance there?

30:29 Yeah, so the thing with Jupyter notebooks is, yes, there was like a talk at JupyterCon about why Jupyter notebooks are bad.

30:36 And I have this love hate relationship with Jupyter notebooks.

30:39 But one of the things I can say, so Rachel Tappman from Kaggle, she gave like an Our Ladies Meetup talk in 2018 about like putting together a data science portfolio.

30:51 And one of the things in there is like the Jupyter notebook is great, but most of the time you probably are just interested in like the figures or tables that's being generated, especially if you're taking this into a meeting, right?

31:04 Like no one wants to like scroll forever to get to the bottom of the notebook because the first like three quarters is cleaning code.

31:12 Right.

31:12 I sort of like got into this sort of workflow of like I'll use the Jupyter notebook to like test things out in like data cleaning pipeline.

31:21 But the actual data cleaning stuff all go into like Python scripts.

31:25 So like what ends up at the end of the day, what happens in the Jupyter notebook is like pretty much load the libraries I want, load the data I want.

31:33 Maybe there's like a few functions that's specific to like the figures I need and then just the figures and tables I need.

31:39 So my Jupyter notebooks are pretty small and that down the line in terms of like other software engineering practices that just makes the diffs in through Git just way more manageable if I start making changes.

31:51 So if you end up with massive Jupyter notebooks that are a lot of it is just data cleaning code, you would think about like moving that out to other notebooks or other files just so you have more files.

32:04 I'm in the camp of pretty much in a lot of academic or scientific use cases, maybe not in like physics when they're working with like sensor data, but file IO is not that big of a bottleneck.

32:16 So like I will have more scripts and more files that just write out data just to have another script and file read it back in.

32:24 But that just breaks up my thought processes into smaller manageable pieces.

32:28 That's interesting.

32:29 It's like a little bit of a cache as well, right?

32:31 Like you can take the step N and go to N plus one and like iterate on how that happens without rerunning all the stuff, right?

32:39 Because you just reload that file that you save.

32:40 Yeah, yeah, exactly.

32:41 So like this goes down into like project template world where like I'll have a data folder in our data folder, you know, we'll have like an original data folder.

32:49 That is the data that we download stuff in, never make changes to your raw data.

32:54 And then everything else gets modified with a script.

32:57 I'll have like, for example, a script that reads in one of my original data sets.

33:01 I'll do my first set of processing.

33:03 Like maybe it's like, oh, fixing missing values.

33:05 And then I'll immediately write it out to somewhere in like under data and processing because it's now a process data set.

33:13 And I want to distinguish between data sets.

33:14 I can, I should just pretty much lock as read only versus things that I could like potentially modify and delete later on.

33:21 I'll have a whole series of these scripts that pretty much just like you'll see it.

33:25 I rarely these days have scripts that are like more than a hundred lines long because it's pretty much read in, do this one task, write it out.

33:33 And especially if you have like one step that just takes a really long time.

33:38 Yeah, it serves as pretty much as a cache where you just save out your temporary results and then you can deal with it later without like accidentally rerunning the part of your code that you didn't mean to run because now you're stuck for an hour.

33:50 And that's sort of like what happens with Jupyter Notebooks as well.

33:53 When we first started programming, like when I first started programming, it was just like, I just need this stuff to run.

33:58 So I'll run cell one and then jump to cell 10 and I'll run cell one again and then jump to cell 15.

34:04 And then I can like scroll all the way down and get my plot.

34:07 Right.

34:07 And then it's like, how am I supposed to run?

34:09 How do you document something like that?

34:10 Right.

34:11 And that's sort of one of the drawbacks with the Jupyter Notebook is, yeah, the execution order isn't guaranteed in what was written.

34:18 It's a little bit like a go-to.

34:20 Yeah.

34:20 It's pretty much like a go-to.

34:21 Yeah.

34:22 Which is kind of bad.

34:22 Except it's not even like documented, right?

34:24 Yeah.

34:25 At least it doesn't even say go to 20.

34:26 It's just like they went to 20.

34:28 Yeah.

34:28 And then when you execute it, it turns to 21, right?

34:31 So like it doesn't even, like you don't even know what 20 is, right?

34:34 So if you end up in a situation where you're running bits and pieces of a code all over the place, that's a sign of like, wait, let's fix this now.

34:42 It's pretty cheap to create a new file.

34:45 And let's do all the data cleaning or the parts I need for this figure.

34:48 Maybe that can just be in one thing.

34:51 And then more and more as you find pieces that need to be reused, you'll, oh, maybe I can turn this into a function.

34:56 Then you'll put that as a module.

34:58 And I would say like even if it's a module, just leave it in.

35:01 If the folder structure is pretty much you have a data folder and an analysis folder and an output folder where output is like your figures and stuff.

35:08 At first, it's okay.

35:10 You can have your modules in your analysis folder.

35:12 And so you can still say import something and it'll still import properly in that sense.

35:17 You don't have to like just go and make a Python package right away because at least in what I've seen is sometimes your analysis, it's not really going to be reused across projects.

35:28 You don't need the overhead of writing a Python package.

35:31 It's when you, for example, if you're querying, if you're doing some study on like code in GitHub, for example, and you write your own GitHub querying API call stuff, and then you realize this is part of one giant grant with many different analyses that need to happen.

35:46 Maybe like your GitHub querying code will turn into a package because you're actually reusing it.

35:51 You don't have to turn everything into a Python project.

35:54 You don't have to do that to do it like quote unquote correctly.

35:57 Yeah, and I feel like the value from going from like just some huge notebook or some huge script file and then moving that into modules that have functions you can import and run and whatever, that's like 85% of the way, right?

36:09 Whether or not you can pip install the thing, it doesn't matter.

36:11 You know, there's a lot of overhead to make something super reusable to make it documented.

36:16 Like maybe if you're in academics, maybe that's a cool project for like a senior undergraduate person.

36:23 Like, hey, you know what?

36:23 You know Python.

36:24 Why don't we take this and turn this into an open source project?

36:27 And that can be your project, right?

36:28 Like, I'm not sure it's a great research time and energy in general.

36:33 Yeah.

36:33 Well, so more and more, there's very recently PyOpenSci is an organization that sprung up and it's trying to mimic our OpenSci.

36:42 And it's essentially like supposed to be a repository of Python packages made towards making science better for some scientific use case.

36:53 And all of those are going to be reviewed by somebody and it fast tracks you for if you want to write a paper based off of that software package, it'll fast track you into Joss, which is a journal of open source software.

37:08 Yeah.

37:08 And I had them on the show as well.

37:10 Yeah.

37:10 So now you have at least like the incentives are more or less lined up, right?

37:15 Because before, like if you were just maintaining a software package, you know, what are your academic incentives?

37:20 Because a lot of that is still around publishing and grants.

37:24 So at least now there's the incentives are now lined up where like even though you are writing a software package, you can now write a paper about it.

37:31 Yeah.

37:32 It may generate a paper which might help you with your tenure and so on.

37:35 But I guess let me take a step back really quick on my statement.

37:38 Like it might not help you in your academic career directly to spend the software engineering time, but it may help you significantly in your research.

37:46 If you can publish something and then you get other researchers to start using it, right?

37:51 It becomes a package that you have more contributors to, right?

37:54 Maybe you have one student you could fund part time.

37:57 Now all of a sudden there's 20 institutions like all working.

37:59 Like that could be a huge benefit, but I think a lot of stuff is so specialized, so tied to your data and your particular problem.

38:07 Like you say, your first thought shouldn't be how do I open source this as a package?

38:10 It's like, how do I just like make this a decent software project?

38:13 Yeah.

38:13 And that's a pretty lofty first goal too.

38:16 Like how do I make this work properly for myself?

38:18 Right.

38:19 Because then that you go into the route of like, okay, I should write tests for this just to make sure it's like at least behaving correctly.

38:25 There's a bunch of incentives as well for just having an open source project and trying to get other people to play with it because you'll build out the functionality for the thing you built.

38:34 And as functionality expands, you'll sort of get more and more people in.

38:38 And it sort of ties back to like the Python community is great.

38:44 And so like now you are embracing the broader Python community and now you have more and more resources or people you've met to help you with your own project.

38:53 If you're at like PyCon or SciPy, you can have your own sprint for your software project just to have other people try this out.

39:01 You end up building your own community off of your little software project, which is, it makes you feel good.

39:08 And it's still also advancing science.

39:10 And a lot of science is also communication.

39:12 And you built this stuff to help other people.

39:15 So like you might as well try to make it easier for other people to help you as well.

39:19 Yeah.

39:20 It could definitely help your career as well.

39:22 I mean, people like Wes McKinney, Jake VanderPlas, Travis Oliphant, like folks like that, like they're legitimate big names in the whole Python space in general.

39:31 And a lot of that came from, you know, these academic projects and whatnot.

39:34 So that's pretty cool.

39:35 This portion of Talk Python To Me is brought to you by Rollbar.

39:40 Got a question for you.

39:41 Have you been outsourcing your bug discovery to your users?

39:44 Have you been making them send you bug reports?

39:46 You know, there's two problems with that.

39:48 You can't discover all the bugs this way.

39:50 And some users don't bother reporting bugs at all.

39:53 They just leave, sometimes forever.

39:55 The best software teams practice proactive error monitoring.

39:58 They detect all the errors in their production apps and services in real time and debug important errors in minutes or hours, sometimes before users even notice.

40:07 Teams from companies like Twilio, Instacart, and CircleCI use Rollbar to do this.

40:12 With Rollbar, you get a real-time feed of all the errors so you know exactly what's broken in production.

40:18 And Rollbar automatically collects all the relevant data and metadata you need to debug the errors so you don't have to sift through logs.

40:25 If you aren't using Rollbar yet, they have a special offer for you, and it's really awesome.

40:30 Sign up and install Rollbar at talkpython.fm/Rollbar, and Rollbar will send you a $100 gift card to use at the Open Collective, where you can donate to any of the 900-plus projects listed under the Open Source Collective or to the Women Who Code organization.

40:46 Get notified of errors in real time and make a difference in Open Source.

40:50 Visit talkpython.fm/Rollbar today.

40:52 Before we move off, I don't want to drop this idea of code smells because, first of all, I love this concept.

41:01 It's just such a good visualization of what can be wrong with software, but not broken with software.

41:09 Because a lot of times you think of, well, my code now works, but what should I do?

41:14 And I think the code smells is a very practical thing.

41:17 Just for folks listening, like, code smells, the idea is the code is working.

41:23 It's not broken, but when you look at it, you try to read it, like, your nose literally could kind of curl up.

41:29 You're like, ew, there's something wrong with this.

41:31 I guess it works, but I guess it's not good.

41:34 It's really not good, right?

41:35 Like a 300-line function, not good.

41:37 Like, it works, but there's something wrong.

41:40 And I knew this mostly from Martin Fowler's work back in 1999 when he wrote refactoring.

41:46 And this was sort of the introduction to, like, how do you know when to refactor?

41:50 Well, you look for the places that make your nose turn up.

41:54 You go, ew, what do we do with this, right?

41:56 Like, oh, there's a 300-line function.

41:58 That's bad.

41:59 What can we do about that?

42:01 Or here's a function taking 20 parameters.

42:03 That's really horrible.

42:04 You know, it's really easy to switch this integer for that integer.

42:07 And how do you know when that happens?

42:09 So what could you do to make that better?

42:10 And there's just a bunch of them.

42:11 But I only know this through the sort of software engineering side of things.

42:17 And this presentation that you talked about here, which was Jenny Bryan, right?

42:24 She has some really interesting tips from the data science perspective, right?

42:28 Yeah.

42:29 Yeah, so the first one is do not comment or uncomment sections of your code to alter behavior

42:33 because you want to try different stuff out.

42:35 Yeah, and that's, like, a very common thing, right?

42:37 Like, the easiest case where that happens is if you are in a collaboration environment,

42:43 you have five people.

42:45 You have five comments of data loading because everyone hardcoded, like, a data path, right?

42:52 And so, like, there's literally, like, you comment in your code just to, like, load the data set

42:58 across, like, depending on who you are, right?

43:00 And then, like, you end up, like, if you end up using, like, some kind of version control system,

43:05 like, the vast majority of your commits are just, like, it's my turn that ran it.

43:10 And you just have this one bit of these couple of lines that are just, like, committing.

43:15 Just cycling.

43:16 Just cycling back and forth.

43:17 Yeah, so, I mean, what is the fix, right?

43:19 The fix would be to do something where you have this proper structure, as you already talked about,

43:24 and then you use something like os.path or pathlib, and you compute the relative path over to that,

43:29 and then you generate an absolute path.

43:32 That would work for everybody as long as they all check out the same general structure, which sounds like Git.

43:37 Yeah, and they dealt with this in the R world with these two packages called Rprojroot, like, for the root of an R project,

43:45 and here, here as in, like, find this file using here as, like, the root path or something.

43:52 And that's sort of, like, my contribution to all of this.

43:55 I tried to pretty much, I wrote a package called PyPrideRoot that tries to mimic, like, the same functionality as well,

44:02 because it works if you are working with scripts and stuff.

44:07 But the second you have, like, some kind of folder structure where you have a Jupyter notebook,

44:12 you'll sort of realize that, like, the Jupyter notebook doesn't care that you have a folder structure.

44:17 Like, the second you're in it, like, the working directory is now wherever the Jupyter notebook is,

44:21 not whatever folder structure you've, like, very carefully pieced together.

44:25 And so this was, like, an attempt.

44:28 It's not a very complicated function.

44:30 It literally takes, like, oh, what is your working directory?

44:32 And I'll recursively go up by its parent and checking for, like, special files like .git or a .here file.

44:40 And then I'll pre-pen that to whatever path, just so, like, you can now use relative paths in a Jupyter notebook,

44:45 just like you would in a script.

44:47 So you can avoid that problem as well, the commenting in and out.

44:51 Yeah, yeah, that's cool.

44:52 And I'll definitely link to that project that you built.

44:54 Tip two, use if and else in moderation, which seems pretty good.

44:59 Number three is pretty straightforward.

45:00 Use functions.

45:01 I mean, just do.

45:03 It's a good idea.

45:05 You should do this.

45:05 Yes, and, like, even when you're writing a function, it's okay to have a very complex function.

45:12 And even complex functions don't need to be written all in one go, right?

45:16 Like, you can break up your function, even though it does, like, a very complicated task.

45:21 There's probably small subtasks.

45:23 And your function can call other helper functions.

45:26 It's not just like, oh, this is a really complicated thing.

45:30 Let me just write a function for it.

45:31 As you're writing the function for it, like, that's one of the other code smells as well.

45:35 Like, if I have a hundred line function, like, that's kind of scary.

45:38 You couldn't break this down into smaller pieces?

45:40 Like, that's kind of weird.

45:41 And so, like, having helper functions that feed into, like, a larger function is also how you fix that code smell.

45:49 Yeah, absolutely.

45:50 And obviously, that makes testing way easier because you test little bits.

45:54 And then, you know, you test the kind of orchestration of them, and you're good.

45:57 Another one that I'm a huge fan of, it's like a serious pet peeve of mine,

46:01 is to have quick returns near the top or guarding clauses or guard clauses.

46:07 If you've got a function that's, like, indented, and then it's got a loop, and then it's got an if,

46:13 and then another if, and then another if, and it's just, like, way to the right.

46:17 If you're scrolling to the right, you're doing it wrong.

46:20 Yeah, yeah.

46:21 And during PyCon, I actually, like, just bought your entire encyclopedia of training.

46:26 And I forgot which one.

46:28 I think it was the how to write your Python code, like, an experience.

46:31 Pythonic code or something.

46:32 Yeah, yeah, that one, huh?

46:33 Yeah.

46:34 So, like, I remember that chapter.

46:36 Yeah, like, don't write nested if statements.

46:38 Like, essentially, write them inside out so, like, it's flat.

46:42 Yeah, exactly.

46:42 Do them backwards.

46:43 Yeah.

46:44 Yeah.

46:44 So, that was something that was just, like, that's what you should do.

46:48 It makes it so clear.

46:50 And, like, it's not very commonly taught, I don't believe.

46:54 So, these are called guarding clauses.

46:55 And the idea is instead of testing for a good condition and then another good condition and

47:00 another good condition and then doing the thing, which puts everything way on the inside, you

47:04 test for all the bad conditions first and you just bail out.

47:07 And then what you're left with is a non-indented simple bit of code, which is what you're actually

47:11 after.

47:12 So, it's really clear what you're testing against.

47:14 And then once you're past that, here's the simple thing we do.

47:17 I love it.

47:18 So, that was one of her tips as well.

47:19 It's a nice one.

47:20 Yeah.

47:21 She's got some great little examples there.

47:23 Some stuff on object orienting and so on.

47:26 But, yeah, these are really good.

47:27 I, you know, switch, which doesn't apply as much to Python.

47:29 I actually wrote a switch language extension for Python using the context management with

47:35 block.

47:35 That's pretty awesome.

47:36 But I'm not going to get into that because that's a whole different debate.

47:38 But I do think this idea of code smells is really interesting.

47:41 And you should think about them for data science because I'm sure there are different.

47:45 It sounds like, it looks like there are different data standout smells that are more common

47:49 than, say, standard software engineering.

47:51 If you're doing database programming or whatever, you get like a different style there.

47:54 Yeah.

47:54 And just for like other programming related things and how you can like structure your projects,

48:00 Jenny Bryan also has this talk about like, how do you name your files?

48:03 It's kind of interesting because like I, if you think about these common problems long enough,

48:08 everyone pretty much just converges to like the same set of solutions.

48:12 I remember like coming up with, yeah, I should just name things this way or like set up my

48:17 folder this way.

48:18 And then like all of a sudden, Jenny Bryan like gives a talk at like, big R conference.

48:22 Like, wait, that was like, I feel validated that like I didn't come up with something like

48:25 nonsensical.

48:26 Other people as well, like they write packages sort of like a cookie cutter, just like set

48:31 up projects.

48:31 And it's pretty much like the same way.

48:33 And one of them is like, oh, how do you name your files?

48:36 Right.

48:36 Like, and especially in analytics, there's clearly an order you should run this stuff

48:40 in.

48:41 So one of the ways of like, how do you name your files is pre-pen a number to them.

48:47 Right.

48:47 So like you can say like one dash and then like the script and that's the order you write

48:51 it in.

48:51 If you want to do better, you say zero one.

48:54 So like 10 and one doesn't get sorted in properly.

48:56 Yeah.

48:57 And then if you really want to go one step further, I started this habit of like having

49:02 a three digit number.

49:03 So like zero one zero, and that gives you a buffer room to like insert something in

49:08 the middle.

49:09 Or if you like forget something or like you realize that.

49:12 That's like the 10, 20, 30 in basic.

49:14 Yeah.

49:15 Like, what if you got to put a line in between that?

49:17 You got to go to 30 still.

49:18 Well, you could do 19.

49:20 Yeah.

49:20 Whatever.

49:20 Yeah.

49:21 And I found that out because like, that's how sort of some of the files in Linux in the

49:26 order of like how it loads up like services or something.

49:29 It's like defined in like those three digit numbers.

49:32 And I was like, oh, this is interesting.

49:34 I should do that.

49:35 It just saves me from like renumbering like a whole bunch of stuff.

49:37 Yeah.

49:38 That's cool.

49:38 I mean, just thinking about the structure is quite interesting.

49:40 At the end of the day, even though you have all this structure for your analytics project,

49:45 because everything is like nice and in some kind of order, if you do, for example, want

49:51 to create a Python package, like it's already there for you, right?

49:54 Like you can create another folder.

49:56 That's the name of your module.

49:58 Put a setup that PY file.

50:00 You could have the ability to set that up.

50:02 And now you can like pip install dash E.

50:04 And then anytime you edit that file, like your analysis will still work.

50:08 And that's pretty cool.

50:10 The other thing with project structure related stuff, like if you have things numbered at

50:14 the end of the day, everything comes down to like a DAG compute system.

50:19 And so like, because you have your stuff in order and there's properly defined inputs and

50:23 outputs, you can use like a make file or like a simple script as like a poor man's make file.

50:28 But then you end up in like the situation like, oh, that's where Luigi and Airflow come into

50:33 play.

50:34 They're pretty much just DAG executors.

50:36 Like I said, at the very beginning, setting up your project is pretty much like the gateway

50:41 drug into like all of this other cool technology.

50:43 Cause like you've, you would have set everything up in such a way that you then use those tools

50:48 when you hit that point where you need it.

50:51 And it's like a nice way to like slowly improve, do self-improvement stuff.

50:55 And then you also like end up using all the cool stuff that you see at like these big conferences

51:00 as well.

51:00 Yeah, that's, that's really cool.

51:02 And of course the structure gets you just that much closer to trying it out.

51:05 Now, what do you think about Papermill and some of these concepts?

51:10 Are you familiar with Papermill?

51:11 Yeah.

51:11 Papermill is, I think that's the Netflix.

51:14 Yes.

51:14 It lets you basically turn a Jupyter notebook into something that can receive inputs and then

51:19 have outputs almost like a function or a module or something like that.

51:22 So I personally haven't used it.

51:25 That's mainly because when I was started working, like Papermill wasn't really a thing at the

51:30 point.

51:30 So like I had migrated out into like, let's just make everything a Python script because

51:35 that has no dependency and we can just execute things that way.

51:40 And then the notebook itself just becomes like, Hey, this is the report.

51:44 In some sense, I can see if I, for me, I guess like the next time I start an analysis project,

51:49 like I probably will use Papermill just because it's like, Oh, it's this cool technology.

51:53 And I've like set up my folder structures in such a way where like I can now use it.

51:58 Right.

51:58 So I've heard of it, but I personally haven't used it yet.

52:01 Yeah.

52:01 I haven't used it either, but it sounds pretty interesting.

52:03 Like it sounds like Netflix, like you said, is doing really interesting stuff to me.

52:07 One of the things that sounded special, it made me go, okay, well maybe that is worth considering,

52:12 even though it's like not necessarily my style, right?

52:15 is if you have a big, long sort of pipeline of operations and each one is its own Jupyter

52:20 notebook.

52:21 If it fails, you can save, you basically keep the notebook as it was computed laying around.

52:28 So you can just open it up and you have basically a history of what happened and then what failed,

52:32 which sounds like a pretty interesting way.

52:34 Cause if you switch it to scripts, which I'm all for, but you end up with, you know,

52:38 it exited without, with like not code zero.

52:41 Oh, that's bad.

52:42 Right.

52:43 Like, what does that mean?

52:44 Like, I forgot, I don't even have logging or any of these things, right?

52:46 Like what happened?

52:47 Like, why did it not work?

52:48 So I do think there's some interesting stuff happening around there, but I do also

52:52 feel like the software engineering tools you have apply really well to modules, right?

52:59 Like it's easy to run that through pytest.

53:01 It's easy to run that through a profiler, the refactoring tools work on those.

53:07 Not that you can't do some of that stuff with notebooks, but it's easier to use them on files.

53:13 Yeah.

53:13 And especially if you're checking things into version control, that's sort of like the one

53:18 thing.

53:18 My main gripe with the notebooks is like, every time I make a change, like I have no

53:24 idea what's going on in diff and it's just like, yeah, just add and commit.

53:28 Like, I think it's right.

53:29 Let's see.

53:31 Do you accept their changes or your changes?

53:33 my changes.

53:34 Or like, or like if I just want to open the notebook.

53:38 So there's this program called enter act, which at least is like a, a desktop version.

53:43 So I don't have to like open up fire a server and then open a notebook.

53:48 That way.

53:48 But yeah, like sometimes like, I just want to double click this thing just to see it.

53:52 I don't like want to open up a terminal and like launch everything just to see something.

53:56 So it was like little things like that, where I was like, I'll try to do as much as I can

54:00 in a script.

54:01 And then like everything else goes into a notebook.

54:03 And then in the notebook, I still save out the things I want just so like I have an easier

54:08 way to like access figures or tables without having to like look at the entire notebook.

54:13 Yeah.

54:13 I guess that is one of the challenges is the whole diff thing.

54:16 Maybe we could talk.

54:17 We're kind of getting long on time, but there's a lot of interesting stuff to cover.

54:19 So I'll ask you a few more questions.

54:20 Let's think a little bit about collaboration.

54:23 Like you talked about the anti-pattern of having like Sarah's path, Dan's path, Michael's path,

54:32 whatever, like, and just commenting them out which one is active at the moment.

54:36 But there's probably some other stuff for collaboration, like are you using Git?

54:40 Are you using some online shared notebook that's kind of like Google Docs?

54:46 Like what are your thoughts around that kind of stuff?

54:48 So Google has something called like the Co-Laboratory Notebook, which is essentially like Google Docs,

54:53 but gives you a Jupyter Notebook system.

54:56 That's pretty cool in the sense that like, yeah, we won't have this commenting out of like random

55:02 lines because everyone's really just working on the same place.

55:04 Like that's really nice for collaboration.

55:06 I still think that you need some form of version control.

55:10 Like that is, I think like at this day and age, like it's pretty much required, especially

55:15 when programs start to get more and more complex.

55:18 Like you need a way to fall back on.

55:20 The nicest feature I use in Git is like I write something, everything is broken and I just say

55:26 Git reset and I just pretend I never did that and I just start over.

55:30 Yes, exactly.

55:32 Like that was a really bad idea.

55:34 Please revert that.

55:34 Okay.

55:35 Now we're good.

55:35 And it lets you be more exploratory.

55:38 It lets you be more aggressive and trying to change it.

55:41 Like this might not work, but if it works, it's going to be awesome.

55:43 And try it.

55:45 Actually, that didn't work.

55:46 Revert.

55:46 Or, you know, maybe it's a little more forethought.

55:49 You create a feature branch to explore it.

55:50 You do it there.

55:52 And they're like, forget that.

55:53 That was a bad branch.

55:54 We're just going back here.

55:55 Like, let's not do that.

55:56 Right.

55:56 Yeah.

55:56 But it's a really great feature.

55:57 Yeah.

55:57 And like, just along the lines of like collaboration stuff, like make small incremental changes.

56:03 And that's like the actual stuff.

56:04 That's the code that will actually get reviewed.

56:07 Right.

56:07 Like no one will review a code base where you're like at the end of the paper, the entire like submission relies on this code base.

56:15 And you're like, I need someone to review this thing.

56:17 Right.

56:17 Like there's no way that's going to get a proper review.

56:21 And so just in general for like, doesn't even have to be in like research or science.

56:26 It's a good habit to like make small incremental changes.

56:29 And like, maybe that's what your weekly meeting is about.

56:32 It's just like, this is what I did this week.

56:34 Someone press the green button to merge this in because that will actually be reviewed.

56:40 And then you'll have a discussion around that point.

56:42 Like all of that stuff.

56:43 For me, I personally, I'm not in a managerial position.

56:47 So like, those are the types of meetings I find like productive where I can actually talk about.

56:51 This is why I did.

56:52 This was the implementation.

56:53 This is what I'm thinking about next.

56:55 And then have a conversation around that because it can still be productive and you can still have like talks about longer goals.

57:01 But like you also now have the benefit of like someone else looking at your work to make sure it doesn't have like a bad code smell.

57:08 You know, maybe like you, you, you're off by like a factor of 10 and no one's going to notice that in like 900 lines of code.

57:14 But they will if it's just like 20 lines of code, like a change like that is much easier to find.

57:19 Yeah, that's definitely good advice.

57:21 I definitely recommend working in small little, little bits and changes and, you know, make some small change.

57:27 Do a git commit.

57:27 Make another small change.

57:29 A little git commit, right?

57:30 Like don't wait until the end of the week or like until the end of the paper and like, all right, time to check it in.

57:34 Like, no, not a good idea.

57:36 One of the things I wish like exists more in academia is just having more resources to do pair programming.

57:43 Because usually people are assigned one project and there isn't like two people assigned to the exact same bit, which is what you really need pair programming for.

57:51 When I was co-instructing like the summer program in my previous lab, I would sit down next to students and I would pair program them through some kind of data related work.

58:02 And it's super valuable for them because they actually get to see how I'm thinking about like this problem.

58:08 And I'll say like, you're doing a join of two tables.

58:11 Yeah, make sure that like the keys don't have duplicates if you're expecting duplicates, right?

58:16 That's like one of those things of like, yeah, the code ran, so I'm just going to keep going, right?

58:19 And you don't realize that you just did, you just did a Cartesian product and now you have a million rows and you don't know why, but you're just going to keep going.

58:28 Why is it taking so long?

58:29 Yeah.

58:29 So pair programming, yeah, it's super valuable.

58:32 And even now during my internship, it's I'm on the receiving end of pair programming, but this is more on the software engineering side.

58:38 It's super valuable just to see like, oh yeah, this is how you write good code or like, this is how they're thinking about it.

58:45 And it's even stuff like I talk about, yeah, be careful where you're hitting like control V a bunch of times.

58:50 And it's like, oh yeah, like this is in two different places.

58:53 Like, let's just refactor this out.

58:55 And it's like, oh yeah, I didn't catch that.

58:56 And like, when you refactor it out, you can actually have more guarding clauses just to make this like an even better check.

59:04 That's one of the things I wish, like at least in research, like there was more budget and time for is just pair programming.

59:11 And that just makes collaboration easier because you're now just talking with a person back and forth.

59:16 It just makes that whole process like way nicer and smoother.

59:20 Yeah.

59:20 I mean, we certainly have the tools these days for it, right?

59:23 You talked about Google Colaboratory, which has like live multiple editor features, kind of like Google Docs.

59:30 You've got obviously screen sharing, you've got like VS Codes, ways to like watch somebody else's system on two sets of Visual Studio code.

59:40 And there's some really interesting options.

59:41 But yeah, it's got to, it's like also a cultural thing.

59:44 And also you've got to have people to collaborate with on that part, right?

59:48 Right.

59:49 And in the sense of, hey, maybe like when you, even though you're in this small world and you write your package, like now you have someone to collaborate with, right?

59:56 And that's sort of like socially motivating that you have other people using your stuff.

01:00:01 Yeah.

01:00:01 It definitely feels good to have someone looking at it, interacting with what you're building because building software completely in isolation just for yourself.

01:00:11 It's kind of a weird place to be.

01:00:12 It's not as much fun as it could be.

01:00:13 Yeah.

01:00:13 It's fun.

01:00:14 Like when you're just in the sense of like, I got to get something like that minimum viable product, like that's fun.

01:00:20 And then it's just like, as soon as you hit maintenance mode, it's like, who am I maintaining this for?

01:00:26 Yeah.

01:00:26 Or just all like a lot of the projects, you know, you're going to be working on it and you kind of get the happy path, mostly working and you feel like you're mostly done.

01:00:35 But then there's all these little loose ends, the documentation you got to write for the other people involved, all the little tests and the edge cases.

01:00:44 And just, it can just go on and on and on.

01:00:46 It feels like, I thought I was done a month ago with this and I'm still working on it.

01:00:49 How is this not still not done yet?

01:00:51 Like I've definitely had that feeling in software and I'm sure it's just the same, you know, that was actually in a semi-research context.

01:00:58 I'm thinking back to it.

01:00:59 Yeah.

01:00:59 Final thought on this collaboration bit.

01:01:03 What do you think about GitHub?

01:01:04 Like creating either a private or a public repo, using that for your work to share with people?

01:01:10 I love it right now.

01:01:12 Like pretty much if I have a thought, I just make a GitHub repo.

01:01:15 So like my personal GitHub account has a bunch of projects where like they're pretty much empty, but they have a name.

01:01:21 And it's just because like I thought of something one day and I just made a repo out of it.

01:01:26 It's even really good for simple stuff.

01:01:28 Like if you're at a conference and you just want a place to take notes, that doesn't matter what machine you're on.

01:01:34 I've taken just notes and markdown as a GitHub repository.

01:01:38 And then like during like a lightning talk, just be like, hey, I just started putting up my notes.

01:01:42 And then maybe some people will like add, hey, wait, this is my talk.

01:01:46 Let me put my talk in there.

01:01:47 And you end up collaborating on like some kind of notes for like a conference, which is pretty cool.

01:01:53 And for me, like I try to in lines of that 10% improvement, like every time, like originally, like I just made everything in Git just because I needed more practice with it.

01:02:03 And it was just like a nice safe place for me to like, oh yeah, like add and commit.

01:02:07 Like if you do it a couple of hundred times, that part doesn't become scary anymore.

01:02:11 And so that's right.

01:02:13 It just becomes so natural.

01:02:15 Like, oh yeah, when I first learned Git, it's like, why am I doing this?

01:02:17 This is so tedious.

01:02:19 And then it's like, now it's like, okay, whatever.

01:02:20 But then like you can do other stuff with Git, which is like super cool.

01:02:24 So GitHub is like a great way to practice using Git and then also gives you the ability to practice or get ready for collaboration.

01:02:33 Right.

01:02:34 So even for me, even if I'm working on personal projects, sometimes like I will do branches for myself, push branches to GitHub by myself, and I will submit pull requests to myself.

01:02:47 Just to document it and make it really clear.

01:02:49 Like this is the reason for it here.

01:02:51 The files that changed and all that.

01:02:52 Right.

01:02:52 And like, I was doing that for a couple of years.

01:02:54 And like now, like during my internship, like that has become so second nature that like I can actually do Git things and it doesn't hinder collaborating in like the real world.

01:03:04 Yeah.

01:03:04 So it was a lot of like just practice that like, I just thought it was cool.

01:03:07 Like, I didn't realize until now that was like, wait, like this is actually just like years of practicing on my own.

01:03:13 And so like, in that sense, like, and like Microsoft essentially saved GitHub and like, it's just as good as ever.

01:03:20 So like, yeah, plus plus one for GitHub all the way.

01:03:24 Yeah.

01:03:25 Awesome.

01:03:26 I totally agree.

01:03:27 I totally agree.

01:03:27 Okay.

01:03:28 This is really interesting.

01:03:30 I think there's a lot of concrete advice here.

01:03:32 I'll link to the papers.

01:03:34 I'll link to your PyRoot project.

01:03:37 The code smells thing, all that.

01:03:39 We'll put all this up there and people can come back and definitely dig into the details if that's useful for them.

01:03:46 So before we get to the final bit of the show, though, I've got to ask you the two questions, Dan.

01:03:51 First of all, if you're going to write some Python code, what editor do you use?

01:03:55 So I used to use Emacs with LPy and now I am now a VS Code convert.

01:04:01 They've brought you over.

01:04:03 You know, I would say like the last four shows that I've had, everyone has said VS Code, which is pretty interesting.

01:04:08 Yeah.

01:04:08 I was pretty reluctant until like I had to write some Python code and I was on, I switched over to my Windows machine and I was like, I don't have any way to edit code right now.

01:04:21 Let's just try this thing.

01:04:23 And, you know, it worked.

01:04:25 And so like I was pretty happy with it.

01:04:27 So I sort of just hung around.

01:04:29 What's actually really cool is the screen sharing ability in VS Code that does pair programming.

01:04:36 Yes.

01:04:37 That live, I think it's called Live Share.

01:04:39 I've never had a good chance to use it, but I've seen it and it looks amazing.

01:04:42 Yeah, I've used it with one of the other interns and it's like, this is really cool.

01:04:47 And they also have like a voice communication mechanism.

01:04:49 So like yet another way to like do voice chat, but at least the screen, like the live coding part, like that was super cool.

01:04:57 Very nice.

01:04:58 Yeah.

01:04:58 Okay, great.

01:04:59 Definitely a good answer for the editor.

01:05:01 Packages, some notable ones.

01:05:02 The package that notable that I haven't heard on the show yet is one called Pie Janitor by Eric Ma.

01:05:08 And he works at Novartis.

01:05:11 And this is pretty much his consolidation of pretty common data cleaning stuff in pandas.

01:05:19 And that ties to another package by Zachary Saylor called Pandas Flavor, which is a wrapper around your ability to extend pandas.

01:05:28 And the benefit of that is, you know, if you want pandas to have a method that you don't already have, like you might think like, oh, let me create another class.

01:05:36 I'll inherit pandas and I'll release a package.

01:05:38 But no one's really going to use that because it's not a pandas data frame object.

01:05:42 It's like some weird class that you created yourself.

01:05:44 And so like this is sort of like a mechanism for you to inject your own methods into a pandas data frame object, but still have a pandas data frame object without having to re-extend the class.

01:05:57 So it's super cool.

01:05:58 Yeah, that's really great.

01:05:59 And yeah, the Pie Janitor, I really like that one.

01:06:02 It takes a whole bunch of imperative data frame operations and turns it into a really nice fluent API like data frame dot from dictionary dot remove columns dot drop not a number drop, you know, rename call and just boom, just flows it all together.

01:06:18 It's really nice.

01:06:19 I haven't covered on the show, but we did talk about it over on Python bytes that podcast.

01:06:23 So, yeah, it's definitely a cool one.

01:06:25 It's been on my radar as well.

01:06:26 Nice.

01:06:27 All right.

01:06:27 Well, final call to action.

01:06:28 People who are out there, maybe they're in science, data science.

01:06:31 It's something like that.

01:06:33 And they want to make their code take, you know, that 10% step you're talking about towards the more proper engineering structured world.

01:06:42 What do they do?

01:06:43 For me, like I was lucky enough to be in New York City, which is a big city.

01:06:46 So it was always like local meetups were always like a thing that were very busy and you learn a lot from there.

01:06:53 But even if you don't live in a very big city, you can either start one yourself because chances are you are not alone.

01:07:01 And the Python community is super supportive.

01:07:04 You can always if you say something on Twitter, someone will give you the ways of how to start something.

01:07:10 And if you're at a university, you can always have meetings in like a classroom or something.

01:07:16 So don't worry.

01:07:17 Right.

01:07:17 Maybe it has an interdisciplinary, right?

01:07:20 Like maybe there's not that many people in your department.

01:07:21 But if you go across, you could probably find a decent number of folks you want to attend.

01:07:26 Yeah.

01:07:26 And so meetups are a great way to like learn or meet other people or at least just like ask questions about stuff.

01:07:33 And if you can make it to like any of the Python conferences or like attend a sprint like that is probably like going to a sprint was like the fastest way that I've became a better Python programmer.

01:07:46 Or even if it was something as like editing a piece of documentation, like just seeing the mechanism of how other people collaborate on such a large scale and then still seeing your work like in one of these major projects like that's super motivating and like cool.

01:08:02 Yeah, that's really cool.

01:08:02 Yeah, it's a great opportunity.

01:08:04 And it's also a great opportunity to, you know, rub shoulders with really prominent people in something that you're working with, right?

01:08:12 The maintainers of this probably important project who are there and, you know, what better chance to get to know them a little bit than to sit down and like add a feature with them or spend a day in the room with them.

01:08:23 Something like that, right?

01:08:24 That really can build some connections that, you know, especially if you're in a small town somewhere and not meeting them in person, that could be a challenge.

01:08:32 Yeah, and a lot of people stay within Python because of the community.

01:08:36 So like, I guess my final call to action comes from Greg Wilson in his book called Teaching Tech Together.

01:08:43 He talks about the rules of teaching how to program or like building community.

01:08:48 And the first rule is be kind, all else is details.

01:08:52 Yeah, that's great.

01:08:53 Be kind to all else's details.

01:08:54 I agree.

01:08:55 It's definitely right up there is one of the most important ones.

01:08:58 All right, Dan, thank you for being on the show.

01:09:00 It's been really great to talk about these ideas with you.

01:09:03 I think there's a lot of good advice people can take away.

01:09:05 Yeah, it's been great talking with you, Michael, as well.

01:09:07 You bet.

01:09:08 Bye.

01:09:09 This has been another episode of Talk Python To Me.

01:09:11 Our guest on this episode was Daniel Chen and has been brought to you by Indeed and Rollbar.

01:09:16 With Indeed Prime, one application puts you in front of hundreds of companies like PayPal and VRBO in over 90 cities.

01:09:24 Get started at talkpython.fm/Indeed.

01:09:28 Rollbar takes the pain out of errors.

01:09:30 They give you the context and insight you need to quickly locate and fix errors that might have gone unnoticed until users complain, of course.

01:09:38 Track a ridiculous number of errors for free as Talk Python To Me listeners at talkpython.fm/rollbar.

01:09:44 Want to level up your Python?

01:09:46 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:09:51 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

01:09:59 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

01:10:04 It's like a subscription that never expires.

01:10:06 Be sure to subscribe to the show.

01:10:08 Open your favorite podcatcher and search for Python.

01:10:11 We should be right at the top.

01:10:12 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:10:21 This is your host, Michael Kennedy.

01:10:23 Thanks so much for listening.

01:10:24 I really appreciate it.

01:10:25 Now get out there and write some Python code.

01:10:27 I'll see you next time.

01:10:47 Bye.