#227: Maintainable data science: Tips for non-developers Transcript
00:00 Michael Kennedy: Did you come to software development outside the traditional computer science path? This is common, and it's even how I got into programming myself. I think it's especially true for data scientists and folks doing scientific computing. That's why I'm thrilled to bring you an episode with Daniel Chen about maintainable data science tips and techniques. This is Talk Python to Me Episode 227, recorded August 6th, 2019. Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy, keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via @talkpython. This episode is sponsored by Indeed and Rollbar. Please check out what they're offering during their segments, it really helps support the show. Dan, welcome to Talk Python to Me.
01:03 Daniel Chen: Hi, Mike, nice to meet you.
01:04 Michael Kennedy: It's great to meet you as well. I'm so glad that we got a chance to run into each other at PyCon this year and learn about what you're up to, 'cause we're going to have a good time talking about it.
01:13 Daniel Chen: Yeah, and this year was the first year I was at PyCon, and I typically live in the data science world, so it was, one, super cool to be at, like, pretty much a convention of Python users, and I almost forgotten, like, how Python is outside of data science, like Django is a thing, was a thing that was repeated back to me.
01:36 Michael Kennedy: Yeah, exactly, that's pretty interesting. What was your take on it? Do you recommend people, especially data scientists, attend PyCon, you're happy you went?
01:44 Daniel Chen: Yeah, I mean, it was super cool. I mean, data science is sort of one of the growing parts of Python as a language, and I think, like a lot of people have said, like, it's sort of the reason why Python has picked up in popularity, like recently. And so yeah, it's super cool just to see all the booths. I personally gave a Pandas tutorial there, so it is becoming like more and more of a thing and I think there were two or at least three Pandas-related tutorials during the session, so, like, it is...
02:16 Michael Kennedy: Yeah, I know Kevin Markham gave one as well, I'm pretty sure, at least something on data science there, so yeah, there was definitely some interest. I think I met one other person who's doing one, so yeah, it's pretty incredible, right?
02:27 Daniel Chen: Yeah, yeah, and like, and then again, like, and then there's this whole web stack of things that, like, I almost never really used, but is, a lot of people do use it as well, so it's super cool just to see it and be reminded how, like, what Python can do as a language.
02:44 Michael Kennedy: Yeah, that's cool, and for me, it's exactly the opposite, right? Like I spend a lot of my days writing web apps and APIs and things, and then to see the data science stuff, it really reminds me, there's a really different way to work and other things to optimize than your scalable web apps. For sure.
03:01 Daniel Chen: So totally recommended to go, and like, you know, there was some talk on Twitter like, "Don't always try to do the hallway track if you can," 'cause like, sometimes speakers would like people in their audience, but I tried to go to the talks that I can, but then there were a few, like, education-related hallway track groupings or meetups and that's what I attended, too, so it's just nice seeing other, like, Python educators, which I also went to SciPy a couple weeks later, so it was some of the people I saw again for a second time, and I was like, "Oh, cool!"
03:33 Michael Kennedy: Yeah, I definitely love going to PyCon. As people know, I talk about it all the time and it's a really great experience, and what I think is interesting is a lot of people feel like they have to be experts to go. I met a lot of people who were fairly beginner in their career, and it was really valuable to them to be there, so I just want to throw that out there for people.
03:49 Daniel Chen: Yeah, super welcoming. I mean, that's sort of the reason why I stuck around with Python. I'm also pretty active in the R community as well, and between Python and R, a lot of people join for whatever reason, but again, like, as the saying goes, they stay because of the community, and everyone's just super nice and helpful and super beginner-friendly.
04:07 Michael Kennedy: Absolutely, absolutely. All right, well, I haven't got a chance to ask you the opening questions, let's start there. So before we get into all the techniques and tips and stuff you have for data scientists to bring in more structured programming stuff to make their data science techniques and tools better, let's talk about you. How did you get into programming and Python?
04:32 Daniel Chen: I was pretty much always surrounded by computers as a kid. I always had the hand-me-down computer when I was a kid from my parents when they were working. I guess it sort of does help that my dad is a software engineer, but it wasn't really like a thing when I was at home, other than, like, hey, Dad does things on computers, that's kind of cool. I was always tinkering around with computers, though, like I do remember, the first thing I would do every time I opened a new app is like, hey, let's go to Edit and Preferences and just see what I can change, and I was sort of just tinkering. I grew up in New York City, I'm from Queens, and I went to one of the specialized math and science high schools in New York City, and so for us, sophomore year, it was actually mandated that every student take one semester of computer science and one semester of technical drawing or drafting.
05:20 Michael Kennedy: That's pretty cool. I think drafting is less valuable than people imagined it, 'cause I remember I had a drafting class as well and I don't really see it, but the software, thinking in tools and ideas, are certainly as, what language was that in?
05:34 Daniel Chen: Pen and paper, and then CAD towards the end, yeah. So yeah, we were, like, in a room, and we were drawing isometrics by pen or pencil and ruler.
05:44 Michael Kennedy: On those big slanted tables. And the programming one, what was that? What technologies did you all cover?
05:50 Daniel Chen: It was like, it was only towards the end, and it was, like, in some CAD program that I don't remember.
05:54 Michael Kennedy: Oh, okay.
05:54 Daniel Chen: Yeah, and it's like, interesting that, like, that was a thing I never, I thought it was super cool, and then, now that 3D printing is a thing, it's sort of like, wait, I used to kind of, I've done this once, but like, I just haven't done it in many years. So it's like, kind of interesting.
06:09 Michael Kennedy: Yeah, how interesting, yeah yeah. How cool, well, that's a great introduction. And then, did you study computer science in college?
06:17 Daniel Chen: So I didn't, and that was, part of it was, I didn't notice until I was in college, but when we had to take computer science in high school, it was sort of, man, all of these other people, and by other people, it's like, just a handful, it was like, man, they're really good at this. There's no way I'm going to be able to study this in the future, or for a career and whatnot. We did the one semester of computer science. I didn't go for the AP or anything, because like, originally I was going to, like, go down and be a medical doctor. That was my original plan. So then in college, I ended up, how do I make my medical application as strong as possible? Let's do, like, neuroscience, and it was sort of like a bio-heavy program, but like, that's where I sort of took my first set of statistics courses, and I was, oh yeah. Like, we hear about mean and standard deviation, but to finally understand it in the context of, oh yeah, here's the exam scores for the previous exam. Like, how do you actually rank? Just something like that, like, get some meaningful understanding of where do you rank in the class? And it's like, oh, maybe this is how the curve is going to be, or like, did I, I didn't do very well in the exam, but like, I'm actually kind of okay, so that was cool, and then, because I ended up switching into neuroscience my second year, I had to stay a fifth year in college, and so my last two years was like, oh, you only need four classes to do a computer science minor, so I was like, eh, I've done this in high school. It's like, just pick this up for fun, and then, so it was that first intro computer science class when we got to the actual Python programming portion where I was like, "Wait, this is actually not as terrible," and I would see the other students, in which case they would be freshmen, and I would've been already a junior. I would see the freshmen like, they've never seen this before. Their struggles were essentially my struggles back in high school, and then I realized, oh, it's literally because, like, I saw it before, and like, even though not much of it got retained, it was just thinking about things procedurally, just doing it once, now I can actually think about, like, syntax errors, versus doing everything at once. That's sort of when I was like, huh, maybe I could've done this as a career choice, but no, nope, let's keep going down the medicine route. So I ended up doing a master's in public health and epidemiology, just to stack on more research skills. The thought being was, hey, research and medicine were super cool, but I'm pretty sure if I ever start medical school, I'm never going to learn the stuff again, so let's just learn everything and then go to medical school. So I did my master's in epidemiology, and that's when I took my first, like, intro to data science course, and that is probably the most life-changing moment in my life. When I was doing my master's, I was already just learning about all of these other basic statistical techniques. I'd never heard of logistic regression before, and that's, like, the type of analysis you do when you have a binary outcome, so for us, it was like, did this person die, yes or no? Or did this person get cancer, yes or no? And I've never seen that before, and it was just like, wow, this is amazing, and then I take my data science class and it was like, what is this random forest thing? This is amazing, or like, what is ridge and lasso regression and like, I can just condense thousands of variables into something meaningful, like, that's super cool! And so that sort of started this whole trajectory down to where I am now, because it wasn't until that data science course I had during that semester, because it was so much learning to do, the instructor set up a Software Carpentry workshop, and so I was an attendee for Software Carpentry.
09:49 Michael Kennedy: I think Software Carpentry is a really cool project for folks with the backgroudn exactly like you described. I actually had Jonah Duckles on the show, way back in Episode 93, talking about Software Carpentry, so it's been a really long times since I've spoken about it. Maybe just tell the listeners out there what Software Carpentry workshop is about, 'cause it'd be good for a lot of folks who are in the data science and sort of science in the programming space.
10:12 Daniel Chen: So yeah, it's sort of expanded over the past couple years, but Software Carpentry and their sister program, Data Carpentry, they're housed under this one umbrella called The Carpentries, and essentially, they're this nonprofit organization, and their goal is simply to teach researchers or scientists the skills that they need for, in the sense of Software Carpentry, like programming skills, and then in the case of Data Carpentry, like working with data, so like data skills. And the two really just go hand-in-hand, so you'll mix and match. They have a lot of overlap, and essentially, there is these two-day workshops where they cover Bash for the shell, and the whole premise of that is to show you about, like, what is a working directory? And programs do one thing and one thing really well, and you can pipe them into one another to chain things together, so that's, like, what you're supposed to take away from Bash. And then they go through Git for version control, which, it's really hard to get an understanding of Git in three hours, but it's just to show you that, like, there are better ways than naming your files final, final final, et cetera, et cetera.
11:20 Michael Kennedy: Putting the date on the end. No, like really, final.
11:22 Daniel Chen: Yeah, and putting the date, and then there's a section on Python or some of R, or any of the other programming languages, and it used to be that they also had a fourth section on SQL, but then usually SQL gets bumped out for, like, a longer Python or R session, so it's a two-day workshop that covers those skills and it's really to give researchers a primer, because we go into science not thinking that we're going to program, and so like, a lot of this stuff is just like, oh, I picked it up on my own, and it's just a bunch of stuff hobbled together, and that's how we learned it, and actually, that's how a lot of people in data science, like, that's how they learn programming. And then this is the first time, like...
12:05 Michael Kennedy: I feel like yeah, I feel like this is actually really common, as you're saying, and I think it's also a little bit why Python is growing a lot in the data science space, is it's, like, what can I do that's an easy step to do just enough computation to solve my problem, so I can go back to what I actually care about? Because I don't want to be a programmer. I want to be a biologist or a doctor or whatever, but then you slowly find yourself six months later with, like, a lot of scripts and you're running code and you're using Pandas or NumPy and you're like, well, I have no qualification for this, but here I am, like in it, somehow, even though I swore I would never do this 'cause I hated math or something like that, right?
12:43 Daniel Chen: Yeah, so that's the whole premise of The Carpentries, is like, okay, let's take one step back. You learned how to do this on your own, and let's refresh the actual basics and kind of, like, steer you in the correct way. That's the general lowdown of what The Carpentries are.
12:57 Michael Kennedy: That's cool, and you started out as a student, but you became an instructor, right?
13:00 Daniel Chen: I was a student, like, fall of 2013, and then it was like, just at the cusp of, wait, I can actually teach this stuff! It wasn't like that much of leap and bounds. Like, I already knew a little bit about Python programming, and then the Bash stuff. I was like one of those people in college that was like, I'm just going to install Linux and see what happens, deal with problems that come from that. I've been saying to myself, like, "It's the year of the Linux desktops," and it's like, 2010 or something.
13:29 Michael Kennedy: It's almost here!
13:29 Daniel Chen: It's almost here. So I ended up signing up to go help out. You end up realizing that like, for a lot of newcomers, a lot of the problems that they have aren't actually that complicated, and just to go into education theory a little bit, it's they don't have a lot of nodes to make connections with, and so a lot of their problems is also like, just, they made a typing mistake, right? Like they're just not used to hitting Tab to tab-complete things, so everything is mainly a typo. So I started off helping out a few workshops, and then I matriculated into your next instructor class, where I was certified to be an instructor where it was mainly getting familiar with the material, and learning how to teach the material.
14:13 Michael Kennedy: That's cool.
14:13 Daniel Chen: Yeah. And then I was an instructor, and my first couple years as an instructor, that was right on the border of, I was still in grad, I was finishing out my master's program, and also, like, I have job, but I ended up working so much during my job, that my boss was pretty much like, "Please go home." And so I spent a lot of time going home, but it was really just to go teach other workshops, and it was super nice being in the New York City area, 'cause, like, going to a university or any place was pretty much local for me. So I got a lot of teaching experience out of that, and I didn't know at the time, but I say it now, like, teaching is one of the best ways to learn something. So Bash and Git and Python and later on, R, like, I just got more familiar with it, just because I was teaching it all the time, and then, you know, once you have some foundation, like learning the next small bit of information is, it becomes easier and easier and then it just snowballs into something.
15:09 Michael Kennedy: That's cool.
15:09 Daniel Chen: Yeah, and then, like, all of that teaching knowledge ended up being the foundation for the book that I ended up writing, or was tasked to write, called "Pandas For Everyone." I mean, it's really like an honor that I got recommended to write this thing, so I should frame it in that sense.
15:26 Michael Kennedy: I've done a lot of training as well, and I feel like, once you kind of go through a couple of cycles of that, you just get so good at learning something with enough depth to present it, that it becomes, like, this really great power and it's kind of addicting, right? You're like, "All right, what's the next thing I can learn? What's the next research project I can go on?" and yeah. So it sounds like you did the Software Carpentry thing and it kind of somehow sucked you down this "Pandas For Everyone" hole of writing this book, which is Addison-Wesley, which is pretty cool.
15:54 Daniel Chen: Even writing the book, like now you're just like, oh, I just can't write really janky code anymore. Like, this actually needs to be quote-unquote, like, the better way of doing things. So like, there was still, even though I was writing a book and I was supposed to be the expert in this, like, a lot of it was also like, I should probably read this part of the documentation just to make sure, 'cause I also learned this on my own.
16:15 Michael Kennedy: Right, well that's the thing about the difference with practicing as a programmer or as a data scientist, versus an author or an instructor, right, like as a practicing person, and you have a problem, you're like, I need to figure out how to make Pandas do this. Like, it doesn't matter how it happens, but if you can make it happen, you're done, like that's the end of the research, you're done. This part is solved, what's the next problem? But as an instructor, like, well, but there's these other two ways, and what if somebody says, "Well, why not this way versus that way?" What's the diff, all of a sudden, all these cases that you would never go down, like, you have to start going down those now, which I think is awesome actually, but it's definitely a different way of thinking.
16:53 Daniel Chen: It's super cool 'cause, like, now it becomes its own learning path, like you see other people have problems and you see how they think about it, and like, it sort of adapts how you present material. For me, when I was originally, when I first started off teaching workshops out of the book, I pretty much went in the order that I presented the chapters in, and then more and more recently, like, I realized, wait, tidy data principles is actually, like, one of the most important things in data science and data cleaning. After we load our first dataset, I pretty much just jump to that chapter, 'cause if you can really understand that, everything else becomes way easier, quote-unquote easier.
17:29 Michael Kennedy: Yeah, sure, well, if you're trying to do operations on bad data and it keeps crashing and like, that's no fun. Like, why does it say "None is an invalid," you know, "doesn't have this attribute?" I don't understand, well, let's talk about that. This portion of Talk Python to Me is brought to you by Indeed Prime. Are you putting your Python skills to good use? Find your dream role with Indeed Prime, and start doing more of what you love, every day. Whether you're a developer, data scientist, or anything in between, one application puts you in front of hundreds of companies like PayPal and Vrbo in over 90 cities. Indeed Prime showcases your experience and tech skills to match you with great fit roles that meet and exceed your salary, location, and career goals. And when you start a one-on-one conversation with one of their career coaches, you'll get resume reviews and personalized advice to help you get what you deserve. So if filling out countless job applications isn't your thing, let top tech companies apply to you. Join Indeed Prime for free at talkpython.fm/indeed. That's talkpython.fm/indeed. The reason I wanted to talk a little bit about Software Carpentry, other than just like you're been doing it and it's cool, is I think it's a really good segue into this larger topic of how do you take the average data scientist and the work that they're doing, and help bring in these more computer science, software, maybe not even computer science. Let's say software engineering principles, to help them basically be more effective, right? So maybe we start at the beginning. We've got some idea, we probably found out we can open up a Jupyter notebook, load something into Pandas, and poke around with it with a Matplotlib or something, right? Maybe that's it, right? Maybe we've, I've seen a lot of MATLAB code as well where it's like, well, I got, this does this thing, but it's like, there's no functions. Maybe there's loops, maybe not, right? It's just like, all crammed in there. And those are PhDs writing that, so like really brilliant people, but they just don't have the software engineering skills, so where do we start with that?
19:37 Daniel Chen: There's a few, like, papers I would direct people to sort of get a sense of where I'm coming from, so like, there's this one paper by William Noble called A Quick Guide to Organizing Computational Biology Projects, and that's sort of the premise of how I guess, like, I would present, like, how do we introduce software skills? And in that paper, he literally talks about, you should have a folder structure and maybe this is one way you should set up your folders for your analysis projects. And I'll talk a little bit about that in a bit, but yeah.
20:06 Michael Kennedy: Yeah, so it's called A Quick Guide For Organizing Computational Biology Projects, and you know, it's probably focused on biologists, but I'm sure that it's pretty generally applicable.
20:14 Daniel Chen: Yeah, yeah, other than maybe the sequence.py file. Like, replace that name with whatever you need.
20:21 Michael Kennedy: Right, it's hubble.py or whatever.
20:25 Daniel Chen: Yeah, and the other two papers, the first author is by Greg Wilson who restarted Software Carpentry back in the 2000s, and he wrote two papers, one in 2014 called, like, Best Practices for Scientific Computing, and then in 2017, the paper is called Good Enough Practices in Scientific Computing. If you just look at the papers, it almost seems like, hey, we're presenting the ideal case, and then we almost realized, like, that's impossible in the real world, but they're both pretty good papers, and they talk about different things.
20:55 Michael Kennedy: Right, what would we have if we had, like, the perfect adaptation of software engineering to this world? Like, okay, well, what can we reasonably ask people to do that will make their life better, it sounds like.
21:05 Daniel Chen: Yeah, and the way I approach it is, like, just like when I teach data science skills, I pretty much make a beeline to tidy data and tidy data principles. In this case, it's almost like a beeline towards project organization, just having some kind of structure to your projects or to your analysis project, that will snowball into all of the cool tools that you probably heard of, and don't know how people end up there, but if you take slow steps, I found that project organization is the fundamental thing, where, sort of like the gateway to everything else.
21:43 Michael Kennedy: Right, because a lot of what you need, it sounds like, is code organization, right? It's like the architecture and functions, classes, different modules, the concept of, I'm going to pass data to this thing and make it reusable. All of that stuff really seems to be, like, natural follow-ons of, like, well, how do we organize this project by function or by purpose and like, just really think through that, right?
22:07 Daniel Chen: Yeah, and it doesn't even have to be as complicated as oh, we're doing proper software engineering and like, we need to create a Python package. Like, that can all be deferred to much later, 'cause usually, what ends up happening, you mentioned, like, hey, I'm a scientist. I found out about Jupyter Notebooks. It's a really cool tool, taking pictures of black holes out, like using them. So yeah, you have all these tools, and the scenario is like, hey, it's great that you're using a programming language to work with data. Excel is a great GUI for data, but it has its limitations. Cool, you are now using a programming language. What, where can we go from there? And like, when you are in that beginning state, just to make everything work, like you dump everything in one folder. You have your Jupyter notebooks, you have all your scripts, all your data.
22:55 Michael Kennedy: Your data files, yeah. If I say load this, I just want to say the file name. I don't want to have to think about, like, where that's relative to the other on some server or something like that, right?
23:05 Daniel Chen: Yeah, and then as an academic, you might have like a Word doc in there, or maybe a LaTeX file, and then you compile that thing, and it very quickly becomes this folder with hundreds of files and you can't find anything, and that's when you just start, end up, you know, maybe the word final comes into, like, the beginning of the file name just so you can find things, right?
23:24 Michael Kennedy: Yeah, I was going to say, it already sounds bad, and then if you start trying to do version control by, like, having multiple files named the same thing, then you're really pushing your luck.
23:32 Daniel Chen: Yeah, so the most important thing, I think, like, if you're at that point, where can you go next, right? It's always like trying to do things incrementally, like how do you make your life 10% better each time? And then it's like a nice way, especially if you're like, brand new grad student or you're in science, but like, you've never really learned programming, like, where can you go from there? It's useful to have some kind of guide or path that you can follow or think about to, like, make yourself better and do these things more efficiently.
24:01 Michael Kennedy: Yeah, so let's talk about some of the programming things that you can think about. One of the ones that you have is like, try to make your code easy to read.
24:10 Daniel Chen: Oh, yes. So one of the things I talk about on programming is like, make things easy to read. Do things in steps, don't try to write one for loop that has a whole bunch of side effects going on, right? Like things should just be incremental. Just to take a queue from education, like, we as human beings can only carry, I think the number is like four, plus or minus three, objects in our mind at the same time, like roughly seven. You should pretty much follow that, too, when you're programming. You shouldn't have to have, I mean yourself, or potentially another reader, try to carry 10 different things going on at the same time. It's just no helpful for...
24:47 Michael Kennedy: Like maybe an example is, I'm trying to go through a loop and I'm really trying to do three things, like as I get the data, then I compute something with the first step, and then I do some other filtering and I do another thing. I could try to cram that into one giant loop, or maybe it should be three separate little loops, one that cleans the data, one that does that computation, another that then filters it, right? Three loops sounds like a better step than one giant loop trying to do it all.
25:13 Daniel Chen: Yeah, or you can, more in education framework, or you can, like, group things together, and in programming, the way we group things together is like, write functions, so then you end up with one giant loop, and it's really just making three function calls, but that's easier to keep track of than, like, let's say we didn't write the function. Now we have, like, three different things scattered in our code, and you end up with a loop that's 150 lines long, and that's like, scary. 'Cause like, I see a loop, and before I even look at this thing, I'm already like, "Oh man, we are in for a ride," right, so.
25:48 Michael Kennedy: Let me just give my perspective from the software development, web dev, more application side of things. It's like, if I see a function that's more than 10 lines long, it starts to make me nervous. I'm like, "There is something going on here that's probably bad. Unless there's, like, a lot of error handling and response, like, even 10 is a lot, and the typical scientific computing bits that I've at least seen a while ago, there was more than 10 lines.
26:15 Daniel Chen: There's more than 10 lines. And so, like Jenny Bryan from the R world has this talk about code smells, and that's one of those code smells, like hey, why does it look like this? Or like, at least when you're working with data or in the PyData stack, usually you shouldn't have to write for loops in the sense of, like, if you're trying to operate on a data frame, they should be an apply call to a function. Even like, sometimes when I see loops, it's like, yes, I will write them just because something broke and I'm just trying to figure out where my data frame, like I have a bad value, but the final result ends up being an apply call or something.
26:52 Michael Kennedy: Yeah, it's interesting, 'cause a lot of the libraries, NumPy, Pandas and whatnot, can, they do the looping and they do it much faster and more efficient than you will in Python.
26:59 Daniel Chen: One of the cool things that I teach during the data science part, is like, when we go over applying functions, if you're doing numerical computations, like, just the NumPy decorator for vectorize or the Numba decorator for vectorize, just wrap the decorator around your computation function. It pretty much for free gives you order of magnitude speed improvements, and so it's like, it's way better than just you trying to optimize this thing yourself, right? And that's like one of the other programming things, it's premature optimization is like the bane of all evil or whatever. Just write the thing you want, especially if there are, like, loops. Python has many mechanisms to help you with that and make it faster, pretty much for free.
27:42 Michael Kennedy: That's definitely cool. I love this idea of code smells. I'm fascinated by it, I want to come back to it, but another thing I want to throw in there that kind of I feel like, is in this realm, is like the idea of reusability. You can write code so that it's easily reusable or that it's not so much so, like I could write a function, but maybe I have a bunch of global variables that I'm still using, and it makes the function, like, it moves the code away, so I understand that it's like, it's more compact and more readable, but it doesn't necessarily make it reusable, so thinking about, like, how do I parameterize these things and make them something that I can use in other situations? Or once you solve this problem in this way, like, I never have to think about this again. I just now use this in the other part, and that was rough, but that was Friday and I don't have to think about it ever again. Like, that's a pretty good principle I think here as well.
28:29 Daniel Chen: Yeah, and even when you're writing your functions, you can write your function for your use case now, and for example, it's like a function that is a regular expression parser for like a US telephone number, which is, if you try to write one of those, it's like, way more complicated than it ever needs to be.
28:45 Michael Kennedy: It's like final exam in regular expression 101, so like it's really, like, way worse than it should be.
28:51 Daniel Chen: Yeah, you'll write your function with that in, and one of the things I end up doing is, even if I have hard coded things within the function, and then I realize later on, like, oh wait, I pretty much need to run that function again, but instead of like, the second index, I need the fourth index or whatever. You can make backwards compatible functions or code by saying oh, I'm just going to create a default parameter in my function that's going to default to the one that already works. But then now I can just reuse that function later on and just change that value. Simple things like that that you don't have to rewrite the function just for your second use case, right? Like I talk about, like, if you ever hit Control + C on your computer, you'd better be paying attention when and how many times you're hitting Control + V, right? And if it's more than three times, you're probably doing something wrong.
29:39 Michael Kennedy: Yeah, for sure, for sure. One of the things I think would be nice, you talked about premature optimization and all these performance stuff. What is your recommendation around, like, how you structure your code? So a lot of times, I imagine that the data science stuff has pretty much, like, there's a Jupyter notebook and most of the code, like the supporting functions are kind of the beginning, and then later on, they're kind of using them and so on. When do you tell folks to break out, like, separate Python modules that you could load into your notebooks, and like, what's the, how do you think about, like, different module files versus notebooks and things like, you can apply refactoring tools really easily to a bunch of files using PyCharm or things like VS Code, not so easily in Jupyter, right? So where's the balance there?
30:29 Daniel Chen: Yeah, so the thing with Jupyter Notebooks is, yes, there was a talk at JupyterCon about why Jupyter notebooks are bad. I have this love-hate relationship with Jupyter Notebooks, but one of the things I can say, so Rachael Tatman from Kaggle, she gave an R-Ladies meetup talk in 2018 about putting together a data science portfolio, and one of the things in there is like, the Jupyter Notebook is great, but most of the time, you probably are just interested in, like, the figures or tables that's being generated, especially if you're taking this into a meeting, right? Like, no one wants to scroll forever to get to the bottom of the notebook because the first 3/4 is cleaning code. I sort of, like, got into this sort of workflow of like, I'll use the Jupyter notebook to test things out in my data cleaning pipeline, but the actual data cleaning stuff all go into, like, Python scripts, so what ends up at the end of the day, what happens in the Jupyter notebook is like, pretty much load the libraries I want, load the data I want, maybe there's a few functions that's specific to the figures I need, and then just the figures and tables I need, so my Jupyter notebooks are pretty small and that, down the line in terms of other software engineering practices, that just makes the diffs, and through Git, just way more manageable if I start making changes. So if you end up with massive Jupyter notebooks that are, a lot of it is just data cleaning code, you would think about moving that out to other notebooks or other files, just so you have more files. I'm in the camp of pretty much, in a lot of academic or scientific use cases, maybe not in physics when they're working with sensor data, but file I/O is not that big of a bottleneck, so like, I will have more scripts and more files that just write out data just to have another script and file read it back in. But that just breaks up my thought processes into smaller manageable pieces.
32:28 Michael Kennedy: That's interesting. It's like a little bit of a cache as well, right? Like you can take the step N and go to N plus one, and iterate on how that happens, without rerunning all the stuff, right? 'Cause you just reload that file that you saved. That's cool.
32:40 Daniel Chen: Yeah, yeah, exactly. So like, this goes down into project template world where I'll have a data folder, and our data folder will have an original data folder. That is the data that we download stuff in. Never make changes to your raw data, and then everything else gets modified with a script. I'll have, like for example, a script that reads in one of my original datasets. I'll do my first setup processing, like maybe it's like, oh, fixing missing values, and then I'll immediately write it out to somewhere in, like, under data and processing 'cause it's now a process dataset, and I want to distinguish between datasets I can, I should just pretty much lock as read-only, versus things that I could potentially modify and delete later on. I'll have a whole series of these scripts that pretty much just, like, you'll see it. I rarely these days have scripts that are more than 100 lines long, 'cause it's pretty much read in, do this one task, write it out, and especially if you have one step that just takes a really long time, yeah, it serves as pretty much as a cache, where you just save out your temporary results and then you can deal with it later without accidentally rerunning the part of your code that you didn't mean to run, 'cause now you're stuck for an hour. And that's sort of like what happens with Jupyter Notebooks as well. When we first started programming, like when I first started programming, it was just like, I just need this stuff to run, so I'll run cell one and then jump to cell 10, and then I'll run cell one again and then jump to cell 15, and then I can scroll all the way down and get my plot, right? And then it's like, how am I supposed to, how do you document something like that, right? And that's sort of one of the drawbacks with the Jupyter notebook, is yeah, the execution order isn't guaranteed in what was written.
34:18 Michael Kennedy: It's a little bit like a goto.
34:20 Daniel Chen: Yeah, it's pretty much like a goto, yeah.
34:22 Michael Kennedy: Which is kind of bad.
34:23 Daniel Chen: Except it's not even, like, documented, right?
34:24 Michael Kennedy: Yeah, at least it doesn't even say "goto 20," it's just like, they went to 20.
34:28 Daniel Chen: Yeah, and then when you execute it, it turns to 21, right? So like, it doesn't even, like you don't even know what 20 is, right? So if you end up in a situation where you're running bits and pieces of a code all over the place, that's a sign of like, wait, let's fix this now. It's pretty cheap to create a new file and let's do all the data cleaning or the parts I need for this figure, maybe that could just be in one thing. And then more and more as you find pieces that are, that need to be reused, you'll, oh, maybe I can turn this into a function. Then you'll put that as a module, and I would say, even if it's a module, just leave it in. If the folder structure is pretty much you have a data folder and an analysis folder and an output folder, where output is like your figures and stuff, at first, it's okay. You can have your modules in your analysis folder, and so you can still say import something and it'll still import properly in that sense. You don't have to, like, just go and make a Python package right away, because at least in what I've seen, is sometimes your analysis, it's not really going to be reused across projects. You don't need the overhead of writing a Python package. It's when you, for example, if you're querying, if you're doing some study on code in GitHub, for example, and you write your own GitHub querying API call stuff, and then you realize this one is part of one giant grant with many different analyses that need to happen. Maybe your GitHub querying code will have turned into a package, because you're actually reusing it. You don't have to turn everything into a Python project. You don't have to do that to do it like, quote-unquote correctly.
35:57 Michael Kennedy: Yeah, and I feel like the value from going from just some huge notebook or some huge script file, and then moving that into modules that have functions that you can import and now run and whatever, that's like 85% of the way, right? Whether or not you can pip install a thing, it doesn't matter. You know, there's a lot of overhead to make something super reasonable, to make it documented. Maybe if you're in academics, maybe that's a cool project for, like, a senior undergraduate person. Like hey, you know what? You know Python, why don't we take this and turn this into an open source project, and that can be your project, right? Like, I'm not sure it's a great research, time and energy, in general.
36:33 Daniel Chen: Yeah, well, so more and more, there's very recently pyOpenSci is an organization that sprung up, and it's trying to mimic rOpenSci and it's essentially, like, supposed to be a repository of Python packages made towards making science better for some scientific use case. And all of those are going to be reviewed by somebody and it fast tracks you for if you want to write a paper based off of that software package, it'll fast track you into JOSS, which is The Journal of Open Source Software.
37:08 Michael Kennedy: Yeah, and I had them on the show as well, quite a while ago, yeah.
37:12 Daniel Chen: So now you have at least the incentives are more or less lined up, right? 'Cause before, like, if you were just maintaining a software package, you know, what are your academic incentives? Because a lot of that is still around publishing and grants, so at least now there's, the incentives are now lined up where even though you are writing a software package, you can now write a paper about it.
37:31 Michael Kennedy: Yeah, it may generate a paper which might help you with your tenure and so on. I guess, let me take a step back really quick on my statement. Like, it might not help you in your academic career directly to spend the software engineering time, but it may help you significantly in your research if you can publish something and then you get other researchers to start using it, right, it becomes a package that you have more contributors to, right? Maybe you have one student you can fund part time and all of a sudden, there's 20 institutions, like, all working, like that can be a huge benefit. But I think a lot of stuff is so specialized, so tied to your data and your particular problem, like you say, your first thought shouldn't be, how do I open source this as a package? It's like, how do I just make this a decent software project?
38:13 Daniel Chen: Yeah, and that's a pretty lofty first goal, too, like how do I make this work properly for myself, right? 'Cause then that, you go into the route of, like, okay, I should write tests for this, just to make sure it's at least behaving correctly. There's a bunch of incentives as well for just having an open source project and trying to get other people to play with it because you'll build out the functionality for the thing you built, and as functionality expands, you'll sort of get more and more people in. And it sort of ties back to, like, the Python community is great, and so now you are embracing the broader Python community and now you have more and more resources or people you've met to help you with your own project. If you're at PyCon or SciPy, you can have your own sprint for your software project just to have other people try this out. You end up building your own community off of your little software project, which is, it makes you feel good, and it's still also advancing science and a lot of science is also communication and you built this stuff to help other people, so you might as well try to make it easier for other people to help you as well.
39:19 Michael Kennedy: Yeah, it could definitely help your career as well. I mean, people like Wes McKinney, Jake Vanderplas, Travis Oliphant, like folks like that, like, they're legitimate big names in the whole Python space in general, and a lot of that came from these academic projects and whatnot, so that's pretty cool. This portion of Talk Python to Me is brought to you by Rollbar. Got a question for you. Have you been outsourcing your bug discovery to your users? Have you been making them send you bug reports? You know, there's two problems with that. You can't discover all the bugs this way, and some users don't bother reporting bugs at all. They just leave, sometimes forever. The best software teams practice proactive error monitoring. They detect all the errors in their production apps and services in realtime and debug important errors in minutes or hours, sometimes before users even notice. Teams from companies like Twilio, Instacart and CircleCI use Rollbar to do this. With Rollbar, you get a realtime feed of all the errors, so you know exactly what's broken in production. And Rollbar automatically collects all the relevant data and metadata you need to debug the errors so you don't have to sift through logs. If you aren't using Rollbar yet, they have a special offer for you, and it's really awesome. Sign up and install Rollbar at talkpython.fm/rollbar and Rollbar will send you a $100 gift card to use at the Open Collective, where you can donate to any of the 900 plus projects listed under the Open Source Collective, or to the Women Who Code organization. Get notified of errors in realtime and make a difference in open source. Visit talkpython.fm/rollbar today. Before we move off, I don't want to drop this idea of code smells, because first of all, I love this concept. It's just so, such a good visualization of like, what can be wrong with software, but not broken with software. Because a lot of times, you think of like, well, my code now works, but what should I do? And I think the code smells is a very practical thing, just for folks listening, like, code smells, the idea is, the code is working. It's not broken, but when you look at it, you try to read it, like your nose literally could kind of curl up, be like, there's something wrong with this. I guess it works, but it's not good. It's really not good, right? Like a 300-line function, not good. Like, it works, but there's something wrong, and I know this mostly from Martin Fowler's work back in 1999 when he wrote "Refactoring," and this was sort of the introduction to like, how do you know when to refactor? Well, you look for the places that make your nose turn up. Go, ooh, what do we do with this, right? Like, oh, there's a 300-line function. That's bad, what can we do about that? Or here's a function taking 20 parameters. That's really horrible. You know, it's really easy to switch this integer for that integer, and how do you know when that happens? So what would you do to make that better? And there's just a bunch of, but I only know this through the sort of software engineering side of things. And this presentation that you talked about here, which was Jenny Bryan, right? She has some really interesting tips from the data science perspective, right? Yeah, so the first one is, do not comment or un-comment sections of your code to alter behavior, 'cause you want to try different stuff out.
42:35 Daniel Chen: Yeah, and that's like a very common thing, right? Like, the easiest case where that happens is if you are in a collaboration environment. You have five people, you have five comments of data loading because everyone hard coded a data path, right? And so like, there's literally you commenting out code just to load the data set across, depending on who you are, right? And then you end up, like, if you end up using some kind of version control system, the vast majority of your commits are just like, it's-my-turn. You just have this one bit of these couple lines that are just, like, committing...
43:16 Michael Kennedy: Just cycling.
43:16 Daniel Chen: Just cycling back and forth.
43:17 Michael Kennedy: Yeah, so I mean, what is the fix, right? The fix would be to do something where you have those proper structures you already talked about, and then you use something like os.path or pathlib and you compute the relative path over to that, and then you generate an absolute path and you run it from, right, that would work for everybody long as they all check out the same general structure, which sounds like Git.
43:36 Daniel Chen: Yeah, and they dealt with this in the R world with these two packages called rProjRoot, like for the root of an R project, and here, here as in like, find this file using here as the root path, or something, and that's sort of like my contribution to all of this. I try to pretty much, I wrote a package called pyProjRoot that tries to mimic the same functionality as well, because it works if you are working with scripts and stuff, but the second you have some kind of folder structure, or you have a Jupyter notebook, you'll sort of realize that a Jupyter notebook doesn't care that you have a folder structure. Like, the second you're in it, the working directory is now wherever the Jupyter notebook is, not whatever folder structure you've very carefully pieced together. And so this was like an attempt. It's not a very complicated function. It literally takes like, oh, what is your working directory? And I'll recursively go up by its parent and checking for special files like .git or a .here file, and then I'll prepend that to whatever path, just so, like, you can now use relative paths in the Jupyter notebook, just like you would in a script so you can avoid the...
44:47 Michael Kennedy: That's pretty cool.
44:49 Daniel Chen: That problem as well, the commenting in and out.
44:51 Michael Kennedy: Yeah yeah yeah, that's cool, and I'll definitely link to that project that you built. Tip two, use if and else in moderation, which seems pretty good. Number three is pretty straightforward: use functions. I mean just do, it's a good idea. You should do this.
45:05 Daniel Chen: Yes, and like, even when you're writing a function, it's okay to have a very complex function, and even complex functions don't need to be written all in one go, right, like you can break up your function, even though it does a very complicated task. There's probably small sub-tasks, and your function can call other helper functions. It's not just like, oh, this is a really complicated thing. Let me just write a function for it. As you're writing the function for it, like that's one of the other code smells as well, like if I have a 100-line function, that's kind of scary. You couldn't break this down into smaller pieces? Like, that's kind of weird. And so having helper functions that feed into a larger function, is also how you fix that code smell.
45:49 Michael Kennedy: Yeah, absolutely, and obviously, that makes testing way easier, 'cause you test little bits, and then, you know, I mean, test kind of the orchestration of them, and you're good. Another one that I'm a huge fan of, it's like a serious pet peeve of mine, is to have quick returns near the top guarding clauses, guard clauses. If you've got a function that's indented, and then it's got a loop, and then it's got an if, and then another if, and then another if, and it's just, like, way to the right. If you're scrolling to the right, you're doing it wrong.
46:20 Daniel Chen: Yeah, yeah, and during PyCon, I actually just bought your entire encyclopedia of training, and I forgot which one, I think it was the How to Write Your Python Code Like an Experienced...
46:32 Michael Kennedy: Pythonic Code or something, yeah, yeah, that one.
46:34 Daniel Chen: Yeah, so I remember that chapter, yeah, like don't write nested if statements, like essentially, write them inside-out, so like it's flat.
46:42 Michael Kennedy: Yeah, exactly, do them backwards, yeah, yeah.
46:45 Daniel Chen: So that was something that was just like, that's what you should do.
46:49 Michael Kennedy: It makes it so clear, and it's not very commonly taught, I don't believe, so these are called guarding clauses, and the idea is, instead of testing for a good condition and then another good condition, and another good condition, and then doing the thing, which puts everything away on the inside, you test for all the bad conditions first and you just bail out, and then what you're left with is a non-indented, simple bit of code, which is what you're actually after. So it's really clear what you're testing against, and then once you're past that, here's the simple thing we do, I love it. So that was one of her tips as well, it's a nice one. Yeah, she's got some great little examples there. Some stuff on object orienting and so on, but yeah, these are really good. I, you know, switch, which doesn't apply as much to Python. I actually wrote a switch language extension for Python using the context manager with block. It's pretty awesome, but I'm not going to get into that 'cause that's a whole different debate. But I do think this idea of code smells is really interesting and you should think about them for data science, 'cause I'm sure there are different, it sounds like, it looks like, there are different data standout smells that are more common than, say, standard software engineering. If you're doing database programming or whatever, you get a different style there.
47:54 Daniel Chen: Yeah, and just for other programming-related things and how you can structure your projects, Jenny Bryan also has this talk about, like, how do you name your files? It's kind of interesting 'cause, like, if you think about these common problems long enough, everyone pretty much just converges to the same set of solutions. I remember coming up with, hey, I should just name things this way, or like, set up my folder this way and then, like, all of a sudden, Jenny Bryan gives a talk at a big R conference, like wait, that was like, I feel validated that I didn't come up with something nonsensical. Other people as well, like they write packages sort of like a cookie cutter, just set up projects and it's pretty much like the same way, and one of them is like, oh, how do you name your files? Right, like, and especially in analytics, there's clearly an order you should run this stuff in, so one of the ways of like, how do you name your files, is prepend a number to them, right? So you can say, like, one-dash and then the script and that's the order you write it in. If you want to do better, you say zero one, so like 10 and one doesn't get sorted improperly. And then if you really want to go one step further, I started this habit of having a three digit number, so like 010, and that gives you a buffer room to insert something in the middle, or if you forget something, or you realize...
49:12 Michael Kennedy: That's like the 10, 20, 30 in BASIC. You'd be like, what if you've got to put a line in between that and you've got to go to 30 still? Well, you do 19, whatever.
49:20 Daniel Chen: Yeah, and I found that out because, like, that's how sort of some of the files in Linux, in the order of how it loads up services or something, it's defined in those three-digit numbers, and I was like, "Oh, this is interesting, I should do that." It just saved me from renumbering a whole bunch of stuff.
49:38 Michael Kennedy: Yeah, that's cool, I mean, just thinking about the structure is quite interesting.
49:41 Daniel Chen: At the end of the day, even though you have all this structure for your analytics project, because everything is nice and in some kind of order, if you do, for example, want to create a Python package, it's already there for you, right? Like you can create another folder that's the name of your module, put a setup.py file. You could have the ability to set that up, and now you can pip install -e, and then any time you edit that file, your analysis will still work, and that's pretty cool. The other thing with project structure-related stuff, like if you have things numbered, at the end of the day, everything comes down to a DAG compute system, and so because you have your stuff in order and there's properly-defined inputs and outputs, you can use a makefile or a simple script as like a poor man's make file. But then you end up in the situation like, oh, that's where Luigi and Airflow come into play. They're pretty much just DAG executors. Like I said at the very beginning, setting up your project is pretty much like the gateway drug into all of this other cool technology, 'cause you would've set everything up in such a way that you then use those tools when you hit that point where you need it, and it's like a nice way to slowly improve, do self-improvement stuff, but then you also end up using all the cool stuff that you see at these big conferences as well, so.
51:01 Michael Kennedy: Yeah, that's really cool, and of course the structure gets you just that much closer to trying it out. Now, what do you think about papermill and some of these concepts? Are you familiar with papermill?
51:11 Daniel Chen: Yeah, papermill is, I think that's the Netflix.
51:14 Michael Kennedy: Yes, it lets you basically turn a Jupyter notebook into something that can receive inputs and then have the outputs almost like a function or a module or something like that.
51:22 Daniel Chen: So I personally haven't used it. That's mainly because when I was started working, papermill wasn't really a thing at the point, so like, I had migrated out into, like, let's just make everything a Python script, because that has no dependency, and we can just execute things that way. And then the notebook itself just becomes, like, hey, this is the report. In some sense, I can see if I, for me, I guess like the next time I start an analysis project, like, I probably will use papermill just because it's like, oh, it's this cool technology and I've set up my folder structures in such a way where I can now use it, right? So I've heard of it, but I personally haven's used it yet.
52:01 Michael Kennedy: Yeah, I haven't used it either, but it sounds pretty interesting. It sounds like Netflix, like you said, is doing really interesting stuff. To me, one of the things that sounded special, it made me go, "Okay, well, maybe that is worth considering, even though it's like, not necessarily my style, right?" Is, if you have a big long sort of pipeline of operations, and each one is its own Jupyter notebook, if it fails, you can save, you basically keep the notebook as it was computed, laying around, so you can just open it up and you have basically a history of what happened and then what failed, which just sounds like a pretty interesting way, 'cause if you switch it to scripts, which I'm all for, but you end up with, you know, it exited without, with like, not code zero. Oh, that's bad, right? Like, what does that mean? Like I forgot, I don't even have logging or any of these things, right, like what happened? Why did it not work? So I do think there's some interesting stuff happening around there, but I do also feel like the software engineering tools you have apply really well to modules, right? Like it's easy to run that through pytest. It's easy to run that through a profiler. The refactoring tools work on those, not that you can't do some of that stuff with notebooks, but it's easier to use them on files.
53:13 Daniel Chen: Yeah, and especially if you're checking things into version control, that's sort of like the one thing. My main gripe with the notebooks is like, every time I make a change, I have no idea what's going on in diff, and it's just like, yeah, just add and commit. Like, I think it's right.
53:31 Michael Kennedy: Let's see, do you accept their changes, or your changes? Uh, my changes.
53:35 Daniel Chen: Or like if I just want to open the notebook, so there's this program called interact, which at least is like a desktop version so I don't have to open up Fire Server and then open a notebook that way. But yeah, like sometimes, I just want to double-click this thing just to see it. I don't want to open up a terminal and launch everything just to see something, so it was little things like that where I was like, I'll try to do as much as I can in the script, and then, like, everything else goes into a notebook, and then in the notebook, I still save out the things I want, just so I have that easier way to access figures or tables without having to look at the entire notebook.
54:13 Michael Kennedy: Yeah, I guess that is one of the challenges, is the whole diff thing. Maybe we could talk, we're kind of getting long on time, but there's a lot of interesting stuff to cover, so I'll ask you a few more questions. Let's think a little bit about collaboration, like you talked about the anti-pattern of having, like, Sarah's path, Dan's path, Michael's path, whatever, and just commenting them out and which one is active at the moment, right? But there's probably some other stuff for collaboration, like are you using Git, are you using some online shared notebook that's kind of like Google Docs? What are your thoughts around that kind of stuff?
54:48 Daniel Chen: So Google has something called the Colaboratory notebook which essentially, like Google Docs, but gives you a Jupyter Notebook system. That's pretty cool in the sense that, like, yeah, we won't have this commenting out of random lines because everyone's really just working on the same place. Like, that's really nice for collaboration. I still think that you need some form of version control, like that is, I think at this day and age, it's pretty much required, especially when programs start to get more and more complex. Like, you need a way to fall back on. The nicest feature I use in Git is like, I write something, everything is broken, and I just say Git reset and I just pretend I never did that and I just start over.
55:32 Michael Kennedy: Yes, exactly. Like, that was a really bad idea, please revert that. Okay, now we're good, and it lets you be more exploratory. It lets you be more aggressive in trying change. You're like, "This might not work, but if it works, it's going to be awesome." And try it, actually, that didn't work, revert. Or, you know, maybe it's a little more forethought. You create a feature branch to explore it. You do it, then you're like, "Eh, forget that." That was a bad branch, we're just going back here, like, let's not do that, right? But it's a really great feature.
55:57 Daniel Chen: Yeah, and just along the lines of collaboration stuff, like, make small incremental changes and that's like the actual stuff. That's the code that will actually get reviewed, right? Like no one will review a codebase where you're like, at the end of the paper, the entire submission relies on this codebase and you're like, "I need someone to review this thing," right? Like, there's no way that's going to get a proper review, and so just in general, for like, doesn't even have to be in research or science, it's a good habit to make small incremental changes and then maybe that's what your weekly meeting is about, is just like, this is what I did this week. Someone press the green button to merge this in, because that will actually be reviewed and then you'll have a discussion around that point, like all of that stuff. For me, I personally, I'm not in a managerial position, so like, those are the types of meetings I find productive where I can actually talk about, this is what I did. This was the implementation. This is what I'm thinking about next, and then have a conversation around that, 'cause it can still be productive and you can still have, like, talks about longer goals, but you also now have the benefit of someone else looking at your work to make sure it doesn't have, like, a bad code smell. You know, maybe you're off by a factor of 10 and no one's going to notice that in like 900 lines of code, but they will if it's just, like, 20 lines of code, a change like that is much easier to find.
57:19 Michael Kennedy: Yeah, that's definitely good advice. I definitely recommend working in small little bits and changes and, you know, make some small change, do a Git commit, make another small change, a little Git commit, right. Like, don't wait until the end of the week, or like you said, 'til the end of the paper, and like, all right, time to check it in. Like, no, not a good idea.
57:36 Daniel Chen: One of the things I wish exists more in academia is just having more resources to do pair programming, 'cause usually, people are assigned one project and there isn't two people assigned to the exact same bit, which is what you really need pair programming for. When I was co-instructing the summer program in my previous lab, I would sit down next to students and I would pair program them through some kind of data-related work, and it's super valuable for them because they actually get to see how I'm thinking about, like, this problem, and I'll say, like, "You're doing a join of two tables. Yeah, make sure that, like, the keys don't have duplicates if you're expecting duplicates, right?" That's like one of those things of like, yeah, the code ran, so I'm just going to keep going, right? You don't realize that you just did, you just did a Cartesian product, and now you have a million rows and you don't know why, but you're just going to keep going with it.
58:27 Michael Kennedy: Why is it taking so long?
58:27 Daniel Chen: Yeah. So pair programming, yeah, it's super valuable and even now during my internship, it's, I'm on the receiving end of pair programming, but this is more on the software engineering side. It's super valuable just to see, like, oh yeah, this is how you write good code, or like, this is how they're thinking about it. And it's even stuff like, I talk about yeah, be careful where you're hitting Control + V a bunch of times, it's like oh yeah, like, this is in two different places. Let's just refactor this out, and it's like, oh yeah, I didn't catch that, and when you refactor it out, you can actually have more guarding clauses just to make this, like, an even better check. That's one of the things I wish, at least in research, like there was more budget and time for, is just pair programming, and that just makes collaboration easier, 'cause you're now just talking with a person back and forth. It just makes that whole process way nicer and smoother.
59:20 Michael Kennedy: Yeah, I mean, we certainly have the tools these days for it, right? You talked about Google Colaboratory, which has live, multiple editor features, kind of like Google Docs. You've got obviously screen sharing, you've got VS Code way to watch somebody else's system on two sets of Visual Studio Code, and there's some really interesting options, but yeah. It's got to, it's like also a cultural thing, and also, you've got to have people to collaborate with on that part, right?
59:49 Daniel Chen: Right, and in the sense of, hey, maybe like when you, even though you're in this small world and you write your package, like, now you have someone to collaborate with, right? And that's sort of like socially motivating, that you have other people using your stuff, right?
01:00:01 Michael Kennedy: Yeah, it definitely feels good to have someone looking at it, interacting with what you're building, 'cause building software completely in isolation, just for yourself, it's kind of a weird place to be. It's not as much fun as it could be.
01:00:13 Daniel Chen: Yeah, it's fun when you're just in the sense of, I've got to get something like that minimum viable product. Like, that's fun, and then it's just like, as soon as you hit maintenance mode, it's like, who am I maintaining this for?
01:00:26 Michael Kennedy: Yeah, or just all like, a lot of the projects, you know, you're going to be working on it, and you kind of get the happy path mostly working, and you feel like you're mostly done, but then there's all these little loose ends. The documentation you got to write for the other people involved, all the little tests and the edge cases and just, it can just go on and on and on. It feels like I thought I was done a month ago with this and I'm still working on it, how is this not done yet? Like, I've definitely had that feeling in software and I'm sure it's just the same. You know, that was actually in a semi-research context, I'm thinking back to the, yeah. Final thought on this collaboration bit. What do you think about GitHub, like creating either a private or a public repo, using that for your work to share with people?
01:01:10 Daniel Chen: I love it. Right now, like pretty much, if I have a thought, I just make a GitHub repo, so my personal GitHub account has a bunch of projects where, like, they're pretty much empty, but they have a name, and it's just because, like, I thought of something one day and I just made a repo out of it. It's even really good for simple stuff like if you're at a conference and you just want a place to take notes that doesn't matter what machine you're on, I've taken just notes in markdown as a GitHub repository, and then during, like, a lightning talk, just be like, "Hey, I just started putting up my notes," and then maybe some people will add, "Hey, wait, this is my talk, let me put my talk in there." And you end up collaborating on some kind of notes for the conference, which is pretty cool. And for me, like, I try to, in lines of that 10% improvement every time, like originally, I just made everything in Git just because I needed more practice with it, and it was just like a nice safe place for me to like, oh yeah, like add and commit. Like, if you do it a couple hundred times, that part doesn't become scary anymore. And so...
01:02:12 Michael Kennedy: That's right.
01:02:14 Daniel Chen: It just becomes so natural, like oh yeah, when I first learned Git, it's like, why am I doing this? This is so tedious, and then it's like, now it's like, okay, whatever. But then, like, you can do other stuff with Git, which is super cool, so GitHub is like a great way to practice using Git, and then also gives you the ability to practice or get ready for collaboration, right? So even for me, even if I'm working on a personal project sometimes, like, I will do branches for myself, push branches to GitHub by myself, and I will submit pull requests to myself.
01:02:46 Michael Kennedy: Just to document it and make it really clear, like this is the reason for it, here are the files that changed and all that, right?
01:02:52 Daniel Chen: And like, I was doing that for a couple of years and like now, during my internship, like, that has become so second nature that like, I can actually do Git things and it doesn't hinder collaborating in the real world. So it was a lot of just practice that like, I just thought it was cool, but like, I didn't realize until now, it was like, wait, this is actually just like years of practicing on my own, and so in that sense, and like Microsoft essentially saved GitHub, and is, it's just as good as ever, so like, yeah. Plus plus one for GitHub all the way.
01:03:25 Michael Kennedy: Yeah, awesome, I totally agree, I totally agree. Okay, this is really interesting. I think there's a lot of concrete advice here. I'll link to the papers. I'll link to your py root project, the code smells thing, all that. We'll put all this up there and people can come back in and definitely dig into the details if it's useful for them. So before we get to the final bit of the show, though, I've got to ask you the two questions, Dan. First of all, if you're going to write some Python code, what editor do you use?
01:03:55 Daniel Chen: So I used to use Emacs with Elpy, and now I am now a VS Code convert.
01:04:02 Michael Kennedy: They've brought you over. You know, I would say like the last four shows that I've had, everyone has said VS Code, which is pretty interesting.
01:04:08 Daniel Chen: Yeah, I was pretty reluctant until, like, I had to write some Python code and I was on, I switched over to my Windows machine and I was like, "I don't have any way to edit code right now. Let's just try this thing." And you know, it worked, and so I was pretty happy with it, so I sort of just hung around. What's actually really cool is the screen sharing ability in VS Code, that does pair programming.
01:04:37 Michael Kennedy: Yes, yes, that live, I think it's called Live Share. I've never had a good chance to use it, but I've seen it and it looks amazing.
01:04:42 Daniel Chen: Yeah, I've used it with one of the other interns, and it's like, this is really cool! And they also have a voice communication mechanism, so like yet another way to do voice chat, but at least the screen, like the live coding part, like that was super cool.
01:04:56 Michael Kennedy: Very nice, yeah. Okay, great, definitely a good answer for the editor. Packages, some notable ones?
01:05:01 Daniel Chen: The package that, notable, that I haven't heard on the show yet, is one called pyjanitor by Eric Ma, and he works at Novartis, and this is pretty much his consolidation of pretty common data cleaning stuff in Pandas. And that ties to another package by Zachary Sailer called Pandas_flavor, which is a wrapper around your ability to extend Pandas. And the benefit of that is, you know, if you want Pandas to have a method that you don't already have, like you might think, like oh, let me create another class. I'll inherit Pandas and I'll release a package, but no one's really going to use that because it's not a Pandas DataFrame object, it's like some weird class that you created yourself, and so like, this is sort of like a mechanism for you to inject your own methods into a Pandas DataFrame object but still have a Pandas DataFrame object, without having to re-extend the class, so it's super cool.
01:05:58 Michael Kennedy: Yeah, that's really great, and yeah. The pyjanitor, I really like that one. It takes a whole bunch of imperative data frame operations and turns it into a really nice fluent API, like, DataFrame dot from dictionary, dot remove columns, dot drop not a number, drop, you know, rename call and just boom, just sloughs it all together. It's really nice, I haven't covered it on the show, but we did talk about it over on Python Bytes, that podcast, so yeah, it's definitely a cool one. It's been on my radar as well. Nice, all right, well, final call to action. People who are out there, maybe they're in science, data scientists, something like that, and they want to make their code, take that 10% step you're talking about towards the more proper engineering structured world. What do they do?
01:06:43 Daniel Chen: For me, like, I was lucky enough to be in New York City, which is a big city, so it was always like, local meetups were always like a thing, that were very busy and you learn a lot from there, but even if you don't live in a very big city, you can either start one yourself, 'cause chances are, you are not alone, and the Python community is super supportive. You can always, if you say something on Twitter, someone will give you the ways of how to start something, and if you're at a university, you can always have meetings in, like, a classroom or something, so don't worry.
01:07:17 Michael Kennedy: Right, maybe it has an interdisciplinary, right, like maybe there's not that many people in your department, but if you go across, you can probably find a decent number of folks who want to attend.
01:07:26 Daniel Chen: Yeah, and so meetups are a great way to learn or meet other people, or at least just, like, ask questions about stuff, and if you can make it to any of the Python conferences, or like attend a sprint, that is probably like going to a sprint was like, the fastest way that I've became a better Python programmer, even if it was something as like, editing a piece of documentation. Like, just seeing the mechanism of how other people collaborate on such a large scale, and then still seeing your work, like, in one of these major projects, like, that's super motivating and, like, cool.
01:08:02 Michael Kennedy: Yeah, that's really cool, yeah. It's a great opportunity, and it's also a great opportunity to rub shoulders with really prominent people, in something that you're working with, right? The maintainers of this probably important project, who are there, and you know, what better chance to get to know them a little bit than to sit down and add a feature with them, or spend a day in the room with them. Something like that, right? That really can build some connections that, you know, especially if you're in a small town somewhere and not meeting them in person, that can be a challenge.
01:08:32 Daniel Chen: Yeah, and a lot of people stay within Python because of the community, so like, I guess my final call to action comes from Greg Wilson in his book called "Teaching Tech Together." He talks about the rules of teaching how to program or like, building community, and the first rule is, "Be kind, all else is details."
01:08:52 Michael Kennedy: Yeah, that's great. "Be kind, all else is details," I agree. It's definitely high, right up there as one of the most important ones. All right, Dan, thank you for being on the show. It's been really great to talk about these ideas with you. I think there's a lot of good advice people can take away.
01:09:05 Daniel Chen: Yeah, it's been great talking with you, Michael, as well.
01:09:07 Michael Kennedy: You bet, bye! This has been another episode of Talk Python to Me. Our guest on this episode was Daniel Chen, and it's been brought to you by Indeed and Rollbar. With Indeed Prime, one application puts you in front of hundreds of companies like PayPal and Vrbo in over 90 cities. Get started at talkpython.fm/indeed. Rollbar takes the pain out of errors. They give you the context and insight you need to quickly locate and fix errors that might have gone unnoticed until users complain, of course. Track a ridiculous number of errors for free as Talk Python to Me listeners at talkpython.fm/rollbar. Want to level up your Python? If you're just getting started, try my Python Jumpstart by Building 10 Apps course. Or, if you're looking for something more advanced, check out our new Async course that digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening, I really appreciate it. Now get out there and write some Python code.