Monitor performance issues & errors in your code

#275: Beautiful Pythonic Refactorings Transcript

Recorded on Thursday, Jul 9, 2020.

00:00 Do you obsess about writing your code just the right way before you get started. Maybe you have some ugly code on your hands and you need to make it better. Either way, refactoring could be your ticket to happier days. On this episode, we'll walk through a powerful example of iteratively refactoring some code until we eventually turn our Ugly Duckling into a pythonic. Ud on our Hoekstra is our guest in this episode to talk us through refactoring some web scraping Python code. This is talk Python to me, Episode 275, recorded July 9 2020.

00:43 Welcome to talk Python to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at m Kennedy. Keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode is brought to you by us over at Talk Python Training.

01:05 pythons async and parallel programming support is highly underrated. Have you shied away from the amazing new async and await keywords because you've heard it's way too complicated or that it's just not worth the effort. But the right workloads 100 times speed up is totally possible with minor changes to your code. But you do need to understand the internals. And that's why our course async techniques and examples and Python show you how to write async code successfully as well as how it works. Get started with async and await today with our course at talk python.fm slash async. Honor Welcome to talk Python to me, thanks for having me. I'm excited to be here. I'm excited to it's gonna be beautiful man. Hopefully, hopefully, yeah, it's gonna be a beautiful refactorings. So I am a huge fan of refactoring. I've seen so many people try to just overthink the code that they're writing. They're like, well, I got to get it right. And I got to think about the algorithms and the way I'm writing it and all this stuff. And what I found is, you don't really end up with what you want. In the end. A lot of times anyway, if you just go in with an attitude of this code is plastic, it is malleable. And I can just keep changing it and you always are on the lookout for making it better, you end up in a good place. Yeah, I completely agree. refactorings not a one time thing or something that happens only you know, two years from when you initially write the code. It's I heard once actually that refactoring goes a lot in hand with legacy code. And there's a number of different definitions for legacy code. But one definition is like a C code is code that isn't actively being written. So if you write something once, and then you Consider it done, and then the next week, like, no one's working on it that technically according to that person's definition, his legacy code so that can be refactored. You know, you can refactor something you wrote earlier in the day, it doesn't have to be a year later, or 10. Yeah, absolutely. I mean, you just, you get it working, you know a little bit more, you apply that learning back to it. And with the tooling these days, is really good. It's not just a matter of, you know, if you go back to 1999, you read Martin Fowler's refactoring book, he talks about these are the steps that you take by hand to make sure you don't make a mistake. And now the steps are highlight, right click Apply refactoring. I mean, that's not 100% true. And the example we're going to talk through is not like that exactly. But there are steps along the way where it is potentially definitely linters. And static analyzers are heavily underutilized, I feel and so many of them will just automatically apply the changes that you want to do. And it's fantastic for huge code bases, it would be almost impossible to do it by hand. Yeah, absolutely. It would definitely be risky. So maybe that's why people sometimes avoid it. Now, before we get into that, though, let's start with your story. How did you get into programming into Python? I know you're into a lot of languages want to talk about that. But Python two, and then also Yeah, so the the shorter, it's a long story, but the shorter version of it is my degree in university, which was in computer science, required at least two introductory CS courses. So the first intro course was in Python. The second one was in Java. And then I ended up really, really enjoying the classes. I ended up taking a couple more, but ultimately stuck with the career that I had entered into, which was actuarial science. That's like insurance statistics. Yeah. So you were in some form of math program, I'm guessing. Yeah. Yeah. Cool. It's very, very boring to explain. But if you like math, it's a great career. Yeah, awesome. And so I ended up for my first job at a university. I ended up working at a software company basically, that very simply explained, created the insurance calculator that many insurance companies use. And after working there for about four or five years, I had just fallen in love with the software engineering side of my job and had decided that I wanted to transition full time to like a purely technical company. So it's a several years or a couple years later and now I work for new videos.

05:00 Senior library software engineer. And that's how I got into programming. And our code base that we work on is it's completely open source and primarily uses C++ 14 and Python three. That's where Python enters. So that sounds like a dream job. Yeah, that sounds awesome. Yeah, I absolutely love it. Yeah. So you're working on the rapids team, right, which works on doing a lot of the computation that might be in pandas, but over on GPUs, that roughly right, yeah, that's a great description. So So yeah, within Nvidia, I work for an organization called rapids we have a number of different projects. So specifically, I work on ku df, that is cu df. So the cu is two letters see you from CUDA, which is like the parallel programming language that Nvidia has made. And the DF stands for data frame. And so this is basically a very similar library to pandas, the difference being that it runs on the GPU. So sort of the one liner for rapids is it's a completely open source, end to end data science pipeline that runs on the GPU. So if you're using pandas, and it works great for you, like there's no reason to switch but if you run into a situation where you have a performance bottleneck, KU df can be like a great drop in replacement. We don't have 100% parody with like the pandas library, but we have enough that a lot of Fortune 500 companies that pick up and use us are able to very easily transition their existing code and pandas to CUDA right Jas and import line. go much faster, something incredible like that. That's the goal. That's the dream. Yeah, I just recently got a new Alienware a high end Alienware desktop, and it's the first GeForce I've had in a long time. That's, you know, not like, I don't know, some AMD Radeon in a MacBook or something like that. So I'm pretty excited to have a machine that I can now test some of these things out on at some point. Yep. Acceleration on different devices is, it is very exciting. Awesome. All right, well, let's start by introducing real briefly a little bit about refactoring. And we've talked a tiny bit about it in general. And then we're going to dive into a cool example that you put together that really brings a lot together. And what I love about your examples, it's something you've just gone and grabbed off the internet. It's not like a contrived, like, well, let's do this and then unwind the refactorings. Until it does, it is like you just found it like, well, let's see what this thing does. That's gonna be fun. But let's just start with a quick definition of refactoring. Maybe How do you know when you need it? How do you know when you need refactoring? For me, I have a sort of number of anti patterns in my head that when I recognize them in the code, some people might refer to them as sort of technical debt, this idea that the first time you write things, or maybe initially, when you write things, you don't have the full picture in mind. And then as time goes on, you start to build up technical debt in your code. And a refactoring can be reorganizing restructuring your code, or rewriting little bits of it to basically reduce tech to make it more readable, maintainable, scalable, and just in better, in general, better code. That's sort of the way I think of it. Yeah, it is pure sense, right? It should not change the behavior, at least in terms of like inputs, outputs, exactly. So the easiest code to refactor is code with tests, whether that's unit tests, or regression tests, or any of the other number of tests that there are, if you have a code base that has zero tests, refactoring is very, very dangerous, because you can refactor something and completely change the behavior and not know about it, which is not ideal at all, a somewhat suboptimal indeed, you know, Martin Fowler, when he came up with the idea of refactoring, or at least he publicized I don't know, I'm sure the ideas were basically there before. One of the things that struck me most was not the refactorings. But was this idea of code smells? And it's like this aesthetic of right, like, I look at the code, and it yeah, it works. But like, your nose kind of turns out, you're kind of like, you know, ooh, but it still works, right? It's like not broken. But it's, it's not nice. And, you know, there's all sorts of code smells like too many parameters, long method, things like that. But they rarely have beer cut offs, right? Like, well, if it's over 12 lines, the function is too large, but under that is totally fine. Right? Like, that's not it's never really super clear cut. So I think this whole idea of refactoring. Much like refactoring itself requires like going over and over it sort of through your career to refine like, what the right aesthetic to achieve is. And it probably varies by language as well, little Yeah, if you start to do it, like consciously when you're looking at code, and asking yourself, like when you have that code smell feeling like something's not right here. If you are conscientiously like paying attention to what it is like slowly, over time, you will start to pick up on exactly what it is about it like a very, very small one for me, and I think this is mentioned in maybe the clean code or it might have been Martin Fowler's book, it's like declaring a variable earlier than it needs to be declared. So you might declare, like all your variables at the top of the function, but then like two of them, you use immediately but the other three you don't use until the last, you know, four lines of the function, small things like that. It seems simple, but

10:00 I've made the change where I've put the declaration closer to where it's get used. And then you realize, Oh, wait a second, this isn't actually referenced like it's set to something, but then it's not actually used later on. So I can just delete this. And it's because it was at the top of the function, you can't see where it's being declared. Or if it's used somewhere else, that like, you actually just have an a phantom, unused variable that can be deleted. It's simple things that lead to better changes later on. Well, and just mental overhead, like you said, the technical debt side of things, too. For example, there's the variable that was at the top, surely, when the code was written, it was being used, but it's been modified over the years. And now no longer is it being used. But it because it's separated from Where's declared where to use, you don't want to mess with that. Like, if you start messing with that you're earning more work, right? You're asking for more is I'm just gonna make the minor change. I don't want to break anything Who knows? And then the next person that comes to try to understand it, they got to figure out well, why is there that like, set count variable? Like, I don't feel like it's being used, but it's there. And like, you know, you just gotta it's another thing to think about that's in the way. Yeah, for sure. Yeah. So certainly, I think it's viable. There are fantastic tools that will like highlight this variable is unused for this assignment, is it meaningless or something like that? So there are options. But still, it's better to not let that stuff live in the code. Yeah. 100%. Agree. Let's talk about this example that you've got here. And maybe you should give a little background on your language, enthusiasm, and programming competition, interest, and so on your interest in coding competitions, I think it's probably worth touching on already. But then this example is from you trying to reach out and understand it and do some analysis of those environments or those ecosystems, right. It's the background with this, these different languages and coding competitions. Yeah. So I initially got into competitive programming, quote, unquote. So just though the one sentence description is there's a number of websites online hacker rank, leak code, code forces, that they host these 123 hour contests where they have three to four or five problems that start out easy and then get harder as you progress through them. And you can choose any language you want to solve them in. And the goal is just to get a solution that passes as quickly as possible. So it's not necessarily about how efficient your code is, it has to run within a certain time limit. But if you can get it to run or pass in Python versus C++ versus Java, any code solution works, I started doing these to prepare for technical interviews. So if you're interviewing for companies like Google, Facebook, etc, a lot of their interview questions are very similar to the questions on these websites. And so I at one point was looking for a resource online, like for YouTube videos that just explained this stuff. But at the time, I couldn't really find any. So I started a YouTube channel covering the solutions to these problems. And I thought it would be better to solve it in a number of languages that have been opposed to just C++. So I started solving them in C++, Python, and Java. And that's sort of what led into led to my interest in competitive programming. And even though I'm not interviewing actively anymore, I just find these they're super fun. It keeps you sort of on your toes in terms of your data structure and algorithms knowledge. And you can treat them as like code kados. I'm not sure if you've, you're familiar with the concept of just sort of writing one little small program and trying it a couple times in different languages, and you learn different ways of solving the problem that you might not, would have initially solved the problem that way. This example, I decided to just figure out what are the top languages that people use to solve these competitive programming problems on a given website? So the site that I chose was code forces. Yeah. And you're like, Hey, I'm working on this new data frame library. That's like pandas, let me see how I can use pandas to solve this problem and get some practice or something. Right. Yeah, yeah. So when I had just started Nvidia, I knew that the pandas library existed, but I had zero experience with it. And I knew that it had this sort of group by reduction functionality that if you had a big table of elements, you could get these sort of statistics on, you know, what's the top language or what's, you know, the average time it takes for people to submit very easily with this kind of library. So I thought, what better way to learn pandas, then by trying to build a simple example, that uses this library for something that I'm interested in. And so the first thing that I did was I googled, you know how to scrape HTML tables using pandas. And then it brought me to this blog that at the end of the day has about 60 lines of code, and it's a tutorial blog. So it walks you through how to get this code off of an HTML table. And basically the the PI con talk that I gave it came out of doing this I had no plans of getting a pipe con talk on this. I just after having gone through it and sort of refactoring one by one I realized that like I could give a pretty simple talk to like Connor like five years ago that didn't know about any of the I didn't know about this comprehension. I didn't know about a numerate I didn't know about all the different techniques. I

15:00 was using. And I figured it would be at least for some individuals out there, it would be a useful talk, highlighting the things that I didn't know when I first started coding in Python. But that now we're like, second nature for me. And that's where that came from. Yeah. And it's really interesting. The example is cool. I do think that a lot of the refactorings, were, let's try to make a more pythonic version of this and more idiomatic version of this, like, Miss understanding the for in loop, for example, and treat them alright. So in a lot of ways, that's a cool refactoring. But it's also kind of leveraging more of the native bits of the language, if you will, absolutely. No, yeah. So you went and grabbed this code, and it does two basic things, it goes and download some HTML, and then pulls it apart using I think, l XML, HTML parser. And then it's going to loop over the results that it gets from the HTML parser. And turn this into basically a list or a dictionary, then you're going to feed that over to pandas, as pandas, and pretty interesting questions. And most of the challenge or most of the messy code lived in this HTML side of things, right? Yeah, that's a pretty good description of what's happening. Cool. So let's go and just talk through some of the issues you identified. And then the fix basically knowing like, how did you identify that as a problem? And then what fix Did you apply to it? Now, there's a lot of code. And it's hard to talk about code and audio. So we'll maybe try to just like, as high level as possible, talk about like, the general patterns and what we fixed, the first part of the code would go through, and it would create an empty list and create like an index to keep track of where it was, and then did a loop over all of the elements, increment the index, add a thing to the last print out some information as it went, right? Yep. And I think the first thing that you talked about was the code comments, actually, like, what is this code? Comment here? It just says, We're looping over these things. But what do you think a loop is? And why do we have this comment? Yeah, even worse, was like, arguably, the second comment, some might argue, is add some value. But the first comment above the line that creates an empty list, it says, create empty list. And it's only a what is that six characters if you don't include the spaces? And I think that's definitely one of the things that's called out in a number of refactoring books is comments should add value. That is not explicitly clear from the code, I think even beginners are able to tell that you're creating an empty list there. There's no reason to basically state what the code is doing. Typically, comments should say why it's not clear why something is being done a certain way, or something that's implicit and not explicitly clear from what the code is doing. Yeah. In terms of refactoring. I love this idea of these comments are sort of almost warning signs. Because if I find myself writing one of these comments to make stuff more clear, I'm like, wait a minute, wait a minute, if this is just describing what's here, something about what I'm doing is wrong. Maybe the variable name is not at all clear what the heck it is. Or maybe it could use a type annotation to see what types come in and sort of here's a list of strings, like how about list bracket string goes there to just say what type it is, it's Python three, after all. And you know, from the the code smells book, Fowler had this great description of calling these types of comments, deodorant for code smells. So there's something wrong, it smells a little less bad if we like, lay it out, set the stage. But every time I see one of those, I'm like, you know what, I just need to rename this function, like a short version, what this comment would say, or rename this variable, or like, restructure and break these things apart. Because if it needs a comment, it's probably just too complex. There's an individual in the C++ community, his name is Tony veneered. And he has a, a rule or not a rule, but a recommendation that you should grep your code base for step one, step two, step three. And if and guaranteed, you're going to get like one or two matches. And a lot of times, it's these steps of comments on top of pieces of code and like a larger function. And odds are, you could make that code a lot better by refactoring each of those steps into its own small function. And just whatever the step, like if you've put step one in a description, you've already given that piece of code name, you just need to take the next step, put it in a function and give that function that yes, exactly, exactly what you said, exactly. I think there was even some tool way, way, way back in the early days of C# that if you would highlight some code to refactor it, and you highlighted a comment it would function name of phi, who would try to guess the function name by using the comment, turning it into a function, you know, like a sub that would work as a identifier in the language Anyway, it was totally a good idea. So there's a couple of things going on here. One is like, why is it our print statement? Nobody needs this. Once you take that out, though, you were able to identify this? Well, let's take a step back first, if you have an integer, and you're incremented every time through the loop, so that it stays in sync with the index of the elements, you're looping over. That's probably not the best way to do it. Right. Like Python has a built in a numerate. Yeah, this is probably one of the most common things I see in Python.

20:00 On. Sadly, in certain languages, they don't have this function. But in Python, it's right there, built into the language. And as you mentioned, it's called a numerate. So you can pass whatever thing you're looping over to enumerate. And that's going to bundle it with an index that you can then inline the structure into an index and the element that you are getting from your range for loop before. So anytime you see an index, ID, x or I, or something that's keeping track of the index, and that's getting into b j, sometimes it's j, sometimes it's j, sometimes it's k, x or y if you're being really creative. And yeah, like there is a built in pattern for basically avoiding that. And it makes me extremely happy. Like it happens actually, not just once in this piece of code, but twice, where you can make use of a numerate. And once you see it, it's very hard to unsee it. But like I said, this was something that I learned a numerate from Python. And this was not something that I knew of, and I didn't learn in school. So there's a lot of Python developers and just developers in many languages out there that I think they're just not aware. And as soon as you tell them, I think they'll agree Oh, yeah, this is way better than what I was doing before. Yeah, you just need to be aware of it. You know, you always run into these issues, you've got to create the variable, then like, why is the variable there? Then you got to make sure you increment it? Do you increment it before you work on that with the value? Or do you increment after zero basis at one based all of these things are just like complexities that are like what is happening here? Like, what if you have a habit if past continued, you skip the loop, but you forget to increment it? Like there's all these little edge cases? And you can just with the numerate you can say, you know, it's always going to work. You can even set the start position to be one if you want it to go 123. Beautiful. Yeah, that's a great point. Yeah, there are use cases where you're going to run into bugs, whereas with a numerate, you know, at least you're not going to have a bug with that index. Right? It's always going to be tied to the position with the starting place the way you want it. So yeah, yeah, that's really nice. But it's not super discoverable. Right. Like there's nothing in the language that screams and waves its hands, it says, Yeah, you're in a for loop. We don't have this concept of a numerical for loop. But this is actually better than what this is what you wanted, you didn't even know you wanted it. Yeah, it has to be something that you stumbled across. Interestingly, some languages go is the one that comes to mind, they actually build in the enumerate into their range base for loop. So in go, they have built in basically the destructuring. And if you don't want the index, if you just want a range base for loop, and you want to ignore the index, then you're just supposed to use the underbar. To say, I don't need the index for this loop. But it's interesting that like, go is a more or more recently created language than Python, at least when they decided like, they thought it was such a common use case that they would think that most people need it more often than they wouldn't. So they built it into their for loop. So with that language, you can't avoid learning about it. Because it's, it's in there for loop. But it's a syntax error to not at least say I explicitly ignore this. Yeah. Oh, interesting. I didn't know that about go. So now you've got this little cleaner, you look at it again. And you say, Well, now what we're doing is we're creating a list an empty lists, which we commented create an empty list. That was cool, pick that common out, but it was very helpful in the beginning to help you understand. Now just kidding. And then you say we're going to loop over these items, and then append something to that list. Well, that's possible. But this is one of your anti patterns that you like to like to find and get rid of, right, this is an anti pattern that I call initialize, then modify. And actually, the enumerate example previously also falls into this anti anti pattern. So anytime you have a variable, that it doesn't need to be a for loop, but many, many times it is that inside each iteration of that for loop, you are then modifying what you just initialized outside, that is initializing and then modifying. And my expectation is that you should try to avoid this as much as possible. And when it comes to the pattern of initializing an empty list, and then in each iteration of your for loop, you're calling a pin that is built in to the Python language as something that can be used as list comprehension, which is so much more beautiful, in my opinion, compared to just a raw for loop and then appending for each iteration. Yeah, every now and then there's like a complicated enough set of tests or conditionals, or something going on in there that maybe not but I agree with you most of the time. That just means what I really wanted to write was a list comprehension. It is though, you know, bracket item for item in such and such if such inside, right that that's what you got to do. Yeah, list comprehension, once you start to use it moving to a language that doesn't have it makes you very sad, because it's such a syntax, it totally makes you sad. And I really, really wish list comprehensions had some form of sorting clause because at that point, you're almost into like in memory data base type of behaviors, right? Like I would love to say, projection thing, transform thing for thing and collection where the test is ordered by whatever I mean, you can always put a sword around it, but it's, it'd be lovely if they're like it's already got those nice steps. I like to

25:00 Write it on three lines, write the projection, the set, and the conditional, like just one more line, put the order by in there, but maybe someone or maybe I should put a pep in there who knows, I was gonna say that sounds like a future Pep. But there's definitely I mean, it would be easy to implement, just transform it to a sorted and pass that as the key or something like that. But anyway, it would be really cool, but they're very, very nice, even without that, and once you have it as a list comprehension, then it unlocks the ability to do some other interesting stuff, which you didn't cover in yours, because it didn't really matter. But if you have square brackets there, and those brackets are turning a large data collection into a list. If you put rounded brackets all the sudden you have a much more efficient generator. Yep, that is something I don't call out at that point. But at the end of the talk, I allude to article that was mentioned on the other podcast that you co host Python bytes. Yeah, thanks for the shout out on that one, by the way. Yeah, no, it was a great article. But it mentions generator expressions right after it mentions list comprehension. And I mentioned that these things go hand in hand, and that you should familiarize yourself because if at any point, you're passing a list comprehension to like an algorithm like any or, or all or something, you can drop the square brackets and then just pass it the generator, and it'll, it'll become much more efficient. So it's both of them. And there's no way to go from a for loop really quickly and easily to a generator yield style of programming, right? There's not like for yield, I and whatever, right like there, but with the comprehensions. It's square brackets versus rounded bracket parentheses, right? It's so it's so close that if that makes sense. It's like basically no effort to make it happen. Yeah. and nice. Okay, so we've got into a list comprehension, which is beautiful. And then you say, all right, it's time to turn our attention to this doubly nested for loop. And it's going to go over a bunch of the items and pull out an index. And then, you know, go and work with that index. So it's another enumerate. And then I think another thing that's pretty interesting that you talked about, I don't remember exactly where it came in the talk. But you're like, Look, what you're doing in this loop is actually looping from like, a second onward for all the items. And that really is just a slice. Yeah, yeah. So in this nested for loop, the outer for loop is basically reads for j in range of one to the length of your list. So you're basically creating a range of numbers from one to the length of your list. And then right inside that for loop, you're creating a basically a variable, that's the JSON element of your list. So all you're doing is skipping the very first element of your list. But the way you're doing this is generating explicit indexes indices, based on the range function and the length function. And I thought at first that they must be doing this because we need access to the index later, or we need access to our elements later. But that wasn't the case, it just seemed like the only reason they were doing all of this was to skip over the first element. And so very nicely, once again, Python has very, very many nice features, they have something called slicing where you can basically pass it the syntax, which is square bracket, and then something in the middle in square brackets. And in order to skip the first one, you just go one colon, one to the end. And that's beautiful, because you don't even have to check the length of the items. You just say go to the end, which is avoids errors of like, do I have to plus one here? And do I not? Is it minus one like budget the ending piece, but you don't worry about just from I skipped the first one and the rest. Yep, it's so convenient your you avoid making a call to Len, you avoid making a call to range and you avoid your local assignment on the first line of your for loop, you can basically remove all of that, and just use slicing, and you're good to go. And yeah, slicing is slicing is a really, really awesome feature. It actually comes from a super old language that was created in the 60s called AAPL. And Python is one of the only languages that has something called negative negative index slicing where you can pass out a negative one so that it wraps around sort of to the last element, which is a super super, it sort of looks weird, but once you use it, it's so much more convenient than doing like Len minus one or something like that. It's Yeah, it's it is a little bit unreasonable. Once you know what it does, it's great. It's great. It's like I want the last three, I don't want to care how long it is. I just want the last three. And that's Yeah, it's fantastic slicing, I think is fairly underused for people who come from other languages. But yeah, and it fits the bill because there's so many these little edge case, you talk about errors in programming, like off by one errors are a significant part of problems with programming, right? And they just skips that altogether. It's beautiful. up to the next thing to do is so you're parsing this stuff out of the internet, which means you're working with 100% strings, but some of the time you need numerical data. So you can ask questions like is this the sixth or seventh or whatever and so they

30:00 Have, this is gonna be fun to talk about. They have try value equals int of data. So pass the integer as the potentially integer like data over to the INT initializer. Either that is going to work, or it's going to throw an exception, which case you will say, except password, not you, the original article had that right. Do it's this try pars except pass. Otherwise, it's going to be non or it's going to be set to the string value or something to that effect. So what do you think about this? How did you feel when you saw that? Yeah, so my initial reaction was that this is four lines of code that can potentially be done in a single line using something called conditional expression. So in many other languages, they have something called a ternary operator, which is typically a question mark, where you can do an assignment to a variable based on a conditional predicate, so something that's just asking true or false. And if it's true, you assign it one value, and if it's false, you sign it another value. So in Python, they have something called a conditional expression, which has the syntax assigned to value using the equal sign, ask your question. So in this case, we just asked Is it an int, or sorry, it's so the first thing that returns it's actually backwards from ternary operator. So this is the read the line reads data equals into data if and then check your predicate. And in Python we can just call is numeric on our value, which will return us true or false based on whether it's a number. So if that returns true, then it'll end up assigning to data into data. Otherwise, you can just assign it itself data. And then it's not going to do any transformation on that variable. Because it's not numeric. It's one line of code. It's more expressive In my opinion, and it avoids using trying to accept and it's preferable from, from my point of view, I would say it's probably preferable from my point of view, as well. I have mixed feelings about this. But I do think it's nice under certain circumstances. One, for example, if you say try do a thing, except pass a lot of linters and pi charm and whatnot, we'll go, this is too broad of a clause, you're catching too much. And you're like, Okay, well, now to make the little squiggly in the scroll bar go away, I have to put a hashtag, disable check, whatever, right. And now, it's five lines, one with a weird exception to say, No, no, this time, it's fine. So that's not ideal. I definitely think that that's more expressive, more expressive to use this conditional if one liner, the one situation where I might step back and go, you know, let's just do the try is if there's more variability in the data, so this assumes that the data is not done, and that it's string like, right, but if you got potentially objects back, or you got none some of the time, then you need a little bit more of a test. I mean, you could always do if, if data and data is numeric, that's okay. But then it's like if data and is instance of string data, and like there's some level where there's enough tests that it becomes you kind of like, fine, just let it crash. Right. And it will just catch it and go, but we were talking before we hit record, also, like there's a performance consideration, potentially. Yeah, definitely. And it's, it's interesting, I'll let you speak to what you found. But on the YouTube comments of the Python talk, that was one of the probably the most discussed things was whether or not the conditional expression was less performance than the original try and accept because a couple individuals commented that it was it was more pythonic to use the try and accept and therefore, it might be more performance, but you can share with what you found. Sure. Well, I think in terms of the pythonic side, like, certainly from other languages, like say, c C++, there's more of this, it's easier to ask for forgiveness than permission style of programming, rather than the alternative look before you leap, right? Because in like C, it could be a page fault in the program goes poof, and goes away. If you do something wrong, where's this, it's just gonna throw an exception, you're gonna catch it or something like that. So there's like this tendency to do the style. But in terms of performance, I wrote a little program because I wanted to add, like, maybe this is faster, maybe it's slower, like, let's think about that, right. But I wrote a little program, which I've linked to, there's a simple just, I'll link to it in the show notes, please 1 million, a list with 1 million items. And it uses a random seed that is always the same. So there's no there's zero variability, even though it's random, it's like predictable random, and it builds up this list of either strings or numbers, randomly, a million of them about two third strings, one third number, and then it goes through and it just tries both of them. It says like, let's just convert this as many of them as we can over two integers and do it either with the tracks that pass or just do it with this is numeric test. It is six times I got about 6.5 times faster to do the test the one line test than it is to let it crash and realize that it didn't work. Yeah, so there you go. You heard it here on how to drive and conditional expressions faster than try except.

34:54 Talk Python to me is partially supported by our training courses. How does your team keep their Python skills

35:00 Sharp? How do you make sure new hires Get Started fast and learn the pythonic? way? If the answer is a series of boring videos that don't inspire, or a subscription service you pay way too much for and use way too little. Listen up a Talk Python Training, we have enterprise tiers for all of our courses, get just the one course you need for your team with full reporting and monitoring, or ditch that unused subscription for our course bundles, which include all the courses and you pay about the same price as a subscription. Once For details, visit training, talk python.fm slash business or just email sales at talk python.fm.

35:38 There's a lot of overhead of throwing an exception and catching it and dealing with all that now. Right? This is a particular use case that varies and like all these benchmarks, like might vary, like if you've got 95% numbers and 5% strings that might behave differently. So there's a lot of variations. But here's an example you can play with in what seems like a reasonable example, to me, it's faster to do the numeric test. So a lot faster, right? Not like 5% faster, but 650% faster. So it is worth thinking about. Yeah, yeah, for sure. Let's see. So come through. And in the end, you had I mean, a ton of stuff was here, it was like many lines of code just for these two loops. And now you've got it down to four lines of code by basically an outer loop an inner loop, grab the data and append it with this little test that you've got much nicer. I agree. Yeah. So you went, I think if you look at the overall program, at this point, you were doing some analysis, or like some reporting said, It started as 60 lines of code. And now it's down to 20. Yeah, good, roughly, depending on if you count, you know, empty lines and whatnot. But it was about 60, down to about 10 or 20 lines. And at this point, I had sort of pointed out that I had made a mistake. So like this was fantastic, at least I had thought that you know, I taken a code, I'd taken a code snippet from a blog, I reduced it by, you know, roughly 75% or 67%, depending on how you measure it, but that I had made an even bigger mistake than I had realized. And it was that when I hit originally, I'd shown googling for you know how to scrape HTML, using pandas that I read the second results. And the third result was actually what I should have chosen. And it was that I had, pandas actually has a read HTML method in the library. And so the point that I going to make if you use that you go from, you know, 10, or 20, lines down to like four lines of code, and you're just invoking this one, pandas API, read HTML, it, it's so much better. So you know, refactoring is fantastic. But there's some quote about like, the best code is no code, if you don't have to write anything to do what you want to do. And you can just use an existing library, that's the best thing that you can do. Because that's going to be way more tested than the custom code that you've written. It's going to save you a ton of time. And you're going to end up with ultimately less code to maintain yourself. And what's better than having someone else maintain the code that you're using for you. Exactly right. It gets better for for no effort on your part. Yeah, it might get faster, or my handle more cases of like broken HTML, or who knows, but you don't have to keep maintaining that it's just read underscore HTML, and pandas just, it's probably getting maintained. Yeah. And so like, one of the things that I've echoed in some of the other talks that I've given is, knowing your algorithms in C++, definitely, there's a whole standard library, there's a lot of built in functions, I guess, if they're not so much called algorithms, they call them built in functions in Python. But like, there's a whole page where I was just looking at it the other day. And there's a ton of them that I'm just not aware of, I've everyone knows about map filter, any all like I just saw, I think it was called dev mon, which was a built in function for giving you both like the quotient and the remainder, which is like, there's definitely been a couple times where I've needed both of those, and you do those operations separately. And it's like I just knew about it, you can in a single line, you know, you can D structure it using the iterable. unpacking, knowing your algorithms is great. But also knowing your libraries, knowing your collections, like the more you get familiar with what exists out there, the less you have to write and the more readable your code is, because if everybody knows about it, we have a common knowledge base that it's transferable from every project you work on, right? Yeah, your final version basically had two really meaningful lines. One was request get the other was pandas dot read HTML. You don't have to explain to anyone who has done almost anything with Python, what requests get means, like, Oh, yeah, okay, so got it next. Right. We all know that works. We know it's gonna work, and so on. And it's really nice. I think, though, what you've touched on here, actually, it's really important, but it also shows why it's kind of hard to get really good at a language and the reason is, there are so many packages, right, you go to pi pi, I mean, I may try pi p org right now, every time I go there is always more right. So 245,000 packages, if you want to learn

40:00 To be a good Python programmer, you need to at least have awareness and a lot of those, and probably some skill set in some of them. Because like pandas, one of those requests is another one, right? The four line solution that you came up with, was building on those two really cool libraries. And so to be a good programmer, and effective means, like keeping your eye on all those things. And I just think that's, it's both amazing. But it's also kind of tricky, because like, well, I'm really good with for loops, and I increased functionality like, great. You've got 200,000 packages to study go. There's some quote that I've heard before, where being a language expert is 10% language 90% ecosystem. Yeah. And it's, you can't be a guru and insert any language. If you don't know the tools. If you don't know the libraries. It's so much more than just learning the syntax and learning the built in functions that come with your language. It takes years and it definitely doesn't happen overnight. It's it's a challenge for all of us. Yeah, for sure. You know, maybe it's worth a shout out to awesome bash python.com. Right now as well, which like, as different categories you may care about, and then we'll like highlight some of the more popular libraries in that area. That's, that's, that's a good one. Yeah, for sure. Do, you went through and you did nine different steps, you actually have those called out very clearly in your slides, you can get the slides from the GitHub repo associated with your talk, which I'll link to in the show notes, of course. But all of this refactoring talk was really part of the journey to come up with a totally different answer, which was one of the most popular languages for these coding competitions. Yeah, ultimate goal was to scrape the data, and then to use pandas in order to do that analysis. And at the end of the day, I believe the number one, I definitely know, the number one language was C++ at about think it was 89%. And that typically is the case because certain websites, they give the same time limit per language. So a website like hacker rank, they vary by language. So Python, your execution time that you're allotted faster, is 10 times more for Python. So even though Python slower, they give you the fortunate amount of time, but most websites don't do that. So the code forces website, it gives like you I think two seconds execution time, regardless of the language you use. And so due to that most people choose the most performant language, which is C++, but in second place was Python. And I know a lot of competitive programmers that for the problems where performance isn't an issue that you're trying to solve for. They always use Python, because it's about a fraction of the number of lines of code to solve it in Python than it is in any other language. Sometimes you can solve a problem in one line in Python. And the next closest language is like five lines, which is a big deal when time matters. Yeah. Are you optimizing execution time or developer time in this competition? Right? Yeah, it definitely matters what you're trying to solve for. So yeah, C++ was first Python was second, Java was third. And then there was a bunch of fringe languages, the top three were C# Haskell and kotlin. And yeah, you can see a full list if you go watch the Python talk. But it was it was really hard to find out what was what was used and what wasn't. Yeah, sure was. And it was cool to see the evolution of what you created to answer that question, which is pretty neat. All right. Well, let's just talk a little bit about rapids. Because I know that people out there are, there's a lot of data scientists and they're probably interested in that project. Do we did mention a tiny bit that is basically take pandas dataframes, apply something like that, that API, pretty close, not 100% identical and everything but pretty close. And it runs on GPUs to wire GPUs better. Like, I have a really fast computer. I have a core i nine with like six cores I got a couple years ago. That's a lot of cores, right? So yeah, it did. Well, first thing I should highlight too, is that rapids is more than just ku df. So ku df is the library I work on. We also have ku IO, KU graph, KU, signal, KU, spatial, KU, ml, and each of those sort of map to different thing and like the data science ecosystem, so ku df definitely is the analog of pandas, KU ml, I think the sort of analog you can think of is like psychic learn, but also to like, none of this is meant as replacements. They're just meant as alternatives. Like if performance is not an issue for you, like stick with what you have, there's no reason to switch. Yeah, don't don't do it. Because, for example, I couldn't run it on my MacBook, right? Because I have a radian Right, right. If you do want to try it out, I think they're on the rapid. So if you go to rapids.ai, we have a link to a couple examples using like Google co lab that are hooked up to like free GPUs that you can just take it for a spin, and you need the hardware, but you can go try it out. But like our pitch is sort of like this is useful for people that have issues with compute, right? And for different pieces, you're going to want different projects. So if you're doing pandas like sort of data manipulation, Qd F is what you want, but yeah, why are GPUs faster? It's just a completely different device and a completely different model. So, GPUs, typically it's in the

45:00 of the GPU. We're known for being great for graphics processing, which is why it's called the GPU. But at some point, someone coined the term, he actually works on the rapids team, Mark Harris, he coined the term gpgpu, which stands for general processing GPU compute, it's now typically referred to as just GPU computing. But it's this idea that even though the GPU model is great for graphics processing, there are other applications that GPUs are also amazing for the next best one is matrix multiplication, which is why they sort of became huge and neural nets and deep learning. But since then, we've basically discovered that there's not really any domain that we can't find a use for GPUs for. So there is a standard library in the CUDA model called thrust. So if you're familiar with C++, the standard library is called STL. And it has a suite of algorithms and data structures that you can use. thrust is the analog of that for CUDA. And it has reductions, it has scans. And it basically has all the algorithms that you might find in your C++ STL. And if you can build a program that uses those algorithms, you've just GPU accelerated your code. However, using thrust isn't as easy as some might like, and a lot of data scientists. They're currently operating in Python and R, and they don't want to go and learn C++ and then CUDA and then master the thrust library, just in order to accelerate their data science code. The rapids goal is to basically bring this GPU computing model for sort of general purpose, acceleration of that data science compute or whatever compute you want to the data scientists. And so if they're familiar with the pandas API, let's just do all that work for them. But the so so rapids is built heavily on top of thrusting CUDA. And so we're basically just doing all this work for the data scientists so that they can take their pandas code, like you said, hopefully just replace the import. And you're off to the races. And some of the performance wins are pretty impressive. Like I'm not on the marketing side of things. But in the talk I mentioned, I just happen to be listening to a podcast called the Nvidia AI podcast. And they had I believe his name was Kyle Nicholson. And by swapping out CUDA for pandas for their model, they were able to get 100 x performance win and a 30 reduction in cost. That's 30 times not 30%. Yeah, so 30

47:30 multiplicatively, which is massive. That's the difference between something running. So if it's 100 x in terms of performance, that's the difference between something running in 60 seconds, or an hour and 40 minutes. And if you can also save 30 x if that cost you 100 bucks, and now you only have to pay $3 It seems like a no brainer for those individuals that are impacted by performance bottleneck. Like I said, if you're hitting pandas, and it runs in a super short number of seconds, it's probably not worth it to switch over. Yeah, well, and you probably you tell me how realistic you think this is. But you could probably do some kind of import conditional import like import, you could try to get the rapid stuff working. If that fails, you could just import pandas as the same thing. One is PD, the other speedy and maybe it just falls back to just working on regular hardware, but faster when it works. Do you think that is definitely possible? There's going to be limitations to it though, obviously, if you have a sort of ku df data frame, like I don't think you wouldn't be able to do it piecemeal. But if you have a large product, what I'm thinking is if you wrote it for the rapids version, but then let it fall back to pandas, not the other way around. If you take arbitrary pandas code, you try to ratify it, that might not work. But it seems like the other one may well work. And that way, if somebody tries to run it, they don't have the right setup. It's just slower possible. What do you think there's definitely a way to do that to make that work. It might require a little bit of, you know, some sort of boilerplate framework code that is doing some sort of checking, you know, is this compatible else? But like, that definitely sounds automatable. Like, yeah, yeah, that sounds cool. Because that would be great to have it just fall back to like, not not working just not so fast, right? Yeah, the future of computing is headed to a place where we can dispatch compute to like different devices without having to like, manually specify that like, I need this code to run on the CPU versus the GPU versus the TPU. Versus in the future. I'm sure there's going to be a queue for quantum processing unit like, like, exactly Currently, we all think Sara Lee, or most of us that don't work at Nvidia, we think seriously, and in terms of like, the way that CPUs do compute, but I think in 10 or 20 years, we're all going to be learning about different devices. And it's going to be too much work to, in our head always have to be keeping track of which devices is going to at some point, there's going to be a programming model that comes out that just automatically handles like when it can go to the fast device and when we can just send it to the CPU. Yeah, absolutely.

50:00 So just while you're talking, I pulled it up on that Alienware gaming machine I got as a GeForce r tx 270, which has 2304 cores. So that's a lot. That's a lot of cores. And if you look somewhere, Google claims that it achieves 7.5 teraflops in the super increases that to nine teraflops, which is just insane. With like a core i seven doing like point three 5.28, or something like that. So anyway, the numbers we just started like they boggle the mind when you think of how much computation graphics cards do these days, I think top of the line, I might get this wrong, but like the modern GPUs are capable of 15 teraflops, it's an immense amount of compute. That's hard to fathom, especially when coming from the CPU sort of way of thinking. Yeah, absolutely. Yeah. The only reason I didn't get a higher graphics card is every other version required. watercooling. I'm like, that sounds like more effort than I want for a computer. I'll just go with this one. Yeah.

51:03 All right. Well, rapid sounds like a super cool project. And you know, maybe maybe we should do another show on the rapids theme across these things to talk a little bit more deeply. But it sounds like a great project by you're working on I work on the C++ lower engine of it. But I'd be happy to connect you with some of the Python folks that that work on that side of things. And I'm sure they'd love to come on. Yeah, that'd be fun. All right. Now, before we get out of here, got to ask you the two questions. If you're going to write some Python code, what editor do you use? So I am a VS code convert? That's what I typically use day to day. Nice. Yeah, that's quite a popular one these days. And then notable pi pi package, then you ran across should I go? People should know about this. Yeah. So I like to recommend there's a built in standard library, which I'm pretty sure most Python developers are familiar with editor tools, which has a ton of great functions, but less well known is a ipi package called more hyphen editor tools. And I'm not sure if this one's been recommended on the show before but if you like what's in it, or tools, you'll love what's in more inner tools. It has a a ton of my favorite algorithms, chunked being one of them, you basically pass it a list and a number, and it gives you a list of lists, consisting of that many things. It's like paging for lists. Yeah. And there's tons of neat functions. Another great one that's so simple, but doesn't exist built in all underscore equal. It just checks given a list are all the elements the same? And it's a simple thing to do. You can do it with all but you have to check is every element equal to the first one of the last one? So there's just a ton of really convenient functions and algorithms and more tools? That's the one I read. Yeah, that's cool. And you can combine these with like generator expressions and stuff. They are all these, you know, pull some element out of each object that's in there, generate that collection, ask if all those are equal, and they go, all these ideas go together? Well, they're Yeah, they compose super nicely. Yeah, for sure. All right, final call to action. People are interested in doing refactoring and making their code better. Maybe even checking out rapids. What do you say? I'd say if you're, if you're interested in what you heard on the podcast, check out the Python talk. It's it's on YouTube. If you search for pike on 2020, you'll find the YouTube channel. And yeah, if you're definitely interested in rapids.ai, check us out there. I assume all this stuff will be in the show notes as well. So maybe that's searching. Yeah. Well, and then also, you talked about YouTube channel a little bit, maybe just tell people how to find that. We'll put a link in the show notes as well. So they can if they want to watch you, you know, talk about some of these solutions and these competitions. Yeah, so my online alias is code underscore report. If you search for that on Twitter, YouTube, or Google, I'm sure all the links will come up. And yeah, you can find me that way. Awesome. All right. Yeah. We'll link to that as well. All right. Well, founder thank you so much for being the show is a lot of fun to talk about these things with you. Thanks for having me on. This was awesome. Yeah, you bet. Bye bye. This has been another episode of talk Python. To me. Our guest on this episode was Connor Hoekstra, it has been brought to you by us over at Talk Python Training, wanting to level up your Python. If you're just getting started, try my Python jumpstart by building 10 apps course. Or if you're looking for something more advanced, check out our new async course the digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite pod catcher and search for Python. We should be right at the top. You can also find the iTunes feed at slash iTunes. The Google Play feed is slash play in the direct RSS feed at slash RSS on talk Python FM. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Get out there and write some Python code

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon