Beautiful Pythonic Refactorings
Conor Hoekstra is our guest on this episode to talk us through refactoring some web scraping code.
Episode Deep Dive
Guest Introduction and Background
Connor Hoekstra is a senior library software engineer at NVIDIA working on the RAPIDS project (in particular, the CUDF library). He has a strong interest in C++ and Python, especially around competitive programming. Connor initially explored Python for data-focused tasks and web scraping, and through that journey, he gave a PyCon talk on transforming “ugly duckling” Python code into a polished, Pythonic beauty.
What to Know If You're New to Python
Before diving into the refactoring concepts, it’s helpful to have some familiarity with Python’s basic syntax and coding style. In particular:
- For loops and how Python’s enumerate()works for index-item pairs.
- List comprehensions and the if ... else ...expressions in Python.
- Basic error handling (try/except) and comparison to built-in functions likestr.isnumeric().
- Basic string and list operations (e.g. slicing with [start:end]).
Key Points and Takeaways
- Iterative Refactoring: Turning Ugly Code into Pythonic Code Refactoring is an ongoing process of restructuring existing code without changing its external behavior. Connor walked through an example of taking a messy web-scraping script and progressively shaping it into concise, maintainable Python code. By focusing on iterative, small changes, you can keep your code running and tested at each step. This approach reduces the risk of introducing bugs and leads to much more readable, elegant solutions.
- Code Smells and “Deodorant” Comments The episode highlighted that comments explaining obvious lines (e.g. “create empty list”) can be a sign of unclear code. Rather than adding commentary that just states the obvious, using clearer variable names or refactoring logic into separate functions often eliminates the need for these comments entirely. As Connor put it, comments should capture the “why,” not the “what,” which the code itself should express.
- Python Built-ins: enumerate(), Slicing, and List Comprehensions Three Python features repeatedly mentioned were:- enumerate(): For cleaner loops when an index is needed alongside items.
- Slicing ([start:end]): An easy way to skip or limit ranges without manual indexing.
- List Comprehensions: An elegant alternative to “initialize then modify” with .append(). Each of these features cuts down on boilerplate code and off-by-one errors, while producing more readable and succinct Python.
 
- try/exceptvs.- str.isnumeric()In the original code, a common anti-pattern was using- try: int(data)with- except: passto see if something was numeric. Switching to- data.isnumeric()and converting accordingly removed the broad exception handling and improved performance substantially (on the order of 6x faster in Connor’s rough benchmarks). It also made the logic more explicit: “If this is numeric, convert it.”
- Automated Refactoring Tools and Linters While the conversation didn’t dive deeply into any single linter or IDE, there was discussion that many modern editors can automatically identify unused variables, rename functions, extract methods, and more. These tools can dramatically reduce human error during refactoring. The overarching point: Use the powerful tooling in your editor (e.g. PyCharm, VS Code) to safely transform your code.
- The Best Code Is No Code
Refactoring doesn't just mean cleaning up existing functions. Sometimes the biggest win comes from removing large swaths of code if a library or function already exists. In the final example, dropping an entire custom HTML parsing routine in favor of pandas.read_html()reduced lines of code from 60 to just a handful. Less code is easier to maintain.
- Real-World Example: Scraping CodeForces Data The driving example throughout this discussion was scraping CodeForces to discover which programming languages appear most often in competitive programming submissions. This gave a tangible anchor for all the refactoring efforts, showcasing how to transform a real script that initially relied on repetitive loops, indexing, and manual string checks.
- Results: C++ #1, but Python #2 After analyzing CodeForces data, Connor found that C++ consistently topped the charts (roughly 89% of submissions), largely because CodeForces grants all languages the same execution time limits. Python came in second, favored for its conciseness on lower-compute or simpler problems. This underscores how developer constraints (time limits, memory) play a big role in language choices in competitive programming.
- GPU Computing with RAPIDS
Beyond refactoring scripts, Connor gave insight into how RAPIDS bridges Python and GPU acceleration. Tools like CUDF aim to bring the pandas-like data frame experience to CUDA-based devices for massive speedups (on the order of 30-100x for certain workloads). Even if you aren’t diving into GPU computing right now, it’s valuable to know that “scaling up with GPUs” can be a straightforward step when performance becomes critical.
- Growing Your Python Knowledge Base
Python has a vast standard library and ecosystem, much larger than many expect. Knowing key built-ins (e.g. divmod(),all(),any()) or external libraries (likerequestsormore-itertools) keeps you from “reinventing the wheel.” As you broaden your horizons, always look for existing solutions first. That awareness is often the difference between hacking code together and writing truly elegant, Pythonic solutions.
Interesting Quotes and Stories
"Refactoring is not a one-time thing or something that happens only, you know, two years from when you initially write the code. You can refactor something you wrote earlier in the day." , Connor Hoekstra
“Whenever I see ‘Step 1,’ ‘Step 2,’ ‘Step 3’ in comments, I know I should extract those into their own functions.” , Connor Hoekstra
“The best code is no code. If you can just import a library or a function that does exactly what your custom code does, remove that code.” , Connor Hoekstra
“If you have a code base that has zero tests, refactoring is very, very dangerous.” , Michael Kennedy
Key Definitions and Terms
- Refactoring: The process of improving internal code structure without altering external behavior.
- List Comprehension: A concise way to create lists (e.g. [x for x in items if condition(x)]).
- enumerate(): A Python built-in that returns both index and value when iterating over a sequence.
- Code Smell: An indication (often subjective) that code needs refactoring. Examples include long methods, misleading variable names, or superfluous comments.
- Technical Debt: Accumulated suboptimal code or architectural decisions that make further changes difficult or error-prone.
- try/except: Error handling in Python. Overuse can hide bugs and harm performance if used where direct condition checks (e.g.- isnumeric()) would suffice.
Learning Resources
Here are some resources to learn more and go deeper into Python refactoring and best practices:
- Write Pythonic Code Like a Seasoned Developer: Learn advanced idiomatic patterns and design in Python.
- Python for Absolute Beginners: If you need a solid Python foundation before focusing on refactoring, this course is for you.
- RAPIDS: Explore GPU-based data science and analytics libraries including CUDF.
- CodeForces: Popular competitive programming platform mentioned in the episode.
- Requests on PyPI: HTTP library used for web scraping calls.
- More-Itertools on PyPI: Extends itertoolswith additional helpful iteration helpers.
Overall Takeaway
This conversation underscores how “ugly code” can evolve into elegant, maintainable Python by taking small refactoring steps and leveraging Python’s best features. Tools like enumerate(), slicing, list comprehensions, and built-ins (.isnumeric()) go a long way toward cleaning up code. Even more powerful is knowing when to drop your own code and let a library, such as pandas.read_html(), handle the heavy lifting. The art of refactoring isn’t just about shorter code; it’s about writing expressive, reliable solutions that free you to focus on bigger problems, like discovering which languages top the charts in coding challenges or leveraging GPUs for massive data workflows.
Links from the show
Presentation source code: github.com/codereport
Conor on Twitter: @code_report
Youtube channel: youtube.com/codereport
Perf example exceptions vs. test: gist.github.com/mikeckennedy
PyCon Online: us.pycon.org/2020/online
RAPIDS AI project: rapids.ai
Slides from presentation (with 9 refactoring steps): github.com/codereport
Talk Python episode on Sourcery: talkpython.fm/266
pip for venv only environment variable
PIP_REQUIRE_VIRTUALENV: docs.python-guide.org
Episode #275 deep-dive: talkpython.fm/275
Episode transcripts: talkpython.fm
---== Don't be a stranger ==---
YouTube: youtube.com/@talkpython
Bluesky: @talkpython.fm
Mastodon: @talkpython@fosstodon.org
X.com: @talkpython
Michael on Bluesky: @mkennedy.codes
Michael on Mastodon: @mkennedy@fosstodon.org
Michael on X.com: @mkennedy
Episode Transcript
Collapse transcript
00:00 Do you obsess about writing your code just the right way before you get started?
00:03 Maybe you have some ugly code on your hands and you need to make it better.
00:07 Either way, refactoring could be your ticket to happier days.
00:10 On this episode, we'll walk through a powerful example of iteratively refactoring some code
00:16 until we eventually turn our ugly duckling into a Pythonic beauty.
00:19 Connor Hoekstra is our guest on this episode to talk us through refactoring some web-scraping Python code.
00:25 This is Talk Python To Me, episode 275, recorded July 9th, 2020.
00:30 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
00:49 This is your host, Michael Kennedy.
00:51 Follow me on Twitter where I'm @mkennedy.
00:54 Keep up with the show and listen to past episodes at talkpython.fm.
00:57 And follow the show on Twitter via at Talk Python.
01:00 This episode is brought to you by us over at Talk Python Training.
01:04 Python's async and parallel programming support is highly underrated.
01:09 Have you shied away from the amazing new async and await keywords because you've heard it's way too complicated
01:14 or that it's just not worth the effort?
01:17 With the right workloads, a hundred times speed up is totally possible with minor changes to your code.
01:23 But you do need to understand the internals.
01:25 And that's why our course, Async Techniques and Examples in Python, show you how to write async code successfully as well as how it works.
01:33 Get started with async and await today with our course at talkpython.fm/async.
01:39 Connor, welcome to Talk Python To Me.
01:41 Thanks for having me on.
01:42 Excited to be here.
01:42 I'm excited too.
01:43 It's going to be beautiful, man.
01:44 Hopefully.
01:45 Hopefully.
01:47 Yeah.
01:47 It's going to be a beautiful refactorings.
01:49 So I am a huge fan of refactoring.
01:51 I've seen so many people try to just overthink the code that they're writing.
01:57 They're like, well, I got to get it right.
01:58 And I got to think about the algorithms and the way I'm writing it and all this stuff.
02:02 And what I found is you don't really end up with what you want in the end a lot of times anyway.
02:07 And if you just go in with an attitude of this code is plastic, it is malleable, and I can just keep changing it.
02:13 And you always are on the lookout for making it better.
02:15 You end up in a good place.
02:16 Yeah, I completely agree.
02:17 Refactoring is not a one-time thing or something that happens only, you know, two years from when you initially write the code.
02:23 I heard once actually that refactoring goes a lot in hand with legacy code.
02:29 And there's a number of different definitions for legacy code.
02:33 But one definition is legacy code is code that isn't actively being written.
02:38 So if you write something once and then you consider it done, and then the next week, like no one's working on it, that technically, according to that person's definition is legacy code.
02:46 So that can be refactored.
02:48 You know, you can refactor something you wrote earlier in the day.
02:50 It doesn't have to be a year later or 10.
02:52 Yeah, absolutely.
02:53 I mean, you just, you get it working, you know a little bit more, you apply that learning back to it.
02:58 And with the tooling these days, it's really good.
03:01 It's not just a matter of, you know, if you go back to 1999 and you read Martin Fowler's refactoring book, he talks about these are the steps that you take by hand to make sure you don't make a mistake.
03:12 And now the steps are highlight, right click, apply, refactoring.
03:16 I mean, that's not 100% true.
03:18 And the example we're going to talk through is not like that exactly.
03:21 But there are steps along the way where it is, potentially.
03:24 Definitely.
03:24 Linters and static analyzers are heavily underutilized, I feel.
03:29 And so many of them will just automatically apply the changes that you want to do.
03:32 And it's fantastic for huge code bases.
03:34 It would be almost impossible to do it by hand.
03:37 Yeah, absolutely.
03:38 It would definitely be risky.
03:39 So maybe that's why people sometimes avoid it.
03:42 Now, before we get into that, though, let's start with your story.
03:44 How do you get into programming into Python?
03:46 I know you're into a lot of languages.
03:47 We're going to talk about that.
03:48 But Python too?
03:49 Python also?
03:49 Yeah.
03:50 So the shorter, it's a long story, but the shorter version of it is my degree in university,
03:56 which wasn't computer science, required at least two introductory CS courses.
04:02 So the first intro course was in Python.
04:04 The second one was in Java.
04:06 And then I ended up really, really enjoying the classes.
04:09 I ended up taking a couple more, but ultimately stuck with the career that I had entered into,
04:15 which was actuarial science.
04:17 That's like insurance statistics.
04:18 Yeah.
04:19 So you were in some form of math program, I'm guessing?
04:21 Yeah, yeah.
04:22 Yeah, cool.
04:22 It's very, very boring to explain.
04:25 But if you like math, it's a great career.
04:27 Yeah, awesome.
04:28 And so I ended up, for my first job at a university, I ended up working at a software company, basically,
04:34 that very simply explained, created the insurance calculator that many insurance companies use.
04:41 And after working there for about four or five years, I had just fallen in love with the
04:47 software engineering side of my job and had decided that I wanted to transition full time to like a purely technical company.
04:54 So it's several years or a couple years later.
04:58 And now I work for NVIDIA as a senior library software engineer.
05:02 And that's how I got into programming.
05:04 And our code base that we work on is it's completely open source and primarily uses C++14 and Python 3.
05:11 That's where Python enters.
05:13 That sounds like a dream job.
05:14 Yeah, that sounds awesome.
05:15 Yeah, I absolutely love it.
05:16 Yeah, so you're working on the Rapids team, right?
05:19 Which works on doing a lot of the computation that might be in Pandas, but over on GPUs.
05:24 Is that roughly right?
05:25 Yeah, that's a great description.
05:27 So yeah, within NVIDIA, I work for an organization called Rapids.
05:31 We have a number of different projects.
05:33 So specifically, I work on CUDF.
05:36 That is C-U-D-F.
05:37 So the CU is two letters C-U from CUDA, which is like the parallel programming language that NVIDIA has made.
05:46 And the DF stands for data frame.
05:48 And so this is basically a very similar library to Pandas.
05:53 The difference being that it runs on the GPUs.
05:56 So sort of the one liner for Rapids is it's a completely open source, end-to-end data science pipeline that runs on the GPUs.
06:03 So if you're using Pandas and it works great for you, like there's no reason to switch.
06:07 But if you run into a situation where you have a performance bottleneck, CUDAF can be like a great drop-in replacement.
06:13 We don't have 100% parity with like the Pandas library, but we have enough that a lot of Fortune 500 companies that pick up and use us are able to very easily transition their existing code in Pandas to CUDAF.
06:25 Right.
06:25 Change an import line, go much faster, something incredible like that.
06:28 That's the goal.
06:29 That's the dream.
06:31 Yeah, I just recently got a new Alienware, a high-end Alienware desktop.
06:35 And it's the first GeForce I've had in a long time that's, you know, not like, I don't know, some AMD Radeon in a MacBook or something like that.
06:43 So I'm pretty excited to have a machine that I can now test some of these things out on at some point.
06:47 Yeah.
06:47 Acceleration on different devices is, it is very exciting.
06:51 Awesome.
06:52 All right.
06:53 Well, let's start by introducing real briefly a little bit about refactoring.
06:58 We've talked a tiny bit about it in general.
07:00 And then we're going to dive into a cool example that you put together that really brings a lot together.
07:06 And what I love about your example is it's something you've just gone and grabbed off the internet.
07:11 It's not like a contrived, like, well, let's do this and then unwind the refactorings until it does it.
07:16 It's like you just found it.
07:17 And like, well, let's see what this thing does.
07:19 That's going to be fun.
07:19 But let's just start with a quick definition of refactoring.
07:23 Maybe how do you know when you need it?
07:25 How do you know when you need refactoring?
07:27 For me, I have a sort of number of anti-patterns in my head that when I recognize them in the code, some people might refer to them as sort of technical debt.
07:37 This idea that the first time you write things or maybe initially when you write things, you don't have the full picture in mind.
07:42 And then as time goes on, you start to build up technical debt in your code.
07:46 And a refactoring can be reorganizing or restructuring your code or rewriting little bits of it to basically reduce tech, to make it more readable, maintainable, scalable, and just in better, in general, better code.
07:58 That's sort of the way I think of it.
07:59 Yeah, it is pure sense, right?
08:01 It should not change the behavior, at least in terms of like inputs, outputs.
08:05 Exactly.
08:06 Yes.
08:06 The easiest code to refactor is code with tests, whether that's unit tests or regression tests or any of the other number of tests that there are.
08:14 If you have a code base that has zero tests, refactoring is very, very dangerous because you can refactor something and completely change the behavior and not know about it, which is not ideal at all.
08:24 Somewhat suboptimal, indeed.
08:26 You know, Martin Fowler, when he came up with the idea of refactoring, or at least he publicized, I don't know, I'm sure the ideas were basically there before.
08:33 One of the things that struck me most was not the refactorings, but was this idea of code smells.
08:40 And it's like this aesthetic of, right, like I look at the code and it, yeah, it works, but like your nose kind of turns out, you're kind of like, ew, no, ew, but it still works, right?
08:49 It's like not broken, but it's not nice.
08:53 And, you know, there's all sorts of code smells like too many parameters, long method, things like that.
08:59 But they rarely have clear cutoffs, right?
09:02 Like, well, if it's over 12 lines, the function is too large, but under that, it's totally fine, right?
09:06 Like that's not, it's never really super clear cut.
09:08 So I think this whole idea of refactoring, much like refactoring itself, requires like going over it and over it as sort of through your career to refine, like what the right aesthetic to achieve is.
09:20 And it probably varies by language as well a little.
09:22 Yeah.
09:22 If you start to do it like consciously when you're looking at code and asking yourself, like when you have that code smell feeling like something's not right here.
09:31 If you are conscientiously like paying attention to what it is, like slowly over time, you will start to pick up on exactly what it is about it.
09:37 Like a very, very small one for me.
09:39 And I think this is mentioned in maybe a clean code or it might've been Martin Fowler's book.
09:44 It's like declaring a variable earlier than it needs to be declared.
09:48 So you might declare like all your variables at the top of a function, but then like two of them you use immediately.
09:53 But the other three you don't use until the last, you know, four lines of the function.
09:57 Small things like that.
09:58 It seems simple, but I've made the change where I've put the declaration closer to where it's get used.
10:03 And then you realize, oh, wait a second, this isn't actually reference.
10:07 Like it's set to something, but then it's not actually used later on.
10:10 So I can just delete this.
10:11 And it's because it was at the top of the function.
10:12 You can't see where it's being declared or if it's used somewhere else that like you actually just have an, a phantom unused variable that can be deleted.
10:20 It's simple things that lead to better changes later on.
10:24 Well, and just mental overhead.
10:26 Like you said, the technical debt side of things.
10:28 So for example, there's the variable that was at the top.
10:32 Literally when the code was written, it was being used, but it's been modified over the years.
10:37 And now no longer is it being used, but because it's separated from where it's declared to where it's used.
10:41 You don't want to mess with that.
10:43 Like if you start messing with that, you're earning more work, right?
10:47 You're asking for more.
10:47 I'm just going to make the minor change.
10:50 I don't want to break anything.
10:51 Who knows?
10:51 And then the next person that comes to try to understand it, they got to figure out, well, why is there that like set count variable?
10:58 Like, I don't feel like it's being used, but it's there.
11:01 And like, you know, you just got to, it's another thing to think about that's in the way.
11:04 Yeah, for sure.
11:05 Yeah.
11:06 So certainly I think it's viable.
11:07 There are fantastic tools that will like highlight this variable is unused or this assignment.
11:13 Is it meaningless or something like that?
11:15 So there are options, but still it's better to not let that stuff live in the code.
11:20 Yeah.
11:21 A hundred percent agree.
11:21 Let's talk about this example that you've got here and maybe you should give a little background on your language, enthusiasm and programming competition interests and so on.
11:33 Your interest in coding competitions, I think it's probably worth touching on already.
11:38 But then this example is from you trying to reach out and understand it and do some analysis of those environments or those ecosystems, right?
11:46 What's the background with this, these different languages and coding competition?
11:50 Yeah, so I initially got into competitive programming, quote unquote.
11:54 So just the one sentence description is there's a number of websites online, HackerRank, LeakCode, CodeForces, that they host these one to three hour contests where they have three to four or five problems that start out easy and then get harder as you progress through them.
12:09 And you can choose any language you want to solve them in.
12:11 And the goal is just to get a solution that passes as quickly as possible.
12:16 So it's not necessarily about how efficient your code is.
12:20 It has to run within a certain time limit.
12:22 But if you can get it to run or pass in Python versus C++ versus Java, any code solution works.
12:28 I started doing these to prepare for technical interviews.
12:31 So if you're interviewing for companies like Google, Facebook, etc., a lot of their interview questions are very similar to the questions on these websites.
12:38 And so I at one point was looking for a resource online, like for YouTube videos that just explain this stuff.
12:44 But at the time, I couldn't really find any.
12:46 So I started a YouTube channel covering the solutions to these problems.
12:50 And I thought it would be better to solve it in a number of languages than opposed to just C++.
12:55 So I started solving them in C++, Python, and Java.
12:59 And that's sort of what led to my interest in competitive programming.
13:03 Even though I'm not interviewing actively anymore.
13:06 I just find these, they're super fun.
13:08 It keeps you sort of on your toes in terms of your data structure and algorithms knowledge.
13:12 And you can treat them as like code katas.
13:15 I'm not sure if you're familiar with the concept of just sort of writing one little small program and trying it a couple times in different languages.
13:22 And you learn different ways of solving the problem that you might not would have initially solved the problem that way.
13:28 This example, I decided to just figure out what are the top languages that people use to solve these competitive programming problems on a given website.
13:38 So the site that I chose was Code Forces.
13:40 Yeah.
13:41 And you're like, hey, I'm working on this new data frame library that's like pandas.
13:45 Let me see how I can use pandas to solve this problem and get some practice or something, right?
13:49 Yeah, yeah.
13:50 So when I had just started NVIDIA, I knew that the pandas library existed, but I had zero experience with it.
13:56 And I knew that it had this sort of group by reduction functionality that if you had a big table of elements, you could get these sort of statistics on, you know, what's the top language or what's, you know, the average time it takes for people to submit very easily with this kind of library.
14:10 So I thought, what better way to learn pandas than by trying to build a simple example that uses this library for something that I'm interested in.
14:18 And so the first thing that I did was I googled, you know, how to scrape HTML tables using pandas.
14:24 And then it brought me to this blog that at the end of the day has about 60 lines of code.
14:29 And it's a tutorial blog.
14:30 So it walks you through how to get this code off of an HTML table.
14:36 And basically the PyCon talk that I gave, it came out of doing this.
14:40 I had no plans of giving a PyCon talk on this.
14:43 I just after having gone through it and sort of refactoring one by one, I realized that like I could give a pretty simple talk to like Connor like five years ago.
14:53 That didn't know about any of the I didn't know about list comprehension.
14:57 I didn't know about enumerate.
14:58 I didn't know about all the different techniques I was using.
15:00 And I figured it would be at least for some individuals out there.
15:04 It would be a useful talk highlighting the things that I didn't know when I first started coding in Python.
15:09 But that now are like second nature for me.
15:12 And that's sort of where the talk came from.
15:13 Yeah.
15:13 And it's really interesting.
15:14 The example is cool.
15:15 I do think that a lot of the refactorings were let's try to make a more Pythonic version of this and more idiomatic.
15:23 version of this like misunderstanding the for in loop, for example, and treat them all right.
15:29 So in a lot of ways, it's a cool refactoring.
15:32 But it's also kind of leveraging more of the native bits of the language, if you will.
15:37 Absolutely.
15:38 Yeah.
15:38 Yeah.
15:38 So you went and grabbed this code and it does two basic things.
15:42 It goes and downloads some HTML and then pulls it apart using, I think, LXML HTML parser.
15:50 And then it's going to loop over the results that it gets from the HTML parser and turn this into basically a list or a dictionary.
15:59 Then you're going to feed that over to pandas.
16:00 Ask pandas some pretty interesting questions.
16:03 And most of the challenge or most of the messy code lived in this HTML side of things, right?
16:09 Yeah.
16:09 That's a pretty good description of what's happening.
16:11 Cool.
16:12 So let's go and just talk through some of the issues you identified and then the fix, basically knowing like how did you identify that as a problem?
16:21 And then what fix did you apply to it?
16:25 Now, there's a lot of code and it's hard to talk about code and audio.
16:28 So we'll maybe try to just like as high level as possible, talk about like the general patterns and what we fixed.
16:33 The first part of the code would go through and it would create an empty list and it would create like an index to keep track of where it was and then did a loop over all of the elements.
16:42 Increment the index, add a thing to the list, print out some information as it went, right?
16:46 Yep.
16:46 And I think the first thing that you talked about was the code comments, actually.
16:51 You're like, what is this code comment here?
16:53 It just says we're looping over these things.
16:55 Well, what do you think a loop is?
16:56 Why do we have this comment?
16:57 Yeah, even worse was like arguably the second comment some might argue is add some value.
17:03 But the first comment above the line that creates an empty list, it says create empty list.
17:10 And it's only a what is that six characters if you don't include the spaces.
17:13 And I think that's definitely one of the things that's called out in a number of refactoring books is comments should add value that is not explicitly clear from the code.
17:24 I think even beginners are able to tell that you're creating an empty list there.
17:28 There's no reason to basically state what the code is doing.
17:31 Typically, comments should say why if it's not clear why something is being done a certain way or something that's implicit and not explicitly clear from what the code is doing.
17:40 Yeah.
17:40 In terms of refactoring, I love this idea of these comments are sort of almost warning signs.
17:46 Because if I find myself writing one of these comments to make stuff more clear, I'm like, wait a minute, wait a minute.
17:51 If this is just describing what's here, something about what I'm doing is wrong.
17:56 Maybe the variable name is not at all clear what the heck it is.
17:59 Or maybe it could use a type annotation to say what types come in instead of here's a list of strings.
18:03 Like how about list bracket string goes there to just say what type it is.
18:06 It's Python 3 after all.
18:07 And, you know, from the Code Smells book, Fowler had this great description of calling these types of comments deodorant for Code Smells.
18:16 So there's something wrong.
18:18 It smells a little less bad if we like lay it out, set the stage.
18:21 But every time I see one of those, I'm like, you know what?
18:24 I just need to rename this function to like a short version of what this comment would say.
18:27 Or rename this variable.
18:29 Or like restructure and break these things apart.
18:31 Because if it needs a comment, it's probably just too complicated.
18:34 There's an individual in the C++ community.
18:36 His name is Tony Van Eerd.
18:37 And he has a rule, or not a rule, but a recommendation that you should grep your code base for step one, step two, step three.
18:45 And guaranteed you're going to get like one or two matches.
18:48 And a lot of times it's these steps of comments on top of pieces of code and like a larger function.
18:54 And odds are you could make that code a lot better by refactoring each of those steps into its own small function.
19:00 And just whatever the step, like if you put step one in a description, you've already given that piece of code a name.
19:05 You just need to take the next step, put it in a function and give that function that name.
19:09 Yes, exactly.
19:10 Which is exactly what you said.
19:11 Exactly.
19:12 I think there was even some tool way, way, way back in the early days of C# that if you would highlight some code to refactor it and you highlighted a comment, it would function nameify.
19:23 It would try to guess the function name by using the comment, turning it into a function, you know, like something that would work as a identifier in the language.
19:31 Anyway, it's totally a good idea.
19:34 So there's a couple of things going on here.
19:36 One is like, why is there a print statement?
19:38 Nobody needs this.
19:39 Once you take that out, though, you are able to identify this.
19:42 Well, let's take a step back.
19:43 First, if you have an integer and you're incrementing it every time through the loop so that it stays in sync with the index of the elements you're looping over, that's probably not the best way to do it, right?
19:54 Like Python has a built-in enumerate.
19:56 Yeah.
19:56 This is probably one of the most common things I see in Python.
20:00 Sadly, in certain languages, they don't have this function.
20:04 But in Python, it's right there built into the language.
20:07 And as you mentioned, it's called enumerate.
20:09 So you can pass whatever thing you're looping over to enumerate.
20:13 And that's going to bundle it with an index that you can then inline destructure into an index and the element that you were getting from your ranged for loop before.
20:22 So anytime you see an index, IDX or I or something that's keeping track of the index and that's getting it could be J.
20:31 Sometimes it's J.
20:32 Sometimes it's J.
20:32 Sometimes it's K.
20:33 X or Y if you're being really creative.
20:36 And yeah, like there is a built-in pattern for basically avoiding that.
20:39 And it makes me extremely happy.
20:41 Like it happens actually not just once in this piece of code, but twice where you can make use of enumerate.
20:47 And once you see it, it's very hard to unsee it.
20:49 But like I said, this was something that I learned enumerate from Python.
20:53 And this was not something that I knew of and I didn't learn in school.
20:57 So there's a lot of Python developers and just developers in many languages out there that I think they're just not aware.
21:02 And as soon as you tell them, I think they'll agree.
21:05 Oh, yeah, this is way better than what I was doing before.
21:07 Yeah.
21:07 You just need to be aware of it.
21:09 You know, you always run into these issues.
21:10 You've got to create the variable.
21:11 Then like, why is the variable there?
21:12 Then you've got to make sure you increment it.
21:14 Do you increment it before you work on that with the value or do you increment it after?
21:18 Is it zero based?
21:19 Is it one based?
21:20 All of these things are just like complexities that are like, what is happening here?
21:23 Like, what if you have a have an if test continue and you skip the loop, but you forget to increment it?
21:29 Like there's all these little edge cases.
21:30 And you can just with enumerate, you can say, you know, it's always going to work.
21:34 You can even set the start position to be one if you want it to go one, two, three.
21:37 Beautiful.
21:38 Yeah, that's a great point.
21:39 Yeah, there are use cases where you're going to run into bugs.
21:42 Whereas with enumerate, you know, at least you're not going to have a bug with that index.
21:46 Right.
21:46 It's always going to be tied to the position with the starting place the way you want it.
21:50 So yeah, yeah, that's really nice.
21:52 But it's not super discoverable, right?
21:54 Like there's nothing in the language that screams and waves its hands.
21:57 It says, yeah, you're in a for loop.
21:59 We don't have this concept of a numerical for loop, but this is actually better than what
22:03 this is what you wanted.
22:04 You didn't even know you wanted it.
22:05 Yeah, it has to be something that you stumble across.
22:07 Interestingly, some languages go is the one that comes to mind.
22:12 They actually build in the enumerate into their range based for loop.
22:16 So in Go, they have built in basically the destructuring.
22:21 And if you don't want the index, if you just want a range based for loop and you want to
22:25 ignore the index, then you're just supposed to use the underbar to say, I don't need the
22:29 index for this loop.
22:30 But it's interesting that like Go is a more recently created language than Python, at least.
22:35 When they decided like they thought it was such a common use case that they would think that
22:40 most people need it more often than they wouldn't.
22:42 So they built it into their for loop.
22:43 So with that language, you can't avoid learning about it because it's in their for loop.
22:48 It's a syntax error to not at least say I explicitly ignore this.
22:51 Yeah.
22:51 Oh, interesting.
22:52 I didn't know that about Go.
22:53 So now you've got this little cleaner.
22:55 You look at it again and you say, well, now what we're doing is we're creating a list,
22:59 an empty list, which we commented, create empty list.
23:02 That was cool.
23:03 Took that comment out, but it was very helpful in the beginning to help you understand.
23:06 No, just kidding.
23:06 And then you say we're going to loop over these items and then append something to that
23:10 list.
23:11 Well, that's possible.
23:12 But this is one of your anti-patterns that you like to find and get rid of, right?
23:18 This is an anti-pattern that I call initialize, then modify.
23:21 And actually, the enumerate example previously also falls into this anti-pattern.
23:27 So anytime you have a variable that it doesn't need to be a for loop, but many, many times it
23:32 is that inside each iteration of that for loop, you are then modifying what you just
23:37 initialized outside.
23:38 That is initializing and then modifying.
23:40 And my assertion is that you should try to avoid this as much as possible.
23:44 And when it comes to the pattern of initializing an empty list, and then in each iteration
23:48 of your for loop, you're calling append.
23:50 That is built in to the Python language as something that can be used as a list comprehension,
23:56 which is so much more beautiful, in my opinion, compared to just a raw for loop and then
24:02 appending for each iteration.
24:03 Yeah.
24:03 Every now and then, there's like a complicated enough set of tests or conditionals or something
24:09 going on in there that maybe not.
24:11 But I agree with you most of the time.
24:13 That just means what I really wanted to write was a list comprehension.
24:17 It is.
24:17 So, you know, bracket item for item in such and such, if such and such, right?
24:22 That's what you got to do.
24:24 Yeah.
24:24 List comprehension, once you start to use it, moving to a language that doesn't have it
24:27 makes you very sad because it's such a convenient syntax.
24:31 It totally makes you sad.
24:32 And I really, really wish list comprehensions had some form of sorting clause because at
24:40 that point, you're almost into like in-memory database type of behaviors, right?
24:45 Like I would love to say projection, you know, thing, transform thing, for thing in collection,
24:50 where the test is, order by, whatever, right?
24:54 I mean, you can always put a sorted around it, but it'd be lovely if they're like, it's
24:58 already got those nice steps.
24:59 I like to write it on three lines, right?
25:01 The projection, the set, and the conditional, like just one more line, put the order by in
25:06 there, but maybe someone, or maybe I should put a PEP in there.
25:09 Who knows?
25:10 I was going to say that sounds like a future pep, but.
25:11 It definitely does.
25:13 I mean, it would be easy to implement, just transform it to a sorted and pass that as a
25:17 key or something like that.
25:18 But anyway, it would be really cool.
25:19 But they're very, very nice, even without that.
25:23 And once you have it as a list comprehension, then it unlocks the ability to do some other
25:29 interesting stuff, which you didn't cover in yours because it didn't really matter.
25:32 But if you have square brackets there, and those brackets are turning a large data collection
25:37 into a list, if you put rounded brackets, all of a sudden you have a much more efficient
25:43 generator.
25:43 Yep.
25:43 That is something I don't call out at that point.
25:46 But at the end of the talk, I allude to a article that was mentioned on the other podcast
25:51 that you co-host, Python Bytes.
25:53 Yeah.
25:53 Thanks for the shout out on that one, by the way.
25:54 Yeah, no, it was a great article, but it mentions generator expressions right after it mentions
25:59 list comprehension.
25:59 And I mentioned that these things go hand in hand and that you should familiarize yourself
26:03 because if at any point you're passing a list comprehension to like an algorithm, like
26:08 any or all or something, you can drop the square brackets and then just pass at the generator
26:13 and it'll become much more efficient.
26:15 So it's good to know both of them.
26:17 And there's no way to go from a for loop really quickly and easily to a generator.
26:22 That's a good thing.
26:23 That's a good thing.
26:24 Right.
26:25 There's not like for yield, I and whatever, right?
26:28 Like there, but with the comprehensions, it's square brackets versus rounded parentheses, right?
26:34 It's so, it's so close that if that makes sense, it's like basically no effort to make
26:38 it happen.
26:38 Yeah.
26:38 Yeah, nice.
26:39 Okay.
26:39 So we've got into a list comprehension, which is beautiful.
26:43 And then you say, all right, it's time to turn our attention to this doubly nested for loop.
26:48 And it's going to go over a bunch of items.
26:50 And it's going to go over a bunch of the items and pull out an index and then, you know, go and work with that index.
26:57 So it's another enumerate.
26:59 And then I think another thing that's pretty interesting that you talk about, I don't remember exactly where it came in the talk, but you're like, look,
27:06 what you're doing in this loop is actually looping from like the second onward for all the items.
27:13 And that really is just a slice.
27:15 Yeah.
27:15 Yeah.
27:16 So in this nested for loop, the outer for loop is basically reads for J in range of one to the length of your list.
27:26 So you're basically creating a range of numbers from one to the length of your list.
27:31 And then right inside that for loop, you're creating a basically a variable that's the Jth element of your list.
27:40 So all you're doing is skipping the very first element of your list.
27:44 But the way you're doing this is generating explicit indices based on the range function and the length function.
27:52 And I thought at first that they must be doing this because we need access to the index later or we need access to our elements later.
28:01 But that wasn't the case.
28:02 It just seemed like the only reason they were doing all of this was to skip over the first element.
28:07 And so very nicely, once again, Python has very, very many nice features.
28:12 They have something called slicing where you can basically pass it the syntax, which is square bracket, and then something in the middle and square brackets.
28:20 And in order to skip the first one, you just go one colon.
28:23 Yeah, one to the end.
28:24 And that's beautiful because you don't even have to check the length of the items.
28:28 You just say go to the end, which avoids errors of like, do I have to plus one here?
28:32 Do I not?
28:33 Is it minus one?
28:34 Like, what is the ending piece?
28:35 But you don't have to worry about just from I skip the first one and the rest.
28:38 Yeah, it's so convenient.
28:39 You avoid making a call to Len.
28:41 You avoid making a call to range.
28:43 And you avoid your local assignment on the first line of your for loop.
28:47 You can basically remove all of that and just use slicing and you're good to go.
28:52 And yeah, slicing is slicing is a really, really awesome feature.
28:55 It actually comes from a super old language that was created in the 60s called APL.
29:00 And Python is one of the only languages that has something called negative index slicing,
29:04 where you can pass it a negative one so that it wraps around sort of to the last element,
29:09 which is a super, super, it sort of looks weird.
29:12 But once you use it, it's so much more convenient than doing like a Len minus one or something like
29:18 that.
29:18 It's it is a little bit unreadable.
29:20 But once you know what it does, it's great.
29:22 It's great.
29:22 It's like I want the last three.
29:23 I don't want to care how long it is.
29:25 I just want the last three.
29:26 And that's yeah, it's fantastic slicing, I think is fairly underused for people who
29:31 come from other languages.
29:32 But yeah, and it fits the bill because there's so many of these little edge case.
29:36 You talk about errors and programming, like off by one errors are a significant part of
29:40 problems programming, right?
29:42 And it just skips that altogether.
29:44 It's beautiful.
29:44 Yeah.
29:44 So the next thing to do is so you're parsing this stuff out of the Internet, which means
29:51 you're working with 100% strings.
29:52 But some of the time you need numerical data.
29:55 So you can ask questions like, is this the sixth or seventh or whatever?
29:58 And so they have this is going to be fun to talk about.
30:02 They have try value equals int of data.
30:06 So pass the integer has the potentially integer like data over to the int initializer.
30:11 Either that's going to work, or it's going to throw an exception, which case you will say
30:16 except pass.
30:17 Well, not you, the original article had that right.
30:19 So it's this try pars except pass.
30:23 Otherwise, it's going to be none or it's going to be set to the string value or something
30:26 to that effect.
30:27 So what do you think about this?
30:28 How do you feel when you saw that?
30:29 Yeah.
30:29 So my initial reaction was that this is four lines of code that can potentially be done
30:36 in a single line using something called a conditional expression.
30:40 So in many other languages, they have something called a ternary operator, which is typically
30:45 a question mark, where you can do an assignment to a variable based on a conditional predicate.
30:50 So something that's just asking true or false.
30:53 And if it's true, you assign it one value.
30:55 And if it's false, you assign it another value.
30:56 So in Python, they have something called a conditional if expression, which has the syntax assigned
31:03 to value using the equal sign, ask your question.
31:06 So in this case, we just ask, is it an int?
31:09 Or sorry, it's so the first thing that returns, it's actually backwards from ternary operator.
31:13 So the line reads data equals int of data if, and then check your predicate.
31:19 And in Python, we can just call is numeric on our value, which will return us true or false
31:24 based on whether it's a number.
31:25 So if that returns true, then it'll end up assigning to data int of data.
31:30 Otherwise, you can just assign it itself data.
31:33 And then it's not going to do any transformation on that variable because it's not numeric.
31:38 It's one line of code.
31:39 It's more expressive, in my opinion, and it avoids using try and accept.
31:43 And it's preferable from my point of view.
31:46 I would say it's probably preferable from my point of view as well.
31:49 I have mixed feelings about this, but I do think it's nice under certain circumstances.
31:55 One, for example, if you say try, do a thing, accept, pass, a lot of linters and PyCharm
32:00 and whatnot will go, this is too broad of a clause.
32:03 You're catching too much.
32:04 And you're like, okay, well, now to make the little squiggly in the scroll bar go away,
32:10 I have to put a hashtag disable check, whatever, right?
32:14 And I'm like, well, now it's five lines, one with a weird exception to say, no, no, this time
32:18 it's fine.
32:19 So that's not ideal.
32:20 I definitely think that it's more expressive to use this conditional if one liner.
32:26 The one situation where I might step back and go, you know, let's just do the try is if
32:32 there's more variability in the data.
32:34 So this assumes that the data is not none and that it's string like, right?
32:38 But if you got potentially objects back or you got none some of the time, then you need a
32:43 little bit more of a test.
32:44 I mean, you could always do if data and data is numeric.
32:47 That's okay.
32:48 But then it's like if data and is instance of string data and like there's some level
32:54 where there's enough tests that it becomes, you kind of like, fine, just let it crash.
32:58 Right.
32:58 And then we'll just catch it and go.
33:00 But we were talking before we hit record.
33:03 Also, like there's a performance consideration potentially.
33:06 Yeah, definitely.
33:06 And it's interesting.
33:07 I'll let you speak to what you found.
33:09 But on the YouTube comments of the PyCon talk, that was one of the probably the most discussed
33:15 things was whether or not the conditional expression was less performant than the original try and
33:22 accept because a couple individuals commented that it was it was more Pythonic to use the
33:26 try and accept and therefore it might be more performant.
33:29 But you can share with what you found.
33:31 Sure.
33:31 Well, I think in terms of the Pythonic side, like certainly from other languages, like say
33:36 C, C++, there's more of this.
33:38 It's easier to ask for forgiveness than permission style of programming rather than the alternative.
33:44 Look before you leap.
33:45 Right.
33:45 Because in like C, it can be a page fault and the program just goes poof and goes away.
33:50 If you do something wrong, whereas this, it's just going to throw an exception.
33:52 You're going to catch it or something like that.
33:54 So there's like this tendency to do this style.
33:56 But in terms of performance, I wrote a little program because I wanted to I'm like, maybe this
34:01 is faster.
34:01 Maybe it's slower.
34:02 Like, let's think about that.
34:03 Right.
34:03 So I wrote a little program, which I'll link to.
34:06 There's a simple gist.
34:07 I'll link to it in the show notes.
34:08 It creates one million, a list with one million items.
34:11 And it uses a random seed that is always the same.
34:14 So there's no, there's zero variability, even though it's random.
34:17 It's like predictable random.
34:19 And it builds up this list of either strings or numbers randomly, a million of them.
34:24 About two third strings, one third number.
34:26 And then it goes through and it just tries both of them.
34:28 It says like, let's just convert this as many of them as we can over to integers and do it
34:33 either with the tri-except pass or just do it with this is numeric test.
34:36 It is six times.
34:38 I got about 6.5 times faster to do the test, the one line test than it is to let it crash
34:45 and realize that it didn't work.
34:46 Yeah.
34:46 So there you go.
34:47 You heard it here on Talk Python To Me.
34:49 That's right, man.
34:50 Conditional expressions faster than try-except.
34:52 Talk Python To Me is partially supported by our training courses.
34:57 How does your team keep their Python skills sharp?
35:00 How do you make sure new hires get started fast and learn the Pythonic way?
35:04 If the answer is a series of boring videos that don't inspire or a subscription service you pay way too much for and use way too little, listen up.
35:14 At Talk Python Training, we have enterprise tiers for all of our courses.
35:17 Get just the one course you need for your team with full reporting and monitoring.
35:22 Or ditch that unused subscription for our course bundles, which include all the courses and you pay about the same price as a subscription once.
35:29 For details, visit training. talkpython.fm/business or just email sales at talkpython.fm.
35:37 I mean, there's a lot of overhead to throw in an exception and catching it and dealing with all that.
35:42 Now, right, this is a particular use case that varies and like all these benchmarks like might vary.
35:49 Like if you've got 95% numbers and 5% strings, it might behave differently.
35:54 Like, so there's a lot of variations, but here's an example you can play with in what seems like a reasonable example to me.
35:59 It's faster to do the is numeric test.
36:02 So a lot faster, right?
36:03 Not like 5% faster, but 650% faster.
36:07 So it's worth thinking about.
36:08 Yeah.
36:08 Yeah, for sure.
36:09 Let's see.
36:10 So come through.
36:11 And in the end, you had, I mean, a ton of stuff was here.
36:14 It was like 20 lines of code just for these two loops.
36:17 And now you've got it down to four lines of code by basically an outer loop and inner loop, grab the data and append it with this little test that you've got.
36:27 Much nicer.
36:28 I agree.
36:28 Yeah.
36:29 So you went, I think if you look at the overall program at this point, you were doing some analysis or like some reporting.
36:35 You said it started at 60 lines of code and now it's down to 20.
36:39 Yeah.
36:39 That's pretty good.
36:40 Roughly.
36:40 Depending on if you count, you know, empty lines and whatnot, but it was about 60 down to about 10 or 20 lines.
36:46 And at this point, I had sort of pointed out that I had made a mistake.
36:52 So like this was fantastic.
36:53 At least I had thought that, you know, I'd taken a code.
36:56 I'd taken a code snippet from a blog, reduced it by, you know, roughly 75% or 67%, depending on how you measure it.
37:04 But that I had made an even bigger mistake than I had realized.
37:09 And it was that when I had originally, I'd shown Googling for, you know, how to scrape HTML using pandas that I read the second results.
37:17 And the third result was actually what I should have chosen.
37:21 And it was that I had pandas actually has a read HTML method in the library.
37:28 And so the point that I go on to make, if you use that, you go from, you know, 10 or 20 lines down to like four lines of code.
37:34 And you're just invoking this one pandas API, read HTML.
37:38 And it's so much better.
37:40 So, you know, refactoring is fantastic.
37:41 But there's some quote about like the best code is no code.
37:44 If you don't have to write anything to do what you want to do and you can just use an existing library.
37:50 That's the best thing that you can do because that's going to be way more tested than the custom code that you've written.
37:55 It's going to save you a ton of time.
37:57 And you're going to end up with ultimately less code to maintain yourself.
38:01 And what's better than having someone else maintain the code that you're using for you?
38:05 Exactly right.
38:07 It gets better for no effort on your part.
38:10 Yeah.
38:10 It might get faster or it might handle more cases of like broken HTML or who knows.
38:15 But you don't have to keep maintaining that.
38:17 It's just read underscore HTML on pandas.
38:20 Just it's probably getting maintained.
38:21 Yeah.
38:21 And so like one of the things that I've echoed in some of the other talks that I've given is knowing your algorithms.
38:27 In C++, definitely there's a whole standard library.
38:30 There's a lot of built-in functions.
38:32 I guess they're not so much called algorithms.
38:34 They call them built-in functions in Python.
38:36 But like there's a whole page where I was just looking at it the other day.
38:39 And there's a ton of them that I'm just not aware of.
38:42 Everyone knows about map, filter, any all.
38:45 Like I just saw, I think it was called div mod, which was a built-in function for giving you both like the quotient and the remainder.
38:51 Which is like, there's definitely been a couple times where I've needed both of those and you do those operations separately.
38:56 And it's like, if I just knew about it, you can in a single line, you know, you can destructure it using the iterable unpacking.
39:02 Knowing your algorithms is great, but also knowing your libraries.
39:06 Knowing your collections.
39:06 Like the more you get familiar with what exists out there, the less you have to write and the more readable your code is.
39:13 Because if everybody knows about it, we have a common knowledge base that it's transferable from every project you work on.
39:19 Right.
39:20 Yeah.
39:20 Your final version basically had two really meaningful lines.
39:23 One was request.get.
39:25 The other was pandas.readHTML.
39:28 You don't have to explain to anyone who has done almost anything with Python what request.get means.
39:32 Like, oh yeah.
39:33 Okay.
39:33 So got it.
39:34 Next.
39:35 Right.
39:35 We all know how that works.
39:36 We know it's going to work.
39:37 And so on.
39:38 And it's really nice.
39:39 I think, though, what you've touched on here actually is really important, but it also shows why it's kind of hard to get really good at a language.
39:48 And the reason is there are so many packages, right?
39:51 You go to PyPI.
39:52 Let me try pyp.org.now.
39:54 Every time I go there, it's always more, right?
39:57 So 245,000 packages.
39:59 If you want to learn to be a good Python programmer, you need to at least have awareness at a lot of those and probably some skill set in some of them.
40:06 Because like Panda is one of those.
40:08 Request is another one, right?
40:09 The four-line solution that you came up with was building on those two really cool libraries.
40:15 And so to be a good programmer and effective means like keeping your eye on all those things.
40:20 And I just think that's, it's both amazing, but it's also kind of tricky because it's like, well, I'm really good with for loops and I create functions now.
40:26 You're like, great.
40:27 You've got 200,000 packages of study.
40:29 Go.
40:30 There's some quote that I've heard before where being a language expert is 10% language, 90% ecosystem.
40:36 And it's, you can't be a guru and insert any language if you don't know the tools, if you don't know the libraries.
40:43 It's so much more than just learning the syntax and learning the built-in functions that come with your language.
40:49 It takes years and it definitely doesn't happen overnight.
40:52 It's a challenge for all of us.
40:53 Yeah, for sure.
40:54 You know, maybe it's worth a shout out to awesome-python.com right now as well, which like has different categories you maybe care about.
41:01 And then we'll like highlight some of the more popular libraries in that area.
41:05 That sounds awesome.
41:05 That's a good one.
41:06 Yeah, for sure.
41:07 So you went through and you did nine different steps.
41:10 You actually have those called out very clearly in your slides.
41:13 You can get the slides from the GitHub repo associated with your talk, which I'll link to in the show notes, of course.
41:19 But all of this refactoring talk was really part of the journey to come up with a totally different answer, which was what are the most popular languages for these coding competitions?
41:28 Yeah.
41:29 Ultimate goal was to scrape the data and then to use pandas in order to do that analysis.
41:34 And at the end of the day, I believe the number one, I definitely know the number one language was C++ at about, I think it was 89%.
41:42 And that typically is the case because certain websites, they give the same time limit per language.
41:50 So a website like HackerRank, they vary by language.
41:53 So Python, your execution time that you're allotted is 10 times more for Python.
41:59 So even though Python's slower, they give you a proportionate amount of time.
42:03 But most websites don't do that.
42:04 So the CodeForces website, it gives like you, I think, two seconds execution time, regardless of the language you use.
42:12 And so due to that, most people choose the most performant language, which is C++.
42:16 But in second place was Python.
42:18 And I know a lot of competitive programmers that for the problems where performance isn't an issue that you're trying to solve for, they always use Python because it's about a fraction of the number of lines of code to solve it in Python than it is in any other language.
42:32 Sometimes you can solve a problem in one line in Python.
42:35 And the next closest language is like five lines, which is a big deal when time matters.
42:40 Yeah.
42:40 Yeah.
42:41 Are you optimizing execution time or developer time in this competition?
42:45 Right?
42:45 Yeah.
42:46 It definitely matters what you're trying to solve for.
42:48 So yeah, C++ was first.
42:49 Python was second.
42:50 Java was third.
42:52 And then there was a bunch of fringe languages.
42:54 The top three were C#, Pascal, and Kotlin.
42:57 And yeah, you can see a full list if you go watch the PyCon talk.
43:00 But it was cool, yeah, to find out what was used and what wasn't.
43:03 Yeah, it sure was.
43:04 And it was cool to see the evolution of what you created to answer that question, which is pretty neat.
43:09 All right, well, let's just talk a little bit about Rapids.
43:12 Because I know that people out there are, there's a lot of data scientists, and they're probably interested in that project.
43:18 So we did mention a tiny bit that it's basically take Handa's data frames, apply something like that, that API, pretty close, not 100% identical and everything, but pretty close.
43:30 And it runs on GPUs.
43:31 So why are GPUs better?
43:33 Like, I have a really fast computer.
43:35 I have a core I9 with like six cores I got a couple years ago.
43:38 That's a lot of cores, right?
43:39 So yeah, well, first thing I should highlight, too, is that Rapids is more than just QDF.
43:46 So QDF is the library I work on.
43:48 We also have QIO, QGraph, QSignal, QSpacial, QML.
43:53 And each of those sort of map to a different thing in like the data science ecosystem.
43:59 So QDF definitely is the analog of pandas.
44:03 QML, I think the sort of analog you can think of is like scikit-learn.
44:07 But also, too, like none of this is meant as replacements.
44:10 They're just meant as alternatives.
44:12 Like if performance is not an issue for you, like stick with what you have.
44:16 There's no reason to switch.
44:17 Yeah, don't do it because, for example, I couldn't run it on my MacBook, right?
44:21 Because I have a Radeon.
44:22 Right, right.
44:23 If you do want to try it out, I think they're on the Rapids.
44:26 So if you go to rapids.ai, we have a link to a couple examples using like Google CoLab
44:32 that are hooked up to like free GPUs that you can just take it for a spin.
44:35 And you need the hardware, but you can go try it out.
44:38 But like our pitch is sort of like this is useful for people that have issues with compute.
44:43 Right.
44:43 And for different pieces, you're going to want different projects.
44:46 So if you're doing pandas like sort of data manipulation, QDF is what you want.
44:50 But yeah, why are our GPUs faster?
44:53 It's just a completely different device and a completely different model.
44:56 So GPUs, typically it's in the G of the GPU.
45:00 We're known for being great for graphics processing, which is why it's called a GPU.
45:06 But at some point, someone coined the term.
45:08 He actually works on the Rapids team, Mark Harris.
45:12 He coined the term GPGPU, which stands for general processing GPU compute.
45:17 It's now typically referred to as just GPU computing.
45:20 But it's this idea that even though the GPU model is great for graphics processing, there
45:26 are other applications that GPUs are also amazing for.
45:31 The next best one is matrix multiplication, which is why they sort of became huge in neural
45:36 nets and deep learning.
45:37 But since then, we've basically discovered that there's not really any domain that we can't
45:42 find a use for GPUs for.
45:43 So there is a standard library in the CUDA model called Thrust.
45:48 So if you're familiar with C++, the standard library is called STL, and it has a suite of algorithms
45:54 and data structures that you can use.
45:56 Thrust is the analog of that for CUDA.
45:59 And it has reductions, it has scans, and it basically has all the algorithms that you
46:05 might find in your C++ STL.
46:07 And if you can build a program that uses those algorithms, you've just GPU accelerated your
46:12 code.
46:12 However, using Thrust isn't as easy as some might like.
46:16 And a lot of data scientists, they're currently operating in Python and R, and they don't want
46:21 to go and learn C++ and then CUDA and then master the Thrust library just in order to accelerate
46:27 their data science code.
46:28 The Rapids goal is to basically bring this GPU computing model for a sort of general purpose
46:35 acceleration of data science compute or whatever compute you want to the data scientists.
46:41 And so if they're familiar with the Pandas API, let's just do all that work for them.
46:45 Put the so so Rapids is built heavily on top of Thrust and CUDA.
46:50 And so we're basically just doing all this work for the data scientists so that they can take
46:54 their Pandas code, like you said, hopefully just replace the import, and you're off to
46:59 the races.
46:59 And some of the performance wins are pretty impressive.
47:03 Like I'm not on the marketing side of things.
47:05 But in the talk I mentioned, I just happened to be listening to a podcast called the NVIDIA
47:10 AI podcast.
47:11 And they had, I believe his name was Kyle Nicholson.
47:13 And by swapping out CUDA for Pandas for their model, they were able to get a 100x performance
47:22 win and a 30x reduction in cost.
47:26 That's 30 times, not 30%.
47:28 Yeah.
47:28 So 30,000 times, like multiplicatively, which is massive.
47:32 That's the difference between something running.
47:34 So if it's 100x in terms of performance, that's the difference between something running in 60
47:38 seconds or an hour and 40 minutes.
47:41 And if you can also save 30x, if that cost you 100 bucks, and now you only have to pay $3,
47:47 it seems like a no brainer for those individuals that are impacted by performance bottleneck.
47:52 Like I said, if you're hitting Pandas, and it runs in a super short number of seconds,
47:57 it's probably not worth it to switch over.
47:59 Yeah.
47:59 Well, and you probably, you tell me how realistic you think this is, but you could probably do
48:04 some kind of import, conditional import, like in the import, you could try to get the rapid
48:11 stuff working.
48:12 If that fails, you could just import Pandas as the same thing.
48:15 One is PD, the other is PD.
48:17 And maybe it just falls back to just working on regular hardware, but faster when it works.
48:21 What do you think?
48:22 That is definitely possible.
48:23 There's going to be limitations to it, though, obviously, if you have a sort of QDF data frame,
48:29 like I don't think you wouldn't be able to do it piecemeal.
48:32 But if you have a large product, what I'm thinking is if you wrote it for the Rapids version, but then let it fall back to
48:40 Pandas, not the other way around.
48:41 If you take arbitrary Pandas code, you try to rapidify it, that might not work.
48:44 But it seems like the other one may well work.
48:47 And that way, if somebody tries to run it, they don't have the right setup.
48:50 It's just slower possible.
48:51 What do you think?
48:52 There's definitely a way to do that, to make that work.
48:54 It might require a little bit of, you know, some sort of boilerplate framework code that
48:59 is doing some sort of checking, you know, is this compatible else?
49:02 But like, that definitely sounds automatable.
49:04 Like, yeah.
49:05 Yeah.
49:05 That sounds cool.
49:06 Because that would be great to have it just fall back to like, not not working, just not
49:10 so fast.
49:10 Right?
49:11 Yeah, yeah.
49:11 The future of computing is headed to a place where we can dispatch compute to like different
49:17 devices without having to like, manually specify that, like, I need this code to run on the CPU
49:24 versus the GPU versus the TPU versus in the future, I'm sure there's going to be a QPU for
49:29 quantum processing unit.
49:31 Like, like, exactly.
49:32 Currently, we all think serially or most of us that don't work at NVIDIA, we think serially
49:37 in terms of like, the way that CPUs do compute, but I think in 10 or 20 years, we're all going
49:43 to be learning about different devices.
49:44 And it's going to be too much work to, in our head, always have to be keeping track of which
49:49 devices it's going to, at some point, there's going to be a programming model that comes out
49:54 that just automatically handles when it can go to the fast device, and when we can just
49:58 send it to the CPU.
49:59 Yeah, absolutely.
50:00 So just while you were talking, I pulled it up on that Alienware gaming machine I got.
50:05 It has a GeForce RTX 270, which has 2,304 cores.
50:12 So that's a lot.
50:14 That's a lot of cores.
50:16 And if you look somewhere, Google claims that it achieves 7.5 teraflops in the super, increases
50:23 that to 9 teraflops, which is just insane.
50:27 It's like a core i7 doing like 0.35, 0.28 or something like that.
50:33 So anyway, the numbers, they just are like, they boggle the mind when you think of how
50:37 much computation graphics cards do these days.
50:40 I think top of the line, I might get this wrong, but like the modern GPUs are capable of 15 teraflops.
50:46 It's an immense amount of compute that's hard to fathom, especially when coming from a CPU sort of way of thinking.
50:51 Yeah, absolutely.
50:52 Yeah, the only reason I didn't get a higher graphics card is every other version required water cooling.
50:57 I'm like, that sounds like more effort than I want for a computer.
50:59 I'll just go with this one.
51:01 Yeah.
51:03 All right.
51:04 Well, Rapid sounds like a super cool project.
51:06 And maybe we should do another show with the Rapid's theme across these things to talk a little bit more deeply.
51:12 But it sounds like a great project.
51:14 Glad you're working on it.
51:14 I work on the C++ lower engine of it, but I'd be happy to connect you with some of the Python folks that work on that side of things.
51:22 And I'm sure they'd love to come on.
51:23 Yeah, that'd be fun.
51:24 All right.
51:25 Now, before we get out of here, I've got to ask you the two questions.
51:27 If you're going to write some Python code, what editor do you use?
51:30 So I am a VS Code convert.
51:33 That's what I typically use day to day.
51:35 Nice.
51:35 Yeah, that's quite a popular one these days.
51:36 And then notable PyPI package.
51:39 Then you ran across like, oh, people should know about this.
51:41 Yeah.
51:41 So I like to recommend there's a built in standard library, which I'm pretty sure most Python developers are familiar with.
51:49 Itertools, which has a ton of great functions.
51:52 But less well known is a PyPI package called more hyphen itertools.
51:57 And I'm not sure if this one's been recommended on the show before.
52:00 But if you like what's in itertools, you'll love what's in more itertools.
52:05 It has a ton of my favorite algorithms, chunked being one of them.
52:09 You basically pass it a list and a number, and it gives you a list of lists consisting of that many things.
52:15 It's like paging for lists.
52:17 Yeah, yeah.
52:17 And there's tons of neat functions.
52:19 Another great one that's so simple, but doesn't exist built in all underscore equal.
52:25 It just checks, given a list, are all of the elements the same?
52:28 And it's a simple thing to do.
52:30 You can do it with all, but you have to check is every element equal to the first one or the last one.
52:35 So there's just a ton of really convenient functions and algorithms in more itto tools.
52:40 That's the one I recommend.
52:40 Yeah, that's cool.
52:41 And you can combine these with like generator expressions and stuff.
52:44 They are all these, you know, pull some element out of each object that's in there and generate that collection.
52:50 Ask if all those are equal.
52:52 And they go to all these ideas go together well there.
52:54 Yeah, they compose super nicely.
52:55 Yeah, for sure.
52:56 All right.
52:57 Final call to action.
52:58 People are interested in doing refactoring, making their code better, maybe even checking out Rapids.
53:03 What do you say?
53:04 I'd say if you're interested in what you heard on the podcast, check out the PyCon talk.
53:08 It's on YouTube.
53:09 If you search for PyCon 2020, you'll find the YouTube channel.
53:14 And yeah, if you're definitely interested in rapids.ai, check us out there.
53:18 I assume all this stuff will be in the show notes as well.
53:20 So maybe that's easier than YouTube searching.
53:23 Yeah, well, and then also you talked about YouTube channel a little bit.
53:26 Maybe just tell people how to find that.
53:27 We'll put a link in the show notes as well.
53:29 So they can, if they want to watch you, you know, talk about some of these solutions and these competitions.
53:34 Yeah.
53:34 So my online alias is code underscore report.
53:37 If you search for that on Twitter or YouTube or Google, I'm sure all the links will come up.
53:42 And yeah, you can find me that way.
53:43 Awesome.
53:43 All right.
53:44 Yeah, we'll link to that as well.
53:45 All right.
53:45 Well, Connor, thank you so much for being on the show.
53:47 It was a lot of fun to talk about these things with you.
53:48 Thanks for having me on.
53:49 This was awesome.
53:50 Yeah, you bet.
53:50 Bye-bye.
53:51 This has been another episode of Talk Python To Me.
53:54 Our guest on this episode was Connor Hoekstra, and it's been brought to you by us over at
53:58 Talk Python Training.
53:59 Want to level up your Python?
54:01 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.
54:06 Or if you're looking for something more advanced, check out our new async course that digs into
54:11 all the different types of async programming you can do in Python.
54:14 And of course, if you're interested in more than one of these, be sure to check out our
54:18 everything bundle.
54:19 It's like a subscription that never expires.
54:21 Be sure to subscribe to the show.
54:23 Open your favorite podcatcher and search for Python.
54:25 We should be right at the top.
54:27 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the
54:32 direct RSS feed at /rss on talkpython.fm.
54:36 This is your host, Michael Kennedy.
54:38 Thanks so much for listening.
54:39 I really appreciate it.
54:40 Now get out there and write some Python code.
54:42 I really appreciate it.
55:03 Thank you.



 Overcast
        Overcast
     Apple
        Apple
     Castbox
        Castbox
     PocketCasts
        PocketCasts
     RSS
        RSS
     RadioPublic
        RadioPublic
     Spotify
        Spotify
     YouTube
        YouTube