Monitor performance issues & errors in your code

#288: 10 tips to move from Excel to Python Transcript

Recorded on Tuesday, Sep 15, 2020.

00:00 Excel is one of the most used and most empowering pieces of software out there. But that doesn't make it a good fit for every data processing need. And when you outgrow Excel a really good option for that next step is Python and the data science tech stack pandas Jupiter and friends. Chris Moffat is back on talk Python to give us concrete tips and tricks for moving from Excel to Python and pandas. This is talk Python to me, Episode 288, recorded September 15 2020.

00:43 Welcome to talk Python to me, a weekly podcast on Python, the language, the libraries, the ecosystem and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm at m Kennedy. Keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode is sponsored by the Voyager video game which is built on Python and linode. Check out what they're both offering during their segments. It really helps support the show. Talk Python to me, it's partially supported by our training courses. I got a joke for you. What's the world's most popular ID Excel. Funny, right except many companies really do run on Excel to the point where they would be much better off using clean and simple programming tools. For many pythons data science stack would be vastly better. But moving from Excel to Python as a challenge. Most data science courses don't focus specifically on the Excel use cases. That's why we've teamed up with Chris Moffitt from practical business Python to create a course tailor made for helping people learn just enough pandas and Jupiter to replace the problematic Excel usage with clean and scalable Python code. If you or your coworkers are ready to move beyond Excel, visit, talk python.fm slash Excel or just click the link in the show notes to learn more about this online course at Talk Python Training. Chris, welcome back to talk Python to me.

02:03 Thanks, Michael. Really excited to be back.

02:05 Yeah, it's great to have you back. You know, back on episode 200. This is January of 2019. So coming up on two years ago, we did an episode together, called escaping Excel hell with Python and pandas. And it was I got to tell you out of upwards of 290 episodes now, well, at the time of the recording, and this is the sixth most popular episode ever. There must be some sort of problem with Excel that people are looking forward to dodge get away from to escape.

02:36 Yeah, I think so I wish I could say it was all me and people want to hear me. But no, I think you're right. There's definitely some challenges with Excel that people are trying to get around. Yeah. And there are and we're going to talk about them. I think there's some interesting parallels, you do this in Excel. Here's how you do that in Python. But there's also some interesting possibilities like, Oh, I could turn this into an API, which is not too many people are probably using Excel as an API. But I will bet you there's a Windows machine in the cloud, running Excel that somebody has an API focused on. Oh, I'm sure you're right. And I agree. I mean, I think that's one of the real powerful aspects of Python is your ceiling is a lot higher, you can do a lot more once you get into Python. Yeah. If you only want to use it for a little bit, that's great. But if you really want to take it to the next level, those options are out there for you.

03:23 Yeah, absolutely. So let's just do a quick catch up. It's been a couple years since you've been on the show, what do you been up to since then,

03:29 continuing to work on the blog continuing to write articles,

03:33 I just saw that PB Python, practical business, Python got ranked as the 12th most bestest Python blog on the internet. That's awesome.

03:41 I saw that. Yeah, I don't know what their methodology was. But I will say they ranked I think on four or five different categories. And the one that they gave me a little bit of a lower rank on was the frequency of writing new content. And I think that's spot on. I wish I could write more. Yeah, but it does take time to get the content out there. And I've really enjoyed continuing to do that, mainly because it helps me learn and engage with the audience and learn what people want to hear about with the Python ecosystem. Yeah, so I've been continuing to do that. And awesome, I actually have a new job. So that's always been exciting in a pandemic, to start a new job and work from home full time. Like I know you're very used to I am very used to it. I got lucky and got a mobile remote friendly job about 12 years ago, and I've just never looked back. You know, a lot of the stuff that people are discovering in this pandemic is kind of been things I've embraced in life and really appreciate I don't appreciate being locked away from my friends, not be able to go to restaurants, but I do appreciate not having a commute. Not having all the expense and all those kinds of things. It's really nice. I definitely enjoy getting up in the morning and my commute is a couple steps from the bedroom and not 30 to 45 minutes. definitely enjoy that. That's right. My office is

04:57 in my garage, which is a separated garage in my house. I gotta go go across the sidewalk periodically, I'll scare a squirrel when I go out in the morning. So there's a chance of excitement. But it's it's pretty low excitement.

05:06 Yes, yes, definitely have to find some benefits and all of this. That's right.

05:10 So you said you have a new job? What kind of work are you doing there?

05:13 I'm working. I'm continuing to work in the medical device industry. And right now I'm doing pricing strategy and analytics. So working with the company to plan our new product launches and what our pricing strategy is, and how do we execute those effectively. So it's been really enjoyable so far.

05:30 Yeah, it sounds like you're able to take these ideas and just 100% make that the core of what you're doing.

05:36 Yeah, absolutely. I mean, a lot of you know, I keep quoting your term that Python is a superpower. And that's definitely the case for me is the knowledge that I have to take Python and apply it to some of these real world business problems that I encounter on a day to day basis, clean some data, manipulate some data, build some visualizations to tell a story. It's really powerful. And I really enjoy getting a chance to do that.

05:59 Yeah, I do think programming is a superpower. And Python is a special kind of superpower for sure, exactly. Now, it's been, like I said, about two years. Since you were on the show, we did talk about some of the benefits of choosing basically pythons data science stack over Excel. But let's, for those who don't have an insanely good memory, or haven't heard it before, let's just do a quick review of like some of the reasons you might want to use Python instead of Excel.

06:27 Sure. So the first thing I think of is the repeatability aspect. So in Excel, once you create something, you'll typically end up with a bunch of tabs and a bunch of data all over the place and a bunch of formulas that are nested, maybe some VBA. And it's really hard to repeat that process. There's no way to start from the beginning, and do it again. So if you've done some analysis, say in January, and you want to repeat it in October, you almost have to start from scratch, you can see that file, but there's no just button to push to run it all again, with Python. If you do it right and build a script, then you do have that repeatability aspect.

07:06 One of the things that drives me crazy about Excel from a programmer mindset is you hear of the different control flow structures and whatnot in programming languages, like Python didn't have a switch statement. But maybe Python is getting a switch statement. It has obviously ifs and loops. One of the things very few people boast about their language having or saying the good thing is a go to statement dryco. Our go twos are really improved now like No, No, they're not.

07:33 No, no. And like you said, Excel is just kind of like a whole bunch of go to statements. And that really is the way you need to troubleshoot Excel. And I think that's the other piece is that repeatability kind of goes hand in hand with traceability me, you can certainly write challenging Python code, that's hard to understand. But in general, Python code flows from the top to the bottom, and you can see what happens. Whereas Excel, it is really hard to figure out what's going on. And even if you look at the formulas, there can be errors in the formulas or there can be gaps in the formula. So there's just a lot of subtle bugs that are difficult to trace with Excel that I think Python makes a lot easier.

08:13 I think one of the challenges is when you look at an Excel spreadsheet, or it could be Apple numbers, or it could be Google Sheets, like same basic thing. The data is hiding the logic, right, you see a number in a cell, and you don't know whether that number was typed there was imported from something or if it was computed in a really complicated way. And so you've got these go twos on top of like

08:38 labels that hide the grudges, and you can hide the data and just see the formulas, but then you don't see the data. So like you said, even when he tried to get around it, it is not very easy at all.

08:49 Yeah. What about lots of like, large files or lots of data, something like that. Excel excels at it, right?

08:57 It's funny, I mean, we're certainly not talking about big data by any means. But when Excel gets to a little over a million rows, it just can't read the data. And nowadays, it's very easy to get data that's well outside of what Excel can handle. And I even think if even if you have a data set that's relatively large and fits in Excel, the performance penalty is pretty large. And it's difficult to manipulate that data and analyze that data efficiently in Excel, because it's just kind of a resource hog, right? You've got a lot of things that Excel supports, which are really nice for visualization and the way it looks on screen. But in Python, for the most part, once that data is read in, if it fits in memory, it's going to be pretty performant. And you can manipulate gigabytes of data on kind of your standard laptop pretty easily with Python without having to move to some of the more advanced data science tools. Yeah, I'm always it blows my mind how fast programming languages are, even when you've got slow Python, right. I think performance of Python is super interesting because you can talk about it. This code runs more slowly than say compiled C++, maybe it's 20 times slower if it's doing numerical calculations. But that's still really fast. But there's also of course, I could put that into like a NumPy array. Now it's back to the sea level speed. And there's all these variations. But I remember on this course, I just did the memory management course, there's a section where I did performance around different types of structuring your classes in different ways. And so one of the metrics I had was, how many attributes of the classes can you access per second? I think it was, like 18 million reads per second, you're like, that's a lot of processing. Right? Right, right. Absolutely. And one of the things that I found is even just reading in an Excel file, it just takes time. Even in Python, it takes a lot of time, because that Excel file is pretty bloated, it carries a lot of information outside just the data. But once it's in Python, then it almost doesn't matter. It's really quick and really, really big benefit.

10:57 Yeah. The other thing to keep in mind, as we're talking about this, and to position it is, much of what you're proposing, or you're going to propose shortly, is to use pythons data science stack. So when we talk about the performance of data analysis in Python, what you really mean is mostly the performance of data analysis in C, right? Because when you talk about NumPy, and pandas, a lot of that stuff is orchestration layers and Python over top native code.

11:25 Sure. So people a lot smarter than me have figured out how to build these capabilities in C, you're right. And it also runs on Windows, it runs on Mac, it runs on all the different operating systems. And you're right, it runs very quickly, much more so than people would expect, especially people who think that Python has this reputation as being a slow language or maybe not a performance language. Yeah.

11:49 Yeah. Meanwhile, Instagram is running on Python. YouTube is doing a million requests a second, like there's variations on what define slow, right? I think. And another interesting thing that is not mentioned here, but it's probably worth mentioning, is price.

12:04 Yes, absolutely. So there are commercial variations of Python. And there are companies that support Python, but Microsoft is making it available through the Windows Store for free. All of the tools that we've kind of mentioned, like pandas, or using conda, for managing the environment, are free and open source. So you can at least get started with it without having to buy some enterprise licensing. And when you get in big companies, enterprise licensing is really expensive. And I think sometimes, developers that maybe haven't been in that environment don't realize how much companies pay for some of these licenses for commercial software. It's just really, really crazy to think about.

12:45 Yeah, and I think most enterprises probably have maxed out Excel for everybody. Right? They all have word, they'll have Excel, and, you know, they all have outlook, right? But you want to do some computation, say in the cloud, right? I want to take some of this work and analyze it over there, turn it into an API endpoint, or whatever, all of a sudden, maybe we don't have licenses to like fire up Windows machines. And put Excel like that doesn't make a lot of sense, right. Whereas you just take these same things, as people know, in Python, and you just pip install them over there, instead of over here are combat, you take your pick,

13:20 yes. And I think that speaks to this concept that there's just this higher ceiling with Python that you can get started. And then like you said, start to move to the cloud start to move to some of these other services that are out there. And you can take the same thing that you're doing on your local machine to analyze a few thousand rows of data and scale it in the web to millions and millions of rows of data,

13:42 right? Docker, Kubernetes, all these things become an option.

13:45 Exactly. Yes,

13:47 I think a lot of what we're saying sounds like use the Python data science stack instead of Excel. But I know for sure, there are many people who must deliver an Excel product, or maybe receive input starting from an Excel worksheet, or workbook, and they've got to, maybe they can make this step to the side where I say the rebels live, you know, but they still got to either receive or deliver Excel, present stuff for other people in Excel, and so on. There's some cool integration on that level as well, right?

14:19 Yes. And I think you're touching on a really key point, I think people that come from a programming background and see Excel, just see the challenges. But businesses run on Excel. And there is a lot of power in Excel. And you can't just take Excel out of the organization, you shouldn't march into your CFOs office and say, we're deleting Excel and everybody's gonna learn to use Python that's just not practical in any world. And I don't think that will really be in the future. So thinking about the problem and thinking about where Python really solves the problems and where Excel can still be leveraged excels a great modeling tool, Python, using pandas and some of the other tools will About allows you to export Excel files and even build interactive Excel files that you can share with others. So that those that just want the final answer or that final report, or the final model can still do it in Excel. And Excel is a perfectly appropriate tool for a lot of those scenarios.

15:19 Yeah, and there's some nice libraries that you can read Excel files, obviously, pandas just imports them directly. But you can also write to them like you can format the cells, you can change their colors, you can add formulas back into the Excel file you deliver, right, so there's a way to sort of do all your work in pandas, but then generate something that's a live Excel Yes, artifact for people who need to pick it up and run with it.

15:44 I've done that quite a bit. And I think it's like a lot of data analysis, you start with a rough cut. And so maybe you dump something to excel. And it doesn't look very pretty, but it gives you what you need. And then once you get to the point where it's almost like a production ready report that you want to give to people every week or month or whenever, then you can start to apply that formatting. And you can put all the styles that you can really do in Excel by hand, you can put those in, you can put pivot tables, you can put visualizations, there's a whole bunch of things you can do there. And then if you really want to take it to the next level, you can start to do some of the more sophisticated visualizations like with streamlet or some of the other tools that are out there that really allow a high degree of interactivity on front or on top of that Python face capability.

16:34 This portion of talk Python, to me is sponsored by the PC game Voyager available on Steam and published by totally consulting Voyager is an edutainment game in which you travel around the world, you visit famous cities and discover well known landmarks and attractions, enjoy stylized graphics, beautiful photos, and the local music of each city. You even apply for a variety of jobs to pay for your next plane ticket. Plate yourself or suggest it for your children so they can learn about the world instead of playing battle royales all the time. In for an added bonus. Voyager is written entirely in Python, visit talk python.fm slash Voyager or just click the link in your podcast player, you'll get a key for the game and instructions on how to redeem it on Steam. Or can of course also just search for Voyager on Steam. That's vo y ag er directly on Steam with some live stream that you could publish it, people could consume it live. Yes. Maybe there's even a download as Excel button.

17:33 Yeah, absolutely. I mean, I think there is a challenge. And I think that'll be something that happens in the future is how do you build up that infrastructure at a company when you're the startup, maybe it's easy to spin up a service and get stream like going but as a company grows, it's not that easy for most people to just get access to a Linux server somewhere and run streamlet and host it for the organization, even though it may be technically relatively simple. There's a lot of hoops to go through. So I don't think we've quite made that as easy as we'd like. But I think it's going to continue to improve in the future.

18:04 Yeah, I agree. If you're just barely moving outside of Excel, and you're just starting to pick up Python, like set up your own cloud VM with Linux, and SSH probably is a bridge too far.

18:15 Yeah, I mean, it's certainly gotten a lot easier than it was many, many years ago. I mean, there's so many services that just almost make it push button. But I think there's also this aspect of security as well, there's, I think there's a lot of flexibility and being able to bring Python installed on your system. And there may be some challenges with your IT organization to get that supported. But the one thing that I do think is really challenging is when you deal with sensitive data, you want to be careful about putting it out in the cloud. And those are the types of things you really get in trouble. If you put all your confidential pricing data on a server somewhere and the company is not supportive of that. That's kind of what they call a career limiting move.

18:54 That's right. You go into the big board meeting, like, Look, I've done all this amazing work that's on the internet, you're fired. Exactly in a minute.

19:02 What just happened? I thought I was doing good. I definitely don't want promote that. And I think most people understand that. And there are certainly varying degrees, some companies, it's probably wouldn't be a challenge. But then you go to the banking industry or the government or other places, and that is or the

19:17 medical industry

19:18 like what you're in. Yeah. Or the medical industry. Yes. A huge, huge challenge. Yeah,

19:22 absolutely. So let's dive into the 10 tips. Before we do. Some of these tips are coming out of a course that you wrote a talk Python, and you want to just tell people really quickly about the course. Sure,

19:34 sure. So we just went live with this course. And what I'm really excited about is we start from the beginning. And it is kind of a Windows focus. Because I think a lot of people that are taking this course are in a Windows environment and all the code that we walk through can run on Windows or Mac or Linux, but I start from the beginning. How do you get your system set up and running in Windows, and then one of the powerful aspects of pain This is that it is a complex library, but you really only have to learn a few concepts to get started. So we start from the beginning, assuming you know a little bit of Python, and then progress through these pandas concepts with the idea that we want to get you up and running as quickly as possible so that you can apply the code to your own business problems and start to solve some problems. So I use the model of talking about how you do something in Excel, and translate it to Python, and talk about the pandas library commands and functions that you would use to manipulate the data. And then at the end, I have a kind of a real world scenario where we talk about ingesting an Excel file, or multiple Excel files, combining them, analyzing them, cleaning them, and then generating an Excel report. And I think this course really is a nice summary of how you can quickly get up to speed with Python and pandas get that stack running, and build a stable foundation for you to progress and just take on more and more complex tasks as you grow your skills.

21:02 Yeah, it's super cool. I really enjoyed going through the course as just somebody consuming it, even though I helped put the pieces together. And there's some of my stuff, I do some accounting in Google Sheets, not actually Excel, but effectively the same experience. And after that, I'm like, Yeah, I really need to take some notes and rewrite some of the important parts of what I'm doing in Excel. So yeah, people should definitely check that out. I'll put a link in the show notes. Alright, without further ado, let's talk about 10, quick, easy things that we can do in Python and pandas to move sort of solve some of our problems or solve a series of problems in one analysis in Excel?

21:38 Sure. So I think the first thing is, and it may sound obvious, but being very specific about defining the problem you want to solve. And so the example I'd give is, if you came to me and said, Hey, Chris, I'd like to build a model to predict customer churn at the company. And I'd say, Well, what kind of model Do you have today? And they say, well, we don't have a model. Well, that may be a little bit much to start out with, I think something more appropriate might be, here's a report where I open up Salesforce every day, someone copies and pastes some data into an Excel spreadsheet, someone manually cleans it up, and then emails us out. And we do it every day. And it takes us 30 minutes a day, but we have, you know, an intern do it for us.

22:25 That's the crazy stuff. Is there somebody usually at these companies that do that? Yes, either because they were told every day or every Monday, but getting the month, whatever, before the presentation, like they sit down and they do these things. And it just, it blows my mind. Because you know, you maybe spend two hours, but it's not just the two hours. Like what if there's a mistake? You know, it could be tremendous. One of the things you pointed out in the course, was there's all these there's like this history of crazy Excel errors, right?

22:59 Yeah. And I think I even mentioned one where one of the genome types that they use in scientific notation, you know, in scientific publications, they actually changed it, because it was being confused with a date type in Excel. So essentially, it was easier to change science than it was to change Excel to properly processes data.

23:19 That's right, I read that article and the global organization that oversees the naming of genes, yes, the sequence of DNA pairs that give us all all the different features we have. Some of them were named in a ways that were being detected like one was March one, all one word that would become March 1, if you type that in. So they literally renamed I think, 27 different genes. So they would be Excel friendly. They wouldn't get renamed, and or re interpreted. And then also, they set up guidelines to say you can't name stuff that might be understood in a non obvious way by Excel.

23:55 It's crazy. It's crazy. I mean, is someone from a computer background, you would think well, why would it matter what it's named, but we've all been bitten by those Excel, trying to help you with things that just make it all more worse?

24:07 Yeah, exactly. So the my disbelief or something along those lines is that this person sits down and does this every so often. And you only asked that question weekly, probably, because it's so painful to create the answer. And it's so error prone to create the answer. What if you could instantaneously instance, just like on demand, say, give me the current state of the report for the last seven days now, or whatever, right. And that was a five second calculation, rather than a person that could make errors.

24:39 And when it's a very manual process and takes time, there may be things that you could do to tweak it to make it more useful. Maybe you want to report daily or maybe you want to add some extra fields, but people just don't want to take that on because it is painful. But once you start to automate this process, then adding a new field is easy or increasing your disruption list or reformatting it now I want a PDF version, right the granularity, right? Like, well, we can't do it by zip code, we got to do it by state, because it's already taken half an hour, it would take a week, right? We did it by zip code. So those kinds of things. So step one, or tip one is define the problem clearly, yes. And it's not just defining clearly, but it's making sure that you understand that problem. So it's almost more taking an existing problem that is really well understood. So you can just focus on learning Python and not learning the problem. And as you grow your skills, then you can start to tackle more and more complex problems with that base kind of Python knowledge.

25:36 Yeah, one of the problems with all computer models computer software is it'll almost always give you an answer. Whether or not that answer makes any sense.

25:43 Exactly. Yes, yeah. 100%.

25:47 All right, tip number two.

25:49 So the second one is get your Python environment set up. And it's gotten a lot better over the years, especially in the windows world. As I mentioned, it is now officially supported in the Windows Store, and you can install it there. But I recommend using some sort of environment, I prefer using mini conda. To install it on Windows, install Python and the associated libraries on Windows and manage the environments. I think that's kind of the key thing to take some time and get it set up. Because one of the crazy things just as I've taught new people how to use it, the concept of Where are files on my file system is actually something you need to spend some time on when you're on a Windows environment. And you've got OneDrive and you've got teams, you've got SharePoint, you've got box, you've got network drive, and local files, like most people aren't very familiar with where they are. And windows makes it hard sometimes to even know exactly where on the disk something is. And that's a pretty important skill to have as you start getting your Python environment set up. And so it's really important to kind of get that base foundation in place so that you can start doing the actual Python development.

27:05 Yeah. And there's a little bit of work to manage the concept of an environment, right? Like, that's not a trivial thing. And the, the unfortunate aspect is that hits you, like step one or two, you know, the very next thing you basically start with is okay, you need an environment then. But what's I think trickier is paths, like oh, when I type Python, it means this but actually pip if I type that it means something else from a different environment like I pip install the thing, but then I run Python, and it's not in there. It can't import it, it was like because for some reason, the pip was from the Python two, seven path. And the Python when you typed it, it runs the Python three, one, which is not even the same thing. There's no you don't even want to get into that conversation. So I think you've just started clearly with like conda environments or something like that makes it a much easier thing. Once you get over that small step, then it's pretty straightforward.

27:57 Yes, yeah, absolutely. And I think if you confront it head on, and have that discussion, and encourage people that if you follow this process, you're not going to break anything. So if you're in virtual environment, the worst cases for virtual environments messed up and you can kind of Yeah, restart that.

28:13 Throw it away. Start over. Exactly. Yeah. And especially on a work computer, I suspect, right? Like I periodically will get frustrated and reformat my home computer. But if it's like a corporate one, and it's got all sorts of VPN stuff that they're going to ask you, why are you messing with it? Just, you just don't even want to mess that thing up? Yes. Maybe you're not permissions to either.

28:33 Yeah. And that is one of the things that has gotten better with Python as you can install it without administrative access, which in the past wasn't always the case.

28:42 Yeah. Another thing that you advocate, and I guess this leads us into tip three is really be super organized about your data, because you could have source data, but then maybe you open up those Excel files and edit them. And then they're no longer source data, or they're all mixed in or they have random name. So what's your tip on kind of setting the stage with data,

29:02 one of the things that I encourage people do is get the data in as raw format as possible. So what that might mean is if it's a sales summary data, can you actually get transaction level data that's maybe directly from a data warehouse or your RP system or somewhere so that it is has all the information you need, and is in a format that you don't touch? So you maybe it's Excel, maybe it's CSV, but you store it somewhere and do not touch it at all, because I think the 10 with what people tend to do with Excel is because you can open up the data, you go in and make changes. So people will go and change a file name or change a cell or clean up some data by hand, which is fine, but it's not reproducible, right. So I encourage people to set up a structure where the data is stored and you never touch it. And then if you do need to manipulate it, you have an intermediate directory structure where you store that manipulative file. And then another directory structure where you store the output file and start getting in that discipline process of managing those input and output files.

30:12 Yeah, and you take this as far as having a cookie cutter template that people can run cookie cutter name of the template, and it'll generate, like the raw data file, the intermediate processing, and the reports, the code, all that kind of stuff that always do it the same, right?

30:26 Yes, it just makes it easier. And when someone's starting, they don't have all the bad habits, right. So this is just, you know, this is how you get started, you create these directories, and you move between the directories in this way, and people are trying to learn so much. It's just accepted that that's the best way to do it.

30:43 Yeah. And if you hand it off to someone else, right, some people in the accounting department worked on it, then they hand it off to, I don't know, the marketing people, they all know that this is the way we do it, then they understand where to go.

30:56 Exactly. And I also set it up. And I encourage people to do this so that you're less likely to make a mistake. And if you do make a mistake, you're not gonna really break the whole pipeline. It's kind of segmented out and firewalled, almost, if you will, yeah,

31:10 absolutely. All right, number four, gets us into some of the pandas commands.

31:15 Yes. So I love pandas. pandas is a an extremely powerful toolbox. And I think that is a big challenge for new users. Because if you start to approach pandas, it's almost like, Well, where do I start? How do I get a handle on all the options that are out there. And so I think that the first thing to do is to understand how to select your data by rows and columns. And pandas has the loke command dot LLC, as well as a dot i LLC. And all the basic pandas tutorials kind of cover this. But when you're working with Excel, this is actually an extremely powerful command. And it can be almost deceptive how much you can get done with it, and how much you need to learn it to use it in the day to day analysis. So I'd like to spend time just starting there with the basics of selecting data with by rows and columns. If you have a DataFrame

32:16 got some columns and some rows, I can say df dot loke, LLC bracket, and then you can give it column names, almost like slicing, right? So I could say, Yes, like state colon country, and that might give me state zip code. City country.

32:32 Yes. Yep. So it's kind of building up that mental model. People have Excel, they have A through Z, Z columns, and they have the number of rows. And we're kind of used to manipulating data that way. This starts to get people to think about it in the pandas way. And I think it's different, but there's enough similarities that it kind of fits in that mental model.

32:55 Yeah, for sure. One of the differences is now all of a sudden, it's much more important what you name your column names.

33:01 Yes, you're absolutely right. It matters what your name your columns, but the order doesn't matter now. So that's one of the challenges with Excel is let's say you're doing a V lookup, and then you had an extra column in there, suddenly your V lookups are broken, or something else breaks down the line. So this is a little bit more like kind of a sequel table approach where you care about the names more so than the order. And I think that it's easier to troubleshoot naming changes than it is to troubleshoot order changes.

33:34 This portion of talk Python to me is brought to you by linode. Whether you're working on a personal project or managing your enterprises infrastructure, linode has the pricing support and scale that you need to take your project to the next level, with 11 data centers worldwide, including their newest data center in Sydney, Australia, enterprise grade hardware, s3 compatible storage, and the next generation network linode delivers the performance that you expect at a price that you don't get started on the node today with a $20 credit and you get access to native SSD storage, a 40 gigabit network industry leading processors, their revamped Cloud Manager cloud not linode.com root access to your server along with their newest API and a Python COI just visit talk python.fm slash linode when creating a new linode account, and you'll automatically get $20 credit for your next project. Oh, and one last thing they're hiring go to lynda.com slash careers to find out more, let them know that we sent you. The order changes are a nightmare. Yes. So next tip is to use what are called accessors. Yes, which are really cool. I had no idea about accessors. Before I check that your course those people how much data frames I will actually work with.

34:47 Well, this is another kind of good analogy to excel. So we talk about selecting rows and columns of data. But now what do you actually do to that data and the string accessor gives you A lot of powerful capability to clean data to manipulate string data. So if you have text, you can uppercase it, you can lowercase it, you can strip out characters, you can use regular expressions, you can pretty much do anything that you can do in Python string you can do in pandas. And the creators of pandas have done all this work to make it really fast. So you don't have to write loops. You use these accessors for strings to get at that data, and potentially filter your data or clean it. And then the date time. accessor is extremely powerful. I mean, I think it's one of those things, it seems on the face that it just gives you access to the data, but you can do so much with it. And it is so much easier to work with dates and times and pandas and as Excel. And so I think that's a really fundamental concept to grasp. So you can start to build more manipulation analysis capabilities in Python and pandas.

36:00 Yeah, and that sort of hands a little bit on the underlying types.

36:03 Yes,

36:04 right. Like, Excel has some kind of idea of types a little bit like you do have the date columns, and it knows about numbers, put numbers on the right, and strings in the left, for example, is a pretty good giveaway. But it's more important here, because you're doing these operations, like the.dt probably doesn't make a lot of sense on a string, and vice versa.

36:23 Right? Yes. And that is one of the things where I think as people are new to pandas, there's a lot of times for pandas will make it easy for you. But there are certainly times where you have to go into it a little bit more enforce the types and make sure that you're having good discussion or a good decision about what type you want. And I think it also exposes maybe errors in your data. So you think it's a date time column, but then you go into it and realize you have some dummy values in there. And you have to figure out at least what to do with it. And Excel may not care. But pandas is going to force you to really understand that data and make some decisions about it. So you can process it appropriately.

37:02 Yeah, that makes sense. Another thing that you touched on, or implied there is that even in regular python programming, you do a lot of loops, and things like that, like I want to compare these two sets. So I might do a loop and then zip them together, and then check whether the thing I got back, you know, they compare in some way. But just in pandas in general, you have to have a much more of a set based mindset, right?

37:27 Yes. In general, like, I don't even think in the course I even talk about loops. And for the most part, you should

37:34 Yeah, I don't think you do either at all,

37:36 not have to do a loop in pandas. Now, you may have to loop through files or something to get the data into pandas. But for the most part, looping is not something you want to do in pandas, because they have, like you said, these vectorized formula. So they're specialized, and they kind of apply everything in parallel versus sequentially in a loop. And it's a different way of thinking. And it takes a little bit of time to kind of get that all there. But I think it is really powerful because you think about the data as a whole versus a whole bunch of individual cells in Excel file.

38:10 Right? And you could probably still just loop over everything, index back into them and update it, but it would not just be more verbose, maybe more error prone, it would also be slower.

38:20 Yes. much slower, much slower.

38:22 Yeah. Yeah. Speaking of working with sets and stuff, the next tip, Boolean indexing.

38:28 So Boolean indexing is really combining those accessors that we talked about with location to then filter your data. And I think the best example I always use is like the auto filter in Excel, that's probably one of the most common things that I use. And I suspect most people use when you get a new data set, you open up your data, and you click that auto filter, and you select maybe a date range, and you select maybe certain customers you're looking at or sort by revenue, whatever you're doing. And it's just all those kind of dropdowns that you select. Well, essentially, the Boolean indexing or masking allows you to do that in code so that you can select that those data sets and maybe do additional summary analysis. Or you can actually update the data set. So this is a powerful tool, not just for analysis, but also for cleaning your data,

39:18 right. So you could say things like, if the total sales for this particular customer is greater than 100,000, I want to pull them out to a separate set and work with them special or maybe give them a discount, right add to their discount range and things like that.

39:33 Absolutely. It's almost like anything you would do with an if statement in an excel formula. You can do with this combination, but it's much more powerful and easier to manage because it's in Python, it's not this kind of crazy nested Excel statement. And so you can build it up and as complicated as you want, but at least it's all just clean Python and not just kind of gnarly, nested Excel formulas. Yeah, whoever came up with the if statements and those kinds of things in Excel. What were they thinking? I don't know. I totally it definitely doesn't scale very well to multiple if statements. It's not

40:11 the simplest sort of thing, right?

40:13 No, no, it's not at all. No. All

40:15 right, you talked about Next up, you talked about having the data in its most raw format. And I concur. I think that's a great idea. But that often means it might be too granular, very granular for the types of questions you're asking the granularity might be off, right? Like maybe I want to know, sales by state, but I have sales by city, every cell. And I just want to know the number and the state, for example, right. So I might, maybe you could do a group by or something.

40:43 Yes, it group by is, I think outside of the Logan accessors. I use group by a ton. And group by is the way you can actually aggregate your data across multiple different columns. And then not just sum the data, you can perform many mathematical functions, you can do the average, you can do a standard deviation, you get the min and the max. But you can almost have as many levels as you want. And so it's very trivial to do a group by and like you said, look at it, maybe by state and then you look at it by region, and then maybe look at it by product. Who knows. Yeah, the the sales people or the product? Yeah. So it's just a lot of flexibility, with very minimal code. And I find that that's a really much simpler way to analyze the data and just kind of iterate through that process using group by. And it's another one of those functions. It's almost deceptive in how powerful it is. Yeah,

41:45 absolutely. It's another one of those set based mindsets as well.

41:48 Yes, yes. Because it's not easy, or the everything is kind of done on the column basis. And so it's just doing all the heavy lifting behind the scenes for you. Yeah. So next step, when I hear people who are Excel power users, you know, they're doing serious stuff, because they talk about pivot tables. I don't no idea what pivot tables are these and then how do they manifest over and pandas? Sure. So you're right, the pivot table is kind of the, I think, is probably the number one tool that people use in Excel. And it's a really convenient way to group the data in multiple different levels across rows and columns. So you can adjust how many levels you want to group it. And then you can summarize the numeric values in multiple different ways. And what's really nice about it from an Excel perspective is it's all GUI driven. So you can kind of drag and drop your columns and put them wherever you want, so that you can quickly adjust the way the data is being presented. Just using that GUI, and what pandas has the pivot table command is really kind of grouped by on steroids. So everything that group I can do pivot table can do. And it uses the similar kind of structure that the Excel pivot table uses as well. So what I find is group bys is the first step. And then as I start to group more and more levels, or do something called unstacking, or stacking the data, then pivot table really makes that easier. And I think that's a nice transition for people that have experienced with Excel. They have this pivot table concept, and pandas does the same thing. And if you start to master that you've really got a very powerful tool to do a lot of quick and easy analysis and reporting.

43:35 Yeah, absolutely. To another thing that you talked about is taking multiple sources of data, maybe multiple Excel workbooks that have some column that means the same thing. And sticking those together and having it merge that data together automatically.

43:52 Yes. So I think, in Excel, probably the most common way to do that as the V lookup, so everyone has some v lookup experience, they're more advanced, maybe they use index match, or I think there's the new Excel called x lookup, which is even more powerful. But the basic idea is it's kind of a very simple merge or join, have the data and pandas supports that. But pandas merging data is much more like the sequel approach where you can do a left join or a right join and inner and outer joins and have a lot more sophisticated usage where you can join maybe on multiple columns, without having to concatenate them together, like you wouldn't Excel. So it's really powerful to do that. And then also, I'm sure a lot of people in Excel will do just the cut and paste. So you've got different tabs with data, and you can easily if the data is all the same stacking on top of each other with the pandas concatenate command,

44:50 right? I've got January, February, and March has sales data. I want it for the quarter so we're just gonna jam it on the end. Exactly. One after another, right? Yeah,

44:59 yeah. And I think people do that all the time. And that's certainly a valid approach. It's hard to repeat, it's easy to make mistakes. But if once you master those concatenate and merge commands, then it's very easy and things that you can do in Excel where maybe cutting and pasting no thousand rows is fine. But if you have a million, you just can't really do that in Excel, whereas pandas, it's all the same command while you're doing 10 rows or 10 million rows.

45:24 Yeah, very cool. All right. So far, these nine tips have probably given us a nice result, we've got the data. But one of the cool things I can do in Excel is I can go over to that chart section. They want this kind of chart or that kind of chart.

45:39 Yes, yeah. And so the next tip is picking a plotting library or a data visualization library and starting to learn it. So one of the challenges in the Python ecosystem is there are a lot of options. It's a good thing. Because there are a lot of options. There's a lot of really good options. But it is challenging for new user to figure out where to start. So I just want to do a bar chart, I want to do a line chart, maybe a box plot,

46:06 right? Even just knowing what basic library I need to start researching, should I do plotly? Did I do matplotlib? I've heard there's other things, right? Where do you even start? Right?

46:16 Yeah, it's really hard. And so what I think is important is if you're getting started, and you're tackling this kind of well known problem, and you've built up some core pandas knowledge, pick a library and go with it. And right now, the one that I recommend, it's certainly not the only one, but I've had some good experience with it is plotly. And specifically plotly, has a plotly Express API, which is kind of a streamlined approach to manipulate data. And I think that that's works really well. One of the things I like about it is it integrates Well, with your Jupyter notebooks. It is interactive by default. So once you build one, it does all the JavaScript behind the scenes. So you can hover over your plots and see the individual data points. And that works really well. Nice. And if you need to do more in depth modifications of it, you can do that as well. So I think it strikes a nice balance. There's certainly other options out there. But what I think is best for someone is to pick one, and stick with it a little bit and get a feel for the API. And then if you decide that that's not the right one, maybe look at some alternatives as well. Yeah,

47:26 a lot of these have galleries, right? You can go into the gallery go does it make pictures, kind of like I need yes or no. And then you can decide whether to pay attention to it.

47:35 Yes. And I think the way these libraries plot is so different from Excel, it takes some time for you to kind of figure out how to fit it in your brain, so that you can use them effectively. And so that's why I encourage someone to stick with it. Get it, you know, applied to your problem. And it even goes back to our earlier point about getting the data. So you have to have the data in the right format, in the sufficient level details so that you can do this plotting, but once you get your mind around it, it's much faster, to iterate through your visualizations and really get some meaningful insight with just very few lines of code.

48:12 Yeah, absolutely. The one I've been using for this kind of stuff lately is Altair and nice as well.

48:17 Yeah, I like Altera as well. I mean, it's a great one, okay, is good as well. It's kind of hard. It's like picking your favorite child. You certainly don't want to go on record with that. But if you pick one of those three, I don't think you're going to go wrong.

48:30 Yeah, absolutely. All right. Well, those are some really great tips. Like I said, after I've seen all these things in action, I'm like, I've been doing it wrong, I've got to take some time and go and put some of this in practice what I'm doing, because I've got Excel type things like everyone else, you know,

48:45 yes. And you you're using Google sheets or other things where you want to pull in data, I think that's really where you experience a lot of benefits. And those are some of the other you know, next level benefits of using pandas is you have those ability to bring in data pretty seamlessly and easily through API's or other sources.

49:04 Yeah, it's worth also pointing out that we said, you can read and write Excel files with Python. And that's great. But you can also read and write Google Sheets, right? So if you need to deliver a Google Sheet to somebody or update an existing one, there are API's to connect to a Google Sheet and do similar things to it.

49:23 Right? And it's all pretty straightforward. And especially once you're using pandas, it will either support pandas or if it doesn't support pandas, it's going to be pretty easy for you to figure out how to get that pandas data into that format, because pandas is such a common tool there, but it uses

49:39 Yeah, awesome. All right. Well, let's wrap up this tip section with a really quick thought, you know, you've been going through this journey at a couple of companies, probably to varying degrees of willingness or unwillingness with various participants in the loop. What's your advice to people out there? What's your experience been?

49:58 I think the thing that You have going for you now is the data science kind of revolution. And so many people understanding the data sciences, that thing, artificial intelligence, machine learning, whatever the term is like it is, for better or worse, a little bit of a buzzword, people in your organization are gonna know about it. And I would say leverage that to say this Python tool, which can help us with data science tasks, can also help us with some of these manual processes that we have. And you can point to maybe some of your bigger enterprise applications that actually support Python. So I'll give one example like Tableau is a visualization tool that I think a lot of companies have. And they actually have bindings for Python, if you go to a tableau user group, they will talk about using Python. So it's not as foreign a concept anymore as it used to be. So I think in general, people are going to be more receptive to talking about Python. So I frame it in those terms, as this is kind of a data science industry standard, that you want to start learning and applying to your specific use cases. And then I think the other thing I'd recommend is, it is really hard. I think you talked about this before people look at Python. And Python seems so easy. It is kind of like pseudocode. And people say, oh, okay, I understand it, but getting them to take the next step to actually write code for themselves to solve their own problem. So I think trying to find a way to have a cohort of people, maybe you need two or three people, you're going down this journey together. So it's not just you, it's coworkers, or maybe cross functionally across different departments, where you hold each other accountable to actually doing some of the work. So you're learning about it, you're applying it, you're using this group to reinforce it and keep the momentum versus just getting caught up in all the day to day work that squeezes out all the fun stuff you'd like to do in Python.

51:59 Yeah, for sure. And one other tip I would throw in there is nothing solves the debates about whether this is possible whether this is a good idea as just doing it, right. So your example of we're going to spend 30 minutes to get all this data together. And then we're going to fill it out in Excel. Like if you say, maybe if you're one of the high end programmers of the company, no one is not going to give you much credence. But if you're somebody who's not typically the developer, and you go and say, Look, this used to be super painful and error prone, and monthly, now I wrote this code, and it took me four hours instead of two. But now I just pushed the button and it's instant, always right. Everyone else would be like, wait a

52:38 minute, that was horrible. We can avoid horrible things like that, right? Like play, I think play into that angle as well. It's got a lot of value. And you you have to be able to be comfortable selling those wins, letting people know that yes, this is something we accomplished. Here's something maybe behind the scenes, some data we scrubbed or manipulated. And then nothing breeds success, like success. So once you prove you could do it, yes, exactly. People start to come view and say, Hey, you know, like, I had some Survey Monkey data that I worked with that to clean up. And I didn't spend a lot of time explaining how much work it was to clean it up. I just did it. But the next time a survey came out, you know, they came back to me and say, Hey, can you help with this? And fortunately, I had those scripts. So all that time and energy I spent there, I could just kind of repurpose and keep moving forward.

53:26 Yeah. Cool. Cool. Two really quick things I just want to touch on while I've got you here, you put together a really nice article on setting up a Python developer environment on Windows, but using Linux, I want to tell people real quick about that.

53:40 Sure. So several years ago, Microsoft came out with something called Windows subsystem for Linux wsl. And it's essentially gives you kind of a lightweight kernel in windows that you can then install a boon to or a couple other different versions of Linux, and you've got a full Linux environment running on your Windows system. And it opens essentially, instantaneously. And performance is good. But also it integrates with Windows. So when you want to use your file explorer to look at your files on the Ws L system and copied between your Windows system, it's easy to do. You can even do things like this is really cool, like you run a Jupyter notebook and go USL. And it will open up your Chrome browser in your Windows environment. So they are much more tightly integrated than just like a normal virtual machine or a dual boot system. And a lot of this is on the windows subsystem to right? Correct. If you do enough research and look back, people might complain that it doesn't do some of these things. But that was version one before it was kind of more integrated, correct? Yeah, there were some changes under the hood, especially with the file systems and the way they were set up and how the access is done in speed. So definitely anybody that's Think about going down this path use Ws l two. And the other thing that Windows has done is I don't have any specific insight, but it sounds like it's been very successful. And I know Microsoft is Ethan backported, Ws l to to some of the the prior builds so that you don't even have to be on the bleeding edge quite as much as you did in the past. So I think they've really hit on something with being able to use a full Linux system in Windows, and using the new windows terminal, which, you know, it's a little surprising to get excited about a terminal, given how long they've been around, though it's not It was horrible. But the windows terminal is really nice. And it Yeah, if you've worked on a Linux system, you're gonna feel right at home. And it's really powerful.

55:43 Yeah, absolutely. Now, I guess it's worth pointing out really quickly that you can do everything that you've been talking about up to this point without windows subsystem for Linux. But it's just if you want to, sometimes certain things work better on Linux, and you don't have to go create virtual machines and stuff if you don't want to. Yes,

55:58 and I will say I do think windows subsystem for Linux, I think you need administrative access to get that all set up and running. So that may be a little more difficult for people to do on their work computers. But if you have a computer at home or somewhere you do have admin access. It's pretty cool. Yeah. A lot of power.

56:18 Yeah, for sure. And then you have this other project called side table, which sounds like it makes displaying information exactly like an Excel user who found their way over to pandas might want to do better. Tell us about that project real quick.

56:31 Sure. It's a simple library I put together. But I've actually found it really useful. I use a lot of my day to day analysis, just to get quick summary tables of my data. So I think the probably the more correct term as it's a frequency table, but it gives you quick summary information, if you just want to know, on a new data set, like how much how are sales distributed by state or by region or by salesperson, it's just kind of a quick one line command that will give you a nicely formatted summary of your data. It doesn't require anything else outside of pandas. So it's easy to install, and really streamlined and kind of integrated with your data workflow. And so I've really enjoyed working on that. And I've enjoyed using it, and I hope others will as well. Yeah, people can just check it out. I'll put a link in the show notes. And it's visually very clear why you'd want a better presentation than some of the default pandas output, as it shows up, and it's nice. Yeah. Cool. Cool. All right. Well, before we get out of here, let me ask you the final two questions. If you're going to write some Python code, maybe you're going to do this kind of work that you're talking about, what editor do you typically use? So I'm changing my answer for the last time. So last time, I think I said sublime, but I am using Visual Studio code now. And back to the Ws l point. It integrates really well with wsl as well. Yeah, does. It runs everywhere. So yeah, I really enjoyed I'd say my one knock on it is I don't spend enough time in it to actually memorize all the shortcuts. So I try and force myself to get it more efficient with it every time I use it. But I really enjoy using it.

58:06 Yeah, there's a lot of stuff underneath the surface that you don't know that's there. Unless you know, to go look for it.

58:11 Yes. Yeah.

58:12 Very cool. And if you come from a sublime background, or atom, Visual Studio Code seems like the very natural like, Well, here's the modern version of that thing.

58:20 Yes, yeah, it's great. And I do keep up with you know, every month, they give updates, and they're really doing a lot of improvements on Jupyter notebooks support. And there's some pretty cool things you can do with that integration right now. So I'm excited to see how that evolves over time as well. Yeah, for sure. Alright, also notable pipe UI package, or conda package if you prefer to view it through that lens. But you know, something cool that people maybe don't know about, we could pip install side table as one of them. That's certainly one than the other one I'll go back to is we talked about plotly Express, there's a library called kaleido, I think is how you pronounce it. That actually makes it easy to export your images, the graphical images you create with plotly to a PNG or an SVG file. And I use that a ton just to generate some graphics, save them out, and you know, send them in an email or put them in a PowerPoint. And that's really high quality. So sometimes I've done the copy and paste. So you use the screen snipping.

59:20 It's not just the screenshot or whatever. Yeah,

59:23 exactly. So it doesn't sound that revolutionary, but it's just one of those things. It's really, really handy to have and really useful and it works well.

59:31 Yeah. Well, people spend a lot of time looking at presentations. And it just, if it has those, that fuzzy screenshot, he feels like, this isn't those kind of things add up. It's like having a good microphone when you're on zoom or something like that, right. It's exactly. You're never gonna get criticized for not having it but people just like to talk to you more. It's a more pleasant experience if they don't have to strain to hear you. Yeah, that was perfect analogy. Yeah. Thanks. All right, well, final call to action. People pythons popularity has grown since the last time we spoke by quite a bit. So the opportunity to use the Python data science tools are probably even more in their own now than they were then. What do you say?

01:00:12 Yeah, absolutely. I think you kind of wonder when that curve is going to flatten out. And I still think we have a long way to go. And Python just has so much to offer to automate and improve processes where people only know Excel. So I think trying to encourage people to start taking that journey, and if they're interested in the course, certainly would love to have them take the course. So I think that's a good place to get started. And I'd say, maybe if you're advanced enough with Python, and don't think you need the course, maybe it's a good place for your co workers to get started as well.

01:00:46 Yeah, absolutely. The course is great. Like I said, it inspired me. We're also doing a free webcast in a couple weeks, which is actually a month ago, because of time travel of podcast recording. But just another point of how much this kind of stuff is warranted, right. I sent out one email and one social media message and over almost 1000 people surely by the time it finished 1000 people have signed up for this webcast. Like Sign me up. I want out of this Excel thing.

01:01:11 Yeah, no, that's I'm really looking forward to that. I think it's gonna be great. I mean, it really was great. Yes.

01:01:16 It was great. It went really well. I'm not sure that he did. But I want to point out that it's recorded, and I'll put a link in the show notes. So whatever it is, we Chris and I come up with for that, you'll be able to check that out as well. So that'll be in the show notes.

01:01:28 Yeah. Great. Well, thanks a lot for having me really enjoyed the talk.

01:01:31 Yeah, thanks for coming back and helping us out with this Excel thing and getting more Jupiter more pandas and more Python.

01:01:38 Absolutely. Thank you.

01:01:39 You bet. Bye. This has been another episode of talk Python to me. Our guest on this episode was Chris Moffatt, and it's been brought to you by Voyager, the video game and linode Cloud Hosting booja is a game developed and published by totally consulting, travel the world in front of your computer and learn a whole lot. Visit talk python.fm slash Voyager, or just search for Voyager on Steam. That's vo YAG. Er, start your next Python project on the nodes state of the art cloud service. Just visit talk python.fm slash linode li in Eau de, you'll automatically get a $20 credit when you create a new account. Want to level up your Python if you're just getting started, try my Python jumpstart by building 10 apps course or if you're looking for something more advanced, check out our new async course the digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite pod catcher and search for Python we should be right at the top. You can also find the iTunes feed at slash iTunes. The Google Play feed is slash play in the direct RSS feed at slash RSS on talk python.fm. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Get out there and write some Python code

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon