Monitor performance issues & errors in your code

#358: Understanding Pandas visually with PandasTutor Transcript

Recorded on Monday, Feb 28, 2022.

00:00 Pandas is a great library that allows you to accomplish a ton of filtering and processing in condensed syntax. But how well do you understand what's happening?

00:09 Sam Lao and Philip Guo built a great site to help us visually explore how Pandas is processing your data set with your specific syntax. It's called Pandas Tutor, and Sam is here to tell us all about it. This is Talk Python to Me, episode 358, recorded February 28, 2022.

00:41 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy and keep up with the show and listen to past episodes at 'talkpython.fm' and follow the show on Twitter via @talkpython. We've started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at 'talkpython.com/Youtube' to get notified about upcoming shows and be part of that episode. This episode is brought to you by Signal Wire and the Stack Overflow Podcast. Please check out what they're both offering during their segments. It really helps support the show. Transcripts for this, and all of our episodes are brought to you by 'Assembly AI' do you need a great automatic speech to text API? Get human level accuracy in just a few lines of code? Visit talkpython.fm/assemblyai'. Sam, welcome to Talk Python to Me.

01:31 It's great to be back.

01:31 It's good to have you back. Previously you were here with Philip Guell and your research is your advisor at UCSD.

01:40 Yeah, Philip is my advisor.

01:42 Philip is a great guy. He's been on the show multiple times before. And you all were here previously to talk about.

01:50 You had analyzed an insane number of notebook environments, right?

01:54 Yeah, something like 60 of them.

01:56 Yeah, like 60. So people think of Jupyter Lab and maybe Google Colab, and then is there another one? Maybe 60. It's insane, right?

02:04 There's just a long list. We have like a giant table in that paper that takes up one whole page and it's a lot going on.

02:11 It was really fun to have you on to talk about that. And I suspect that list is longer, not shorter. Now, of course, it just is growing. How about we just do a quick sync up? What have you been up to? When was that? That was about a year and a half ago. So it's been a while. What have you been up to? You still over at UCSD.

02:29 Yeah. So I'm still at UCSD.

02:31 UCSD being University of California, San Diego. For those who are not hanging on the West Coast US and are familiar with all the UC acronyms.

02:39 Yeah, US, San Diego. Here in sunny San Diego, I'm doing my PhD in cognitive science. But what that really means for me is most of my research is in a field called human computer interaction, which is the way I describe it, is more or less the study of user interfaces. How people interact with computers. And for me, specifically, how people use computers to teach and learn programming and data science.

03:04 Yeah, that sounds super fun. There's just so many more interfaces these days. Human computer interfaces used to be just like how to use Windows or something like that, right.

03:14 Well, even before that, it was like, how do you type stuff into a terminal? Before we even had Windows. And then people from the community was like, Windows would be good, and then they were like, new touch screens will be good. And so now we have all these cool input ways to input, like command into a computer and make use of that.

03:30 Yeah, absolutely. And so it seems like you and Philip are somewhat focused on how developers interact with computers. Is that fair to say?

03:39 We kind of do a mix. So Philip, I think, has done a lot of developer oriented work, maybe like in the last five years or ten years or so. For me specifically, I'm more interested in how students and instructors use computers.

03:55 Okay.

03:57 How people use create lecture slides using screenshots of their notebook environments and how people put code in their slides and use code to teach while also talking over at the same time.

04:10 Sure. Well, in notebooks are all about communicating computational ideas, either being science or developer data science, right?

04:18 Yeah, for sure. And nowadays we see a lot of notebooks being used in the classroom as well. So instructors will not only have lecture slides, but a lot of instructors will flip back and forth between lecture slides and a computational notebook like Jupyter. So they'll have lecture slides for a few minutes, and then they'll do some live coding in front of students for another few minutes, and then they'll switch back and forth. So we're seeing a lot of that sort of use case in the classroom. And that's the sort of thing that I'm really interested in because here we have people doing something visual, doing something verbal, but also now working with code and having students, like, see code in the classroom.

04:54 I think it must be such an advantage for students these days. I remember when I was in school, it was either the instructor will be writing something either on the blackboard or on one of those overlay things with a pen with a light that literally went through it, or you would just get a book or something. But now if you want to say, well, look how these forces in physics come together, or look how these chemical bonds are formed, you could see them actually, you could see actual animations and you could try new ideas. And it just opened up so much exploration, I think.

05:24 Yeah, I think it's really cool. And I don't know about you, but I would definitely prefer to write code in the computer rather than writing code in the overhead slide.

05:32 Oh, my goodness.

05:34 I cannot even imagine doing it that way. For sure. All right. Speaking of code, let's talk about your project. So you have this project called Pandas Tutor.

05:45 Yeah, that's right.

05:46 And this is a little bit of a next generation data science thing followed on from Python Tutor. Right. I see some similarities in the website and stuff like that.

05:56 Yeah. So Pandas Tutor site is similar to the Python to their website because Philip used his old style sheet for both of them.

06:05 Sure.

06:05 We were getting this out the door at the time. We were very focused on squashing bugs and adding new features. And then the website was like, oh, okay, we got to launch this thing. We need something to put on. So Philip was like, okay, let me copy my styles.

06:20 Yes. His styles look good. So I think it's there's nothing wrong with that. It's very nice, visualizations and interactive. So what is this? Why would people be interested in Pandas Tutor? It might sound like it's a series of video courses on Pandas or something, but that's not at all what it is. Right. Tell people what this is.

06:38 Yes. So Pandas Tutor is a little website tool where you can paste in some Python Pandas code. So Python code that works with data frames. What the tool will do is it'll take your code and break it down step by step. It'll draw a little diagram for you at every step of your code. So I have a little example on the website. But the example on the website basically shows that if you have a few lines of Pandas code, when you run those lines in the Jupyter notebook, let's say all you get when you run the cell, all you get when you run the code is the final output. So even if there are four steps in the recipe, all you get is the finished cake. And you don't really get to see the middle steps that happen so often.

07:19 That's what you do. I'm going to transform this data frame with these multiple operations to get a new data frame. That's the destination I want to be at, right?

07:27 Totally.

07:28 Yeah. And it happens all the time in real practice for data analysis. And it also happens a lot in the classroom. So a lot of times when I'm teaching in data science, a little piece of data science, I'll write a few lines of code. But the problem is students ask me, there's a lot going on. Can you show me, like this step or step three in the code or step four? And I have to manually comment out lines of code and display those data frames to students. So this will kind of emerge from that use case.

07:55 Yeah, absolutely. And it's totally reasonable that you would comment out those lines, but it's also more difficult still to see the transformation. Right. So if I say what happens if I don't do, say, a group by, but I just get the previous result well, then how do you put that side by side with the other? The next step, right. There's not a great mechanism for saying put these side by side and see the changes without, like, lots of scrolling and back and forth and stuff.

08:22 Totally. Yeah. So it's a use case that came directly out of our experiences teaching data science, where a lot of times when we teach, we're like, okay, in the top cell the notebook will display dogs, and then the cell below it will do some stuff with the data frame, like dogs sort values. And then we have to say, okay, students, look at the top data frame and look at the bottom line and then compare those two and stare really hard at it and try to figure out what happened between the two data frames to understand it. So whereas with Pandas tutor, what I could do instead is put the code, just paste the code into Pandas tutor. It'll display the two data frames side by side and draw some arrows between the two data frames, or add some coloring so that I can see, okay, the rows were sorted. I don't have to ask people to stare really hard and imagine the rows moving from place to place.

09:14 And before we get too far into this episode, I do want to just point out to people like, this is a really visual tool, which is its massive advantage. But that also put Sam and me at a disadvantage for discussing it during this mostly audio presentation. So you might consider checking out also the YouTube stream and flip it around there, or just open up Pandas Tutor.com and play around with yourself. That it's all about just a live example. So with that said, I want to dive into a few things. There's going to be an example here that we're going to cover a lot.

09:46 And it's an example using a Pandas data frame called Dogs. And the dogs are like the breed, which would be German Shepherd or terrier or whatever. Then type it's a herding dog or a Hound dog or a toy dog. I love the toy dog. I have an idea. And then there's a filtering statement where you say dogs bracket dog size equal equal medium. That's a standard Pandas. Like, let's do this filter type of way or clause. And then there's the sort where you sort by the type of dog, the herding or Hound dog or working dog, whatever. And then you do a group by on the type. So show me all the non sporting dogs, all the sporting dogs, and so on.

10:26 Yes, that's right.

10:27 We'll be using this data quite a bit before we get into analyzing Pandas. I guess one thing I was wondering, is it's super focused on Pandas and Pandas data frames? Did you consider making it for other tools, like a NumPy one or a TensorFlow one? Why Pandas specifically out of the data science space it's a great question.

10:49 Pandas happens to be like the de facto tool for working with data tables and Pandas. They're going to tables in Python and it's taught a lot in introductory courses specifically. So a lot of insured data science courses, when they teach students how to work with data tables, they'll also teach students how to work with Pandas. We chose to focus on Pandas because of those two aspects. So one, it's like a common standard package that a lot of people use in practice. And it's also a package that people, a lot of people learn when they're learning data science for the first time. They'll learn Pandas as one of the first first stepping stones to learn how to do data analysis. Now there is another version of this tool for the Rtidiverse world. It's called Tidy Data Tutor instead of Pandas Tutor. And it was made by one of my lab mates and Philip as well. So the three of us kind of work together. I worked on the Panda side and then my labmate worked on the Rtiverse side.

11:47 Sure. Okay. Yeah, that's cool. But regardless of which side of data science you're on, you can grab this and run with it.

11:55 Yeah. And as for other tools, I think the approach that the tool uses to analyze code could be applied to all sorts of other tools, including tools like TensorFlow. And that's probably one of the exciting things for me is that like this for me is like a stepping stone towards all sorts of visual tools for learning Python packages or other data science specific things that otherwise are difficult to understand, things that you would normally draw a picture to understand. Normally, I can imagine a tool like this drawing a picture for you or helping you when you're explaining your code to someone else.

12:27 When I first learned about this, Brian Hawkin and Leah Cole and I spoke about it over on the Python Bytes podcast. And Leah made a really interesting point when she saw this. My first impression is, okay, this is a great way to teach students who are getting into data science. And it sounds like that's a very solid use case. But Leah said, hey, this would be really good if you just went to stack overflow or you picked up some new code and you're like, what does this do? This is like a complicated Pandas expression. Let me throw it in here and you could visualize that. So I want to sort of put that out there for people to think about. Well, if they're not learning data science right now, they're probably still encountering algorithms and data sets that are new to them. And that might be useful as well. What do you think?

13:11 Yes, I totally agree with that. And actually one of the ideas that Philip and I had was to do a little Bookmarket where if you put the bookmarklet in your browser, like BOOKMARKS bar, and then you're on Stack Overflow and you click that button. We could even show you those diagrams, like the Pandas creates in line. If we do it, if you click it on my own stock overflow website or let's say the kind of pandas documentation. And I've certainly run to the problem where I'm trying to Google, like, how do I unpivot a data frame? And then I pull up the first Stack Overflow result and then it works. But it's like five lines of complicated machinery and I'm trying to figure out what's going on, and I have to kind of walk through it step by step myself to make sure it's not messing up my data in some way. And so I can imagine, like a tool like this, if you could display the visualizations in line and just put it in Stack Overflows website itself on the user side, then that could be like a nice way to make use of the visualizations in real practice.

14:09 That would be really cool. Just install a browser helper that like every time it detects a Panda statement, it just puts a little.

14:17 Yeah, like a little Chrome extension. That would be super neat.

14:19 I would use that, yeah, for sure. So let's click on start visualizing your Python code. Now that will pull this up and run. And this what happens for those of you who are not familiar with Python tutor and Pandas Tutor, is it actually execute you put a block of code in here and it actually executes it in a container or something like that, right?

14:39 Yeah, that's right. So it'll run your code in a Docker container on one of our servers.

14:43 Nice. You don't have to just explore other people's code. Right. You can put whatever you want in here. Now, the first thing that I think about when I think about data frames is data.

14:53 Obviously, you've got to get some data in here. And as we were talking about this Chrome extension, one of the challenges for me is, well, here's a cool little code snippet, but it has to have the proper data frame data back in it before it really means anything, right?

15:08 Yeah, that's right. It's a challenge.

15:10 It is. So how do you get data in here? What are the options? Forget data here.

15:13 Okay. So the one challenge about hosting a tool that runs code for other people is that it's a little bit you run into issues when you let the tool also have Internet access, because then people start mining Bitcoin.

15:28 Exactly.

15:30 And arbitrary code on the Internet, how could it go wrong? I know container. It's not going to harm you or the other things, most likely, unless there's an exploit fully in Docker. But they're going to exploit your computational resources and things like that, right. That's what you probably don't have full just root access to do whatever you feel like here.

15:51 Yeah. So the tool, it cuts off Internet access for your code, and it also imposes, like, a few memory limits. I don't think you can write you might be able to read or write to disk temporarily. I don't quite remember the details of that, but I do know that it does restrict Internet access. So a lot of things like one common way that we'll get data sets in class is we'll do Pandas readcsv and put in the URL of a. Csv file. And unfortunately that doesn't work for the tool. So the way that we're getting around this right now is in the examples that we have for Pandas Tutor on the website, we have inline CSV. So we put a snippet of a CSV file as a Python string, and then we can read that string into Pandas as if it was a CSV file.

16:36 Yeah. So you just drop it in as part of the code, as a literal string, the triple quotes that will go multi line. So you just drop it in the middle of that, right? Yeah.

16:47 And I definitely don't think it's an ideal solution. What I would like one stepping stone to a future solution would be to include some example data sets with Python with Python code, people can load in directly.

17:00 That's a good idea.

17:01 The R version of this tool has some built in data sets because of R built in packages. So R has like, I think it comes with default Cars data set, maybe like some Flowers data set, so you can load those indirectly, and that makes the R version of this a little bit more convenient.

17:19 We're thinking about ways of Loading in data right now, but in terms of being able to load an arbitrary data, the only real option right now is to go into a. Csv file and copy and paste out a few lines of it into the tool directly.

17:34 This portion of Talk Python to me is brought to you by SignalWire. Let's kick this off with a question. Do you need to add multiparty video calls to your website or app? I'm talking about live video conference rooms that host 500 active participants, run in the browser, and work within your existing stack, and even support 1080p without devouring the bandwidth and CPU on your users devices. Signal Wire offers the API, the SDK and Edge networks around the world for building the real estate of real time voice and video communication apps with less than 50 milliseconds of latency. Their core products use WebSockets to deliver 300% lower latency than APIs built on rest, making them ideal for apps where every millisecond of responsiveness makes a difference. Now you may wonder how they get 500 active participants in a browser based app. Most current approaches use a limited but more economical approach called SFU, or selective forwarding units, which leaves the work of mixing and decoding all those video and audio streams of every participant to each user's device. Browser based apps built on SFU struggled to support more than 20 interactive participants. So Signal Wire mixes all the video and audio feeds on the server and distributes a single unified stream back to every participant. So you can build things like live streaming fitness Studios, where instructors demonstrate every move from multiple angles, or even live shopping apps that highlight the charisma of the presenter and the charisma of the products they're pitching at the same time. Signal Wire comes from the team behind Free Switch, the open source telecom infrastructure toolkit used by Amazon, Zoom, and tens of thousands of more to build mass scale telecom products. Sign up for your free account at talkpython. Fm Signal Wire and be sure to mention talk Python to me. Receive an extra 5000 video minutes that's talkpython.fm/signalwire and mention talk Python to me for all those credits.

19:24 It's not a huge problem because the goal here is not to execute and get results from it. It's like, what would it look like if I did this?

19:34 Yeah, that's right. So it's not really about visualizing, like gigabytes of data frame transformations, because in those cases, we're going to draw a bajillion arrows and it's really hard to see what's actually happening between the 2D.

19:48 Exactly. From an understanding perspective, it might be worse. Right. Because it's like as people will see as they explore pandastutor.com it's, putting lines and little interactive widgets all over the place to see how the data flows. And if you've got too much, then it's just lines everywhere.

20:02 Yeah.

20:03 Cool. All right, so we've got this data which we already talked about a little bit, the dogs. And so we've got like a Labrador retriever, which is a sporting dog that lives to twelve years special size and it's £67 and so on. So that's the kind of data that we're working with. You jam that into an in memory CSV string. Use IO string. Io to treat it like a file stream until Pandas to read CSV.

20:29 Okay.

20:29 So then you want to know, basically, let me go to the final. You want to know, given a type of dog like nonsporting sporting working, what is its median longevity in years and its weight. So that's the kind of question that you're trying to answer, which is kind of incredible, that given this data, it's only three lines of Pandas to answer that.

20:49 Yeah, pandas does a lot behind the scenes.

20:51 Yeah, that's pretty remarkable, actually. So what happens when you say visualize? This is it takes that same code block that we've been talking about, dogs, dog of size, equal, equal, medium, sort, values, group by median and so on. And you put that on the screen where the parts that are not relevant at the moment are grayed out, and the code that it's actually applied at that step is strong or regular color font. That's really clever. How do you come up with that idea to say we're going to show the flow through this code by dimming the other parts of the code.

21:28 When we were making this tool, our first cut of this was actually just to show the code that we were running itself.

21:33 Right. So you see, like dog bracket size, equal, equal medium or something like that.

21:37 That's exactly right.

21:38 So in the first version of this tool, we did like, if we're filtering a data frame and then sorting it, we would just show the code for filtering first, but we showed all the code for filtering. And then when we sort the data frame, then we would show the code for filtering and sorting because that's the snippet of code that we would execute or visualize. And it'll end up being less helpful, in our opinion, when we have longer chains, especially because when you have longer chains, it looks like there's like five things running. But really all we're visualizing is the last step. So the searching we kind of came around to was to say, okay, well, we'll just display all the code that we're visualizing, but bold highlight the part that we're visualizing in this step. So if there are four steps, then we'll have four little diagram displays and we'll highlight the individual steps that we're running at each time.

22:32 Yeah. There's a lot of subtleness and this tool that it takes on first glance, you don't see exactly how much it's communicating, but there's a lot of nice touches like that. So in this first step, it says we're going to go and filter down our data set from all the different types of dogs, like large dogs, small dogs, and so on to just look at medium sized breeds. So then we get to the thing that we're talking about, where you've got two data frames, as if you had done just like DF head or something, where it puts the top and then like a few listings of them.

23:05 That's right.

23:05 But it puts them side by side. Right. Tell us what's on the screen here. Describe this.

23:10 Yeah. So on the left, we display the original data frame dogs, and then on the right, we display the data frame dogs, but with the medium dogs filtered. So the right data frame only has medium dogs. The left data frame has all the dogs in the original data frame.

23:25 Right. So that's the filtering that you might first see. But then you also have lines which are pretty cool.

23:30 Yeah. So what we're doing here is we're showing which rows from the original data frame made it into the right data frame. So in the original data frame, it looks like rows three, four, three through seven. And a few more other rows made it into the right side. And what we're doing is we're drawing arrows to show from the left side which rows got copied over into the right side.

23:52 Yeah. So in that filter, you can sort of see them going across, which is fantastic. And the other part, that's subtle because there's two more is this looks like a static picture. It almost looks like a JPEG or something. But then as you move your mouse over, as you Hover over a particular row in the original data set, it will highlight just what happens that row. And where did it end up, either directly or in some kind of aggregation, like some or something on the right hand side. Right. So there's really cool sort of hovery, like follow this piece of data along story.

24:24 Yeah, that's right. And you can do it on the right side, too. So if you Hover on the right side, it'll show you the rows that went into that from the left side.

24:31 Okay.

24:31 This is convenient when there's a lot of arrows. So this is like the main reason why we did this was in some cases there are a lot of arrows and you want to just show a few of them. And to do that, the easiest way to do that in this tool right now is to highlight moving mouse over a row. Yeah, that's one subtle here.

24:48 In this case, it's not super relevant because it's just a filter. When you get down to group by and stuff like that, you're like, well, show me everything that went into this row on this group by these three.

24:57 That's right.

24:58 That's pretty cool.

24:59 Yeah. So it's especially helpful when you have multiple rows on the left going into a single row on the right or some weird combination of rows that otherwise would be hard to see.

25:09 Yeah, for sure. One final thing is we're doing a sort by filter by this comparison on the size column of the data frame. So dog bracket size equal equal medium. And on both of these, you've got an outline of vertical outline on the size. That's another aspect that's really nice to highlight what's going on here, right?

25:32 Yeah, that's right. So we're drawing a box from the size column that's meant to show the user, like, what column we're using for filtering. It's actually one more thing here I want to point out, which is that if you display this data frame in a Jupyter notebook, you might not actually see any medium sized dogs because they're hidden in the middle rows of this data frame.

25:52 Yeah.

25:52 And so by default, Pandas displays a few rows in the top and the bottom of a data frame. But sometimes when you're filtering, the rows that you're filtering only come from the middle of the data frame, which are hidden by default. So you have to kind of like really scrub through a data frame yourself in a Jeep for a notebook to see that there are medium dogs in the first place. So one thing we're doing with this tool is whenever we draw an arrow, we make sure that the rows on the left are visible to the user. So we'll collectively hide and show rows to make sure that it always see the road that we're making use of.

26:24 Right. The most important ones are the ones that are included. So let's include some of them, right?

26:29 That's right. That's right.

26:30 Yeah. I've got this little subtle one more, four more in this section and so on. So that's pretty neat. All right. So far, we're step one. We've taken all the different types of dogs and we're now down to a data frame of just medium sized dogs. The next thing to do was what sort them, right?

26:47 Yeah, that's right.

26:48 Okay. So maybe tell us about this stuff here.

26:49 Yeah. So in this step, we're sorting the rows by the type column. So again, we're going to highlight or draw a box on the type column for the input and the output to show the user. This is the column we're using for this operation. And here we're drawing arrows and the arrows kind of crisscrossing this diagram because some of the rows go to the top of the data frame and some of the rows go to the bottom of the data frame after we sort them.

27:11 Yeah. As you would expect. But you can see after the sort here's where everything sort of started and ended up. So that's pretty straightforward, I would say. And then you have a group by one. Tell us how you visualize the group by stuff, because that's pretty interesting to think about. How do you represent the twodimensional data but then Additionally the groups.

27:30 Yeah.

27:31 So when we group data, what we do visually is we use the same color to highlight the backgrounds of each row within a group. So in this example, all the non sporting dogs will get highlighted blue, all the sporting dogs get highlighted light green, and then all the working dogs get highlighted red. So you can see visually how the rows are put together in groups by pandas.

27:54 Yeah, fantastic. And then finally simple but also very cool visualization is you just go to that data frame, you say, give us the median of that. And I guess I don't know Panda's Super well. But I guess if you just say dot median, that'll just give you the median of all the columns that are numerical. Is that what happens?

28:11 That's right. So it'll give you the median for all the dog longevity and the median for all the dog weights.

28:16 Which is the two numerical columns. Okay.

28:20 And in this case, what we're doing instead is because we've grouped the data frame before taking the median Pandas does the median within each group. So it'll find the median longevity and weight for the non sporting dogs and then for the sporting dogs and then for the working dogs.

28:35 What we're doing in the diagram here is we're showing how the four non sporting dogs on the left side gets kind of aggregated together into one row on the right hand side. So there's four arrows going from four rows on the left side that go into one row on the right.

28:50 Yeah, that's awesome. And this is probably that case where showing where the lines go is most interesting because there's, like, the first four results are all non sporting, so they contribute to row one. So you've got all these errors going into that first group and so on. I think that really helps visualize how this is computed.

29:09 Yeah, and that's right. And group by is especially a tricky thing when we teach students because students are like, I understand sorting because the rows get moved around. When we do group by and then do something afterwards, all of a sudden my rows disappear. Where do my rows go? And so here we can see exactly where all the rows go.

29:24 Yeah. That is very tricky because often you don't see that group by intermediate representation. Right.

29:30 Yeah. And Pandas actually does not help people understand the group by. So in the notebook, if you do group by itself, all you get back is like this Pandas group by object at some memory address. And that's for novices. It's like, I have no idea what just happened. And so what we decided to do here as kind of like a design step is instead of showing the default group by display, which is just text, we extended it out to the original data frame plus some colors.

30:00 Yeah, it really is a great visualization. So I guess while we're talking about setting up your code, one thing that's interesting here is I can go to the top and I can jam in my code, and I could do maybe I could change this to, like, I don't know, Max, is it going to work? I got a different result by running Max, and I might play and explore and so on. And then I'm like, I want to save this. I want to save it for myself or I want to share it on Twitter because I just got to say something on Twitter and let people check it out. I want to put it in a gist, I want to share with my students, whatever. So you have a handy little thing at the bottom that lets you create sort of a shareable link type of thing. How does that work? You just base 64, encode the stuff at the top into the URL or what happens.

30:46 It's actually even simpler than base 64 encoding. We just put the code verbatim into the URL. If you're looking at the URL, like the shareable URL, you'll see that we put the code like import Pandas STD into the URL there.

31:00 Yeah, that's true. No, it's URL encoded.

31:02 It's just URL encoded. Yeah. So we don't do anything super special with it. And this works because URLs can actually be pretty long. Like, browsers will accept pretty long URLs. And so for a lot of cases, we actually can't put the entirety of the code in the URL stuff. For sharing.

31:19 This portion of talk, Pythony is brought to you by the Stack Overflow Podcast.

31:24 There are a few places more significant to software developers than Stack Overflow, but did you know they have a podcast?

31:31 For a dozen years, the Stack Overflow Podcast has been exploring what it means to be a developer and how the art and practice of software programming is changing our world. Are you wondering what skills you need to break into the world to technology or level up as a developer? Curious how the tools and frameworks you use every day were created? The Stack Overflow Podcast is your resource for tough coding questions and your home for candid conversations with guests from leading tech companies about the art and practice of programming. From Rails to React, from Java to Python, the Stack Overflow Podcast will help you understand how technology is made and where it's headed. Hosted by Ben Popper, Cassidy Williams, Matt Kiernanda, and Sierra Ford, the Stack Overflow Podcast is your home for all things code. You'll find new episodes twice a week. Wherever you get your podcast, just visit Talkython.Fm/StackOverflow and click your Podcast player icon to subscribe. One more thing. I know you're a podcast veteran and you could just open up your favorite podcast app and search for the Stack Overflow Podcast and subscribe there. But our sponsors continue to support us when they see results, and they'll only know you're interested from Talk Python if you use our link. So if you plan on listening, do use our link, Talkpython.fm StackOverflow, to get started. Thank you to StackOverflow for sponsoring the show.

32:50 So people can come down here and they can copy this URL and then they can do whatever they want, either save it as a bookmark for themselves or share it. Right? That's cool.

32:57 Yeah, that's right. So the idea is if I had, let's say, like an email I'm sending to a colleague and I send them a little picture of data frames, like with some arrows. I can also send this URL, which will let the other person view, like the original code I used to create the diagrams.

33:13 Sure. So maybe it makes sense to do like a snapshot screenshot of just one step and you say, well, here you can explore the whole thing and rerun it and so on. It also kind of leads to the reproducible publications, reproducible science. If you're trying to explain to people what this stuff does, it might be worth putting that in the paper, right?

33:33 Yeah, exactly.

33:34 One use case we imagine was like, let's say for electric slide, I'll take a screenshot of this page. I can also put in the URL, like some small text at the bottom of electric slides so people can play around with it afterwards.

33:46 You need a URL shortening service like Pandas tot, or Total.

33:54 Yeah.

33:55 I don't even know or domains, but they should be.

33:58 I don't know, maybe.

33:59 But like a Bitly equivalent of a short one. Yeah. Cool.

34:03 Yeah.

34:04 All right. Well, I do see there's a place I can suggest improvements.

34:08 Let's send a little email.

34:12 Yeah. One thing we do want to do is to include, like, save this diagram as a PNG or SVG button.

34:19 Oh, nice.

34:19 We haven't gone around to it, but we would like to include that at some point.

34:23 Yeah, that'd be great. You know what would be fantastic is saved as an animated GIF.

34:26 Yeah. That would be cool.

34:27 Where you could sort of see, like, the little arrows run by or something.

34:33 That's a cool idea. Yeah. Think about that.

34:35 Well, just give me more work. So super cool project. This project is part of your PhD work, right?

34:42 Yeah, that's right.

34:44 What's next? Where is it going from here?

34:46 I'm looking to graduate, like next year, spring. So my hope is that this project will be like a major piece of the thesis, and I talk about it a lot. And as for the tool next steps in the short term, we'd like to expand the types of visualizations it can do. So right now, it doesn't know how to do joins, for instance, or Pivots. And so we really like to include those because those also tend to be really confusing for people learning for the first time.

35:14 They'Re in the group by category of hard to understand. Yeah, I think they're worse, actually.

35:19 Yeah, it's a tossup. I think in my experience, students will always have one, at least one of those three to struggle with group buys, joins and pivots. Those three are just really confusing and longer term for this tool, we're thinking it would be great to have it, like, as I alluded to a little bit earlier, have a version of it that we can embed into, like Stack overflow or even the kind of documentation itself. And we're also looking at different alternative ways of drawing arrows and colors to see what can help students. And then we're also looking at ways of applying this general code analysis approach to other tools for visualization.

35:57 Yeah. There are certainly other tools that could benefit from it, although not so many of them as popular as Pandas. But for example, like Dask is Panda, like, but it's fairly complicated what happens to compute stuff.

36:11 Yeah.

36:11 Not understanding the ideas of it necessarily, but it's like, okay, well, it's going to go after this cluster and these things happened.

36:17 There could be interesting things there.

36:19 Yeah, totally. There's Dask, and now there's other tools like Ray and even like, if we're going really far out there's, like tools like Spark and MapReduce and stuff that distribute computing across different computers, visualizing, those would be really cool because for me, as a user of those tools, sometimes it's like when one computer breaks, it's really hard to debug and figure out what happened.

36:40 It definitely is. Another one is just database queries.

36:44 Oh, yeah, for sure.

36:45 Right. I mean, it's pretty similar. You've got, like, group buys, you've got sorting, you've got filtering, select where clauses and yeah, I think that would be a pretty natural match as well. But, yeah, there's a lot of places for it to go. So. Very cool. Are you using this in any of your courses and things like that?

37:01 At the moment, I'm not teaching any data science courses, but this summer I will be teaching a course at UC San Diego. And so I'm really hoping to use this tool to make some download diagrams to copy and paste into my electric slides before I would make these diagrams manually using Google slides, like arrows and shapes and such. But it's a huge pain. It's a massive pain.

37:22 Oh, gosh, I know. It's so painful. As someone who does online courses, I'm like, I really want to put some arrows and show how this goes.

37:28 Like, yeah, it's a lot of work. It's a lot of work. Yes.

37:32 I've actually used Python tutor in my beginner Python course to show people, like, okay, when you have two variables, but they point to the same object in memory. Here's why changing one of them changes the other, for example, because you're really the same they're just like looking at the same thing. So they're changing the thing they point out. And whatnot pretty cool. So very nice. Yes. And you're also working on a book, right?

37:55 Yeah, it's a book. The website right now says Principles and Techniques of Data Science, but we're renaming the book to Learning Data Science.

38:03 Okay, cool. When is that coming out?

38:06 That's a great question. Question to ask every book out there.

38:10 Yes, I know.

38:11 We're really hoping to get a first cut of it by the end of this year. We're about one third or halfway through the content right now, but we're hoping to get the first cut of it out by the end of this year and then go through the whole editing cycle and before publication work and have it published sometime here after.

38:33 Yeah, fantastic. Well, good luck on that. We are getting a little bit short on time, mostly because I have an extra tight constraint today. Sorry about that.

38:41 No problem.

38:41 But let's talk a little bit about the internals. So we talked about uploading some code in the text area field of the website, how it runs in a Docker container. It has restrictions. But how do you all pull this off? Like, understanding some of these things that we're getting color and pictures and arrows for? I don't think that's built into Pandas or Python, is it?

39:02 No, it's not.

39:03 How do you pull this off?

39:04 Yeah. So what we're doing to actually make this work is we're parsing the code behind the scenes and then running each step using sort of like a debugger. So what we're doing is we parse the code to split it up into the steps that we want to run. So in this case, we'd split it up to say the first step is dogs and the second step is filtering and third step is sorting and then grouping and then taking the Max.

39:30 So you don't go all the way down to disassembly like this.disk, because that many more steps than this. Okay.

39:38 Yeah. So we're trying to keep it roughly at the level of function calls. We just split up into individual function calls. And in this case we have like a slice syntax for filtering. So we split that up as well.

39:50 Okay. So then you basically run each step. You have the before data frame and the after data frame.

39:56 That's right. Yeah.

39:57 So we run a step unwound expression. Yeah.

40:00 We run a step that like the left hand side and the right hand side. And then we essentially have some special rules where if we do sort values, we'll use this rule for drawing arrows. We see a group by we'll use this rule for drawing arrows. So underneath the hood is really just a lot of heuristics and rules for specific Pandas functions.

40:18 Yeah. So you might say if it's doing a group by and here's the groups, here's how we're going to understand which group this goes to. So you're like it's grouping on the type column. That's right. Then you know which thing to point it to. I see.

40:32 Yeah, that's right. And we could have done a smarter approach using some deeper co analysis or machine learning. But I think for teaching purposes, we really wanted to avoid the case where we draw arrows wrongly so we draw arrows that are not supposed to be there. Yes.

40:50 That'd be confusing, wouldn't it?

40:52 Yeah. And especially for learners, we would rather just not draw arrows than draw wrong arrows. I think that was our main one of the main design decisions here. So that's why we resorted to a simple approach to just using rules and heuristics. But it gets the job done.

41:06 Sure. What about source code? Can people come and check it out and play with it, or is it really they can just play with the website right now?

41:14 Only the website is available for the public. The source code right now we're thinking about making it open source, but right now it's not really in a state for the rest of the world to see.

41:25 What about the front end stuff here? When I went and did a few source on it, there's not a lot to see. Basically some JavaScript and five or six empty locations on the page. Right. What's the story there?

41:38 Yeah. So the front end code is a similar story to the back end code. It's like pretty messy and very specific to our workflows right now. So essentially the issue right now is getting it set up on someone else's computer would be like a massive pain. That's one of the main issues with releasing the source code out.

41:58 If we were to release it right now, it's open source and people want to play around with it, then we would probably get tons of emails from people asking how to debug this one, like really weird step that we're doing or how to make it work. So that's something that we'd like to get around to you. But for the first release, unfortunately, we ran out of time and we just wanted to get the tool out the door.

42:15 Yeah, I know that's the most useful bit is for people to go and play with it, right?

42:19 Yeah.

42:20 Fantastic. Any chance of an offline version, like a progressive web app?

42:24 So I actually would like a version of this tool that would work as a Jupyter Lab extension, because Jupyter Lab has a Python back end that we can make use of. So I could totally imagine this being you directly in a notebook and displaying diagrams directly in a notebook without needing to go to pandastutor.com.

42:44 What about WebAssembly and Pyodide?

42:46 Oh, yeah, we thought about that, too. That's another, like, far fetched idea we like to do. We've seen Pilot right now is just super cool to me. And I really like to do.

42:54 You could turn on the Internet accessibility and stuff again. Right. Because they're only going to hack themselves or exactly their own resources. Right.

43:02 Yeah.

43:03 You know what?

43:03 That's great. That's a great point. I didn't even think about that. But that totally makes sense.

43:07 It's coming along pretty well. And Steve Dower was just telling us that they're starting to do official web assembly builds out of CPython, so it's a little more stable.

43:17 That's super cool.

43:18 Some other projects. So I think there's progress in the web assembly space. And the Py Iodide guys have got some of the data science libraries.

43:26 Yes. They're going to log in to it.

43:27 Yeah, possibly. I don't know what that means in terms of your system that understands and takes it apart. That could be totally tricky, but maybe that's an option.

43:36 Yes, I think it's really viable. I really like to look into it in the months to come.

43:39 Yes, for sure. All right. Well, Sam, I'm afraid that's all the time we got to talk about today. So before, though, we get to move off of this, tell us about the final two questions. If you're going to write some Python code, what editor are you using these days?

43:52 My editor right now is Vs code. I've jumped Editors from Notepad++ to Sublime to Vim and Emacs, and now I'm settling into the discord world. It's got me.

44:04 Yeah, cool. Right on. And then notable PyPI package, like. Oh, that's really cool.

44:10 Honestly, I'm really vibing with tqdm right now. I always forget how to pronounce it, but I really like that one. It's displays a little progress bar for looping, and I just think it's a really nicely done package. Yeah.

44:21 If you just do wrap, just do a decorator and then do a for loop or something like that. I don't remember the exact API but then you just do a for loop and it'll automatically do a progress bar. It's fantastic.

44:32 I love it.

44:33 Yeah. Super cool. Alright. Final call to action. People want to visualize their Pandas code. What do you tell them?

44:39 I tell them go to Pandastutor.com and put your code in and see what comes out.

44:44 Awesome. And I encourage people when you go there, try to interact with the diagrams. Right. There's a lot of stuff going on that it doesn't look at first interactive, but it is so yeah, go play around. Sam, thanks for being on the show.

44:55 Thanks, Michael.

44:56 Great to have you here.

44:57 It was really great. It was a lot of fun for me and I really enjoyed talking to you about it.

45:00 Same. Bye. Okay.

45:01 See you.

45:03 This has been another episode of Talk Python to me. Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show. Add high performance multiparty video calls to any app or website with SignalWire. Visit talkpython.Fm/SignalWire and mention that you came from talk python to me to get started and grab those free credits for over a dozen years, the Stack Overflow podcast has been exploring what it means to be a developer and how the art and practice of software programming is changing the world. Join them on that adventure at talkpython.fm/StackOverflow when you level up your Python, we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async. And best of all, there's not a subscription in site. Check it out for yourself at 'training.talkpython.Fm'. Be sure to subscribe to the show, open your favorite podcast app and search for Python. We should be right at the top. You can also find the itunesfeed at /itunes, the GooglePlay Feed at /play and the Direct rss feed at rss on talkpythonon. Fm.

46:09 We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at 'talkpython.com/Youtube'. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code.

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon