Learn Python with Talk Python's 270 hours of courses

#90: Data Wrangling with Python Transcript

Recorded on Monday, Nov 28, 2016.

00:00 Do you have a dirty, messy data problem? Whether you work as a software developer or as a data scientist, you've surely run across data that is malformed, incomplete, or maybe even wrong. And don't let messy data wreck your apps or generate wrong results. What should you do? Listen to this episode of talk Python to me with Katherine jermall, about the book she co authored called data wrangling in Python and her pike on UK presentation entitled How to automate your data cleanup with Python. This is talk Python to me recorded November 28 2016. in many senses

00:35 of the word because I make these applications vows and use these verbs to make this music constructed. Just like when I'm coding another software design, in both cases, it's about design patterns. Anyone can get the job done. It's the execution that matters. I have many interests and welcome to talk

00:55 Python to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy, follow me on Twitter, where I'm at m Kennedy, keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode has been sponsored by robar and gocd. Thank them both for supporting the podcast by checking out what they're offering during their segments. Catherine, welcome to talk Python.

01:24 Thanks.

01:25 I'm really excited to wrangle some data together. That's fun.

01:29 Yeah, it sounds good.

01:30 I've seen some of your talks. And you know, look through your book. And you're doing really cool stuff with data, data cleansing and data processing data pipelines. And so I'm really looking forward to getting into those topics and sharing them with everybody. But before we do, let's start at the beginning. What's your story? How did you get into programming and Python?

01:47 Yeah, so I kind of have a varied history with programming. I started doing programming in high school, where I learned C++ as part of my AP Computer Science. And then I went to school, and I was attempting to get my degree in computer science. And although I really loved my math courses, I was pretty secluded in my computer programming courses. And I attribute this to the fact that I wasn't that into gaming. And I was one of about three women out of an incoming freshman class of over 300.

02:24 I totally know where you're coming from, it seems like young guys that are into the programming, so many of them come from the gaming world, that's why they got into programming is because they're so excited about games. Do you feel like that he felt like that kind of left you excluded from the social circles a little bit just because you didn't want to go hang out and play games and eat pizza in the morning?

02:43 Yeah, it was definitely. I mean, this was a bit of a different era, I'm happy to hear it's a bit different now. But this was definitely like LAN parties and Everquest. And I just didn't connect with that very well. So I didn't find a lot of people that wanted to work on projects with me, or, you know, socialized with me outside of class, maybe about math things rather than about computer things. And for that reason, I kind of gravitated into economics and statistics and political science, and then kind of ended up doing statistics with soft sciences, if you will.

03:16 Sure. And that that's kind of the nudge that sent you down this data science path a little bit, huh,

03:21 yeah, yeah, I ended up eventually getting into journalism from that. And I ended up at the Washington Post and was, that's where I met Jackie kaizala, my co author for the book. And I learned Python there. And that was really fun. Oh, nice. What kind of work were you doing there? So the washington post was initially and I don't know if it still is one of the largest Django installs in the world. And we had built up a big app stack, where we did the elections, and numerous, all of the data pieces were built on top of Django, we had quite a lot of Python in the background doing data wrangling. So that was kind of my first exposure to data wrangling if you will, with Python.

04:02 Yeah, that sounds really cool. And there's actually a lot of data science and interesting things going on in this data journalism space. Right. That's a pretty big growing area, isn't it?

04:14 Yeah. It's really neat to I'm still in touch with some folks that are still in journalism. And I meet new people all the time, because I'm involved in PI Data Berlin. And so for those reasons, I'm constantly impressed and seeing new things that journalists are doing with data. And you know, if any time now is the time to do really good data journalism, in my opinion,

04:35 yeah, it's definitely it's definitely a good time. And it's also a time where a lot of it's very challenging, right, I just did a show with Jonathan Morgan from partially derivative about the top 10 data science stories of 2016 that's coming out the end of the year. I'm saving for the end of the year, because that seems like that's the time to do a look back episode. But you know, a lot of the themes are about how data science and sort of the pollsters, and So on, didn't really get things right this year. And I think it's both an interesting time and a challenging time. It's pretty difficult with statistics to predict something like human behavior sometimes so. So math can only do so much for you when you have some unpredictability, if you will. Yeah, absolutely. Absolutely. All right, well, we have a bunch of stuff to talk about. So we're gonna kind of skip around a little bit. But let's start with your book. Tell us what's the title? Would you get the idea to write it in the first place?

05:31 Yeah, so the title is data wrangling with Python. And it was a, you know, kind of project of love between Jackie Hazel and I, again, we had worked together initially at the post, back in 2008. And she actually first pitched the book to a Riley, and she was working on it. And she decided, you know what, this would be a lot better if I had a co author. And so she called me up, I was in Berlin, and I said, sure, that sounds great. I would love to help.

06:01 That's really cool. So the idea is, it's really for people who were, we're not super seasoned Python programmers, but they're just getting into this whole data wrangling world. And it sort of introduces them to Python a little bit and hits all the major problems or types of things you want to do. Getting Started, right,

06:21 yeah, so our initial idea was this is for beginners, this is for someone that may or may not even know where their command line prompt is. And we're gonna take them through the steps. And this is to make data wrangling more accessible to people that might not have a computer science background. And that kind of is a passion of mine. And also of Jackie's being that we both were involved in pylea these chapters. And for that, yeah, we want we want an easy accessible way to get involved working with Python working with data. And this was our idea, our product of that. Yeah, I feel like

07:01 depending on your background, you sympathize more with people who you know, are going to struggle, right. Like, I learned programming in college, not when I was really young. And so I still remember, not getting my C libraries to link correctly and all the pain of like, what it is to be a new programmer. So I feel like, you know, it sounds like you have some of those experiences as well. And that probably comes through in the book.

07:26 Yeah, I think, you know, there's a lot of ways that even just as we get more advanced as developers and programmers, or engineers, whatever you think of yourself as that you start using a lot of jargon, and you start, you know, thinking in these bigger pictures. And sometimes it's good to remind yourself, you know, there's probably an easier way to explain this, there's probably a way to make this slightly more accessible. And I feel like that's a really big, important step towards making beginners feel like they can be a part of the community.

07:56 Yeah, absolutely. I mean, there's certainly times when having high level conversations that maybe depend on terminology out of design patterns, or some common library that you all know, or something like that is important, because you want to be efficient and get stuff done right as experts. But when you're presenting or when you're writing books, sometimes this perspective is really, really cool. So some of the things you talked about in there, were basically just, you know, how do you get data loaded up? And there were three major areas where this data comes from one of them was CSV and JSON are basically text files, right?

08:31 Yeah. So this is a particularly from a journalism perspective. And also from just data that I see at companies. This is a big way that people handle data, even just yeah, plain text files, if you will.

08:45 Yeah, absolutely. And so how often did you hear from people that were like trying to parse CSV files directly, or JSON files directly? Rather than using the built in modules?

08:55 I'm not quite sure. I do think that there's a lot of people that just try to do that directly in the files or directly in a program like Excel. And so this is kind of taking them out of the pre programmed programs to actually get started writing code and using Python to do it. Yeah, that's cool. You're speaking of Excel, like that's got to be the world's biggest database. Right?

09:20 You know, there's so much data in data processing actually runs on excels into Excel. Yeah, yeah. tillage is like it just doesn't it work anymore? Like, I don't think people leave Excel? very willingly. It's like, we have to find a different answer.

09:36 Yeah, yeah, that and I've found you know, quite a lot of resistance to learning programming for people that are really adept at SQL. They kind of stay in this databasing world where everything is a SQL query. And every report can be made with a really, really complex 40 line SQL query. And I have a lot of respect for that, but maybe it's a little it easier to let a programming language do some of the heavy lifting for you.

10:03 Sure. I mean, sequel is great for declarative stuff. But sometimes you want an imperative type of problem solving and with a little declarative mixed in or who knows, right. Another area that you said you really liked was were PDFs, right?

10:19 Oh, yeah. Yeah, PDFs are amazing. I still amazingly painful, right? Yeah, exactly, indeed, a very large pain point. And I still find that there's so many different NGOs and other governmental organizations that release all of their data in a yearly report in PDF form.

10:42 Yeah, and you know, to some degree, that's fine. If it's meant for like reading, but there should be a, and here's the actual JSON version, or something, right. And that probably is missing.

10:52 Yeah. And you know, the biggest thing that we encourage folks to do in the book is actually just pick up the phone and call the agency or the NGO or whomever it is, and ask, Hey, do you happen to have this in any other format, because PDFs are really painful. There's not a lot of fun, and you end up doing quite a lot of post processing. Can you do it though? Like from Python? If you've got like tables in a PDF, can you get in there and get it out with Python? Yeah, so we ran into this issue with a book because it was actually quite difficult to get out some of the tabular data. And I came across an old kind of forgotten library called PDF tables. At the time, it wasn't Python three compliant, so we need to do it all in Python two. But now I went to euro Python, and I was talking about this problem. And there was a guy there, I forget his name right now. But he converted it to Python three. So now it's, I believe, compatible with three, four. And you can just simply use a few commands, the documentation is not very well made. But you can use a few commands, and you can parse out actual tabular data, and it does it quite well.

12:00 That's really excellent. I think that's an interesting story in that happens all the time. Right? Like, there are so many packages, and this functionality spread across so many different places, right? I mean, Python is great, because you can pip install anything right? pip install, you know, import into gravity sort of thing, right? Yeah. Which is great. But when you're newer, how do you know where to find these things? So I think when you get a chance to share these cool libraries that you found that solve problems, I think that's great. And we'll we'll do some more of that later, actually, from your talk. Yeah, for sure. So that's sort of the data acquisition stuff. And then you talk a little bit about storing and presenting data, you want to talk about that?

12:40 Yeah, so I give people basically an introduction to databasing. And kind of went over with them, okay, here's how you might use relational or non relational databases. And here's a little bit of the pros and cons of each, it's kind of difficult to make that extremely accessible to beginners, but I feel like it's a good problem to start to introduce them to that there's quite a lot of different ways to store data. And you should start thinking now about what makes the most sense for your project or for your team. And that's, that's a problem that we face every day, you know, as data people when we're deciding how to construct something new, or we're deciding, okay, how should we build this workflow? We constantly have to think, okay, what's going to be, you know, do we need it in high availability, storage, so to speak, or do we need it may be stored away somewhere in the file on somebody's computer,

13:33 right? Are you going to do aggregation or MapReduce type stuff to is joins like, there's just a lot to consider? And when you're new, of course, that's extra hard, right?

13:44 Yeah. Yeah. No

13:45 to join is like, how

13:47 do you like evaluate whether you need one, it's even something that we as more seasoned people forget sometimes. I mean, sometimes I'm doing some speed comparisons of something. And I just changed the format to maybe one that's more preferable to that particular library or tool. And oh, yeah, that speeded about five times. I don't even have to optimize code, because I chose a different format to read the data from so

14:10 yeah, it's really amazing. To vinden Another area that you talk about is web scraping. And I think web scraping is pretty cool. web scraping API's kind of go together, right? Like, there's so much data out there. You just got to go get it.

14:25 Yeah. And I feel like that is really empowering. First step, when you're beginning is I scraped, you know, my favorite website, or whatever it is, I feel like that can be a really empowering first step. And it doesn't take that much Python code I used in the book, I use some scrapey, which is about my favorite library for doing web scraping and web crawling.

14:47 Yeah, scraping is cool. I had Pablo on the show a while ago, and his story with scrapey is really cool.

14:52 Oh, that's awesome. I'll have to check it out. They're they're really great team. I had a chance to meet them at a PI con A while ago. And then Just really awesome, inspiring, you know, fully remote teams.

15:03 So So how much do you have to be careful about usage rights or restrictions on websites? If you're looking at data for like internal use? You know, this is just something that's coming to mind as I'm thinking about web scraping. Like, if you want to answer a question for your company, how open is the web for you? You know,

15:22 I feel like I go over quite a lot in the book. And also, you know, whenever I give talks on web scraping, that you need to look at the terms of service at the website, you need to look at the robots. txt file, you should be conscientious scraper. These are important things to think about. Clearly. There's been times where I've found API's that are undocumented, and I've reached out and sometimes they're like, no, don't use that. We didn't know it was out there. And other times, they're like, Oh, yeah, fine. We don't have documentation for it. But you can go ahead. So I think that's, you know, that's the best practices. I know quite a few people that don't follow those. But I do think it's really important to be conscientious scraper, and there are of course, media laws around that. And so you don't want to find yourself on the wrong end of a lawsuit, because you were a little bit too lazy to read the Terms of Service.

16:15 Yeah, absolutely. I feel like those kinds of challenges are really hard, because it's not just knowing the laws of your country. Right? I mean, it's all the countries, you might be interacting with many of them. Right. And so it's it's extra hard. And, you know, Europe has different privacy rules than the US and it's a bit of a bit of a wild west.

16:36 Yeah, it's definitely you know, there's obviously no international regulation of this at any point in time. And I don't even think international regulation would work properly for this. But I do think it's a matter of being kind of the ethical person and saying, okay, am I am i doing this in an unethical way? Probably trying to avoid it

17:02 seems like a good rule to live your life. I really I mean, yeah. Especially if you're going to go write an article, right? If you're doing this for like, journal, data journalism sort of thing.

17:10 Yeah, clearly, and a lot of times, if honestly, if you pick up the phone, and you call people or you write an email, people are more than willing to help share the data with you. And I feel like we're kind of in an era of open data everywhere. Open API's everywhere. And so I think people are very responsive to that.

17:28 Yeah, totally agree. So the last thing that I want to talk about that you you cover in your book, before we get to your cleaning data story is about automation and scaling. What do you guys do with regard to that?

17:39 So I think that we just give a little bit of a taste of that. And this this idea, automate the automate the boring stuff elsewhere. Yes. Yeah. Sorry. Yeah. And I think that this is a really important skill as a beginner that it's just like, wow, that's magical. I can just have something run. And it runs either via cron, or it runs as a celery task, or however it runs. And wow, it just did it on its own. I

18:06 think as a beginner, that's a really exciting moment of just like all my programs alive. So yeah, if you can take something that yeah, it's awesome. If you could take something that's like, four hours, you have to do the end of the week, that's super painful and repetitive and make it a button press and remove all the human error. Like that's, that's magical.

18:27 Yeah. And I think it's really, you know, it's something that people can give back quickly to their team. So if you're just on the side, like, Okay, I'm gonna learn Python, I want to do this one report I do every week in Excel. I want to do it on my own. And then yeah, you're right. Like they can use flask or Django or something and turn it into a one click button. That's, you know, the hero for the next year or so.

18:51 Your book seems really interesting. And if you're just getting into Python from this data angle, it's definitely worth checking out. Yeah, thanks. Yeah, you bet. So let's move on to something that you did set this summer. Fall was this fall? Right? Like September? Right? The pike on UK?

19:07 Yeah, yeah. So this is a bit my pike on UK talk was based a little bit off of my initial talk on this topic, which was that PI Data Berlin last year.

19:17 Alright, so and pike on Uk 2016, you gave a talk called cleaning data, something that what was the title? Exactly? I don't think I have it written down here.

19:26 I believe it was automate your data cleanup with Python.

19:29 I think it was really interesting. Like you really laid out some of the places that we get data from. And then you gave a bunch of libraries and techniques to fix it. And and that's what I found really interesting is I feel like, if any of these problems that we're going to talk about line up with a problem you have, it's like, and here's the solution, which is super cool.

19:49 Yeah. And even if it doesn't quite fix your problem, looking through sometimes the source code and how people are approaching it can give you new ideas about how to fix it for your own particular problems.

20:00 So you kind of set the stage, let me read a little quote from one of your slides, there was a paper or an article or something called towards reliable interactive data cleaning. And he said something to the effect or the quote said something like, such approaches require a clear definition of dirty data that is independent of downstream analysis. Even worse, when consultant noted that errors might be found after some results is reported. That's a really big problem, right? Like understanding even knowing whether you have good data or not, right?

20:31 Yeah, I feel like this is a massive problem. And the problem that we see often, one thing I'm pretty passionate about, that a lot of people don't do is even just data unit testing. So we have these unit tests for nearly everything. And then we don't test, you know, what happens if we get a really odd piece of data or something that doesn't fit? And it just goes to the pipeline, and then it's reported, this can take, you know, a few analysts down the line before somebody like, wait, we had negative sales in January, what does that even mean?

21:04 That's right. That's the pit in your stomach, that forums when you realize what has happened, because it's not just like a bug in some code that like, well, this website renders a little weird or didn't scale. So well. It's like, decisions could have been made. Right now, or maybe already has been, yeah, have been made based on this data. We decided to buy this company or not, as an acquisition, because of your report, we decided to pursue this product line and not that one on your report.

21:36 or whatever, right, you know, killed a product based on your reports that was not quite accurate. So yeah, or, you know, yeah, like kind of this goes back to what we were initially talking about, about statistics being off about the election. You know, it's a little bit of knowing your inputs and your outputs, and being cognizant that you need to actually test for some of those things, or at least have safety checks along the way.

21:59 Yeah, absolutely. And so you brought up data unit testing, which was one of your recommendations, let's say during the talk, and you talked about hypothesis, right?

22:09 Oh, I'm a big fan of hypothesis. It's really amazing work by David McKeever. And yeah, I highly recommend if you haven't looked at it to take a look at it, it's based off of haskel's quick check. And it's this idea of deterministic testing. It's really, really fun. Yeah, I'm

22:26 really impressed with it as well. I didn't know about it until six months ago, or something like this. And then I learned Wow, this is this is really nice. So yeah, actually had david on episode 67 of people want to learn more. But the idea is, you're going to do some kind of tests. And I guess the terminology we were coming up with was the old that this type of test that you think of when you write unit tests are example based, yes, or something like this, where we say if I put in a five here, and this customer that has registered equals true on them, and I call this function, then this thing happens, I'm going to start something like that. And that's not really how I thought a hypothesis works. It's more like, there's an input, that's the number, there's an input, that's a customer and you can let the framework change all the values, and then you just make assertions. Inputs like this should lead up it's like that. It's really cool. I think,

23:22 yeah, it's idea of property based testing. So you're essentially saying like, Okay, this accepts, you know, a list of floats. And it should return a valid float that's greater than two. And when that fails, then you know, you have these edge cases. And the fun thing is, is sometimes you just come up on the edge cases of whatever tool you're using, or even the edge cases, if you're using Python to have floats. And that's fine. But it's good to first understand, okay, what are the possible inputs? And what do I expect the output to be in this helps determine in your workflow or in your data science that you're doing, that you actually know what you're seeing coming in, and that you actually know what you should be producing?

24:07 Yeah, absolutely. And it both has this happy path and the edge cases, where so many of the bugs and errors live. So, yeah, so whether that's bugs in data or bugs in just pure algorithms, right. It's really important. So I definitely second your recommendation hypothesis. That's cool. So we talked about having this bad data, and not even really knowing. I mean, having your report or your pipeline or whatever it is you're working on to processes data run to completion is not really enough, is it?

24:41 The problem is that just running something to completion, like you said, or just assuming that because there's new data in my database, that that means that everything has processed correctly, is always a false assumption. And sometimes you're correct in that and other times, you didn't have the right checks, and it's a Impossible to write perfect code, right? We know this as engineers as data people that there's going to be bugs in our programming. And because of that, we need to have, you know, real eyes on the problem. And this is interesting, this is some of the things that I was looking at when I was doing research for this talk, is that within the academic field, they're actually determining ways to have the data cleanup process report, hey, I'm not quite sure about this one, because it seems either like an outlier, or the probability that it's correct, based on the algorithm I used is very low. And then actually taking that data and presenting it to the user again, you know, the next day, or once a week, and having the user actually confirm yes or no, whether the cleanup operated correctly. And I think that this is, you know, an important lesson to learn from where academics are coming from it. And also way that we can help automation actually be a solution where we have buy in from everyone, right?

26:04 Yeah, that's really cool. And basically, you take the least trusted data, say, okay, based on the algorithm like this, this I'm really not sure about the other ones were cool. And you can just show that to the user. Right. So he also had some some other interesting stuff from academics as well.

26:35 This portion of talk pipe on me has been brought to you by robar. One of the frustrating things about being a developer is dealing with errors, relying on users to report errors, digging through log files, trying to debug issues, or a million alerts just flooding your inbox and ruining your day. With role bars full stack error monitoring. To get the context, insights and control that you need to find and fix bugs faster. It's easy to install, you can start tracking production errors and deployments in eight minutes, or even less. rhobar works with all the major languages and frameworks, including the Python one such as Django flask pyramid, as well as Ruby, JavaScript, node, iOS and Android. You could integrate robar into your existing workflow, send error alerts to slack or HipChat, or even automatically create issues in JIRA and Pivotal Tracker and a whole bunch more robots put together a special offer for talk Python to me listeners, visit robar.com slash talk Python to me sign up and get the bootstrap plan free for 90 days. That's 300,000 errors tracked all for free. But hey, just between you and me, I really hope you don't encounter that many errors. I love to buy developers that awesome companies like Heroku, Twilio, kayak instacart, Zendesk, twitch and more. give rhobar a try today, go to robar.com slash talk Python to me.

27:56 One of the parts of that talk or the second half of that talk, I guess, that I thought was really interesting was the sort of catalog of libraries or Python packages that you could use for various solving various problems that were we work in your data into a way that's going to be much more useful to you. So let's take a moment kind of go through those and tell people about them. Because I think I think they're really useful. And the first one that you mentioned was about if you have some kind of duplication is called D dupe.

28:31 Yeah. And that's, that, along with probable people and us address, they're all from this company called database that works with journalists really to tell good data stories. I believe they're based in Chicago, I'm not quite sure. But they have a few good talks on this. And so they have D dupe, which essentially, you know, kind of can go through your tabular data and say, hey, these two rows look quite similar. Are they actually the same data, and they use a mixture of fuzzy matching, and I believe string edit distance to determine, okay, this, Bob Dole and Robert door are the same person. And that's a really great one. They also have used some of their same techniques to determine probable people, which is their other library, which tries to essentially parse people names and determine this is the first name, this is the last name. This is the title and us address, which they use to parse You know, this is the street number, this is the city. And it's all based on a very, very simple neural network. And you can therefore train it on your own data. That's really great, like, probable people, you

29:46 can give it like a first name, middle name, like a nickname, last name, your junior, senior or whatever. And it can pull those out and say, Well, here's the actually the last name and things like that, right.

30:00 Yeah, so it's really great if you have, you know, survey data or other things that haven't been processed, which is most of the data that you have to do, when you're dealing with just randomly generated data or randomly input data, then yeah, these are really great ways to kind of cut down on your time. You also

30:18 mentioned some four string matching, one was called jellyfish to do approximate and phonetic matching of strings.

30:27 Yeah, so in this, you know, speech to text world, we all know that sometimes things go wrong. So if you have, you know, strings that you potentially need phonetic matches for then jellyfish is a great tool for that. And it can help you find kind of maybe some of the errors in this, you know, auto transcription or speech to text.

30:49 Oh, yeah, very cool. What about Fuzzy Wuzzy?

30:52 Fuzzy Wuzzy is one of my favorite libraries.

30:54 You got a little bit just for the name, right?

30:56 Yeah. Right. And it allows you basically to, you know, take strings and do levenstein distance between them. And this can help you either in like a token sort way. So it doesn't matter the order of the words, just that the words are the same. And they use this a lot because they actually sell game tickets. So whether I say it's the Steelers versus the Patriots, or the Patriots versus the Steelers, it's the same game that I'm likely talking about. And so they can do you know, token swap matches, and they can also just do partial ratio matches. And it's really useful if you just want a quick installation to work on string distance.

31:37 Yeah, that sounds great. The other one, the next one, rather, I really like the name. There's a lot of great names in here. The next one is scrub a dub.

31:46 Yeah. So skirt up is great. If you've ever had to work with medical data, or maybe even potentially customer data, and you essentially need to privatize it before you do reporting. scrub a dub is going to try to go through, find these personally identifiable pieces of information and remove them or replace them with like you IDs or something like that. Yeah, that's really cool. You know what another area that comes to mind that that might be useful is developer data for like web applications and stuff. You want to take the data that drives your website, saying, and you want to put it on the dev machines and on staging machines and whatnot, but maybe it's got like e commerce information in there. And you want that gone? Right? Yeah, like maybe you like if the developer loses their laptop or it gets broken into, it's like, well, it's not really that important. That would be great to say it doesn't happen

32:37 very often, I think.

32:40 Yeah. And there's quite a lot of tools that you can use to generate, you know, fake data with that, I believe. I think one of them is called Faker.

32:49 Yeah. Faker. You definitely missin Faker. Faker is cool. And it'll generate all sorts of interesting data. It'll do like addresses, right?

32:58 Yeah, it can do dresses it get new people. And then you can write your own methods, and it will generate those. So. So you need to I think it even actually has credit cards built in and other things like that. So Oh, that's excellent. Yeah.

33:09 Because one of the things that's challenging is, it's harder than it sounds, I think, to generate real looking fake data. For unit tests for web design. Like if I'm going to look at like a profile page. If it doesn't, like have realistic looking data, maybe my design isn't really going to capture what it should.

33:29 Yeah. And kind of goes back to what we were talking about with testing your pipeline. If you don't have realistic enough data, then everything can look fine until you throw actual data into it. Yeah, that's for sure.

33:41 Yeah, it's cool. Because it'll do addresses in different locales, you can say, give me a US address. Give me a German address. And of course, like the zip postal code comes in different orders and stuff like that. Yeah, yeah, definitely. I was impressed with that one. Another one that made me happy that he's talked about was pint, because who, who wouldn't want a pint? Actually, how big is a pint? I mean, that's like how, like relative to say, a leader, like I have no idea. My wife is German. She asked me these questions all the time. How many ounces in this thing? I have no idea like Alice's don't make any sense. Like, I just have to compute it right? And so peint kind of addresses this problem of a convergence, right?

34:16 Yeah. So and it's does it it really, really simple terminology. And for that reason, you can do things like multiply meters by centimeters or convert easily from feet. Yeah, that's a lifesaver when it comes to maybe you have an international application. We don't really have a good way of internationalizing units built into Python. And so for that reason, yeah, pint is a great tool.

34:42 Yeah, pine is really cool. And you can do like Like you said, you can compute with the measurements within a particular scheme of measurement or convert between them. So for example, you can say like, three times uniregistry dot meter plus four times, you know, centimeters and it will come compute that or use a five times foot plus nine times inches. Now, tell me that in meters or something like that, right?

35:06 Yeah. And that's just a lot of math that you don't have to do yourself and that you don't have to check yourself.

35:13 Sure. And it sounds simple, right? Like, okay, well, there's, you know, like 3.31 feet per meter. So you divide like this, but we've had like, billion dollar spacecraft, like crash straight into the ground. Just somebody inverted that multiplication or something. Right. Like,

35:29 they needed find. Yeah, I'm not quite sure. I'm going to challenge NASA or SpaceX. But sure, no, of course, but

35:37 definitely, these sorts of problems have Vax, like really important projects.

35:41 Yeah. And clearly, you know, again, if you have, you know, an international site, or if you're dealing with data from numerous places, and you want to make sure that it's correct, it's better to rely on another library that has its own set of unit tests that has its own set of supporters and contributors. And, you know, rely on the smarts of other people. Yeah,

36:01 absolutely.

36:01 Another thing that's really cool that I talked about a little bit before on the show, I think it was on episode 77 was arrow, and arrow lets you deal with time in a much cleaner way, right arrow is super useful. If you're, you don't want to sit there and determine, okay, this date is written in exactly this way, it has an auto recognition, which can really help you if you're seeing data day times in a few different syntaxes. And this is especially great for if you have like a distributed system. And maybe half of your machines are set to report dates in one way. And half of your machines are set in another way. You can still distribute the same code, and it will actually function properly most of the time. Yeah, that's really cool. It definitely adjust for the Times has a much nicer way than the built in one built in day times. It also has a nice, humanized bit. So you can see like, given an arrow type humanizes and say, Well, that was about an hour ago. Yeah, yeah. Which is great. So you don't have to write you know, all of the math around us and use like five JavaScript libraries to have it written out to your to your users have

37:11 definitely, definitely written that, that code more than I need to write it in my life. So another one that you mentioned, just a shout out to it again, is PDF tables, PDF tables, six, that's really cool to go get the tabular data out of PDFs, because who thinks it belongs in there?

37:29 And nobody wants to look at the code underneath the hood there. So again, hat tip.

37:36 Yeah, absolutely. And by the way, all these packages we're talking about, there's links to every one of them in the show that so just go back and find it that way. Another one was called Data cleaner. What's the story data cleaner,

37:49 the idea behind data cleaner is that it can automatically clean data for you. Now, that sounds a bit too good to be true. And it definitely depends on what you need to have cleaned and how you might need to clean. But I feel like this is starting to get on the edge of what my biggest excitement when researching this topless, which is kind of the ways that academics are using machine learning to automatically clean and automatically edit dirty data, if you will. And I feel like we're we're definitely on the edge of this. I mean, machine learning is very sexy, and has had quite quite a lot of advances. And it's exciting to see it applied to something that is definitely unsexy, like data cleaned up.

38:37 So well, that's what we should use AI for. Right?

38:40 Yeah, right. I mean, nobody wants to sit around and do string matching on their own. Really, this is not a thing that any of us have a deep passion for. So the more that we can automate this away, the better our lives will be the more time we can spend doing the fun stuff like actual data analysis.

38:57 Yeah, absolutely. Another one that you talked about was really cool. I thought it was about creating simple domain specific languages basically, or parsers that parse simple domain specific languages called parser rater.

39:12 Yeah, so this is again from the data made team and this is kind of some of the back end that powers their us addresses and their probable people. And you can essentially use it to find some different data structures. And depending on if you have data available that's you know, labeled, you can train it and I think that this is just a really great way that we can stop doing a lot of the difficult work. This is the last time I want to see you know, a 40 if else statement line going through and testing is it this that or the other to you know, parse a name or to parse an address. We can use machines to help us do this in far less code.

40:09 This portion of talk Python to me is brought to you by gocd. From thoughtworks. Go CD is the on premise, open source Continuous Delivery server. With go CDs comprehensive pipeline and model, you can model complex workflows for multiple teams with ease. And go see these Value Stream Map lets you track changes from commit to deployment at a glance. To see these real power is in the visibility it provides over your end to end workflow. You get complete control of and visibility into your deployments across multiple teams say goodbye to release a panic and hello to consistent predictable deliveries. Commercial support and enterprise add ons, including disaster recovery are available. To learn more about go CD visit, talkpython.fm/ go CD for free download, let's talk Python, FM slash g OCD. Check them out, it helps support the show. Another thing that you mentioned that I thought was interesting was dbpedia.

41:15 With dbpedia dbpedia is a knowledge base based on Wikipedia. There's also Yago that's based off of Wikipedia. And it essentially is these you know, knowledge based API's. And you have to write sparkle, which yeah takes a bit of getting used to. And for that reason, there's several ways that you can write this sparkle using some Python helpers, including RDF alchemy is a really popular one and surf RDF. And but basically, you can essentially get all this data from this Wikipedia base database. And this can be really essential in terms of cleaning your data, particularly when you get into natural language processing. And you want to say, okay, you know, I have William Clinton, like, what is this mean? Who is this person, and you can essentially go query Wikipedia, via dbpedia, or viago. And it can tell you all these different, okay, it's in topics, human people, it's in topics, US presidents, and you can kind of get a lot of information from this, that you wouldn't know just from saving the string and then having to have humans go through and annotate it.

42:29 Okay. Yeah, that sounds really cool. I mean, to be able to harness Wikipedia, in like a structured way, is definitely more useful in trying to scrape it or I don't know how big the download of the data is. I know it used to be huge many years ago. So it's probably even huger.

42:45 Yeah, it's pretty massive. And this is a great way that you can just use it like an API. Okay. Yeah. Very cool.

42:51 You know, another thing that you talked about that looks really useful, and it's quite pythonic, in its style, is this thing called on guard. And you can use user to come up with decorators to apply here functions to verify stuff, right?

43:07 Yeah. So on guard is specifically for pandas. And you can essentially decorate functions that, you know, essentially take in data frames, and ensure that it has particular types or ensure that it's not empty, things like that. And so this is essentially static typing for pandas. Okay, well, and I think that that's really great. I mean, I know that there's numerous opinions within the Python community about the need for something like static typing or type checking. But when you work with data, this is really important steps. And yeah, maybe you only want it to throw a warning, not an exception. But these are the things that are important to building scripts, building pipelines that actually work.

43:51 Yeah, absolutely. Cuz it's, it's super easy for something that you would find had a number to have, quote that number in it. And then

44:01 crazy, right? Yeah. And, you know, in pandas, there are certain things that are only available from particular D types. So, you know, given that everything is NumPy, based in pandas, if you have an incorrect D type, it could just go through and like add all your strings together and give you you know, a really awesome large number that has absolutely nothing to do with the math that you expected. Absolutely expected a

44:23 number of I got a huge string that's full of numbers back. I don't know why.

44:27 Exactly. Guys, I

44:30 think that kind of wraps it up for those solutions. stating the problem these these little tools that you found to solve in, I thought they were really cool. So hopefully people find them interesting to know they exist as well.

44:42 I hope so. And people want to check out the talk on this on the pike on UK YouTube channel.

44:48 Yeah, absolutely. And I'm definitely going to link to it in the show notes as well. So you can find it there on the page as well. And then speaking of videos, you actually took this idea of data wrangling and data pipelines and did a couple of cool O'Reilly videos as well. Right?

45:04 Yeah. So I have one early video that's very focused on pandas. It's an introduction video to pandas. So it's meant for somebody that's trying to learn about pandas that hasn't done a lot in it yet. And it basically gives you an overview of these are the things that you can do with pandas. And we play around with some data in Jupyter notebooks and such. So it's really great introduction, if you've been meaning to check out pandas and you haven't. And then I also have a new one coming out on data pipelines or automation workflows, if you will. And this is going to cover things like Luigi and airflow, as well as celery, and kind of talk about how do you do distributed task processing and DAGs. In Python? That sounds

45:49 really cool, especially the second one. You know, I hadn't really heard of airflow, or Luigi. So I suspect many people listening probably haven't either. Maybe tell us what what those are?

46:00 Yes, so Luigi is from Spotify. It's been on, I believe, for four or five years now. And airflow is from Airbnb. And it's actually incubating as an Apache incubator project right now. And they're both really neat. They have a little bit of different approaches. But essentially, they're a way to build DAGs, which are directed acyclic graphs in Python, and to push all of your data through those graphs. So essentially, when you think about, let's think about a problem like MapReduce, we have a beginning, right, where we have all the data collected somewhere, then we map the task, then we shuffle and sort, and then we reduce. And finally we have our output files. And when you think about that flow, you can see that it goes through a directed acyclic graph, right, and only moves one way. And it has only a particular set of nodes or edges that it will go through. And for that reason, you know, you can take this idea and apply it to quite a lot of workflow or data pipeline problems. And in the video series, I cover a few of those, and how to parallelize them and how to work with them so that you get kind of an introduction to how to build these things, and whether they're right for your toolset.

47:17 Yeah, it sounds really cool. I feel like a lot of projects have these data pipelines. And they're just hiding. They're not really brought out in a real clear way, you know?

47:27 Yeah, I mean, there's definitely like quite a lot of scripting done in this instead of actually building a pipeline, right? So you have a script in it, it's calling, okay, when this finishes, then call this other function. What happens? If it fails, then, you know, you get an exception in your logs? And you have to go through and say, like, Okay, do I started again? Or where did it actually error? And having these instead, as graphs, you can see exactly where they failed. And both Luigi and airflow as well, as, as we know, celery has ways to go and retry those based on the failure status.

48:03 Yeah, absolutely. Once you understand them in this data pipeline way, distributed them, so they run in parallel, might be a really cool thing to do. There's a lot of stuff, right?

48:13 Yeah. And I mean, this is just kind of, again, an introduction, but it's there, you know, it's to give you an idea of what's available, and what people are using from a distributed scale. And then you can determine, okay, is this right for my project? Another thing that I covered in that, which is one of my favorite libraries to play around with his desk, and desk is amazing. Basically, you can have parallelized DAGs directly on your local computer that is going through sometimes terabytes of data, just because it has the ability to parallelize it, and to do out of core memory processes. That sounds really cool. I haven't played with basketball, maybe I need to know. Yeah, yeah. And Matthew Rocklin is super smart. So we'd love to hear him on the show sometime.

49:02 Oh, that'd be great. Absolutely. Those videos sound really interesting. And people can definitely check them out if they're interesting. interested in it. Let's talk about a few of the things you're up to because you're juggling many things, right?

49:13 I like to keep busy. That's good.

49:17 So you were involved in the first pilot ease chapter in LA, right?

49:21 I was lucky enough to be part of the original seven ladies that formed pilot ease in Los Angeles. And that was really, really neat experience. And very near and dear to my heart.

49:35 Yeah, that's really great. You must be really proud to see where it's how far it's come. Yeah, yeah. It's

49:39 kind of amazing to you know, be at a conference somewhere in Europe, and people will be like, Oh, yeah, I run pi ladies. I run pi ladies Prague, I run pylea vs. Moscow. I run Pilate these wherever. That's just so inspiring. And I feel really, really great that the community has kind of just embraced it and ran with it. I feel like that's really inspiring for young women that are getting into programming. Yeah, I think it's,

50:05 it's really great to have that support. And just that whole structure, I think is really important. And I think it makes a big difference in the Python community. As I look to other programming communities that don't have these things, it's they're definitely less well off for it.

50:21 Yeah, I feel like there's a lot of communities that are now looking at Python as a really great example of how do you have a diverse community? And how do you have a supportive and open community? And that just feels really amazing to be part of a community that other languages are trying to replicate? if you will?

50:39 Yeah, I'm sure I'm sure. It's, it's definitely It feels like a better place to be, in my opinion. I love the Python community. And that's one of the reasons. So another thing speaking of community that that you do is you help organize PI data in Berlin? Right?

50:54 Yeah. So I got asked to be a part of that at the conference last year. And it's a great group of folks. And we're constantly organizing the monthly meetups. Sometimes we have hackathons, and then we're have already started organizing the conference for next year, which will probably be the first weekend of July. So if you want to come to Europe, you should do so the first weekend of July and come visit PI Data Berlin.

51:20 Yeah, I definitely recommend that. And maybe I'll make it Who knows? That would be wonderful. Yeah, yeah,

51:25 we're gonna have a call for speakers pretty soon here. So

51:28 Okay, excellent. So but I hear this a lot. And I know it means that you're really busy. But what does it mean to organize a conference, a Python conference?

51:36 Yeah. So I mean, right now, it means quite a lot of emailing and telephoning with folks to find a good venue. We have a few keynote speakers that we're speaking with. And then we work of course, with num focus, which is the overarching, nonprofit organization based in the States. And they're the ones that are really kind of running the PI data behind the scenes, and really helping organizers like us with the tools that we need to set up a good conference.

52:06 Yeah, that's really cool. One of the things I think is great about Python is there's not just one big PI Data conference somewhere in the world, and you get to go to it, but the they're in many places, right? And I think that makes it more accessible as well.

52:19 Yeah. And it's really the support is there kind of with the same idea of, you know, having pylea these chapters all over. So the support is there within numb focus, to really say, okay, you know, we have your back, and we're going to help you figure out how to get good speakers, we're going to help you figure out a good venue. And having support like that is just, I think, tremendous for being able to make Python accessible, and also make it amazing and great. And I think this is why it has been able to grow so much within the data in scientific communities. That's an interesting question. And maybe we're talking about a little bit is, just few weeks ago, just this year, somebody measured and realize that Python became the second most popular language on GitHub. And it's second only to JavaScript displays Java, which cheer for that, right.

53:12 Current most popular one is JavaScript. And I feel like that's probably highly over counted, because JavaScript appears in Python, web apps, Ruby, web apps, ASP.

53:23 NET,

53:24 like, you know, everything that is up there as JavaScript, right. So it's, I think it's over counting it. People are asking, like, why is Python so much popular? All the sudden, when it's been around for 25 years? What are your thoughts on that I

53:37 have my own, but I definitely come at it from a data perspective. So I know I'm looking at it through my own Python colored glasses. But I do think that the embrace of the data and scientific communities around Python has been massive. And I feel like it definitely came from definitely second place to our possibly second place to Java. And now it has really been embraced within machine learning within artificial intelligence within the chatbot movement, and within data analysis. And I think that that's a really strong thing, because those fields are obviously growing right now. And it's really powerful to have Python, be kind of the up and coming language within those communities. And even for something like, okay, I do Apache Spark, and I only write Scala. The second in line is Python. Oh is the PI spark library. And so for these reasons, I feel like we are no longer the weird kid on the block. Yeah,

54:40 I tend to agree with fat analysis. I think that's a really large part of it. I think there's more to it. But I think that maybe is the single view and pick one thing. I think that might be the one thing that's making the biggest difference, you know, for example, like cloud computing, right, Google App Engine and some of the other things that made running Python apps at scale much easier than it had And before, I think that has something to do with it, yeah, but the data is probably number one. And sort of, because it's a new area where it's really growing fast.

55:09 And it's just really wonderful when you hear frameworks released, particularly around new machine learning or deep learning, that almost always Python is supported from the start. And that's just really powerful that even if the back end is C++ or Java, that they understand and know that the Python community will kind of pick it up and run with it. And so they want to make sure that from the very beginning, you know, TensorFlow runs with Python and things like that. Yeah, absolutely. That's pretty cool.

55:37 So another thing as if you're not busy enough, with all this stuff, another thing that you're up to is you're running a data consulting company in Berlin, what's the name of it? What do you guys do?

55:45 Yeah, so I guess, auf Deutsch is qiyamah Stan, but it's Kj mustache. And that's from a fond nickname, several lifetimes ago in job perspective. And basically, we do data consulting, it's primarily me, sometimes I get a chance to hire some folks to help out some friends and other people I respect within the data community, and to kind of specialize in doing a mixture of natural language processing work, and data analysis work. And so I usually take on clients and get to work alongside teams a lot of the time or build proof of concepts. And that's really, really fun work. So but it is,

56:28 yeah, yeah. Are there are a lot of companies that maybe are not software companies, per se, there's things some other specialty, but they've got a software team in house, but not a data science team. And how soon would they you know, it's a pretty common that they maybe bring somebody like you or your company into mixing a little data science or natural language processing for a particular project.

56:50 Yeah. So sometimes it's things like that other you know, or other times, yeah, it's just a new startup. And they need some data analyze, so that they can take it to their investors, things like that. Other times, it's that maybe they have a data science team to do science team is very busy doing one thing, they want somebody that knows Python that they can, you know, trust their code, to build out kind of a new proof of concept or a new idea. And then they decide, okay, is this useful? Are people in the company happy with it? If so, then they usually will take it on their own or potentially hire people to help manage it.

57:27 Yeah, sure. That makes sense. All right. Very cool. That sounds like a fun job. Yeah, it's really fun. Very cool. All right, Catherine, I think we're running low on time. So I would probably need to wrap up our topics there. So let me ask you two questions I always ask my guests on at the end. And first is, we already have enumerated quite the list. But you know, there's over 90,000 packages on pipe Ei, like, what one would you recommend to people that maybe they don't know about or haven't heard about? well, we

57:54 didn't get to talk much about NLP. But I really, really love Jensen and spacey, both of them are really changing and pushing the space of kind of where academia is with natural language processing, and giving it and making it available for us mere mortals. So I really recommend if you want to take a look at how to use neural networks with natural language processing, that you check out both Jensen and spacey.

58:20 Okay, those sound really, really excellent. And I have not tried either. So sounds fun. And when you write some Python code, what editor do you open up?

58:28 I use them. So I'm kind of old school. You know, I

58:32 think that might be the single single most popular editor of all the guests. So there's a wide variety of topics of editors, but that one might have the highest histogram bar, though, then. Right on. Okay, well, that's really cool. And final call to action. before we say goodbye to everyone.

58:52 Keep in touch at kjm on Twitter, and you can find my company and blog posts on Kg msn.com. And if you're interested in speaking at PI data, Berlin, feel free to reach out. Yeah, that sounds great that

59:05 that sounds like such a lovely opportunity to go to Berlin to meet all the people and present there. So definitely want a second. All right. Well, it's I've learned a lot about data cleaning and picked up a bunch of cool tools along the way. So thank you for your time, Katherine, that was great.

59:20 Thanks so much, Michael. Thanks for having me. You bet. Talk to that child.

59:26 This has been another episode of talk Python to me. Today's guest has been Catherine Turnbull and this episode has been sponsored by robar and gocd. Thank you both for supporting the show. rhobar takes the pain out of errors. They give you the context and insight you need to quickly locate and fix errors that might have gone unnoticed until your users complain of course, as talk Python to me listeners track a ridiculous number of errors for free@robar.com slash talk Python to me. Go CD is the on premise open source Continuous Delivery server Improve your deployment workflow but keep your code and builds in house, check out go CD at talkpython.fm/g OCD and take control over your process. Are you or a colleague trying to learn Python? Have you tried books and videos that just left you bored by covering topics point by point, well check out my online course Python jumpstart by building 10 apps at talkpython.fm/course, to experience a more engaging way to learn Python. And if you're looking for something a little more advanced, try my write pythonic code course at talk Python FM slash pythonic. And be sure to subscribe to the show open your favorite pod catcher and search for Python we should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at /play in direct RSS feed at /rss on talk python.fm. Our theme music is developers developers, developers by Cory Smith Goes by some mix. Corey just recently started selling his tracks on iTunes. So I recommend you check it out at talkpython.fm/music. You can browse his tracks he has for sale on iTunes and listen to the full length version of the theme song. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Let's mix. Let's get out of here.

01:01:13 Dating with my boys having been sleeping I've been using lots of rest

Back to show page
Talk Python's Mastodon Michael Kennedy's Mastodon