Data Wrangling with Python

Episode #90, published Wed, Dec 21, 2016, recorded Mon, Nov 28, 2016

Episode Deep Dive Transcript

Do you have a dirty, messy data problem? Whether you work as a software developer or as a data scientist, you've surely run across data that was malformed, incomplete, or maybe even wrong. Don't let messy data wreck your apps or generate wrong results.

What should you do? Listen to this episode of Talk Python To Me with Katharine Jarmul about the book she co-authored called Data Wrangling with Python and her PyCon UK presentation entitled How to Automate your Data Cleanup with Python.

Links from the show:

Katharine on the web: kjamistan.com
Katharine on twitter: @kjam
Book: Data Wrangling with Python: Tips and Tools to Make Your Life Easier: amzn.to/2fGc0Cx
Pycon 2016: How to Automate your Data Cleanup with Python: youtube.com/watch?v=gp-ngPV_ZX8

Packages from Data Cleanup talk
Dedupe Python Library: github.com/datamade/dedupe
probablepeople: github.com/datamade/probablepeople
usaddress: github.com/datamade/usaddress
jellyfish: github.com/jamesturk/jellyfish
Fuzzywuzzy: github.com/seatgeek/fuzzywuzzy
scrubadub: github.com/datascopeanalytics/scrubadub
pint: pint.readthedocs.io
arrow: github.com/crsmithdev/arrow
pdftables.six: github.com/vnaydionov/pdftables
Datacleaner: github.com/rhiever/datacleaner
Parserator: github.com/datamade/parserator
Gensim: radimrehurek.com/gensim
Faker: github.com/joke2k/faker
Dask: dask.pydata.org
SpaCy: spacy.io
Airflow: airflow.incubator.apache.org
Luigi: luigi.readthedocs.io
Hypothesis (testing): hypothesis.works

Katharine's courses

Data Pipelines with Python
shop.oreilly.com/product/0636920055334.do
Data Wrangling & Analysis with Python. Learn Pandas
shop.oreilly.com/product/0636920051831.do

Sponsors
Rollbar: rollbar.com/talkpythontome
GoCD: go.cd

Episode Deep Dive

Guest introduction and background

Catherine Jarmul is a data scientist, educator, and author based in Berlin. She got her start in programming back in high school with C++ and later shifted toward math, economics, and political science in college. Her path led her into data journalism at the Washington Post, where she discovered Python and its potential for data wrangling and data analysis. Catherine co-authored the Data Wrangling with Python book and is deeply involved in the Python community, including PyLadies and the PyData Berlin meetup. Beyond writing and speaking at conferences, Catherine runs a data consulting company in Berlin focusing on natural language processing and data analytics.

What to Know If You're New to Python

If you're just starting out with Python and want to dive into data wrangling and cleaning, here are a few pointers so you can get the most out of this episode:

Understand basic file I/O (reading text files, CSV, JSON) as Python often handles data in these formats.
Familiarity with Python’s built-in modules (csv, json) and some third-party ones (like pandas) will help.
Knowing a little about how Python’s ecosystem manages packages (via pip) can help you quickly install libraries such as scrubadub or fuzzywuzzy.
Stay aware that data cleaning often extends beyond pure Python code to the discipline of verifying and testing data itself.

Key points and takeaways

1. Data Wrangling with Python (the Book and Motivation)

Catherine co-authored Data Wrangling with Python to help beginners, those who might not even know where their command prompt is, get comfortable cleaning and structuring data with Python. The book guides readers through the fundamentals of reading data from diverse sources like CSV and JSON files, parsing information from PDFs, and using Python to automate cleanup processes. Catherine’s personal experience in both journalism and data science gave her deep insights into how messy data can be.

Tools / Links
- Data Wrangling with Python (O’Reilly) (book reference, no direct link from transcript)

2. Importance of Handling Messy Data

Dirty or malformed data leads to wrong results and wrong decisions. Catherine emphasized that simply having a pipeline run to completion doesn’t ensure data is correct. Real-world data might be incomplete, have strange encodings, or be full of duplicates that can derail analyses or applications. Ensuring data is standardized, validated, and cleaned is crucial.

Key Tools and Concepts
- Validation libraries (e.g., OnGuard)
- Data unit testing with hypothesis

3. Data Unit Testing for Reliability

Instead of waiting until a report is generated to notice something went wrong, data unit testing helps catch anomalies early. Catherine highlighted the library hypothesis, based on Haskell’s QuickCheck, which auto-generates input examples to test assumptions in your code. This method uncovers edge cases you might never think of (e.g., extreme values or unexpected types).

Tools / Links
- Hypothesis (GitHub)

4. Handling PDFs and Tabular Data Extraction

Extracting structured content from PDFs is a notoriously painful step in many data workflows. Catherine mentioned pdfTables (an older library she revived in Python 3) as one way to parse tabular data in PDF documents. If you must rely on PDF datasets (e.g., government or NGO reports), libraries like this can save countless hours.

Tools / Links
- pdfTables on GitHub (the version mentioned in the show was pdfTables or pdfTables.six)

5. String and Duplicate Cleanup

Catherine recommends several libraries for fuzzy matching, duplicate detection, and cleaning personally identifiable information:

dedupe helps find likely duplicates using fuzzy string matching.
probablepeople and usaddress (both from DataMade) parse out structured data like names or addresses.
fuzzywuzzy helps with approximate string matching (e.g., "Steelers vs. Patriots" vs. "Patriots vs. Steelers").
scrubadub removes personal data, such as phone numbers or email addresses, to anonymize datasets.

6. Automating Cleanup Processes

Automation is vital when dealing with recurring data feeds or large volumes of information. Catherine encouraged using scheduling tools such as cron or asynchronous processing with frameworks like celery to rerun data cleaning tasks without manual intervention. This approach is especially powerful when combined with unit tests or warnings for suspicious data (e.g., negative sales).

Tools / Links
- Celery

7. Creating Data Pipelines with DAG Tools

For more complex data flows, Catherine’s upcoming O’Reilly video dives into tools like Luigi (by Spotify) and Apache Airflow. They help structure tasks into Directed Acyclic Graphs (DAGs), making your process transparent, modular, and easily parallelizable. If one step fails, you can restart that node in the DAG without rerunning the entire pipeline.

Tools / Links
- Airflow (Apache)
- Luigi (GitHub)

8. Web Scraping: Ethical and Practical Considerations

Catherine reminded listeners about “conscientious scraping”, checking robots.txt, reading terms of service, and verifying if the site already has an API. Tools like Scrapy can automate the process of collecting data from websites, but devs must respect site owners’ wishes and not overwhelm servers or breach user privacy.

Tools / Links
- Scrapy

9. Simple Solutions for Conversions and Time Data

Sometimes data cleaning involves smaller but essential steps like unit conversion or handling date-times:

pint for units (meters, feet, liters, etc.).
arrow for friendlier date and time handling, including human-friendly “3 hours ago” style strings. Both help ensure you’re not dealing with a crash-landing spacecraft scenario because of a unit mismatch.

10. Community Involvement: PyLadies & PyData

Community events like PyLadies and PyData were highlighted as vital resources for aspiring developers. Catherine was part of the original PyLadies LA and helps organize PyData Berlin. Involvement in meetups and conferences can connect beginners and experts alike, fostering an open culture that Python is well-known for.

Links
- PyLadies
- PyData

11. Data Journalism Roots and Real-World Insights

Catherine’s background in data journalism at the Washington Post gave her a first-hand look at how quickly messy data can derail a story. It underscores the real-world impact of data wrangling on accurate reporting and timely insights. Her advice? Sometimes the best data tool is simply picking up the phone to request a more appropriate or raw data feed rather than scraping PDFs or websites.

Interesting quotes and stories

"I started doing programming in high school where I learned C++, and then in college I drifted into economics and statistics because I didn’t click with the gaming culture in my CS classes." -- Katherine Jarmul

"There’s probably an easier way to explain this. There’s probably a way to make this slightly more accessible." -- Katherine Jarmul

"When you work with data, you need real eyes on the problem. It’s impossible to write perfect code or just trust your pipeline if it runs to completion." -- Katherine Jarmul

"Sometimes, the best approach is to literally call up the NGO or government agency and ask if they have the data in another format rather than PDFs." -- Katherine Jarmul

"Data cleanup is unglamorous, but if we can automate it, we can spend more time on real analysis or on the creative side of data science." -- Katherine Jarmul

Key definitions and terms

Data Wrangling: The process of cleaning, structuring, and enriching raw data into a more refined and usable format for analysis or application development.
Fuzzy Matching: A string comparison technique to find approximate matches rather than exact matches (e.g., "Steelers vs Patriots" vs "Steelers vs. the Patriots").
DAG (Directed Acyclic Graph): A data structure used in workflow tools like Luigi and Airflow, ensuring tasks proceed in a one-directional, non-repetitive manner.
Property-Based Testing: An approach (as seen in Hypothesis) that tests functions against generated properties and random data inputs, uncovering edge cases automatically.
Conscientious Scraping: The concept of respecting robots.txt, reading terms of service, and being mindful of server load and data ownership when collecting data from websites.

Learning resources

For those who want to dive deeper into Python-based data analysis and beyond, here are a few recommended courses:

Python for Absolute Beginners: If you're brand new to Python or programming in general, this course covers core language fundamentals at a beginner’s pace.
Move from Excel to Python with Pandas: Ideal if you’re comfortable with spreadsheets and want to streamline and scale your data analysis in Python.
Getting Started with NLP and spaCy: Catherine often works with NLP, if you want to go beyond data cleaning and start extracting meaning from text, spaCy is a powerful place to begin.

Overall takeaway

With so many ways for data to become messy, incorrect formats, duplicates, partial fields, or entire datasets stuck in PDFs, data cleaning is an indispensable skill. Catherine Jarmul’s examples show that Python offers a diverse ecosystem of libraries and best practices to tackle these issues more reliably. By testing your data rigorously, automating cleanup pipelines, and leveraging specialized tools, you can turn messy data into trustworthy insights. Ultimately, making data wrangling more efficient frees you to focus on discovery, analysis, and innovation rather than wrestling with endless manual fixes.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Do you have a dirty, messy data problem?

00:01 Whether you work as a software developer or as a data scientist, you've surely run across data that is malformed, incomplete, or maybe even wrong.

00:09 Don't let messy data wreck your apps or generate wrong results.

00:12 What should you do?

00:13 Listen to this episode of Talk Python To Me with Catherine Jarmul about the book she co-authored called Data Wrangling in Python

00:19 and her PyCon UK presentation entitled How to Automate Your Data Cleanup with Python.

00:25 This is Talk Python To Me, recorded November 28, 2016.

00:29 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

01:00 This is your host, Michael Kennedy.

01:03 Follow me on Twitter where I'm @mkennedy.

01:05 Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python.

01:11 This episode has been sponsored by Rollbar and GoCD.

01:15 Thank them both for supporting the podcast by checking out what they're offering during their segments.

01:21 Catherine, welcome to Talk Python.

01:24 Thanks.

01:25 I'm really excited to wrangle some data together.

01:28 It's going to be fun.

01:28 Yeah, it sounds good.

01:30 I've seen some of your talks and, you know, look through your book and you're doing really cool stuff with data, data cleansing, data processing, data pipelines.

01:38 And so I'm really looking forward to getting into those topics and sharing them with everybody.

01:43 But before we do, let's start at the beginning.

01:45 What's your story?

01:46 How do you get into programming in Python?

01:48 Yeah, so I kind of have a varied history with programming.

01:51 I started doing programming in high school where I learned C++ as part of my AP computer science.

01:58 And then I went to school and I was attempting to get my degree in computer science.

02:05 And although I really loved my maths courses, I was pretty secluded in my computer programming courses.

02:13 And I attribute this to the fact that I wasn't that into gaming.

02:16 And I was one of about three women out of an incoming freshman class of over 300.

02:24 I totally know where you're coming from.

02:26 It seems like young guys that are into programming, so many of them come from the gaming world.

02:31 That's why they got into programming is because they're so excited about games.

02:34 Do you feel like that you felt like that kind of left you excluded from the social circles a little bit

02:39 just because you didn't want to go hang out and play games and eat pizza at two in the morning?

02:43 Yeah, it was definitely.

02:45 I mean, this was a bit of a different era.

02:47 I'm happy to hear it's a bit different now, but this was definitely like LAN parties and EverQuest.

02:53 And I just didn't connect with that very well.

02:56 So I didn't find a lot of people that wanted to work on projects with me or, you know, socialize

03:02 with me outside of class, maybe about math things rather than about computer things.

03:06 And for that reason, I kind of gravitated into economics and statistics and political science.

03:12 And I kind of ended up doing statistics with soft sciences, if you will.

03:16 Sure.

03:17 And that's kind of the nudge that sent you down this data science path a little bit, huh?

03:21 Yeah, yeah.

03:22 I ended up eventually getting into journalism from that.

03:25 And I ended up at the Washington Post and was that's where I met Jackie Kazel, my co-author

03:30 for the book.

03:31 And I learned Python there.

03:33 And that was really fun.

03:34 Oh, nice.

03:35 What kind of work were you doing there?

03:36 So the Washington Post was initially, and I don't know if it still is, one of the largest

03:41 Django installs in the world.

03:43 And we had built up a big app stack where we did the elections and numerous, all of the

03:52 data pieces were built on top of Django with quite a lot of Python in the background doing

03:57 the data wrangling.

03:58 So that was kind of my first exposure to data wrangling, if you will, with Python.

04:02 Yeah, that sounds really cool.

04:04 And there's actually a lot of data science and interesting things going on in this data

04:10 journalism space, right?

04:12 That's a pretty big growing area, isn't it?

04:13 Yeah, it's really neat to...

04:15 I'm still in touch with some folks that are still in journalism.

04:18 And I meet new people all the time because I'm involved in PyData Berlin.

04:23 And so for those reasons, I'm constantly impressed and seeing new things that journalists are doing

04:29 with data.

04:29 And if any time, now is the time to do really good data journalism, in my opinion.

04:36 Yeah, it's definitely a good time.

04:38 And it's also a time where a lot of...

04:40 It's very challenging, right?

04:42 I just did a show with Jonathan Morgan from Partially Derivative about the top 10 data science

04:47 stories of 2016 that's coming out at the end of the year.

04:49 And I'm saving it for the end of the year because that seems like that's the time you do a look

04:53 back episode.

04:54 But a lot of the themes were about how data science and sort of the pollsters and so on

05:00 didn't really get things right this year.

05:04 And I think it's both an interesting time and a challenging time.

05:07 It's pretty difficult with statistics to predict something like human behavior sometimes.

05:12 So math can only do so much for you when you have some unpredictability, if you will.

05:19 Yeah, absolutely.

05:19 Absolutely.

05:20 All right.

05:21 Well, we have a bunch of stuff to talk about.

05:23 So we're going to kind of skip around a little bit.

05:26 But let's start with your book.

05:28 Tell us, what's the title?

05:29 Where did you get the idea to write it in the first place?

05:31 Yeah.

05:31 So the title is Data Wrangling with Python.

05:34 And it was a kind of project of love between Jackie Kazel and I.

05:41 Again, we had worked together initially at The Post back in 2008.

05:46 And she actually first pitched the book to O'Reilly.

05:50 And she was working on it.

05:52 And she decided, you know what, this would be a lot better if I had a co-author.

05:56 And so she called me up.

05:58 I was in Berlin.

05:59 And I said, sure, that sounds great.

06:01 I would love to help.

06:02 That's really cool.

06:02 So the idea is, it's really for people who are not super seasoned Python programmers, but they're just getting into this whole data wrangling world.

06:12 And it sort of introduces them to Python a little bit and hits all the major problems or types of things you want to do getting started, right?

06:21 Yeah.

06:21 So our initial idea was, this is for beginners.

06:24 This is for someone that may or may not even know where their command line prompt is.

06:29 And we're going to take them through the steps.

06:31 And this is to make data wrangling more accessible to people that might not have a computer science background.

06:39 And that kind of is a passion of mine and also of Jackie's, being that we both were involved in PyLadies chapters.

06:47 And for that, yeah, we want an easy, accessible way to get involved working with Python, working with data.

06:55 And this was our idea, our product of that.

06:58 Yeah, I feel like, depending on your background, you sympathize more with people who you know are going to struggle, right?

07:06 Like, I learned programming in college, not when I was really young.

07:11 And so I still remember not getting my C libraries to link correctly and all the pain of what it is to be a new programmer.

07:19 So I feel like, you know, it sounds like you have some of those experiences as well.

07:23 And that probably comes through in the book.

07:25 Yeah, I think, you know, there's a lot of ways that even just as we get more advanced as developers and programmers or engineers,

07:33 whatever you think of yourself as, that you start using a lot of jargon and you start, you know, thinking in these bigger pictures.

07:40 And sometimes it's good to remind yourself, you know, there's probably an easier way to explain this.

07:47 There's probably a way to make this slightly more accessible.

07:49 And I feel like that's a really big, important step towards making beginners feel like they can be a part of the community.

07:57 Yeah, absolutely.

07:58 I mean, there's certainly times when having high level conversations that maybe depend on terminology out of design patterns or some common library that you all know or something like that.

08:08 It's important because you want to be efficient and get stuff done, right, as experts.

08:12 But when you're presenting or when you're writing books, sometimes this perspective is really, really cool.

08:17 So some of the things you talked about in there were basically just, you know, how do you get data loaded up?

08:24 And there were three major areas where this data comes from.

08:28 One of them was CSV and JSON, basically text files, right?

08:31 Yeah.

08:32 So this is particularly from a journalism perspective and also from just data that I see at companies.

08:38 This is a big way that people handle data, even just, yeah, plain text files, if you will.

08:45 Yeah, absolutely.

08:45 And so how often did you hear from people that were like trying to parse CSV files directly or JSON files directly rather than using the built-in modules?

08:55 I'm not quite sure.

08:56 I do think that there's a lot of people that just try to do that directly in the files or directly in a program like Excel.

09:04 And so this is kind of taking them out of the pre-programmed programs to actually get started writing code and using Python to do it.

09:14 Yeah, that's cool.

09:15 Speaking of Excel, like that's got to be the world's biggest database, right?

09:20 You know, there's so much data and data processing actually runs on Excel.

09:25 Until you hit about.

09:26 Yeah, yeah.

09:27 Yeah.

09:28 Until it just like, it just doesn't work anymore.

09:30 Like, I don't think people leave Excel very willingly.

09:33 It's like, we have to find a different answer.

09:36 Yeah.

09:37 Yeah, that and I've found, you know, quite a lot of resistance to learning programming for people that are really adept at SQL.

09:44 They kind of stay in this databasing world where everything is a SQL query.

09:49 And every report can be made with a really, really complex 40 line SQL query.

09:55 And I have a lot of respect for that.

09:57 But maybe it's a little bit easier to let a programming language do some of the heavy lifting for you.

10:03 Sure.

10:04 I mean, SQL is great for declarative stuff.

10:06 But sometimes you want an imperative type of problem solving and with a little declarative mixed in or who knows, right?

10:14 Another area that you said you really liked were PDFs, right?

10:19 Oh, yeah.

10:20 Yeah, PDFs are amazing.

10:23 Amazingly painful, right?

10:26 Yeah, exactly.

10:27 Indeed, a very large pain point.

10:30 And I still find that there's so many different NGOs and other governmental organizations that release all of their data in a yearly report in PDF form.

10:42 Yeah.

10:42 And, you know, to some degree, that's fine if it's meant for like reading.

10:46 But there should be a, and here's the actual JSON version or something, right?

10:51 And that probably is missing.

10:52 Yeah.

10:52 And, you know, the biggest thing that we encourage folks to do in the book is actually just pick up the phone and call the agency or the NGO or whomever it is and ask, hey, do you happen to have this in any other format?

11:05 Because PDFs are really painful.

11:07 There's not a lot of fun.

11:09 And you end up doing quite a lot of post-processing.

11:12 Can you do it, though, like from Python?

11:13 If you've got like tables and a PDF, can you get in there and get it out with Python?

11:17 Yeah.

11:17 So we ran into this issue with the book because it was actually quite difficult to get at some of the tabular data.

11:22 And I came across an old kind of forgotten library called PDF tables.

11:28 At the time, it wasn't Python 3 compliant.

11:31 So we needed to do it all in Python 2.

11:34 But now I went to EuroPython and I was talking about this problem.

11:39 And there was a guy there.

11:42 I forget his name right now.

11:43 But he converted it to Python 3.

11:46 So now it's, I believe, compatible with 3.4.

11:48 And you can just simply use a few commands.

11:53 The documentation is not very well made.

11:55 But you can use a few commands and you can parse out actual tabular data.

11:59 And it does it quite well.

12:00 That's really excellent.

12:01 I think that's an interesting story and that happens all the time, right?

12:06 Like there are so many packages and this functionality is spread across so many different places, right?

12:13 I mean, Python is great because you can pip install anything, right?

12:16 pip install, you know, import anti-gravity sort of thing, right?

12:20 Yeah.

12:20 Which is great.

12:21 But when you're newer, how do you know where to find these things?

12:25 So I think when you get a chance to share these cool libraries that you found that solve problems, I think that's great.

12:30 And we'll do some more of that later, actually, from your talk.

12:32 Cool.

12:32 Yeah, for sure.

12:33 So that's sort of the data acquisition stuff.

12:36 And then you talk a little bit about storing and presenting data.

12:40 You want to talk about that?

12:40 Yeah.

12:41 So I gave people basically an introduction to databasing and kind of went over with them.

12:48 Okay.

12:48 Here's how you might use relational or non-relational databases.

12:52 And here's a little bit of the pros and cons of each.

12:55 It's kind of difficult to make that extremely accessible to beginners.

12:58 But I feel like it's a good problem to start to introduce them to, that there's quite a lot of different ways to store data.

13:06 And you should start thinking now about what makes the most sense for your project or for your team.

13:13 And that's a problem that we face every day, you know, as data people, when we're deciding how to construct something new or we're deciding, okay, how should we build this workflow?

13:24 We constantly have to think, okay, what's going to be, you know, do we need it in high availability storage, so to speak?

13:30 Or do we need it maybe stored away somewhere in a file on somebody's computer?

13:34 Right.

13:34 Are you going to do aggregation or MapReduce type stuff to it?

13:38 Are you going to do joins?

13:40 Like, there's just a lot to consider.

13:41 And when you're new, of course, that's extra hard, right?

13:44 Yeah.

13:45 You don't even know what a join is.

13:46 Like, how do you, like, evaluate whether you need one?

13:49 It's even something that we, as more seasoned people, forget sometimes.

13:52 I mean, sometimes I'm doing some speed comparisons of something, and I just change the format to maybe one that's more preferable to that particular library or tool.

14:02 And, oh, yeah, that speeded it up five times.

14:05 I don't even have to optimize code because I chose a different format to read the data from.

14:10 Yeah, it's really amazing.

14:12 So then another area that you talk about is web scraping.

14:17 And I think web scraping is pretty cool.

14:19 Web scraping and APIs kind of go together, right?

14:21 Like, there's so much data out there.

14:23 You just got to go get it.

14:25 Yeah, and I feel like that is really empowering first step when you're beginning is, oh, yeah, I scraped, you know, my favorite website or whatever it is.

14:34 I feel like that can be a really empowering first step.

14:37 And it doesn't take that much Python code.

14:40 I used in the book, I used some scrapey, which is about my favorite library for doing web scraping and web crawling.

14:47 Yeah, scraping is cool.

14:48 Oh, I had Pablo on the show a while ago.

14:51 And his story with scrapey is really cool.

14:52 Oh, that's awesome.

14:53 I'll have to check it out.

14:55 They're a really great team.

14:56 I had a chance to meet them out of PyCon a while ago.

14:59 And they're just really awesome, inspiring, you know, fully remote team.

15:03 So how much do you have to be careful about usage rights or restrictions on websites if you're looking at data for, like, internal use?

15:13 You know, this is just something that's coming to mind as I'm thinking about web scraping.

15:17 Like, if you want to answer a question for your company, how open is the web for you?

15:21 Do you know?

15:22 I feel like I go over quite a lot in the book.

15:26 And also, you know, whenever I give talks on web scraping, that you need to look at the terms of service of the website.

15:33 You need to look at the robots.txt file.

15:36 You should be a conscientious scraper.

15:39 These are important things to think about.

15:41 Clearly, there's been times where I've found APIs that are undocumented.

15:45 And I've reached out and sometimes they're like, no, don't use that.

15:48 We didn't know it was out there.

15:49 And other times they're like, oh, yeah, fine.

15:51 We don't have documentation for it, but you can go ahead.

15:54 So I think that's, you know, that's the best practices.

15:57 I know quite a few people that don't follow those.

16:00 But I do think it's really important to be a conscientious scraper.

16:04 And there are, of course, media laws around that.

16:08 And so you don't want to find yourself on the wrong end of a lawsuit because you were a little bit too lazy to read the terms of service.

16:15 Yeah, absolutely.

16:16 I feel like those kinds of challenges are really hard because it's not just knowing the laws of your country.

16:23 Right.

16:23 I mean, it's all the countries.

16:26 You might be interacting with many of them.

16:28 Right.

16:28 And so it's extra hard.

16:30 And, you know, Europe has different privacy rules than the U.S.

16:33 And it's a bit of a bit of a Wild West.

16:36 Yeah, it's definitely, you know, there's obviously no international regulation of this at any point in time.

16:42 And I don't even think international regulation would work properly for this.

16:47 But I do think it's a matter of being kind of an ethical person and saying, okay, am I doing this in an unethical way?

16:58 Probably try and avoid it.

17:02 It seems like a good rule to live your life by, really.

17:04 I mean, especially if you're going to go write an article, right, if you're doing this for like data journalism sort of thing.

17:10 Yeah, clearly.

17:11 And a lot of times, honestly, if you pick up the phone and you call people or you write an email, people are more than willing to help share the data with you.

17:20 And I feel like we're kind of in an era of open data everywhere, open APIs everywhere.

17:25 And so I think people are very responsive to that.

17:27 Yeah, totally agree.

17:29 So the last thing that I want to talk about that you cover in your book before we get to your cleaning data story is about automation and scaling.

17:37 What do you guys do with regard to that?

17:39 So I think that we just give a little bit of a taste of that.

17:43 And this is this idea, automate the...

17:46 Automate the boring stuff with elsewhere.

17:47 Yes, yes.

17:48 Sorry.

17:48 That's a great one.

17:49 Yeah.

17:49 And I think that this is a really important skill as a beginner that is just like, wow, that's magical.

17:55 I can just have something run and it runs either via Cron or it runs as a salary task or however it runs.

18:03 And wow, it just did it on its own.

18:06 I think as a beginner, that's a really exciting moment of just like, oh, my program's alive.

18:13 So...

18:14 Yeah, if you can take something that...

18:15 Yeah, yeah, that's awesome.

18:16 And if you could take something that's like four hours you had to do at the end of the week that's super painful and repetitive and make it a button press and remove all the human error.

18:26 Like, that's magical.

18:27 Yeah, and I think it's really, you know, it's something that people can give back quickly to their team.

18:33 So if you're just on the side like, okay, I'm going to learn Python.

18:36 I want to do this one report I do every week in Excel.

18:40 I want to do it on my own.

18:41 And then, yeah, you're right.

18:43 Like, they can use Flask or Django or something and turn it into a one-click button.

18:47 That's, you know, they're the hero for the next year or so.

18:51 Your book seems really interesting.

18:53 And if you're just getting into Python from this data angle, it's definitely worth checking out.

18:57 Yeah, thanks.

18:59 Yeah, you bet.

18:59 So let's move on to something that you did.

19:03 Was that this summer?

19:04 Fall?

19:04 It was this fall, right?

19:05 Like September, right?

19:06 The PyCon UK?

19:07 Yeah, yeah.

19:08 So this is a bit, my PyCon UK talk was based a little bit off of my initial talk on this topic, which was at PyData Berlin last year.

19:17 All right.

19:17 So at PyCon UK 2016, you gave a talk called Cleaning Data, something that, what was the title exactly?

19:25 I don't think I have it written down here.

19:26 I believe it was Automate Your Data Cleanup with Python.

19:29 I think it was really interesting.

19:31 Like, you really laid out some of the places that we get data from.

19:35 And then you gave a bunch of libraries and techniques to fix it.

19:39 And that's what I found really interesting is I feel like if any of these problems that we're going to talk about line up with a problem you have, it's like, and here's the solution, which is super cool.

19:49 Yeah.

19:49 And even if it doesn't quite fix your problem, looking through sometimes the source code and how people are approaching it can give you new ideas about how to fix it for your own particular problem set.

20:00 Yeah.

20:00 So you kind of set the stage.

20:01 Let me read a little quote from one of your slides.

20:03 There was a paper or an article or something called Towards Reliable Interactive Data Cleaning.

20:08 And you said something to the fact, or the quote said something like, such approaches require a clear definition of dirty data that is independent of downstream analysis.

20:18 Even worse, one consultant noted that errors might be found after some result is reported.

20:26 That's a really big problem, right?

20:27 Like understanding, even knowing whether you have good data or not, right?

20:31 Yeah, I feel like this is a massive problem and a problem that we see often.

20:36 One thing I'm pretty passionate about that a lot of people don't do is even just data unit testing.

20:42 So we have these unit tests for nearly everything.

20:45 And then we don't test, you know, what happens if we get a really odd piece of data or something that doesn't fit.

20:52 And if it just goes to the pipeline and then it's reported, this can take, you know, a few analysts down the line before somebody's like, wait, we had negative sales in January.

21:01 What does that even mean?

21:03 That's right.

21:06 That's the pit in your stomach that forms when you realize what has happened, because it's not just like a bug in some code that like, well, this website renders a little weird or doesn't scale so well.

21:18 It's like decisions could have been made, right?

21:22 Yeah, or maybe already have been.

21:24 Yeah, you have been made based on this data.

21:26 We decided to buy this company or not as an acquisition because of your report.

21:31 We decided to pursue this product line and not that one on your report or whatever, right?

21:37 You know, killed a product based on your report that was not quite accurate.

21:41 So, yeah.

21:42 Or, you know, yeah, like kind of this goes back to what we were initially talking about, about statistics being off about the election.

21:48 You know, it's a little bit of knowing your inputs and your outputs and being cognizant that you need to actually test for some of those things or at least have safety checks along the way.

21:59 Yeah, absolutely.

22:00 And so you brought up data unit testing, which was one of your recommendations, let's say, during the talk.

22:06 And you talked about hypothesis, right?

22:09 Oh, I'm a big fan of hypothesis.

22:11 It's really amazing work by David McEver.

22:14 And, yeah, I highly recommend if you haven't looked at it to take a look at it.

22:19 It's based off of Haskell's QuickCheck.

22:21 And it's this idea of deterministic testing.

22:25 It's really, really fun.

22:26 Yeah, I'm really impressed with it as well.

22:28 I didn't know about it until six months ago or something like this.

22:33 And then I learned, wow, this is really nice.

22:36 So, yeah, I actually had David on episode 67 if people want to learn more.

22:41 But the idea is you're going to do some kind of test.

22:44 And I guess the terminology we were coming up with was the type of test that you think of when you write unit tests are example-based unit tests or something like this, where you say, if I put in a five here and this customer that has registered equals true on them and I call this function, then this thing happens.

23:05 I'm going to do some kind of test.

23:06 And that's not really how Python hypothesis works.

23:08 And that's not really how Python hypothesis works.

23:09 It's more like there's an input that's the number.

23:12 There's an input that's a customer.

23:13 And you can let the framework change all the values.

23:16 And then you just make assertions.

23:18 Inputs like this should lead to outputs like that.

23:21 And it's really cool, I think.

23:23 Yeah, it's the idea of property-based testing.

23:25 So you're essentially saying like, okay, this accepts a list of floats.

23:29 And it should return a valid float that's greater than two.

23:33 And when that fails, then you know you have these edge cases.

23:38 And the fun thing is sometimes you just come up on the edge cases of whatever tool you're using or even the edge cases if you're using Python 2 of floats.

23:47 And that's fine.

23:49 But it's good to first understand, okay, what are the possible inputs?

23:54 And what do I expect the output to be?

23:56 And this helps determine in your workflow or in your data science that you're doing that you actually know what you're seeing coming in and that you actually know what you should be producing.

24:07 Yeah, absolutely.

24:08 And it both has the happy path and the edge cases where so many of the bugs and errors live.

24:15 Yeah.

24:17 Yeah.

24:17 So whether that's bugs in data or bugs in just pure algorithms, right, it's really important.

24:22 So I definitely second your recommendation hypothesis.

24:25 That's cool.

24:26 So we talked about having this bad data and not even really knowing.

24:31 I mean, having your report or your pipeline or whatever it is you're working on to process this data run to completion is not really enough, is it?

24:41 The problem is, is that just running something to completion, like you said, or just assuming that because there's new data in my database, that that means that everything has processed correctly is always a false assumption.

24:54 And sometimes you're correct in that.

24:55 And other times you didn't have the right checks.

24:59 And it's impossible to write perfect code, right?

25:02 We know this as engineers, as data people, that there's going to be bugs in our programming.

25:09 And because of that, we need to have, you know, real eyes on the problem.

25:15 And this is interesting.

25:16 This is interesting.

25:16 This is some of the things that I was looking at when I was doing research for this talk, is that within the academic field, they're actually determining ways to have the data cleanup process report.

25:29 Hey, I'm not quite sure about this one because it seems either like an outlier or the probability that it's correct based on the algorithm I used is very low.

25:40 And then actually taking that data and presenting it to the user again, you know, the next day or once a week and having the user actually confirm yes or no, whether the cleanup operated correctly.

25:52 And I think that this is, you know, an important lesson to learn from where academics are coming from it.

25:57 And also a way that we can help automation actually be a solution where we have buy-in from everyone, right?

26:04 Yeah, that's really cool.

26:05 And basically you take the least trusted data and you say, okay, based on your algorithm, like this, this, I'm really not so sure about the other ones that were cool.

26:14 And you can just show that to the user, right?

26:16 So you also had some, some other interesting stuff from academics as well.

26:34 This portion of TalkPyPyPundMe has been brought to you by Rollbar.

26:38 One of the frustrating things about being a developer is dealing with errors, relying on users to report errors, digging through log files, trying to debug issues, or a million alerts just flooding your inbox and ruining your day.

26:50 With Rollbar's full stack error monitoring, you'll get the context, insights, and control that you need to find and fix bugs faster.

26:58 It's easy to install.

26:59 You can start tracking production errors and deployments in eight minutes or even less.

27:03 Rollbar works with all the major languages and frameworks, including the Python ones, such as Django, Flask, Pyramid, as well as Ruby, JavaScript, Node, iOS, and Android.

27:12 You could integrate Rollbar into your existing workflow, send error alerts to Slack or HipChat, or even automatically create issues in Jira, Pivotal Tracker, and a whole bunch more.

27:22 Rollbar has put together a special offer for Talk Python To Me listeners.

27:25 Visit rollbar.com slash Talk Python To Me, sign up, and get the bootstrap plan free for 90 days.

27:31 That's 300,000 errors tracked, all for free.

27:34 But hey, just between you and me, I really hope you don't encounter that many errors.

27:37 Love to buy developers at awesome companies like Heroku, Twilio, Kayak, Instacart, Zendesk, Twitch, and more.

27:43 Give Rollbar a try today.

27:45 Go to rollbar.com slash Talk Python To Me.

27:48 One of the parts of that talk, or the second half of that talk, I guess, that I thought was really interesting

28:01 was the sort of catalog of libraries or Python packages that you could use for solving various problems

28:10 that were reworking your data into a way that's going to be much more useful to you.

28:16 So let's take a moment and kind of go through those and tell people about them because I think they're really useful.

28:22 And the first one that you mentioned was about if you have some kind of duplication.

28:29 It's called D-dupe.

28:30 Yeah, and that along with Probable People and U.S. Address, they're all from this company called DataMade that works with journalists really to tell good data stories.

28:43 I believe they're based in Chicago.

28:45 I'm not quite sure.

28:46 But they have a few good talks on this.

28:49 And so they have D-dupe, which essentially, you know, kind of can go through your tabular data and say,

28:54 hey, these two rows look quite similar.

28:57 Are they actually the same data?

28:59 And they use a mixture of fuzzy matching and I believe stringetic distance to determine,

29:07 okay, this Bob Dole and Robert Dole are the same person.

29:11 And that's a really great one.

29:14 They also have used some of their same techniques to determine probable people,

29:19 which is their other library, which tries to essentially parse people names and determine this is the first name,

29:26 this is the last name, this is the title, and U.S. Address, which they use to parse, you know, this is the street number, this is the city.

29:35 And it's all based on a very, very simple neural network.

29:40 And you can therefore train it on your own data.

29:43 That's really great.

29:44 Like probable people, you can give it like a first name, a middle name, like a nickname,

29:51 last name, you know, junior, senior, whatever.

29:55 And it can pull those out and say, well, here's actually the last name.

29:58 Things like that, right?

29:59 Yeah.

30:00 So it's really great if you have, you know, survey data or other things that haven't been processed,

30:05 which is most of the data that you have to do when you're dealing with just randomly generated data

30:12 or randomly input data, then yeah, these are really great ways to kind of cut down on your time.

30:18 You also mentioned some for string matching.

30:21 One was called Jellyfish to do approximate and phonetic matching of strings.

30:27 Yeah.

30:27 So in this, you know, speech to text world, we all know that sometimes things go wrong.

30:33 So if you have, you know, strings that you potentially need phonetic matches for,

30:40 then Jellyfish is a great tool for that.

30:42 And it can help you find kind of maybe some of the errors in this, you know, auto transcription

30:48 or speech to text.

30:49 Oh yeah.

30:50 Very cool.

30:50 What about Fuzzy Wuzzy?

30:52 Fuzzy Wuzzy is one of my favorite libraries.

30:54 You got to love it just for the name, right?

30:56 Yeah.

30:57 Right.

30:57 And it allows you basically to, you know, take strings and do Levenstein distance between them.

31:05 And this can help you either in like a token sort way.

31:09 So if it doesn't matter the order of the words, just that the words are the same.

31:14 And they use this a lot because they actually sell game tickets.

31:17 So whether I say it's the Steelers versus the Patriots or the Patriots versus the Steelers,

31:23 it's the same game that I'm likely talking about.

31:25 And so they can do, you know, token sort matches and they can also just do partial ratio matches.

31:31 And it's really useful if you just want a quick installation to work on string distance.

31:37 Yeah.

31:38 Yeah.

31:38 That sounds great.

31:39 The other one, the next one, rather, I really liked the name.

31:42 There's a lot of great names in here.

31:44 The next one is Scrub-A-Dub.

31:46 Yeah.

31:47 So Scrub-A-Dub is great if you've ever had to work with medical data or maybe even potentially

31:52 customer data and you essentially need to privatize it before you do reporting.

31:56 Scrub-A-Dub is going to try to go through, find these personally identifiable pieces of information

32:03 and remove them or replace them with like UIDs or something like that.

32:08 Yeah.

32:08 That's really cool.

32:08 You know, another area that comes to mind that that might be useful is developer data

32:14 for like web applications and stuff.

32:16 You want to take the data that drives your website, say, and you want to put it on the

32:22 dev machines and staging machines and whatnot.

32:24 But maybe it's got like e-commerce information in there and you want that gone.

32:28 Right?

32:29 Yeah.

32:29 Like maybe you, like if the developer loses their laptop or it gets broken into, it's like,

32:33 well, it's not really that important.

32:35 That would be great to say.

32:37 It doesn't happen very often, I think.

32:38 Yeah.

32:39 I mean, and there's quite a lot of tools too that you can use to generate, you know, fake

32:45 data with that.

32:46 I believe, I think one of them is called Faker.

32:49 Yeah.

32:50 Faker.

32:50 You definitely mentioned Faker.

32:51 Faker is cool.

32:53 And it'll generate all sorts of interesting data.

32:56 It'll do like addresses, right?

32:58 Yeah.

32:58 It can do addresses.

32:59 It can do people.

33:00 And then you can write your own methods and it will generate those.

33:03 So, so you need to, I think it even actually has credit cards built in and other things

33:08 like that.

33:08 So.

33:09 Oh, that's excellent.

33:10 Yeah.

33:10 Because one of the things that's challenging is it's harder than it sounds, I think, to

33:15 generate real looking fake data for a unit test for web design.

33:20 Like if I'm going to look at like a profile page, if it doesn't like have realistic looking

33:25 data, well, maybe my design isn't really going to capture what it should.

33:29 Yeah.

33:30 And kind of goes back to what we were talking about with testing your pipeline.

33:33 If you don't have realistic enough data, then everything can look fine until you throw actual

33:38 data into it.

33:39 Yeah.

33:40 That's for sure.

33:41 Yeah.

33:41 And it's cool because it'll do addresses in different locales.

33:44 You can say, give me a US address.

33:45 Give me a German address.

33:47 And of course, like the zip postal code comes in different orders and stuff like that.

33:51 Yeah.

33:52 Yeah.

33:52 I definitely was impressed with that one.

33:53 Another one that made me happy that you talked about was pint.

33:57 Because who wouldn't want a pint?

33:58 Actually, how big is a pint?

34:00 I mean, that's like how, like, well, to say a liter.

34:03 Like, I have no idea.

34:04 My wife is German.

34:05 She asks me these questions all the time.

34:06 How many ounces in this thing?

34:07 I have no idea.

34:08 Like, ounces don't make any sense.

34:10 Like, I just have to compute it.

34:13 Right?

34:13 And so pint kind of addresses this problem of conversions, right?

34:16 Yeah.

34:16 So, and it does it in really, really simple terminology.

34:20 And for that reason, you can do things like multiply meters by centimeters or convert easily

34:27 from feet.

34:28 Yeah.

34:29 That's a lifesaver when it comes to maybe you have an international application.

34:33 We don't really have a good way of internationalizing units built into Python.

34:40 And so for that reason, yeah, pint is a great tool.

34:42 Yeah.

34:43 Pint is really cool.

34:43 And you can do, like you said, you can compute with the measurements within a particular scheme

34:49 of measurement or convert between them.

34:52 So, for example, you can say like three times unit registry dot meter plus four times unit

34:58 registry dot centimeters.

34:59 And it will compute that.

35:00 Or you can say five times foot plus nine times inches.

35:03 Now, tell me that in meters or something like this, right?

35:07 Yeah.

35:07 And that's just a lot of math that you don't have to do yourself and that you don't have

35:12 to check yourself.

35:13 Sure.

35:13 And it sounds simple, right?

35:14 Like, okay, well, there's, you know, like 3.31 feet per meter.

35:19 So you divide like this.

35:20 But we've had like billion dollar spacecraft like crash straight into the ground because

35:26 somebody inverted that multiplication or something, right?

35:29 Like they needed pint.

35:30 Yeah.

35:32 Not quite sure I'm going to challenge NASA or SpaceX, but sure.

35:36 No, of course.

35:37 But definitely these sorts of problems have vexed like really important projects.

35:41 Yeah.

35:42 And clearly, you know, again, if you have, you know, an international site or if you're

35:46 dealing with data from numerous places and you want to make sure that it's correct, it's

35:51 better to rely on another library that has its own set of unit tests, that has its own

35:56 set of supporters and contributors and, you know, rely on the smarts of other people.

36:00 Yeah, absolutely.

36:02 Another thing that's really cool that I've talked about a little bit before on the show,

36:06 I think it was on episode 77, was Arrow.

36:09 And Arrow lets you deal with time in a much cleaner way, right?

36:14 Arrow is super useful if you're, you don't want to sit there and determine, okay, this

36:21 date is written in exactly this way.

36:23 It has an auto-recognition, which can really help you if you're seeing data, date times in

36:29 a few different syntaxes.

36:31 And this is especially great for if you have like a distributed system and maybe half of

36:37 your machines are set to report dates in one way and half of your machines are set in another

36:42 way.

36:42 You can still distribute the same code and it will actually function properly most of the

36:46 time.

36:47 Yeah, that's really cool.

36:48 It definitely adjusts for the time time in a much nicer way than the built-in one, built-in

36:53 date times.

36:54 And also it has a nice humanized bit.

36:55 So you can say like, given an arrow time, humanize this and say, well, that was about an hour

37:00 ago.

37:00 Yeah.

37:01 Yeah.

37:02 Which is great.

37:03 So you don't have to write, you know, all of the math around those and use like five JavaScript

37:07 libraries to have it written out to your users.

37:10 Yeah.

37:11 I've definitely, definitely written that code more than I need to write it in my life.

37:17 So another one that you mentioned, just a shout out to it again, is PDF tables, PDF tables.6.

37:23 That's really cool to go get the tabular data out of PDFs because, you know, who thinks it

37:29 belongs in there?

37:29 And nobody wants to look at the code underneath the hood there.

37:33 So again, hat tip.

37:35 Yeah, yeah, absolutely.

37:37 And by the way, all of these packages we're talking about, there's links to every one of them

37:41 in the show notes.

37:42 So just go back and find them that way.

37:45 Another one was called Data Cleaner.

37:47 What's the story of Data Cleaner?

37:49 The idea behind Data Cleaner is that it can automatically clean data for you.

37:54 Now, that sounds a bit too good to be true.

37:59 And it definitely depends on what you need to have cleaned and how you might need it cleaned.

38:04 But I feel like this is starting to get on the edge of what my biggest excitement when

38:11 researching this talk was, which is kind of the ways that academics are using machine learning

38:17 to automatically clean and automatically edit dirty data, if you will.

38:22 And I feel like we're definitely on the edge of this.

38:25 I mean, machine learning is very sexy and has had quite, quite a lot of advances.

38:31 And it's exciting to see it applied to something that is definitely unsexy, like data cleanup.

38:37 Well, that's what we should use AI for, right?

38:40 Yeah, right?

38:41 I mean, nobody wants to sit around and do string matching on their own.

38:45 Really, this is not a thing that any of us have a deep passion for.

38:50 So the more that we can automate this away, the better our lives will be.

38:54 The more time we can spend doing the fun stuff, like actual data analysis.

38:57 Yeah, absolutely.

38:58 Another one that you talked about was really cool, I thought, was about creating simple

39:03 domain-specific languages, basically, or parsers that parse simple domain-specific languages

39:10 called Parserator.

39:12 Yeah, so this is, again, from the DataMade team.

39:15 And this is kind of some of the back end that powers their U.S. addresses and their probable

39:21 people.

39:21 And you can essentially use it to find some different data structures.

39:25 And depending on if you have data available that's labeled, you can train it.

39:31 And I think that this is just a really great way that we can stop doing a lot of the difficult

39:38 work.

39:38 So this is the last time I want to see, you know, a 40-if-else statement line going through

39:45 and testing, is it this, that, or the other, to, you know, parse a name or to parse an address.

39:50 We can use machines to help us do this in far less code.

39:54 This portion of Talk Python To Me is brought to you by GoCD from ThoughtWorks.

40:14 GoCD is the on-premise, open-source, continuous delivery server.

40:18 With GoCD's comprehensive pipeline and model, you can model complex workflows for multiple

40:24 teams with ease.

40:25 And GoCD's value stream map lets you track changes from commit to deployment at a glance.

40:31 GoCD's real power is in the visibility it provides over your end-to-end workflow.

40:36 You get complete control of and visibility into your deployments across multiple teams.

40:41 Say goodbye to release day panic and hello to consistent, predictable deliveries.

40:46 Commercial support and enterprise add-ons, including disaster recovery, are available.

40:50 To learn more about GoCD, visit talkpython.fm/gocd for a free download.

40:56 That's talkpython.fm/gocd.

41:00 Check them out.

41:01 It helps support the show.

41:09 Another thing that you mentioned that I thought was interesting was DBpedia.

41:14 What's DBpedia?

41:16 DBpedia is a knowledge base based on Wikipedia.

41:20 There's also Yago that's based off of Wikipedia.

41:23 And it essentially is these, you know, knowledge base APIs.

41:27 And you have to write Sparkle, which, yeah, takes a bit of getting used to.

41:33 And for that reason, there's several ways that you can write this Sparkle using some Python helpers,

41:40 including RDF Alchemy is a really popular one and Surf RDF.

41:45 But basically, you can essentially get all of this data from this Wikipedia-based database.

41:52 And this can be really essential in terms of cleaning your data, particularly when you get into natural language processing.

42:00 And you want to say, okay, you know, I have William Clinton.

42:03 Like, what does this mean?

42:05 Who is this person?

42:06 And you can essentially go query Wikipedia via DBpedia or via Yago.

42:11 And it can tell you all of these different, okay, it's in topics, human people.

42:16 It's in topics, U.S. presidents.

42:19 And you can kind of get a lot of information from this that you wouldn't know just from saving the string and then having to have humans go through and annotate it.

42:29 Okay.

42:29 Yeah, that sounds really cool.

42:31 I mean, to be able to harness Wikipedia in like a structured way is definitely more useful than trying to scrape it.

42:38 Or I don't know how big the download of the data is.

42:40 I know it used to be huge many years ago, so it's probably even huger.

42:44 Yeah, it's pretty massive.

42:46 And this is a great way that you can just use it like an API.

42:50 Okay.

42:50 Yeah, very cool.

42:51 You know, another thing that you talked about that looks really useful and it's quite Pythonic in its style is this thing called OnGuard.

43:01 And you can use it to come up with decorators to apply to your functions to verify stuff, right?

43:07 Yeah.

43:07 So OnGuard is specifically for pandas.

43:10 And you can essentially decorate functions that, you know, essentially take in data frames and ensure that it has particular types or ensure that it's not empty.

43:22 Things like that.

43:24 And so this is essentially static typing for pandas, if you will.

43:28 And I think that that's really great.

43:30 I mean, I know that there's numerous opinions within the Python community about the need for something like static typing or type checking.

43:38 But when you work with data, this is really important steps.

43:41 And yeah, maybe you only want it to throw a warning, not an exception.

43:44 But these are things that are important to building scripts, building pipelines that actually work.

43:51 Yeah, absolutely.

43:52 Because it's super easy for something that you thought had a number to have, quote, that number in it.

43:59 And then math goes crazy, right?

44:02 Yeah.

44:02 And, you know, in pandas, there are certain things that are only available from particular d types.

44:07 So, you know, given that everything is NumPy based in pandas, if you have an incorrect d type, it could just go through and, like, add all your strings together and give you, you know, a really awesome large number that has absolutely nothing to do with the math that you expected.

44:22 Absolutely.

44:23 I expected a number.

44:24 I got a huge string that's full of numbers back.

44:26 I don't know why.

44:27 Exactly.

44:28 Nice.

44:30 All right.

44:30 Well, I think that kind of wraps it up for the solutions, stating the problem, these little tools that you found to solve it.

44:37 And I thought they were really cool.

44:38 So hopefully people will find them interesting to know they exist as well.

44:41 I hope so.

44:42 And if people want to check out the talk, it's on the PyCon UK YouTube channel.

44:48 Yeah, absolutely.

44:49 And I'm definitely going to link to it in the show notes as well.

44:51 So you can find it there on the page as well.

44:53 And then speaking of videos, you actually took this idea of data wrangling and data pipelines and did a couple of cool O'Reilly videos as well, right?

45:04 Yeah.

45:04 So I have one O'Reilly video that's very focused on pandas.

45:08 It's an introduction video to pandas.

45:10 So it's meant for somebody that's trying to learn about pandas that hasn't done a lot in it yet.

45:16 And it basically gives you an overview of these are the things that you can do with pandas.

45:21 And we play around with some data in Jupyter notebooks and such.

45:25 So it's a really great introduction if you've been meaning to check out pandas and you haven't.

45:30 And then I also have a new one coming out on data pipelines or automation workflows, if you will.

45:37 And this is going to cover things like Luigi and Airflow as well as Celery and kind of talk about how do you do distributed task processing and DAGs in Python.

45:49 That sounds really cool, especially the second one.

45:51 You know, I hadn't really heard of Airflow or Luigi.

45:55 So I suspect many people listening probably haven't either.

45:58 Maybe tell us what those are.

46:00 Yeah.

46:01 So Luigi is from Spotify.

46:03 It's been out, I believe, for four or five years now.

46:06 And Airflow is from Airbnb.

46:08 And it's actually incubating as an Apache incubator project right now.

46:12 And they're both really neat.

46:14 They have a little bit of different approaches.

46:16 But essentially, they're a way to build DAGs, which are directed acyclic graphs in Python, and to push all of your data through those graphs.

46:26 So essentially, when you think about, let's think about a problem like MapReduce.

46:30 We have a beginning, right, where we have all the data collected somewhere.

46:35 Then we map the task.

46:36 Then we shuffle and sort.

46:38 And then we reduce.

46:39 And finally, we have our output files.

46:43 And when you think about that flow, you can see that it goes through a directed acyclic graph, right?

46:48 It only moves one way.

46:50 And it has only a particular set of nodes or edges that it will go through.

46:55 And for that reason, you know, you can take this idea and apply it to quite a lot of workflow or data pipeline problems.

47:03 And in the video series, I cover a few of those and how to parallelize them and how to work with them so that you get kind of an introduction to how to build these things and whether they're right for your tool set.

47:17 Yeah, it sounds really cool.

47:18 I feel like a lot of projects have these data pipelines and they're just hiding.

47:24 They're not really brought out in a real clear way, you know?

47:27 Yeah.

47:28 I mean, there's definitely like quite a lot of scripting done in this instead of actually building a pipeline, right?

47:35 So you have a script and it's calling, okay, when this finishes, then call this other function.

47:41 What happens if it fails then?

47:43 You know, you get an exception in your logs and you have to go through and say like, okay, do I start it again?

47:48 Or where did it actually error?

47:50 And having these instead as graphs, you can see exactly where they failed.

47:55 And both Luigi and Airflow, as well as we know, Celery has ways to go and retry those based on the failure status.

48:03 Yeah, absolutely.

48:04 And once you understand them in this data pipeline way, distributing them so they run in parallel might be a really cool thing to do.

48:12 There's a lot of stuff, right?

48:13 Yeah.

48:14 And I mean, this is just kind of, again, an introduction, but it's there, you know, it's to give you an idea of what's available and what people are using from a distributed scale.

48:24 And then you can determine, okay, is this right for my project?

48:28 Another thing that I covered in that, which is one of my favorite libraries to play around with, is Dask.

48:34 And Dask is amazing.

48:36 Basically, you can have parallelized DAGs directly on your local computer that is going through sometimes terabytes of data, just because it has the ability to parallelize it and to do out of core memory processes.

48:53 That sounds really cool.

48:53 I haven't played with Dask, but maybe I need to now.

48:56 Yeah, yeah.

48:56 And Matthew Rocklin is super smart.

49:00 So, we'd love to hear him on the show sometime.

49:02 Oh, that'd be great.

49:03 Absolutely.

49:03 Those videos sound really interesting.

49:05 And people can definitely check them out if they're interested in it.

49:09 Let's talk about a few other things you're up to because you're juggling many things, right?

49:13 I like to keep busy.

49:14 That's good.

49:17 So, you were involved in the first Pi Ladies chapter in LA, right?

49:22 I was lucky enough to be part of the original seven ladies that formed Pi Ladies in Los Angeles.

49:28 And that was a really, really neat experience and very near and dear to my heart.

49:35 Yeah, that's really great.

49:36 You must be really proud to see how far it's coming.

49:39 Yeah, it's kind of amazing to be at a conference somewhere in Europe and people will be like,

49:45 Oh, yeah, I run Pi Ladies.

49:46 I run Pi Ladies Prague.

49:48 I run Pi Ladies Moscow.

49:50 I run Pi Ladies wherever.

49:51 That's just so inspiring.

49:53 And I feel really, really great that the community has kind of just embraced it and ran with it.

49:59 And I feel like that's really inspiring for young women that are getting into programming.

50:04 Yeah, I think it's really great to have that support and just that whole structure.

50:09 I think it's really important.

50:10 And I think it makes a big difference in the Python community.

50:13 As I look to other programming communities that don't have these things, they're definitely less well off for it.

50:21 Yeah, I feel like there's a lot of communities that are now looking at Python as a really great example of how do you have a diverse community?

50:29 And how do you have a supportive and open community?

50:33 And that just feels really amazing to be part of a community that other languages are trying to replicate, if you will.

50:39 Yeah, I'm sure.

50:40 I'm sure.

50:41 It's definitely, it feels like a better place to be, in my opinion.

50:44 I love the Python community.

50:46 And that's one of the reasons.

50:47 So another thing, speaking of community, that you do is you help organize Pi Data in Berlin, right?

50:53 Yeah.

50:54 So I got asked to be a part of that at the conference last year.

50:58 And it's a great group of folks.

51:01 And we're constantly organizing the monthly meetups.

51:04 Sometimes we have hackathons.

51:06 And then we've already started organizing the conference for next year, which will probably be the first weekend of July.

51:13 So if you want to come to Europe, you should do so the first weekend of July and come visit Pi Data Berlin.

51:19 Yeah, I definitely recommend that.

51:22 And maybe I'll make it.

51:23 Who knows?

51:24 It would be wonderful.

51:24 Yeah, yeah.

51:25 We're going to have a call for speakers pretty soon here.

51:28 So.

51:28 Okay, excellent.

51:29 So I hear this a lot.

51:31 And I know it means that you're really busy.

51:32 But what does it mean to organize a conference, a Python conference?

51:36 Yeah.

51:36 So, I mean, right now it means quite a lot of emailing and telephoning with folks to find a good venue.

51:43 We have a few keynote speakers that we're speaking with.

51:46 And then we work, of course, with NumFocus, which is the overarching nonprofit organization based in the States.

51:54 And they're the ones that are really kind of running the Pi Data behind the scenes and really helping organizers like us with the tools that we need to set up a good conference.

52:06 Yeah, that's really cool.

52:07 One of the things I think is great about Python is there's not just one big Pi Data conference somewhere in the world and you have to go to it.

52:15 But they're in many places, right?

52:17 And I think that makes it more accessible as well.

52:19 Yeah, and it's really the support is there kind of with the same idea of, you know, having Pi Ladies chapters all over.

52:26 The support is there within NumFocus to really say, okay, you know, we have your back and we're going to help you figure out how to get good speakers.

52:36 We're going to help you figure out a good venue.

52:38 And having support like that is just, I think, tremendous for being able to make Python accessible and also make it amazing and great.

52:47 And I think this is why it has been able to grow so much within the data and scientific communities.

52:53 That's an interesting question.

52:54 And maybe we're talking about a little bit is just a few weeks ago, just this year, somebody measured and realized that Python became the second most popular language on GitHub.

53:05 And it's second only to JavaScript.

53:07 It displays Java, which cheer for that, right?

53:10 Current most popular one is JavaScript.

53:14 And I feel like that's probably highly overcounted because JavaScript appears in Python web apps, Ruby web apps, ASP.net.

53:24 Like, you know, everything that is up there has JavaScript, right?

53:27 So I think it's overcounting it.

53:28 People were asking, like, why is Python so much popular all of a sudden when it's been around for 25 years?

53:36 What are your thoughts on that?

53:37 I have my own, but.

53:38 I definitely come at it from a data perspective.

53:41 So I know I'm looking at it through my own Python colored glasses.

53:45 But I do think that the embrace of the data and scientific communities around Python has been massive.

53:52 And I feel like it definitely came from definitely second place to R, possibly second place to Java.

54:02 And now it has really been embraced within machine learning, within artificial intelligence, within the chatbot movement, and within data analysis.

54:12 And I think that that's a really strong thing because those fields are obviously growing right now.

54:17 And it's really powerful to have Python be kind of the up and coming language within those communities.

54:24 And even for something like, okay, I do Apache Spark and I only write Scala, the second in line is Python, always the PySpark library.

54:34 And so for these reasons, I feel like we are no longer the weird kid on the block.

54:39 Yeah, I tend to agree with that analysis.

54:42 I think that's a really large part of it.

54:43 I think there's more to it, but I think that maybe is the single, if you're going to pick one thing, I think that might be the one thing that's making the biggest difference.

54:52 You know, for example, like cloud computing, right, Google App Engine, and some of the other things that made running Python apps at scale much easier than it had been before.

55:00 I think that has something to do with it.

55:02 Yeah.

55:03 But the data is probably number one and sort of, because it's a new area where it's really growing fast.

55:08 Well, and it's just really wonderful when you hear frameworks released, particularly around new machine learning or deep learning, that almost always Python is supported from the start.

55:20 And that's just really powerful that even if the backend is C++ or Java, that they understand and know that the Python community will kind of pick it up and run with it.

55:30 And so they want to make sure that from the very beginning, you know, TensorFlow runs with Python and things like that.

55:35 Yeah, absolutely.

55:36 That's very cool.

55:37 So another thing, as if you're not busy enough with all this stuff, another thing that you're up to is you're running a data consulting company in Berlin.

55:43 What's the name of it?

55:44 And what do you guys do?

55:45 Yeah.

55:45 So I guess off Deutsch, it's Kjamistan.

55:48 But it's Kjamistan.

55:50 And that's from a fond nickname several lifetimes ago in job perspective.

55:56 And basically, we do data consulting.

56:00 It's primarily me.

56:02 Sometimes I get a chance to hire some folks to help out some friends and other people I respect within the data community.

56:09 And to kind of specialize in doing a mixture of natural language processing work and data analysis work.

56:17 And so I usually take on clients and get to work alongside teams a lot of the time or build proof of concepts.

56:24 And that's really, really fun work.

56:26 I bet it is.

56:28 Yeah.

56:29 Yeah.

56:29 Are there a lot of companies that maybe are not software companies per se?

56:33 There's some other specialty, but they've got a software team in-house, but not a data science team in-house.

56:40 Also, is it pretty common that they maybe bring somebody like you or your company in to mix in a little data science or natural language processing for a particular project?

56:50 Yeah.

56:50 So sometimes it's things like that.

56:53 Or other times, yeah, it's just a new startup and they need some data analyzed so that they can take it to their investors, things like that.

57:01 Other times, it's that maybe they have a data science team and the data science team is very busy doing one thing.

57:08 They want somebody that knows Python that they can trust their code to build out kind of a new proof of concept or a new idea.

57:16 And then they decide, okay, is this useful?

57:19 Are people in the company happy with it?

57:21 If so, then they usually will take it on their own or potentially hire people to help manage it.

57:27 Yeah, sure.

57:27 That makes sense.

57:28 All right.

57:29 Very cool.

57:29 That sounds like a fun job.

57:31 Yeah, it's really fun.

57:32 Very, very cool.

57:33 All right, Catherine, I think we're running low on time, so we probably need to wrap up our topics there.

57:39 So let me ask you two questions I always ask my guests on at the end.

57:42 And the first is we already enumerated quite the list, but there's over 90,000 packages on PyPI.

57:49 What one would you recommend to people that maybe they don't know about or haven't heard about?

57:54 Well, we didn't get to talk much about NLP, but I really, really love Gensum and spaCy.

58:00 Both of them are really changing and pushing the space of kind of where academia is with natural language processing and giving it and making it available for us mere mortals.

58:12 So I really recommend if you want to take a look at how to use neural networks with natural language processing, that you check out both Gensum and spaCy.

58:21 Okay.

58:21 Those sound really, really excellent.

58:23 And I have not tried either, so it sounds fun.

58:25 And when you write some Python code, what editor do you open up?

58:28 I use Vim, so I'm kind of old school.

58:32 You know, I think that might be the single most popular editor of all the guests.

58:38 So there's a wide variety of topics of editors, but that one might have the highest histogram bar.

58:44 Go Vim.

58:45 Right on.

58:46 Okay.

58:47 Well, that's really cool.

58:48 And final call to action before we say goodbye to everyone.

58:52 Keep in touch.

58:53 I'm at KJM on Twitter.

58:54 And you can find my company and blog posts on KJMastan.com.

58:59 And if you're interested in speaking at PyData Berlin, feel free to reach out.

59:04 Yeah, that sounds great.

59:05 That sounds like such a lovely opportunity to go to Berlin, to meet all the people, and to present there.

59:11 So definitely want to second that.

59:13 All right.

59:14 Well, I've learned a lot about data cleaning and picked up a bunch of cool tools along the way.

59:18 So thank you for your time, Catherine.

59:20 That was great.

59:20 Thanks so much, Michael.

59:22 Thanks for having me.

59:23 You bet.

59:23 Talk to you later.

59:23 Ciao.

59:24 This has been another episode of Talk Python To Me.

59:29 Today's guest has been Catherine Jarmul.

59:31 And this episode has been sponsored by Rollbar and GoCD.

59:34 Thank you both for supporting the show.

59:36 Rollbar takes the pain out of errors.

59:39 They give you the context and insight you need to quickly locate and fix errors that might have gone unnoticed until your users complain, of course.

59:47 As Talk Python To Me listeners, track a ridiculous number of errors for free at rollbar.com slash Talk Python To Me.

59:54 GoCD is the on-premise, open-source, continuous delivery server.

59:59 Want to improve your deployment workflow but keep your code and builds in-house?

01:00:03 Check out GoCD at talkpython.fm/GoCD and take control over your process.

01:00:10 Are you or a colleague trying to learn Python?

01:00:12 Have you tried books and videos that just left you bored by covering topics point by point?

01:00:17 Well, check out my online course, Python Jumpstart, by building 10 apps at talkpython.fm/course to experience a more engaging way to learn Python.

01:00:25 And if you're looking for something a little more advanced, try my Write Pythonic Code course at talkpython.fm/Pythonic.

01:00:34 Be sure to subscribe to the show.

01:00:35 Open your favorite podcatcher and search for Python.

01:00:38 We should be right at the top.

01:00:39 You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm.

01:00:49 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

01:00:53 Corey just recently started selling his tracks on iTunes, so I recommend you check it out at talkpython.fm/music.

01:01:00 You can browse his tracks he has for sale on iTunes and listen to the full-length version of the theme song.

01:01:05 This is your host, Michael Kennedy.

01:01:07 Thanks so much for listening.

01:01:09 I really appreciate it.

01:01:10 Smix, let's get out of here.

01:01:13 Stay tuned.

01:01:34 Don't believe.

01:01:34 Thank you.