Parsing horrible things with Python

Episode #85, published Thu, Nov 17, 2016, recorded Thu, Oct 27, 2016

Episode Deep Dive Transcript

Do you have horribly convoluted things that need parsing? Obviously you'll learn a bunch of tips and tricks from this episode. But you'll see that advanced parsing is a gateway to many interesting computer science techniques.

Listen in as I speak with Erik Rose about his journey to parse weird things at Mozilla.

Links from the show:

Erik on Twitter: @erikrose
parsimonious: pypi.org/project/parsimonious
Erik on GitHub: github.com/erikrose
PyCon Talk: Parsing Horrible Things with Python: youtube.com/watch?v=tCUdeLIj4hE
Poetic APIs Talk: pyvideo.org/pycon-us-2014/designing-poetic-apis.html
fathom-web project: npmjs.com/package/fathom-web
NLTK Project: nltk.org
Mozilla's DXR: wiki.mozilla.org/DXR
Let's Encrypt: letsencrypt.org
Turtles: github.com/erikrose/turtles
more-itertools package: pypi.org/project/more-itertools
Erik's Blog at Mozilla: blog.mozilla.org/webdev/author/erosemozilla-com
Things GTD App: culturedcode.com/things
Project Quantum: medium.com/mozilla-tech/a-quantum-leap-for-the-web-a3b7174b3c12
Book: Getting Things Done, The Art of Stress-Free Productivity: amzn.to/2gupfs9
Michael's Data Science Pythonic Webcast: crowdcast.io/e/pythonic

Episode Deep Dive

Guest Introduction and Background

Eric Rose is a seasoned Python developer and active member of the open-source community, currently working at Mozilla. He is well-known for projects such as DXR (a code navigation tool for large codebases like Firefox), his cutting-edge “mad scientist” project called Fathom for semantic web extraction, and his contributions to the Let’s Encrypt initiative. Eric has also created the Parsimonious parsing library, which offers a simple but powerful approach to handling complex and “horrible” data. His journey into Python began by bridging VBScript to Python in a Windows environment, and he has since specialized in advanced parsing, Pythonic design, and community-building best practices.

What to Know If You're New to Python

Here are a few quick suggestions to help you understand some of the parsing concepts discussed:

Make sure you’re comfortable with basic text manipulation in Python (string methods, slicing, and regular expressions).
Understand how Python represents data internally (lists, dicts, and trees) as you’ll see tree-based structures here.
Know how to install external libraries (via pip) and import them so you can experiment with libraries like pyparsing, PLY, and parsimonious.
Brush up on Python’s functions and modular design so you can organize both parsing logic and the intermediate representations (ASTs, parse trees, etc.).

Key Points and Takeaways

Why Advanced Parsing Matters While regular expressions can solve simple text extraction problems, they break down with nested or highly complex input. This episode shows how robust parsing solutions in Python open the door to deeper computer science concepts (compilers, DSLs, natural language processing, etc.). Eric emphasizes that tackling “horrible things” to parse can be both a challenge and a pathway to bigger opportunities in software design. Once you move beyond trivial matching, structured parse trees enable transformation, analysis, and reusability in ways regex alone cannot.
- Tools and Links:
Regex vs. Real Parsers In the conversation, Eric highlights that a direct series of regular expressions is often brittle and unreadable for significant projects. True parsing libraries support arbitrarily nested structures and produce parse trees you can analyze or transform. This is critical when building maintainable code around formats such as MediaWiki markup, programming languages, or deeply nested data formats.
- Tools and Links:
  - PyParsing on SourceForge
Intermediate Representations (ASTs / Parse Trees) Generating a parse tree (or an Abstract Syntax Tree) is central to building flexible, maintainable text-processing solutions. Once you have the data in a structured form, you can render or transform it however you like: convert to HTML, strip out markup, or even produce alternate text-based formats. This separation of concerns, parsing first, transforming second, drastically reduces complexity.
- Tools and Links:
  - Parsimonious documentation
Chomsky Hierarchy and Context-Free Grammars The episode dives briefly into formal grammar theory, such as why some tasks are context-free but others are context-sensitive. Python itself is nearly context-free, except for its indentation rules. This impacts what type of parser or technique you can use (e.g., LR(1), PEG, etc.). Understanding these grammar classes is especially relevant when you need advanced lookahead or custom logic for your parser.
PEG Parsers and the PackRat Algorithm Parsimonious is built upon Parsing Expression Grammars (PEGs) coupled with “PackRat” memoization. PackRat caching keeps partial parsing results around so you don’t redundantly do the same parsing work over and over. This ensures you can handle complex grammars with possible backtracking in linear time.
MediaWiki and Parsing ‘Weird’ Formats One of Eric’s motivational stories comes from having to parse MediaWiki’s markup language (like the syntax used on Wikipedia). The original approach used dozens of regexes in a loop, which was difficult to maintain. By switching to a real parser and building an explicit grammar, Eric and his team gained clarity, maintainability, and could even add new directives easily.
Why Eric Wrote Parsimonious Eric researched PyParsing and PLY but needed a slightly different approach that was more readable and allowed forward references in the grammar. Parsimonious uses a domain-specific language for grammar definition as multiline text blocks, keeping it easy to read and adapt. This approach also leads to simpler debugging and potential compile-time optimizations that reduce overhead.
- Tools and Links:
  - Parsimonious GitHub
DXR at Mozilla Eric works on DXR, a tool for navigating huge codebases like Firefox’s tens of millions of lines of code. DXR needs advanced code analysis, indexing, and searching, highlighting how a deeper grasp of parsing and compilers can help with real-life large projects.
- Tools and Links:
  - DXR GitHub
Fathom: Semantic Extraction from the Web Fathom is Eric’s “mad scientist” project to teach browsers (or web analysis tools) to identify meaningful semantic units in a webpage, like headers, footers, ads, or even next/previous buttons. It uses rules and a logic-based approach (inspired by Prolog) for text segmentation. By focusing on semantic extraction, Fathom aims to simplify tasks like data extraction and user interface customizations.
- Tools and Links:
  - Fathom GitHub
Open Source, Let’s Encrypt, and Security Eric also discusses Let’s Encrypt, which revolutionized SSL certificates by making them free and automatable. This ensures more sites run under HTTPS by default. The discussion underscores the importance of trusting your software sources and verifying dependencies, something Eric brought into pip itself with pip hashing (added in pip 8).
Productivity and Mindset (GTD and Code Review) Eric shares how practicing GTD (Getting Things Done) helped him reduce stress and handle obligations from open-source responsibilities. He also emphasizes welcoming community interactions, exemplified by code reviews and encouraging contributions at Mozilla. A constructive and empathetic review process lets volunteers and maintainers thrive.

Tools and Links:
- David Allen’s GTD (general reference, not Python-specific)
- Mozilla’s Code Review guidelines (for reference)

Interesting Quotes and Stories

"Well, nobody in their right mind really looks forward to parsing horrible things, but it's something we have to do." -- Eric Rose

"Advanced parsing is actually a gateway to many interesting computer science techniques." -- Michael Kennedy

"Once you have a tree-shaped intermediate representation, you can do anything you want with it." -- Eric Rose

Key Definitions and Terms

PEG (Parsing Expression Grammar): A type of formal grammar offering a simpler and more flexible way to define how a language is parsed, used by libraries like Parsimonious.
PackRat Algorithm: A memoization strategy in parsing that stores partial results to avoid repeated work, guaranteeing linear time parsing in PEG-based systems.
Context-Free Grammar: A grammar where each production rule’s left-hand side is a single nonterminal, allowing for easier parsing. HTML and many languages fit here in most cases.
LR(1) Parsing: A popular bottom-up parsing technique that looks ahead one token; used by libraries like PLY.
Abstract Syntax Tree (AST) / Parse Tree: A tree-based internal structure representing the syntactic organization of code or data, separate from textual representation.

Learning Resources

Parsimonious GitHub: Official repo with docs for Eric’s parsing library.
pyparsing: Another Python parsing library mentioned in the episode.
Python for Absolute Beginners: A great place to start if you want a solid foundation in Python before diving into advanced topics like custom parsing.

Overall Takeaway

This episode highlights that while parsing is often seen as drudgery, doing it properly unlocks higher-level capabilities in software. By building structured representations of messy data, you can transform it for varied use-cases and tap deeper computer science knowledge. Whether you’re building new DSLs, analyzing massive codebases, or extracting rich semantics from the web, advanced parsing in Python is both approachable and powerful.

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Do you have horrible convoluted things that need parsing?

00:02 Well, you're in luck because this week we're covering parsing horrible things in Python.

00:08 Obviously, you'll learn a bunch of tips and tricks from this episode, but you'll see that advanced parsing is actually a gateway to many interesting computer science

00:16 techniques. Listen in as I speak with Eric Rose about his journey to parse weird things at Mozilla.

00:21 This is Talk Python To Me, episode 85, recorded October 27, 2016.

00:28 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

00:57 ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where

01:02 I'm at mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the

01:08 show on Twitter via at Talk Python. This episode has been sponsored by GoCD and Hired. Thank them

01:14 both for supporting the show by checking out what they have to offer during their segments.

01:17 Hey, everyone. A little bit of news for you. So first of all, I'm doing a webcast next week,

01:24 Tuesday, November 22, at 11 a.m. Pacific time. The topic is Write Pythonic Code for Better Data Science.

01:31 And I've partnered with Kevin Markham from Data School. And this is 100% free. You can drop in.

01:36 And it's kind of like a super miniature version of my Write Pythonic Code course. So if you want to come

01:42 check it out and register, just go to crowdcast.io slash e slash Pythonic or click that same link in

01:49 the show notes. I was talking to the PyCharm team this week, and they agreed to give away a free copy

01:55 of PyCharm Professional every week to one lucky listener. So all you have to do to be in the running

02:01 for this is be a friend of the show. Just visit talkpython.fm/friends, enter your email address,

02:07 and randomly I'll choose an email address out of the list, and somebody will win. And because we're

02:12 doing every week, the odds are pretty decent for you to get a copy. Let's get to this excellent

02:16 interview about parsing horrible things with Eric Rose. Eric, welcome to Talk Python.

02:22 Thanks for having me, Michael. Pleasure to be here.

02:23 Yeah, it's great to have you. I'm really looking forward to parsing horrible things and a bunch of

02:28 other stuff.

02:29 Well, nobody in their right mind really looks forward to parsing horrible things, but it's something we

02:33 have to do.

02:33 Well, when you have horrible things and you want to parse them, someone's got to do it, right?

02:37 That's right.

02:38 So I think it's, we're going to talk about some really cool techniques, some libraries,

02:42 some algorithms, and whatnot to pull that off, and some interesting work and talks that you've done.

02:47 But of course, before we get into those, let's start with your story. How do you get into

02:50 programming in Python?

02:50 Well, I mean, I used to be a Legos kid, right? You know, like snapping those things together.

02:55 And of course, the thing about being a kid is you have no economic power, so you've only got the

02:59 ones you've got in the box, and you get at birthdays, and you get at Christmas or what have you.

03:03 And so I kind of discovered programming as a way to program with, to build with Legos without ever

03:09 worrying about running out of parts. And so I became kind of an early collector of programming

03:13 languages, you know, before the internet. So you have to get these things kind of happen upon them

03:17 on floppy disks or whatever. And I did basic and I did hyper card and kind of learned programming

03:22 through mimicry like we all did in the 80s. You know, you get these magazines with program listings

03:28 in them. And you just kind of have to type them in if you want to play with them. And so you're just forced to plow

03:34 linearly through these things, making mistakes throughout. And that kind of teaches you debugging and proofreading.

03:39 And you come to all kinds of wrong conclusions. Like I remember in hyper card, there's this statement global

03:45 start flag, which of course the global variable declaration, but I thought it was this round checkerboard cursor,

03:51 which totally looked like a global start flag to me. So, you know, fast forward two dozen languages later

03:57 and I ended up working at a university over at Penn State and we had this crazy seminar registration

04:03 system called SEMREG. Maybe they're still running it. I really hope not. And it was, well, the simple thing

04:10 I could say is it was written in VBScript, but really it was written in PHP and then transliterated into

04:17 VBScript via a series of regular expressions. Wow. That's crazy. I've never heard of anything

04:22 being compiled from PHP to VBScript. Absolutely insane. And compiled is probably too kind of a word,

04:27 but, so I didn't realize this when I took the job, you know, I've become a lot smarter about

04:33 interviewing since then, I hope. And I thought I'm going to give this a chance. You know, I'm this big

04:37 Mac guy and I, it was, it was the height of the platform wars and windows was terrible, but I, you know

04:42 what? I'm going to, I'm going to not be a jerk. I'm going to give this thing a chance. I'm going to see if I

04:45 can write stuff in VBScript. And the short answer is you can't write anything in VBScript.

04:49 The thing has classes, but no inheritance. And, so, you know, with a little detour through making

04:56 my own, prototype based inheritance language out of the call statement, I thought, okay, we need to

05:03 bridge this over to something more usable. And Python was very well supported at the time on Windows

05:09 scripting host, which allowed me to do ridiculous things like share a database connection between the

05:14 legacy VBScript code and the new Python code. And that's how I got into Python, believe it or not.

05:20 Interesting. Because if you can get a drawer in the script host, you can basically get it the same

05:24 environment as this other bad thing, right? Yeah. You can kind of mash it all together. It'll bridge

05:28 strings to strings. It'll actually bridge objects to objects and methods to methods, numbers to numbers.

05:34 And if you bang on it hard enough, you can get database handles shared and share a single

05:38 transaction between the two languages. So I was very fortunate to happen upon Python that way.

05:43 Yeah, that's great. And it looks like you've been doing a bunch of Python since then, huh?

05:47 Well, Python is a really, nice little patchwork of, of, of stolen bits of language.

05:53 There's not a whole lot unique about Python, which I think is one of its strengths apart from maybe the

05:57 width statement, but it knows where to steal all the best things, the list comprehensions out of Haskell,

06:03 for example. And, you know, Hey, we're going to steal, you know, class-based object orientation

06:08 from wherever that came from small talk, I guess, but we'll also have top level functions from any

06:13 number of languages. So I've been pretty happy with it. Yeah, absolutely. And it's still,

06:17 still on its way to doing that thing with like async and await in the latest version, right? Which is

06:22 a fantastic language feature. Yeah. We'll see where that goes. I haven't played with it myself.

06:26 Yeah. I'm still looking for a good use case for that as well, but it's, it's definitely a neat

06:29 concept. So how about now? What do you work on these days?

06:33 Well, day to day, I maintain a project called DXR over at Mozilla, which doesn't really stand for

06:39 anything, but is a language analysis and navigation tool for large code bases. So

06:45 Firefox is something like 17 million lines of code. You know, if you want to make a change,

06:50 you've got to be thinking, well, what am I going to break with this change? Who am I affecting?

06:53 What functions call this function that I want to change? What, what eats the result of the

07:00 contents of this variable that I want to alter or, what invariant might I be violating here?

07:06 And DXR answers those kinds of questions through a structured query language and through both text

07:12 search and a cleverly accelerated regular expressions search. So you can get through these 17 million

07:17 lines with a regex in, you know, sub second. Wow. And how you said 17 million lines of code. That's,

07:24 that's quite impressive. That's bigger even than I thought it would be, but I knew that was huge.

07:29 How many languages are involved in that? Oh, that's a crazy question to ask. Well,

07:34 okay, I'll just go with language. So HTML is a language though, not programming language,

07:39 arguably CSS is Turing complete. Someone has proved as long as it's level three. So that's a programming

07:45 language. Now a lot of the UI is written in JavaScript. A lot of the down and dirty stuff

07:50 is written in C++. And more recently we've begun importing, Rust into the code base.

07:55 Rust is now a part of the released Firefox and, and more and more with the release of project quantum,

08:01 which was just announced last week, will be ported over to rust. So it's, so we can make more

08:07 guarantees so it can be safer and more concurrent. Nice. And what's, what is project quantum?

08:12 Project quantum. Let's see if I can get this right. Is a Mozilla project just became unsecret to import a lot of our

08:22 experimental, a Rust based rendering pipeline into Gecko, which is the current released,

08:28 Firefox pipeline. So project, that I mentioned serve already. I forget.

08:33 No, the sort of thing we're importing things from is servo an experimental web renderer written in rust,

08:40 a CSS renderer, HTML, all that jazz. And, yeah, we're, we're pulling bits of that into Gecko.

08:47 Okay. That's awesome. It's a very exciting time that'll let us be more concurrent and use all these

08:52 different cores on all these different things. You know, phones even have four cores now.

08:55 Yeah. Watches even have multi cores now.

08:57 Yeah. It's crazy. It's a crazy time to live.

08:59 It is a crazy time. You said another thing you're working on is fathom. What's fathom?

09:03 So fathom kind of, fits into all this, this parsing subject. Fathom is my new kind of toy over at

09:08 Mozilla. It's a mad scientist project to see if we can make it easier to write semantic extractors for the web.

09:14 So an example of a semantic extractor you might know is something along the lines of readability

09:18 or a browser's reader mode, where it just pulls the content text out and dispenses with headers and

09:24 footers and ads and such. But other things you might want to extract are, Hey, let's, let's teach a browser

09:30 to recognize what a previous or next button looks like. So that maybe the browser can let you assign

09:35 a keystroke to that in the general sense and not have to chase them as they bounce around as you advance

09:40 through a slideshow. Or maybe we can teach the browser at a deeper level to appreciate what an

09:45 advertisement is or what a navigation element is. So maybe we can collapse those on small screens and

09:50 hide them in a menu. There really, there's really endless potential to this. And I'm trying to

09:56 fix it. So those extractors are easier to maintain, faster to write and become less of a mess.

10:02 If you were to read the readability source code, which is kind of the thing that both Firefox's

10:08 reader mode and Safari's reader mode are based on, you get the sense that it's been written by hand,

10:13 maintained over time. There is state flying everywhere. It's hard to tell where to make tweaks.

10:21 So what Fathom does is express these extractors as lists of unordered rules in a prologue sense. If

10:28 you've ever played with prologue, it's just a whole bunch of kind of, logical statements. And then

10:32 the environment figures out how to fit those together and run them. Well, Fathom, works along those

10:37 same lines. And as a result, since order doesn't matter, third parties can tweak and, can tweak an

10:45 existing extractor just by inserting their own rules. They can say here, whenever you see a,

10:50 whatever kind of element, for example, I want to boost the score by this much or lower it by that

10:55 much. Nice. So you could put like understanding of time of calendars and dates and whatnot, possibly

11:01 as another rule. Exactly. Hey, this looks like a calendar or, Hey, this looks like a payment form.

11:05 Let me help you fill it. Yeah. Okay. Yeah. That's awesome. Is that in Firefox yet? Or is this,

11:11 this is just a project so far? We hope to get it in so that, other things,

11:15 can make use of it more easily, but, it's already embeddable within Firefox add-ons and has

11:20 been embedded in, at least one add-on. It's kind of a Fathom debugging add-on.

11:25 it runs on the server side. It's all just kind of vanilla ES6. You can compile it down to ES5 for

11:30 older browsers. It kind of runs all over the place. I wrote it in JavaScript despite not liking

11:35 JavaScript myself so that we could get it, popular. Yeah. Sometimes, sometimes you got to go with

11:41 that. Well, that sounds really cool. And then another thing that you're involved in is

11:44 let's encrypt, which is very exciting. Yes. Let's encrypt has probably got the easiest

11:49 business plan I've ever heard, which is to give away a hundred dollar bills.

11:52 How's that work? How do I, how do I get my a hundred dollar bill?

11:55 Well, if you've ever bought a, well, it's very easy. So certificates, this is all certificates have historically been fairly expensive on the order of a hundred bucks.

12:04 And there are some other ways to get cheaper ones through start SSL or Komodo or whatever is kind

12:09 of a hassle. But what we do is we have a little command line tool that you run. And if,

12:14 if you can authenticate yourself to our little, certificate creating server through a DNS record or

12:21 putting a little thing on your server temporarily, then we will give you a cert, which is recognized by

12:27 all the major browsers. So really there's no way, no reason not to use it at this point.

12:32 Yeah. So having encrypted content is obviously important if you're doing like e-commerce or something

12:38 like that, but it's also just becoming increasingly important to be a first class citizen on the web,

12:43 right?

12:44 It really is. It, I mean, it gives you the impression of trust. First of all, if people are

12:48 watching their URL bars, but also it's important for your visitors just to have the requests be private.

12:53 If I'm surfing to Wikipedia and, not doing anything particularly suspicious, but say I have a

12:59 suspect, I have a medical condition and I'm reading about all these different skin diseases or whatever

13:04 is ailing me. I don't want my ISP logging that away and selling it to their marketing partners,

13:10 which until the FCC's ruling last week was perfectly legal to do state actors. I mean,

13:16 the best defense we have against really anybody in the future coming to power and looking into our past

13:23 and seeing things they don't like, and then coming down on us is keeping things that are our business,

13:27 our business and using, SSL certificates and surfing to secure sites is the best way to do that

13:34 right now.

13:34 Yeah. And the fact that let's encrypt is, is free makes that very, very possible. And I think that's,

13:40 that's great. You know, there's a really interesting studies quoted in the original Edward Snowden book

13:47 that came out, I think, by Greenwald. I can't remember that guy's glad Greenwald, I believe.

13:52 Yes. That guy. And it was a great book. And basically, you know, a lot of people say, look,

13:56 I don't care about this privacy stuff. Like I have nothing to hide, but there've been psychological

14:01 studies and social, social studies saying people behave differently. If they know somebody's listening,

14:07 they might not break a rule in private, but they behave differently. They are slightly more private,

14:13 less willing to think, you know, sort of contrarian ideas and at least share them. And the less that

14:19 people are watching the better as far as I'm concerned for, for people in general.

14:24 Yeah. It's really the idea of chilling effects, which came out of some academic institution. It's

14:29 a wonderful phrase. And when you have chilling effects in operation, where you are afraid to do

14:35 perfectly legal things, which otherwise you might not do, democracy really cannot function.

14:40 How democracy works is by means of, well, really the Overton window, this range of things that you're

14:46 allowed to say and think that don't get you kicked out of cocktail parties.

14:50 And this window moves around over time. It's, you know, yesterday's conservative, rather yesterday's

14:56 progressive is today's conservative. And if we're not allowed to play with ideas that are just outside

15:02 the bounds of the Overton window, then the future really has nowhere to pick its new ideas from.

15:07 And we just kind of stagnate and the powers that be become the powers that always will be.

15:12 And the ideas that are, are the ideas that always will be, and we can't really go anywhere.

15:17 So, yeah, I think privacy is key to having a functioning democracy at all.

15:22 Yeah, I totally agree. And if you don't buy that, Google ranks sites higher if they have SSL.

15:28 We all want to rank higher in Google, right? So there's the Google juice.

15:31 There's the final straw to, like, start encrypting stuff. My blog's encrypted. This podcast site is encrypted and so on.

15:38 And I think it's great. Happy to do that on as much as I can.

15:42 I'm wondering, how did you become interested in parsing all these horrible things? Where did you get started with that?

15:48 Well, I guess, I guess it probably started when I was seven years old and playing around with basic and hypercard and thinking,

15:54 oh, you know, I want to write my own programming language because, of course, you're seven and no one has told you that it's hard.

15:59 And if you spend a couple of days trying to do that with just if statements, you end up in this spaghetti-fied mess and you can't get anywhere.

16:07 And you're just amazed that anyone has ever managed to write a language ever.

16:11 So, you know, then put in a little pause of about 25 years and it came up at work.

16:17 Over at Mozilla, we had a support site. We still have a support site.

16:22 Support.mozilla.org. And it's got a wiki and it's got a Stack Overflow clone and all this different stuff to help out people who are using Firefox and our other products.

16:30 And our wiki is powered by not MediaWiki itself, but the MediaWiki syntax for various hysterical purposes, as we like to say.

16:41 And not only do we want the MediaWiki syntax, but we wanted to be able to add our own little directives to the syntax.

16:48 Crazy little things that'll maybe, well, one example is we wanted to change the text that would come up according to which version of Firefox someone was using to visit the site.

16:57 It was very, very difficult to make those edits to the implementation of MediaWiki that we were using, which was a port by David Kramer, very nice little port, of the original MediaWiki machinery.

17:10 Such a direct port, in fact, that it still had dollar signs in the comments from the PHP.

17:15 Right. Yeah, it was originally PHP and you guys translated it over to Python and that helped because at least you were working in Python.

17:22 But you said the way that it worked was the parser was pretty insane.

17:25 Like it was just like a crazy bunch of regexes, right?

17:28 Yeah, there were, I think, 41 regexes.

17:31 And then there was another 2,100 lines of PHP, which just ran them over and over again against the source text, finding and replacing and finding and replacing, hopefully in the right order, interacting with each other, dropping little markers so that they didn't smash over each other when they shouldn't.

17:47 And then hopefully at the end out comes the proper rendered text.

17:51 Of course, in reality, MediaWiki language changes from release to release as they find little corner cases that this crazy slapdash loopy way didn't handle.

18:01 Yeah, you called it the Klingon MediaWiki.

18:04 Yeah, I look at these regexes and I think, well, this is the original Klingon, clearly.

18:08 Yes, obviously.

18:09 Nice.

18:11 And so you were looking for a way to escape this Klingon parser world and create something nicer.

18:19 Like one of the problems you said that was inherent in the algorithm was it would directly parse into its new representation.

18:26 Right.

18:27 There was never an intermediate representation.

18:29 So tell people what, like, is, like, why do you care?

18:32 Like, what's the intermediate representation do for you?

18:35 Well, it gives you flexibility, like any abstraction.

18:37 So let's say we have this imaginary intermediate representation for MediaWiki syntax.

18:42 We bring in the MediaWiki syntax.

18:43 We parse it into a tree because that's what all these things end up as in parsing land.

18:48 And then we have a lot of options.

18:49 We can output plain text from that, ignoring the bits of the tree that say bold or italic.

18:55 We can render out HTML from that, not ignoring those things.

18:59 We can go hunting through for just, say, date or time elements and pull out some date and timey entities.

19:06 Once you have this abstraction, you have any kind of output you like.

19:09 Or you can do any kind of analysis or transformation you like.

19:12 Nice.

19:12 So, for example, if you wanted to possibly represent stuff by a markdown output or an HTML output or plain text, those are super easy because both, like, markdown and HTML have a bold concept.

19:25 One is a bracket, you know, angle bracket strong.

19:27 One is a star.

19:29 But it's, you know, you can do that final translation pretty easily once you have the tree, right?

19:34 Exactly.

19:34 That's the trivial part, going from tree back to a linear representation.

19:38 So the hard part is getting it into a tree, huh?

19:41 Right.

19:41 Now, we're doing that as we talk, which is nice.

19:44 Now, you make a linear sequence of sounds.

19:46 And I hear them.

19:47 And I deconstruct them back into, you know, phrases and word pairs and things and idioms.

19:51 And I say, wow, that's probably what he means.

19:53 And then you do it back from my end.

19:55 Yeah, exactly.

19:55 So pretty much pulling structure out of any flat, linear stream of data is parsing, right?

20:02 Exactly.

20:03 And so the applications are as wide as you like.

20:06 I mean, anything that has any kind of structure, text, sound, you can think of musical phrases

20:11 and the sorts of, a lot of music has a theme in variations or development of a thematic statement.

20:17 On one hand, I sort of can see how you would do some of this with regular expressions.

20:22 But it's really the goal is to move it into this parse tree because then it's not just transforming

20:27 one bit of text into another text, but it's actually transforming into the structure that

20:30 we can do all kinds of interesting things, huh?

20:33 Well, right.

20:33 And more formally, regular expressions can't really support nesting.

20:37 Now, there are little corner cases that people were going to write in about where things that

20:42 are called regular expressions nowadays, like in Perl, have support for nesting bolted on,

20:47 but then they cease to become proper regular expressions.

20:49 Exactly.

20:50 Proper parsers, on the other hand, descend into the, what we call, like, Chomsky type one or

20:56 two grammar and can support nesting.

20:59 So you could say, understand sets of nested parentheses of arbitrary depth.

21:04 And that's how things like, well, HTML, for example, are.

21:08 You could have a bold tag with an italic tag in it with another bold tag in it at infinitum.

21:13 Yeah.

21:13 Div, span, A, all sorts of stuff.

21:17 Yeah.

21:17 More div.

21:18 And in addition to the nesting, a proper parser gives you the ability to read what you're doing,

21:22 essentially.

21:23 Regexes are famous for being a write-only language.

21:27 Absolutely.

21:27 I definitely think of them as write-only.

21:29 Yeah.

21:29 And you can comment them as much as you like and put white space in there, but it still

21:33 becomes, you know, at least awkward to name sub-patterns, to repeat sub-patterns, and to

21:39 maintain something where you have repeated bits of regex.

21:42 Sure.

21:42 Well, and as it gets increasingly complicated, the more you need the regular expression, the

21:46 less you can understand it, I think.

21:48 Yeah, for sure.

21:48 There seems to be a ceiling for one reason or another with straight-out regexes.

21:52 Yeah.

21:52 So you talked about there being different types of grammars that are different complexities

21:58 of grammars.

21:59 Yeah.

21:59 There's the idea of the Chomsky hierarchy of grammars.

22:02 And Chomsky is really a linguist.

22:05 And so he spoke of these grammars in terms of generative power, where, you know, I can

22:11 start from a production, a production name, something like sentence, and then descend in

22:16 one level and say, well, sentences, you know, phrase, and then maybe another phrase, and maybe

22:20 another phrase.

22:21 And what are the phrases?

22:22 Well, we have a subject, and we have verb, and we have an object.

22:25 And he's interested in creating sentences from the top level.

22:28 Now, as computer scientists, we're usually more interested in going in the opposite direction,

22:34 starting with this complete sentence, and then kind of inferring out the structure, recognizing,

22:40 as they call it in the literature.

22:41 So Chomsky came out with Chomsky hierarchy of grammar, something like that.

22:46 Anyway, levels 0, 1, 2, 3, and 4.

22:49 And type 0 is basically a free-for-all.

22:53 It kind of comes out looking like a directed graph.

22:56 And there's not really, really available kit for parsing that sort of thing, or any algorithmic

23:02 bounds to its complexity.

23:03 When we get down to a level 1, that's your context-sensitive grammars.

23:08 And Python is actually a context-sensitive grammar, I believe only for the reason that it has,

23:13 only because it has whitespace sensitivity.

23:15 So I was actually trying to parse something like Python the other day, a little side project

23:20 of mine called Turtles.

23:21 And I ran up against this.

23:23 When you indent in Python, in a normal language, in a curly brace language, when you go inward

23:29 a block, you say, curly brace.

23:31 And then when you end a block, you say, end brace.

23:34 Now, in Python, think about trying to write a tokenizer for Python, trying to find those

23:40 instances where we go in a level or out a level.

23:42 You're reading along, you're reading along.

23:45 You're at the beginning of a line, and you see eight spaces.

23:48 Did you go in a level?

23:49 I don't know.

23:50 What is the previous line?

23:52 Is it four spaces?

23:53 If so, I went in a level.

23:54 So you can see that eight space span is interpreted differently depending on our context, depending

24:01 on what the previous line had done.

24:02 So Python is a context-sensitive grammar.

24:05 Now, the rest of Python is at Chomsky level two, which is a context-free grammar.

24:12 So if you were to dig into the Python source code, one of these files there, there's a little,

24:16 let's see if I have it sitting here, a nice little sort of summary where it says, okay,

24:22 here's what a suite is, a series of statements.

24:24 A suite is a statement, statement, statement, or statement star.

24:27 But it's always that.

24:28 A suite is always this.

24:30 An if statement is always that.

24:31 You know, a function definition is always this.

24:34 It doesn't matter what came before or what comes after.

24:36 It's context-free.

24:38 Right.

24:38 Like, the if statement doesn't mean something different if it's in a while loop.

24:43 It always means the same thing.

24:45 So if you know what an if is, like, if you can define the structure of an if, you can define

24:49 its meaning.

24:50 And you don't have to do more interesting parsing.

24:52 Yep.

24:52 All the time.

24:53 All the time, if means the same thing.

24:54 There are no modes to go into or out of.

25:11 This portion of Talk Python To Me is brought to you by GoCD from ThoughtWorks.

25:16 GoCD is the on-premise, open-source, continuous delivery server.

25:20 With GoCD's comprehensive pipeline and model, you can model complex workflows for multiple

25:26 teams with ease.

25:27 And GoCD's value stream map lets you track changes from commit to deployment at a glance.

25:33 GoCD's real power is in the visibility it provides over your end-to-end workflow.

25:38 You get complete control of and visibility into your deployments across multiple teams.

25:43 Say goodbye to release day panic and hello to consistent, predictable deliveries.

25:47 Commercial support and enterprise add-ons, including disaster recovery, are available.

25:52 To learn more about GoCD, visit talkpython.fm/gocd for a free download.

25:58 That's talkpython.fm/gocd.

26:02 Check them out.

26:03 It helps support the show.

26:04 One of the things that you said, a quote from your PyCon 2012 talk, which I'll link to

26:18 the show notes, of course, was that parsing is a gateway drug to other areas of computer

26:24 science.

26:24 It's true.

26:25 I mean, once you have this tree-shaped intermediate representation, you can do anything you want.

26:30 You can get into natural language processing, which if you're doing any of this in Python,

26:35 NLTK is a not-to-be-missed library.

26:38 It, first of all, is a shining example of what all library documentation should be.

26:43 There is an NLTK book freely available and is a fantastic place to start if you want to

26:48 start understanding human language with a computer or doing sorts of data mining and machine learning.

26:53 It's also full of those sorts of algorithms.

26:55 Great piece of kit.

26:56 Yeah.

26:56 Awesome.

26:57 Awesome.

26:57 Also, from the intermediate representation, you can get into programming language design.

27:01 That's fun.

27:02 Make your own little toy language.

27:04 Anybody can make a lisp in a couple of days because the parsing is so darn simple.

27:10 It's just a bunch of nested parentheticals.

27:13 There's a lot of parentheses there.

27:16 Yeah.

27:16 Now I'm a lot less intimidated to reach for custom little query languages.

27:21 Like, DXR has a custom little query language, which looks a lot like Google's custom little

27:25 query language.

27:26 And it's, you know, 10 or 20 lines of grammar description that I feed into my library parsimonious.

27:32 And out comes the tree.

27:33 Yeah.

27:34 That's really cool.

27:35 I feel like when you learn these new techniques and these data structures and algorithms, like, you see problems where you just saw opaqueness before.

27:44 You're like, oh, I could actually apply this thing.

27:46 And out would pop interesting answers.

27:48 Whereas before, you're just like, there's no way I can answer that.

27:50 That's just text blobs, not structure.

27:53 Yeah, for sure.

27:53 It's just pattern recognition.

27:54 The more you've exposed yourself to, the more you'll say, hey, this is just one of those.

27:59 Let's review some of the various options for parsing text in Python today.

28:04 So there's two pretty well-known ones that have been around for a while.

28:09 There's PyParsing and there's PLY.

28:11 So David Beasley's Python Lex Yak.

28:14 Yeah, PLY is fantastic.

28:16 And anything that David Beasley does is fantastic.

28:19 You should immediately pause the podcast and go watch all of his talks.

28:24 PLY came out of when he was teaching a course on building your own little Pascal interpreter, I believe.

28:31 PLY has the advantage of being tripped over by hundreds and hundreds of students.

28:36 And they've hit every little corner case and made every possible mistake.

28:39 And so the error reporting is top notch.

28:42 And really, I would seriously recommend taking a look at PLY if the complexity of what you're parsing is amenable to it.

28:50 Now, the limit of PLY and why I couldn't use it for the MediaWiki stuff is that it implements what we call LR1 parsing, which you can look up the formalisms.

29:00 And I probably forget most of them.

29:01 But this is something we did.

29:02 I think most languages that are currently in production probably use something along the lines of LR1 just because it's very, very memory efficient and CPU efficient as well.

29:13 So it was a thing that we implemented when we had tiny little computers.

29:15 But where you run into problems is that LR1, that one means we can only look ahead one token to decide what kind of thing we're recognizing.

29:27 And in the case of MediaWiki, that wasn't actually enough.

29:31 For example, I believe it was internal links that begin with two brackets, bracket, bracket.

29:37 So it's not every time that bracket, bracket means an internal link.

29:42 It has to be followed by, oh, I'm going to get this wrong, but something along the lines of maybe a URL or page name, and then maybe a vertical bar, and then maybe a page title.

29:56 And so you might have to look ahead two or three tokens to see if your internal link is going to work out.

30:02 And if it doesn't, that bracket, bracket is just part of plain text and should be emitted verbatim.

30:07 And so we couldn't use any LR1 parser, and I put that on my shopping list and ruled out a whole lot of libraries because of it.

30:14 Right.

30:14 It just needs to keep more of it in its head, in the algorithm, all at once, right?

30:19 Pretty much, yeah.

30:20 All right.

30:20 So what's the story with py parsing?

30:22 So py parsing is what I consider or was the canonical, I reached for it first, Python parsing library before I wrote my own, of course, because I'm in that bad habit.

30:32 And py parsing is, you know, fairly Pythonic.

30:35 It mashes all the grammar definition stuff into Python objects.

30:42 So you literally say in a piece of Python, bold toggle equals literal parentheses, and then you construct an object, you say, quote, quote, quote, and say, okay, well, quote, quote, quote, is the thing that turns on bold in media wiki syntax.

30:57 And then on that object, you call methods like set name bold toggle, so that when you have this tree, you can actually tell that this thing was a bold toggle as you're walking the tree.

31:07 And kind of on you go.

31:09 Now, the disadvantage of this is it's kind of hard to read.

31:14 It's kind of wordy because, after all, we had to make this a valid Python.

31:18 And you can't do certain things like make forward references.

31:22 And oftentimes there are cyclical references in a non-trivial grammar.

31:26 In the case of py parsing, they had to put a little hack around that and say, you know, bold toggle equals forward, which is sort of a promise that I'm going to declare something later.

31:37 And then later on, you kind of jam that on.

31:40 For people who haven't seen the syntax of py parsing, it's kind of like formalized Python that acts as regular expressions.

31:49 So you'll see like Python objects, but they're clearly coming from sort of regular expression land like groups and whatnot, right?

31:58 Well, I mean, groups is a valid word to use when you're talking about parsing.

32:02 After all, we're talking about trees here, and a tree is nothing more than a nested list.

32:06 And so every sub list you can think of as a group.

32:09 So that's what py parsing is doing with the groups.

32:11 Yeah, okay.

32:12 And py parsing is of, I think, equal power with parsimonious.

32:17 They're both able to describe all the context-free grammars and probably a subset of context-sensitive ones.

32:25 And parsimonious is your library that you wrote, right?

32:28 Yes, it's my mad scientist experiment.

32:30 Though its version number starts with zero point, I have it in production all over the place.

32:35 There are lots of people using it, so.

32:36 Nice.

32:36 Do you have some examples for what it's being used for?

32:39 Yeah, well, I'm using it in DXR to parse our little query language.

32:42 What awesome people are using it for?

32:44 You know, they don't report back to me, but it gets a lot of downloads.

32:47 Yeah, excellent.

32:48 So it's out there somewhere.

32:49 Excellent.

32:49 So to me, it seems like parsimonious, it's a little simpler to define the grammar.

32:54 Is that right?

32:55 That was really my goal, both to make it simpler, to make it run fast, and to make it optimizable.

33:01 One of the goals you said was frugal RAM use, which I thought was just a great way to phrase it.

33:08 Yeah, and I haven't done a lot of RAM profiling, so I'm not ready to make any claims about that just yet.

33:13 But the formalism underneath parsimonious, which is parsing expression grammars, which come out of a 2004 paper by Brian Ford, meet that goal just fine.

33:22 I think RAM use is something along the lines of order n to the third, but n is the grammar size, so it's actually not that big an n.

33:31 And as a result of blowing that RAM, it's caches for each individual little production, each individual little context-free equals.

33:40 You get linear parse time.

33:42 Yeah, which is great.

33:43 Yeah, that's really, really nice.

33:44 So the reason people care about RAM is, I mean, if you're running this over some text and you're just going to dump it out and produce a text version, that's fine.

33:54 But if you're wanting to keep that in memory and say like a web server and continuously serve requests from it or something, then all of a sudden you care way more about RAM, right?

34:04 Well, just imagine serving up a bunch of media wiki pages and you've got to be parsing 100 of these at a time on a web server.

34:10 You know, RAM is not free.

34:12 No, it's definitely not free, especially if you've got a lot of traffic.

34:15 So one of the algorithms involved in here is you said that it uses something called the PackRat algorithm, which sounds fun.

34:23 What's that?

34:25 Yeah, I think Ryan Ford came up with that too.

34:26 It's really simple.

34:28 It's what you would come up with yourself if you're implementing one of these.

34:31 These PEG parsers, PEG parsers, they're really just recursive descent parsers.

34:36 And as you descend, you might find out, hey, now that I've looked ahead to tokens and I find out this internal link isn't going to work out for me, I need to rewind a little.

34:45 Let me take a couple steps back up the stack and let me try parsing it as plain text, for example, or maybe as an external link, for example.

34:52 And in the course of that, you may need to use a partial parse from a previous stack frame.

34:58 And rather than redoing that work, it makes sense oftentimes to cache the results of each partial parse.

35:04 And that's all the PackRat does.

35:06 Just keeps all these intermediate results around.

35:08 I see.

35:09 Basically to allow you to look ahead an arbitrary number of tokens and then adjust.

35:14 Without paying a penalty of redoing work.

35:16 Exactly.

35:16 Right.

35:17 Yeah.

35:17 Very cool.

35:18 Well, one of the things I was mostly trying to do with parsimonious, one of the things that really the thing that differentiates it so much, was that its grammars look a lot like what you'd find if you looked up the definition of the grammar in a book or in the documentation.

35:30 They're just big blobs of multi-line text in the Python quote, quote, quote, quote way.

35:35 And so it's able to do forward references because it's not bound to the idea of undefined symbols in an outer programming language.

35:42 We're able to do compile time optimizations on it really without limit because we haven't lost anything in its transformation down to a bunch of Python objects prematurely.

35:54 And it's also very easy to read as a result.

35:56 And you also keep the representation part as a separate phase so you can render to multiple formats, which is cool.

36:04 Yes, exactly.

36:05 A lot of these Python, not Python rather, but a lot of these parsing kits tend to inter-twingle output rendering with parsing.

36:12 And that way leads to pain in my experience.

36:17 Certainly rigidity.

36:19 Interesting.

36:20 It sounds like your parsimonious project is really cool.

36:24 And it's on GitHub.

36:24 People can check it out, right?

36:26 Please.

36:26 Send feedback.

36:27 Play around with it.

36:28 Send patches.

36:29 Okay, cool.

36:29 Yeah, one of the things that made me a little bit sad about PyParsing is it's on SourceForge, which I don't know.

36:35 Wow, still?

36:36 Yeah, when I see things on SourceForge, it kind of makes me feel like, oh, I'm not really sure that thing's actually still going.

36:42 Yeah, right on the homepage it says, download now from SourceForge.

36:45 It's like, oh.

36:46 Oh.

36:47 Oh.

36:47 Okay, that's unfortunate.

36:49 It's still fine code.

36:50 I mean.

36:51 Yeah, of course.

36:52 It is a single 3,000 line file.

36:54 But, you know, old doesn't mean bad.

36:58 It means proven.

36:59 Yeah, absolutely.

37:00 Absolutely.

37:00 So that was a really cool talk on parsing horrible things you gave.

37:04 Do you have some other favorites?

37:04 Oh, let's see.

37:05 What else have I done?

37:06 I have a talk called Poetic APIs, where I kind of will expand on the idea that, you know, these grammars, they should be really easy to read.

37:15 In fact, all programs should be easy to read.

37:17 In fact, here are seven, you know, kind of checklist-y things you can bang against what you're writing to make sure that you set a good language for the users of your API.

37:27 Oh, interesting.

37:27 Yeah.

37:28 What we're doing when we're programming is always creating language.

37:31 Every time we name a function, name a variable, create semantics of an object, we're kind of creating the mental model in which everybody who interacts with that code in the future has to play.

37:44 So if we do that irresponsibly, we can really make people think terrible, stupid thoughts that make their jobs hard.

37:50 But if we give them really good symbols that correspond well to reality and are easily composable and flexible, like the requests package is a great example.

38:00 Like, hey, you know what?

38:01 We could just totally represent HTTP requests instead of making everyone think about raw sockets all the time, like URL lib.

38:07 Then people can have their efforts magnified.

38:09 Yeah, it really does define the way that you think about a problem, the APIs and whatnot you have to work with, and the language itself, right?

38:17 And so I think Python itself is something, an example of, like, why that's important, right?

38:22 Well, yeah, I mean, so Python is a fairly close match to what we tend to write as pseudocode.

38:29 Right, and the reason we write pseudocode is it's easy to understand and communicate, so.

38:33 Exactly.

38:34 I was just reading the topographical sort algorithms on Wikipedia, and you know what?

38:38 If you put some colons in there and take out the eaches, it's about valid Python.

38:42 That's awesome.

38:43 Yeah, I've heard that before, that people have copied algorithms out of Wikipedia, more or less, just straight up.

38:48 Turn that into Python, and it works beautifully.

38:50 That's great.

38:51 Which is also interesting verification of Wikipedia content.

38:54 Yes.

38:55 This portion of Talk Python To Me is brought to you by Hired.

39:10 Hired is the platform for top Python developer jobs.

39:13 Create your profile and instantly get access to 3,500 companies who will work to compete with you.

39:18 Take it from one of Hired's users who recently got a job and said, I had my first offer on Thursday after going live on Monday, and I ended up getting eight offers in total.

39:26 I've worked with recruiters in the past, but they've always been pretty hit and miss.

39:29 I tried LinkedIn, but I found Hired to be the best.

39:32 I really like knowing the salary up front.

39:34 Privacy was also a huge seller for me.

39:37 Sounds awesome, doesn't it?

39:38 Well, wait until you hear about the sign-in bonus.

39:41 Everyone who accepts a job from Hired gets $1,000 signing bonus.

39:44 And as Talk Python listeners, it gets way sweeter.

39:46 Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $2,000.

39:52 Opportunity's knocking.

39:53 Visit Hired.com slash Talk Python To Me and answer the door.

40:04 There's another one that you talked about called the Code Review Review.

40:07 What's the story of that?

40:08 Yeah, this is a newer talk.

40:09 So something we're trying to do at Mozilla is make sure we don't drive people away by being jerks,

40:16 doing code review or otherwise having kind of unwelcoming culture.

40:20 Mozilla is historically and today largely driven by volunteer contributions.

40:25 I mean, even the guy who owns our security sockets layer or something, or the NSS, whatever that is, like the module owner for this, the one who has the final say,

40:36 he doesn't get paid by Mozilla.

40:37 He does something else, and he just takes it on this responsibility out of his own free time.

40:42 And so it's really important for Mozilla to keep that rolling, you know,

40:46 welcome people into the community, take in contributions, help people level up to become better programmers and more familiar with the project.

40:52 And so the Code Review Review is a piece of our onboarding right now that I'm turning into a more generically applicable talk where we talk about,

41:00 well, you know, how do we create that kind of welcoming atmosphere?

41:02 How do we do a proper review so that good programs come out?

41:07 And how do we do a review such that better programmers come out of it?

41:11 Yeah, that's really important that people, when they come to these new projects

41:15 or when they stick around, that they feel like it's a delightful experience

41:18 because they're doing it of their own free time and energy, right?

41:22 Yeah, for sure.

41:24 You definitely don't want it to be like slugging through a hard code.

41:27 I mean, I think an example of the opposite comes to mind is the old Python packaging,

41:33 PyPI web code.

41:35 I talked to Donald Stuffed on episode 64, and he was like, a lot of people want to come along and help maintain and evolve this,

41:44 but it's like two files in this hugely complicated old custom web framework that they built for it.

41:52 It's just people look at it, you know, actually, thanks, but no thanks.

41:56 And finally, they're rewriting it at pypi.org, and it's in Pyramid and Bootstrap, and it's lovely.

42:05 But for a long time, I think it turned people away from pushing that project forward,

42:09 and you could tell that it kind of, it was just getting maintained, which is good,

42:13 but it's also evidence of this anti-approachability, I guess.

42:17 Yeah, it's kind of funny how all those old projects tend to be like two enormous files.

42:22 All I can think of is folders must have been expensive 20 years ago.

42:24 Yes, exactly.

42:25 Really expensive.

42:26 All right, so what else?

42:28 We've got a few moments to talk about a few other things, and I know you've got a lot of interesting pieces out there.

42:34 What else is going on?

42:35 You're doing something with pip, right?

42:36 Oh, pip, yes.

42:37 So we deploy a lot of Python at Mozilla, and we used to check everything into a vendor library,

42:44 you know, just to be sure no one had slipped anything under the radar, anything malicious like that.

42:48 And vendor libraries are a pain to maintain.

42:50 You know, you have to update the versions of things, and that creates enormous diffs in your version control,

42:55 and then checkouts take forever, and your checkouts are huge.

42:57 And so we, for a while, ran an internal PIPI mirror.

43:02 A lot of people run their own little index server, and you kind of keep track of who's allowed to upload what to the server,

43:08 and then who did it last, and what versions are going to work with your own projects,

43:13 and you have to keep an access control list and an audit trail.

43:15 And that was a pain, and it slowed things down.

43:18 And then I thought, well, you know, we were actually having a beer and tell.

43:22 We have these little sessions where we have a beverage of our choice and talk about something that we've been playing around with as a side project.

43:28 And I needed something to talk about one day, and I thought, well, why not just hash the results of what you download from PIPI

43:37 and make sure they match some local hash that you've pre-vetted, and then go ahead and install it?

43:44 And so I put this thing together as this little tool called PEEP for prudently examine every package.

43:50 And we ended up moving the whole production lifecycle over to that for a number of years, year or two.

43:57 And then I thought, well, okay, this is proven out.

43:59 And PEEP called deep into pip's internal APIs, and so it would break all the time.

44:04 And it was a pain to maintain.

44:06 I thought, you know what?

44:06 I'm going to lift this up into pip, see if I can get people interested.

44:10 And long story short, I did.

44:12 People were interested, and it's in pip 8 and above.

44:15 So if you're out there deploying Python and running your own index server

44:21 or keeping up to date with a vendor library, hey, consider just putting a bunch of SHA-256 hashes into your requirements file

44:29 with a funny little syntax and running pip 8 over it.

44:32 And it'll vet these things for you.

44:34 I mean, it won't vet them for you.

44:35 You have to make sure that there's nothing malicious in a given version of a package,

44:38 but it'll make sure that what you got that first time is the same thing you got.

44:42 Yeah, once vetted, it'll verify it can't change.

44:44 So what do you do?

44:45 You can't just say, I depend upon SQLAlchemy.

44:48 You've got to say, I depend upon SQLAlchemy 1.0.whatever, and here's its SHA.

44:54 Dash, dash, hash equals whatever it is.

44:57 Right, because obviously if the version changes, you'd imagine the code would change and the package would change.

45:01 Absolutely.

45:01 I see.

45:02 And I try to keep the handholding in there so that if you forget to pin the version but provide a hash,

45:09 it'll say, you know what?

45:10 You should really pin the version because you're going to have an unpleasant surprise down the line.

45:14 This is a really stable product.

45:16 You're going to find out that this is not the same.

45:19 Awesome.

45:20 And you mentioned Turtles before.

45:21 What's Turtles?

45:22 Turtles is a real mad scientist project.

45:25 So I used to do a lot of Zope and Plone consulting back in the day, and I watched a lot of really smart people from all different walks of life.

45:36 You know, it was in higher ed, so there were professors, and there were very good writers,

45:40 and there were sys-ops, and there were chemists, and they'd come in, and we'd have to teach them,

45:44 okay, how do I build a website, or how do I use this content management system that has Python underlying it?

45:50 And they'd have to learn HTML, and CSS, and JavaScript, and Python, and then ZCML as a control language,

45:58 and DTML for dynamic CSS back in the day, and all these crazy different languages.

46:03 And it was ridiculous, and they would get discouraged and go away, a lot of them,

46:08 or at least not work to the potential that I thought we could provide them.

46:11 And so Turtles kind of had its genesis there.

46:14 I thought, you know, if only we had a single language that we could use end-to-end,

46:19 and we didn't need to constantly reinvent, say, for loops, or variables that allow us to not repeat ourselves,

46:28 and have a different way to do variables in Python, and a different way in CSS, and another way in JavaScript,

46:32 wouldn't things be easier to learn, and easier to remember, and heck, easier to read?

46:37 And wouldn't we have all of these synergies?

46:40 Turtles is one of those.

46:48 Turtles is a single-language system, or will be, I hope, for web development.

46:53 Right now, it is just a context-sensitive parser that does run, but everything else is still up in the air.

47:01 Interesting.

47:02 So basically, instead of teaching people these three or four languages, when they come, plus the frameworks, you're like, look, learn this one thing,

47:09 and you'll have a website, a full-on website.

47:12 That is the idea, to teach them, you know, an hour or two's worth of stuff,

47:16 and then have them be able to make real progress without an internet connection.

47:21 You know, make this thing explorable, like the old small talk environments used to be,

47:25 where you can drill into an example and take it apart and rip out pieces that you want to use,

47:30 where you can make changes to a live system and see what happens.

47:34 I mean, that's really how I learned programming, by mimicry, which we don't really do anymore.

47:39 You know, I had to type this stuff out of magazines, make mistakes, see what the effects of the mistakes were, fix the mistakes,

47:45 and then screw around and make my own new mistakes and see what effects they had.

47:50 I think that's how we learn human language, and I think it's a powerful way to learn programming.

47:53 It is a powerful way, but we've definitely moved beyond that in lots of areas.

47:57 I mean, I try to, I think of getting started with, like, Node.js and, like, the crazy packaging and requirements and all that,

48:04 and it's like, I thought that was a simple thing to get started with, you know?

48:07 But I feel like I've done a better job of keeping it simple.

48:11 But still, how do I know where a package comes from?

48:14 If I type import requests, why doesn't that run, right?

48:17 Like, when you're new, all these challenges that we are just like, yeah, whatever, just pip install or whatever,

48:22 I just, you forgot that step.

48:23 These are all levels of friction.

48:25 Yeah, that's one of the things that I'm trying to solve in Turtles by making the answer always the same.

48:30 You ask, where does this come from?

48:32 Or where do I put this?

48:33 And the answer is always in Turtles, it goes on a page.

48:37 Your config goes on a page.

48:39 Your program goes on a page.

48:41 Everything goes on a page.

48:43 The thing that we've lost in the web, because the web is a state-free kind of environment,

48:48 is that the default behavior is for everything to go poof.

48:51 You put something in a forum, you leave the page, poof, it's gone.

48:55 Unless you're, you know, using a nice browser like Firefox where you can say, hey, reopen that tab,

48:59 and then your content comes back.

49:01 But as a developer, you have the same problem, right?

49:03 You end up writing these template languages, and you end up having to take state and shuttle it aside

49:07 into some other process, into a relational database or a document store or something,

49:12 and then reconstitute, rehydrate these pages all the time out of these databases,

49:17 where the representation may not look anything like the structure you're actually trying to pull out.

49:23 And so the idea with turtles is put everything on pages and have a single representation for everything,

49:28 which I think is going to be trees.

49:30 Because as we said with the parsing stuff, trees are a very good choice for universal representation of things.

49:38 You can do a lot with trees.

49:39 It definitely sounds like an interesting project.

49:41 So there's nothing quite yet that we can go play with, but you're working on it, huh?

49:44 Nothing but 100K of design notes and parser up on GitHub.

49:48 But yeah, not quite ready yet.

49:49 All right, cool.

49:50 I'll keep us posted on that.

49:51 That's great.

49:52 All right, so another thing that you said you're into these days is GTD or getting things done, right?

49:57 Oh, my gosh.

49:58 Changed my life.

49:58 So as a repeat offender of Python library creation, I have the open source guilt.

50:06 I put this thing out there, and people are like, oh, well, it's broken this way, and it's broken this way,

50:11 and I wish it did this, and here's a patch, and why don't you review my patch,

50:14 and shouldn't you be doing this instead of watching Netflix and spending time with your family?

50:17 I'm like, well, I guess.

50:18 And the guilt builds and builds and builds, and you're just kind of, you know, you're just kind of shaking all the time, and you can't relax.

50:25 So GTD, I got this book at work seven years ago.

50:31 Somebody gave me a copy and it sat on the shelf for seven years, and I picked it up,

50:34 and it has all these little helpful practices for getting rid of that guilt

50:37 and making sure you're always working on the most important thing at any one given time.

50:42 And you know what?

50:43 It's really changed my life.

50:44 I no longer have the guilt to such a degree.

50:46 Hardly at all, really.

50:48 My response time has gone way down.

50:50 My email box has been at zero.

50:51 I mean, my work one and my home one were 5,000 before.

50:55 Now they've been zero for months.

50:57 And it's been easy.

50:58 It's a crazy thing.

51:00 That sounds delightful.

51:01 Yeah.

51:02 Yeah.

51:02 I've read that book, and I've lived the GTD lifestyle.

51:06 And I found it didn't quite work for me, but I gained a huge value from doing it.

51:13 And so if you can find a few techniques to help tame the world, whether it's so you have

51:20 better response time on your open source project, or you're not stressed all the time, or you

51:23 can go home and see the family and not have the weight of 5,000 unread emails on you, these

51:31 are all good.

51:31 The biggest thing that helps me these days is Google Inbox with its ability to snooze items

51:38 for two weeks until that item comes back right when I'm supposed to deal with it and things

51:42 like that, that's actually been sort of where I've evolved to.

51:45 But GTD is great.

51:47 Yeah.

51:47 For real.

51:48 And nobody, you know, you say you don't use GTD per se because it didn't work out for you.

51:53 Well, nobody uses vanilla GTD.

51:55 It's, you know, it's made to be customized.

51:57 It's a grab bag of tricks.

51:58 And some of them are, you know, more important than others.

52:00 Some of them are, you know, kind of vital, and some of them are optional.

52:03 Yeah, absolutely.

52:03 And you ran into something with that inbox that really rings true to me.

52:07 I had been trying for years to use my email inbox as a sort of quasi-touching.

52:12 To-do list.

52:13 I'll leave this in my box because I'm going to need it in three days.

52:17 And so for some reason I thought it's expensive to do a find.

52:20 I don't know why I thought that, but that's what I thought apparently.

52:23 Or, you know, oh, I need to respond to this mail, so it's going to stay in there.

52:27 And it ended up just being this mixture of chronologically sorted things, some of which were reference material, some of which were things to do, some of which I couldn't do for a certain period, like you said, and needed to be snoozed for two weeks.

52:38 And yet the only way it would present them to me is as this linear kind of last-touched first list of things.

52:46 It's not a useful presentation.

52:47 And I have been unable to make any email client really bend to my will as a to-do list.

52:54 And so the reason my boxes are empty these days is any time there's an actionable email item that takes me more than two minutes to do, because otherwise I would do it immediately as part of GTD, it goes into my to-do system.

53:05 And it pops up, like you say, when you're able to take action on it.

53:09 It's really a wonderful thing.

53:11 I do agree with that.

53:12 So I would be remiss not asking which to-do system you use for this.

53:16 So I went shopping.

53:17 I wanted something that – so I'm kind of weird.

53:19 I wanted something that was expressly not cross-platform, because I think that the platform-specific stuff usually ends up with a better UI, and I'm kind of a UI enthusiast.

53:28 So I looked at OmniFocus.

53:29 I love all of Omni's stuff.

53:31 I love OmniGraffle and OmniOutliner.

53:33 I read all my talks on OmniOutliner.

53:34 And I really wanted to like OmniFocus.

53:36 But there were a couple things that it was just grating against me, and it wouldn't do what I wanted.

53:41 I couldn't reorder tasks except within projects.

53:44 And I kind of like to plan my day out.

53:45 And so I ended up with something called Things by a little German company called Cultured Code.

53:51 And it's a Mac and iPhone-only gadget.

53:54 It uses their own little cloud sync service.

53:56 And the UI has been thought out in great detail.

54:00 Development goes glacially slowly.

54:03 That's the downside.

54:05 These people seem very, very intent on getting things exactly right and not releasing until then.

54:09 So that's the downside.

54:10 But Things is great.

54:11 Things lets you schedule things out ahead of time and not bother you until then.

54:15 It lets you express due dates like, well, this drop dead has to happen now.

54:19 And it does all the D2D-style contexts.

54:22 Let me know about this when I'm at home or at work or in the car or what have you.

54:26 It's nothing particularly, you know, whizzy from a technical point of view.

54:31 But it has those kind of three core features of contexts and due dates and hide until that any large-scale to-do system needs.

54:40 Oh, yeah.

54:41 It looks really, really cool.

54:42 All right.

54:42 Well, thanks for sharing your research results there.

54:44 That's cool.

54:45 Great.

54:46 Okay.

54:46 So it looks like we're just about out of time.

54:49 We've covered all the horrible things.

54:50 Now let's talk about some cool things.

54:52 How about your favorite PyPI package?

54:54 Well, I mean, there's a lot of good ones out there.

54:57 I enjoy Flask.

54:58 I've used that to power DXR.

55:00 It's a nice little lightweight web framework by Armin Roniker.

55:05 Yeah.

55:05 I can't remember exactly who it is.

55:06 Yeah.

55:06 I hit him on one of the early shows.

55:08 Yeah.

55:08 All of Armin's stuff is fantastic.

55:09 It is.

55:10 Likewise, Click is a fantastic one by Armin.

55:12 It's, I think he kind of, did he pull it out of Flask?

55:16 No, he's pulling it into Flask, in fact.

55:18 Click is a kit for making command line tools.

55:22 And I pulled it into DXR as well because it makes it so easy to do nested subcommands.

55:27 Like if you use homebrew or something, homebrew install, brew this, brew that.

55:31 It helps you brew up commands like that.

55:34 And like all of Armin's stuff, it's sort of a decorator soup that I like very much.

55:39 Absolutely.

55:39 And then one of my own things that I always pull off the shelf when I start a new project

55:44 is more iter tools.

55:46 And it is, as the name says, more iter tools.

55:49 The ones that got left behind.

55:50 Collations and such that come in handy in almost every project.

55:54 Yeah.

55:54 Okay.

55:55 Excellent.

55:55 I'll be sure to link to all of those.

55:57 That's great.

55:57 And how about the editor?

55:59 Favorite editor?

56:00 Are you going to write some Python code?

56:01 What do you pull up?

56:01 I pull up a little Mac app called BBEdit.

56:04 It's been around for, must be 40 years or something.

56:07 Yeah.

56:08 They're on BBEdit 11.

56:09 I like their little subtitle.

56:12 I don't know.

56:12 It doesn't suck.

56:14 Yeah.

56:15 Bare-bones software.

56:15 It doesn't suck.

56:16 And they typically make software that looks like nothing.

56:18 You pull it up and their mailer looks like a big empty window of white.

56:22 And their text editor looks like a big empty window full of white.

56:25 Not a lot of toolbars.

56:26 Not a lot of fluff.

56:27 But under the hood, they do a really nice job.

56:31 The text editor uses what must be ropes because I can edit very large, you know, multi-tens

56:37 of megabyte files without a lot of lag.

56:39 It's incredibly stable.

56:41 It just doesn't lose data.

56:42 On the off chance that it maybe crashes every couple of weeks, its magical little implicit

56:50 auto-save thing will bring up your windows in exactly the state that they were.

56:53 Even your untitled documents, it won't save and plow over your stuff.

56:57 It'll save it in its own little buffer and just make sure nothing is unduly lost.

57:00 Yeah.

57:01 That sounds really cool.

57:02 All right.

57:02 Awesome.

57:02 So check that out, people.

57:04 That's cool.

57:04 And final call to action.

57:06 Well, if you're parsing in Python, give parsimonies a try and send feedback and complaints and

57:12 patches my way.

57:12 Yeah, definitely.

57:13 It's on GitHub.

57:14 I'll put the links to it in the show notes so people can find it there.

57:17 Thank you.

57:17 And second, be safe out there.

57:19 If you're running a web server, get a certificate.

57:22 Use Let's Encrypt or something else.

57:24 And if you're deploying Python, again, be safe.

57:26 Use pip hashing.

57:28 I really appreciate this look inside of the whole parsing world and how we can move beyond

57:33 regular expressions to do something way cooler.

57:36 So thanks for your project and your talk and being on the show.

57:38 It was a pleasure to speak with you, Michael.

57:40 Yeah.

57:40 Bye, Eric.

57:41 This has been another episode of Talk Python To Me.

57:45 Today's guest has been Eric Rose, and this episode has been sponsored by GoCD and Hired.

57:51 Thank you both for supporting the show.

57:53 GoCD is the on-premise, open-source, continuous delivery server.

57:58 Want to improve your deployment workflow but keep your code and builds in-house?

58:02 Check out GoCD at talkpython.fm/gocd and take control over your process.

58:08 Hired wants to help you find your next big thing.

58:11 Visit hired.com slash talkpythontome to get five or more offers with salary and equity presented

58:16 right up front and a special listener signing bonus of $2,000.

58:20 Are you or a colleague trying to learn Python?

58:23 Have you tried books and videos that just left you bored by covering topics point by point?

58:28 Well, check out my online course, Python Jumpstart, by building 10 apps at talkpython.fm/course

58:34 to experience a more engaging way to learn Python.

58:37 And if you're looking for something a little more advanced, try my WritePythonic code course at talkpython.fm/pythonic.

58:43 Be sure to subscribe to the show.

58:46 Open your favorite podcatcher and search for Python.

58:48 We should be right at the top.

58:49 You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm.

58:59 Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

59:04 Corey just recently started selling his tracks on iTunes, so I recommend you check it out at talkpython.fm/music.

59:10 You can browse his tracks he has for sale on iTunes and listen to the full-length version of the theme song.

59:16 This is your host, Michael Kennedy.

59:18 Thanks so much for listening.

59:19 I really appreciate it.

59:20 Smix, let's get out of here.

59:22 I'll see you next time.

59:44 Don't believe.

59:45 Thank you.