#85: Parsing horrible things with Python Transcript
00:00 Do you have horrible convoluted things that need parsing? Well, you're in luck, because this week we're covering parsing horrible things in Python. Obviously, you'll learn a bunch of tips and tricks from this episode. But you'll see that advanced parsing is actually a gateway to many interesting computer science techniques. Listen in as I speak with Eric rose about his journey to pars weird things at Mozilla. This is talk Python to me, Episode 85, recorded October 27 2016.
00:30 developer,
00:32 developer, in many senses of the word because I make these applications, vowels and use these verbs to make this music constructed. Just like when I'm coding another software design, in both cases, it's about design patterns, anyone can get the job done. It's the execution that matters. Interesting Chomsky,
00:52 welcome to talk Python, to me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy, follow me on Twitter, where I'm at m Kennedy, keep up with the show and listen to past episodes at talk python.fm and follow the show on Twitter via at talk Python. This episode has been sponsored by gocd. And hired thank them both for supporting the show by checking out what they have to offer during their segments. Everyone a little bit of news for you. So first of all, I'm doing a webcast next week, Tuesday, November 22. At 11am pacific time, the topic is write pythonic code for better data science. And I partnered with Kevin mark on from data school. And this is 100% free you can drop in and it's kind of like a super miniature version of my write pythonic code course. So if you want to come check it out and register, just go to crowdcast.io slash IE slash pythonic. Or click that same link in the show notes. I was talking to the PI charm team this week. And they agreed to give away a free copy of Python professional every week to one lucky listener. So all you have to do to be in the running for this is be a friend of the show, just visit talk Python, FM slash friends in your email address and randomly choose an email address out of the list. And somebody will win. And because we're doing every week, the odds are pretty decent for you to get a copy. Let's get to this excellent interview about parsing horrible things with Eric rose. Eric, welcome to talk by Thanks for having me. Michael,
02:22 pleasure to be here.
02:23 Yeah, it's great to have you I'm really looking forward to parsing horrible things and a bunch of other stuff.
02:29 Well, nobody in the right mind really looks forward to person horrible things. But it's something we have to do
02:33 well, when you have horrible things that you want to fire them. Someone's got to do it right and threat. So I think it's we're going to talk about some really cool techniques, some libraries and algorithms and whatnot, pull it off in some interesting work and talks that you've done. But of course, before we get into those, let's start with your story. How did you get into programming in Python? Well, I
02:51 mean, I used to be a Legos kid right now like snapping those things together. And of course, the thing about being a kid is you have no economic power. So you've only got the ones you've gotten the box and you get it birthdays, and you get a Christmas or what have you. And so I kind of discovered programming as a way to program with to build with Legos without ever worrying about running out of parts. And so I became kind of an early collector of programming languages, you know, before the internet, so you have to get these things kind of happen upon them on floppy disks, or whatever. And I did basic and I did hypercard and kind of learn programming through mimicry, like we all did in the 80s. You know, you get these magazines with program listings in them. And you just kind of have to type them in if you want to play with them. And so you're just forced to plow linearly through these things and making mistakes throughout. And that kind of teaches you debugging and proofreading. And you come to all kinds of wrong conclusions. Like I remember in hypercard, there's this statement, global start flag, which of course, the global variable declaration, but I thought it was this round, checkerboard cursor, which totally looked like a global start flag to me. So you know, fast forward two dozen languages later, and I end up working at at a university over Penn State. And we had this crazy seminar registration system called SEMrush. Mixed, maybe they're still running and I really hope not. And it was, well, simple. One thing I could say is it was written in VB script, but really it was written in PHP and then transliterated into VB script via a series of regular expressions. Wow,
04:20 that's crazy. I've never heard of anything being passed from ESP to VB script.
04:24 absolutely insane and compiled is probably to kind of word but I didn't realize this when I took the job. You know, I've come a lot smarter about interviewing since then, I hope. And I thought, I'm gonna give this a chance. You know, I'm this big mac guy. And I was it was, it was the height of the platform wars and Windows was terrible, but I know what I'm gonna I'm gonna not be a jerk. I'm gonna give this thing a chance to see if I can write stuff in VB script. And the short answer is you can't write anything in VB script. The thing has classes but no inheritance. And so, you know, with a little detour through making my own prototype based in here Since language out of the call statement, I thought, okay, we need to bridge this over to something more usable. And Python was very well supported at the time on Windows scripting host, which allowed me to do ridiculous things like share a database connection between the legacy VB script code and the new Python code. And that's how I got into Python, believe it or not
05:20 interesting, because if you can get it during the script pose, you can basically get the same environment as this other bad thing, right?
05:26 Yeah, you can kind of mash it all together, it'll bridge strings to strings, it'll actually bridge objects to objects and methods to methods, numbers to numbers. And if you bang on a hard enough, you can get database handle shared and share a single transaction between the two languages. So I was very fortunate to to happen upon Python that way.
05:43 Yeah, that's great. And it looks like you've been doing a bunch of Python since then.
05:47 Well, Python is a really nice little patchwork of stolen bits of language. There's not a whole lot unique about Python, which I think is one of its strengths, apart from maybe that with statement, but it knows where to steal all the best things that list comprehensions out of Haskell, for example. And, you know, hey, we're gonna steal, you know, class based object orientation from wherever that came from small talk, I guess. But we'll also have top level functions from any number of languages. So I've been pretty happy with it.
06:16 Yeah, absolutely. And it's still still on its way to doing that thing with like, async and await in the latest version, right, which is a fantastic language feature.
06:24 Yeah, we'll see where that goes. I haven't played with it myself. Yeah, I'm
06:26 still looking for a good use case for that as well. But it's definitely a neat concept. So how about now what do you work on these days?
06:33 Well, day to day, I maintain a project called dx r over at Mozilla, which doesn't really stand for anything, but is a language analysis and navigation tool for large code bases. So Firefox is something like 17 million lines of code. You know, if you want to make a change, you've got to be thinking, Well, what am I going to break with this change? Who am I affecting what functions call this function that I want to change? What? What eats the result of the contents of this variable that I want to alter? Or what invariant might I be violating here? And DSR answers those kinds of questions through a structured query language, and through both text search, and a cleverly accelerated regular expression search, so you can get through these 17 million lines? With a regex? in you know, sub second? Wow. And he said 17 million lines of code. That's quite impressive. That's bigger even than I thought it would be. But I there abouts. Yeah, I knew that was huge. How many languages are involved in that? That's a crazy question to ask, Well, I'll just go with language. So HTML is a language they're not programming language, arguably, CSS is Turing complete, someone has proved as long as it's level three. So that's a programming language. Now, a lot of the UI is written in JavaScript, a lot of the down and dirty stuff is written in C++. And more recently, we've begun importing rust into the codebase rust is now a part of the released Firefox and more and more with the release of project quantum, which was just announced last week, will be ported over to rust. So it's, so we can make more guarantees. So it can be safer. And more concurrent, nice. And
08:11 what what is project quantum project
08:12 quantum? Let's see if I can get this right, is a Mozilla project just became on secret to import a lot of our experimental a rust based rendering pipeline into Gecko which is the current released Firefox pipeline. So project that I mentioned servo already I forget. Now, the sort of thing we're importing things from is servo. And experimental. Web render are written in rust, CSS, render HTML, all that jazz. And yeah, we're, we're pulling bits of that into Gecko. Okay, that's awesome. Very exciting time. that'll let us be more concurrent use all these different cores on all these different things. You know, phones even have four cores now.
08:55 Yeah, watches even have multi cores now. Yeah, it's crazy. It's a crazy time to live. It is a crazy time. You said another thing you're working on is fathom what's fathom,
09:03 so fathom kind of fits into all this. This parsing subject fathom is my new kind of toy over Mozilla. It's a mad scientist project to see if we can make it easier to write semantic extractors for the web. So an example of a semantic extractor you might know is something along the lines of readability, or browsers reader mode, where it just pulls the content text out and dispenses with headers and footers, and ads and such. But other things you might want to extract are, hey, let's teach a browser to recognize what a previous or next button looks like so that maybe the browser can let you assign a keystroke to that in the general sense and not have to chase them as they bounce around as you advance through a slideshow. Or maybe we can teach the browser at a deeper level to appreciate what an advertisement is, or what a navigation element is. So maybe we can collapse those on small screens and hide them in a menu. There really, there's really endless potential to this and I'm trying to fix it. So those extractors are easier to maintain. faster to write and become less of a mess. If you were to read the readability source code, which is kind of the thing, that's what Firefox is reader mode and Safari reader mode are based on, you get the sense that it's been written by hand maintained over time, there is state flying everywhere. It's hard to tell where to make tweaks. So what fathom does is express these extractors as lists of unordered rules. In a prologue sense, have you ever played with prologue? Just a whole bunch of kind of logical statements, and then the environment figures out how to fit those together and run them? Well fathom works along those same lines. And as a result, since order doesn't matter, third parties can tweak and insert country can existing extractor bits by inserting their own rules. They can say here, whenever you see a whatever kind of element, for example, I want to boost the score by this much or lowered by that much
10:56 nice. So you could put like, understanding of time of calendars and dates and whatnot, possibly as another rule,
11:02 exactly. Hey, this looks like a calendar or Hey, this looks like a payment form. Let me help you fill it. Yeah.
11:07 Okay. Yeah, that's awesome. Is that in Firefox yet? Or is this this is just a project so far, we hope
11:13 to get it in so that other things can make use of it more easily. But it's already embeddable within Firefox add ons, and has been embedded in at least one add on just kind of a fathom debugging add on runs on the server side. So I'll just kind of vanilla es six, you can compile down to es five for older browsers. It kind of runs all over the place. I wrote it in JavaScript, despite not liking JavaScript myself, so it can get popular.
11:38 Yeah, sometimes, sometimes you got to go with that. Well, that sounds really cool. And then another thing that you're involved in is Let's Encrypt, which is very exciting.
11:46 Yes, Let's Encrypt is probably got the easiest business plan I've ever heard, which is to give away $100 bills.
11:52 How's that work? How do I get my hundred dollar bill?
11:55 Well, if you've ever bought a house very easy, so certificates, SSL certificates have historically been fairly expensive, and on the order of 100 bucks, and there are some other ways to get cheaper wants to start SSL or Comodo, or whatever is kind of a hassle. But what we do is we have a little command line tool that you run. And if if you can authenticate yourself to our little certificate creating server through a DNS record or putting a little thing on your server temporarily, then we will give you a cert, which is recognized by all the major browsers. So really, there's no way no reason not to use it at this point.
12:32 Yeah. So having encrypted content is obviously important if you're doing like e commerce or something like that. But it's also just becoming increasingly important to be a first class citizen on the web, right?
12:44 It really is it. I mean, it gives you the impression of trust, first of all, if people are watching their URL bars, but also it's important for your visitors just to have the requests to be private, if I'm surfing to Wikipedia, and not doing anything particularly suspicious, but say I have a suspect, I have a medical condition, and I'm reading about all these different skin diseases or whatever is ailing me. I don't want my ISP lugging that away and selling it to their marketing partners, which until the FCC is ruling last week was perfectly legal to do state actors. I mean, the best defense we have against really anybody in the future coming to power and looking into our past and seeing things they don't like and then coming down on us is keeping things that are our business, our business. And using SSL certificates and surfing to secure sites is the best way to do that right now.
13:34 Yeah. And the fact that Let's Encrypt is free mix that very, very possible. And I think that's, that's great. You know, there's really interesting studies quoted in the original Edward Snowden book that came out, I think, by Greenwald, can't remember that guy's Glenn Greenwald, I believe, yes, that guy. And it was a great book. And basically, you know, a lot of people say, Look, I don't care about this privacy stuff, like I have nothing to hide. But there have been some psychological studies and social social studies seeing people behave differently if they know somebody listening, they might not break a rule in private, but they behave differently. They are slightly more private, less willing to think, you know, sort of contrarian ideas and at least share them. And the less that people are watching the better as far as I'm concerned for, for people in general.
14:24 Yeah. It's really the idea of chilling effects, which came out of some academic institution. It's a wonderful phrase. And when you have chilling effects in operation, where you are afraid to do perfectly legal things, which otherwise you might not do, democracy really cannot function. how democracy works is by means of, well, really the Overton window, this range of things that you're allowed to say and think that don't get you kicked out of cocktail parties. And this window moves around over time. It's, you know, yesterday's concern, rather, yesterday's progressive is today's conservative, and if we're not allowed to play with ideas that are just outside the bounds of the Overton window, then the future really has nowhere to pick its new ideas from and we just kind of stagnate. And the powers that be become the powers that always will be, and the ideas that are or the ideas that always will be, and we can't really go anywhere. So yeah, I think privacy is key to having a functioning democracy at all.
15:22 Yeah, I totally agree. And if you don't buy that, Google ranks, sights higher if they have SSL. And we all want to rank higher in Google, right? So here's the Google juice, there's the final straw to like, start start encrypting stuff, my blog is encrypted, this podcast site is encrypted, and so on. And I think it's I think it's great. Happy to do that on much as I can. I'm wondering how you How did you become interested in parsing all these horrible things? Where do you get started with that?
15:47 Well, I guess, I guess it probably started when I was seven years old, and playing around with basic and hypercard. And thinking, Oh, you know, I want to write my own programming language. Because of course, your your seven, and no one has told you that's hard. And if you spend a couple of days trying to do that, with just if statements, you end up in this spaghetti FIDE mess, and you can't get anywhere, and you're just amazed that anyone has ever managed to write a language ever. So, you know, then put in a little pause about 25 years, and it came up at work. Over at Mozilla, we had a support site and still have a support site, support that mozilla.org. And it's got a wiki, and it's got a StackOverflow clone, and all this different stuff to help out people who are using Firefox and or other products. And our wiki is powered by not media wiki itself, but the media wiki syntax for various historical purposes, as we like to say. And not only do we want the meaty wiki syntax, but we wanted to be able to add our own little directives to the syntax crazy little things that will maybe one example, if you wanted to change the text that would come up, according to which version of Firefox someone was using to visit the site. It was very, very difficult to make those edits to the implementation of media wiki that we were using, which was a port, by David creamer. Very nice little port of the original media wiki machinery, such a direct port, in fact that it still had dollar signs in the comments from the pH, right? Yeah, it was originally PHP, and you guys translate it over to Python. And that helped because at least you were working in Python. But you said the way that it worked was the parser was pretty insane. Like it was just like a crazy bunch of regex is right. Yeah, there were, I think 41 reg x's. And then there was another 2100 lines of PHP was just random over and over again, against the source text finding and replacing and finding and replacing, hopefully in the right order, interacting with each other dropping little markers so that they didn't smash over each other when they shouldn't. And then hopefully, at the end outcomes, the proper rendered text. Of course, in reality mediawiki language changes from release to release, as they find little corner cases that this crazy, slapdash loopy way didn't handle. Yeah, you call it the calling on
18:03 media wiki.
18:05 Say we I look at these reg x's. And I think well, this is the original Klingon clearly,
18:09 yes, obviously. Nice. And so you were looking for a way to escape this thing on pars or world and increase something nicer? I think one of the problems you said that was inherent in the algorithm was it would directly pars into its new representation, right? There was never an intermediate representation you so tell people what like is like, Why do you care? Like, what's the intermediate representation do for you?
18:34 Well, it gives you flexibility, like any abstraction. So let's say we have this imaginary intermediate representation for mediawiki syntax, we bring in mediawiki. syntax, we parse it into a tree, because that's all these things end up as in parsing land. And then we have a lot of options, we can output plain text from that, ignoring the bits of the tree that say, bold or italic. We can render out HTML from that not ignoring those things. We can go hunting through for just say data time elements and then pull out some date and time entities. Once you have this abstraction, you have any kind of output you like. Or you can do any kind of analysis or transformation you like, nice. So for example, if you wanted to possibly represent stuff by
19:15 the markdown output, or an HTML output, or plain text, it's super easy because both like markdown and HTML have a bold concept. One is a bracket or angle bracket strong one is a star. But it's you know, you can do that final translation pretty easily once you have the tree. Right,
19:34 exactly. That's the trivial part, going from tree back to a linear representation.
19:38 So the hard parts getting into a tree.
19:41 Right? Now we're doing that as we talk, which is nice. So you have like a linear sequence of sounds and I hear them and I deconstruct them back into you know, phrases and word pairs and things and idioms. And I say well, that's probably what he means and then you do it back from my end.
19:55 Exactly. So pretty much pulling structure out of any flat linear stream of data is parsing Right,
20:02 exactly. And so the applications are as wide as you like. I mean, anything that has any kind of structure, text, sound, you can think of musical phrases and the sorts of a lot of music as a theme and variations or development of a thematic statement. On one hand, I sort of can see how you would do some of this with regular expressions. But it's really the goal is to move it into this parse tree, because then it's not just transforming one bit of text in another text, but it's actually transforming into the structure that we can do all kinds of interesting things. All right. And more formally, regular expressions can't really support nesting. Now, there are little corner cases that people are going to write in about, or things that are called regular expressions nowadays, like in Perl, have support for nesting bolted on, but then they cease to become proper regular expressions, exactly proper parsers, on the other hand, descend into the lower call, like Chomsky, type one or two grammar, and can support nesting. So you could say, understand sets of nested parentheses of arbitrary depth. And that's how things like, well, HTML, for example, are you could have a bold tag with an italic tag in it with another bold tag in it, ad infinitum.
21:13 Yeah, div span a all sorts of stuff. Yeah.
21:18 addition to the nesting of proper parser gives you the ability to read what you're doing, essentially, reg x's are famous for being a write only language.
21:27 I definitely think of them as write only.
21:29 Yeah, and you can comment them as much as you like and put whitespace in there. But it still becomes, you know, at least awkward to name sub patterns, to repeat sub patterns and to to maintain something where you have repeated bits of regex.
21:42 Sure, well, and as it gets increasingly complicated, the more you need the regular expression, the less you can understand it, I think,
21:48 yeah, for sure. There seems to be a ceiling for one reason or another with with straight out reg x's.
21:52 Yeah. So you talked about there being different types of grammars that are different complexities of grammars.
21:59 Yeah, there's the idea of the Chomsky hierarchy of grammars. And Chomsky is a really a linguist. And so he spoke of these grammars in terms of generative power, where I can start from a production that production Name something like sentence and then descend in one level to well sentences, you know, phrase, and then maybe another phrase, and maybe another phrase, and what are the phrases? Well, your subject, and we have verb and we have an object. And he's interested in creating sentences from the top level. Now, as computer scientists, we're usually more interested in going the opposite direction, starting with this complete sentence, and then kind of inferring out the structure, recognizing, as they called in the literature. So Chomsky came out with Chomsky hierarchy of grammar or something like that, anyway, levels, 0123, and four. And type zero is basically a free for all, it's kind of comes out looking like a directed graph. And there's not really, really available kit for parsing that sort of thing, or any algorithmic balance to its complexity. When we get down to a level one, that's your context sensitive grammars. And Python is actually a context sensitive grammar, I believe only for the reason that it has only because it has whitespace sensitivity. So I was actually trying to parse something like Python the other day, a little side project of mine called turtles. And I ran up against this. When you indent in Python, we ignore in a normal language and a curly brace language. When you go in Word, a block, you say, curly brace, I know when you end a block, you say, end brace. Now in Python, think about trying to write a tokenizer. For Python, trying to find those instances where we go in a level or out a level, you're reading along, reading along, you're at the beginning of the line, and you see eight spaces. Did you go in a level? I don't know. What is the previous line? Is it four spaces? If so I went in a level. So you can see that eight space span is interpreted differently depending on our context, depending on what the previous line had done. So Python is a context sensitive grammar. Okay, now the rest of Python is a Chomsky level two, which is a context free grammar. So if you were to dig into the Python source code, one of these files there, there's a little let's see, I have it sitting here, a nice little sort of summary where it says, Okay, here's what a suite is a series of statements. A suite is a statement, statement statement, or statement star. But it's always that a suite is always this. an if statement is always that, you know, a function definition is always this, it doesn't matter what came before what comes after. It's context free, right? Like the if statement doesn't mean something different. If it's in a while loop, it's always mean the same thing. So if you know what an F is, like if you can define the structure of an if you can define its meaning and you don't have to do more interesting parsing. Yep, all the time, all the time. It means the same thing. There are no modes to go into or out of
25:11 This portion of talk Python to me is brought to you by gocd. From thoughtworks. Go CD is the on premise, open source Continuous Delivery server. With go CDs comprehensive pipeline and model, you can model complex workflows for multiple teams with ease. And go see these Value Stream Map lets you track changes from commit to deployment at a glance and go see these real power is in the visibility it provides over your end to end workflow. You get complete control of and visibility into your deployments across multiple teams say goodbye to release a panic and hello to consistent predictable deliveries. Commercial support and enterprise add ons, including disaster recovery are available. To learn more about gocd visit talkpython.fm/ go CD for a free download. Let's talk Python, FM slash g OCD. Check them out, it helps support the show. And
26:11 one of the things that you said a quote from your your pike on 2012 talk, which I'll link to that in the show notes, of course, was that parsing is a gateway drug to other areas of computer science.
26:24 It's true. I mean, once you have this tree shaped intermediate representation, you can do anything you want, you can get into natural language processing, which if you're doing any of this in Python nltk is not to be missed library. It first of all, is a shining example of what all library documentation should be there isn't an ltk book freely available, and is a fantastic place to start. If you want to start understanding human language with the computer or doing sorts of Data Mining and Machine Learning. It's also full of those sorts of algorithms. Great piece of kit. Yeah, awesome. Also, from the intermediate representation, you can get into programming language design, that's fun. Make your own little toy language, anybody can make a lisp in a couple of days. Because the parsing is so darn simple. It's just a bunch of nested parentheticals.
27:12 There's a lot of a lot of pregnancies there.
27:16 Yeah, now I'm a lot less intimidated to reach for custom little query languages like dx R has a custom little query language, which looks a lot like Google's custom little query language. And it's, you know, 10 or 20 lines of grammar description that I feed into my library, parsimonious and then outcomes the tree. Yeah, that's,
27:34 that's really cool. I, I feel like when you learn these new techniques, and these data structures and algorithms, like you see problems where you just saw opaqueness before, like, Oh, I could actually apply this thing and out would pop interesting answers. Whereas before he's like, there's no way I can answer that. That's just text blobs, not structure.
27:52 Yeah, for sure. It's just pattern recognition. The more you expose yourself to the more you'll say, Hey, is it just one of those?
27:59 Let's review some of the various options for parsing text in Python today. So there's two pretty well known ones that have been around for a while there's pi parsing and there's poi so David Beasley's Python, Lex Yak.
28:14 Yeah, pl y is fantastic. Anything that David Beasley does is fantastic, you should immediately pause the podcast and go watch all of his talks. Feel why came out when he was teaching a course on building your own little Pascal interpreter, I believe Pillai has the advantage of being tripped over by hundreds and hundreds of students. And they've hit every little corner case and made every possible mistake. And so the error reporting is top notch. And really, I would seriously recommend taking a look at PL y if you're if the complexity of what you're parsing is amenable to it. Now, the limit of PL y and why I couldn't use it for the media wiki stuff is that implements what we call LR one parsing, which you can look up the formalisms. And I probably forget most of them. But this is this is something we did. I think most most languages that are currently in production probably use something along the lines of LR one just because it's very, very memory efficient and CPU efficient as well. So it was the thing that we implemented when we had tiny little computers. But where you run into problems is the LR one, that one means we can only look ahead one token to decide what kind of thing we're recognizing. And in the case of media wiki, that wasn't actually enough. For example, I believe it was internal links that begin with two brackets, bracket, bracket. So it's not every time that bracket bracket means an internal link. It has to be followed by, oh, I'm gonna get this wrong, but something along the lines of maybe a URL or page name, and then maybe a vertical bar and then maybe a page title. And so you might have to look ahead two or three tokens to see if Your internal link is going to work out. And if it doesn't, that bracket bracket is just part of plain text and should be emitted verbatim. And so we couldn't use any LR one parser. And I put that on my shopping list and rule out a whole lot of libraries because of it. Right? It just needs to keep more
30:15 of it. In its head in the algorithm all at once, right? Pretty much. Yeah. All right. So let's do it pi parsing.
30:22 So pi parsing is what I consider or was the the canonical, I reach for it first Python parsing library before I put my own, of course, because I'm in that bad habit. And pi parsing is, you know, fairly pythonic, it mashes all the grammar definition stuff into Python objects. So you literally say in a piece of Python, bold toggle equals literal parentheses, and then you construct an object you say, quote, quote, quote, and say, Okay, well, book quote, quote is the thing that turns on bold and media wiki syntax. And then on that object, you call methods like, set name, bold toggle, so that when you have this tree, you can actually tell that this thing was a bold toggle as you're walking the tree, and kind of on your go. Now, the disadvantage of this is, it's kind of hard to read, it's kind of wordy, because after all, we had to make this a valid Python. And you can't do certain things like make forward references. And oftentimes, there are cyclical references in a non trivial grammar. In the case of pi parsing, we had to have put a little hack around that and say, you know, bold toggle equals forward, which is sort of a promise that I'm going to declare something later. And then later on, you kind of
31:39 jam that on for people who haven't seen the syntax of pipe Rs. And it's kind of like, formalize Python that acts as regular expressions. So you'll see like Python objects, but some of you can clearly coming from sort of regular expression land like groups and whatnot, right?
31:58 Well, I mean, groups is a valid word to use when you're talking about parsing, after all, we're talking about trees here. And a tree is nothing more than a series as a nested list. And so every sub list you can think of as a group. So that's what pi parsing is doing with the groups. Yeah, okay. And pi parsing is of, I think, equal power with parsimonious they're both able to describe all the context free grammars and probably a subset of context sensitive ones.
32:25 And parsimonious is your library that you wrote, right?
32:28 Yes, it's my mad scientist experiment, which, though it's version number starts with zero point, I have it in production all over the place. There are lots of people using it. So
32:36 nice. Do you have some examples for what it's being used for?
32:39 Yeah, I'm using an index R to parse our little query language. But also people using it for I know, they don't report back to me, but it gets a lot of downloads, or somewhere.
32:49 Excellent. So it to me, it seems like parsimonious it's a little simpler to define the grammar. Is that right?
32:55 That was really my goal, both to make it simpler, make it run fast. And to make it optimizable.
33:02 I one of the goals he said was frugal ram use, which I thought was just a great way to phrase it.
33:08 Yeah, and I haven't done a lot of RAM profiling. So I'm not ready to be to make any claims about that just yet. But the formalism underneath parsimonious, which is parsing expression grammars, which come out of a 2004 paper by Brian Ford, meet that goal just fine. I think ram uses something along the lines of order n to the third, but it's and is the grammar size. So it's actually not that big an N. Yeah. And as a result of blowing that ram as is caches for each individual little production each individual little, you know, context free equals, you get linear parse time, which is great. Yeah,
33:43 that's really, really nice. So the reason people care about RAM is I mean, if you're running this over some texts, and you're just going to dump it out and produce a text version, that's fine. But if you're wanting to keep that in memory and say, like a web server and continuously serve requests from it or something, then then all of a sudden you care way more about RAM,
34:04 right? Well just imagine serving up a bunch of media Wiki pages, and you've got to be parsing 100 of these at a time on a web server. You know, RAMs, Ram is not free. No,
34:13 it's definitely not free, especially if you got a lot of traffic. So one of the algorithms involved in here is you said that it's it uses something called the pack rat algorithm, which sounds fun.
34:24 But I think I think Brian Ford came up with that, too. It's really simple. It's what you would come up with yourself, if you're implementing one of these, these peg parsers, PG parsers. They're really just recursive descent parsers. And as you descend, you might find out Hey, now that I've looked ahead to tokens, and I find out this internal link isn't going to work out for me, I need to rewind a little let me let me take a couple steps back up the stack. And let me try parsing it as plain text for example, or maybe as an external link, for example. And in the course of that, you may need to use a partial parse from a previous stack frame and rather than redoing that work It makes sense oftentimes to cache the results of each partial parse. And that's all the packrat does just keeps all these intermediate results around. I see basically to allow you to look ahead and arbitrary number of tokens, and then adjust without paying a penalty of redoing work exact right? Yeah, very cool. Well, one of the things I was mostly trying to do with parsimonious, one of the things that really the thing that differentiates it so much was that its grammars look a lot like what you'd find, if you looked up the definition of the grammar in a book or in the documentation. They're just big blobs of multi line text and the Python, quote, quote, quote, way. And so it's able to do forward references, because it's not bound to the idea of undefined symbols and an outer programming language, we're able to do compile time optimizations on it really without limit, because we haven't lost anything. And it's transformation down to a bunch of Python objects prematurely. And it's also very easy to read as a result.
35:57 And you also keep the representation part as a separate phase. So you can render to multiple formats, which is cool.
36:04 Yes, exactly. A lot of these Python, Python, rather. But a lot of these parsing kits tend to inter twinkl output rendering with parsing, and that way leads to pain in my experience,
36:17 certainly rigidity. Interesting. It sounds like you're parsimonious project is really cool. And it's on GitHub. People can check it out. Right.
36:26 Please send feedback, play around with it.
36:28 Send patches. Okay, cool. Yeah, one of the things that made me a little bit sad about pi parsing is it's on SourceForge, which I don't know still. Yeah, when I see things on SourceForge, it kind of makes me feel like, I'm not really sure that things actually still going. Yeah, right on the homepage. That's download now from SourceForge. It's like, oh, oh, okay. That's unfortunate.
36:49 It's still fine code. I mean,
36:51 yeah, of course, it
36:51 is a single 3000 line file. But you know, old doesn't mean bad. It means proven?
36:59 Yeah, absolutely. Absolutely. So that was a really cool talk on parsing horrible things you gave you have some other favorites.
37:05 Oh, let's see, what else have I done, I have a talk called poetic API's where I kind of expand on the idea that you know, these grammars, they should be really easy to read. In fact, all programs should be easy to read. In fact, here are seven, you know, kind of checklist things you can bang against what you're writing to make sure that you set a good language for the users of your API. How interesting. Yeah, what we're doing when we're programming is always creating language. Every time we name a function, name, a variable, create semantics of an object, we're kind of creating the mental model in which everybody who interacts with that code in the future has to play. So if we do that irresponsibly, we can really make people think, terrible, stupid thoughts that make their jobs hard. But if we give them really good symbols that correspond well to reality, and are easily composable, and flexible, like like the requests package is a great example of Hey, you know what, we could just totally represent HTTP requests, instead of making everyone think about raw sockets all the time like URL lib, then people can have their efforts magnified.
38:09 Yeah, really does define the way that you think about a problem, the API's and whatnot, you have to work with them. But the language itself, right. And so I think Python itself is something an example of like, why that's important, right?
38:22 Well, yeah, I mean, so Python is a fairly close match to what we tend to write a pseudo code, right. And
38:29 the reason we write pseudocode is it's easy to understand and communicate. So
38:33 Exactly. I was just reading the topographical sort algorithms on Wikipedia. And you know, what, if you put some colons in there and take out the beaches, it's about melted Python. That's awesome.
38:43 Yeah, I've heard that before. And people have copied algorithms out of Wikipedia, more or less, just straight up, turn that into Python. And it works beautifully. That's great, which is also interesting. verification of Wikipedia content. Yes.
39:07 This portion of talk Python to me is brought to you by hired hired is the platform for top Python developer jobs, create your profile and instantly get access to 3500 companies who will work to compete with you and take it from one 100 users who recently got a job and said I had my first offer on Thursday after going live on Monday, and I ended up getting eight offers in total. I've worked with recruiters in the past but they've always been pretty hit and miss. I tried LinkedIn, but I found hired to be the best. I really like knowing the salary upfront and privacy was also a huge seller for me. It sounds awesome, doesn't it? We'll wait to hear about the signup bonus. Everyone who accepts the job from hard gets $1,000 signing bonus and as talk Python listeners it gets way sweeter. Use the link higher.com slash talk Python to me and hired will double signing bonus to $2,000 opportunities knocking visit hire.com slash talk Python to me and answer the door There's another one that you talked about called the code review review. What's the story of that?
40:08 Yeah, this is a newer talk. So something we're trying to do at Mozilla is, make sure we don't drive people away by being jerks, doing code review or otherwise having, you know, kind of unwelcoming culture. Mozilla is historically and today, largely driven by volunteer contributions. I mean, even the guy who owns our security Sockets Layer something at NSF, whatever that is, like the module owner for this, the one who has the final say, he doesn't get paid by Mozilla, he does something else. And he just takes it on this responsibility out of his own free time. And so it's really important for Mozilla to keep that rolling, you know, welcome people into the community, taking contributions help people level up to become better programmers and more familiar with the project. And so the code review review is a piece of our onboarding right now that I'm turning into more generically applicable talk, where we talk about, well, you know, how do we create that kind of welcoming atmosphere? How do we do a proper review so that good programs come out? And how do we do a review such that better programmers come out of it?
41:11 Yeah, that's really important that people when they come to these new projects, or when they sit around, they feel like it's a delightful experience, because they're doing their own free time and energy, right. So
41:23 yeah, for sure,
41:24 you definitely don't want it to be like slogging through a hard code. I mean, I think an example of the opposite comes to mind is the old Python packaging, pi pi, web code, talk to Donald stuffed on episode 64. And he was like, a lot of people want to come along and help maintain and evolve this, but it's like to two files. And this hugely complicated old custom web framework that they built for it is just people look at it, you know, actually, thanks. But thanks. And finally, they're rewriting it in at pi pi.org. And it's, it's in a pyramid and bootstrap, and it's lovely. But for a long time, I think it turned people away from pushing that project for it. And you could tell that it kind of it was just getting maintained, which is good. But it's it's also evidence of this anti approachability, I guess.
42:17 Yeah, it's kind of funny how all those old projects tend to be like two enormous files. All I can think of is folders must have been expensive. 20 years ago,
42:24 that's exactly really expensive. Alright, so what else? We got a few a few moments to talk about a few other things. And I know you've got a lot of interesting pieces out there. What else is going on? in physical Pip? Right? Oh, Pip. Yes. So.
42:37 So we deploy a lot of Python at Mozilla. And we used to check everything into a vendor library. You know, just to be sure no one had slipped anything under the radar, anything malicious like that. And vendor libraries are a pain to maintain, you know, you have to update the versions of things. And that creates enormous dips in your version control. And and checkouts take forever. And your checkouts are huge. And so we for a while ran an internal pi pi mirror a lot of people run their own index server and you kind of keep track of who's allowed to upload what to what to the server and then who did it last, and what versions are going to work with, with your own projects. And you have to keep an access control list and an audit trail. And that was a pain and it slowed things down. And then I thought, Well, we know we're actually having having a beer until we have these little sessions where we have a beverage of our choice and talk about something that we've been playing around with as a side project. And I needed something to talk about one day, and I thought, well, why not just hash the results of what your download from pi pi, and make sure they match some local hash that you've that you've prevented, and then go ahead and install it. And so I put this thing together as this little tool called peep for prudently examine every package. And we ended up moving the whole production lifecycle over to that for a number of years here too. And I thought, Well, okay, this is proven out. And Pete called deep into pips internal API, since I would break all the time. And it was a pain to maintain. I thought, you know what, I'm gonna lift this up into pips, if I can get people interested. And long story short, I did, people were interested. And it's in pip eight and above. So if you're out there, deploying Python and running your own index server, or keeping up to date with a vendor library, hey, consider just putting a bunch of Sha 256 hashes into your requirements file with a funny little syntax and running pip ate over it, and it'll it'll vet these things for you. I mean, it will vet them for you have to make sure that there's nothing malicious in a given version of the package, but it'll make sure that what you got that first time is the same thing you got. Yeah, once that it'll verify it can't change. And so what do you do you say like, you can't just say I depend upon SQLAlchemy, you got to say I depend upon SQLAlchemy, one dot o dot, whatever. And here's it's shot,
44:54 --hash equals whatever it is, right? Because obviously, if the version changes you'd imagine The code would change and the package Absolutely, I see.
45:03 And and I've tried to keep the, keep the hand holding in there so that if you forget to pin the version, but provide a hash will set, you know what you should really pin the version because you're gonna have an unpleasant surprise down the line. That's this
45:15 is a really stable product, you're gonna find out that this is not the same. Awesome. You mentioned turtles before what's turtles,
45:22 turtles is a real mad scientist project. So, I used to do a lot of xopen plone Consulting back in the day, and I watched a lot of really smart people from all different walks of life in higher ed. So your professors, and they were very good writers, and they were sysops. And they're chemists and they come in and they, we'd have to teach them, okay, how do I build a website? Or how do I use this content management system that has Python underlying it, and they'd have to learn HTML and CSS and JavaScript and Python. And then zcml is a control language and, and D tml. For dynamic CSS back in the day, all these crazy different languages, and it was ridiculous, and they would get discouraged and go away a lot of them, or at least not work to the potential that I thought we could provide them. And so turtles kind of had its genesis there, I thought, you know, if only we had a single language that we could use end to end. And we didn't need to constantly reinvent, say, for loops, or variables that allow us to not repeat ourselves and have a different way to do variables in Python and a different way in CSS. And another way in JavaScript, wouldn't things be easier to learn and easier to remember? And tech easier to read? And May, wouldn't we have all of these synergies between different parts of our website that we don't have right now. And so turtles is one of those turtles is a single language system, or will be I hope, for web development. Right now, it is just a context sensitive parser that does run. But you know, everything else is still up in the air. Interesting. So
47:02 basically, instead of teaching people, these three or four languages, when they come plus the frameworks, you're like, Look, learn this one thing, and you have a website, full on one site,
47:12 that is the idea to teach them, you know, an hour or twos worth of stuff, and then have them be able to make real progress without an internet connection, you know, make this thing explorable, like the old small talk environments used to be where you can drill into a, an example and take it apart and rip out pieces that you want to use, where you can make changes to a live system and see what happens. I mean, that's, that's really how I learned programming by mimicry, which we don't really do anymore. You know, I had to type this stuff out of magazines, make mistakes, see what effects the mistakes were fixed the mistakes, and then screw around and make my own new mistakes and see what effects they had. I think that's how we learn human language. I think it's a powerful way to learn programming.
47:53 It is a powerful way. But we've definitely moved beyond that in lots of areas. I try to I think of getting started with like, no GS, and like the crazy packaging and requirements and all that. And it's like, I thought that was a simple thing to get started with. But I feel like instead of keeping it simple, but still, how do I know where package comes from? If I type import requests, why doesn't that run? Right? Like when you're new, all these challenges that we're just like, cat, whatever, just pip install, or whatever, I just forgot that step. These are all levels of friction. Yeah, that's one
48:26 of the things that I'm trying to solve in turtles by making the answer always the same. You ask, Where does this come from? Or where do I put this in? The answer is always in turtles, it goes on a page, your config goes on a page, your program goes on a page, everything goes on a page. The thing lost in the web, because the web is a state free kind of environment, is that the default behavior is for everything to go poof, you put something in a forum, you leave the page, poof, it's gone. Unless you're using a nice browser like Firefox where you can say, hey, you reopen that tab and then your content comes back. But as a developer, you have the same problem, right? You end up writing these template languages, and you end up having to take state and shuttle it aside into some other process into a relational database or a document store or something and then reconstitute, rehydrate these pages all the time, out of these databases, where the representation may not look anything like the structure actually tried to pull out. And so the the idea with turtles is put everything on pages, and have a simple representation for everything, which I think is going to be trees, because as we said, with the parsing stuff, trees are a very,
49:36 very good choice for universal representation of things you can do a lot with trees. It definitely sounds like an interesting project. So there's nothing quite yet that we can go play with. But you're working on it. Hmm,
49:44 nothing but 100 k of design notes and parser up on GitHub, but yeah, not quite ready yet.
49:49 All right. Cool. Well keep us posted on that. That's great. All right. So another thing that you said you're into these days, is GTD getting things done right?
49:57 Oh my gosh. changed my life. So as a repeat offender of Python library creation, I have the open source guilt, you know, I put this thing out there and people are like, Oh, well, it's broken this way. And it's broken this way. And I wish it did this. And here's a patch and why don't you review my patch? And Shouldn't you be doing this instead of watching Netflix and spending time with your family, I guess. And the guilt builds and builds and builds, and you're just kind of you don't know he's gonna show you again all the time. And you and you can't relax. So GTD I got this book at work. Seven years ago, somebody gave me a copy of sat on the shelf for seven years, I picked it up and has all these little helpful practices for getting rid of that guilt. And making sure you're always working on the most important thing at any one given time. And you know what, it's really changed my life, I no longer have the the guilt to such degree, hardly at all, really. My response time has gone way down. My email box has been zero, my work one and my home one where they served for 5000. Before now, there have been zero for months. And it's been easy. It's a crazy thing. That sounds delightful.
51:01 Yeah, I've read that book. And I've, I've lived the GTD lifestyle. And I found it didn't quite work for me, but I gained a huge value from doing it. And so if you can find a few few techniques to help tame the world, whether it's so you have better response, how many open source project or you're not stressed all the time, or you can, you know, go home and see the family and not have like, the weight of 5000 unread emails on you. These are all good. I, the biggest thing that helps me these days is Google inbox, with the ability to like snooze items for two weeks until that item comes back right when I'm supposed to deal with it and things like that. That's actually that's been sort of where I've evolved to, but GTD is great.
51:47 Yeah, for real. And nobody, you know, you say you don't use GTD per se, because it didn't work out for you. nobody uses vanilla GTD, it's you know, it's made to be customized, it's a grab bag of tricks. And some of them are, you know, more important than others. Some of them are, you know, kind of vital, and some of them are optional. Yeah, you ran into something with that inbox that really rings true to me, I had been trying for years to use my email inbox as a sort of quasi to do list. I'll leave this in my box, because I'm going to need it in three days. And so for some reason, I thought it's expensive to do a find, I don't know why I thought that. But that's, that's what I thought apparently, or, you know, oh, I need to respond to this mail. So it's gonna stay in there and end up just being this mixture of chronologically sorted things, some of which were reference material, some of which were things to do, some of which I couldn't do for a certain period, like you said, need to be snoozed for two weeks. And yet, the only way it would present them to me is as this linear kind of last touched first list of things. It's not a useful presentation, and I have been unable to make any email client really bend to my will as a to do list. And so the reason my boxes are empty these days is anytime there's an actionable email item that it's takes me more than two minutes to do. Because otherwise I would do it immediately, as part of GTD, it goes into my to do system and pops up like you say, when you're able to take action on it. It's really,
53:10 it's awesome. I do agree with that. So I would be remiss not asking a which to do system use for this.
53:16 So I went shopping, I wanted something that. So I'm kind of weird. I wanted something that was expressly not cross platform, because I think that platform specific stuff usually ends up with a better UI. And I'm kind of a UI enthusiast. So I looked at OmniFocus I love all of Omni stuff. I love omnigraffle and omnioutliner. I don't like toxin omnioutliner. And I really wanted to like OmniFocus. But there were a couple things that it just it was just grating against me. And it wouldn't do what I wanted. I couldn't reorder tasks except within projects. And I kind of like to plan my day out. And so I ended up with something called things by little German company called cultured code, and it's a Mac and iPhone only gadget uses their own little Cloud Sync service. And the UI has been thought out in great detail. Development goes glacially slowly. That's the downside. These people seem very, very intent on getting things exactly right. And not releasing until then. So that's the that's the downside. But the things is great things let you schedule things out ahead of time and not bother you until then lets you express due dates like well, this drop dead has to happen now. And it does, you know, due to due sale contexts, I've let me know about this when I'm at home or at work or in the car, what have you. It's nothing particularly wizzy from a technical point of view, but it has those those kind of three core features of contexts and due dates and and hide until that any large scale to do system needs. Oh, yeah,
54:40 it looks really, really cool. All right. Well, thanks for sharing your research results and it's cool. Great. Okay, so it looks like we're just about out of time. We've covered all the horrible things. Now let's talk about some cool things about your favorite pie pie package.
54:54 Well, I mean, there's a lot of a lot of good ones out there. I enjoy flask. I've used that to power xR it's a nice little lightweight web framework by Herman rennaker.
55:05 Yeah, remember it is it is, yeah, I hit him on and one of the early says, Yeah, all of arm and stuff is fantastic.
55:09 It is likewise click is a fantastic one by Armand hits, it's kind of a pull it out of flask. Now he's pulling it into flask. In fact, click is a kit for making command line tools. And I pulled it into the Tsar as well, because it makes it so easy to do nested sub commands, like if you're use homebrew or something at home, brew, install, brew this brew that it helps you brew up commands like that. And like all of our main stuff, it's sort of a decorator soup that I like very much. Absolutely. And then one of my own things that I always pull off the shelf, when I start a new project is more inner tools. And it is, as the name says, more inner tools, the ones that the ones that got left behind correlations and such that come in handy in almost every project.
55:54 Yeah. Okay. Excellent. We'll be sure to link to all those. That's great. And how about the editor? Favorite, you're gonna write some Python code we pull, I pull
56:02 up a little Mac app called BB edit. It's been around for almost 40 years or something.
56:07 Yeah, they're on BB edit 11. I like their little subtitle, I don't know, it doesn't suck all rights reserved.
56:14 Yeah, bare bones software, it doesn't suck. And they typically make software that looks like nothing, you know, you pull it up. And you know, they're male, or looks like a big empty window of white in their text, it looks like a big empty window full of white, not a lot of toolbars. Not a lot of fluff. But under the hood, they do a really nice job. The text editor uses what must be ropes, because I can edit very large, you know, multi 10s of megabyte files without a lot of lag. It's incredibly stable. It just doesn't lose data on the off chance that it maybe crashes every couple of weeks. It's magical, little implicit autosave thing will bring up your windows in exactly the state that they were even your untitled documents, it won't save and plow over your stuff. It'll save it in its own little buffer and just make sure nothing is unduly lost. Yeah, that sounds really cool.
57:02 All right, awesome. So check that out people. That's cool. And final call to action. Well, if
57:06 you're parsing in Python, give parsimonious a try and send feedback and complaints and patches my way.
57:12 Yeah, definitely. It's on GitHub. I'll put the links to it in the show notes so people can find it there.
57:17 Thank you. And second, be safe out there. If you're running a web server, get a certificate use Let's Encrypt or something else. And if you're deploying Python, again, be safe use
57:27 pip hashing. I really appreciate this look inside of the whole parsing world and how we can move beyond regular expressions to do something way cooler. So thanks for your project and you're talking
57:38 being on the show. It was a pleasure to speak with you Michael. Yeah,
57:40 buyer. This has been another episode of talk Python to me. Today's guest has been Eric rose, and this episode has been sponsored by gocd. And hired thank you both for supporting the show. Go CD is the on premise open source Continuous Delivery server will improve your deployment workflow but keep your code and builds in house check out go CD at talk Python dot f m slash g OCD and take control over your process. hardwoods to help you find your next big thing visit hire comm slash talk Python to me to get five or more offers with salary and equity presented right up front and a special listeners signing bonus of $2,000. Or you are a colleague trying to learn Python. Have you tried books and videos that just left you bored by covering topics point by point, well check out my online course Python jumpstart by building 10 apps at talkpython.fm/course to experience a more engaging way to learn Python. And if you're looking for something a little more advanced, try my write pythonic code course at talk Python FM slash pythonic. Be sure to subscribe to the show open your favorite podcatcher and search for Python we should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at /play indirect RSS feed at /rss on talk python.fm. Our theme music is developers developers, developers by Cory Smith Goes by some mix. Corey just recently started selling his tracks on iTunes. So I recommend you check it out at talkpython.fm/music. You can browse his tracks he has for sale on iTunes and listen to the full length version of the theme song. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Let's mix. Let's get out of here.
59:24 Staying with my boys, there's no going back and having been sleeping. I've been using lots of rest.
59:33 Developers,
59:39 developers developers