WEBVTT

00:00:00.001 --> 00:00:02.740
Do you have horrible convoluted things that need parsing?

00:00:02.740 --> 00:00:08.160
Well, you're in luck because this week we're covering parsing horrible things in Python.

00:00:08.160 --> 00:00:11.260
Obviously, you'll learn a bunch of tips and tricks from this episode,

00:00:11.260 --> 00:00:16.380
but you'll see that advanced parsing is actually a gateway to many interesting computer science

00:00:16.380 --> 00:00:21.800
techniques. Listen in as I speak with Eric Rose about his journey to parse weird things at Mozilla.

00:00:21.800 --> 00:00:28.040
This is Talk Python To Me, episode 85, recorded October 27, 2016.

00:00:28.040 --> 00:00:57.760
Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

00:00:57.760 --> 00:01:02.560
ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where

00:01:02.560 --> 00:01:08.200
I'm at mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the

00:01:08.200 --> 00:01:14.340
show on Twitter via at Talk Python. This episode has been sponsored by GoCD and Hired. Thank them

00:01:14.340 --> 00:01:17.940
both for supporting the show by checking out what they have to offer during their segments.

00:01:17.940 --> 00:01:24.240
Hey, everyone. A little bit of news for you. So first of all, I'm doing a webcast next week,

00:01:24.580 --> 00:01:31.720
Tuesday, November 22, at 11 a.m. Pacific time. The topic is Write Pythonic Code for Better Data Science.

00:01:31.720 --> 00:01:36.620
And I've partnered with Kevin Markham from Data School. And this is 100% free. You can drop in.

00:01:36.620 --> 00:01:42.180
And it's kind of like a super miniature version of my Write Pythonic Code course. So if you want to come

00:01:42.180 --> 00:01:49.060
check it out and register, just go to crowdcast.io slash e slash Pythonic or click that same link in

00:01:49.060 --> 00:01:55.380
the show notes. I was talking to the PyCharm team this week, and they agreed to give away a free copy

00:01:55.380 --> 00:02:01.660
of PyCharm Professional every week to one lucky listener. So all you have to do to be in the running

00:02:01.660 --> 00:02:06.820
for this is be a friend of the show. Just visit talkpython.fm/friends, enter your email address,

00:02:07.040 --> 00:02:12.240
and randomly I'll choose an email address out of the list, and somebody will win. And because we're

00:02:12.240 --> 00:02:16.980
doing every week, the odds are pretty decent for you to get a copy. Let's get to this excellent

00:02:16.980 --> 00:02:22.140
interview about parsing horrible things with Eric Rose. Eric, welcome to Talk Python.

00:02:22.140 --> 00:02:23.960
Thanks for having me, Michael. Pleasure to be here.

00:02:23.960 --> 00:02:28.980
Yeah, it's great to have you. I'm really looking forward to parsing horrible things and a bunch of

00:02:28.980 --> 00:02:29.500
other stuff.

00:02:29.500 --> 00:02:33.240
Well, nobody in their right mind really looks forward to parsing horrible things, but it's something we

00:02:33.240 --> 00:02:33.700
have to do.

00:02:33.700 --> 00:02:37.380
Well, when you have horrible things and you want to parse them, someone's got to do it, right?

00:02:37.380 --> 00:02:38.160
That's right.

00:02:38.160 --> 00:02:42.200
So I think it's, we're going to talk about some really cool techniques, some libraries,

00:02:42.200 --> 00:02:47.080
some algorithms, and whatnot to pull that off, and some interesting work and talks that you've done.

00:02:47.080 --> 00:02:50.140
But of course, before we get into those, let's start with your story. How do you get into

00:02:50.140 --> 00:02:50.840
programming in Python?

00:02:50.840 --> 00:02:55.080
Well, I mean, I used to be a Legos kid, right? You know, like snapping those things together.

00:02:55.080 --> 00:02:59.480
And of course, the thing about being a kid is you have no economic power, so you've only got the

00:02:59.480 --> 00:03:03.340
ones you've got in the box, and you get at birthdays, and you get at Christmas or what have you.

00:03:03.340 --> 00:03:09.160
And so I kind of discovered programming as a way to program with, to build with Legos without ever

00:03:09.160 --> 00:03:13.600
worrying about running out of parts. And so I became kind of an early collector of programming

00:03:13.600 --> 00:03:17.700
languages, you know, before the internet. So you have to get these things kind of happen upon them

00:03:17.700 --> 00:03:22.960
on floppy disks or whatever. And I did basic and I did hyper card and kind of learned programming

00:03:22.960 --> 00:03:28.620
through mimicry like we all did in the 80s. You know, you get these magazines with program listings

00:03:28.620 --> 00:03:34.120
in them. And you just kind of have to type them in if you want to play with them. And so you're just forced to plow

00:03:34.120 --> 00:03:39.480
linearly through these things, making mistakes throughout. And that kind of teaches you debugging and proofreading.

00:03:39.480 --> 00:03:45.760
And you come to all kinds of wrong conclusions. Like I remember in hyper card, there's this statement global

00:03:45.760 --> 00:03:51.500
start flag, which of course the global variable declaration, but I thought it was this round checkerboard cursor,

00:03:51.500 --> 00:03:57.000
which totally looked like a global start flag to me. So, you know, fast forward two dozen languages later

00:03:57.000 --> 00:04:03.860
and I ended up working at a university over at Penn State and we had this crazy seminar registration

00:04:03.860 --> 00:04:10.460
system called SEMREG. Maybe they're still running it. I really hope not. And it was, well, the simple thing

00:04:10.460 --> 00:04:17.420
I could say is it was written in VBScript, but really it was written in PHP and then transliterated into

00:04:17.420 --> 00:04:22.460
VBScript via a series of regular expressions. Wow. That's crazy. I've never heard of anything

00:04:22.460 --> 00:04:27.820
being compiled from PHP to VBScript. Absolutely insane. And compiled is probably too kind of a word,

00:04:27.820 --> 00:04:33.400
but, so I didn't realize this when I took the job, you know, I've become a lot smarter about

00:04:33.400 --> 00:04:37.920
interviewing since then, I hope. And I thought I'm going to give this a chance. You know, I'm this big

00:04:37.920 --> 00:04:42.500
Mac guy and I, it was, it was the height of the platform wars and windows was terrible, but I, you know

00:04:42.500 --> 00:04:45.880
what? I'm going to, I'm going to not be a jerk. I'm going to give this thing a chance. I'm going to see if I

00:04:45.880 --> 00:04:49.640
can write stuff in VBScript. And the short answer is you can't write anything in VBScript.

00:04:49.640 --> 00:04:56.840
The thing has classes, but no inheritance. And, so, you know, with a little detour through making

00:04:56.840 --> 00:05:03.800
my own, prototype based inheritance language out of the call statement, I thought, okay, we need to

00:05:03.800 --> 00:05:09.440
bridge this over to something more usable. And Python was very well supported at the time on Windows

00:05:09.440 --> 00:05:14.980
scripting host, which allowed me to do ridiculous things like share a database connection between the

00:05:14.980 --> 00:05:20.320
legacy VBScript code and the new Python code. And that's how I got into Python, believe it or not.

00:05:20.320 --> 00:05:24.400
Interesting. Because if you can get a drawer in the script host, you can basically get it the same

00:05:24.400 --> 00:05:28.880
environment as this other bad thing, right? Yeah. You can kind of mash it all together. It'll bridge

00:05:28.880 --> 00:05:34.360
strings to strings. It'll actually bridge objects to objects and methods to methods, numbers to numbers.

00:05:34.360 --> 00:05:38.840
And if you bang on it hard enough, you can get database handles shared and share a single

00:05:38.840 --> 00:05:43.760
transaction between the two languages. So I was very fortunate to happen upon Python that way.

00:05:43.760 --> 00:05:47.460
Yeah, that's great. And it looks like you've been doing a bunch of Python since then, huh?

00:05:47.460 --> 00:05:53.180
Well, Python is a really, nice little patchwork of, of, of stolen bits of language.

00:05:53.180 --> 00:05:57.380
There's not a whole lot unique about Python, which I think is one of its strengths apart from maybe the

00:05:57.380 --> 00:06:03.400
width statement, but it knows where to steal all the best things, the list comprehensions out of Haskell,

00:06:03.400 --> 00:06:08.280
for example. And, you know, Hey, we're going to steal, you know, class-based object orientation

00:06:08.280 --> 00:06:13.260
from wherever that came from small talk, I guess, but we'll also have top level functions from any

00:06:13.260 --> 00:06:17.520
number of languages. So I've been pretty happy with it. Yeah, absolutely. And it's still,

00:06:17.520 --> 00:06:22.460
still on its way to doing that thing with like async and await in the latest version, right? Which is

00:06:22.460 --> 00:06:26.140
a fantastic language feature. Yeah. We'll see where that goes. I haven't played with it myself.

00:06:26.140 --> 00:06:29.980
Yeah. I'm still looking for a good use case for that as well, but it's, it's definitely a neat

00:06:29.980 --> 00:06:33.280
concept. So how about now? What do you work on these days?

00:06:33.280 --> 00:06:39.860
Well, day to day, I maintain a project called DXR over at Mozilla, which doesn't really stand for

00:06:39.860 --> 00:06:45.480
anything, but is a language analysis and navigation tool for large code bases. So

00:06:45.480 --> 00:06:50.460
Firefox is something like 17 million lines of code. You know, if you want to make a change,

00:06:50.460 --> 00:06:53.880
you've got to be thinking, well, what am I going to break with this change? Who am I affecting?

00:06:53.880 --> 00:07:00.580
What functions call this function that I want to change? What, what eats the result of the

00:07:00.580 --> 00:07:06.480
contents of this variable that I want to alter or, what invariant might I be violating here?

00:07:06.480 --> 00:07:12.020
And DXR answers those kinds of questions through a structured query language and through both text

00:07:12.020 --> 00:07:17.100
search and a cleverly accelerated regular expressions search. So you can get through these 17 million

00:07:17.100 --> 00:07:24.260
lines with a regex in, you know, sub second. Wow. And how you said 17 million lines of code. That's,

00:07:24.260 --> 00:07:29.640
that's quite impressive. That's bigger even than I thought it would be, but I knew that was huge.

00:07:29.640 --> 00:07:34.620
How many languages are involved in that? Oh, that's a crazy question to ask. Well,

00:07:34.620 --> 00:07:39.860
okay, I'll just go with language. So HTML is a language though, not programming language,

00:07:39.860 --> 00:07:45.320
arguably CSS is Turing complete. Someone has proved as long as it's level three. So that's a programming

00:07:45.320 --> 00:07:50.360
language. Now a lot of the UI is written in JavaScript. A lot of the down and dirty stuff

00:07:50.360 --> 00:07:55.740
is written in C++. And more recently we've begun importing, Rust into the code base.

00:07:55.740 --> 00:08:01.500
Rust is now a part of the released Firefox and, and more and more with the release of project quantum,

00:08:01.500 --> 00:08:07.880
which was just announced last week, will be ported over to rust. So it's, so we can make more

00:08:07.880 --> 00:08:12.960
guarantees so it can be safer and more concurrent. Nice. And what's, what is project quantum?

00:08:12.960 --> 00:08:21.860
Project quantum. Let's see if I can get this right. Is a Mozilla project just became unsecret to import a lot of our

00:08:22.060 --> 00:08:28.700
experimental, a Rust based rendering pipeline into Gecko, which is the current released,

00:08:28.700 --> 00:08:33.580
Firefox pipeline. So project, that I mentioned serve already. I forget.

00:08:33.580 --> 00:08:40.860
No, the sort of thing we're importing things from is servo an experimental web renderer written in rust,

00:08:40.860 --> 00:08:47.820
a CSS renderer, HTML, all that jazz. And, yeah, we're, we're pulling bits of that into Gecko.

00:08:47.820 --> 00:08:52.020
Okay. That's awesome. It's a very exciting time that'll let us be more concurrent and use all these

00:08:52.020 --> 00:08:55.120
different cores on all these different things. You know, phones even have four cores now.

00:08:55.120 --> 00:08:57.220
Yeah. Watches even have multi cores now.

00:08:57.220 --> 00:08:59.800
Yeah. It's crazy. It's a crazy time to live.

00:08:59.800 --> 00:09:03.560
It is a crazy time. You said another thing you're working on is fathom. What's fathom?

00:09:03.560 --> 00:09:08.780
So fathom kind of, fits into all this, this parsing subject. Fathom is my new kind of toy over at

00:09:08.780 --> 00:09:14.640
Mozilla. It's a mad scientist project to see if we can make it easier to write semantic extractors for the web.

00:09:14.740 --> 00:09:18.880
So an example of a semantic extractor you might know is something along the lines of readability

00:09:18.880 --> 00:09:24.500
or a browser's reader mode, where it just pulls the content text out and dispenses with headers and

00:09:24.500 --> 00:09:30.000
footers and ads and such. But other things you might want to extract are, Hey, let's, let's teach a browser

00:09:30.000 --> 00:09:35.320
to recognize what a previous or next button looks like. So that maybe the browser can let you assign

00:09:35.320 --> 00:09:40.440
a keystroke to that in the general sense and not have to chase them as they bounce around as you advance

00:09:40.440 --> 00:09:45.220
through a slideshow. Or maybe we can teach the browser at a deeper level to appreciate what an

00:09:45.220 --> 00:09:50.900
advertisement is or what a navigation element is. So maybe we can collapse those on small screens and

00:09:50.900 --> 00:09:56.000
hide them in a menu. There really, there's really endless potential to this. And I'm trying to

00:09:56.000 --> 00:10:02.360
fix it. So those extractors are easier to maintain, faster to write and become less of a mess.

00:10:02.620 --> 00:10:08.160
If you were to read the readability source code, which is kind of the thing that both Firefox's

00:10:08.160 --> 00:10:13.100
reader mode and Safari's reader mode are based on, you get the sense that it's been written by hand,

00:10:13.100 --> 00:10:19.980
maintained over time. There is state flying everywhere. It's hard to tell where to make tweaks.

00:10:21.080 --> 00:10:28.440
So what Fathom does is express these extractors as lists of unordered rules in a prologue sense. If

00:10:28.440 --> 00:10:32.600
you've ever played with prologue, it's just a whole bunch of kind of, logical statements. And then

00:10:32.600 --> 00:10:37.900
the environment figures out how to fit those together and run them. Well, Fathom, works along those

00:10:37.900 --> 00:10:45.680
same lines. And as a result, since order doesn't matter, third parties can tweak and, can tweak an

00:10:45.680 --> 00:10:50.860
existing extractor just by inserting their own rules. They can say here, whenever you see a,

00:10:50.860 --> 00:10:55.600
uh, whatever kind of element, for example, I want to boost the score by this much or lower it by that

00:10:55.600 --> 00:11:01.840
much. Nice. So you could put like understanding of time of calendars and dates and whatnot, possibly

00:11:01.840 --> 00:11:05.920
as another rule. Exactly. Hey, this looks like a calendar or, Hey, this looks like a payment form.

00:11:05.920 --> 00:11:11.280
Let me help you fill it. Yeah. Okay. Yeah. That's awesome. Is that in Firefox yet? Or is this,

00:11:11.280 --> 00:11:15.660
this is just a project so far? We hope to get it in so that, other things,

00:11:15.660 --> 00:11:20.760
can make use of it more easily, but, it's already embeddable within Firefox add-ons and has

00:11:20.760 --> 00:11:25.160
been embedded in, at least one add-on. It's kind of a Fathom debugging add-on.

00:11:25.160 --> 00:11:30.840
it runs on the server side. It's all just kind of vanilla ES6. You can compile it down to ES5 for

00:11:30.840 --> 00:11:35.240
older browsers. It kind of runs all over the place. I wrote it in JavaScript despite not liking

00:11:35.240 --> 00:11:41.320
JavaScript myself so that we could get it, popular. Yeah. Sometimes, sometimes you got to go with

00:11:41.320 --> 00:11:44.900
that. Well, that sounds really cool. And then another thing that you're involved in is

00:11:44.900 --> 00:11:49.080
let's encrypt, which is very exciting. Yes. Let's encrypt has probably got the easiest

00:11:49.080 --> 00:11:52.040
business plan I've ever heard, which is to give away a hundred dollar bills.

00:11:52.040 --> 00:11:55.420
How's that work? How do I, how do I get my a hundred dollar bill?

00:11:55.420 --> 00:12:00.180
Well, if you've ever bought a, well, it's very easy. So certificates,

00:12:00.180 --> 00:12:04.300
this is all certificates have historically been fairly expensive on the order of a hundred bucks.

00:12:04.300 --> 00:12:09.420
And there are some other ways to get cheaper ones through start SSL or Komodo or whatever is kind

00:12:09.420 --> 00:12:14.700
of a hassle. But what we do is we have a little command line tool that you run. And if,

00:12:14.700 --> 00:12:21.620
if you can authenticate yourself to our little, certificate creating server through a DNS record or

00:12:21.620 --> 00:12:27.440
putting a little thing on your server temporarily, then we will give you a cert, which is recognized by

00:12:27.440 --> 00:12:32.180
all the major browsers. So really there's no way, no reason not to use it at this point.

00:12:32.180 --> 00:12:38.900
Yeah. So having encrypted content is obviously important if you're doing like e-commerce or something

00:12:38.900 --> 00:12:43.900
like that, but it's also just becoming increasingly important to be a first class citizen on the web,

00:12:43.900 --> 00:12:44.160
right?

00:12:44.160 --> 00:12:48.800
It really is. It, I mean, it gives you the impression of trust. First of all, if people are

00:12:48.800 --> 00:12:53.960
watching their URL bars, but also it's important for your visitors just to have the requests be private.

00:12:53.960 --> 00:12:59.740
If I'm surfing to Wikipedia and, not doing anything particularly suspicious, but say I have a

00:12:59.740 --> 00:13:04.140
suspect, I have a medical condition and I'm reading about all these different skin diseases or whatever

00:13:04.140 --> 00:13:10.340
is ailing me. I don't want my ISP logging that away and selling it to their marketing partners,

00:13:10.340 --> 00:13:16.220
which until the FCC's ruling last week was perfectly legal to do state actors. I mean,

00:13:16.220 --> 00:13:23.060
the best defense we have against really anybody in the future coming to power and looking into our past

00:13:23.060 --> 00:13:27.700
and seeing things they don't like, and then coming down on us is keeping things that are our business,

00:13:27.920 --> 00:13:34.020
our business and using, SSL certificates and surfing to secure sites is the best way to do that

00:13:34.020 --> 00:13:34.420
right now.

00:13:34.420 --> 00:13:40.720
Yeah. And the fact that let's encrypt is, is free makes that very, very possible. And I think that's,

00:13:40.720 --> 00:13:47.040
that's great. You know, there's a really interesting studies quoted in the original Edward Snowden book

00:13:47.040 --> 00:13:52.040
that came out, I think, by Greenwald. I can't remember that guy's glad Greenwald, I believe.

00:13:52.120 --> 00:13:56.780
Yes. That guy. And it was a great book. And basically, you know, a lot of people say, look,

00:13:56.780 --> 00:14:01.520
I don't care about this privacy stuff. Like I have nothing to hide, but there've been psychological

00:14:01.520 --> 00:14:07.380
studies and social, social studies saying people behave differently. If they know somebody's listening,

00:14:07.380 --> 00:14:13.120
they might not break a rule in private, but they behave differently. They are slightly more private,

00:14:13.120 --> 00:14:19.620
less willing to think, you know, sort of contrarian ideas and at least share them. And the less that

00:14:19.620 --> 00:14:24.500
people are watching the better as far as I'm concerned for, for people in general.

00:14:24.500 --> 00:14:29.420
Yeah. It's really the idea of chilling effects, which came out of some academic institution. It's

00:14:29.420 --> 00:14:35.120
a wonderful phrase. And when you have chilling effects in operation, where you are afraid to do

00:14:35.120 --> 00:14:40.080
perfectly legal things, which otherwise you might not do, democracy really cannot function.

00:14:40.080 --> 00:14:46.480
How democracy works is by means of, well, really the Overton window, this range of things that you're

00:14:46.480 --> 00:14:50.720
allowed to say and think that don't get you kicked out of cocktail parties.

00:14:50.720 --> 00:14:56.200
And this window moves around over time. It's, you know, yesterday's conservative, rather yesterday's

00:14:56.200 --> 00:15:02.080
progressive is today's conservative. And if we're not allowed to play with ideas that are just outside

00:15:02.080 --> 00:15:07.740
the bounds of the Overton window, then the future really has nowhere to pick its new ideas from.

00:15:07.860 --> 00:15:12.980
And we just kind of stagnate and the powers that be become the powers that always will be.

00:15:12.980 --> 00:15:17.580
And the ideas that are, are the ideas that always will be, and we can't really go anywhere.

00:15:17.580 --> 00:15:22.380
So, yeah, I think privacy is key to having a functioning democracy at all.

00:15:22.380 --> 00:15:28.460
Yeah, I totally agree. And if you don't buy that, Google ranks sites higher if they have SSL.

00:15:28.940 --> 00:15:31.920
We all want to rank higher in Google, right? So there's the Google juice.

00:15:31.920 --> 00:15:38.540
There's the final straw to, like, start encrypting stuff. My blog's encrypted. This podcast site is encrypted and so on.

00:15:38.540 --> 00:15:42.020
And I think it's great. Happy to do that on as much as I can.

00:15:42.020 --> 00:15:47.960
I'm wondering, how did you become interested in parsing all these horrible things? Where did you get started with that?

00:15:48.120 --> 00:15:54.220
Well, I guess, I guess it probably started when I was seven years old and playing around with basic and hypercard and thinking,

00:15:54.220 --> 00:15:59.660
oh, you know, I want to write my own programming language because, of course, you're seven and no one has told you that it's hard.

00:15:59.660 --> 00:16:07.560
And if you spend a couple of days trying to do that with just if statements, you end up in this spaghetti-fied mess and you can't get anywhere.

00:16:07.560 --> 00:16:10.960
And you're just amazed that anyone has ever managed to write a language ever.

00:16:11.760 --> 00:16:17.920
So, you know, then put in a little pause of about 25 years and it came up at work.

00:16:17.920 --> 00:16:22.580
Over at Mozilla, we had a support site. We still have a support site.

00:16:22.580 --> 00:16:30.880
Support.mozilla.org. And it's got a wiki and it's got a Stack Overflow clone and all this different stuff to help out people who are using Firefox and our other products.

00:16:30.880 --> 00:16:41.160
And our wiki is powered by not MediaWiki itself, but the MediaWiki syntax for various hysterical purposes, as we like to say.

00:16:41.160 --> 00:16:48.180
And not only do we want the MediaWiki syntax, but we wanted to be able to add our own little directives to the syntax.

00:16:48.180 --> 00:16:57.580
Crazy little things that'll maybe, well, one example is we wanted to change the text that would come up according to which version of Firefox someone was using to visit the site.

00:16:57.580 --> 00:17:10.680
It was very, very difficult to make those edits to the implementation of MediaWiki that we were using, which was a port by David Kramer, very nice little port, of the original MediaWiki machinery.

00:17:10.680 --> 00:17:15.100
Such a direct port, in fact, that it still had dollar signs in the comments from the PHP.

00:17:15.100 --> 00:17:22.020
Right. Yeah, it was originally PHP and you guys translated it over to Python and that helped because at least you were working in Python.

00:17:22.020 --> 00:17:25.940
But you said the way that it worked was the parser was pretty insane.

00:17:25.940 --> 00:17:28.560
Like it was just like a crazy bunch of regexes, right?

00:17:28.560 --> 00:17:31.880
Yeah, there were, I think, 41 regexes.

00:17:31.880 --> 00:17:47.700
And then there was another 2,100 lines of PHP, which just ran them over and over again against the source text, finding and replacing and finding and replacing, hopefully in the right order, interacting with each other, dropping little markers so that they didn't smash over each other when they shouldn't.

00:17:47.700 --> 00:17:51.520
And then hopefully at the end out comes the proper rendered text.

00:17:51.520 --> 00:18:01.400
Of course, in reality, MediaWiki language changes from release to release as they find little corner cases that this crazy slapdash loopy way didn't handle.

00:18:01.400 --> 00:18:04.560
Yeah, you called it the Klingon MediaWiki.

00:18:04.560 --> 00:18:08.900
Yeah, I look at these regexes and I think, well, this is the original Klingon, clearly.

00:18:08.900 --> 00:18:09.880
Yes, obviously.

00:18:09.880 --> 00:18:11.920
Nice.

00:18:11.920 --> 00:18:18.620
And so you were looking for a way to escape this Klingon parser world and create something nicer.

00:18:19.100 --> 00:18:26.920
Like one of the problems you said that was inherent in the algorithm was it would directly parse into its new representation.

00:18:26.920 --> 00:18:27.840
Right.

00:18:27.840 --> 00:18:29.960
There was never an intermediate representation.

00:18:29.960 --> 00:18:32.840
So tell people what, like, is, like, why do you care?

00:18:32.840 --> 00:18:35.000
Like, what's the intermediate representation do for you?

00:18:35.000 --> 00:18:37.020
Well, it gives you flexibility, like any abstraction.

00:18:37.020 --> 00:18:42.180
So let's say we have this imaginary intermediate representation for MediaWiki syntax.

00:18:42.180 --> 00:18:43.480
We bring in the MediaWiki syntax.

00:18:43.480 --> 00:18:47.900
We parse it into a tree because that's what all these things end up as in parsing land.

00:18:48.160 --> 00:18:49.380
And then we have a lot of options.

00:18:49.380 --> 00:18:55.320
We can output plain text from that, ignoring the bits of the tree that say bold or italic.

00:18:55.320 --> 00:18:59.820
We can render out HTML from that, not ignoring those things.

00:18:59.820 --> 00:19:06.340
We can go hunting through for just, say, date or time elements and pull out some date and timey entities.

00:19:06.340 --> 00:19:09.600
Once you have this abstraction, you have any kind of output you like.

00:19:09.600 --> 00:19:12.340
Or you can do any kind of analysis or transformation you like.

00:19:12.340 --> 00:19:12.740
Nice.

00:19:12.740 --> 00:19:25.180
So, for example, if you wanted to possibly represent stuff by a markdown output or an HTML output or plain text, those are super easy because both, like, markdown and HTML have a bold concept.

00:19:25.180 --> 00:19:27.720
One is a bracket, you know, angle bracket strong.

00:19:27.720 --> 00:19:29.020
One is a star.

00:19:29.020 --> 00:19:33.960
But it's, you know, you can do that final translation pretty easily once you have the tree, right?

00:19:34.280 --> 00:19:34.720
Exactly.

00:19:34.720 --> 00:19:38.380
That's the trivial part, going from tree back to a linear representation.

00:19:38.380 --> 00:19:41.160
So the hard part is getting it into a tree, huh?

00:19:41.160 --> 00:19:41.760
Right.

00:19:41.760 --> 00:19:44.040
Now, we're doing that as we talk, which is nice.

00:19:44.040 --> 00:19:46.360
Now, you make a linear sequence of sounds.

00:19:46.360 --> 00:19:47.060
And I hear them.

00:19:47.120 --> 00:19:51.980
And I deconstruct them back into, you know, phrases and word pairs and things and idioms.

00:19:51.980 --> 00:19:53.540
And I say, wow, that's probably what he means.

00:19:53.540 --> 00:19:55.280
And then you do it back from my end.

00:19:55.280 --> 00:19:55.820
Yeah, exactly.

00:19:55.820 --> 00:20:02.560
So pretty much pulling structure out of any flat, linear stream of data is parsing, right?

00:20:02.560 --> 00:20:03.460
Exactly.

00:20:03.460 --> 00:20:06.340
And so the applications are as wide as you like.

00:20:06.340 --> 00:20:11.360
I mean, anything that has any kind of structure, text, sound, you can think of musical phrases

00:20:11.360 --> 00:20:17.900
and the sorts of, a lot of music has a theme in variations or development of a thematic statement.

00:20:17.900 --> 00:20:22.140
On one hand, I sort of can see how you would do some of this with regular expressions.

00:20:22.140 --> 00:20:27.160
But it's really the goal is to move it into this parse tree because then it's not just transforming

00:20:27.160 --> 00:20:30.880
one bit of text into another text, but it's actually transforming into the structure that

00:20:30.880 --> 00:20:33.040
we can do all kinds of interesting things, huh?

00:20:33.040 --> 00:20:33.660
Well, right.

00:20:33.660 --> 00:20:37.980
And more formally, regular expressions can't really support nesting.

00:20:37.980 --> 00:20:42.020
Now, there are little corner cases that people were going to write in about where things that

00:20:42.020 --> 00:20:47.180
are called regular expressions nowadays, like in Perl, have support for nesting bolted on,

00:20:47.180 --> 00:20:49.860
but then they cease to become proper regular expressions.

00:20:49.860 --> 00:20:50.780
Exactly.

00:20:50.780 --> 00:20:56.980
Proper parsers, on the other hand, descend into the, what we call, like, Chomsky type one or

00:20:56.980 --> 00:20:59.820
two grammar and can support nesting.

00:20:59.820 --> 00:21:04.900
So you could say, understand sets of nested parentheses of arbitrary depth.

00:21:04.900 --> 00:21:08.800
And that's how things like, well, HTML, for example, are.

00:21:08.800 --> 00:21:13.340
You could have a bold tag with an italic tag in it with another bold tag in it at infinitum.

00:21:13.340 --> 00:21:13.800
Yeah.

00:21:13.800 --> 00:21:17.140
Div, span, A, all sorts of stuff.

00:21:17.140 --> 00:21:17.380
Yeah.

00:21:17.380 --> 00:21:18.260
More div.

00:21:18.260 --> 00:21:22.600
And in addition to the nesting, a proper parser gives you the ability to read what you're doing,

00:21:22.600 --> 00:21:23.060
essentially.

00:21:23.060 --> 00:21:26.520
Regexes are famous for being a write-only language.

00:21:27.120 --> 00:21:27.520
Absolutely.

00:21:27.520 --> 00:21:29.100
I definitely think of them as write-only.

00:21:29.100 --> 00:21:29.640
Yeah.

00:21:29.640 --> 00:21:33.400
And you can comment them as much as you like and put white space in there, but it still

00:21:33.400 --> 00:21:39.720
becomes, you know, at least awkward to name sub-patterns, to repeat sub-patterns, and to

00:21:39.720 --> 00:21:42.220
maintain something where you have repeated bits of regex.

00:21:42.220 --> 00:21:42.760
Sure.

00:21:42.760 --> 00:21:46.620
Well, and as it gets increasingly complicated, the more you need the regular expression, the

00:21:46.620 --> 00:21:48.160
less you can understand it, I think.

00:21:48.160 --> 00:21:48.900
Yeah, for sure.

00:21:48.900 --> 00:21:52.300
There seems to be a ceiling for one reason or another with straight-out regexes.

00:21:52.640 --> 00:21:52.760
Yeah.

00:21:52.760 --> 00:21:58.600
So you talked about there being different types of grammars that are different complexities

00:21:58.600 --> 00:21:59.320
of grammars.

00:21:59.320 --> 00:21:59.760
Yeah.

00:21:59.760 --> 00:22:02.500
There's the idea of the Chomsky hierarchy of grammars.

00:22:02.500 --> 00:22:04.540
And Chomsky is really a linguist.

00:22:05.240 --> 00:22:11.140
And so he spoke of these grammars in terms of generative power, where, you know, I can

00:22:11.140 --> 00:22:16.380
start from a production, a production name, something like sentence, and then descend in

00:22:16.380 --> 00:22:20.620
one level and say, well, sentences, you know, phrase, and then maybe another phrase, and maybe

00:22:20.620 --> 00:22:21.460
another phrase.

00:22:21.460 --> 00:22:22.520
And what are the phrases?

00:22:22.520 --> 00:22:25.380
Well, we have a subject, and we have verb, and we have an object.

00:22:25.580 --> 00:22:28.580
And he's interested in creating sentences from the top level.

00:22:28.580 --> 00:22:34.840
Now, as computer scientists, we're usually more interested in going in the opposite direction,

00:22:34.840 --> 00:22:40.380
starting with this complete sentence, and then kind of inferring out the structure, recognizing,

00:22:40.380 --> 00:22:41.660
as they call it in the literature.

00:22:41.660 --> 00:22:45.960
So Chomsky came out with Chomsky hierarchy of grammar, something like that.

00:22:46.320 --> 00:22:49.220
Anyway, levels 0, 1, 2, 3, and 4.

00:22:49.220 --> 00:22:53.040
And type 0 is basically a free-for-all.

00:22:53.040 --> 00:22:56.320
It kind of comes out looking like a directed graph.

00:22:56.320 --> 00:23:02.700
And there's not really, really available kit for parsing that sort of thing, or any algorithmic

00:23:02.700 --> 00:23:03.780
bounds to its complexity.

00:23:03.780 --> 00:23:08.060
When we get down to a level 1, that's your context-sensitive grammars.

00:23:08.060 --> 00:23:13.180
And Python is actually a context-sensitive grammar, I believe only for the reason that it has,

00:23:13.180 --> 00:23:15.640
only because it has whitespace sensitivity.

00:23:15.640 --> 00:23:20.260
So I was actually trying to parse something like Python the other day, a little side project

00:23:20.260 --> 00:23:21.140
of mine called Turtles.

00:23:21.140 --> 00:23:23.200
And I ran up against this.

00:23:23.200 --> 00:23:29.600
When you indent in Python, in a normal language, in a curly brace language, when you go inward

00:23:29.600 --> 00:23:31.440
a block, you say, curly brace.

00:23:31.440 --> 00:23:34.380
And then when you end a block, you say, end brace.

00:23:34.380 --> 00:23:40.160
Now, in Python, think about trying to write a tokenizer for Python, trying to find those

00:23:40.160 --> 00:23:42.660
instances where we go in a level or out a level.

00:23:42.660 --> 00:23:44.840
You're reading along, you're reading along.

00:23:45.200 --> 00:23:48.020
You're at the beginning of a line, and you see eight spaces.

00:23:48.020 --> 00:23:49.380
Did you go in a level?

00:23:49.380 --> 00:23:50.440
I don't know.

00:23:50.440 --> 00:23:52.260
What is the previous line?

00:23:52.260 --> 00:23:53.260
Is it four spaces?

00:23:53.260 --> 00:23:54.920
If so, I went in a level.

00:23:54.920 --> 00:24:01.240
So you can see that eight space span is interpreted differently depending on our context, depending

00:24:01.240 --> 00:24:02.840
on what the previous line had done.

00:24:02.840 --> 00:24:05.120
So Python is a context-sensitive grammar.

00:24:05.700 --> 00:24:12.120
Now, the rest of Python is at Chomsky level two, which is a context-free grammar.

00:24:12.560 --> 00:24:16.620
So if you were to dig into the Python source code, one of these files there, there's a little,

00:24:16.620 --> 00:24:22.040
let's see if I have it sitting here, a nice little sort of summary where it says, okay,

00:24:22.040 --> 00:24:24.540
here's what a suite is, a series of statements.

00:24:24.540 --> 00:24:27.440
A suite is a statement, statement, statement, or statement star.

00:24:27.740 --> 00:24:28.840
But it's always that.

00:24:28.840 --> 00:24:30.060
A suite is always this.

00:24:30.060 --> 00:24:31.980
An if statement is always that.

00:24:31.980 --> 00:24:34.200
You know, a function definition is always this.

00:24:34.200 --> 00:24:36.620
It doesn't matter what came before or what comes after.

00:24:36.620 --> 00:24:38.200
It's context-free.

00:24:38.200 --> 00:24:38.700
Right.

00:24:38.700 --> 00:24:43.100
Like, the if statement doesn't mean something different if it's in a while loop.

00:24:43.100 --> 00:24:44.840
It always means the same thing.

00:24:45.060 --> 00:24:49.640
So if you know what an if is, like, if you can define the structure of an if, you can define

00:24:49.640 --> 00:24:50.200
its meaning.

00:24:50.200 --> 00:24:52.120
And you don't have to do more interesting parsing.

00:24:52.120 --> 00:24:52.460
Yep.

00:24:52.460 --> 00:24:53.160
All the time.

00:24:53.160 --> 00:24:54.700
All the time, if means the same thing.

00:24:54.700 --> 00:24:57.100
There are no modes to go into or out of.

00:25:11.740 --> 00:25:16.140
This portion of Talk Python To Me is brought to you by GoCD from ThoughtWorks.

00:25:16.140 --> 00:25:20.800
GoCD is the on-premise, open-source, continuous delivery server.

00:25:20.800 --> 00:25:26.480
With GoCD's comprehensive pipeline and model, you can model complex workflows for multiple

00:25:26.480 --> 00:25:27.600
teams with ease.

00:25:27.600 --> 00:25:33.420
And GoCD's value stream map lets you track changes from commit to deployment at a glance.

00:25:33.420 --> 00:25:38.420
GoCD's real power is in the visibility it provides over your end-to-end workflow.

00:25:38.420 --> 00:25:43.400
You get complete control of and visibility into your deployments across multiple teams.

00:25:43.400 --> 00:25:47.880
Say goodbye to release day panic and hello to consistent, predictable deliveries.

00:25:47.880 --> 00:25:52.780
Commercial support and enterprise add-ons, including disaster recovery, are available.

00:25:52.780 --> 00:25:58.740
To learn more about GoCD, visit talkpython.fm/gocd for a free download.

00:25:58.740 --> 00:26:02.040
That's talkpython.fm/gocd.

00:26:02.040 --> 00:26:03.140
Check them out.

00:26:03.140 --> 00:26:04.140
It helps support the show.

00:26:04.140 --> 00:26:18.100
One of the things that you said, a quote from your PyCon 2012 talk, which I'll link to

00:26:18.100 --> 00:26:24.160
the show notes, of course, was that parsing is a gateway drug to other areas of computer

00:26:24.160 --> 00:26:24.660
science.

00:26:24.660 --> 00:26:25.440
It's true.

00:26:25.440 --> 00:26:30.300
I mean, once you have this tree-shaped intermediate representation, you can do anything you want.

00:26:30.300 --> 00:26:35.420
You can get into natural language processing, which if you're doing any of this in Python,

00:26:35.420 --> 00:26:38.100
NLTK is a not-to-be-missed library.

00:26:38.100 --> 00:26:43.920
It, first of all, is a shining example of what all library documentation should be.

00:26:43.920 --> 00:26:48.500
There is an NLTK book freely available and is a fantastic place to start if you want to

00:26:48.500 --> 00:26:53.540
start understanding human language with a computer or doing sorts of data mining and machine learning.

00:26:53.540 --> 00:26:55.240
It's also full of those sorts of algorithms.

00:26:55.240 --> 00:26:56.340
Great piece of kit.

00:26:56.720 --> 00:26:56.840
Yeah.

00:26:56.840 --> 00:26:57.500
Awesome.

00:26:57.500 --> 00:26:57.540
Awesome.

00:26:57.540 --> 00:27:01.740
Also, from the intermediate representation, you can get into programming language design.

00:27:01.740 --> 00:27:02.520
That's fun.

00:27:02.520 --> 00:27:04.620
Make your own little toy language.

00:27:04.620 --> 00:27:10.820
Anybody can make a lisp in a couple of days because the parsing is so darn simple.

00:27:10.820 --> 00:27:13.000
It's just a bunch of nested parentheticals.

00:27:13.000 --> 00:27:15.520
There's a lot of parentheses there.

00:27:16.380 --> 00:27:16.560
Yeah.

00:27:16.560 --> 00:27:21.280
Now I'm a lot less intimidated to reach for custom little query languages.

00:27:21.280 --> 00:27:25.500
Like, DXR has a custom little query language, which looks a lot like Google's custom little

00:27:25.500 --> 00:27:26.200
query language.

00:27:26.200 --> 00:27:32.560
And it's, you know, 10 or 20 lines of grammar description that I feed into my library parsimonious.

00:27:32.560 --> 00:27:33.860
And out comes the tree.

00:27:33.860 --> 00:27:34.240
Yeah.

00:27:34.240 --> 00:27:35.240
That's really cool.

00:27:35.400 --> 00:27:44.620
I feel like when you learn these new techniques and these data structures and algorithms, like, you see problems where you just saw opaqueness before.

00:27:44.620 --> 00:27:46.660
You're like, oh, I could actually apply this thing.

00:27:46.660 --> 00:27:48.440
And out would pop interesting answers.

00:27:48.440 --> 00:27:50.620
Whereas before, you're just like, there's no way I can answer that.

00:27:50.620 --> 00:27:53.000
That's just text blobs, not structure.

00:27:53.000 --> 00:27:53.740
Yeah, for sure.

00:27:53.740 --> 00:27:54.860
It's just pattern recognition.

00:27:54.860 --> 00:27:58.560
The more you've exposed yourself to, the more you'll say, hey, this is just one of those.

00:27:59.520 --> 00:28:04.720
Let's review some of the various options for parsing text in Python today.

00:28:04.720 --> 00:28:09.340
So there's two pretty well-known ones that have been around for a while.

00:28:09.340 --> 00:28:11.480
There's PyParsing and there's PLY.

00:28:11.480 --> 00:28:14.860
So David Beasley's Python Lex Yak.

00:28:14.860 --> 00:28:16.720
Yeah, PLY is fantastic.

00:28:16.720 --> 00:28:19.260
And anything that David Beasley does is fantastic.

00:28:19.260 --> 00:28:22.660
You should immediately pause the podcast and go watch all of his talks.

00:28:24.940 --> 00:28:31.900
PLY came out of when he was teaching a course on building your own little Pascal interpreter, I believe.

00:28:31.900 --> 00:28:36.620
PLY has the advantage of being tripped over by hundreds and hundreds of students.

00:28:36.620 --> 00:28:39.580
And they've hit every little corner case and made every possible mistake.

00:28:39.580 --> 00:28:42.040
And so the error reporting is top notch.

00:28:42.040 --> 00:28:50.120
And really, I would seriously recommend taking a look at PLY if the complexity of what you're parsing is amenable to it.

00:28:50.680 --> 00:29:00.000
Now, the limit of PLY and why I couldn't use it for the MediaWiki stuff is that it implements what we call LR1 parsing, which you can look up the formalisms.

00:29:00.000 --> 00:29:01.420
And I probably forget most of them.

00:29:01.420 --> 00:29:02.980
But this is something we did.

00:29:02.980 --> 00:29:12.800
I think most languages that are currently in production probably use something along the lines of LR1 just because it's very, very memory efficient and CPU efficient as well.

00:29:13.000 --> 00:29:15.620
So it was a thing that we implemented when we had tiny little computers.

00:29:15.620 --> 00:29:27.020
But where you run into problems is that LR1, that one means we can only look ahead one token to decide what kind of thing we're recognizing.

00:29:27.020 --> 00:29:31.260
And in the case of MediaWiki, that wasn't actually enough.

00:29:31.580 --> 00:29:37.520
For example, I believe it was internal links that begin with two brackets, bracket, bracket.

00:29:37.520 --> 00:29:42.780
So it's not every time that bracket, bracket means an internal link.

00:29:42.780 --> 00:29:55.040
It has to be followed by, oh, I'm going to get this wrong, but something along the lines of maybe a URL or page name, and then maybe a vertical bar, and then maybe a page title.

00:29:56.040 --> 00:30:02.000
And so you might have to look ahead two or three tokens to see if your internal link is going to work out.

00:30:02.000 --> 00:30:07.040
And if it doesn't, that bracket, bracket is just part of plain text and should be emitted verbatim.

00:30:07.040 --> 00:30:14.420
And so we couldn't use any LR1 parser, and I put that on my shopping list and ruled out a whole lot of libraries because of it.

00:30:14.420 --> 00:30:14.740
Right.

00:30:14.740 --> 00:30:19.500
It just needs to keep more of it in its head, in the algorithm, all at once, right?

00:30:19.500 --> 00:30:20.500
Pretty much, yeah.

00:30:20.500 --> 00:30:20.900
All right.

00:30:20.900 --> 00:30:22.100
So what's the story with py parsing?

00:30:22.100 --> 00:30:32.300
So py parsing is what I consider or was the canonical, I reached for it first, Python parsing library before I wrote my own, of course, because I'm in that bad habit.

00:30:32.300 --> 00:30:35.920
And py parsing is, you know, fairly Pythonic.

00:30:35.920 --> 00:30:41.940
It mashes all the grammar definition stuff into Python objects.

00:30:42.800 --> 00:30:56.860
So you literally say in a piece of Python, bold toggle equals literal parentheses, and then you construct an object, you say, quote, quote, quote, and say, okay, well, quote, quote, quote, is the thing that turns on bold in media wiki syntax.

00:30:57.340 --> 00:31:07.340
And then on that object, you call methods like set name bold toggle, so that when you have this tree, you can actually tell that this thing was a bold toggle as you're walking the tree.

00:31:07.340 --> 00:31:09.620
And kind of on you go.

00:31:09.620 --> 00:31:14.480
Now, the disadvantage of this is it's kind of hard to read.

00:31:14.480 --> 00:31:18.300
It's kind of wordy because, after all, we had to make this a valid Python.

00:31:18.780 --> 00:31:22.240
And you can't do certain things like make forward references.

00:31:22.240 --> 00:31:26.820
And oftentimes there are cyclical references in a non-trivial grammar.

00:31:26.820 --> 00:31:37.380
In the case of py parsing, they had to put a little hack around that and say, you know, bold toggle equals forward, which is sort of a promise that I'm going to declare something later.

00:31:37.380 --> 00:31:40.100
And then later on, you kind of jam that on.

00:31:40.100 --> 00:31:49.580
For people who haven't seen the syntax of py parsing, it's kind of like formalized Python that acts as regular expressions.

00:31:49.580 --> 00:31:58.400
So you'll see like Python objects, but they're clearly coming from sort of regular expression land like groups and whatnot, right?

00:31:58.400 --> 00:32:02.180
Well, I mean, groups is a valid word to use when you're talking about parsing.

00:32:02.180 --> 00:32:06.920
After all, we're talking about trees here, and a tree is nothing more than a nested list.

00:32:06.920 --> 00:32:09.500
And so every sub list you can think of as a group.

00:32:09.880 --> 00:32:11.760
So that's what py parsing is doing with the groups.

00:32:11.760 --> 00:32:12.440
Yeah, okay.

00:32:12.440 --> 00:32:17.380
And py parsing is of, I think, equal power with parsimonious.

00:32:17.380 --> 00:32:25.420
They're both able to describe all the context-free grammars and probably a subset of context-sensitive ones.

00:32:25.420 --> 00:32:28.340
And parsimonious is your library that you wrote, right?

00:32:28.340 --> 00:32:30.440
Yes, it's my mad scientist experiment.

00:32:30.440 --> 00:32:35.080
Though its version number starts with zero point, I have it in production all over the place.

00:32:35.080 --> 00:32:36.340
There are lots of people using it, so.

00:32:36.340 --> 00:32:36.700
Nice.

00:32:36.700 --> 00:32:38.920
Do you have some examples for what it's being used for?

00:32:39.300 --> 00:32:42.760
Yeah, well, I'm using it in DXR to parse our little query language.

00:32:42.760 --> 00:32:44.500
What awesome people are using it for?

00:32:44.500 --> 00:32:47.700
You know, they don't report back to me, but it gets a lot of downloads.

00:32:47.700 --> 00:32:48.800
Yeah, excellent.

00:32:48.800 --> 00:32:49.300
So it's out there somewhere.

00:32:49.300 --> 00:32:49.920
Excellent.

00:32:49.920 --> 00:32:54.580
So to me, it seems like parsimonious, it's a little simpler to define the grammar.

00:32:54.580 --> 00:32:55.240
Is that right?

00:32:55.240 --> 00:33:01.840
That was really my goal, both to make it simpler, to make it run fast, and to make it optimizable.

00:33:01.840 --> 00:33:08.300
One of the goals you said was frugal RAM use, which I thought was just a great way to phrase it.

00:33:08.300 --> 00:33:13.560
Yeah, and I haven't done a lot of RAM profiling, so I'm not ready to make any claims about that just yet.

00:33:13.560 --> 00:33:22.720
But the formalism underneath parsimonious, which is parsing expression grammars, which come out of a 2004 paper by Brian Ford, meet that goal just fine.

00:33:22.720 --> 00:33:31.260
I think RAM use is something along the lines of order n to the third, but n is the grammar size, so it's actually not that big an n.

00:33:31.600 --> 00:33:40.080
And as a result of blowing that RAM, it's caches for each individual little production, each individual little context-free equals.

00:33:40.080 --> 00:33:42.560
You get linear parse time.

00:33:42.560 --> 00:33:43.520
Yeah, which is great.

00:33:43.520 --> 00:33:44.780
Yeah, that's really, really nice.

00:33:44.780 --> 00:33:54.400
So the reason people care about RAM is, I mean, if you're running this over some text and you're just going to dump it out and produce a text version, that's fine.

00:33:54.400 --> 00:34:04.420
But if you're wanting to keep that in memory and say like a web server and continuously serve requests from it or something, then all of a sudden you care way more about RAM, right?

00:34:04.420 --> 00:34:10.640
Well, just imagine serving up a bunch of media wiki pages and you've got to be parsing 100 of these at a time on a web server.

00:34:10.640 --> 00:34:12.760
You know, RAM is not free.

00:34:12.760 --> 00:34:15.480
No, it's definitely not free, especially if you've got a lot of traffic.

00:34:15.480 --> 00:34:23.160
So one of the algorithms involved in here is you said that it uses something called the PackRat algorithm, which sounds fun.

00:34:23.160 --> 00:34:25.100
What's that?

00:34:25.100 --> 00:34:26.760
Yeah, I think Ryan Ford came up with that too.

00:34:26.760 --> 00:34:28.720
It's really simple.

00:34:28.720 --> 00:34:31.380
It's what you would come up with yourself if you're implementing one of these.

00:34:31.380 --> 00:34:36.020
These PEG parsers, PEG parsers, they're really just recursive descent parsers.

00:34:36.020 --> 00:34:45.140
And as you descend, you might find out, hey, now that I've looked ahead to tokens and I find out this internal link isn't going to work out for me, I need to rewind a little.

00:34:45.140 --> 00:34:51.900
Let me take a couple steps back up the stack and let me try parsing it as plain text, for example, or maybe as an external link, for example.

00:34:52.500 --> 00:34:58.400
And in the course of that, you may need to use a partial parse from a previous stack frame.

00:34:58.400 --> 00:35:04.640
And rather than redoing that work, it makes sense oftentimes to cache the results of each partial parse.

00:35:04.640 --> 00:35:06.320
And that's all the PackRat does.

00:35:06.320 --> 00:35:08.640
Just keeps all these intermediate results around.

00:35:08.640 --> 00:35:09.100
I see.

00:35:09.100 --> 00:35:14.480
Basically to allow you to look ahead an arbitrary number of tokens and then adjust.

00:35:14.480 --> 00:35:16.480
Without paying a penalty of redoing work.

00:35:16.480 --> 00:35:16.800
Exactly.

00:35:16.800 --> 00:35:17.140
Right.

00:35:17.140 --> 00:35:17.540
Yeah.

00:35:17.540 --> 00:35:17.740
Yeah.

00:35:17.740 --> 00:35:18.140
Very cool.

00:35:18.260 --> 00:35:30.440
Well, one of the things I was mostly trying to do with parsimonious, one of the things that really the thing that differentiates it so much, was that its grammars look a lot like what you'd find if you looked up the definition of the grammar in a book or in the documentation.

00:35:30.440 --> 00:35:35.660
They're just big blobs of multi-line text in the Python quote, quote, quote, quote way.

00:35:35.660 --> 00:35:42.920
And so it's able to do forward references because it's not bound to the idea of undefined symbols in an outer programming language.

00:35:42.920 --> 00:35:54.400
We're able to do compile time optimizations on it really without limit because we haven't lost anything in its transformation down to a bunch of Python objects prematurely.

00:35:54.620 --> 00:35:56.340
And it's also very easy to read as a result.

00:35:56.340 --> 00:36:04.800
And you also keep the representation part as a separate phase so you can render to multiple formats, which is cool.

00:36:04.800 --> 00:36:05.600
Yes, exactly.

00:36:05.600 --> 00:36:12.740
A lot of these Python, not Python rather, but a lot of these parsing kits tend to inter-twingle output rendering with parsing.

00:36:12.740 --> 00:36:17.240
And that way leads to pain in my experience.

00:36:17.240 --> 00:36:19.200
Certainly rigidity.

00:36:19.200 --> 00:36:19.960
Interesting.

00:36:20.680 --> 00:36:24.240
It sounds like your parsimonious project is really cool.

00:36:24.240 --> 00:36:24.980
And it's on GitHub.

00:36:24.980 --> 00:36:26.120
People can check it out, right?

00:36:26.120 --> 00:36:26.920
Please.

00:36:26.920 --> 00:36:27.660
Send feedback.

00:36:27.660 --> 00:36:28.440
Play around with it.

00:36:28.440 --> 00:36:29.140
Send patches.

00:36:29.140 --> 00:36:29.980
Okay, cool.

00:36:29.980 --> 00:36:35.540
Yeah, one of the things that made me a little bit sad about PyParsing is it's on SourceForge, which I don't know.

00:36:35.540 --> 00:36:36.020
Wow, still?

00:36:36.020 --> 00:36:42.800
Yeah, when I see things on SourceForge, it kind of makes me feel like, oh, I'm not really sure that thing's actually still going.

00:36:42.800 --> 00:36:45.700
Yeah, right on the homepage it says, download now from SourceForge.

00:36:45.700 --> 00:36:46.380
It's like, oh.

00:36:46.380 --> 00:36:46.840
Oh.

00:36:46.840 --> 00:36:47.480
Oh.

00:36:47.480 --> 00:36:47.600
Oh.

00:36:47.600 --> 00:36:49.140
Okay, that's unfortunate.

00:36:49.760 --> 00:36:50.840
It's still fine code.

00:36:50.840 --> 00:36:51.120
I mean.

00:36:51.120 --> 00:36:52.060
Yeah, of course.

00:36:52.060 --> 00:36:54.080
It is a single 3,000 line file.

00:36:54.080 --> 00:36:58.480
But, you know, old doesn't mean bad.

00:36:58.480 --> 00:36:59.020
It means proven.

00:36:59.020 --> 00:37:00.000
Yeah, absolutely.

00:37:00.000 --> 00:37:00.580
Absolutely.

00:37:00.580 --> 00:37:04.100
So that was a really cool talk on parsing horrible things you gave.

00:37:04.100 --> 00:37:04.960
Do you have some other favorites?

00:37:04.960 --> 00:37:05.980
Oh, let's see.

00:37:05.980 --> 00:37:06.840
What else have I done?

00:37:06.840 --> 00:37:14.940
I have a talk called Poetic APIs, where I kind of will expand on the idea that, you know, these grammars, they should be really easy to read.

00:37:15.020 --> 00:37:17.360
In fact, all programs should be easy to read.

00:37:17.360 --> 00:37:26.520
In fact, here are seven, you know, kind of checklist-y things you can bang against what you're writing to make sure that you set a good language for the users of your API.

00:37:27.220 --> 00:37:27.580
Oh, interesting.

00:37:27.580 --> 00:37:28.320
Yeah.

00:37:28.320 --> 00:37:31.740
What we're doing when we're programming is always creating language.

00:37:31.740 --> 00:37:43.980
Every time we name a function, name a variable, create semantics of an object, we're kind of creating the mental model in which everybody who interacts with that code in the future has to play.

00:37:44.420 --> 00:37:50.180
So if we do that irresponsibly, we can really make people think terrible, stupid thoughts that make their jobs hard.

00:37:50.180 --> 00:38:00.980
But if we give them really good symbols that correspond well to reality and are easily composable and flexible, like the requests package is a great example.

00:38:00.980 --> 00:38:01.680
Like, hey, you know what?

00:38:01.700 --> 00:38:07.340
We could just totally represent HTTP requests instead of making everyone think about raw sockets all the time, like URL lib.

00:38:07.340 --> 00:38:09.840
Then people can have their efforts magnified.

00:38:09.840 --> 00:38:17.560
Yeah, it really does define the way that you think about a problem, the APIs and whatnot you have to work with, and the language itself, right?

00:38:17.560 --> 00:38:22.220
And so I think Python itself is something, an example of, like, why that's important, right?

00:38:22.420 --> 00:38:29.380
Well, yeah, I mean, so Python is a fairly close match to what we tend to write as pseudocode.

00:38:29.380 --> 00:38:33.540
Right, and the reason we write pseudocode is it's easy to understand and communicate, so.

00:38:33.540 --> 00:38:34.320
Exactly.

00:38:34.320 --> 00:38:38.440
I was just reading the topographical sort algorithms on Wikipedia, and you know what?

00:38:38.440 --> 00:38:42.920
If you put some colons in there and take out the eaches, it's about valid Python.

00:38:42.920 --> 00:38:43.540
That's awesome.

00:38:43.540 --> 00:38:48.380
Yeah, I've heard that before, that people have copied algorithms out of Wikipedia, more or less, just straight up.

00:38:48.380 --> 00:38:50.440
Turn that into Python, and it works beautifully.

00:38:50.440 --> 00:38:51.120
That's great.

00:38:51.220 --> 00:38:54.940
Which is also interesting verification of Wikipedia content.

00:38:54.940 --> 00:38:55.680
Yes.

00:38:55.680 --> 00:39:10.340
This portion of Talk Python To Me is brought to you by Hired.

00:39:10.340 --> 00:39:13.360
Hired is the platform for top Python developer jobs.

00:39:13.360 --> 00:39:18.160
Create your profile and instantly get access to 3,500 companies who will work to compete with you.

00:39:18.160 --> 00:39:21.040
Take it from one of Hired's users who recently got a job and said,

00:39:21.040 --> 00:39:26.340
I had my first offer on Thursday after going live on Monday, and I ended up getting eight offers in total.

00:39:26.340 --> 00:39:29.780
I've worked with recruiters in the past, but they've always been pretty hit and miss.

00:39:29.780 --> 00:39:32.620
I tried LinkedIn, but I found Hired to be the best.

00:39:32.620 --> 00:39:34.720
I really like knowing the salary up front.

00:39:34.720 --> 00:39:37.080
Privacy was also a huge seller for me.

00:39:37.080 --> 00:39:38.760
Sounds awesome, doesn't it?

00:39:38.760 --> 00:39:40.780
Well, wait until you hear about the sign-in bonus.

00:39:41.240 --> 00:39:44.200
Everyone who accepts a job from Hired gets $1,000 signing bonus.

00:39:44.200 --> 00:39:46.860
And as Talk Python listeners, it gets way sweeter.

00:39:46.860 --> 00:39:52.100
Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $2,000.

00:39:52.100 --> 00:39:53.880
Opportunity's knocking.

00:39:53.880 --> 00:39:57.660
Visit Hired.com slash Talk Python To Me and answer the door.

00:40:04.120 --> 00:40:07.280
There's another one that you talked about called the Code Review Review.

00:40:07.280 --> 00:40:08.440
What's the story of that?

00:40:08.440 --> 00:40:09.600
Yeah, this is a newer talk.

00:40:09.600 --> 00:40:16.580
So something we're trying to do at Mozilla is make sure we don't drive people away by being jerks,

00:40:16.580 --> 00:40:20.220
doing code review or otherwise having kind of unwelcoming culture.

00:40:20.220 --> 00:40:25.420
Mozilla is historically and today largely driven by volunteer contributions.

00:40:25.420 --> 00:40:32.340
I mean, even the guy who owns our security sockets layer or something,

00:40:32.340 --> 00:40:36.380
or the NSS, whatever that is, like the module owner for this, the one who has the final say,

00:40:36.380 --> 00:40:37.860
he doesn't get paid by Mozilla.

00:40:37.860 --> 00:40:42.840
He does something else, and he just takes it on this responsibility out of his own free time.

00:40:42.840 --> 00:40:46.280
And so it's really important for Mozilla to keep that rolling, you know,

00:40:46.280 --> 00:40:48.840
welcome people into the community, take in contributions,

00:40:48.840 --> 00:40:52.820
help people level up to become better programmers and more familiar with the project.

00:40:52.960 --> 00:40:56.940
And so the Code Review Review is a piece of our onboarding right now

00:40:56.940 --> 00:41:00.360
that I'm turning into a more generically applicable talk where we talk about,

00:41:00.360 --> 00:41:02.960
well, you know, how do we create that kind of welcoming atmosphere?

00:41:02.960 --> 00:41:07.520
How do we do a proper review so that good programs come out?

00:41:07.520 --> 00:41:11.280
And how do we do a review such that better programmers come out of it?

00:41:11.280 --> 00:41:15.320
Yeah, that's really important that people, when they come to these new projects

00:41:15.320 --> 00:41:18.900
or when they stick around, that they feel like it's a delightful experience

00:41:18.900 --> 00:41:22.780
because they're doing it of their own free time and energy, right?

00:41:22.780 --> 00:41:24.120
Yeah, for sure.

00:41:24.120 --> 00:41:27.780
You definitely don't want it to be like slugging through a hard code.

00:41:27.780 --> 00:41:33.480
I mean, I think an example of the opposite comes to mind is the old Python packaging,

00:41:33.480 --> 00:41:35.780
PyPI web code.

00:41:35.780 --> 00:41:40.000
I talked to Donald Stuffed on episode 64, and he was like,

00:41:40.000 --> 00:41:43.980
a lot of people want to come along and help maintain and evolve this,

00:41:44.180 --> 00:41:52.880
but it's like two files in this hugely complicated old custom web framework that they built for it.

00:41:52.880 --> 00:41:56.680
It's just people look at it, you know, actually, thanks, but no thanks.

00:41:56.680 --> 00:42:04.820
And finally, they're rewriting it at pypi.org, and it's in Pyramid and Bootstrap, and it's lovely.

00:42:05.180 --> 00:42:09.340
But for a long time, I think it turned people away from pushing that project forward,

00:42:09.340 --> 00:42:13.580
and you could tell that it kind of, it was just getting maintained, which is good,

00:42:13.580 --> 00:42:17.080
but it's also evidence of this anti-approachability, I guess.

00:42:17.080 --> 00:42:20.520
Yeah, it's kind of funny how all those old projects tend to be like two enormous files.

00:42:22.020 --> 00:42:24.500
All I can think of is folders must have been expensive 20 years ago.

00:42:24.500 --> 00:42:25.140
Yes, exactly.

00:42:25.140 --> 00:42:26.120
Really expensive.

00:42:26.120 --> 00:42:28.840
All right, so what else?

00:42:28.840 --> 00:42:31.480
We've got a few moments to talk about a few other things,

00:42:31.480 --> 00:42:34.020
and I know you've got a lot of interesting pieces out there.

00:42:34.020 --> 00:42:35.180
What else is going on?

00:42:35.180 --> 00:42:36.320
You're doing something with pip, right?

00:42:36.320 --> 00:42:37.640
Oh, pip, yes.

00:42:37.640 --> 00:42:40.040
So we deploy a lot of Python at Mozilla,

00:42:40.040 --> 00:42:44.180
and we used to check everything into a vendor library,

00:42:44.180 --> 00:42:47.640
you know, just to be sure no one had slipped anything under the radar,

00:42:47.640 --> 00:42:48.980
anything malicious like that.

00:42:48.980 --> 00:42:50.820
And vendor libraries are a pain to maintain.

00:42:50.980 --> 00:42:52.680
You know, you have to update the versions of things,

00:42:52.680 --> 00:42:55.100
and that creates enormous diffs in your version control,

00:42:55.100 --> 00:42:57.920
and then checkouts take forever, and your checkouts are huge.

00:42:57.920 --> 00:43:02.940
And so we, for a while, ran an internal PIPI mirror.

00:43:02.940 --> 00:43:04.540
A lot of people run their own little index server,

00:43:04.540 --> 00:43:08.140
and you kind of keep track of who's allowed to upload what to the server,

00:43:08.140 --> 00:43:09.460
and then who did it last,

00:43:09.460 --> 00:43:13.040
and what versions are going to work with your own projects,

00:43:13.040 --> 00:43:15.840
and you have to keep an access control list and an audit trail.

00:43:15.840 --> 00:43:18.560
And that was a pain, and it slowed things down.

00:43:18.560 --> 00:43:20.240
And then I thought, well, you know,

00:43:20.600 --> 00:43:22.260
we were actually having a beer and tell.

00:43:22.260 --> 00:43:25.480
We have these little sessions where we have a beverage of our choice

00:43:25.480 --> 00:43:28.680
and talk about something that we've been playing around with as a side project.

00:43:28.680 --> 00:43:31.080
And I needed something to talk about one day,

00:43:31.080 --> 00:43:32.940
and I thought, well,

00:43:32.940 --> 00:43:37.520
why not just hash the results of what you download from PIPI

00:43:37.520 --> 00:43:42.580
and make sure they match some local hash that you've pre-vetted,

00:43:43.120 --> 00:43:44.780
and then go ahead and install it?

00:43:44.780 --> 00:43:50.560
And so I put this thing together as this little tool called PEEP for prudently examine every package.

00:43:50.560 --> 00:43:56.860
And we ended up moving the whole production lifecycle over to that for a number of years, year or two.

00:43:57.600 --> 00:43:59.520
And then I thought, well, okay, this is proven out.

00:43:59.520 --> 00:44:04.500
And PEEP called deep into pip's internal APIs, and so it would break all the time.

00:44:04.500 --> 00:44:06.320
And it was a pain to maintain.

00:44:06.320 --> 00:44:06.980
I thought, you know what?

00:44:06.980 --> 00:44:09.700
I'm going to lift this up into pip, see if I can get people interested.

00:44:10.480 --> 00:44:12.360
And long story short, I did.

00:44:12.360 --> 00:44:15.380
People were interested, and it's in pip 8 and above.

00:44:15.380 --> 00:44:21.700
So if you're out there deploying Python and running your own index server

00:44:21.700 --> 00:44:24.400
or keeping up to date with a vendor library,

00:44:24.400 --> 00:44:29.180
hey, consider just putting a bunch of SHA-256 hashes into your requirements file

00:44:29.180 --> 00:44:32.880
with a funny little syntax and running pip 8 over it.

00:44:32.880 --> 00:44:34.400
And it'll vet these things for you.

00:44:34.400 --> 00:44:35.480
I mean, it won't vet them for you.

00:44:35.480 --> 00:44:38.140
You have to make sure that there's nothing malicious in a given version of a package,

00:44:38.140 --> 00:44:42.100
but it'll make sure that what you got that first time is the same thing you got.

00:44:42.100 --> 00:44:44.660
Yeah, once vetted, it'll verify it can't change.

00:44:44.660 --> 00:44:45.660
So what do you do?

00:44:45.660 --> 00:44:48.460
You can't just say, I depend upon SQLAlchemy.

00:44:48.460 --> 00:44:54.360
You've got to say, I depend upon SQLAlchemy 1.0.whatever, and here's its SHA.

00:44:54.360 --> 00:44:57.300
Dash, dash, hash equals whatever it is.

00:44:57.300 --> 00:44:59.460
Right, because obviously if the version changes,

00:44:59.460 --> 00:45:01.640
you'd imagine the code would change and the package would change.

00:45:01.640 --> 00:45:01.840
Absolutely.

00:45:01.840 --> 00:45:02.840
I see.

00:45:02.840 --> 00:45:06.860
And I try to keep the handholding in there

00:45:06.860 --> 00:45:09.740
so that if you forget to pin the version but provide a hash,

00:45:09.740 --> 00:45:10.500
it'll say, you know what?

00:45:10.500 --> 00:45:14.560
You should really pin the version because you're going to have an unpleasant surprise down the line.

00:45:14.560 --> 00:45:16.960
This is a really stable product.

00:45:16.960 --> 00:45:19.420
You're going to find out that this is not the same.

00:45:19.420 --> 00:45:20.120
Awesome.

00:45:20.120 --> 00:45:21.720
And you mentioned Turtles before.

00:45:21.720 --> 00:45:22.280
What's Turtles?

00:45:22.280 --> 00:45:24.500
Turtles is a real mad scientist project.

00:45:25.960 --> 00:45:30.900
So I used to do a lot of Zope and Plone consulting back in the day,

00:45:30.900 --> 00:45:36.000
and I watched a lot of really smart people from all different walks of life.

00:45:36.000 --> 00:45:40.200
You know, it was in higher ed, so there were professors, and there were very good writers,

00:45:40.200 --> 00:45:42.200
and there were sys-ops, and there were chemists,

00:45:42.200 --> 00:45:44.780
and they'd come in, and we'd have to teach them,

00:45:44.880 --> 00:45:46.380
okay, how do I build a website,

00:45:46.380 --> 00:45:50.300
or how do I use this content management system that has Python underlying it?

00:45:50.300 --> 00:45:56.280
And they'd have to learn HTML, and CSS, and JavaScript, and Python,

00:45:56.280 --> 00:45:58.960
and then ZCML as a control language,

00:45:58.960 --> 00:46:02.440
and DTML for dynamic CSS back in the day,

00:46:02.440 --> 00:46:03.700
and all these crazy different languages.

00:46:03.700 --> 00:46:04.880
And it was ridiculous,

00:46:04.880 --> 00:46:08.000
and they would get discouraged and go away, a lot of them,

00:46:08.000 --> 00:46:11.020
or at least not work to the potential that I thought we could provide them.

00:46:11.020 --> 00:46:14.460
And so Turtles kind of had its genesis there.

00:46:14.460 --> 00:46:19.340
I thought, you know, if only we had a single language that we could use end-to-end,

00:46:19.340 --> 00:46:24.540
and we didn't need to constantly reinvent, say, for loops,

00:46:24.540 --> 00:46:28.000
or variables that allow us to not repeat ourselves,

00:46:28.000 --> 00:46:29.600
and have a different way to do variables in Python,

00:46:29.600 --> 00:46:32.240
and a different way in CSS, and another way in JavaScript,

00:46:32.240 --> 00:46:35.940
wouldn't things be easier to learn, and easier to remember,

00:46:35.940 --> 00:46:37.580
and heck, easier to read?

00:46:37.580 --> 00:46:40.720
And wouldn't we have all of these synergies?

00:46:40.720 --> 00:46:48.200
Turtles is one of those.

00:46:48.200 --> 00:46:53.060
Turtles is a single-language system, or will be, I hope, for web development.

00:46:53.060 --> 00:46:58.320
Right now, it is just a context-sensitive parser that does run,

00:46:58.320 --> 00:47:01.720
but everything else is still up in the air.

00:47:01.720 --> 00:47:02.200
Interesting.

00:47:02.200 --> 00:47:06.000
So basically, instead of teaching people these three or four languages,

00:47:06.000 --> 00:47:09.520
when they come, plus the frameworks, you're like, look, learn this one thing,

00:47:09.520 --> 00:47:12.300
and you'll have a website, a full-on website.

00:47:12.300 --> 00:47:16.780
That is the idea, to teach them, you know, an hour or two's worth of stuff,

00:47:16.780 --> 00:47:21.300
and then have them be able to make real progress without an internet connection.

00:47:21.300 --> 00:47:25.280
You know, make this thing explorable, like the old small talk environments used to be,

00:47:25.280 --> 00:47:30.940
where you can drill into an example and take it apart and rip out pieces that you want to use,

00:47:30.940 --> 00:47:34.600
where you can make changes to a live system and see what happens.

00:47:34.600 --> 00:47:39.380
I mean, that's really how I learned programming, by mimicry, which we don't really do anymore.

00:47:39.380 --> 00:47:42.460
You know, I had to type this stuff out of magazines, make mistakes,

00:47:42.460 --> 00:47:45.840
see what the effects of the mistakes were, fix the mistakes,

00:47:45.840 --> 00:47:50.040
and then screw around and make my own new mistakes and see what effects they had.

00:47:50.420 --> 00:47:53.600
I think that's how we learn human language, and I think it's a powerful way to learn programming.

00:47:53.600 --> 00:47:57.640
It is a powerful way, but we've definitely moved beyond that in lots of areas.

00:47:57.640 --> 00:48:04.340
I mean, I try to, I think of getting started with, like, Node.js and, like, the crazy packaging and requirements and all that,

00:48:04.340 --> 00:48:07.300
and it's like, I thought that was a simple thing to get started with, you know?

00:48:07.300 --> 00:48:11.380
But I feel like I've done a better job of keeping it simple.

00:48:11.380 --> 00:48:14.460
But still, how do I know where a package comes from?

00:48:14.460 --> 00:48:17.540
If I type import requests, why doesn't that run, right?

00:48:17.580 --> 00:48:22.220
Like, when you're new, all these challenges that we are just like, yeah, whatever, just pip install or whatever,

00:48:22.220 --> 00:48:23.760
I just, you forgot that step.

00:48:23.760 --> 00:48:25.740
These are all levels of friction.

00:48:25.740 --> 00:48:30.200
Yeah, that's one of the things that I'm trying to solve in Turtles by making the answer always the same.

00:48:30.200 --> 00:48:32.220
You ask, where does this come from?

00:48:32.220 --> 00:48:33.700
Or where do I put this?

00:48:33.700 --> 00:48:37.100
And the answer is always in Turtles, it goes on a page.

00:48:37.100 --> 00:48:39.400
Your config goes on a page.

00:48:39.400 --> 00:48:41.280
Your program goes on a page.

00:48:41.280 --> 00:48:43.400
Everything goes on a page.

00:48:43.960 --> 00:48:48.320
The thing that we've lost in the web, because the web is a state-free kind of environment,

00:48:48.320 --> 00:48:51.900
is that the default behavior is for everything to go poof.

00:48:51.900 --> 00:48:55.280
You put something in a forum, you leave the page, poof, it's gone.

00:48:55.280 --> 00:48:59.900
Unless you're, you know, using a nice browser like Firefox where you can say, hey, reopen that tab,

00:48:59.900 --> 00:49:01.120
and then your content comes back.

00:49:01.120 --> 00:49:03.120
But as a developer, you have the same problem, right?

00:49:03.160 --> 00:49:07.780
You end up writing these template languages, and you end up having to take state and shuttle it aside

00:49:07.780 --> 00:49:12.100
into some other process, into a relational database or a document store or something,

00:49:12.100 --> 00:49:17.240
and then reconstitute, rehydrate these pages all the time out of these databases,

00:49:17.240 --> 00:49:22.200
where the representation may not look anything like the structure you're actually trying to pull out.

00:49:23.200 --> 00:49:28.480
And so the idea with turtles is put everything on pages and have a single representation for everything,

00:49:28.480 --> 00:49:30.840
which I think is going to be trees.

00:49:30.840 --> 00:49:38.400
Because as we said with the parsing stuff, trees are a very good choice for universal representation of things.

00:49:38.400 --> 00:49:39.440
You can do a lot with trees.

00:49:39.440 --> 00:49:41.160
It definitely sounds like an interesting project.

00:49:41.160 --> 00:49:44.560
So there's nothing quite yet that we can go play with, but you're working on it, huh?

00:49:44.560 --> 00:49:48.020
Nothing but 100K of design notes and parser up on GitHub.

00:49:48.020 --> 00:49:49.780
But yeah, not quite ready yet.

00:49:49.780 --> 00:49:50.360
All right, cool.

00:49:50.360 --> 00:49:51.420
I'll keep us posted on that.

00:49:51.420 --> 00:49:51.960
That's great.

00:49:52.580 --> 00:49:57.280
All right, so another thing that you said you're into these days is GTD or getting things done, right?

00:49:57.280 --> 00:49:58.220
Oh, my gosh.

00:49:58.220 --> 00:49:58.940
Changed my life.

00:49:58.940 --> 00:50:06.700
So as a repeat offender of Python library creation, I have the open source guilt.

00:50:06.700 --> 00:50:11.860
I put this thing out there, and people are like, oh, well, it's broken this way, and it's broken this way,

00:50:11.860 --> 00:50:14.400
and I wish it did this, and here's a patch, and why don't you review my patch,

00:50:14.400 --> 00:50:17.220
and shouldn't you be doing this instead of watching Netflix and spending time with your family?

00:50:17.220 --> 00:50:18.440
I'm like, well, I guess.

00:50:18.440 --> 00:50:22.200
And the guilt builds and builds and builds, and you're just kind of,

00:50:22.200 --> 00:50:25.740
you know, you're just kind of shaking all the time, and you can't relax.

00:50:25.740 --> 00:50:31.440
So GTD, I got this book at work seven years ago.

00:50:31.440 --> 00:50:34.300
Somebody gave me a copy and it sat on the shelf for seven years, and I picked it up,

00:50:34.300 --> 00:50:37.800
and it has all these little helpful practices for getting rid of that guilt

00:50:37.800 --> 00:50:42.320
and making sure you're always working on the most important thing at any one given time.

00:50:42.780 --> 00:50:43.440
And you know what?

00:50:43.440 --> 00:50:44.500
It's really changed my life.

00:50:44.500 --> 00:50:46.780
I no longer have the guilt to such a degree.

00:50:46.780 --> 00:50:48.120
Hardly at all, really.

00:50:48.120 --> 00:50:50.260
My response time has gone way down.

00:50:50.260 --> 00:50:51.960
My email box has been at zero.

00:50:51.960 --> 00:50:55.800
I mean, my work one and my home one were 5,000 before.

00:50:55.800 --> 00:50:57.180
Now they've been zero for months.

00:50:57.180 --> 00:50:58.700
And it's been easy.

00:50:58.700 --> 00:51:00.380
It's a crazy thing.

00:51:00.500 --> 00:51:01.220
That sounds delightful.

00:51:01.220 --> 00:51:02.100
Yeah.

00:51:02.100 --> 00:51:02.400
Yeah.

00:51:02.400 --> 00:51:06.400
I've read that book, and I've lived the GTD lifestyle.

00:51:06.400 --> 00:51:13.380
And I found it didn't quite work for me, but I gained a huge value from doing it.

00:51:13.380 --> 00:51:20.000
And so if you can find a few techniques to help tame the world, whether it's so you have

00:51:20.000 --> 00:51:23.920
better response time on your open source project, or you're not stressed all the time, or you

00:51:23.920 --> 00:51:31.320
can go home and see the family and not have the weight of 5,000 unread emails on you, these

00:51:31.320 --> 00:51:31.860
are all good.

00:51:31.860 --> 00:51:38.940
The biggest thing that helps me these days is Google Inbox with its ability to snooze items

00:51:38.940 --> 00:51:42.580
for two weeks until that item comes back right when I'm supposed to deal with it and things

00:51:42.580 --> 00:51:45.980
like that, that's actually been sort of where I've evolved to.

00:51:45.980 --> 00:51:47.320
But GTD is great.

00:51:47.320 --> 00:51:47.860
Yeah.

00:51:47.860 --> 00:51:48.740
For real.

00:51:48.740 --> 00:51:53.180
And nobody, you know, you say you don't use GTD per se because it didn't work out for you.

00:51:53.180 --> 00:51:55.160
Well, nobody uses vanilla GTD.

00:51:55.160 --> 00:51:57.080
It's, you know, it's made to be customized.

00:51:57.080 --> 00:51:58.540
It's a grab bag of tricks.

00:51:58.540 --> 00:52:00.880
And some of them are, you know, more important than others.

00:52:00.880 --> 00:52:03.380
Some of them are, you know, kind of vital, and some of them are optional.

00:52:03.380 --> 00:52:03.980
Yeah, absolutely.

00:52:03.980 --> 00:52:07.260
And you ran into something with that inbox that really rings true to me.

00:52:07.260 --> 00:52:12.560
I had been trying for years to use my email inbox as a sort of quasi-touching.

00:52:12.560 --> 00:52:13.240
To-do list.

00:52:13.240 --> 00:52:17.280
I'll leave this in my box because I'm going to need it in three days.

00:52:17.280 --> 00:52:20.580
And so for some reason I thought it's expensive to do a find.

00:52:20.580 --> 00:52:23.500
I don't know why I thought that, but that's what I thought apparently.

00:52:23.500 --> 00:52:27.080
Or, you know, oh, I need to respond to this mail, so it's going to stay in there.

00:52:27.240 --> 00:52:38.700
And it ended up just being this mixture of chronologically sorted things, some of which were reference material, some of which were things to do, some of which I couldn't do for a certain period, like you said, and needed to be snoozed for two weeks.

00:52:38.700 --> 00:52:46.180
And yet the only way it would present them to me is as this linear kind of last-touched first list of things.

00:52:46.560 --> 00:52:47.920
It's not a useful presentation.

00:52:47.920 --> 00:52:54.240
And I have been unable to make any email client really bend to my will as a to-do list.

00:52:54.240 --> 00:53:05.380
And so the reason my boxes are empty these days is any time there's an actionable email item that takes me more than two minutes to do, because otherwise I would do it immediately as part of GTD, it goes into my to-do system.

00:53:05.380 --> 00:53:09.020
And it pops up, like you say, when you're able to take action on it.

00:53:09.020 --> 00:53:11.720
It's really a wonderful thing.

00:53:11.820 --> 00:53:12.360
I do agree with that.

00:53:12.360 --> 00:53:16.260
So I would be remiss not asking which to-do system you use for this.

00:53:16.260 --> 00:53:17.120
So I went shopping.

00:53:17.120 --> 00:53:19.640
I wanted something that – so I'm kind of weird.

00:53:19.640 --> 00:53:28.060
I wanted something that was expressly not cross-platform, because I think that the platform-specific stuff usually ends up with a better UI, and I'm kind of a UI enthusiast.

00:53:28.060 --> 00:53:29.960
So I looked at OmniFocus.

00:53:29.960 --> 00:53:31.640
I love all of Omni's stuff.

00:53:31.640 --> 00:53:33.240
I love OmniGraffle and OmniOutliner.

00:53:33.240 --> 00:53:34.560
I read all my talks on OmniOutliner.

00:53:34.560 --> 00:53:36.760
And I really wanted to like OmniFocus.

00:53:36.760 --> 00:53:41.640
But there were a couple things that it was just grating against me, and it wouldn't do what I wanted.

00:53:41.940 --> 00:53:44.400
I couldn't reorder tasks except within projects.

00:53:44.400 --> 00:53:45.760
And I kind of like to plan my day out.

00:53:45.760 --> 00:53:51.460
And so I ended up with something called Things by a little German company called Cultured Code.

00:53:51.460 --> 00:53:54.880
And it's a Mac and iPhone-only gadget.

00:53:54.880 --> 00:53:56.740
It uses their own little cloud sync service.

00:53:56.740 --> 00:54:00.440
And the UI has been thought out in great detail.

00:54:00.440 --> 00:54:03.440
Development goes glacially slowly.

00:54:03.440 --> 00:54:04.440
That's the downside.

00:54:05.100 --> 00:54:09.840
These people seem very, very intent on getting things exactly right and not releasing until then.

00:54:09.840 --> 00:54:10.900
So that's the downside.

00:54:10.900 --> 00:54:11.900
But Things is great.

00:54:11.900 --> 00:54:15.320
Things lets you schedule things out ahead of time and not bother you until then.

00:54:15.320 --> 00:54:19.160
It lets you express due dates like, well, this drop dead has to happen now.

00:54:19.160 --> 00:54:22.100
And it does all the D2D-style contexts.

00:54:22.100 --> 00:54:26.440
Let me know about this when I'm at home or at work or in the car or what have you.

00:54:26.960 --> 00:54:31.300
It's nothing particularly, you know, whizzy from a technical point of view.

00:54:31.300 --> 00:54:40.480
But it has those kind of three core features of contexts and due dates and hide until that any large-scale to-do system needs.

00:54:40.480 --> 00:54:41.060
Oh, yeah.

00:54:41.060 --> 00:54:42.060
It looks really, really cool.

00:54:42.060 --> 00:54:42.440
All right.

00:54:42.440 --> 00:54:44.960
Well, thanks for sharing your research results there.

00:54:44.960 --> 00:54:45.320
That's cool.

00:54:45.320 --> 00:54:46.280
Great.

00:54:46.280 --> 00:54:46.540
Okay.

00:54:46.600 --> 00:54:49.120
So it looks like we're just about out of time.

00:54:49.120 --> 00:54:50.820
We've covered all the horrible things.

00:54:50.820 --> 00:54:52.100
Now let's talk about some cool things.

00:54:52.100 --> 00:54:54.920
How about your favorite PyPI package?

00:54:54.920 --> 00:54:57.820
Well, I mean, there's a lot of good ones out there.

00:54:57.820 --> 00:54:58.760
I enjoy Flask.

00:54:58.760 --> 00:55:00.620
I've used that to power DXR.

00:55:00.620 --> 00:55:05.000
It's a nice little lightweight web framework by Armin Roniker.

00:55:05.000 --> 00:55:05.460
Yeah.

00:55:05.460 --> 00:55:06.280
I can't remember exactly who it is.

00:55:06.280 --> 00:55:06.740
Yeah.

00:55:06.740 --> 00:55:08.100
I hit him on one of the early shows.

00:55:08.100 --> 00:55:08.280
Yeah.

00:55:08.280 --> 00:55:09.620
All of Armin's stuff is fantastic.

00:55:09.620 --> 00:55:10.260
It is.

00:55:10.260 --> 00:55:12.680
Likewise, Click is a fantastic one by Armin.

00:55:12.680 --> 00:55:16.480
It's, I think he kind of, did he pull it out of Flask?

00:55:16.640 --> 00:55:18.480
No, he's pulling it into Flask, in fact.

00:55:18.480 --> 00:55:22.600
Click is a kit for making command line tools.

00:55:22.600 --> 00:55:27.700
And I pulled it into DXR as well because it makes it so easy to do nested subcommands.

00:55:27.700 --> 00:55:31.900
Like if you use homebrew or something, homebrew install, brew this, brew that.

00:55:31.900 --> 00:55:34.520
It helps you brew up commands like that.

00:55:34.520 --> 00:55:39.340
And like all of Armin's stuff, it's sort of a decorator soup that I like very much.

00:55:39.340 --> 00:55:39.900
Absolutely.

00:55:39.900 --> 00:55:44.160
And then one of my own things that I always pull off the shelf when I start a new project

00:55:44.160 --> 00:55:45.620
is more iter tools.

00:55:46.340 --> 00:55:49.020
And it is, as the name says, more iter tools.

00:55:49.020 --> 00:55:50.980
The ones that got left behind.

00:55:50.980 --> 00:55:54.480
Collations and such that come in handy in almost every project.

00:55:54.480 --> 00:55:54.840
Yeah.

00:55:54.840 --> 00:55:55.100
Okay.

00:55:55.100 --> 00:55:55.580
Excellent.

00:55:55.580 --> 00:55:57.160
I'll be sure to link to all of those.

00:55:57.160 --> 00:55:57.660
That's great.

00:55:57.660 --> 00:55:59.240
And how about the editor?

00:55:59.240 --> 00:56:00.540
Favorite editor?

00:56:00.540 --> 00:56:01.480
Are you going to write some Python code?

00:56:01.480 --> 00:56:01.920
What do you pull up?

00:56:01.920 --> 00:56:04.900
I pull up a little Mac app called BBEdit.

00:56:04.900 --> 00:56:07.820
It's been around for, must be 40 years or something.

00:56:07.820 --> 00:56:08.360
Yeah.

00:56:08.360 --> 00:56:09.720
They're on BBEdit 11.

00:56:09.720 --> 00:56:12.080
I like their little subtitle.

00:56:12.080 --> 00:56:12.660
I don't know.

00:56:12.660 --> 00:56:14.080
It doesn't suck.

00:56:14.080 --> 00:56:14.780
All rights reserved.

00:56:14.780 --> 00:56:15.180
Yeah.

00:56:15.180 --> 00:56:15.960
Bare-bones software.

00:56:15.960 --> 00:56:16.520
It doesn't suck.

00:56:16.520 --> 00:56:18.860
And they typically make software that looks like nothing.

00:56:18.860 --> 00:56:22.920
You pull it up and their mailer looks like a big empty window of white.

00:56:22.920 --> 00:56:25.520
And their text editor looks like a big empty window full of white.

00:56:25.520 --> 00:56:26.380
Not a lot of toolbars.

00:56:26.380 --> 00:56:27.260
Not a lot of fluff.

00:56:27.260 --> 00:56:30.640
But under the hood, they do a really nice job.

00:56:31.280 --> 00:56:37.060
The text editor uses what must be ropes because I can edit very large, you know, multi-tens

00:56:37.060 --> 00:56:39.300
of megabyte files without a lot of lag.

00:56:39.300 --> 00:56:41.320
It's incredibly stable.

00:56:41.320 --> 00:56:42.960
It just doesn't lose data.

00:56:42.960 --> 00:56:50.020
On the off chance that it maybe crashes every couple of weeks, its magical little implicit

00:56:50.020 --> 00:56:53.220
auto-save thing will bring up your windows in exactly the state that they were.

00:56:53.220 --> 00:56:57.200
Even your untitled documents, it won't save and plow over your stuff.

00:56:57.200 --> 00:57:00.960
It'll save it in its own little buffer and just make sure nothing is unduly lost.

00:57:00.960 --> 00:57:01.360
Yeah.

00:57:01.360 --> 00:57:02.020
That sounds really cool.

00:57:02.020 --> 00:57:02.720
All right.

00:57:02.720 --> 00:57:02.960
Awesome.

00:57:02.960 --> 00:57:04.440
So check that out, people.

00:57:04.440 --> 00:57:04.980
That's cool.

00:57:04.980 --> 00:57:06.620
And final call to action.

00:57:06.780 --> 00:57:12.140
Well, if you're parsing in Python, give parsimonies a try and send feedback and complaints and

00:57:12.140 --> 00:57:12.900
patches my way.

00:57:12.900 --> 00:57:13.420
Yeah, definitely.

00:57:13.420 --> 00:57:14.420
It's on GitHub.

00:57:14.420 --> 00:57:17.240
I'll put the links to it in the show notes so people can find it there.

00:57:17.240 --> 00:57:17.800
Thank you.

00:57:17.800 --> 00:57:19.800
And second, be safe out there.

00:57:19.800 --> 00:57:22.540
If you're running a web server, get a certificate.

00:57:22.540 --> 00:57:24.040
Use Let's Encrypt or something else.

00:57:24.040 --> 00:57:26.780
And if you're deploying Python, again, be safe.

00:57:26.780 --> 00:57:28.100
Use pip hashing.

00:57:28.100 --> 00:57:33.860
I really appreciate this look inside of the whole parsing world and how we can move beyond

00:57:33.860 --> 00:57:36.340
regular expressions to do something way cooler.

00:57:36.580 --> 00:57:38.960
So thanks for your project and your talk and being on the show.

00:57:38.960 --> 00:57:40.320
It was a pleasure to speak with you, Michael.

00:57:40.320 --> 00:57:40.600
Yeah.

00:57:40.600 --> 00:57:41.040
Bye, Eric.

00:57:41.040 --> 00:57:45.180
This has been another episode of Talk Python To Me.

00:57:45.180 --> 00:57:51.220
Today's guest has been Eric Rose, and this episode has been sponsored by GoCD and Hired.

00:57:51.220 --> 00:57:53.500
Thank you both for supporting the show.

00:57:53.500 --> 00:57:58.160
GoCD is the on-premise, open-source, continuous delivery server.

00:57:58.160 --> 00:58:02.300
Want to improve your deployment workflow but keep your code and builds in-house?

00:58:02.740 --> 00:58:08.680
Check out GoCD at talkpython.fm/gocd and take control over your process.

00:58:08.680 --> 00:58:11.780
Hired wants to help you find your next big thing.

00:58:11.780 --> 00:58:16.940
Visit hired.com slash talkpythontome to get five or more offers with salary and equity presented

00:58:16.940 --> 00:58:20.280
right up front and a special listener signing bonus of $2,000.

00:58:20.280 --> 00:58:23.380
Are you or a colleague trying to learn Python?

00:58:23.380 --> 00:58:28.040
Have you tried books and videos that just left you bored by covering topics point by point?

00:58:28.040 --> 00:58:34.040
Well, check out my online course, Python Jumpstart, by building 10 apps at talkpython.fm/course

00:58:34.040 --> 00:58:36.660
to experience a more engaging way to learn Python.

00:58:37.200 --> 00:58:43.980
And if you're looking for something a little more advanced, try my WritePythonic code course at talkpython.fm/pythonic.

00:58:43.980 --> 00:58:46.280
Be sure to subscribe to the show.

00:58:46.280 --> 00:58:48.480
Open your favorite podcatcher and search for Python.

00:58:48.480 --> 00:58:49.720
We should be right at the top.

00:58:49.720 --> 00:58:59.040
You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm.

00:58:59.360 --> 00:59:04.140
Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

00:59:04.140 --> 00:59:10.820
Corey just recently started selling his tracks on iTunes, so I recommend you check it out at talkpython.fm/music.

00:59:10.820 --> 00:59:16.180
You can browse his tracks he has for sale on iTunes and listen to the full-length version of the theme song.

00:59:16.180 --> 00:59:18.260
This is your host, Michael Kennedy.

00:59:18.260 --> 00:59:19.540
Thanks so much for listening.

00:59:19.540 --> 00:59:20.720
I really appreciate it.

00:59:20.720 --> 00:59:22.860
Smix, let's get out of here.

00:59:22.860 --> 00:59:44.360
I'll see you next time.

00:59:44.360 --> 00:59:45.140
Don't believe.

00:59:45.140 --> 01:00:15.120
Thank you.

