#299: Personal search engine with datasette and dogsheep Transcript
00:00 In this episode, we'll be discussing two powerful tools for data reporting and exploration data set and dog sheep. Data Set helps people take data of any size or shape, analyze and explore it and publish it as an interactive website and accompany API. Dog sheep is a collection of tools for personal analytics using SQL lite and data set. Imagine a unified search engine for everything personal in your life, such as Twitter, photos, Google Docs to do as good reads and more all in one place and outside of the cloud companies. On this episode, we talk with Simon Wilson, who created both of these projects. He's also one of the CO creators of Django, and we'll discuss some of the early Django history. This is talk Python to me, Episode 299, recorded November 18 2020.
00:57 Welcome to talk Python, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy, follow me on Twitter, where I'm at m Kennedy, and keep up with the show and listen to past episodes at talk python.fm. And follow the show on Twitter via talk Python. This episode is brought to you by linode and talk Python training. Please check out the offers during their segments. It really helps support the show. Simon, welcome to talk with me. Hi, great to be here. Hey, it's great to have you here. We're going to talk about some really interesting projects that you've been working on some that are extremely broad and far reaching. And anybody who touches Python, or maybe even hasn't knows about it, and then another which I feel is such a personal project, like it's for everybody. But it reveals so much data about you, you know, absolutely yeah, we're going to talk about Django a little bit which you were part of creating, we're talking about data set, which is a way to interact, basically put a UI on top of any database in a friendly way that lets you explore it and treat it as an API. And then dog sheep, which has a funny name, nice story around the name that allows you to basically take all of the data about you and turn it into like your own Google, right. It's your personal data warehouse. I've started calling it Yeah, I'm fascinated by where that could go. Now, before we get to all three of those cool things. Let's just start with your story, how'd you get into programming into Python islands program with my dad on a Commodore 64 back in the 80s. And then didn't get to that then kind of moved on to early windows and dos where you didn't really get to program anything about some key basic. But I got really into programming when actually during the first.com, boom, like, even like web 1.0 1999 2000 when I was working in London for a online gaming company, a company called gameplay.com, which was doing like selling box games where you may have ordered them but also getting online gaming servers like Team Fortress Classic and HalfLife, and quake and all of those kinds of things. Oh, yeah, Team Fortress and Counter Strike. I love that stuff. That was so fun. Oh, absolutely. And so I was working for that online gaming division for a year and a half before the.com crash and everyone got laid off middle class. During that time, I started the company as the downloads editor, so responsible for the section of the website where you download plugins and patches and mods and so forth. And I essentially taught myself web development part as part of that, and as part of some other gaming related side projects I had. And so yeah, I am I'm a veteran of the first round of of columns back when no one has like idea what we were doing. Yeah, everyone was just making it up on the spot, right. Oh, totally. So when you were working on that web development stuff, it probably wasn't Python, I'm guessing certainly within Django, because Django was what 2005 2003 2004. When we built it, I think it was open sourced in 2005. Yeah, right. Right. Okay, so what were you programming in? Oh, so gameplay.com is running on a fast, expensive content management system called media surface surface, which was a combination of Perl for templates, Java and Oracle under the hood, and insanely expensive and very, very tricky to get things done with. And then I had side projects, which were classic PHP and MySQL, right? So I was very much the sort of PHP, the classic PHP programmer. And actually, this is where Django came from, is agent hell about he and I were working together at this local newspaper in Lawrence, Kansas. And we were both PHP developers. And we saw the silent call of Python, and wanted to figure out how we could build websites using this, we felt much more exciting programming language. Oh, that's awesome. It's definitely more exciting. But I do want to ask you, it's such a different world, right? We can go to the cloud providers and pay $5 a month and probably get better infrastructure than you guys had. I mean, who knows how much right enormously I mean, back then a content management systems because you could spend a million dollars in your content management system, and they were terrible because nobody knew what a good content management system looked like. This was back before things like WordPress as well like your own. Yeah. And was to spend
05:00 Huge amounts of money on these, like giant enterprise systems and then cross your fingers and hope that they could work for you. Yeah. So how did you feel when you worked, you go to work, you work in this kind of like, ultra expensive, clunky environment and then go home. And even though is PHP still, like do PHP and MySQL and kind of like I paid nothing for this and maybe felt as good as what this crazy online system you guys are working with? What was that back and forth like, I don't think it did feel nearly as good to be honest, because there was no open source was still just starting up. So if you wanted to build something up in MySQL, you wrote everything
05:33 you would build start with the authentication system and work on it from there. So to be honest, the I feel like today, like open source for me has solved the whole code reuse problem the how do we stop wasting our time to rebuilding the same things over and over again? Yeah, you know, 20 years ago, you just build everything from scratch, and everything took months to do. It's really interesting, because I started out in c++. And I felt like a lot of the stuff I built was extremely low level, not just because it was c++, but because of this sort of difference that you're talking about. And throughout the years, I've seen it just get, like, what used to be really hard is now just grab this library and plug it in that other thing needs to be hard grab this library and plug it in. And it just seems like the natural consequence would be well, we need fewer developers, because they use have to build this now they plug it together. And yet all we've done is decided to solve more ambitious problems. Absolutely, in more interesting ways, right? It amazes me the quality of software that we build these days, you know, like, like 1015 years ago, we weren't writing tests for everything, you know, the quality of the stuff we wrote was just abysmal. And today, we've got continuous integration and continuous deployment. And it's really quite, it's really easy to get out there and analyze like 15 different options and pick the one that's most the one that has the highest quality, not only do you have ci CD, you've got it for free in the cloud, on Dad, you know, on actions that trigger it right. It's just it's such a cool place. Cool time to be doing this stuff. Definitely. Yeah, yeah. So before we get to Django, how about now what are you up to these days? So these days, I spent the last year at Stanford University doing a fellowship, there's a fellowship program called JSK. And it's journalism fellowships. The idea is to get sort of journalists from around the world together at Stanford for a year thinking about problems facing journalism. And I managed to make my own two as a sort of computer scientists with journalism leanings. And basically thinking about, okay, what are the open source tools that I can build, that can help make data journalism more powerful, more widely accessible. And so a lot of the dataset project got was really accelerated by doing that. It finished a few months ago, the problem I'm having is that at Stanford, I was essentially paid to tinker on my own projects and go after whatever I thought was interesting, and good. It's great, but I'm having real trouble stopping. So now I'm not getting paid, but I'm doing on my own projects, going up to things I find interesting. So I'm calling myself a consultant. And I am available for consulting opportunities, especially around this sort of set of tools that I've been building. But mainly, yeah, I'm focusing on continuing to build out this ecosystem of tooling for data journalism and related projects. Yeah, well, we're gonna get to those. And they're they're definitely super interesting. The ones and like I said, this personal aspect, I think that could help a lot of people, not just journalists, for sure. But let's talk a little bit about Django, you mentioned that it came out of was a journal, Lawrence, Lawrence journal world or something like that, General? Well, yeah, it's a tiny newspaper in Lawrence, Kansas, a town I'd never even heard of. And this was back in 2000, to 2003. I was a blogger, I had a blog about web development, and about 100. Other people had blogs about web developments. We all read each other's blogs. And Adrian havarti, who was a journalism and programmer posted a job ad on his blog saying, hey, I want somebody to come and join me in Lawrence, Kansas building websites for local newspapers. And it coincided with my university course giving me the option to spend a year in industry, which is something that UK degrees do quite often nice. So I could take you out, get a student visa, which means I could track what track traveling, work in different countries, spend a year working, and then go back and finish my degree. And so the opportunity is sort of aligned themselves. And I had huge respect from Adrian, just based on what I seen that he'd been doing. And it felt like a pretty interesting adventure to run off to Kansas. Yeah, that's a cool adventure. And yes, so I did that. So essentially, it was a year long, almost a paid internship. But it was in Lawrence, Kansas at this little newspaper. And it was a fascinating place because the family that owned the newspaper, had laid fiber optic cable around a bunch of states like a bunch of years beforehand, when everyone thought they were crazy, and then sold the whole lot to maybe Comcast or one of these big companies. So they financially they were very secure, which meant they could invest huge resources in that local newspaper for this little town. And so this newspaper despite serving a town with a population of like 100,000 people, it had way, way more resources than you would expect any local newspaper to have. They have
10:00 Their own software engineering team who were building some websites for things. And because the family owned the local cable company for the town, so everyone in the town of Lawrence, Kansas had broadband internet in 2003. And that meant these websites could have like online videos and stuff, which no one else was doing, because what newspaper had an audience who can actually watch that kind of stuff. So it was a really exciting place to be inventing like things around online news. And we also had a very ambitious box, this chap called Rob curly, who basically wanted us to act like we were the New York Times, even though we were like six nerds in a basement somewhere. There's a little local newspaper, we have things like the local those as a softball league, where all of the local kids have in softball teams competing against each other. And it turns out, this is an amazing thing for a local newspaper to cover. Because if you have good coverage of softball league, everyone who knows a child in your town will buy your newspaper. So we went all in on on kids softball, and we ended up building a website for them that treated them. The idea was to treat them like the New York Yankees. So we had like player profiles and match reports and photo galleries. And then we sent two interns out to take 360 degree photographs of every softball pitch in town. And we had those on the website. And this was, this was like, VR or something. It was absolutely astonishing. That's so so neat. In the early days, like I'm sure they kids felt so special as well, I bet they still have like saved copies of their time. The best website we worked on was then there was a local entertainment portal for the town of Lawrence, Kansas. And it was basically a website that had the events calendar, it had band profiles, it had restaurant reviews it had, so it was like sort of a super hyper local version of Yelp, crossed with music magazines. Plus, with an events website, just this one little town. And we had features like, there was a download page where you could download mp3 series of bands who were playing in town that week, because we have the mp3 is for all of the local bands again in like 2003 and a little radio widget that you could click play on. And so it was astonishing. It was I have never seen an entertainments website since that was as good as this one that we were building back in in Lawrence, Kansas.
12:14 This portion of talk by Thunder Bay sponsored by linode. Simplify your infrastructure and cut your cloud bills in half with linode. Linux virtual machines, develop, deploy and scale your modern applications faster and easier. Whether you're developing a personal project or managing large workloads, you deserve simple, affordable and accessible cloud computing solutions. As listeners of talk Python to me, you'll get a $100 free credit, you can find all the details at talk python.fm slash linode. linode has data centers around the world with the same simple and consistent pricing regardless of location, just choose the data center that's nearest to your users, you also receive 20 473 65 human support with no tears or handoffs, regardless of your plan size. You can choose shared and dedicated compute instances. Or you can use your $100 in credit on s3, compatible object storage, managed Kubernetes clusters. And more. If it runs on Linux, it runs on the node, visit talk python.fm slash linode. Or click the link in your show notes. Then click that create free account button to get started.
13:18 I have a history of Florence I actually went to college there and got my math degree at University of Kansas. No, I love that town. I love it. Like the mastery, little downtown like mastery brewery and all that was such a beautiful place. And I really enjoyed my time there. Yeah, it's a great town. I think when I went was there, I'd never been to anywhere else in America, basically. So I felt like a cool town. But I didn't really understand how cool it was until like 20 years later, I've now lived in America. And I've been to lots of towns and Lawrence's special, you know, it's a very special, it definitely is. But what I never knew was how cool this newspaper was. I mean, I was just like a kid in college. I didn't read the newspaper a lot, and whatnot. So this is a really interesting cradle from which Django sprung. So tell us about just Django and how it fits into this world. Sure. So basically, at NIH, Adrian had built lawrence.com, this amazing enseignements website in PHP. And both Adrian and I had hit that point in our PHP careers where it was straining under the size and complexity of the things that we wanted to do with it. This was like before PHP, four even so like classes were very new. And they were the PHP language was pretty primitive compared to what you have today. And meanwhile, Python was exploding in popularity, we were both huge fans of Mark pilgrim his dive into Python, and Mark pilgrims blog where he talks about this. And so we decided that we really wanted to be working with Python for building these websites. But the web options back then were not particularly the main thing was soap and soap was pretty good. But it didn't match the way Adrian I thought about the web, we've cared about things like designing our URLs carefully and separating our CSS from it right. Check them out the sort of modern MVC framework that you people almost take for granted.
15:00 It now right, actually, but it wasn't there weren't really any great options for that in Python. So we were looking at mod Python, the Apache module as the way that we would put Python on the internet. And we were a little bit worried about it because more Python wasn't being used very widely. And we're like, Okay, what happens if we bet the newspaper on mod Python? And it turns out, right, that would be? Yeah. So what we'll do, we'll have a very thin abstraction layer between us and mod Python, so that if we have to swap my piping, something else we can do so. And that basically, is what jet was, that was the initial seats of Jango. We wanted a request and response object, a basic way of doing templating, basic URL routing. And so we built that out, we never thought it was a framework. We called it the CMS, right, it was the CMS that use paper, and it kept on evolving these additional little bits and pieces. Django admin was something which I went away to the South by Southwest festival for like four days. And when I came back, Adrian had written a code generator for admin websites that was churning out all of this stuff, we just kept on building these extra bits out. And then I went back to England, my year in Kansas ended. And about six months later, they open source Django at the time I was working on it wasn't called Django. There were various ideas for names, which are truly terrible. But yeah, Jacob Kaplan moss had joined the team at that point. And yeah, they made the case to the newspaper that they should open source this thing. It's early days for that, right? Like, it's not like now it would be easy sell. But back then that was weird, right? I asked him about this. And apparently, one of the arguments they use is Ruby on Rails had just come out and was just exploding popularity. And they could see that this company that released rails was hiring people left, right and center and was doing really well out of it. So they went to the newspaper and said, hey, look, if we open sourced this, it's a great way for us to, like get talent and, and get free fixes and all of that sort of thing. And it worked. Like you and I are sitting here talking about this small newspaper in Lawrence right now. Right? I mean, we wouldn't be doing this. Otherwise, that's true. But the argument the worked is they said to the newspaper owners, we've been building on open source, right, the newspaper, we went Linux, and we run Apache and Perl and Python. And we've used all of this open source stuff. This is a way of giving back. And that's the argument that apparently resonated them. They said, no, that completely makes sense. We can give back in that way. And yeah, and so Django, open source and it's been that was 15 years ago, I think, and it's just been growing in a sense. Yeah. Did you predict like, you look around now, it's just ubiquitous, like does it blow you away? What's happened completely blows me away? The thing that really amuses me is that I keep on seeing people talking about Django as the boring option. And like Django and rails, yeah, those are the safe, boring options for things. And I actually saw someone on Twitter the other day say, Well, nobody ever got fired for choosing Django. And I was like, I actually I direct message. Adrian Jacobs, back quote, I'm like, Can you believe this, that we are now that no one ever got fired for choosing IBM options? Exactly, exactly. I think. Yeah, I think there's definitely some truth to that. Quite interesting. It's seems like it's got a ton of momentum. It's really starting to embrace the async and await world. It's so lovely. Oh, yeah, the app. So a lot of my projects these days, don't use Django. But they do use ASCII, I am really to the heart that the ASCII ecosystem that's growing up is so exciting. And I'm Django is getting better at ASCII itself. So I'm going to be able to merge a bunch of my ASCII projects back into my Django projects Pretty soon, which is super exciting. Yeah, that's exciting. I'm super excited about fast API. And it's one of those that fits really well in that world. Yeah, I mean, I haven't really done fast API, but I love style apps, which is the framework and foundation. Yeah, yeah. Yeah. Super cool. All right. Well, congratulations on Django, you and everyone who worked on it. I do think it's really interesting. If you look at the timing, Ruby on Rails came out of 37 signals now Basecamp, and was extracted from they were using inside, right, and they built for them and extracted it, you guys built it at the newspaper and said, this thing we can pull out of and make it something else. I think it's really interesting that it was sort of polished and proven in this real place. Right? I think the way I see it is Wales was extracted from Basecamp. Right, they built Basecamp. Up, they pulled the framework out of it. Django, the goal was always Lawrence calm. We had this entertainments website and PHP and MySQL. And we knew that we wanted the thing we were building to power that. So Django was actually there was an existing target. And we evolved the the function of the framework until it could run a very high quality like newspaper entertainments listings website. And so it's almost it's like, one is extracted and one was was evolved in the direction of supporting this one site. Yeah, it's neat to see, I think they both came out quite successful from those experiences. Alright, let's start at the foundation of this recent work you've been doing. And in some sense, it's sort of a natural progression, right? It's in the journalism side of things is where the origin came from. So tell us about data set. So data set is I on its website, I call it an open source multi tool for exploring and publishing data. Basically.
20:00 It's a web application, which you can point at a sequel lite relational database. And it gives you pages where you can browse the tables and run queries. It lets you run like typing custom SQL queries and run them against that database lets you custom templates and how things turn lets you get everything back out as JSON or CSV. So you can use it for API integrations. And it lets you publish the whole thing on the internet really easily. So it's, it's a lot. Yeah. And one of the biggest challenges I've had is, how do I turn this into a bite sized description that really helps people understand what the software does? I'm at a point now where if I can get somebody on a video chat, I can do a 15 minute demo. And at the end of it, they come out going, I totally get this. This is amazing. But that's not not explaining software, it doesn't scale particularly well. Yeah, well, let me see if I can, with my limited exposure to it and knowing some some what where we're going, you have this data source that's pretty ubiquitous, or can become ubiquitous in terms of like some sort of ETL with SQL lite, right? SQL lite is everywhere. what's beautiful about it is there's no please set up the server and make it not run as root and then put it on your network. And so yeah, right. The magic of SQL lite SQL lite, it boasted. It's the most widely distributed database in the world, which it is because it runs on every phone, my watch, and sequel lite, tracking my stats, every iPhone app, every Android app, every laptop, that all running the like, yeah, your phone, that's crazy. It's a file format, right? It's a sequel lite database is a single.db binary file on disk, which, like you said, makes it so convenient. Because I don't have to ask a set sis admin to set me up a Postgres schema, or anything like that. I just create a file on my laptop. And that's my new database. Yeah. And it's even built into Python. Right? It just comes with Python C. Library. Yeah, exactly. So that's super cool. And it's great that we have this data format that if we either have data in there, or you could do like an API call, and then jam the data in there, right, like something to to get it into that format, which is great. But you can explore that with like beekeeper studio or some data visualization SQL Management Studio. But that doesn't work for journalists, that doesn't work for getting on the internet that doesn't like the transformations. In some sense, I kind of see it almost like as a really advanced web based, like data, ie, but user friendly. Yeah, with and but the emphasis is absolutely on publishing. So I'm getting it online. And then it's on being web native, like everything. And data set can be good apps as JSON as well as HTML, it can all get beat up, it can return CSV to you. It uses, you pass the SQL query and a get in a query string. So you can bookmark queries, all of that kind of stuff. Yeah, I think the key that's really the key idea, it's how do you take relational databases and make them as web native as possible, and as cheap and inexpensive to, to host and to run as possible. So you can take any data that fits in a sequel lite database, which is almost everything and stick it online in a way that people can both explore it and start, like integrating with it as well. And the another key idea in data set is date setting is a plug in system, I've actually written over 50 plugins for it now that add all sorts of different things, different output formats, so get your database out as an atom feed when iCal feed, I've got plugins visualizations that plot the data on a map or give you charts and line graphs and so on. I just this morning released a authentication plugin that supports the indie auth authentication mechanism. So you can use in the auth, login to password protect your, your data, all of these different things. And honestly, having a plugin system is so much fun, because I can come up with a terrible idea for a feature, and I can build it to plug in. And it doesn't matter that that's just an awful idea that nobody should ever have implemented, because they're not causing any harm to that the core project. It's a super interesting idea. I also think it might be a way to encourage others to contribute, because they don't have to understand the whole system and be afraid of breaking it, they just have to understand like, here's the three functions I implement to make this happen. We'll train there is some when people contribute to open source, that's more work for me because I have to review that pull requests and figure it out. So on, if you write a plug in, you can release that plugin to the Python package index. And I'd love to know about it like I can wake up one day, and my software has new features, because somebody built a plug in and shipped it, which I think
24:24 is fantastic. And they don't have to go through you as a gatekeeper. Even if you might be super friendly and whatnot. They just don't have to have that that interaction, right, which is pretty cool. Yeah. So one of the things that's interesting about data set is the way you get your stuff online is you basically just run data set against a sequel lite database. And now you have this website that lets you explore it like you described. So you say dataset space path to sequel I file and now you have a web app running right. So you type dataset name file, hit Enter, and it runs on your local laptop and you can start browsing and exploring it. But then if you want to put it online, I've been building out integrations with a bunch of different hosts
25:00 Providers where you can from the command line type data sets space, publish space, pick your provider, say, say Google Cloud run data set, public cloud run name of database, enter, and it'll upload that database to the internet, wrap it in the application itself, and give it a URL and start serving. So it's like a set a one liner for publishing data online with a URL that other people can start using. And that space that's enabled by all of these fascinating serverless hosting providers. And that was actually one of the original inspirations today.
30:00 So I've got a query for that. And now I bookmarked it. And I've got an application, which is the hero of the get hub issues that you should go and look at application. So an entire application gets baked, ends up being a URL that you can bookmark. And that's really interesting. That's a really, again, a very native way of thinking about the problem domain. You know, you were talking about starting out working in the first.com. Boom. And one of the things that was all the rage back then were mashups. Do you remember like Yahoo mashups and all that kind of stuff? Absolutely. Yep.
30:32 Talk Python to me is partially supported by our training courses, pythons, async. And parallel programming support is highly underrated. Have you shied away from the amazing new async and await keywords, because you've heard it's way too complicated, or that it's just not worth the effort. But the right workloads, 100 times speed up is totally possible with minor changes to your code. But you do need to understand the internals. And that's why our course async techniques and examples and Python, show you how to write async code successfully, as well as how it works. Get started with async and await today with our course at talk python.fm slash async.
31:10 So it sounds to me like what you can almost build here is like a super interesting mash up. So you can like extract the data from GitHub, you can extract it from over here, and you put it together in this new form. And then now you've got these, this API on top of it, right? massive realization I've had working on this stuff is lots of websites have API's. And API's sometimes have a lot of features like the GitHub API can do some pretty powerful stuff. But I can always think of something the API can't do that. They didn't predict i'd one. If I can get all of my data out of that API and into a sequel lite database, then there are no limits. And any question I can think to ask, I can apply against that against that thing. So I've basically sort of given the only thing I use API's for now is to give get everything.
31:54 Here, download it and sync, get everything into a database. And now I can start interest back to my dad. And Bill. Yeah, I think where it gets interesting, as we'll see, as we get on to like the final dog sheep side of things is, it's great that GitHub as an API is great. The Twitter as an API is great that Gmail has I'm app and all these different things have rich, deep ways to talk to them. But if you want to talk to all of it at the same time, and say, I want to know about what I've tweeted, emailed, or whatever else I've done about like, you don't want to try to build that integration of all those API's. And it gets super gnarly, but you could if you get it into a some kind of sequel lite database, all of a sudden, it becomes an option, right? It's this personal data warehouse idea. And, and it's not just a personal data, like as a company, if you're a company with 50 different Git repositories, which, which lots of companies have getting all of that metadata from all 50 of those repos into one place. And I've got tooling that will let you do exactly that. It's crazy useful. It lets you build across all of your issues, and all of your comments. And it lets you talk about like here is what our software team as a company has accomplished, or how many like that kind of stuff, right, which is still super hard, even if you go to GitHub and to do that. So I want to talk to you a few examples that maybe we could mention really quickly. You gave a talk at that covered both data set and dog sheep at pi con au online this year, right? So this year, last year? Yes. Yes, I did. Yeah. It was this year. Yeah. So in there, you talked about all sorts of interesting things. So I want to cover some of the examples there because they kind of like really made this stuff Connect for me. Oh, okay. Yeah. So the first one was you said, let me just say, we just search for random sequel lite databases on my Mac. He said, Oh, look, here, we randomly found one in in the photos library. Let's look at that. Right? You just pointed data set at that, like search for it. But do you do like DB or something like that in spotlight? Yeah, there's a spotlight command that you can run, which will show you every sequel lite database on your Mac. And it's fascinating. Oh, my goodness, that that the number of the number of weird little databases that you already have your Firefox history, your Chrome history on there. Evernote uses SQL lite, there was quite a few databases I found I still don't quite know what they are. But they've got things like places that I've been over the past couple of years. I just sat there and a sequel lite database somewhere, which is super interesting. Yeah, for sure. So in this demo you do is you use that command to find the sequel lite database back in your photos library, and then said, well, let's just pull that up and poke around. Right. Tell us about that. Yeah. So the photos, this was always one of my sort of white whale was I want my photos data. I've taken 40,000 photos. They've got timestamps and latitude and longitude and all of this, how can I get that metadata into a sequel light database, so I can run queries against my life and photos. And I've tried getting this to work with things like Google Photos in the past. Google Photos doesn't give you access to last two longitudes, I think for privacy reasons. But anyway, the big realization I had was that the apple photos on your phone and on your laptop uses SQL lite there's a little secret that actually read I already done it for you just gotta go find it. It's probably huge. Yeah. 800 megabytes of data in one sequel lite database file.
35:00 It's not very easy to query because it's not designed for people to use it from outside. But if you jump through a bunch of hoops, you can get that data back out again. And then you start finding some really interesting things. So obviously, they've got a record for each of your photos with when it was taken and the latitude and longitude, they've reversed geocoded those locations, so they can actually tell you in the database, this was in San Francisco, this was in the Mission District, those kinds of things. But then the coolest thing is Apple use machine learning to identify what your photos are. So is it a dog or a cat or a pelican, right, because you can go to your Photos app and search for that, like, show me cars, like some insane way cars come up exactly. But the beautiful thing about that is it turns out, they run the models on your laptop. But Google and Facebook will upload your photos to the internet and put them in a data center somewhere, Apple download these big binary machine learning model like weights files onto your device, they actually run it on your phone overnight. And they use those to identify what's in your photos. So from a privacy point of view, this is perfect because you're not uploading your photos somewhere for some creepy machine learning model to run against. It's all happening on devices that you control. And the results of that go into SQL lite database to the go into data set. And yes, so I can get them out and get them into datasets. So yeah, I have a date. I have an example query that shows me photographs I've taken of pelicans based on Apple's machine learning, labeling, labeling my photos of pelicans. And I can visualize those on a map, because they've got last season longitudes with them, and so on. And then the really fun thing is, there's this, there were various clues in the Photos app that they're doing quality evaluations, like if they show you all of your photos for a month, they'll pick a good photo to show us the sort of title of that album or whatever. That's machine learning as well. It's running on your device, and it's based on these scores. And the scores are sat there in the database, and they have names like Zed overall aesthetic score, and Zed pleasant camera tilt score, and Zed harmonious color score. So you can say things like, show me my Pelican photograph with the most harmonious colors with the most pleasant camera tilts, and just get things back that way. You could even set up like a walking tour of show me where I've taken aesthetic photos of pelicans, starting with them best one, and then the next. And then the next. That is such a good idea. Yeah, that's just a SQL query. Yeah, order by aesthetic, descending. Yeah. And there's also facial recognition, which again, is trained by you and runs on your device. So it's the least creepy version of it. So I've run a SQL query shown saying, show me photographs of my wife, Natalie, and my friend, Andrew, and show me the one with the most pleasant camera tilt that was taken outdoors. And this stuff all just works, it's baffling. And yeah, it's really super fun. So one piece that we probably should connect for folks, if they tried to follow along, they through they find that SQL lite database, and then they throw data set at it, they're gonna end up with like binary blobs where this data lives, right? Oh, totally. Yeah, I think apples SQL lite format uses like a binary p lists in some of the columns. And also, it's actually quite hard to even open it because it'll crash and tell you that you don't have their custom something extension running. So the way I've addressed that I wrote is, I found this software on GitHub called OSX photos, which is someone's open source library for talking to the OSX sequel lite database and working around some of these weird issues. And then I built my own tool on top of that tool called dog sheep photos, which basically both pulls out the metadata of your photos into it into a nicer format, including getting the machine learning labels and stuff. But it's also got a tool for uploading the photo files themselves to an s3 bucket. Because if you want to really take control of your photos, you need them to have URLs so that you can embed them in pages and link to them and so on. So I've got a whole tool chain for uploading all of my photographs to s3 and extracting the metadata into a separate database and then publishing that database with datasets somewhere so that I can run queries against it. Yeah, yeah. Super cool. All right. Well, I think that probably is a good transition over to datasheet. So we have these different sources of data. Like it's nearly unbounded at this point, right of where data on the internet might live about that could be for you. Like there was this firebrand of a character that you mentioned, around Wolfram Alpha. And, yes, he does some crazy, crazy, weird stuff. But he also had this idea that inspired you to sort of try to bring those together and build it on top of the data set. So like maybe started with that story. We can tell people what dog sheep is. Okay. So that's this chap called Stephen Wolfram. Who is is this he created Mathematica and Wolfram Alpha. He's the CEO of 1000 person company, which it turns out, he's a remote CEO runs the entire company from home, which is kind of fascinating. Yeah. And he has been for a while, right. Like before, it was cool. Yeah, no, absolutely. He's been doing the COVID thing for years and years and years. And so in February of last year, he published an essay called seeking the productive life, some details of my personal infrastructure. And this thing, I would thoroughly recommend everyone take a look just to marvel at quite how long it is. He has spent 40 years
40:00 maizing every single inch of his personal and professional life, and he wrote about all of it in one place like he scanned every document he's worked on since he was 11 years old and got them OCR and he's got a green screen in his basement for giving remote talks. And his heart rate monitor showed him that his heart rate is better when when he's so he had a standing desk, but his heart rate monitor showed him that walking outside is better for his heart. So he rigged up a sort of little tray mechanism. So he could use his laptop while walking in the woods. And it's just astonishing. And I read through this essay thinking this is this is a lot like, this is not next level rock. Okay, next level. But there was this one little bit in it that caught my eye, he talks about how he has a personal search engine, something he calls his meta searcher. And so he's got his own private search engine that searches everything that every email, he's sent every paper he's written, everyone he knows who might know things about everything. He's read all the files in this machine all in one place. And I thought, well, that's something I'd like, like I would love to have one place with as much of my personal data from different sources in one place where I can query. Like, I know, I was talking to this person, but was it on iMessage? Was it an email? Was it over slack? Like, where the heck did I tell them this thing that I need to give back? Absolutely. And combine that with, you know, your bookmarks and your GitHub issues and your messages and all of these
41:21 things? Oh, yeah. I felt like there was something interesting there. And so I then I'll be honest, the best idea I've had in all of this, it's I thought, well, it's inspired by Stephen Wolfram, but it's not as good as what he's done. So if he's Wolfram, maybe I should be doing something called dog sheep. Because the less alpha versions of both animals and better thought, well, he's got a search engine called Wolfram Alpha, I could build a search engine called dog sheep beater. And that Joe stuck in my head. And I enjoyed it so much that I spent like 12 months to build this, it must exist now that it's so good.
41:57 Yeah, so this entire project is basically and driven development is driven out of this, this poem that I came up with a year and a half ago. And yeah, so that's what I've been building. And so the idea with Doug sheet, it's a basically an umbrella project for a whole bunch of tools around this idea of personal analytics, like, what data is there about me in the world? How can I get that data out of lots of different sources and get it into sequel lite databases, because once it's in sequel light, I can when data set on top of it. And now I've got this personal data warehouse of my data from all of these different sources. And then on top of that, I can build a search engine, which I've now built, which ties all of this stuff together again. And so yes, so I've been tinkering around with all sorts of tools in this category for just over a year now. And I've got it. So right now I've got data in my personal dog sheet from Twitter, I've got all of my tweets, but also all the tweets, like favorited, my favorite did like 30,000 tweets, but I can search those and and see who I've favorited, the most from and so on. I've got all of my photos, as we discussed earlier, I've got my health kit data from my Apple Watch, which means I can tell you my heart rate going back last three years or something, how do you get it off Apple Watch. So again, Apple are really good for this kind of stuff. They don't publish it to the end, they don't upload it somewhere. They keep it on your phone and on your watch. But there's an export button in the Health app on the iPhone, there is a button that says export my data. And he actually creates a zip file full of XML on your phone, and then gives you the option to like airdrop it to your laptop. So I do that and I get a 300 megabyte zip file full of XML. And then I wrote a script which reads that XML into SQL, right? So I've got all of that data. The best thing about that is, anytime you record an outdoor workout, like if you go for a run, and even if you go for a walk, it caught the chords, your latitude and longitude every few seconds. And that's available in the data. So I've got to like within 10 meter maps of every walk I've taken for the past year, which is super fun. I mentioned GitHub. I've got all of the data from all of my GitHub projects. I've got over 400 repositories. Now. That's actually quite a lot of stuff. I use Foursquare swarm and check into places. So I've got 4000 swarm check ins, or you have Google Takeout, which is insane, right? I've only done a little bit of work with Google Takeout. That's one of the least developed tools. But yeah, there you can get Google's version of your history of your location history, which is like for me, it's like 250,000 latitude longitude points that they've I don't even know where they got that stuff from. Yeah, I recently did a Google Takeout and I think it was probably zipped I think it was 61 gigabytes or something. I mean, it's a lot of data. It's a lot of data and a lot of fat is like photographs and document files and stuff. But there's a ton of very detailed JSON data about you as well, it with those exports. It's always fun to look for the ad targeting stuff because you'll find out that you have been out you have been assigned the role of a like middle aged tech executive or something and you can see what they're targeting. Based on. I've got Evernote, I've got like 600 notes from Evernote, my Goodreads data on books I've read which is synced from my Kindle. Oh, and then the most fun one is I've got a copy of my genome.
45:00 Because I did 23andme a few years ago, and I found out they've got an export button, and you get back a CSV file of 600,000, like gene pairs from your genome, which I can run SQL queries against. So I have a sound query that tells me what color My eyes are based on interrogating my own copy of my genome, which delights me, that is pretty amazing. That's just insane. Yeah, so that's a ton of data, right? This is a lot of stuff. And I'm barely even scratching the surface of what could be into this spray well, and these are all like plugins or other types. So it's wherever the data lives. If there's an API or web scraping, you can have it right. And it's some of the tools are all cold things like Twitter to SQL lite, or GitHub to SQL lite, or I think I've got genome to SQL lite somewhere. That's just the naming convention that I use. But yeah, the core idea is you knock out a quick Python command line tool, which either takes a zip file you got from somewhere, or hits an API with API credentials, and it slips down as much data as it can what's in the sequel lite database. And that's all it does. And then it's up to you to run data set against it and start doing the like doing phone querying. So that's cool. So I think maybe it would be good to connect this to maybe tell a story or an example again, but you did a pike on a you talk about your dog, and figuring out using Twitter to graph the weight of your dog over time and Twitter map where your dog likes to go on walks. Absolutely. So my dog is cool. First of all, what tell me how dog and Twitter go together, not dog sheep, but just like a dog. So Cleo has a Twitter account. Because I mean, to be honest, most dogs have Instagram these days, but Claire's a bit more old fashioned, so Claire's on Twitter, she's more on the tech side less than the young influence. Yeah, got it. But see PLC LEOP, AWS clearpores on Twitter, and she tweets about things you tweet selfies and like things that she likes, and so on. And every time we go to the vet, she tweets a selfie of herself that that, and they weigh her, and she tweets, how much she weighs, she tweets, I weigh 49.3 pounds, I grew more dog, and there's a selfie. And one of the things I've done with dog sheep is I've imported all of her tweets. And so now I can run a SQL query that just pulls up the tweets where it contains lb for pounds and all way. So I can pull back just the tweets where she said how much she weighs. And then I've got a regular expression extension for data that adds a custom SQL function that can do regular expressions, because that's a useful thing to have. So I can pull out her weight with regular expression into a separate column. And then they've got a charting plugin that can chart something, get something so I can chart date against wait and see a chart of how much she weighs, based on her self reported weight in the selfies that she's posted on Twitter, which is clearly a that's a killer app. Right that that is a
47:41 it's absolutely frivolous. But what I think it shows is so interesting is what you can do if you can put these pieces of data together, right, if all of a sudden I can just do arbitrary queries and apply some special filtering. But to these other data sources, you never expected to be able to all of a sudden you have a graph that you'd never knew you're keeping about the weight of your dog or some other thing you're interested in. This is just a SQL query, the SQL query goes in a bookmark. So the entire application, the hash show me a chart of my dog's weight based on herself is is a bookmark. It's just it's just a bookmark that
48:15 just is actually super useful. Super cool. Now, how do you get a map of where your dog is from Twitter. So that one I mentioned, I use Foursquare swarm and I check in places, every time the dogs with me, I use the wolf emoji in the check in message, which looks a little bit like her. And it turns out, SQL does emoji these days. So you can run a SQL query where you look for things like percentage of wolf emoji percentage, right? It's just a character that's unique. Exactly. So then you get back the cat, the check ins where my dog was there. And then because I've got latitude and longitude in that query, I get them on a map. So I've got a map of places my dog likes to go based on the wolf emoji in my swarm check ins. And again, it's just a bookmark, it's another, these custom applications. That's a bookmark, I think there's a few examples really bring home, like the unexpected power of what you kind of unleashed when you get this stuff completely. There's a project I should mention that that relates to this. So I've been writing a lot of these tools that create SQL lite databases, right, all of these dodgy tools, pull something from somewhere and turn it into SQL Lite. And the way I do that is using a Python library that I've been building called SQL lite hyphen. utils is SQL Lite. utils is a bunch of utility functions to make it really productive to create new SQL lite databases. So the core idea is say you've got an array of JSON objects, you can say, dot insert, bracket, rev, date, JSON objects, and it will create a sequel like table with the schema that's needed to match those JSON objects. Wow. So it just looks and says, These are the top level keys. So we're gonna make those columns, something like that. These ones are integers, these ones but they floats, there's a tax and it creates the creates the table automatically, which means if you're working with an API that's well designed, like the GitHub API, returns lists of JSON objects, so it's a Python one liner to turn those into a sequel like table with the correct columns. And you can say Oh, and make the ID
50:00 One the primary key and set these up as foreign keys and those kinds of things. And that's been crucial because it means that I didn't have to come up with a database schema for swarm and Twitter and GitHub, and Apple photos and all of that, I just had to get the data into a list of objects. And the scheme was created for me, oh, it's got a little bit of a no SQL fields in SQL Exactly. And SQL light turns out can deal with JSON as well. So you can stick a JSON document in a sequel lite column. And then there are SQL lite functions for pulling out the first key a bit and that kind of thing. But yeah, it means it's all super productive. And SQL lite utils also comes with a command line tool. So simple things, you don't have to write any Python at all, you can like w get a JSON blob, pipe it into the sequel lite utils command line tool and tell it to insert it into a table. And it will create a database file on disk and populate the table. And then you can do stuff like configure full text, search against it, or set up extra foreign keys, or whatever it is. That's super neat. And so all of these different integrations that you built, it sounds to me like they could potentially just be useful for people listening go, you know, I'd really love to get Foursquare swarm data as a sequel lite database. And they don't necessarily want to use dogs you'd like it sounds like these plugin pieces might be cool building blocks, if you can get yourself a OAuth token for your swarm account, which I've got an online tool that will do that for you, you pip install swarm to SQL Lite. And you type swarm hyphen to hive and SQL lite swarm.db dash dash token equals that and hit Enter. And that's it. That's it's like a one liner on the terminal. And that will give you a sequel lite database with all of your swarm chickens in it. Wow, I'm pretty fascinated by the idea of like, I could go to one place and just search everything about me. Absolutely. So that feature because I had the stuff in like hundreds, dozens of different databases and tables, I actually built the dog sheep beta search engine just a couple of months ago. And basically, the way that works is you give it SQL queries to run against all of your other tables. So you say for the GitHub one, select title from issues and create a date and so forth. For the twitch one, select this, select that, and you run a script and it will run those SQL queries against all of your like 20 different databases, load the results into a new database table and set up full text search on it. So it's kind of like using something like Elastic Search where you have to query your data in lots of different sources into one index. And in this case, the index is just another sequel lite table. And then that gives you a faceted search index on the top that lets you search across all of the different things that you've ingested into it, right, if you do an index, it'll be nice and fast. And then you just say, well, you got to go back to these five tables and get these various things, right. Yeah. And that's actually part of the tool is you can then set up a SQL query for each type of content that says, oh, and to display it, run the SQL query to grab these details, stick them in this Jinja template and stick that on the page. So when you display the results, it can use all of the rich data that's coming back. But the actual index underneath it is basically title content and the date and that's it. Yeah. Wow. Okay, pretty interesting. What about email I don't see in this list, like connect to email or there's Google Takeout. That's not exactly necessarily the same. I will admit, I have not done email yet. Because I am terrible at email. And I'm almost a little terrified what will happen if I start running SQL queries against 10 years of unready metal, but I'm sort of transitioning into doing freelancing, consulting, one of the most important aspects for consultants is that they're on top of that email. So I think the next task that I have to take on is, is getting good at email and then ingesting that. Dog sheets. Yeah, I mean, it's kind of like this open plugin type of architecture. So someone else could create it as well, if they're listening. They just want it to exist, right? Absolutely. I mean, honestly, I'm like, you know that the email standards are good enough now that writing a tool that turns your Email Archive into a sequel lite database is pretty trivial. Apple Mail dot app uses SQL lite anyway. So I've actually done a little bit of poking around just looking at that database. Okay, interesting. Yeah, I just I just haven't thought so you could probably pointed at an outlook PST file, if you know, the world is cursed you and like you have to work in Outlook, or you just use IMAP or pop three or something like that, right? That's a very solid Python library for for reading outlook mailboxes. So you could totally use that. It would be an error code on top to turn that into a list of JSON objects and pipe them into the sequel lite library. Yeah. All right. Super cool. Simon. This is like a bunch of levels, building on top of each other like it. And also thank you for the the history of Django. It was really cool to hear how are you experienced as it came to existence? That was neat. Oh, yeah. You talk about a bunch more. But I don't want to take all of your time. So let me just ask you the final two questions here. I always asked her if you're gonna write some Python code, what editor using these days, and he says, I'm all about VS code, the especially the most recent version of their Python integration, which is controversial because it's the one bit of it, there's not open source, but that thing is just miraculous. Nice. bilanz right. I think so. Yeah. It's showing me Hey, this variable hasn't been used yet. And this import wasn't working and all of that kind of stuff. See, I'm all into VS code now. Yeah. Okay. That's definitely one that seems to be coming along and catching a lot of traction. Cool, cool. Notable
55:00 API package, something you've run across that you're like, Oh, this thing is cool, you should really know about it. This is great because I get sponsored this using dog sheep. Because I've got all of my starred GitHub repos are pulled into my database into dog sheep. And I can actually run a dog sheep or beta search and say, Show me everything I've starred or sorted by most recent. So the most recent Python I start, I just started a store as t o r. I've heard of that. I forgot what it is, though. Yeah, it's some it works with the Python, the abstract syntax tree. So it's building software on top of Python. And the reason I found it is that I found this tool called Flint, which we does all of your dot format calls and turns them into Python 3.6 format strings. Yes. And I was like, oh, how does that work? And it turns out is as asked, or under the hood, I was actually going to give a shout out to Flint. The reason was on data set I just saw the last commit at the time of the recording was use f strings in place a format. And you can just point Flynt at the entire just the top level directory, and it just fixes it. That's why I started I was playing around with that. And then the other I mean, the other one, it's not a recent favorite, but I'm gonna promote HTTP x. Oh, yeah. I'm a big fan of that one as well. You talked about the ASCII side. This is like consuming services from an insider as a server. Yeah, one way to look at the exes It's the new requests, right? It's essentially, it was almost, I think, called requests three at one point in its history. But it's basically the modern version of requests with full async support. So you can use it synchronously. And you can use it asynchronously. But the killer feature from my point of view, is you can instantiate HTTP x client, and you can point it at a Python ASCII or whiskey object, and then start running requests to so it's an amazing test harness. So yeah, except tests, and every data set plugin that I've written, they're actually they're using HTTPS. So it's like being able to do HTTP testing, you don't even have to spin up a local host server. It's all happening in memory. And that's just extraordinary. That sounds fantastic. Like that's such a good job of writing unit tests against things. Yeah, I've never tried it. But a lot of times, you can create, you'd like install the web test, and then wrap it in a test framework or, you know, maybe you've got even run it, like fired up and run it and then talk to it. But this is just like, yep, connect the two pieces in code and skip the network and go on, I've got a really nerdy thing that I've just started doing with it. So data set has plugins, and data set, plugins can do a bunch of different things. But I realized that data set itself isn't API, the whole point is that it gives you a JSON API that you can use to interrogate your tables and run queries and so on. And I wanted my plugins to be able to use that API. But I didn't really want them to be making outbound HTTP requests against themselves and so on. So I, just a couple of weeks ago, added a feature to dataset where they give you the integral, they can say, a client dot get brackets, and then feed the URL. And that's actually using HTTP x and ASCII under the hood. So the idea here is that any feature of data set that has an external JSON API is now also an internal API, the plugins can use most started building plugins against that Miam. The dog cheap beta plugin actually runs internal searches against the dataset Search API for things like adding faceted search and so on. They say graph. qL is a plugin on writing that adds a graph qL like API on top of data set. And that's going to be using this client as well. So you'll run a graph qL query, which gets turned internally into an internal JSON query and runs over this ASCII mechanism. It's cool. I hadn't really thought about that with HTTP. I've always used it to just like I'm doing some async methods. So here's a good choice for a client. Yeah, the deep integration with ASCII I think is really exciting. Yeah, super neat. All right. Well, those are all good recommendations. Now final call to action. People are interested in data set dog sheet first, I just want to throw out you should really go watch the 25 minute or whatever it was talk that you did a pike on IU that'll connect a ton of things for people. I'll throw in another recommendation. Yeah, go for it. I gave a talk last week, the GitHub octo speaker series, which I think is the best book I've given about data set and dog sheep. It's got a lot of very recent demos. And and that's linked to on my blog. It's a talk about building personal data warehouses. But yeah, that's got a data set and dog sheep on those. Yeah, very neat. They're also you actually, this is somewhat unusual for open source project. But I think cool, because promoting open source projects is always like, why do they take off or don't you have a data set weekly newsletter? Yes, I do. It's not quite weekly. That was maybe maybe I should have picked a different name. But yeah, I've got a I've got a newsletter, which goes out every week or so. And with the latest from the dataset ecosystem. So that's dataset.substack.com my blog, Simon Willison dotnet. I update at least once a week with all sorts of bits and pieces. And then if you're interested in the dog sheep stuff, I would love it. If people started building these themselves, there is quite a bit of assembly required. All of the code that I've written is open source, but you have to track down your authentication tokens and run cron jobs and find somewhere to host it and so it's not easy to get up and running. But if you do get it up and running, I would love to hear from you about what you what kind of things you managed to do with it. And if people want to build tools themselves,
01:00:00 To the Zico system, I'd be absolutely thrilled. Yeah, be awesome if they want to build a star to sequel light, whatever star is let you know. Right? Well congratulations on this project. I think it's super neat. And thanks for coming on to share with everyone. Awesome. Thanks a lot for having me. You bet. Bye.
01:00:15 This has been another episode of talk Python to me. Our guest on this episode was Simon Willison and it's been brought to you by linode. In us over at talk Python training, simplify your infrastructure and cut your cost bills in half with linode. Linux virtual machines develop, deploy and scale your modern applications faster and easier. Visit talk python.fm slash linode and click the Create free account button to get started. Want to level up your Python if you're just getting started, try my Python jumpstart by building 10 apps course or if you're looking for something more advanced, check out our new async course the digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show, open your favorite pod catcher and search for Python we should be right at the top. You can also find the iTunes feed at slash iTunes. The Google Play feed is slash play in the direct RSS feed at slash RSS on talk python.fm. This is your host Michael Kennedy. Thanks so much for listening. I really appreciate it. Don't get out there and write some Python code.