Personal search engine with datasette and dogsheep

Episode #299, published Sun, Jan 17, 2021, recorded Wed, Nov 18, 2020

Episode Deep Dive Links Transcript

In this episode, we'll be discussing two powerful tools for data reporting and exploration: Datasette and Dogsheep.

Datasette helps people take data of any shape or size, analyze and explore it, and publish it as an interactive website and accompanying API.

Dogsheep is a collection of tools for personal analytics using SQLite and Datasette. Imagine a unified search engine for everything personal in your life such as twitter, photos, google docs, todoist, goodreads, and more, all in once place and outside of cloud companies.

On this episode we talk with Simon Willison who created both of these projects. He's also one of the co-creators of Django and we'll discuss some early Django history!

Episode Deep Dive

Guests introduction and background

Simon Willison is a seasoned Python developer and an influential figure in the Python community. He co-created the Django web framework and has a deep background in data journalism, open-source software, and designing tools to help people publish and explore data. Simon is the creator of both Datasette, an open-source platform to publish and explore SQLite databases online, and Dogsheep, a project that brings your personal data from diverse sources into a single place.

What to Know If You're New to Python

If you’re newer to Python and want to follow along with the ideas in this episode, here are a few quick pointers:

Be comfortable reading basic Python code, as you’ll see references to imports and simple scripts.
Know how to install libraries (e.g., pip install datasette) and how to run Python from the command line or a virtual environment.
You don’t have to fully understand SQL, but familiarity with databases (like SQLite) will help.

Key points and takeaways

Building a Personal Search Engine with Datasette and Dogsheep Datasette allows you to turn your SQLite databases into interactive web apps, while Dogsheep is all about unifying personal data into one searchable platform. Simon’s vision is a personal “universal search engine” that indexes everything from Twitter likes and GitHub issues to photos and health data.
- Links and tools:
  - Datasette (github.com/simonw/datasette)
  - Dogsheep (github.com/dogsheep/dogsheep)
Datasette’s Core Idea: Publish and Query SQLite Databases Online By simply typing a command like datasette publish, you can bundle a read-only SQLite database with a web interface and an API. It requires no traditional database server, and you can easily share data or build interactive tools on top of it.
- Links and tools:
  - SQLite (sqlite.org)
  - Command-line publish examples (docs.datasette.cloud)
Read-Only SQL, Where SQL Injection Becomes a Feature Typically, SQL injection is a vulnerability. However, Datasette’s design—where data is read-only—turns it into a powerful query interface. Users can write SQL queries directly in the URL or via the web UI, exploring data without risk to underlying records.
- Links and tools:
  - Documentation: docs.datasette.cloud/en/stable/sql_queries.html
Dogsheep and Personal Analytics Inspired by Stephen Wolfram’s idea of a “personal search engine,” Dogsheep aggregates personal data sources into SQLite databases. GitHub repos, Twitter history, health data, and more can be synced and queried in one place to gain insights or just search across everything you do.
- Links and tools:
  - GitHub to SQLite
  - Twitter to SQLite
Exploring Apple Photos and Machine Learning Labels Apple Photos stores images and machine learning tags (e.g., cats, pelicans, location data) in a SQLite database on your device. Simon taps into it with OSX Photos and Dogsheep to create queries like “Show me pictures of pelicans, sorted by the highest aesthetic score.”
- Links and tools:
  - OSX Photos
Turning Data Into ‘Mashups’ This approach harks back to the early “mashup” concept: Gather data from different services, unify it in one database, and then build new insights or mini-apps on top. You can combine data from GitHub, Twitter, or even your phone’s location logs for creative projects.
- Links and tools:
  - Swarm to SQLite
  - GitHub to SQLite
Cleo the Dog and Social Media Data A fun example is using regular expressions on a dog’s Twitter account to track her weight over time. By mining tweets that mention pounds or “weigh,” you can build a chart of your pet’s progress, showcasing how flexible these SQLite-driven approaches can be.
- Links and tools:
  - Regex and chart plugins for Datasette
Plugin Architecture for Datasette Datasette’s plugin system encourages experimentation. Simon has written over 50 plugins, enabling features like additional output formats, interactive charting, custom authentication, and more. Plugins can be independently developed and published, extending Datasette’s capabilities without modifying its core.
- Links and tools:
  - Datasette plugins
  - Example: datasette-indieauth
Working with ASGI, FastAPI, and Next-Generation Web Technologies While not a core part of his recent work, Simon referenced the broader Python web ecosystem (including Django, FastAPI, and ASGI). Tools like HTTPX make testing and internal API calls simple, bridging modern asynchronous capabilities with these data-driven frameworks.
- Links and tools:
  - HTTPX (github.com/encode/httpx)
Origins of Django and Lessons Learned Simon recounted Django’s creation story at a local newspaper in Lawrence, Kansas. They needed a flexible web framework for newsroom-driven projects, leading to Django’s hallmark emphasis on fast development, the admin interface, and best practices that are now industry standard.

Links and tools:
- Django (djangoproject.com)
- Django Admin docs (docs.djangoproject.com/en/stable/ref/contrib/admin)

Interesting quotes and stories

On building a personal data warehouse: “It’s like you can create your own private Google, but just for your stuff.”

On read-only data exploration: “Having SQL injection as a feature rather than a bug is something most frameworks would never dream of. But Datasette is built for that.”

On Apple’s machine learning labels: “You can literally run queries on your own device that say, ‘Show me all my pictures of pelicans with a high ‘pleasant camera tilt’ score.’ And it just works.”

Key definitions and terms

Datasette: An open-source tool that turns SQLite databases into interactive web apps and JSON APIs.
Dogsheep: A collection of utilities (and an overall project) to centralize personal data from multiple online services into SQLite for search and analytics.
SQLite: A lightweight relational database stored in a single file, making it easy to bundle with applications.
ASGI: Asynchronous Server Gateway Interface, a standard for async Python web apps.
HTTPX: A Python HTTP client supporting both sync and async operations, with tight ASGI integration.
Wolfram Alpha: A computational knowledge engine that inspired the concept of a “personal search engine,” leading to the pun-driven name “Dogsheep Beta.”

Learning resources

Here are some resources to help deepen your Python and data knowledge:

Python for Absolute Beginners: A comprehensive foundation if you’re new to Python and want to build real applications.
Django: Getting Started: Learn to build your first Django project, especially helpful if Simon’s Django story piqued your interest.
Building Data-Driven Web Apps with Flask and SQLAlchemy: Get into creating and deploying web apps that talk to a database in Python.
Python Data Visualization: Once you gather your data in Datasette or Dogsheep, you can visualize it more extensively.

Overall takeaway

Simon Willison’s work with Datasette and Dogsheep shows the power of combining Python, SQLite, and a bit of creativity to make data not only more accessible, but deeply personal. His approach exemplifies how anyone—developers and data enthusiasts alike—can unify scattered information into a single, interactive hub, unlocking new ways to visualize insights, explore old memories, and even run machine learning queries on a personal scale.

Links from the show

Datasette: datasette.io
Dogsheep: dogsheep.github.io
Datasheet newsletter: datasette.substack.com
Video: Build your own data warehouse for personal analytics with SQLite and Datasette: youtube.com

Examples
List: github.com
Personal data warehouses: github.com
Global power plants: datasettes.com
SF data: datasettes.com
FiveThirtyEight: fivethirtyeight.datasettes.com
Lahman’s Baseball Database: baseballdb.lawlesst.net
Live demo of current main: datasette.io
Episode #299 deep-dive: talkpython.fm/299
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode #299 deep-dive: talkpython.fm/299

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 In this episode, we'll be discussing two powerful tools for data reporting and exploration,

00:05 Dataset and Dogsheep.

00:06 Dataset helps people take data of any size or shape, analyze and explore it, and publish

00:12 it as an interactive website and an accompanying API.

00:14 Dogsheep is a collection of tools for personal analytics using SQLite and Dataset.

00:20 Imagine a unified search engine for everything personal in your life, such as Twitter, photos,

00:25 Google Docs, Todoist, Goodreads, and more, all in one place and outside of the cloud companies.

00:30 On this episode, we talk with Simon Wilson, who created both of these projects.

00:34 He's also one of the co-creators of Django, and will discuss some of the early Django history.

00:39 This is Talk By The Nome, episode 299, recorded November 18th, 2020.

00:44 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the

01:02 ecosystem, and the personalities.

01:03 This is your host, Michael Kennedy.

01:05 Follow me on Twitter, where I'm @mkennedy, and keep up with the show and listen to past

01:09 episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

01:14 This episode is brought to you by Linode and Talk Python Training.

01:18 Please check out the offers during their segments.

01:20 It really helps support the show.

01:22 Simon, welcome to Talk Python To Me.

01:24 Hi, great to be here.

01:25 Hey, it's great to have you here.

01:26 We're going to talk about some really interesting projects that you've been working on, some

01:31 that are extremely broad and far-reaching, and anybody who touches Python or maybe even

01:36 knows about it, and then another, which I feel is such a personal project.

01:40 It's for everybody, but it reveals so much data about you, you know?

01:44 Absolutely, yeah.

01:45 We're going to talk about Django a little bit, which you were part of creating.

01:49 We're going to talk about data set, which is a way to interact, basically put a UI on

01:55 top of any database in a friendly way that lets you explore it and treat it as an API.

01:59 And then Dog Sheep, which has a funny name, nice story around the name, that allows you

02:05 to basically take all of the data about you and turn it into your own Google.

02:10 Right.

02:10 It's your personal data warehouse, I've started calling it.

02:13 I'm fascinated by where that could go.

02:15 Now, before we get to all three of those cool things, let's just start with your story.

02:18 How did you get into programming into Python?

02:20 I learned to program with my dad on a Commodore 64 back in the 80s, and then kind of moved on

02:27 to early Windows and DOS, where you didn't really get to program anything from QBasic.

02:31 But I got really into programming when actually during the first dot-com boom, like even like

02:37 Web 1.0, 1999, 2000, when I was working in London for a online gaming company, a company

02:45 called Gameplay.com, which was doing like selling boxed games where you mail ordered them, but

02:50 also running online gaming servers for like Team Fortress Classic and Half-Life and Quake and

02:55 all of those kinds of things.

02:56 Oh, yeah.

02:56 Team Fortress and Counter-Strike.

02:58 I love that stuff.

02:59 That was so fun.

03:00 Oh, absolutely.

03:01 And so I was working for their online gaming division for a year and a half before the

03:06 dot-com crash and everyone got laid off and it all collapsed.

03:09 But during that time, I started the company as the downloads editor, so responsible for

03:14 the section of the website where you download plugins and patches and mods and so forth.

03:19 And I essentially taught myself web development as part of that and as part of some other gaming

03:24 related side projects I had.

03:25 And so, yeah, I'm a veteran of the first round of dot-coms back when no one had a slight

03:31 idea what we were doing.

03:32 Yeah, yeah.

03:32 Everyone was just making it up on the spot, right?

03:34 Oh, totally.

03:35 So when you were working on that web development stuff, it probably wasn't Python, I'm guessing.

03:39 Certainly wasn't Django because Django was what, 2005?

03:42 2003, 2004 when we built it.

03:45 I think it was open sourced in 2005.

03:47 Yeah.

03:47 Right, right.

03:48 Okay, so what were you programming then?

03:49 Oh, so Gameplay.com was running on a vastly expensive content management system called Media

03:55 Surface, which was a combination of Perl for templates, Java and Oracle under the hood and

04:01 insanely expensive and very, very tricky to get things done with.

04:06 And then I had side projects which were classic PHP and MySQL.

04:09 Right.

04:09 So I was very much the sort of the classic PHP programmer.

04:13 And actually, I mean, this is where Django came from is Adrian Holabati and I were working

04:18 together at this local newspaper in Lawrence, Kansas.

04:20 And we were both PHP developers and we saw the siren call of Python and wanted to figure out

04:26 how we could build websites using this.

04:28 We felt much more exciting programming language.

04:31 Oh, that's awesome.

04:31 It's definitely more exciting.

04:33 But I do want to ask you, it's such a different world, right?

04:36 We can go to the cloud providers and pay $5 a month and probably get better infrastructure

04:42 than you guys had.

04:43 I mean, who knows how much, right?

04:44 Enormously.

04:45 I mean, back then, a content management system, because you could spend a million dollars on

04:49 your content management system.

04:50 And they were terrible because nobody knew what a good content management system looked like.

04:55 This was back before things like WordPress as well.

04:57 Like your only option was to spend huge amounts of money on these like giant enterprise systems

05:03 and then cross your fingers and hope that they could work for you.

05:05 Yeah.

05:06 So how did you feel when you worked, you go to work, you work in this kind of like ultra

05:11 expensive, clunky environment and then go home.

05:13 And even though it was PHP, still like do PHP and MySQL and kind of like, I paid nothing

05:17 for this.

05:18 It maybe felt as good as what this crazy online system you guys are working with.

05:21 What was that back and forth like?

05:23 I don't think it did feel nearly as good, to be honest, because there was no open source

05:26 was still just starting up.

05:27 So if you wanted to build something in MySQL, you wrote everything.

05:31 Oh my goodness.

05:33 You would start with the authentication system and work on it from there.

05:36 So to be honest, I feel like today, like open source for me has solved the whole code

05:42 reuse problem.

05:43 How do we stop wasting our time to rebuilding the same things over and over again?

05:46 But yet, you know, 20 years ago, you just built everything from scratch and everything

05:51 took months to do.

05:52 It's really interesting because I started out in C++ and I felt like a lot of the stuff

05:57 I built was extremely low level, not just because it was C++, but because of this sort of difference

06:02 that you're talking about.

06:03 And throughout the years, I've seen it just get like what used to be really hard is now

06:09 just grab this library and plug it in.

06:10 That other thing used to be hard, grab this library and plug it in.

06:12 And it just seems like the natural consequence would be, well, we need fewer developers because

06:17 they used to have to build this.

06:18 Now they plug it together.

06:19 And yet all we've done is decided to solve more ambitious problems.

06:22 Absolutely.

06:23 In more interesting ways, right?

06:24 It amazes me the quality of software that we build these days, you know, like 10, 15 years

06:30 ago, we weren't writing tests for everything.

06:32 You know, the quality of the stuff we wrote was just abysmal.

06:34 And today we've got continuous integration and continuous deployment.

06:38 And it's really quite, it's really easy to get out there and analyze like 15 different

06:42 options and pick the one that's most, the one that is the highest quality.

06:46 Not only do you have CICD, you've got it for free in the cloud on demand, you know, on actions

06:52 that trigger it, right?

06:53 It's just, it's such a cool place, a cool time to be doing this stuff.

06:55 Definitely.

06:56 Yeah.

06:57 So before we get to Django, how about now?

06:59 What are you up to these days?

07:00 So these days I spent the last year at Stanford University doing a fellowship.

07:05 There's a fellowship program called JSK and it's journalism fellowships.

07:10 The idea is to get sort of journalists from around the world together at Stanford for a year,

07:14 thinking about problems facing journalism.

07:17 And I managed to make my own to it as a sort of computer scientist with journalism leanings.

07:22 And basically thinking about, okay, what are the open source tools that I can build that

07:27 can help make data journalism more powerful, more widely accessible?

07:31 And so a lot of the data set project was really accelerated by doing that.

07:36 It finished a few months ago.

07:38 The problem I'm having is that at Stanford, I was essentially paid to tinker on my own

07:43 projects and go after whatever I thought was interesting.

07:46 That sounds good.

07:46 And it's great, but I'm having real trouble stopping.

07:48 So now I'm not getting paid, but I'm tinkering on my own projects, going after things I find

07:53 interesting.

07:53 So I'm calling myself a consultant and I am available for consulting opportunities, especially

07:59 around the sort of set of tools that I've been building.

08:02 But mainly, yeah, I'm focusing on continuing to build out this ecosystem of tooling for

08:07 data journalism and related projects.

08:09 Yeah.

08:09 Well, we're going to get to those.

08:11 And they're definitely super interesting, the ones.

08:13 And like I said, this personal aspect, I think that could help a lot of people, not just journalists

08:17 for sure.

08:18 But let's talk a little bit about Django.

08:20 You mentioned that it came out of, what was it?

08:22 Journal Lawrence?

08:23 Lawrence Journal World or something like that?

08:25 Journal World.

08:26 Yeah, it's a tiny newspaper in Lawrence, Kansas, a town I'd never even heard of.

08:31 And this was back in 2002, 2003.

08:34 I was a blogger.

08:35 I had a blog about web development and about 100 other people had blogs about web development.

08:40 So we all read each other's blogs.

08:41 And Adrian Holovaty, who was a journalism and programmer, posted a job ad on his blog saying,

08:48 hey, I want somebody to come and join me in Lawrence, Kansas, building websites for local

08:51 newspapers.

08:52 And it coincided with my university course, giving me the option to spend a year in industry,

08:57 which is something that UK degrees do quite often.

09:00 So I could take a year out, get a student visa, which means I could track traveling, work in

09:06 different countries, spend a year working, and then go back and finish my degree.

09:09 And so the opportunities sort of aligned themselves.

09:12 And I had huge respect from Adrian, just based on what I'd seen that he'd been doing.

09:16 And it felt like a pretty interesting adventure to run off to Kansas.

09:19 Yeah, yeah.

09:20 That's a cool adventure.

09:21 And yes, so I did that.

09:22 So essentially, it was a year long, almost a paid internship.

09:26 But it was in Lawrence, Kansas at this little newspaper.

09:29 And it was a fascinating place because the family that owned the newspaper had laid fiber optic

09:35 cable around a bunch of states like a bunch of years beforehand, when everyone thought they

09:39 were crazy, and then sold the whole lot to maybe Comcast, one of these big companies.

09:44 So financially, they were very secure, which meant they could invest huge resources in their

09:49 local newspaper for this little town.

09:51 And so this newspaper, despite serving a town with a population of like 100,000 people,

09:55 it had way, way more resources than you would expect any local newspaper to have.

09:59 They had their own software engineering team who were building out the websites for things.

10:03 And because the family owned the local cable company for the town.

10:06 So everyone in the town of Lawrence, Kansas had broadband internet in 2003.

10:12 And that meant that the news websites could have like online videos and stuff, which no one else was doing, because what newspaper had an audience who could actually watch that kind of stuff.

10:20 So it was a really exciting place to be inventing like things around online news.

10:26 And we also had a very ambitious boss, this chap called Rob Curley, who basically wanted us to act like we were the New York Times, even though we were like six nerds in a basement somewhere.

10:39 This little local newspaper.

10:40 We had things like the local, there was a softball league where all of the local kids in softball teams competing against each other.

10:48 And it turns out this is an amazing thing for a local newspaper to cover, because if you have good coverage of the softball league, everyone who knows a child in your town will buy your newspaper.

10:56 So we went all in on kids softball and we ended up building a website for them that treated them.

11:02 The idea was to treat them like the New York Yankees.

11:04 So we had like player profiles and match reports and photo galleries.

11:08 And then we sent two interns out to take 360 degree photographs of every softball pitch in town.

11:15 And we had those on the website.

11:17 And this was like quick VR or something.

11:19 It was absolutely astonishing.

11:21 That's so, so neat in the early days.

11:23 Like, I'm sure the kids felt so special as well.

11:25 I bet they still have like saved copies of their time.

11:29 The best website we worked on was there was a local entertainment portal for the town of Lawrence, Kansas.

11:34 And it was basically a website that had the events calendar.

11:37 It had band profiles.

11:38 It had restaurant reviews.

11:40 It was like sort of a super hyper local version of Yelp crossed with music magazines,

11:46 crossed with an events website just for this one little town.

11:49 And we had features like there was a downloads page where you could download MP3s of bands who were playing in town that week.

11:58 Because we had the MP3s for all of the local bands, again, in like 2003.

12:02 And a little radio widget that you could click play on and stuff.

12:05 It was astonishing.

12:06 I have never seen an entertainment website since that was as good as this one that we were building back in Lawrence, Kansas.

12:15 This portion of Talk Python To Me is sponsored by Linode.

12:17 Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual machines.

12:21 Develop, deploy, and scale your modern applications faster and easier.

12:25 Whether you're developing a personal project or managing large workloads,

12:29 you deserve simple, affordable, and accessible cloud computing solutions.

12:33 As listeners of Talk Python To Me, you'll get a $100 free credit.

12:37 You can find all the details at talkpython.fm/Linode.

12:41 Linode has data centers around the world with the same simple and consistent pricing regardless of location.

12:47 Just choose the data center that's nearest to your users.

12:50 You'll also receive 24-7-365 human support with no tiers or handoffs regardless of your plan size.

12:56 You can choose shared and dedicated compute instances, or you can use your $100 in credit on S3-compatible object storage,

13:04 managed Kubernetes clusters, and more.

13:07 If it runs on Linux, it runs on Linode.

13:09 Visit talkpython.fm/Linode or click the link in your show notes, then click that create free account button to get started.

13:18 I have a history with Lawrence.

13:20 I actually went to college there and got my math degree at University of Kansas.

13:24 No way!

13:24 I love that town.

13:25 I love it.

13:26 The Mass Street little downtown with Mass Street Brewery and all that was such a beautiful place.

13:31 I really enjoyed my time there.

13:33 Yeah, it's a great town.

13:34 I think when I was there, I'd never been to anywhere else in America, basically.

13:38 So it felt like a cool town, but I didn't really understand how cool it was until 20 years later.

13:43 I've now lived in America, and I've been to lots of towns, and Lawrence is special.

13:47 It's a very special.

13:49 It definitely is.

13:50 But what I never knew was how cool this newspaper was.

13:54 I mean, I was just like a kid in college.

13:56 I didn't read the newspaper a lot and whatnot.

13:58 So this is a really interesting cradle from which Django sprung.

14:02 So tell us about Django and how it fits into this world.

14:06 Sure.

14:06 So basically, Adrian and I, Adrian had built Lawrence.com, this amazing assignments website, in PHP.

14:11 And both Adrian and I had hit that point in our PHP careers where it was straining under the size and complexity of the things that we wanted to do with it.

14:20 This was like before PHP 4 even.

14:22 So classes were very new, and the PHP language was pretty primitive compared to what you have today.

14:29 And meanwhile, Python was exploding in popularity.

14:31 We were both huge fans of Mark Pilgrim's Dive Into Python and Mark Pilgrim's blog where he talked about this.

14:37 And so we decided that we really wanted to be working with Python for building these websites.

14:41 But the web options back then were not particularly – the main thing was Zope.

14:46 And Zope was pretty good, but it didn't match the way Adrian and I thought about the web.

14:51 We cleared about things like designing our URLs carefully and separating our CSS from our HTML.

14:56 The sort of modern MVC framework that people almost take for granted now, right?

15:00 Exactly.

15:01 But there weren't really any great options for that in Python.

15:04 So we were looking at ModPython, the Apache module, as the way that we would put Python on the Internet.

15:10 And we were a little bit worried about it because ModPython wasn't being used very widely.

15:14 And we're like, okay, what happens if we bet the newspaper on ModPython and it turns out not to work, right?

15:20 That would be – Yeah, yeah, yeah, yeah.

15:22 So we said, okay, what we'll do, we'll have a very thin abstraction layer between us and ModPython so that if we have to swap ModPython for something else, we can do so.

15:29 And that basically is what – well, that was the initial seed to Django.

15:33 We wanted a request and response object, a basic way of doing templating, basic URL routing.

15:38 And so we built that out.

15:39 And we never thought it was a framework.

15:41 We called it the CMS, right?

15:43 It was the CMS of the newspaper.

15:45 And it kept on evolving these additional little bits and pieces.

15:48 Django admin was something which I went away to the South by Southwest Festival for like four days.

15:54 And when I came back, Adrian had written a code generator for admin websites that was churning out all of this stuff.

16:00 We just kept on building these extra bits out.

16:03 And then I went back to England.

16:04 My year in Kansas ended.

16:06 And about six months later, they open-sourced Django.

16:09 At the time I was working, it wasn't called Django.

16:11 There were various ideas for names, which are truly terrible.

16:14 But yeah, Jacob Kaplan Moss had joined the team at that point.

16:18 And yeah, they made the case to the newspaper that they should open-source this thing.

16:21 That's early days for that, right?

16:23 Like it's not – like now it would be easy sell.

16:25 But back then that was weird, right?

16:26 Well, yeah.

16:27 I asked them about this.

16:28 And apparently one of the arguments they used is Ruby on Rails had just come out and was just exploding in popularity.

16:34 And they could see that this company that released Rails was hiring people left, right, and center and was doing really well out of it.

16:39 So they went to the newspaper and said, hey, look, if we open-source this, it's a great way for us to get talent and get free fixes and all of that sort of thing.

16:47 And it worked.

16:48 Like you and I are sitting here talking about this small newspaper in Lawrence right now, right?

16:52 I mean, we wouldn't be doing this otherwise.

16:54 That's true.

16:54 But the argument that worked is they said to the newspaper owners, we've been building on open-source, right?

17:00 The newspaper – we run Linux and we run Apache and Perl and Python and we've used all of this open-source stuff.

17:05 This is a way of us giving back.

17:07 And that's the argument that apparently resonated with them.

17:08 They said, no, that completely makes sense.

17:10 We can give back in that way.

17:12 And yeah, and so Django open-sourced and it's been – that was 15 years ago, I think.

17:16 And it's just been growing in it ever since.

17:18 Yeah.

17:19 Did you predict – like you look around now, it's just ubiquitous.

17:22 Like does it blow you away what's happened?

17:24 Completely blows me away.

17:25 The thing that really amuses me is that I keep on seeing people talking about Django as the boring option.

17:31 And like Django and Rails, yeah, those are the safe, boring options for things.

17:34 And I actually saw someone on Twitter the other day say, well, nobody ever got fired for choosing Django.

17:39 And I was like – I actually – I direct messaged Adrian and Jacob with that quote.

17:44 I'm like, can you believe this?

17:45 That we are now – no one ever got fired for choosing IBM option for web framework.

17:49 Exactly.

17:50 Exactly.

17:50 Amazing.

17:51 Yeah.

17:52 I think there's definitely some truth to that.

17:54 Quite interesting.

17:54 It seems like it's got a ton of momentum.

17:57 It's really starting to embrace the async and await world.

18:00 It looks strong.

18:01 Yeah.

18:02 So a lot of my projects these days don't use Django, but they do use ASCII.

18:07 I am really into the whole – the ASCII ecosystem that's growing up is so exciting.

18:12 And Django is getting better at ASCII itself, so I'm going to be able to merge a bunch of my ASCII projects back into my Django projects pretty soon, which is super exciting.

18:20 Yeah, that's exciting.

18:21 I'm super excited about FastAPI, and it's one of those that fits really well in that world.

18:26 Yeah.

18:26 I mean, I haven't really done FastAPI, but I love Starlet, which is the framework.

18:30 Yes, that's the foundation.

18:31 Yeah.

18:32 Yeah, super cool.

18:33 All right.

18:33 Well, congratulations on Django to you and everyone who worked on it.

18:37 I do think it's really interesting.

18:39 If you look at the timing, Ruby on Rails came out of 37 Signals, now Basecamp, and was extracted from – they were using inside, right?

18:48 Yeah.

18:48 And they built it for them and extracted it.

18:50 You guys built it at the newspaper and said, this thing we can pull out of and make it something else.

18:56 I think it's really interesting that it was sort of polished and proven in this real place, right?

19:01 I think the way I see it is Rails was extracted from Basecamp, right?

19:05 They built Basecamp up.

19:06 They pulled a framework out of it.

19:07 Django, the goal was always Lawrence.com.

19:10 We had this entertainment website and PHP and MySQL, and we knew that we wanted the thing we were building to power that.

19:16 So Django was actually – there was an existing target, and we evolved the functionality of the framework until it could run a very high-quality newspaper, entertainment, listings, website.

19:27 And so it's almost – it's like one was extracted and one was evolved in the direction of supporting this one site.

19:34 Yeah.

19:34 It's neat to see.

19:35 I think they both came out quite successful from those experiences.

19:39 All right.

19:39 Let's start at the foundation of this recent work you've been doing.

19:43 And in some sense, it's sort of a natural progression, right?

19:47 It's in the journalism side of things is where the origin came from.

19:50 So tell us about Dataset.

19:52 So Dataset is – on its website, I call it an open-source multi-tool for exploring and publishing data.

19:59 Basically, it's a web application which you can point at a SQLite relational database, and it gives you pages where you can browse the tables and run queries.

20:08 It lets you type in custom SQL queries and run them against that database.

20:13 It lets you custom templates and how things turn.

20:15 It lets you get everything back out as JSON or CSV so you can use it for API integrations.

20:20 And it lets you publish the whole thing on the internet really easily.

20:23 So it's a lot.

20:24 Yeah.

20:25 And one of the biggest challenges I've had is how do I turn this into a bite-sized description that really helps people understand what the software does.

20:33 I'm at a point now where if I can get somebody on a video chat, I can do a 15-minute demo, and at the end of it, they come out going, I totally get this.

20:40 This is amazing.

20:41 But that's not explaining software.

20:44 It doesn't scale particularly well.

20:46 Yeah.

20:47 Yeah.

20:47 Well, let me see if I can, with my limited exposure to it and knowing somewhat where we're going, you have this data source that's pretty ubiquitous or can become ubiquitous in terms of like some sort of ETL with SQLite, right?

21:00 SQLite is everywhere.

21:02 What's beautiful about it is there's no, please set up the server and make it not run as root and then put it on your network and so on.

21:08 Yeah.

21:09 Right?

21:09 It's the magic of SQLite.

21:10 SQLite, it boasts it's the most widely distributed database in the world, which it is because it runs on every phone.

21:17 My watch has SQLite tracking my steps.

21:19 Every iPhone app, every Android app, every laptop, they're all running these.

21:24 Yeah.

21:24 Your phone.

21:25 That's crazy.

21:25 It's a file format, right?

21:27 It's a SQLite database is a single .db binary file on disk, which like you said, makes it so convenient because I don't have to ask a sysadmin to set me up a Postgres schema or anything like that.

21:39 I just create a file on my laptop and that's my new database.

21:42 Yeah.

21:43 And it's even built into Python, right?

21:44 It just comes with Python.

21:46 Yeah.

21:46 It ships in the standard library.

21:48 Yeah, exactly.

21:48 So that's super cool.

21:50 And it's great that we have this data format that if we either have data in there or you could do like an API call and then jam the data in there, right?

21:59 Like something to get it into that format, which is great.

22:02 But you can explore that with like Beekeeper Studio or some data visualization, SQL Management Studio.

22:09 But that doesn't work for journalists.

22:11 That doesn't work for getting it on the internet.

22:14 That doesn't give you like the transformations.

22:16 In some sense, I kind of see it almost like as a really advanced web-based like data IDE, but user-friendly.

22:25 Kind of.

22:25 Yeah.

22:26 But the emphasis is absolutely on publishing.

22:29 It's on getting it online.

22:30 And then it's on being web native.

22:32 Like everything in dataset can be got out as JSON as well as HTML.

22:35 It can all get beat.

22:36 It can return CSV to you.

22:38 It uses you pass the SQL query in a get in a query string so you can bookmark queries, all of that kind of stuff.

22:45 Yeah.

22:45 I think the key, that's really the key idea.

22:47 It's how do you take relational databases and make them as web native as possible and as cheap and inexpensive to host and to run as possible.

22:55 So you can take any data that fits in a SQLite database, which is almost everything, and stick it online in a way that people can both explore it and start like integrating with it as well.

23:06 And another key idea in dataset is dataset has a plugin system.

23:10 I've actually written over 50 plugins for it now that add all sorts of different things, different output formats to get your database out as an Atom feed or an iCal feed.

23:19 I've got plugins for visualizations that plot your data on a map or give you charts and line graphs and so on.

23:25 I just this morning released a authentication plugin that supports the IndieAuth authentication mechanism.

23:31 So you can use IndieAuth login to password protect your data, all of these different things.

23:37 And honestly, having a plugin system is so much fun because I can come up with a terrible idea for a feature and I can build it to plugin.

23:44 And it doesn't matter that that's just an awful idea that nobody should ever have implemented because I'm not causing any harm to that core project.

23:53 It's a super interesting idea.

23:54 I also think it might be a way to encourage others to contribute because they don't have to understand the whole system and be afraid of breaking it.

24:01 They just have to understand, like, here's the three functions I implement to make this happen.

24:05 Real dream there is when people contribute to open source, that's more work for me because I have to review their pull requests and figure it out and so on.

24:11 If you write a plugin, you can release that plugin to the Python package index and I don't even have to know about it.

24:17 Like, I can wake up one day and my software has new features because somebody built a plugin and shipped it, which I think is fantastic.

24:24 It's fantastic.

24:25 And they don't have to go through you as a gatekeeper, even if you might be super friendly and whatnot.

24:29 They just don't have to have that interaction, right, which is pretty cool.

24:32 Yeah.

24:33 So one of the things that's interesting about Dataset is the way you get your stuff online is you basically just run Dataset against a SQLite database.

24:41 And now you have this website that lets you explore it like you described.

24:44 So you say Dataset space, path to a SQLite file, and now you have a web app running, right?

24:50 So you type Dataset, name a file, hit enter, and it runs on your local laptop and you can start browsing it and exploring it.

24:55 But then if you want to put it online, I've been building out integrations with a bunch of different hosting providers

25:00 where you can, from the command line type, Dataset space, publish space, pick your provider, say, say Google Cloud Run, Dataset publish Cloud Run, name of database, enter.

25:10 And it'll upload that database to the internet, wrap it in the application itself, and give it a URL and start serving.

25:17 So it's like a one-liner for publishing data online with a URL that other people can start using.

25:22 And that's basically, that's enabled by all of these fascinating serverless hosting providers.

25:28 And that was actually one of the original inspirations for Dataset.

25:32 So I was looking at server hosting and saying, okay, what can I build with this that I couldn't have built beforehand?

25:37 And the limitation of serverless host is that they generally don't let you run databases.

25:41 They're like, you can run like a sort of stateless application that does stuff, it's fine.

25:46 But if you need to store data in the database, you're going to have to pay extra or get an RDS.

25:50 Right.

25:51 The only real option you have is please subscribe to our managed database service of some sort, which turns you into a DB admin and all sorts of stuff you don't want to be.

26:00 The realization I had is, hold on a second, that makes sense if you have to write to your database.

26:04 But if it's read-only, if you're publishing like interesting data sets throughout the world, which aren't going to change, why not just stick that data in SQLite and then put that .db file in as an asset as part of your source code?

26:17 Like basically bundle it in with the application.

26:19 Right.

26:20 It gets loaded in AWS Lambda or whatever, just like it's right there.

26:24 Yeah.

26:24 It's basically a giant loophole that lets you abuse serverless host providers, which are really cheap to serve interesting relational data.

26:31 As long as you don't need to accept rights, it just works, which I think it's kind of devious and it works incredibly well.

26:36 Yeah.

26:36 It's really interesting.

26:37 One of the things that I heard you say in a talk that we'll get to in a little bit is, you know, coming from Django, Django has the beautiful Django ORM, right?

26:46 Which is a really nice way to talk to data and manage it.

26:49 It's not so good for completely dynamic, unknown schemas, but it's good if you know it.

26:54 One of the cool things about ORMs that I really like is it makes it very hard to do to write bad code that has SQL injection.

27:01 Absolutely.

27:02 Yeah.

27:02 Right.

27:03 And yet what you said to me is, here's the thing on the internet.

27:05 I can go to it and then I can type in, edit the SQL directly.

27:09 Yeah.

27:10 This seems a little bobby tables all over the place.

27:12 Totally.

27:12 It's SQL injection as a core feature of the application.

27:16 And the reason it's okay to do that is that it's read only.

27:19 Like the data is read only.

27:20 I open the SQLite database file in read only mode and so forth.

27:24 And so the only damage you can cause by putting in an evil SQL query is you could try and write one that uses up too many resources.

27:30 But I've got a time limit on those.

27:32 So if your SQL query takes a second to execute, I'll cut it off and I'll return an error.

27:36 And yeah, it means that SQL is the API language itself, which is kind of fascinating.

27:42 People have been very excited about GraphQL over the past few years because it's a language that you can use to define exactly what you get back from an API.

27:51 SQL has been doing that since the 1970s.

27:53 So Dataset essentially gives you a JSON API that you can feed SQL queries to and get back the results in JSON.

28:01 Yeah.

28:01 It's really interesting.

28:02 The other problem that you can run into a SQL injection is exposing private data, right?

28:08 So if I want to say show, do a query that shows just my orders, but I'm allowed to get a or true into the query, I get all the orders, right?

28:16 But the whole point of Dataset is to say we're going to create just an explorer of all the data.

28:22 So there's no private data and it's read only.

28:24 So you're good to go.

28:25 Exactly.

28:25 And so suddenly SQL injection becomes this really powerful feature.

28:29 I've done some work where I've written JavaScript applications where the JavaScript, it creates a SQL query in the JavaScript and then it sticks it in the query string and runs it against Dataset and pulls back the results.

28:40 And the first time I did that, it was genuinely meant as a joke.

28:43 I was trying to troll my coworkers by saying, ha ha ha, look, I've got JavaScript that's injecting SQL statements into this thing.

28:50 It turned out it took me a couple of hours to build quite a sophisticated internal search engine for our documentation.

28:56 And it feels like running SQL in your JavaScript is a terrible idea.

29:01 But actually, for prototyping, it's really good.

29:04 Like you end up with about 20 lines of JavaScript that do something very sophisticated and let you build an interactive UI on the top of it.

29:11 Yeah, it's a pretty simple way to expose a searchable sort of REST API, read only REST API on top of just a data schema.

29:19 Exactly, yeah.

29:20 Yeah, but you have the ability to edit.

29:24 You go in there and you write some more SQL and you work with it.

29:26 And then you can say, give me the URL for this result, which I think is also a pretty neat thing, like deep linking into it.

29:33 The way I think about that is I've written applications, which is a SQL query.

29:38 Like it's a SQL query that is a great example.

29:40 I've got a database of all of my GitHub issues and issue comments across all of my projects.

29:46 And I wrote a SQL query which finds me GitHub issues which were created by someone who's not me and when none of the replies are from me.

29:54 So it's issues in my repos where I have not replied to them yet.

29:57 You basically somehow missed them potentially.

29:59 Exactly.

30:00 And so I've got a query for that and now I've bookmarked it and now I've got an application which is the here are the GitHub issues that you should go and look at application.

30:08 So an entire application ends up being a URL that you can bookmark.

30:12 And that's really interesting.

30:14 That's a really, again, a very web native way of thinking about the problem domain.

30:17 You know, you were talking about starting out working in the first dot com boom.

30:22 And one of the things that was all the rage back then were mashups.

30:26 Do you remember like Yahoo mashups and all that kind of stuff?

30:29 Absolutely.

30:30 Yep.

30:32 Talk Python To Me is partially supported by our training courses.

30:35 Python's async and parallel programming support is highly underrated.

30:39 Have you shied away from the amazing new async and await keywords because you've heard it's way too complicated or that it's just not worth the effort?

30:47 With the right workloads, a hundred times speed up is totally possible with minor changes to your code.

30:53 But you do need to understand the internals.

30:55 And that's why our course, Async Techniques and Examples in Python, show you how to write async code successfully as well as how it works.

31:03 Get started with async and await today with our course at talkpython.fm/async.

31:09 So it sounds to me like what you can almost build here is like a super interesting mashup.

31:15 So you can like extract the data from GitHub.

31:18 You can extract it from over here and you put it together in this new form.

31:21 And then now you've got this API on top of it, right?

31:25 A massive realization I've had working on this stuff is lots of websites have APIs.

31:29 And the APIs sometimes have a lot of features like the GitHub API can do some pretty powerful stuff.

31:33 But I can always think of something the API can't do that they didn't predict I'd won.

31:38 If I can get all of my data out of that API and into a SQLite database, then there are no limits.

31:43 And any question I can think to ask, I can apply against that thing.

31:49 So I've basically sort of given the only thing I use APIs for now is to give get everything.

31:53 Just give everything.

31:54 It's here.

31:55 Download it and sync.

31:56 Get everything into a database.

31:57 And now I can start introspecting my data and building this.

32:00 Yeah.

32:00 I think where it gets interesting, as we'll see as we get on to like the final dog sheep side of things,

32:05 is it's great that GitHub has an API.

32:07 It's great that Twitter has an API.

32:09 It's great that Gmail has IMAP.

32:11 And all these different things have rich, deep ways to talk to them.

32:15 But if you want to talk to all of it at the same time and say, I want to know about what I've tweeted, emailed, or whatever else I've done about it.

32:22 Like, you don't want to try to build that integration of all those APIs.

32:26 Then it gets super gnarly.

32:27 But if you get it into some kind of SQLite database, all of a sudden it becomes an option.

32:32 Right.

32:32 It's this personal data warehouse idea.

32:34 And it's not just a personal data.

32:36 Like, as a company, if you're a company with 50 different Git repositories, which lots of companies have,

32:42 getting all of that metadata from all 50 of those repos into one place, and I've got tooling that will let you do exactly that, is crazy useful.

32:49 It lets you build stuff across all of your issues and all of your comments.

32:52 It lets you talk about, like, here is what our software team as a company has accomplished,

32:57 or how many, like, that kind of stuff, right?

32:59 Which is still super hard, even if you go to GitHub and do that.

33:02 So I want to talk through a few examples that maybe we could mention really quickly.

33:06 You gave a talk that covered both Dataset and Dogsheep at PyCon AU online this year, right?

33:13 Was it this year or last year?

33:14 Yes.

33:15 Yes, I did.

33:16 It was this year.

33:17 Yeah.

33:18 So in there, you talked about all sorts of interesting things.

33:20 So I want to cover some of the examples there, because they kind of, like, really made this stuff connect for me.

33:24 Cool.

33:24 Okay.

33:24 Yeah.

33:25 So the first one was, you said, let me just say, we just search for random SQLite databases on my Mac.

33:32 And you said, oh, look, here, we randomly found one in the photos library.

33:35 Let's look at that, right?

33:37 You just pointed Dataset at that, like, search for it.

33:39 What do you do, like a .db or something like that in Spotlight?

33:42 Yeah, there's a Spotlight command that you can run, which will show you every SQLite database on your Mac,

33:47 and it's fascinating.

33:48 Oh, my goodness.

33:49 The number of weird little databases that you already have, your Firefox history, your Chrome history on there,

33:56 Evernote uses SQLite.

33:57 There was quite a few databases I found where I still don't quite know what they are,

34:01 but they've got things like places that I've been over the past couple of years.

34:05 I just sat there in a SQLite database somewhere, which is super interesting.

34:09 Yeah, for sure.

34:10 So in this demo, what you did is you used that command to find the SQLite database back in your photos library,

34:16 and then said, well, let's just pull that up and poke around, right?

34:20 Tell us about that.

34:20 Yeah, so the photos, this was always one of my sort of white whale was, I want my photos data.

34:27 I've taken 40,000 photos.

34:28 They've got timestamps and latitudes and longitudes and all of this.

34:31 How can I get that metadata into a SQLite database so I can run queries against my life in photos?

34:37 And I've tried getting this to work with things like Google Photos in the past.

34:41 Google Photos doesn't give you access to latitudes and longitudes, I think, for privacy reasons.

34:45 But anyway, the big realization I had was that the Apple Photos on your phone and on your laptop uses SQLite.

34:51 There's a little SQLite.

34:52 They've already done it for you.

34:54 You just got to go find it.

34:55 It's probably huge.

34:56 Yeah.

34:56 800 megabytes of data in one SQLite database file.

35:00 And it's not very easy to query because it's not designed for people to use it from outside.

35:05 But if you jump through a bunch of hoops, you can get that data back out again.

35:08 And then you start finding some really interesting things.

35:11 So obviously, they've got a record for each of your photos with when it was taken in the latitude and longitude.

35:16 They've reversed geocoded those locations.

35:18 They can actually tell you in the database, this was in San Francisco.

35:21 This was in the Mission District, those kinds of things.

35:23 But then the coolest thing is Apple use machine learning to identify what your photos are of.

35:29 So is it a dog or a cat or a pelican?

35:31 Right, because you can go to your photos app and search for that.

35:33 But like, show me cars and like some insane way cars come up.

35:37 Exactly.

35:37 But the beautiful thing about that is it turns out they run the models on your laptop.

35:41 Like Google and Facebook will upload your photos to the internet and put them in a data center somewhere.

35:45 Apple downloads these big binary machine learning model, like weights files onto your device.

35:52 They actually run it on your phone overnight.

35:53 And they use those to identify what's in your photos.

35:56 So from a privacy point of view, this is perfect because you're not uploading your photos somewhere for some creepy machine learning model to run against.

36:03 It's all happening on devices that you control.

36:05 And the results of that go in a SQLite database.

36:08 So they go into data set.

36:09 And yes, so I can get them out and get them into data set.

36:11 So yeah, I have a data.

36:12 I have an example query that shows me photographs I've taken of pelicans based on Apple's machine learning labeling, labeling my photos of pelicans.

36:20 And I can visualize those on a map because they've got lattes and longitudes with them and so on.

36:24 And then the really fun thing is there's this there are various clues in the photos app that they're doing quality evaluations.

36:31 Like if they show you all of your photos for a month, they'll pick a good photo to show us the sort of title of that album or whatever.

36:38 That's machine learning as well.

36:39 It's running on your device and it's based on these scores.

36:42 And the scores are sat there in the database and they have names like Z overall aesthetic score and Z pleasant camera tilt score and Z harmonious color score.

36:52 So you can say things like show me my pelican photograph with the most harmonious colors, with the most pleasant camera tilts and just get things back that way.

37:00 You could even set up like a walking tour of show me where I've taken aesthetic photos of pelicans, starting with the best one and then the next and then the next.

37:10 That is such a good idea.

37:11 Yeah, that's just a SQL query.

37:13 Yeah, yeah, yeah.

37:14 Order by aesthetic descending or something.

37:16 Yeah.

37:17 And there's also facial recognition, which again is trained by you and runs on your device.

37:21 So it's the least creepy version of it.

37:23 So I go on a SQL query saying, show me photographs of my wife, Natalie, and my friend, Andrew, and show me the one with the most pleasant camera tilt that was taken outdoors.

37:33 And this stuff all just works.

37:35 It's baffling and really super fun.

37:38 So one piece that we probably should connect for folks, if they tried to follow along, they find that SQLite database and then they throw a dataset at it, they're going to end up with like binary blobs where this data lives, right?

37:49 Oh, totally.

37:50 Yeah.

37:50 I think Apple's SQLite format uses like binary plists in some of the columns.

37:55 And also, it's actually quite hard to even open it because it'll crash and tell you that you don't have that custom something extension running.

38:03 So the way I've addressed that, I wrote this, I found this software on GitHub called OSX Photos, which is someone's open source library for talking to the OSX SQLite database and working around some of these weird issues.

38:16 And then I built my own tool on top of that tool called Dog Sheep Photos, which basically both pulls out the metadata for your photos into a nicer format, including getting the machine learning labels and stuff.

38:27 But it's also got a tool for uploading the photo files themselves to an S3 bucket.

38:31 Because if you want to really take control of your photos, you need them to have URLs so that you can embed them in pages and link to them and so on.

38:38 So I've got a whole tool chain for uploading all of my photographs to S3 and extracting the metadata into a separate database and then publishing that database with data sets somewhere so that I can run queries against it.

38:49 Yeah.

38:49 Yeah, super cool.

38:50 All right.

38:50 Well, I think that probably is a good transition over to Data Sheep.

38:54 So we have these different sources of data.

38:57 Like it's nearly unbounded at this point, right?

38:59 Of where data on the internet might live about that could be for you.

39:03 There was this firebrand of a character that you mentioned around Wolfram Alpha.

39:08 And he does some crazy, crazy, weird stuff.

39:13 But he also had this idea that inspired you to sort of try to bring those together and build it on top of the data set.

39:19 So like maybe start it with that story and we can tell people what Dog Sheep is.

39:22 Okay.

39:22 So there's this chap called Stephen Wolfram who is, he created Mathematica and Wolfram Alpha.

39:29 And he's the CEO of a thousand person company, which it turns out he's a remote CEO, runs the entire company from home, which is kind of fascinating.

39:38 Yeah.

39:38 And he has been for a while, right?

39:39 Like before it was cool.

39:41 Yeah.

39:41 No, absolutely.

39:42 He's been doing the COVID thing for years and years and years.

39:44 And so in February of last year, he published an essay called Seeking the Productive Life, Some Details of My Personal Infrastructure.

39:52 And this thing, I would thoroughly recommend everyone take a look just to marvel at quite how long it is.

39:57 He has spent 40 years optimizing every single inch of his personal and professional life.

40:03 And he wrote about all of it in one place.

40:04 Like he scanned every document he's worked on since he was 11 years old and got them OCR'd.

40:09 And he's got a green screen in his basement for giving remote talks.

40:13 And his heart rate monitor showed him that his heart rate is better when he's – so he had a standing desk.

40:18 But his heart rate monitor showed him that walking outside is better for his heart.

40:21 So he rigged up a sort of little tray mechanism so he could use his laptop while walking in the woods.

40:26 And it's just astonishing.

40:28 And I read through this essay thinking this is a lot.

40:32 Like this is not – This is next level.

40:35 Right.

40:35 Totally next level.

40:36 But there was this one little bit in it that caught my eye.

40:40 He talked about how he has a personal search engine, something he calls his meta searcher.

40:44 And so he's got his own private search engine that searches everything, every email he's sent, every paper he's written, everyone he knows who might know things about it, everything he's read, all of the files on his machine, all in one place.

40:57 And I thought, well, that's something I'd like.

40:58 Like I would love to have one place with as much of my personal data from different sources in one place where I can query it.

41:05 Like I know I was talking to this person, but was it on iMessage?

41:09 Was it an email?

41:10 Was it over Slack?

41:11 Like where the heck did I tell them this thing that I need to get back?

41:14 Absolutely.

41:15 And combine that with, you know, your bookmarks and your GitHub issues and, yeah, messages and all of these –

41:20 Your tweets.

41:20 Oh, yeah.

41:22 And it felt like there was something interesting there.

41:24 And so I then – I'll be honest, the best idea I've had in all of this is I thought, well, it's inspired by Stephen Wolfram, but it's not as good as what he's done.

41:33 So if he's Wolfram, maybe I should be doing something called Dog Sheep, like as the less alpha versions of those animals.

41:40 And then I thought, well, he's got a search engine called Wolfram Alpha.

41:44 I could build a search engine called Dog Sheep Beta.

41:46 And that joke stuck in my head.

41:49 And I enjoyed it so much that I spent like 12 months on and off thinking with –

41:52 Yeah, I have to build this.

41:54 It must exist now that it's so good.

41:56 I have to put it.

41:57 Yeah.

41:58 So this entire project is basically – it's pun-driven development.

42:01 It's driven out of this pun that I came up with a year and a half ago.

42:04 And, yeah, so that's what I've been building.

42:06 And so the idea with Dog Sheep, it's basically an umbrella project for a whole bunch of tools around this idea of personal analytics.

42:13 Like what data is there about me in the world?

42:16 How can I get that data out of lots of different sources and get it into SQLite databases?

42:20 Because once it's in SQLite, I can run data set on top of it.

42:23 And now I've got this personal data warehouse of my data from all of these different sources.

42:28 And then on top of that, I can build a search engine, which I've now built, which ties all of this stuff together again.

42:33 And so, yeah, so I've been tinkering around with all sorts of tools in this category for, yeah, just over a year now.

42:40 And I've got – so right now I've got data in my personal Dog Sheep from Twitter.

42:45 I've got all of my tweets, but also all of the tweets that I've favorited.

42:48 And I've favorited like 30,000 tweets, but I can search those and see who I've favorited the most from and so on.

42:54 I've got all of my photos, as we discussed earlier.

42:57 I've got my health kit data from my Apple Watch, which means I can tell you my heart rate going back the last three years or something.

43:03 How do you get it off the Apple Watch?

43:05 So, again, Apple are really good for this kind of stuff.

43:08 They don't publish it to the – they don't upload it somewhere.

43:10 They keep it on your phone and on your watch.

43:13 But there's an export button.

43:14 In the health app on the iPhone, there's a button that says export my data.

43:18 And it actually creates a zip file full of XML on your phone and then gives you the option to like airdrop it to your laptop.

43:25 So I do that and I get a 300 megabyte zip file full of XML.

43:29 And then I write a script which reads that XML in terms of SQLite.

43:32 And so, yeah, so I've got all of that data.

43:34 The best thing about that is anytime you record an outdoor workout, like if you go for a run or even if you go for a walk, it records your latitude and longitude every few seconds.

43:43 And that's available in the data.

43:44 So I've got – so, like, within 10-meter maps of every walk I've taken for the past year, which is super fun.

43:50 I mentioned GitHub.

43:51 I've got all of the data from all of my GitHub projects.

43:54 I've got over 400 repositories now.

43:55 So that's actually quite a lot of stuff.

43:57 I use Foursquare Swarm and check-in to places.

44:00 So I've got 4,000 swarm check-ins.

44:03 Oh, you have Google Takeout, which is insane.

44:05 Right.

44:05 I've only done a little bit of work with Google Takeout.

44:08 That's one of the least developed tools.

44:09 But, yeah, you can get Google's version of your location history, which is – like, for me, it's like 250,000 latitude, longitude points that they've – I don't even know where they got that stuff from.

44:19 Yeah.

44:20 I recently did a Google Takeout, and I think it was probably zipped.

44:23 I think it was 61 gigabytes or something.

44:26 I mean, it's a lot of data.

44:27 It's a lot of data.

44:28 And a lot of that is, like, photographs and document files and stuff.

44:31 But there's just a ton of very detailed JSON data about you as well.

44:35 With those exports, it's always fun to look for the ad targeting stuff because you'll find out that you have been assigned the role of a, like, middle-aged tech executive or something.

44:47 And you can see what they're targeting ads are actually based on.

44:49 I've got Evernote.

44:51 I've got, like, 600 notes from Evernote.

44:53 My Goodreads data on books that I've read, which is synced from my Kindle.

44:56 Oh, and then the most fun one is I've got a copy of my genome.

45:00 Because I did 23andMe a few years ago.

45:02 And I found out they've got an export button.

45:04 And you can get back a CSV file of 600,000, like, gene pairs from your genome, which I can now run SQL queries against.

45:12 So I have a SQL query that tells me what color my eyes are based on interrogating my own copy of my genome, which delights me.

45:18 That is pretty amazing.

45:19 That's just insane.

45:21 Yeah, so that's a ton of data, right?

45:23 This is a lot of stuff.

45:24 And I'm barely even scratching the surface of what I could be doing into this system.

45:27 Right.

45:28 Well, and these are all, like, plugins or other types.

45:30 So it's wherever the data lives.

45:32 If there's an API or web scraping, you can have it.

45:34 Right.

45:35 And it's some, the tools are all called things like Twitter to SQLite or GitHub to SQLite or I think I've got genome to SQLite somewhere.

45:43 That's just the naming convention that I use.

45:45 But, yeah, the core idea is you knock out a quick Python command line tool, which either takes a zip file you got from somewhere or hits an API with API credentials.

45:53 And it slurps down as much data as it can, wallops it in a SQLite database, and that's all it does.

45:58 And then it's up to you to run data set against it and start doing the, like, doing fun querying some things.

46:04 That's cool.

46:04 So I think maybe it would be good to connect this, maybe tell a story or an example, again, what you did at PyCon and you talk about your dog and figuring out using Twitter to graph the weight of your dog over time and Twitter to map where your dog likes to go on walks.

46:20 Absolutely.

46:20 So my dog is called Cleo.

46:21 First of all, tell me how dog and Twitter go together.

46:24 Not dog sheep, but just like your dog.

46:26 So Cleo has a Twitter account because, I mean, to be honest, most dogs have Instagram these days, but Cleo is a bit more old fashioned.

46:32 So Cleo is on Twitter.

46:34 She's more on the tech side, less on the young influencer side.

46:36 Yeah, exactly.

46:37 Got it.

46:37 Cleo, C-L-E-O-P-A-W-S.

46:41 Cleo pause on Twitter.

46:42 And she tweets about things.

46:44 She tweets selfies and like things that she likes and so on.

46:47 And every time we go to the vet, she tweets a selfie of herself at the vet and they weigh her and she tweets how much she weighs.

46:53 She tweets, I weigh 49.3 pounds.

46:56 I grew more dog.

46:57 And there's a selfie.

46:58 And one of the things I've done with dog sheep is I've imported all of her tweets.

47:02 And so now I can run a SQL query that just pulls up the tweets where it contains LB for pounds and or weigh.

47:08 So I can pull back just the tweets where she said how much she weighs.

47:12 And then I've got a regular expression extension for data that adds a custom SQL function that can do regular expressions because that's a useful thing to have.

47:20 So I can pull out her weight with a regular expression into a separate column.

47:23 And then I've got a charting plugin that can chart something against something.

47:27 So I can chart date against weight and see a chart of how much she weighs based on her self-reported weight in the selfies that she's posted on Twitter, which is clearly a...

47:38 That's a killer act right there.

47:39 That is a useful thing.

47:41 It's absolutely frivolous.

47:43 But what I think it shows is so interesting is what you can do if you can put these pieces of data together, right?

47:49 If all of a sudden I can just do arbitrary queries and apply some special filtering, but to these other data sources you never expected to be able to, all of a sudden you have a graph that you never knew you were keeping about the weight of your dog or some other thing you're interested in.

48:03 This is just a SQL query.

48:05 The SQL query goes in a bookmark.

48:07 So the entire application, the show me a chart of my dog's weight based on her selfies is a bookmark.

48:12 It's just a bookmark that I've got.

48:14 It's actually super useful.

48:17 Super cool.

48:18 Now, how do you get a map of where your dog is from Twitter?

48:21 So that one, I mentioned I use Foursquare Swarm and I check in places.

48:24 Every time the dog's with me, I use the wolf emoji in the check-in message, which looks a little bit like her.

48:30 And it turns out SQL does emoji these days.

48:33 So you can run a SQL query where you look for things like percentage wolf emoji percentage.

48:38 Right.

48:38 It's just a character that's unique.

48:40 Exactly.

48:40 So then you get back the check-ins where my dog was there.

48:44 And then because I've got a latitude and longitude in that query, I get them on a map.

48:48 So I've got a map of places my dog likes to go based on the wolf emoji in my swarm check-ins.

48:54 And again, it's just a bookmark.

48:55 It's these custom applications that's a bookmark.

48:58 I think those two examples really bring home the unexpected power of what you kind of unleash when you get this stuff.

49:04 Completely.

49:04 There's a project I should mention that relates to this.

49:07 So I've been writing a lot of these tools that create SQLite databases.

49:10 All of these dog sheep tools pull something from somewhere and turn it into SQLite.

49:14 And the way I do that is using a Python library that I've been building called SQLite hyphen utils.

49:20 SQLite utils is a bunch of utility functions to make it really productive to create new SQLite databases.

49:26 So the core idea is, say you've got array of JSON objects.

49:30 You can say dot insert bracket array of JSON objects.

49:33 And it will create a SQLite table with the schema that's needed to match those JSON objects.

49:38 Wow.

49:39 So it just looks and says, these are the top level keys.

49:41 So we're going to make those columns, something like that.

49:43 These ones are integers.

49:44 These ones float.

49:44 This is a text.

49:45 And it creates the table automatically.

49:48 Which means if you're working with an API that's well designed, like the GitHub API returns lists of JSON objects.

49:53 So it's a Python one-liner to turn those into a SQLite table with the correct columns.

49:58 And you can say, oh, and make the ID one the primary key and set these up as foreign keys and those kinds of things.

50:03 And that's been crucial because it means that I didn't have to come up with a database schema for Swarm and Twitter and GitHub and Apple Photos and all of that.

50:12 I just had to get the data into a list of objects and the schema was created for me.

50:16 No, it's got a little bit of a NoSQL feel in SQL.

50:19 Exactly.

50:20 And SQLite, it turns out, can deal with JSON as well.

50:22 So you can stick a JSON document in a SQLite column.

50:25 And then there are SQLite functions for pulling out the first key of it and that kind of thing.

50:30 But yeah, it means it's all super productive.

50:32 SQLite utils also comes with a command line tool.

50:36 So for simple things, you don't have to write any Python at all.

50:39 You can W get a JSON blob, pipe it into the SQLite utils command line tool and tell it to insert it into a table.

50:46 And it will create a database file on disk and populate the table.

50:49 And then you can do stuff like configure full text search against it or set up extra foreign keys or whatever it is.

50:55 Oh, that's super neat.

50:56 So all of these different integrations that you built, it sounds to me like they could potentially just be useful for people listening and go, you know, I'd really love to get four square swarm data as a SQLite database.

51:06 And they don't necessarily want to use dogsheet.

51:07 Like it sounds like these plug in pieces might be cool building blocks.

51:11 If you can get yourself a OAuth token for your swarm account, which I've got an online tool that will do that for you.

51:17 You pip install swarm to SQLite and you type swarm hyphen to hyphen SQLite swarm dot DB --token equals that and hit enter.

51:25 And that's it.

51:25 That's it's like a one line on the terminal.

51:27 And that will give you a SQLite database with all of your swarm check ins in it.

51:31 Wow.

51:32 I'm pretty fascinated by the idea of like I could go to one place and just search everything about me.

51:37 Absolutely.

51:37 So that feature, because I had this stuff in like hundreds, dozens of different databases and tables.

51:42 I actually built the dog sheep beta search engine just a couple of months ago.

51:47 And basically the way that works is you give it SQL queries to run against all of your other tables.

51:52 So you say for the GitHub one, select title from issues and create a date and so forth.

51:57 For the Twitter one, select this, select that.

51:59 And you run a script and it will run those SQL queries against all of your like 20 different databases,

52:05 load the results into a new database table and set up full text search on it.

52:09 So it's kind of like using something like elastic search where you have to query your data from lots of different sources into one index.

52:16 And in this case, the index is just another SQLite table.

52:18 And then that gives you a fasted search index on the top that lets you search across all of the different things that you've ingested into it.

52:24 Right.

52:25 If you do an index, it'll be nice and fast.

52:26 And then you just say, well, you've got to go back to these five tables and get these various things.

52:30 Right.

52:30 Yeah.

52:30 And that's actually part of the tool is you can then set up a SQL query for each type of content that says, oh, and to display it, run this SQL query to grab these details, stick them in this Jinja template and stick that on the page.

52:43 So when you display the results, it can use all of the rich data that's come back.

52:47 But the actual index underneath it is basically title content and the date and that's it.

52:51 Yeah.

52:51 Wow.

52:52 Okay.

52:52 Pretty interesting.

52:53 What about email?

52:54 I don't see in this list like connect to email, although there's Google takeout.

52:58 That's not exactly necessarily the same.

53:00 I will admit I have not done email yet because I am terrible at email and I'm almost a little terrified what will happen if I start running SQL queries against 10 years of mostly unread emails.

53:10 But I'm sort of transitioning into doing freelancing and consulting.

53:14 One of the most important aspects for consultants is that they're on top of that email.

53:17 So I think that's probably the next task that I'll have to take on is getting good at email and then ingesting that into the dog sheepers.

53:25 Yeah.

53:25 I mean, it's kind of like this open plugin type of architecture.

53:28 So someone else could create it as well if they're listening and they just want it to exist, right?

53:31 Absolutely.

53:32 I mean, honestly, like, you know, the email standards are good enough now that writing a tool that turns your email archive into a SQLite database is pretty trivial.

53:40 Apple mail.app uses SQLite anyway.

53:43 So I've actually done a little bit of poking around just looking at that database.

53:47 Okay.

53:47 Interesting.

53:48 Yeah.

53:48 I just haven't got to it.

53:49 You could probably point it at an Outlook PST file if, you know, the world has cursed you and like you have to work in Outlook.

53:55 Or you just use IMAP or POP3 or something like that, right?

53:58 There's a very solid Python library for reading Outlook mailboxes.

54:03 So you could totally use that.

54:04 Okay.

54:05 And then it would be a very thin layer of code on top to turn that into a list of JSON objects and pipe them into the SQLite library.

54:11 Yeah.

54:11 All right.

54:12 Super cool, Simon.

54:13 This is like a bunch of levels building on top of each other.

54:16 I like it.

54:16 And also thank you for the history of Django.

54:19 It was really cool to hear how you experienced as it came to existence.

54:23 That was neat.

54:24 Cool.

54:24 Yeah.

54:25 Could talk about a bunch more, but I don't want to take all of your time.

54:28 So let me just ask you the final two questions here.

54:30 I always ask.

54:31 So if you're going to write some Python code, what editor are you using these days?

54:34 And these days I'm all about VS Code.

54:36 Especially the most recent version of their Python integration, which is controversial because it's the one bit of it that's not open source.

54:43 But that thing is just miraculous.

54:45 Nice.

54:46 PyLance, right?

54:46 I think so.

54:47 Yeah.

54:47 It's showing me, hey, this variable hasn't been used yet and this import wasn't working and all of that kind of stuff.

54:52 See, I'm all into VS Code now.

54:54 Yeah, okay.

54:55 That's definitely one that seems to be coming along and catching a lot of traction.

54:58 Cool, cool.

54:59 Notable PyPI package.

55:01 Something you've run across that you're like, oh, this thing is cool.

55:03 You should really know about it.

55:04 This is great because I get to answer this using dog sheep.

55:06 Because I've got all of my starred GitHub repos are pulled into my database, into dog sheep.

55:11 And I can actually run a dog sheep, a beta search and say, show me everything I've starred or sorted by most recent.

55:17 So the most recent Python one I started, I just started Astor, A-S-T-O-R.

55:22 I've heard of that.

55:23 I forgot what it is, though.

55:24 Yeah, it works with the Python abstract syntax tree.

55:27 So it's for building software on top of Python.

55:30 And the reason I found it is that I found this tool called Flint, which re-does all of your .format calls and turns them into Python 3.6 format strings.

55:39 Yes.

55:40 And I was like, oh, how does that work?

55:41 And it turns out it uses Astor under the hood.

55:43 I was actually going to give a shout out to Flint.

55:46 The reason was on Dataset, I just saw the last commit at the time of the recording was use f-strings in place of format.

55:53 And you can just point Flint at the entire, just the top level directory and it just fixes it.

55:59 That's why I started.

56:00 I was playing around with that.

56:01 And then the other, I mean, the other one, it's not a recent favorite, but I'm going to promote HTTPX.

56:06 Oh, yeah.

56:06 I'm a big fan of that one as well.

56:08 You talked about the ASCII side.

56:10 This is like consuming services from inside an ASCII server.

56:14 Yeah.

56:14 One way to look at it, it's the new requests, right?

56:16 It's essentially, it was almost, I think, called requests three at one point in its history, but it's basically the modern version of requests with full async support.

56:25 So you can use it synchronously and you can use it asynchronously.

56:28 But the killer feature from my point of view is you can instantiate a HTTPX client and you can point it at a Python ASCII or WSGI object and then start running requests through it.

56:39 So it's an amazing test harness.

56:40 Oh, yeah.

56:41 All of the dataset tests and every dataset plugin that I've written, they're actually, they're using HTTPX.

56:47 So it's like being able to do HTTP testing.

56:49 You don't even have to spin up a local host server.

56:51 It's all happening in memory.

56:53 And that's just extraordinary.

56:54 Oh, that sounds fantastic.

56:56 Like that's such a good product of writing unit tests against things.

56:59 Yeah, I've never tried it.

57:00 But a lot of times you can create, you'd like install the web test and then wrap it in a test framework.

57:05 Or, you know, maybe you've got to even run it, like fire it up and run it and then talk to it.

57:10 But this is just like connect the two pieces in code and skip the network and go.

57:14 And I've got a really nerdy thing that I've just started doing with it.

57:17 So dataset has plugins and dataset plugins can do a bunch of different things.

57:22 But I realized that dataset itself is an API.

57:25 The whole point is that it gives you a JSON API that you can use to interrogate your tables and run queries and so on.

57:31 And I wanted my plugins to be able to use that API.

57:34 But I didn't really want them to be making outbound HTTP requests against themselves and so on.

57:39 So I just a couple of weeks ago added a feature to dataset where it gives you an internal call.

57:45 They can say client.get brackets and then feed it a URL.

57:48 And that's actually using HTTPX and ASCII under the hood.

57:51 So the idea here is that any feature of dataset that is an external JSON API is now also an internal API that plugins can use.

57:59 And I've started building plugins against that.

58:01 The Dogsheep beta plugin actually runs internal searches against the dataset search API for things like adding faceted search and so on.

58:10 Dataset GraphQL is a plugin I'm writing that adds a GraphQL API on top of dataset.

58:16 And that's going to be using this client as well.

58:18 So you'll run a GraphQL query which gets turned internally into an internal JSON query and runs over this ASCII mechanism.

58:25 It's cool.

58:26 I hadn't really thought about that with HTTPX.

58:28 I've always used it to just like I'm doing some async methods.

58:31 So here's a good choice for a client.

58:33 Yeah.

58:33 The deep integration with ASCII I think is really exciting.

58:36 Yeah.

58:36 Super neat.

58:37 All right.

58:37 Well, those are all good recommendations.

58:39 Now, final call to action.

58:40 People are interested in dataset.

58:42 Dogsheep.

58:43 First, I just want to throw out you should really go watch the 25 minute or whatever it was talk that you did at PyCon AU.

58:49 That'll connect a ton of things for people.

58:50 I'll throw in another recommendation.

58:52 Yeah, go for it.

58:52 I gave a talk last week, the GitHub Octo Speaker Series, which I think is the best talk I've given about dataset and Dogsheep.

59:00 It's got a lot of very recent demos in.

59:01 And that's linked to on my blog.

59:03 It's a talk about building personal data warehouses.

59:05 But yeah, that's got dataset and Dogsheep demos.

59:08 Yeah, very neat.

59:09 And also, you actually, this is somewhat unusual for an open source project, but I think cool because promoting open source projects is always like, you know, why do they take off or don't?

59:17 You have a dataset weekly newsletter.

59:18 Yes, I do.

59:19 It's not quite weekly.

59:20 Maybe I should have picked a different name.

59:22 But yeah, I've got a newsletter which goes out every week or so with the latest from the dataset ecosystem.

59:28 So that's dataset.substack.com.

59:31 My blog, Simon Willison.net, I update at least once a week with all sorts of bits and pieces.

59:35 And then if you're interested in the Dogsheep stuff, I would love it if people started building these themselves.

59:41 There is quite a bit of assembly required.

59:43 All of the code that I've written is open source, but you have to track down your authentication tokens and run cron jobs and find somewhere to host it.

59:50 And so it's not easy to get up and running.

59:52 But if you do get it up and running, I would love to hear from you about what kind of things you managed to do with it.

59:58 And if people want to build tools that are part of this ecosystem, I'd be absolutely thrilled.

01:00:02 Yeah, that'd be awesome.

01:00:02 If they want to build a star to SQLite, whatever star is, let you know, right?

01:00:07 Well, congratulations on this project.

01:00:09 I think it's super neat.

01:00:10 And thanks for coming on to share with everyone.

01:00:12 Awesome.

01:00:12 Thanks a lot for having me.

01:00:13 You bet.

01:00:14 Bye.

01:00:16 This has been another episode of Talk Python To Me.

01:00:18 Our guest on this episode was Simon Willison, and it's been brought to you by Linode and us over at Talk Python Training.

01:00:25 Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual machines.

01:00:30 Develop, deploy, and scale your modern applications faster and easier.

01:00:33 Visit talkpython.fm/linode and click the create free account button to get started.

01:00:38 Want to level up your Python?

01:00:40 If you're just getting started, try my Python Jumpstart by Building 10 Apps course.

01:00:45 Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python.

01:00:53 And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle.

01:00:58 It's like a subscription that never expires.

01:01:00 Be sure to subscribe to the show.

01:01:02 Open your favorite podcatcher and search for Python.

01:01:04 We should be right at the top.

01:01:05 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm.

01:01:15 This is your host, Michael Kennedy.

01:01:16 Thanks so much for listening.

01:01:18 I really appreciate it.

01:01:19 Now get out there and write some Python code.

01:01:21 I'll see you next time.

01:01:21 Bye.