Learn Python with Talk Python's 270 hours of courses

Personal search engine with datasette and dogsheep

Episode #299, published Sun, Jan 17, 2021, recorded Wed, Nov 18, 2020

In this episode, we'll be discussing two powerful tools for data reporting and exploration: Datasette and Dogsheep.

Datasette helps people take data of any shape or size, analyze and explore it, and publish it as an interactive website and accompanying API.

Dogsheep is a collection of tools for personal analytics using SQLite and Datasette. Imagine a unified search engine for everything personal in your life such as twitter, photos, google docs, todoist, goodreads, and more, all in once place and outside of cloud companies.

On this episode we talk with Simon Willison who created both of these projects. He's also one of the co-creators of Django and we'll discuss some early Django history!

Episode Deep Dive

Guests introduction and background

Simon Willison is a seasoned Python developer and an influential figure in the Python community. He co-created the Django web framework and has a deep background in data journalism, open-source software, and designing tools to help people publish and explore data. Simon is the creator of both Datasette, an open-source platform to publish and explore SQLite databases online, and Dogsheep, a project that brings your personal data from diverse sources into a single place.

What to Know If You're New to Python

If you’re newer to Python and want to follow along with the ideas in this episode, here are a few quick pointers:

  • Be comfortable reading basic Python code, as you’ll see references to imports and simple scripts.
  • Know how to install libraries (e.g., pip install datasette) and how to run Python from the command line or a virtual environment.
  • You don’t have to fully understand SQL, but familiarity with databases (like SQLite) will help.

Key points and takeaways

  1. Building a Personal Search Engine with Datasette and Dogsheep Datasette allows you to turn your SQLite databases into interactive web apps, while Dogsheep is all about unifying personal data into one searchable platform. Simon’s vision is a personal “universal search engine” that indexes everything from Twitter likes and GitHub issues to photos and health data.
  2. Datasette’s Core Idea: Publish and Query SQLite Databases Online By simply typing a command like datasette publish, you can bundle a read-only SQLite database with a web interface and an API. It requires no traditional database server, and you can easily share data or build interactive tools on top of it.
  3. Read-Only SQL, Where SQL Injection Becomes a Feature Typically, SQL injection is a vulnerability. However, Datasette’s design—where data is read-only—turns it into a powerful query interface. Users can write SQL queries directly in the URL or via the web UI, exploring data without risk to underlying records.
  4. Dogsheep and Personal Analytics Inspired by Stephen Wolfram’s idea of a “personal search engine,” Dogsheep aggregates personal data sources into SQLite databases. GitHub repos, Twitter history, health data, and more can be synced and queried in one place to gain insights or just search across everything you do.
  5. Exploring Apple Photos and Machine Learning Labels Apple Photos stores images and machine learning tags (e.g., cats, pelicans, location data) in a SQLite database on your device. Simon taps into it with OSX Photos and Dogsheep to create queries like “Show me pictures of pelicans, sorted by the highest aesthetic score.”
  6. Turning Data Into ‘Mashups’ This approach harks back to the early “mashup” concept: Gather data from different services, unify it in one database, and then build new insights or mini-apps on top. You can combine data from GitHub, Twitter, or even your phone’s location logs for creative projects.
  7. Cleo the Dog and Social Media Data A fun example is using regular expressions on a dog’s Twitter account to track her weight over time. By mining tweets that mention pounds or “weigh,” you can build a chart of your pet’s progress, showcasing how flexible these SQLite-driven approaches can be.
  8. Plugin Architecture for Datasette Datasette’s plugin system encourages experimentation. Simon has written over 50 plugins, enabling features like additional output formats, interactive charting, custom authentication, and more. Plugins can be independently developed and published, extending Datasette’s capabilities without modifying its core.
  9. Working with ASGI, FastAPI, and Next-Generation Web Technologies While not a core part of his recent work, Simon referenced the broader Python web ecosystem (including Django, FastAPI, and ASGI). Tools like HTTPX make testing and internal API calls simple, bridging modern asynchronous capabilities with these data-driven frameworks.
  10. Origins of Django and Lessons Learned Simon recounted Django’s creation story at a local newspaper in Lawrence, Kansas. They needed a flexible web framework for newsroom-driven projects, leading to Django’s hallmark emphasis on fast development, the admin interface, and best practices that are now industry standard.

Interesting quotes and stories

On building a personal data warehouse: “It’s like you can create your own private Google, but just for your stuff.”

On read-only data exploration: “Having SQL injection as a feature rather than a bug is something most frameworks would never dream of. But Datasette is built for that.”

On Apple’s machine learning labels: “You can literally run queries on your own device that say, ‘Show me all my pictures of pelicans with a high ‘pleasant camera tilt’ score.’ And it just works.”

Key definitions and terms

  • Datasette: An open-source tool that turns SQLite databases into interactive web apps and JSON APIs.
  • Dogsheep: A collection of utilities (and an overall project) to centralize personal data from multiple online services into SQLite for search and analytics.
  • SQLite: A lightweight relational database stored in a single file, making it easy to bundle with applications.
  • ASGI: Asynchronous Server Gateway Interface, a standard for async Python web apps.
  • HTTPX: A Python HTTP client supporting both sync and async operations, with tight ASGI integration.
  • Wolfram Alpha: A computational knowledge engine that inspired the concept of a “personal search engine,” leading to the pun-driven name “Dogsheep Beta.”

Learning resources

Here are some resources to help deepen your Python and data knowledge:

Overall takeaway

Simon Willison’s work with Datasette and Dogsheep shows the power of combining Python, SQLite, and a bit of creativity to make data not only more accessible, but deeply personal. His approach exemplifies how anyone—developers and data enthusiasts alike—can unify scattered information into a single, interactive hub, unlocking new ways to visualize insights, explore old memories, and even run machine learning queries on a personal scale.

Links from the show

Datasette: datasette.io
Dogsheep: dogsheep.github.io
Datasheet newsletter: datasette.substack.com
Video: Build your own data warehouse for personal analytics with SQLite and Datasette: youtube.com

Examples
List: github.com
Personal data warehouses: github.com
Global power plants: datasettes.com
SF data: datasettes.com
FiveThirtyEight: fivethirtyeight.datasettes.com
Lahman’s Baseball Database: baseballdb.lawlesst.net
Live demo of current main: datasette.io
Episode transcripts: talkpython.fm

--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy

Talk Python's Mastodon Michael Kennedy's Mastodon