Personal search engine with datasette and dogsheep
Datasette helps people take data of any shape or size, analyze and explore it, and publish it as an interactive website and accompanying API.
Dogsheep is a collection of tools for personal analytics using SQLite and Datasette. Imagine a unified search engine for everything personal in your life such as twitter, photos, google docs, todoist, goodreads, and more, all in once place and outside of cloud companies.
On this episode we talk with Simon Willison who created both of these projects. He's also one of the co-creators of Django and we'll discuss some early Django history!
Episode Deep Dive
Guests introduction and background
Simon Willison is a seasoned Python developer and an influential figure in the Python community. He co-created the Django web framework and has a deep background in data journalism, open-source software, and designing tools to help people publish and explore data. Simon is the creator of both Datasette, an open-source platform to publish and explore SQLite databases online, and Dogsheep, a project that brings your personal data from diverse sources into a single place.
What to Know If You're New to Python
If you’re newer to Python and want to follow along with the ideas in this episode, here are a few quick pointers:
- Be comfortable reading basic Python code, as you’ll see references to imports and simple scripts.
- Know how to install libraries (e.g.,
pip install datasette
) and how to run Python from the command line or a virtual environment. - You don’t have to fully understand SQL, but familiarity with databases (like SQLite) will help.
Key points and takeaways
- Building a Personal Search Engine with Datasette and Dogsheep
Datasette allows you to turn your SQLite databases into interactive web apps, while Dogsheep is all about unifying personal data into one searchable platform. Simon’s vision is a personal “universal search engine” that indexes everything from Twitter likes and GitHub issues to photos and health data.
- Links and tools:
- Datasette (github.com/simonw/datasette)
- Dogsheep (github.com/dogsheep/dogsheep)
- Links and tools:
- Datasette’s Core Idea: Publish and Query SQLite Databases Online
By simply typing a command like
datasette publish
, you can bundle a read-only SQLite database with a web interface and an API. It requires no traditional database server, and you can easily share data or build interactive tools on top of it.- Links and tools:
- SQLite (sqlite.org)
- Command-line publish examples (docs.datasette.cloud)
- Links and tools:
- Read-Only SQL, Where SQL Injection Becomes a Feature
Typically, SQL injection is a vulnerability. However, Datasette’s design—where data is read-only—turns it into a powerful query interface. Users can write SQL queries directly in the URL or via the web UI, exploring data without risk to underlying records.
- Links and tools:
- Documentation: docs.datasette.cloud/en/stable/sql_queries.html
- Links and tools:
- Dogsheep and Personal Analytics
Inspired by Stephen Wolfram’s idea of a “personal search engine,” Dogsheep aggregates personal data sources into SQLite databases. GitHub repos, Twitter history, health data, and more can be synced and queried in one place to gain insights or just search across everything you do.
- Links and tools:
- Exploring Apple Photos and Machine Learning Labels
Apple Photos stores images and machine learning tags (e.g., cats, pelicans, location data) in a SQLite database on your device. Simon taps into it with
OSX Photos
and Dogsheep to create queries like “Show me pictures of pelicans, sorted by the highest aesthetic score.”- Links and tools:
- Turning Data Into ‘Mashups’
This approach harks back to the early “mashup” concept: Gather data from different services, unify it in one database, and then build new insights or mini-apps on top. You can combine data from GitHub, Twitter, or even your phone’s location logs for creative projects.
- Links and tools:
- Cleo the Dog and Social Media Data
A fun example is using regular expressions on a dog’s Twitter account to track her weight over time. By mining tweets that mention pounds or “weigh,” you can build a chart of your pet’s progress, showcasing how flexible these SQLite-driven approaches can be.
- Links and tools:
- Plugin Architecture for Datasette
Datasette’s plugin system encourages experimentation. Simon has written over 50 plugins, enabling features like additional output formats, interactive charting, custom authentication, and more. Plugins can be independently developed and published, extending Datasette’s capabilities without modifying its core.
- Links and tools:
- Datasette plugins
- Example: datasette-indieauth
- Links and tools:
- Working with ASGI, FastAPI, and Next-Generation Web Technologies
While not a core part of his recent work, Simon referenced the broader Python web ecosystem (including Django, FastAPI, and ASGI). Tools like HTTPX make testing and internal API calls simple, bridging modern asynchronous capabilities with these data-driven frameworks.
- Links and tools:
- HTTPX (github.com/encode/httpx)
- Links and tools:
- Origins of Django and Lessons Learned Simon recounted Django’s creation story at a local newspaper in Lawrence, Kansas. They needed a flexible web framework for newsroom-driven projects, leading to Django’s hallmark emphasis on fast development, the admin interface, and best practices that are now industry standard.
- Links and tools:
- Django (djangoproject.com)
- Django Admin docs (docs.djangoproject.com/en/stable/ref/contrib/admin)
Interesting quotes and stories
On building a personal data warehouse: “It’s like you can create your own private Google, but just for your stuff.”
On read-only data exploration: “Having SQL injection as a feature rather than a bug is something most frameworks would never dream of. But Datasette is built for that.”
On Apple’s machine learning labels: “You can literally run queries on your own device that say, ‘Show me all my pictures of pelicans with a high ‘pleasant camera tilt’ score.’ And it just works.”
Key definitions and terms
- Datasette: An open-source tool that turns SQLite databases into interactive web apps and JSON APIs.
- Dogsheep: A collection of utilities (and an overall project) to centralize personal data from multiple online services into SQLite for search and analytics.
- SQLite: A lightweight relational database stored in a single file, making it easy to bundle with applications.
- ASGI: Asynchronous Server Gateway Interface, a standard for async Python web apps.
- HTTPX: A Python HTTP client supporting both sync and async operations, with tight ASGI integration.
- Wolfram Alpha: A computational knowledge engine that inspired the concept of a “personal search engine,” leading to the pun-driven name “Dogsheep Beta.”
Learning resources
Here are some resources to help deepen your Python and data knowledge:
- Python for Absolute Beginners: A comprehensive foundation if you’re new to Python and want to build real applications.
- Django: Getting Started: Learn to build your first Django project, especially helpful if Simon’s Django story piqued your interest.
- Building Data-Driven Web Apps with Flask and SQLAlchemy: Get into creating and deploying web apps that talk to a database in Python.
- Python Data Visualization: Once you gather your data in Datasette or Dogsheep, you can visualize it more extensively.
Overall takeaway
Simon Willison’s work with Datasette and Dogsheep shows the power of combining Python, SQLite, and a bit of creativity to make data not only more accessible, but deeply personal. His approach exemplifies how anyone—developers and data enthusiasts alike—can unify scattered information into a single, interactive hub, unlocking new ways to visualize insights, explore old memories, and even run machine learning queries on a personal scale.
Links from the show
Dogsheep: dogsheep.github.io
Datasheet newsletter: datasette.substack.com
Video: Build your own data warehouse for personal analytics with SQLite and Datasette: youtube.com
Examples
List: github.com
Personal data warehouses: github.com
Global power plants: datasettes.com
SF data: datasettes.com
FiveThirtyEight: fivethirtyeight.datasettes.com
Lahman’s Baseball Database: baseballdb.lawlesst.net
Live demo of current main: datasette.io
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy