DuckDB and Python: Ducks and Snakes living together
Episode #491,
published Fri, Dec 27, 2024, recorded Tue, Dec 10, 2024
Join me for an insightful conversation with Alex Monahan, who works on documentation, tutorials, and training at DuckDB Labs. We explore why DuckDB is gaining momentum among Python and data enthusiasts, from its in-process database design to its blazingly fast, columnar architecture. We also dive into indexing strategies, concurrency considerations, and the fascinating way MotherDuck (the cloud companion to DuckDB) handles large-scale data seamlessly. Don’t miss this chance to learn how a single pip install could totally transform your Python data workflow!
Links from the show
Alex on Mastodon: @__Alex__
DuckDB: duckdb.org
MotherDuck: motherduck.com
SQLite: sqlite.org
Moka-Py: github.com
PostgreSQL: www.postgresql.org
MySQL: www.mysql.com
Redis: redis.io
Apache Parquet: parquet.apache.org
Apache Arrow: arrow.apache.org
Pandas: pandas.pydata.org
Polars: pola.rs
Pyodide: pyodide.org
DB-API (PEP 249): peps.python.org/pep-0249
Flask: flask.palletsprojects.com
Gunicorn: gunicorn.org
MinIO: min.io
Amazon S3: aws.amazon.com/s3
Azure Blob Storage: azure.microsoft.com/products/storage
Google Cloud Storage: cloud.google.com/storage
DigitalOcean: www.digitalocean.com
Linode: www.linode.com
Hetzner: www.hetzner.com
BigQuery: cloud.google.com/bigquery
DBT (Data Build Tool): docs.getdbt.com
Mode: mode.com
Hex: hex.tech
Python: www.python.org
Node.js: nodejs.org
Rust: www.rust-lang.org
Go: go.dev
.NET: dotnet.microsoft.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
DuckDB: duckdb.org
MotherDuck: motherduck.com
SQLite: sqlite.org
Moka-Py: github.com
PostgreSQL: www.postgresql.org
MySQL: www.mysql.com
Redis: redis.io
Apache Parquet: parquet.apache.org
Apache Arrow: arrow.apache.org
Pandas: pandas.pydata.org
Polars: pola.rs
Pyodide: pyodide.org
DB-API (PEP 249): peps.python.org/pep-0249
Flask: flask.palletsprojects.com
Gunicorn: gunicorn.org
MinIO: min.io
Amazon S3: aws.amazon.com/s3
Azure Blob Storage: azure.microsoft.com/products/storage
Google Cloud Storage: cloud.google.com/storage
DigitalOcean: www.digitalocean.com
Linode: www.linode.com
Hetzner: www.hetzner.com
BigQuery: cloud.google.com/bigquery
DBT (Data Build Tool): docs.getdbt.com
Mode: mode.com
Hex: hex.tech
Python: www.python.org
Node.js: nodejs.org
Rust: www.rust-lang.org
Go: go.dev
.NET: dotnet.microsoft.com
Watch this episode on YouTube: youtube.com
Episode transcripts: talkpython.fm
--- Stay in touch with us ---
Subscribe to Talk Python on YouTube: youtube.com
Talk Python on Bluesky: @talkpython.fm at bsky.app
Talk Python on Mastodon: talkpython
Michael on Bluesky: @mkennedy.codes at bsky.app
Michael on Mastodon: mkennedy
Episode Breakdown and Key Points
Alex Monahan
- Alex works on documentation, tutorials, and training at DuckDB Labs and also as a forward-deployed engineer at MotherDuck.
- Explained how a forward-deployed engineer helps customers integrate DuckDB (and MotherDuck) into their data workflows, from tool selection to SQL optimization.
What is DuckDB?
- An in-process analytical database designed for data science, data analysis, and data engineering tasks.
- Installable via a simple
pip install duckdb
command; no separate server needed. - Supports reading data from various sources (CSV, Parquet, JSON) without prior loading or schema creation.
- Official site: DuckDB.org
SQLite vs. DuckDB
- Similarities: Both are in-process databases, packaged as a library, and can be embedded 491-duckdb-and-python-ducks-and-snakes-living-togethereasily in Python or other languages.
- Differences:
- DuckDB is columnar, compresses better, and is optimized for large-scale analytical queries (aggregations, joins over millions or billions of rows).
- DuckDB has strict typing (versus SQLite’s more flexible typing) and supports big data operations bigger than available RAM by streaming and disk spill.
Columnar Architecture and Performance
- Discussed how column-based storage enables faster queries and better compression for analytical workloads.
- Automatically creates “chunk-based” indexing (min/max) on each column to speed up filtering (so-called zone maps).
- Built-in vectorized execution processes data in batches (e.g., chunks of 2,000 rows) to balance speed and memory usage.
Working with Python Data Frames
- Integration with pandas, Polars, and Apache Arrow: You can run SQL queries directly on a pandas
DataFrame
(or Polars / Arrow) by referencing the variable name in the SQL. - Offers commands like
.df
to return query results as pandas, or.pl
for Polars, etc. - This in-process design means DuckDB can “see” local variables in Python without extra data copying—especially efficient for Arrow/Polars.
- Pandas site: pandas.pydata.org
- Polars site: pola.rs
- Apache Arrow site: arrow.apache.org
- Integration with pandas, Polars, and Apache Arrow: You can run SQL queries directly on a pandas
File-Based and In-Memory Usage
- By default, running
duckdb.sql(...)
uses an in-memory database. - You can also persist data to disk via a single file (e.g.,
duckdb.connect("file.db")
), which stores all tables in compressed columnar format. - Supports reading and writing files in cloud object stores (S3, GCS, Azure) and even hierarchical data (folder partitioning).
- By default, running
Concurrency and Write Access
- Single-process read-write model: one process at a time can open a DuckDB file in write mode.
- Multi-threading within that one process is fine (DuckDB uses multiple CPU cores for parallel execution).
- For read-only connections, concurrent processes can safely share the same file.
MotherDuck: DuckDB in the Cloud
- A serverless cloud data warehouse built around DuckDB.
- Offers concurrency, managed storage, access control, and easy data sharing.
- Allows partial execution in the cloud and partial execution locally, optimizing query performance and developer workflows.
- MotherDuck: motherduck.com
Indexing Strategies
- DuckDB automatically creates “rough indexes” for skipping irrelevant chunks of data (based on min/max per column).
- Optional adaptive radix tree (ART) indexes for more granular lookups if needed, though often not essential in analytical scenarios.
JSON and Nested Data
- JSON type support for semi-structured data.
- Tools to unnest JSON into columns or selectively query parts of the JSON via path syntax.
- Also has native nested types if you want to store strictly typed hierarchical data in a single column.
Integration in Data Engineering
- Commonly used alongside dbt (github.com/dbt-labs/dbt) for transformations and aggregations.
- Helps turn large, raw data sources (CSV, Parquet) into aggregated datasets quickly.
- Useful in pipelines: triggered transformations or micro-batch processing at scale by spawning multiple lightweight DuckDB tasks.
Relevant Tools and Links
- DuckDB: duckdb.org
- MotherDuck: motherduck.com
- dbt: github.com/dbt-labs/dbt
- Pandas: pandas.pydata.org
- Polars: pola.rs
- Apache Arrow: arrow.apache.org
- SQLite (reference): sqlite.org
Overall Takeaway
DuckDB’s in-process, columnar architecture makes it remarkably easy to embed fast analytical queries in any Python workflow—no external servers needed. Whether you’re aggregating billions of rows on your laptop or integrating large-scale analytics with MotherDuck in the cloud, you get a streamlined SQL experience that slots right into your existing data science or engineering stacks.