ADR-003: DuckDB for Citation Graph¶

Date: 2026-03-14 Status: Accepted

Context¶

The citation graph database needs to support: - Graph edge storage (citing -> cited relationships) - Analytical queries (centrality, clustering, coverage analysis) - Full-text search on paper titles/abstracts - Potential time-travel/versioning for run diffing (v0.6) - Scaling to tens of thousands of papers over time - Zero external system dependencies (embedded database)

The response cache already uses SQLite (simple key-value TTL store).

Options Considered¶

SQLite — stdlib, zero deps, well-understood
Pro: no extra dependency, same pattern as cache
Con: row-oriented, poor for analytical queries, no native graph traversal
DuckDB — embedded analytical database, columnar
Pro: fast analytics, good for graph queries, Parquet export, single pip install (~30MB)
Con: extra dependency, newer than SQLite
DuckDB + DuckLake — adds catalog/versioning layer
Pro: time-travel queries ("graph as of date X"), perfect for run diffing
Con: very new (2025), adds complexity, may be premature
NetworkX (in-memory) — Python graph library
Pro: native graph algorithms (centrality, shortest path, clustering)
Con: no persistence, memory-bound, won't scale
SQLite + FTS5 — SQLite with full-text search extension
Pro: no extra dependency, decent text search
Con: still row-oriented, not good for analytical graph queries

Decision¶

DuckDB for the citation graph. Response cache stays SQLite.

DuckLake deferred to v0.6 when run diffing becomes a real need.

Rationale¶

DuckDB's columnar storage and analytical query engine are genuinely better for the graph use cases (traversals, aggregations, coverage analysis)
The API is nearly identical to SQLite: duckdb.connect("graph.db")
Single pip install duckdb, no system-level dependencies
Parquet export is useful for interop with data science tools
The roadmap includes features (run diffing, coverage analysis, 3D visualization) that will stress SQLite's analytical capabilities
Keeping the cache as SQLite avoids adding DuckDB as a hard dependency for basic search-only use

Consequences¶

duckdb added to project dependencies
Graph module uses DuckDB, cache module stays SQLite
Two database files: ~/.cache/litseer/responses.db (SQLite) and ~/.cache/litseer/graph.db (DuckDB)
Need to handle DuckDB's slightly different SQL dialect in places
Can leverage DuckDB's built-in Parquet/JSON export for graph export features
DuckLake migration path exists for future versioning needs