ADR-003: DuckDB for Citation Graph¶
Date: 2026-03-14 Status: Accepted
Context¶
The citation graph database needs to support: - Graph edge storage (citing -> cited relationships) - Analytical queries (centrality, clustering, coverage analysis) - Full-text search on paper titles/abstracts - Potential time-travel/versioning for run diffing (v0.6) - Scaling to tens of thousands of papers over time - Zero external system dependencies (embedded database)
The response cache already uses SQLite (simple key-value TTL store).
Options Considered¶
- SQLite — stdlib, zero deps, well-understood
- Pro: no extra dependency, same pattern as cache
-
Con: row-oriented, poor for analytical queries, no native graph traversal
-
DuckDB — embedded analytical database, columnar
- Pro: fast analytics, good for graph queries, Parquet export, single pip install (~30MB)
-
Con: extra dependency, newer than SQLite
-
DuckDB + DuckLake — adds catalog/versioning layer
- Pro: time-travel queries ("graph as of date X"), perfect for run diffing
-
Con: very new (2025), adds complexity, may be premature
-
NetworkX (in-memory) — Python graph library
- Pro: native graph algorithms (centrality, shortest path, clustering)
-
Con: no persistence, memory-bound, won't scale
-
SQLite + FTS5 — SQLite with full-text search extension
- Pro: no extra dependency, decent text search
- Con: still row-oriented, not good for analytical graph queries
Decision¶
DuckDB for the citation graph. Response cache stays SQLite.
DuckLake deferred to v0.6 when run diffing becomes a real need.
Rationale¶
- DuckDB's columnar storage and analytical query engine are genuinely better for the graph use cases (traversals, aggregations, coverage analysis)
- The API is nearly identical to SQLite:
duckdb.connect("graph.db") - Single
pip install duckdb, no system-level dependencies - Parquet export is useful for interop with data science tools
- The roadmap includes features (run diffing, coverage analysis, 3D visualization) that will stress SQLite's analytical capabilities
- Keeping the cache as SQLite avoids adding DuckDB as a hard dependency for basic search-only use
Consequences¶
duckdbadded to project dependencies- Graph module uses DuckDB, cache module stays SQLite
- Two database files:
~/.cache/litseer/responses.db(SQLite) and~/.cache/litseer/graph.db(DuckDB) - Need to handle DuckDB's slightly different SQL dialect in places
- Can leverage DuckDB's built-in Parquet/JSON export for graph export features
- DuckLake migration path exists for future versioning needs