Skip to content

ADR-003: DuckDB for Citation Graph

Date: 2026-03-14 Status: Accepted

Context

The citation graph database needs to support: - Graph edge storage (citing -> cited relationships) - Analytical queries (centrality, clustering, coverage analysis) - Full-text search on paper titles/abstracts - Potential time-travel/versioning for run diffing (v0.6) - Scaling to tens of thousands of papers over time - Zero external system dependencies (embedded database)

The response cache already uses SQLite (simple key-value TTL store).

Options Considered

  1. SQLite — stdlib, zero deps, well-understood
  2. Pro: no extra dependency, same pattern as cache
  3. Con: row-oriented, poor for analytical queries, no native graph traversal

  4. DuckDB — embedded analytical database, columnar

  5. Pro: fast analytics, good for graph queries, Parquet export, single pip install (~30MB)
  6. Con: extra dependency, newer than SQLite

  7. DuckDB + DuckLake — adds catalog/versioning layer

  8. Pro: time-travel queries ("graph as of date X"), perfect for run diffing
  9. Con: very new (2025), adds complexity, may be premature

  10. NetworkX (in-memory) — Python graph library

  11. Pro: native graph algorithms (centrality, shortest path, clustering)
  12. Con: no persistence, memory-bound, won't scale

  13. SQLite + FTS5 — SQLite with full-text search extension

  14. Pro: no extra dependency, decent text search
  15. Con: still row-oriented, not good for analytical graph queries

Decision

DuckDB for the citation graph. Response cache stays SQLite.

DuckLake deferred to v0.6 when run diffing becomes a real need.

Rationale

  • DuckDB's columnar storage and analytical query engine are genuinely better for the graph use cases (traversals, aggregations, coverage analysis)
  • The API is nearly identical to SQLite: duckdb.connect("graph.db")
  • Single pip install duckdb, no system-level dependencies
  • Parquet export is useful for interop with data science tools
  • The roadmap includes features (run diffing, coverage analysis, 3D visualization) that will stress SQLite's analytical capabilities
  • Keeping the cache as SQLite avoids adding DuckDB as a hard dependency for basic search-only use

Consequences

  • duckdb added to project dependencies
  • Graph module uses DuckDB, cache module stays SQLite
  • Two database files: ~/.cache/litseer/responses.db (SQLite) and ~/.cache/litseer/graph.db (DuckDB)
  • Need to handle DuckDB's slightly different SQL dialect in places
  • Can leverage DuckDB's built-in Parquet/JSON export for graph export features
  • DuckLake migration path exists for future versioning needs