Skip to content

Litseer Roadmap

Vision

Litseer is a technology intelligence platform for automated, reproducible literature review and technology assessment. The core principle: deterministic, verifiable outputs with minimal built-in AI — structured data that humans and LLMs can independently verify.

The end goal is near-full automation of the technology lit review and SME interview pipeline, designed to integrate with LLMs for high-quality verified analysis without depending on them.

Version Plan

Version Focus Status
v0.1 Core search, dedup, quality, export, cache Alpha
v0.2 Local citation graph + technology portfolio search Planned
v0.3 Unstructured reference parsing Planned
v0.4 PDF ingestion with OCR Planned
v0.5 Report generation (quad charts, tech assessment) Planned
v0.6 Run diffing and coverage analysis Planned
v1.0 Interactive graph dashboard (Rust/WASM) Planned

v0.1 — Core Functionality (current)

  • Multi-source search: OpenAlex, Semantic Scholar, CrossRef, NASA NTRS, IEEE Xplore, AIAA, SAE, SKYbrary
  • Citation snowballing (forward + backward) via API-backed sources
  • Quality tier filtering
  • Cross-source deduplication (DOI + title normalization)
  • Export: BibTeX, JSON, markdown summary
  • SQLite response cache with 7-day TTL
  • GitHub Actions CI (Python 3.11-3.14)
  • AGPL-3.0 dual licensing

v0.2 — Local Citation Graph + Technology Portfolio

Citation Graph

Build a permanent local citation graph that accumulates over time, enabling citation walking even for sources without citation graph APIs.

Every time a source adapter returns papers with structured reference DOIs (OpenAlex, Semantic Scholar, CrossRef), the edges (citing_doi -> cited_doi) are stored in a local SQLite database. Forward citations emerge for free: if papers A and B both cite paper C, and we've seen A and B, we know C's citers without asking any API.

Schema:

papers:   paper_id, doi, title, title_normalized, authors, year, venue, source_db
edges:    citing_id -> cited_id (with source attribution)
meta:     schema versioning for forward migrations

New capabilities: - --source local for cite-walk — queries accumulated local graph - litseer graph stats — paper count, edge count, top-cited papers - litseer graph export — JSON, DOT, or GraphML format - Automatic graph population during search and cite-walk operations - DOI-less papers matched by normalized title + year (+-1 year tolerance)

Batch search across a folder of technology definition YAML files, with per-technology output folders.

Directory structure:

techs/
  turbine-cooling.yaml
  additive-mfg-cooling.yaml
  cmc-materials.yaml
output/
  turbine-cooling/
    search-2026-03-14.json
    new-refs-2026-03-14.bib
    summary-2026-03-14.md
  additive-mfg-cooling/
    ...

New capabilities: - litseer portfolio techs/ — batch search all YAML configs in a directory - Per-technology output subfolders with bib files and summaries - Portfolio-level summary (cross-technology coverage, shared references) - Designed for scheduled/automated runs (cron, CI)

Scope

  • ~700 LOC new source code
  • ~200 LOC modifications to existing files
  • ~450 LOC tests

v0.3 — Unstructured Reference Parsing

Parse bibliography text blobs from sources that don't provide structured DOIs (e.g., NASA NTRS gives "Smith, J., 'Turbine Cooling', AIAA J., 2023").

Rule-based regex parser extracts author, year, title, venue, DOI from reference strings. Parsed references are matched against the local graph using trigram similarity, or optionally resolved via CrossRef API.

No LLM calls — all parsing is deterministic/rule-based. Output is structured for external LLM consumption if desired.

New capabilities: - litseer graph resolve — match unresolved references to known papers - --resolve-refs flag on search/cite-walk for automatic resolution - unresolved_refs table tracks what hasn't been matched yet - Confidence scoring for parsed references

Parser patterns (priority order): 1. DOI extraction (regex, high confidence) 2. Year extraction (4-digit number in 1900-2099 range) 3. Author extraction (APA-style, abbreviated "et al." forms) 4. Title extraction (longest capitalized phrase heuristic) 5. Venue matching (against known abbreviation list)

Scope

  • ~500 LOC new source code (refparse.py, refmatch.py)
  • ~100 LOC modifications to existing files
  • ~400 LOC tests

v0.4 — PDF Reference Extraction

Extract bibliography sections directly from PDF papers.

  1. Extract text from PDF using pdfplumber
  2. Detect bibliography section by header scan or last-15% heuristic
  3. Split into individual reference strings
  4. Feed into Tier 2 parser -> local graph
  5. OCR fallback via pytesseract for scanned PDFs

New capabilities: - litseer ingest paper.pdf — extract and parse references from a PDF - litseer ingest papers/ — batch ingest a directory of PDFs - Optional dependencies: pdfplumber, pytesseract, Pillow

Scope

  • ~200 LOC new source code (pdfextract.py)
  • ~55 LOC modifications to existing files
  • ~150 LOC tests

v0.5 — Report Generation

Automated technology assessment reports from search results and graph data.

Quad charts: - Typst templates for technology quad charts (TRL, risk, maturity, impact) - Export to PPTX for presentation use - NASA Aviation Technology Report style formatting

Tech assessment reports: - Per-technology summaries with citation statistics - Cross-technology comparison tables - Gap analysis (areas with sparse literature coverage) - SME interview integration — structured templates for capturing expert input alongside automated lit search results

New capabilities: - litseer report <tech-folder> — generate quad chart + assessment - litseer report --format pptx|pdf|typst - Typst template system for customizable report layouts - SME input YAML format for structured expert annotations

v0.6 — Run Diffing and Coverage Analysis

Compare searches at different dates to track how the literature evolves.

Run diffing: - litseer diff output/run-2026-01/ output/run-2026-03/ — show new papers, removed papers, citation count changes - Keyword impact analysis — show how changing search terms affects coverage - Timeline visualization of literature growth per technology area

Coverage analysis: - Identify under-researched subtopics within a technology area - Detect citation clusters and isolated papers - Map technology maturity by publication volume and recency trends

v1.0 — Interactive Graph Dashboard

Portable, browser-based visualization for exploring the citation graph.

Technology: - Rust compiled to WASM for performance - 3D force-directed graph layout (WebGPU or WebGL) - Serves from litseer dashboard as a local web server

Capabilities: - Interactive 3D citation graph with zoom, pan, filter - Visual YAML config builder — drag to define search clusters, see coverage update in real time - Run diff visualization — overlay two runs, highlight new/changed nodes - Technology area coloring and clustering - Click-through to paper details, DOI links, local PDF viewer - Export graph views as images for reports

Architecture Notes

Design Principles

  1. Deterministic: Same inputs produce same outputs. No randomness, no LLM-dependent features baked in.
  2. Verifiable: Every claim traces to a specific paper, DOI, and source. Structured output enables automated and human verification.
  3. LLM-ready: All output formats (JSON, graph export, structured markdown) are designed for LLM consumption without requiring LLM integration.
  4. Incremental: The citation graph and cache accumulate value over time. Each run adds to the knowledge base.
  5. Portable: No cloud dependencies. Everything runs locally. Graph and cache are SQLite files that can be backed up or shared.

Graph vs Cache

Response Cache Citation Graph
Database ~/.cache/litseer/responses.db ~/.cache/litseer/graph.db
Lifespan Ephemeral (7-day TTL) Permanent (grows over time)
Content Raw API JSON Normalized paper metadata + edges
Purpose Reduce API calls Enable local citation walking

Privacy

The citation graph is strictly local. No graph data is sent to any API. The optional reference resolution (--resolve-refs) sends only raw reference text to CrossRef, which is already public data from published papers.

Structured Output for LLM Consumption

Graph export produces machine-readable JSON:

{
  "papers": [{"doi": "10.xxx", "title": "...", "year": 2023}],
  "edges": [{"citing_doi": "10.xxx", "cited_doi": "10.yyy"}],
  "stats": {"paper_count": 150, "edge_count": 420}
}

This can be fed to an external LLM for analysis (centrality, gap detection, cluster identification) without any built-in LLM dependency.