Litseer Roadmap¶

Vision¶

Litseer is a technology intelligence platform for automated, reproducible literature review and technology assessment. The core principle: deterministic, verifiable outputs with minimal built-in AI — structured data that humans and LLMs can independently verify.

The end goal is near-full automation of the technology lit review and SME interview pipeline, designed to integrate with LLMs for high-quality verified analysis without depending on them.

Version Plan¶

Version	Focus	Status
v0.1	Core search, dedup, quality, export, cache	Alpha
v0.2	Local citation graph + technology portfolio search	Planned
v0.3	Unstructured reference parsing	Planned
v0.4	PDF ingestion with OCR	Planned
v0.5	Report generation (quad charts, tech assessment)	Planned
v0.6	Run diffing and coverage analysis	Planned
v1.0	Interactive graph dashboard (Rust/WASM)	Planned

v0.1 — Core Functionality (current)¶

Multi-source search: OpenAlex, Semantic Scholar, CrossRef, NASA NTRS, IEEE Xplore, AIAA, SAE, SKYbrary
Citation snowballing (forward + backward) via API-backed sources
Quality tier filtering
Cross-source deduplication (DOI + title normalization)
Export: BibTeX, JSON, markdown summary
SQLite response cache with 7-day TTL
GitHub Actions CI (Python 3.11-3.14)
AGPL-3.0 dual licensing

v0.2 — Local Citation Graph + Technology Portfolio¶

Citation Graph¶

Build a permanent local citation graph that accumulates over time, enabling citation walking even for sources without citation graph APIs.

Every time a source adapter returns papers with structured reference DOIs (OpenAlex, Semantic Scholar, CrossRef), the edges (citing_doi -> cited_doi) are stored in a local SQLite database. Forward citations emerge for free: if papers A and B both cite paper C, and we've seen A and B, we know C's citers without asking any API.

Schema:

papers:   paper_id, doi, title, title_normalized, authors, year, venue, source_db
edges:    citing_id -> cited_id (with source attribution)
meta:     schema versioning for forward migrations

New capabilities: - --source local for cite-walk — queries accumulated local graph - litseer graph stats — paper count, edge count, top-cited papers - litseer graph export — JSON, DOT, or GraphML format - Automatic graph population during search and cite-walk operations - DOI-less papers matched by normalized title + year (+-1 year tolerance)

Technology Portfolio Search¶

Batch search across a folder of technology definition YAML files, with per-technology output folders.

Directory structure:

techs/
  turbine-cooling.yaml
  additive-mfg-cooling.yaml
  cmc-materials.yaml
output/
  turbine-cooling/
    search-2026-03-14.json
    new-refs-2026-03-14.bib
    summary-2026-03-14.md
  additive-mfg-cooling/
    ...

New capabilities: - litseer portfolio techs/ — batch search all YAML configs in a directory - Per-technology output subfolders with bib files and summaries - Portfolio-level summary (cross-technology coverage, shared references) - Designed for scheduled/automated runs (cron, CI)

Scope¶

~700 LOC new source code
~200 LOC modifications to existing files
~450 LOC tests

v0.3 — Unstructured Reference Parsing¶

Parse bibliography text blobs from sources that don't provide structured DOIs (e.g., NASA NTRS gives "Smith, J., 'Turbine Cooling', AIAA J., 2023").

Rule-based regex parser extracts author, year, title, venue, DOI from reference strings. Parsed references are matched against the local graph using trigram similarity, or optionally resolved via CrossRef API.

No LLM calls — all parsing is deterministic/rule-based. Output is structured for external LLM consumption if desired.

New capabilities: - litseer graph resolve — match unresolved references to known papers - --resolve-refs flag on search/cite-walk for automatic resolution - unresolved_refs table tracks what hasn't been matched yet - Confidence scoring for parsed references

Parser patterns (priority order): 1. DOI extraction (regex, high confidence) 2. Year extraction (4-digit number in 1900-2099 range) 3. Author extraction (APA-style, abbreviated "et al." forms) 4. Title extraction (longest capitalized phrase heuristic) 5. Venue matching (against known abbreviation list)

Scope¶

~500 LOC new source code (refparse.py, refmatch.py)
~100 LOC modifications to existing files
~400 LOC tests

v0.4 — PDF Reference Extraction¶

Extract bibliography sections directly from PDF papers.

Extract text from PDF using pdfplumber
Detect bibliography section by header scan or last-15% heuristic
Split into individual reference strings
Feed into Tier 2 parser -> local graph
OCR fallback via pytesseract for scanned PDFs

New capabilities: - litseer ingest paper.pdf — extract and parse references from a PDF - litseer ingest papers/ — batch ingest a directory of PDFs - Optional dependencies: pdfplumber, pytesseract, Pillow

Scope¶

~200 LOC new source code (pdfextract.py)
~55 LOC modifications to existing files
~150 LOC tests

v0.5 — Report Generation¶

Automated technology assessment reports from search results and graph data.

Quad charts: - Typst templates for technology quad charts (TRL, risk, maturity, impact) - Export to PPTX for presentation use - NASA Aviation Technology Report style formatting

Tech assessment reports: - Per-technology summaries with citation statistics - Cross-technology comparison tables - Gap analysis (areas with sparse literature coverage) - SME interview integration — structured templates for capturing expert input alongside automated lit search results

New capabilities: - litseer report <tech-folder> — generate quad chart + assessment - litseer report --format pptx|pdf|typst - Typst template system for customizable report layouts - SME input YAML format for structured expert annotations

v0.6 — Run Diffing and Coverage Analysis¶

Compare searches at different dates to track how the literature evolves.

Run diffing: - litseer diff output/run-2026-01/ output/run-2026-03/ — show new papers, removed papers, citation count changes - Keyword impact analysis — show how changing search terms affects coverage - Timeline visualization of literature growth per technology area

Coverage analysis: - Identify under-researched subtopics within a technology area - Detect citation clusters and isolated papers - Map technology maturity by publication volume and recency trends

v1.0 — Interactive Graph Dashboard¶

Portable, browser-based visualization for exploring the citation graph.

Technology: - Rust compiled to WASM for performance - 3D force-directed graph layout (WebGPU or WebGL) - Serves from litseer dashboard as a local web server

Capabilities: - Interactive 3D citation graph with zoom, pan, filter - Visual YAML config builder — drag to define search clusters, see coverage update in real time - Run diff visualization — overlay two runs, highlight new/changed nodes - Technology area coloring and clustering - Click-through to paper details, DOI links, local PDF viewer - Export graph views as images for reports

Architecture Notes¶

Design Principles¶

Deterministic: Same inputs produce same outputs. No randomness, no LLM-dependent features baked in.
Verifiable: Every claim traces to a specific paper, DOI, and source. Structured output enables automated and human verification.
LLM-ready: All output formats (JSON, graph export, structured markdown) are designed for LLM consumption without requiring LLM integration.
Incremental: The citation graph and cache accumulate value over time. Each run adds to the knowledge base.
Portable: No cloud dependencies. Everything runs locally. Graph and cache are SQLite files that can be backed up or shared.

Graph vs Cache¶

	Response Cache	Citation Graph
Database	`~/.cache/litseer/responses.db`	`~/.cache/litseer/graph.db`
Lifespan	Ephemeral (7-day TTL)	Permanent (grows over time)
Content	Raw API JSON	Normalized paper metadata + edges
Purpose	Reduce API calls	Enable local citation walking

Privacy¶

The citation graph is strictly local. No graph data is sent to any API. The optional reference resolution (--resolve-refs) sends only raw reference text to CrossRef, which is already public data from published papers.

Structured Output for LLM Consumption¶

Graph export produces machine-readable JSON:

{
  "papers": [{"doi": "10.xxx", "title": "...", "year": 2023}],
  "edges": [{"citing_doi": "10.xxx", "cited_doi": "10.yyy"}],
  "stats": {"paper_count": 150, "edge_count": 420}
}

This can be fed to an external LLM for analysis (centrality, gap detection, cluster identification) without any built-in LLM dependency.