Litseer Roadmap¶
Vision¶
Litseer is a technology intelligence platform for automated, reproducible literature review and technology assessment. The core principle: deterministic, verifiable outputs with minimal built-in AI — structured data that humans and LLMs can independently verify.
The end goal is near-full automation of the technology lit review and SME interview pipeline, designed to integrate with LLMs for high-quality verified analysis without depending on them.
Version Plan¶
| Version | Focus | Status |
|---|---|---|
| v0.1 | Core search, dedup, quality, export, cache | Alpha |
| v0.2 | Local citation graph + technology portfolio search | Planned |
| v0.3 | Unstructured reference parsing | Planned |
| v0.4 | PDF ingestion with OCR | Planned |
| v0.5 | Report generation (quad charts, tech assessment) | Planned |
| v0.6 | Run diffing and coverage analysis | Planned |
| v1.0 | Interactive graph dashboard (Rust/WASM) | Planned |
v0.1 — Core Functionality (current)¶
- Multi-source search: OpenAlex, Semantic Scholar, CrossRef, NASA NTRS, IEEE Xplore, AIAA, SAE, SKYbrary
- Citation snowballing (forward + backward) via API-backed sources
- Quality tier filtering
- Cross-source deduplication (DOI + title normalization)
- Export: BibTeX, JSON, markdown summary
- SQLite response cache with 7-day TTL
- GitHub Actions CI (Python 3.11-3.14)
- AGPL-3.0 dual licensing
v0.2 — Local Citation Graph + Technology Portfolio¶
Citation Graph¶
Build a permanent local citation graph that accumulates over time, enabling citation walking even for sources without citation graph APIs.
Every time a source adapter returns papers with structured reference DOIs (OpenAlex, Semantic Scholar, CrossRef), the edges (citing_doi -> cited_doi) are stored in a local SQLite database. Forward citations emerge for free: if papers A and B both cite paper C, and we've seen A and B, we know C's citers without asking any API.
Schema:
papers: paper_id, doi, title, title_normalized, authors, year, venue, source_db
edges: citing_id -> cited_id (with source attribution)
meta: schema versioning for forward migrations
New capabilities:
- --source local for cite-walk — queries accumulated local graph
- litseer graph stats — paper count, edge count, top-cited papers
- litseer graph export — JSON, DOT, or GraphML format
- Automatic graph population during search and cite-walk operations
- DOI-less papers matched by normalized title + year (+-1 year tolerance)
Technology Portfolio Search¶
Batch search across a folder of technology definition YAML files, with per-technology output folders.
Directory structure:
techs/
turbine-cooling.yaml
additive-mfg-cooling.yaml
cmc-materials.yaml
output/
turbine-cooling/
search-2026-03-14.json
new-refs-2026-03-14.bib
summary-2026-03-14.md
additive-mfg-cooling/
...
New capabilities:
- litseer portfolio techs/ — batch search all YAML configs in a directory
- Per-technology output subfolders with bib files and summaries
- Portfolio-level summary (cross-technology coverage, shared references)
- Designed for scheduled/automated runs (cron, CI)
Scope¶
- ~700 LOC new source code
- ~200 LOC modifications to existing files
- ~450 LOC tests
v0.3 — Unstructured Reference Parsing¶
Parse bibliography text blobs from sources that don't provide structured DOIs (e.g., NASA NTRS gives "Smith, J., 'Turbine Cooling', AIAA J., 2023").
Rule-based regex parser extracts author, year, title, venue, DOI from reference strings. Parsed references are matched against the local graph using trigram similarity, or optionally resolved via CrossRef API.
No LLM calls — all parsing is deterministic/rule-based. Output is structured for external LLM consumption if desired.
New capabilities:
- litseer graph resolve — match unresolved references to known papers
- --resolve-refs flag on search/cite-walk for automatic resolution
- unresolved_refs table tracks what hasn't been matched yet
- Confidence scoring for parsed references
Parser patterns (priority order): 1. DOI extraction (regex, high confidence) 2. Year extraction (4-digit number in 1900-2099 range) 3. Author extraction (APA-style, abbreviated "et al." forms) 4. Title extraction (longest capitalized phrase heuristic) 5. Venue matching (against known abbreviation list)
Scope¶
- ~500 LOC new source code (refparse.py, refmatch.py)
- ~100 LOC modifications to existing files
- ~400 LOC tests
v0.4 — PDF Reference Extraction¶
Extract bibliography sections directly from PDF papers.
- Extract text from PDF using
pdfplumber - Detect bibliography section by header scan or last-15% heuristic
- Split into individual reference strings
- Feed into Tier 2 parser -> local graph
- OCR fallback via
pytesseractfor scanned PDFs
New capabilities:
- litseer ingest paper.pdf — extract and parse references from a PDF
- litseer ingest papers/ — batch ingest a directory of PDFs
- Optional dependencies: pdfplumber, pytesseract, Pillow
Scope¶
- ~200 LOC new source code (pdfextract.py)
- ~55 LOC modifications to existing files
- ~150 LOC tests
v0.5 — Report Generation¶
Automated technology assessment reports from search results and graph data.
Quad charts: - Typst templates for technology quad charts (TRL, risk, maturity, impact) - Export to PPTX for presentation use - NASA Aviation Technology Report style formatting
Tech assessment reports: - Per-technology summaries with citation statistics - Cross-technology comparison tables - Gap analysis (areas with sparse literature coverage) - SME interview integration — structured templates for capturing expert input alongside automated lit search results
New capabilities:
- litseer report <tech-folder> — generate quad chart + assessment
- litseer report --format pptx|pdf|typst
- Typst template system for customizable report layouts
- SME input YAML format for structured expert annotations
v0.6 — Run Diffing and Coverage Analysis¶
Compare searches at different dates to track how the literature evolves.
Run diffing:
- litseer diff output/run-2026-01/ output/run-2026-03/ — show new papers,
removed papers, citation count changes
- Keyword impact analysis — show how changing search terms affects coverage
- Timeline visualization of literature growth per technology area
Coverage analysis: - Identify under-researched subtopics within a technology area - Detect citation clusters and isolated papers - Map technology maturity by publication volume and recency trends
v1.0 — Interactive Graph Dashboard¶
Portable, browser-based visualization for exploring the citation graph.
Technology:
- Rust compiled to WASM for performance
- 3D force-directed graph layout (WebGPU or WebGL)
- Serves from litseer dashboard as a local web server
Capabilities: - Interactive 3D citation graph with zoom, pan, filter - Visual YAML config builder — drag to define search clusters, see coverage update in real time - Run diff visualization — overlay two runs, highlight new/changed nodes - Technology area coloring and clustering - Click-through to paper details, DOI links, local PDF viewer - Export graph views as images for reports
Architecture Notes¶
Design Principles¶
- Deterministic: Same inputs produce same outputs. No randomness, no LLM-dependent features baked in.
- Verifiable: Every claim traces to a specific paper, DOI, and source. Structured output enables automated and human verification.
- LLM-ready: All output formats (JSON, graph export, structured markdown) are designed for LLM consumption without requiring LLM integration.
- Incremental: The citation graph and cache accumulate value over time. Each run adds to the knowledge base.
- Portable: No cloud dependencies. Everything runs locally. Graph and cache are SQLite files that can be backed up or shared.
Graph vs Cache¶
| Response Cache | Citation Graph | |
|---|---|---|
| Database | ~/.cache/litseer/responses.db |
~/.cache/litseer/graph.db |
| Lifespan | Ephemeral (7-day TTL) | Permanent (grows over time) |
| Content | Raw API JSON | Normalized paper metadata + edges |
| Purpose | Reduce API calls | Enable local citation walking |
Privacy¶
The citation graph is strictly local. No graph data is sent to any API.
The optional reference resolution (--resolve-refs) sends only raw
reference text to CrossRef, which is already public data from published papers.
Structured Output for LLM Consumption¶
Graph export produces machine-readable JSON:
{
"papers": [{"doi": "10.xxx", "title": "...", "year": 2023}],
"edges": [{"citing_doi": "10.xxx", "cited_doi": "10.yyy"}],
"stats": {"paper_count": 150, "edge_count": 420}
}
This can be fed to an external LLM for analysis (centrality, gap detection, cluster identification) without any built-in LLM dependency.