ADR-008: Input Validation and Security Hardening¶

Date: 2026-03-15 Status: Accepted Completed: 2026-03-15

Context¶

Litseer ingests large volumes of untrusted data from external APIs (OpenAlex, Semantic Scholar, CrossRef, IEEE, NASA NTRS, AIAA, SAE, SKYbrary) and from user-supplied YAML configs. This data flows into SQL queries, BibTeX export, file path operations, and the local citation graph. As the tool accumulates data over time and adds features like PDF ingestion (v0.4) and unstructured reference parsing (v0.3), the attack surface grows.

Although the current codebase has no exploitable vulnerabilities (all SQL is parameterized, YAML uses safe_load, regexes are linear), we need a systematic approach to input validation that scales with the roadmap.

Threat Model¶

Data Sources (untrusted)¶

API JSON responses — titles, authors, venues, DOIs, abstracts from 8+ APIs
YAML config files — user-authored, could contain path traversal in existing_bib_path
BibTeX files — parsed for existing citekeys/DOIs during dedup
PDF text (v0.4) — extracted text from arbitrary PDFs
Reference strings (v0.3) — unstructured citation text from any source

Attack Vectors¶

Vector	Current Status	Risk
SQL injection	Mitigated — all queries parameterized	Low
YAML code execution	Mitigated — `yaml.safe_load()`	Low
ReDoS (regex backtracking)	Mitigated — linear patterns only	Low
LIKE wildcard injection	Mitigated — `normalize_title` strips wildcards	Low
BibTeX injection	Mitigated — `_escape_bibtex` + triple-brace wrapping	Low
DOT label injection	Mitigated — quote escaping + truncation	Low
Path traversal	Needs hardening — `existing_bib_path` not validated	Medium
Field length exhaustion	Needs hardening — no limits on API response fields	Medium
Unicode/encoding attacks	Needs hardening — no normalization of control chars	Medium
Log injection	Needs hardening — API strings logged without sanitization	Low

Decision¶

Adopt a defense-in-depth input validation strategy with three layers:

Layer 1: Boundary Validation (v0.1 milestone)¶

Validate and sanitize all data at system boundaries — where external data first enters the application.

Field length limits: Cap title (1000), authors (2000), venue (500), abstract (10000), DOI (500) chars at the Work model level
Control character stripping: Remove \x00-\x1f (except \n, \t) from all text fields
Path validation: Resolve existing_bib_path and verify it doesn't escape the config file's parent directory
DOI format validation: Reject DOIs that don't match 10.\d{4,}/ pattern

Layer 2: Structural Validation (v0.2 milestone)¶

Validate data structure and semantics before processing.

Year range validation: Reject years outside 1900-2100
URL scheme validation: Only allow http://, https:// in URL fields
Config schema validation: Validate YAML config against expected schema before processing (cluster IDs, query strings, year ranges)
Graph integrity checks: Validate paper_id references in edge operations

Layer 3: Output Encoding (v0.3 milestone)¶

Ensure all output formats are properly encoded for their context.

BibTeX: Audit escaping for completeness (add $, ^, ~ to escape list)
Markdown: Escape user-controlled strings in markdown output
JSON: Already safe via json.dumps
DOT: Already safe via quote escaping + truncation
Log messages: Sanitize before logging (truncate, strip control chars)

Implementation Plan¶

Milestone 1: Core Input Sanitization ✅ Complete¶

Add sanitize_text() and validate_doi() to a new src/litseer/sanitize.py
Integrate into Work model __post_init__ or source adapter parsing
Add validate_config_paths() to config loader
Tests for each sanitization function with adversarial inputs

Milestone 2: Structural Validation ✅ Complete¶

Add config schema validation with clear error messages
Add year/URL/field semantic validation
Graph integrity checks (paper_id validation in add_edge())
Fuzz testing for source adapter parsers

Milestone 3: Output Hardening ✅ Complete¶

Audit and extend BibTeX escaping
Add markdown escaping for user-controlled fields (_escape_markdown())
Log sanitization

Milestone 4: Ongoing (with each new feature)¶

v0.3 (reference parsing): Validate parsed reference fields
v0.4 (PDF ingestion): Sanitize extracted text, validate file types
Every new source adapter: Apply sanitize_text to all response fields

Consequences¶

Positive: - Systematic defense against current and future input vectors - Each milestone is independently valuable and testable - Sanitization module is reusable across all source adapters - Positions the codebase for safe PDF/reference parsing in v0.3-v0.4

Negative: - Small performance overhead from validation (negligible vs. API latency) - May truncate legitimately long titles in rare cases (1000 char limit) - Adds a dependency between source adapters and the sanitize module

Neutral: - Does not protect against denial-of-service at the network level (rate limiting is handled by source adapters and the cache)