ADR-008: Input Validation and Security Hardening¶
Date: 2026-03-15 Status: Accepted Completed: 2026-03-15
Context¶
Litseer ingests large volumes of untrusted data from external APIs (OpenAlex, Semantic Scholar, CrossRef, IEEE, NASA NTRS, AIAA, SAE, SKYbrary) and from user-supplied YAML configs. This data flows into SQL queries, BibTeX export, file path operations, and the local citation graph. As the tool accumulates data over time and adds features like PDF ingestion (v0.4) and unstructured reference parsing (v0.3), the attack surface grows.
Although the current codebase has no exploitable vulnerabilities (all SQL is
parameterized, YAML uses safe_load, regexes are linear), we need a
systematic approach to input validation that scales with the roadmap.
Threat Model¶
Data Sources (untrusted)¶
- API JSON responses — titles, authors, venues, DOIs, abstracts from 8+ APIs
- YAML config files — user-authored, could contain path traversal in
existing_bib_path - BibTeX files — parsed for existing citekeys/DOIs during dedup
- PDF text (v0.4) — extracted text from arbitrary PDFs
- Reference strings (v0.3) — unstructured citation text from any source
Attack Vectors¶
| Vector | Current Status | Risk |
|---|---|---|
| SQL injection | Mitigated — all queries parameterized | Low |
| YAML code execution | Mitigated — yaml.safe_load() |
Low |
| ReDoS (regex backtracking) | Mitigated — linear patterns only | Low |
| LIKE wildcard injection | Mitigated — normalize_title strips wildcards |
Low |
| BibTeX injection | Mitigated — _escape_bibtex + triple-brace wrapping |
Low |
| DOT label injection | Mitigated — quote escaping + truncation | Low |
| Path traversal | Needs hardening — existing_bib_path not validated |
Medium |
| Field length exhaustion | Needs hardening — no limits on API response fields | Medium |
| Unicode/encoding attacks | Needs hardening — no normalization of control chars | Medium |
| Log injection | Needs hardening — API strings logged without sanitization | Low |
Decision¶
Adopt a defense-in-depth input validation strategy with three layers:
Layer 1: Boundary Validation (v0.1 milestone)¶
Validate and sanitize all data at system boundaries — where external data first enters the application.
- Field length limits: Cap title (1000), authors (2000), venue (500), abstract (10000), DOI (500) chars at the Work model level
- Control character stripping: Remove
\x00-\x1f(except\n,\t) from all text fields - Path validation: Resolve
existing_bib_pathand verify it doesn't escape the config file's parent directory - DOI format validation: Reject DOIs that don't match
10.\d{4,}/pattern
Layer 2: Structural Validation (v0.2 milestone)¶
Validate data structure and semantics before processing.
- Year range validation: Reject years outside 1900-2100
- URL scheme validation: Only allow
http://,https://in URL fields - Config schema validation: Validate YAML config against expected schema before processing (cluster IDs, query strings, year ranges)
- Graph integrity checks: Validate paper_id references in edge operations
Layer 3: Output Encoding (v0.3 milestone)¶
Ensure all output formats are properly encoded for their context.
- BibTeX: Audit escaping for completeness (add
$,^,~to escape list) - Markdown: Escape user-controlled strings in markdown output
- JSON: Already safe via
json.dumps - DOT: Already safe via quote escaping + truncation
- Log messages: Sanitize before logging (truncate, strip control chars)
Implementation Plan¶
Milestone 1: Core Input Sanitization ✅ Complete¶
- Add
sanitize_text()andvalidate_doi()to a newsrc/litseer/sanitize.py - Integrate into Work model
__post_init__or source adapter parsing - Add
validate_config_paths()to config loader - Tests for each sanitization function with adversarial inputs
Milestone 2: Structural Validation ✅ Complete¶
- Add config schema validation with clear error messages
- Add year/URL/field semantic validation
- Graph integrity checks (paper_id validation in
add_edge()) - Fuzz testing for source adapter parsers
Milestone 3: Output Hardening ✅ Complete¶
- Audit and extend BibTeX escaping
- Add markdown escaping for user-controlled fields (
_escape_markdown()) - Log sanitization
Milestone 4: Ongoing (with each new feature)¶
- v0.3 (reference parsing): Validate parsed reference fields
- v0.4 (PDF ingestion): Sanitize extracted text, validate file types
- Every new source adapter: Apply sanitize_text to all response fields
Consequences¶
Positive: - Systematic defense against current and future input vectors - Each milestone is independently valuable and testable - Sanitization module is reusable across all source adapters - Positions the codebase for safe PDF/reference parsing in v0.3-v0.4
Negative: - Small performance overhead from validation (negligible vs. API latency) - May truncate legitimately long titles in rare cases (1000 char limit) - Adds a dependency between source adapters and the sanitize module
Neutral: - Does not protect against denial-of-service at the network level (rate limiting is handled by source adapters and the cache)