Skip to content

ADR-008: Input Validation and Security Hardening

Date: 2026-03-15 Status: Accepted Completed: 2026-03-15

Context

Litseer ingests large volumes of untrusted data from external APIs (OpenAlex, Semantic Scholar, CrossRef, IEEE, NASA NTRS, AIAA, SAE, SKYbrary) and from user-supplied YAML configs. This data flows into SQL queries, BibTeX export, file path operations, and the local citation graph. As the tool accumulates data over time and adds features like PDF ingestion (v0.4) and unstructured reference parsing (v0.3), the attack surface grows.

Although the current codebase has no exploitable vulnerabilities (all SQL is parameterized, YAML uses safe_load, regexes are linear), we need a systematic approach to input validation that scales with the roadmap.

Threat Model

Data Sources (untrusted)

  1. API JSON responses — titles, authors, venues, DOIs, abstracts from 8+ APIs
  2. YAML config files — user-authored, could contain path traversal in existing_bib_path
  3. BibTeX files — parsed for existing citekeys/DOIs during dedup
  4. PDF text (v0.4) — extracted text from arbitrary PDFs
  5. Reference strings (v0.3) — unstructured citation text from any source

Attack Vectors

Vector Current Status Risk
SQL injection Mitigated — all queries parameterized Low
YAML code execution Mitigatedyaml.safe_load() Low
ReDoS (regex backtracking) Mitigated — linear patterns only Low
LIKE wildcard injection Mitigatednormalize_title strips wildcards Low
BibTeX injection Mitigated_escape_bibtex + triple-brace wrapping Low
DOT label injection Mitigated — quote escaping + truncation Low
Path traversal Needs hardeningexisting_bib_path not validated Medium
Field length exhaustion Needs hardening — no limits on API response fields Medium
Unicode/encoding attacks Needs hardening — no normalization of control chars Medium
Log injection Needs hardening — API strings logged without sanitization Low

Decision

Adopt a defense-in-depth input validation strategy with three layers:

Layer 1: Boundary Validation (v0.1 milestone)

Validate and sanitize all data at system boundaries — where external data first enters the application.

  • Field length limits: Cap title (1000), authors (2000), venue (500), abstract (10000), DOI (500) chars at the Work model level
  • Control character stripping: Remove \x00-\x1f (except \n, \t) from all text fields
  • Path validation: Resolve existing_bib_path and verify it doesn't escape the config file's parent directory
  • DOI format validation: Reject DOIs that don't match 10.\d{4,}/ pattern

Layer 2: Structural Validation (v0.2 milestone)

Validate data structure and semantics before processing.

  • Year range validation: Reject years outside 1900-2100
  • URL scheme validation: Only allow http://, https:// in URL fields
  • Config schema validation: Validate YAML config against expected schema before processing (cluster IDs, query strings, year ranges)
  • Graph integrity checks: Validate paper_id references in edge operations

Layer 3: Output Encoding (v0.3 milestone)

Ensure all output formats are properly encoded for their context.

  • BibTeX: Audit escaping for completeness (add $, ^, ~ to escape list)
  • Markdown: Escape user-controlled strings in markdown output
  • JSON: Already safe via json.dumps
  • DOT: Already safe via quote escaping + truncation
  • Log messages: Sanitize before logging (truncate, strip control chars)

Implementation Plan

Milestone 1: Core Input Sanitization ✅ Complete

  • Add sanitize_text() and validate_doi() to a new src/litseer/sanitize.py
  • Integrate into Work model __post_init__ or source adapter parsing
  • Add validate_config_paths() to config loader
  • Tests for each sanitization function with adversarial inputs

Milestone 2: Structural Validation ✅ Complete

  • Add config schema validation with clear error messages
  • Add year/URL/field semantic validation
  • Graph integrity checks (paper_id validation in add_edge())
  • Fuzz testing for source adapter parsers

Milestone 3: Output Hardening ✅ Complete

  • Audit and extend BibTeX escaping
  • Add markdown escaping for user-controlled fields (_escape_markdown())
  • Log sanitization

Milestone 4: Ongoing (with each new feature)

  • v0.3 (reference parsing): Validate parsed reference fields
  • v0.4 (PDF ingestion): Sanitize extracted text, validate file types
  • Every new source adapter: Apply sanitize_text to all response fields

Consequences

Positive: - Systematic defense against current and future input vectors - Each milestone is independently valuable and testable - Sanitization module is reusable across all source adapters - Positions the codebase for safe PDF/reference parsing in v0.3-v0.4

Negative: - Small performance overhead from validation (negligible vs. API latency) - May truncate legitimately long titles in rare cases (1000 char limit) - Adds a dependency between source adapters and the sanitize module

Neutral: - Does not protect against denial-of-service at the network level (rate limiting is handled by source adapters and the cache)