Skip to content

ADR-007: Bipartite Sparse Matrix Pattern for Network Analysis

Date: 2026-03-14 Status: Accepted

Context

Bibliometric network analysis (co-citation, bibliographic coupling, keyword co-occurrence, author collaboration) requires building and analyzing relationship networks from paper metadata. The R bibliometrix package (the most widely used open-source bibliometric tool) uses an elegant pattern that we should adopt.

Design from bibliometrix (R)

The core pattern, discovered from analyzing bibliometrix's cocMatrix.R and biblioNetwork.R:

  1. Build a binary sparse bipartite matrix A (papers x items):
  2. Rows = papers, Columns = unique items (references, keywords, authors)
  3. A[i,j] = 1 if paper i contains item j

  4. All network types derive from A via matrix multiplication:

  5. Co-citation / Co-occurrence: A.T @ A (items that appear together)
  6. Bibliographic coupling: A @ A.T (papers that share items)

  7. Normalize with association strength (van Eck & Waltman): S[i,j] = C[i,j] / (C[i,i] * C[j,j])

Decision

Adopt the bipartite sparse matrix pattern using scipy.sparse for all network analysis. This will be implemented in a new src/litseer/networks.py module that reads from the DuckDB citation graph.

Implementation Plan

# One function builds the bipartite matrix from any field
def build_bipartite(works, field_accessor) -> scipy.sparse.csr_matrix

# All networks are one-liners
coupling_matrix = A @ A.T           # bibliographic coupling
cocitation_matrix = A.T @ A         # co-citation
keyword_cooccurrence = A_kw.T @ A_kw  # keyword co-occurrence

Python equivalents of R dependencies: - Matrix::sparseMatrix -> scipy.sparse.csr_matrix - igraph -> networkx or python-igraph - stringdist -> rapidfuzz - tm (text mining) -> scikit-learn TfidfVectorizer - RAKE (keyword extraction) -> yake

Rationale

  • One matrix builder function, all network types via linear algebra
  • Sparse matrices handle 5K papers x 50K references efficiently
  • Same pattern used by VOSviewer and bibliometrix (proven at scale)
  • scipy.sparse is battle-tested and performant
  • Enables future centrality/community detection via networkx conversion

Consequences

  • scipy becomes a dependency (only when network analysis features are used)
  • The Work model now carries cited_references, keywords, indexed_keywords to feed the matrix builder
  • Graph export formats (GraphML, GEXF) will enable interop with Gephi/Cytoscape
  • Future: keyword importance analysis (litsearchr pattern) for automated search term discovery