ADR-007: Bipartite Sparse Matrix Pattern for Network Analysis¶
Date: 2026-03-14 Status: Accepted
Context¶
Bibliometric network analysis (co-citation, bibliographic coupling, keyword co-occurrence, author collaboration) requires building and analyzing relationship networks from paper metadata. The R bibliometrix package (the most widely used open-source bibliometric tool) uses an elegant pattern that we should adopt.
Design from bibliometrix (R)¶
The core pattern, discovered from analyzing bibliometrix's cocMatrix.R
and biblioNetwork.R:
- Build a binary sparse bipartite matrix A (papers x items):
- Rows = papers, Columns = unique items (references, keywords, authors)
-
A[i,j] = 1 if paper i contains item j
-
All network types derive from A via matrix multiplication:
- Co-citation / Co-occurrence:
A.T @ A(items that appear together) -
Bibliographic coupling:
A @ A.T(papers that share items) -
Normalize with association strength (van Eck & Waltman):
S[i,j] = C[i,j] / (C[i,i] * C[j,j])
Decision¶
Adopt the bipartite sparse matrix pattern using scipy.sparse for all
network analysis. This will be implemented in a new src/litseer/networks.py
module that reads from the DuckDB citation graph.
Implementation Plan¶
# One function builds the bipartite matrix from any field
def build_bipartite(works, field_accessor) -> scipy.sparse.csr_matrix
# All networks are one-liners
coupling_matrix = A @ A.T # bibliographic coupling
cocitation_matrix = A.T @ A # co-citation
keyword_cooccurrence = A_kw.T @ A_kw # keyword co-occurrence
Python equivalents of R dependencies:
- Matrix::sparseMatrix -> scipy.sparse.csr_matrix
- igraph -> networkx or python-igraph
- stringdist -> rapidfuzz
- tm (text mining) -> scikit-learn TfidfVectorizer
- RAKE (keyword extraction) -> yake
Rationale¶
- One matrix builder function, all network types via linear algebra
- Sparse matrices handle 5K papers x 50K references efficiently
- Same pattern used by VOSviewer and bibliometrix (proven at scale)
scipy.sparseis battle-tested and performant- Enables future centrality/community detection via networkx conversion
Consequences¶
scipybecomes a dependency (only when network analysis features are used)- The
Workmodel now carriescited_references,keywords,indexed_keywordsto feed the matrix builder - Graph export formats (GraphML, GEXF) will enable interop with Gephi/Cytoscape
- Future: keyword importance analysis (litsearchr pattern) for automated search term discovery