Pipeline Overview: Big-O per Stage¶

The table below summarises the asymptotic cost of every major PROTEA stage. Symbols used throughout this section:

Symbol	Meaning
`N`	Number of sequences in the corpus (reference + query)
`L`	Sequence length (residues). ESM-2 / ESM-C cap at 1024; T5/ProstT5 cap at the configured `max_length`.
`d`	Embedding dimension (320–2560 depending on PLM)
`n`	Reference pool size (the “train” snapshot). Typically 300k–550k in the PROTEA v226 corpus.
`q`	Query pool size (the “test” snapshot). Typically 3k–15k.
`k`	Number of nearest neighbours (3, 5, or 10 in the FARM-EXP.13 grid)
`T`	Number of candidate GO terms transferred per (query, reference) pair
`F`	Number of features in the LightGBM feature vector (fixed at schema compile time; currently O(100))

Big-O summary by pipeline stage¶
Stage	Dominant cost	Notes	Page
PLM forward pass (embedding)	O(N * L^2 * d) for attention-based PLMs	Per sequence; chunked to cap peak VRAM	PLM Attention Complexity
KNN search (numpy backend)	O(q * n * d) per aspect	Matrix multiply + argpartition; chunked over queries	KNN Search Complexity
KNN search (FAISS Flat)	O(q * n * d)	Same asymptotic; 3-10x faster constant via BLAS	KNN Search Complexity
KNN search (FAISS IVFFlat / HNSW)	O(q * sqrt(n) * d) approximate	Recall trade-off; not used in PROTEA production	KNN Search Complexity
GO transfer + feature engineering	O(q * k * T) + alignment O(q * k)	Alignment is the hot path; cached + parallelised in PR #421	Export Pipeline Complexity
anc2vec lookup	O(1) per GO term (hash table)	200-dim vectors; entire index loaded into RAM once	Anc2Vec Complexity
LightGBM inference	O(q * k * T * F * depth)	Linear in candidates; negligible vs KNN	LightGBM Complexity

End-to-end: FARM-EXP.13 measured times

The 24-cell grid (8 PLMs x 3 K values) exported from the v226 corpus was timed on the production stack (single A100, 48-core CPU, 256 GB RAM). The bottleneck per cell was the GPU embedding pass for the two PLMs with the longest sequences (ProstT5-XL, ESM-C 600M); the KNN and feature steps were always sub-10% of total wall clock.

Detailed flamegraphs are planned under docs/perf/ as part of the upcoming PERF.1 profiling slice (scalene).