Export Pipeline Complexity¶

Note

Hot path: O(q * k) pairwise sequence alignments, parallelised across a process pool (PR #421).

The export_research_dataset operation materialises train.parquet + eval.parquet for the reranker lab. Its runtime is dominated by the pairwise NW+SW alignment step in _KnnTransferRunner._compute_pair_features: for each (query, reference) pair in the KNN result, a full Needleman-Wunsch traceback is computed via parasail (BLOSUM62, gap-open 10, gap-extend 1). With q query proteins and k neighbours per query there are q * k unique pairs to align. At the FARM-EXP.13 scale (q ~ 5000, k = 10) this is 50k alignments per aspect, each taking ~2 ms single-threaded, for a total of ~100 s per aspect before parallelisation.

Two structural optimisations landed in PR #421:

Process-level parallelism. Alignments are distributed across a ProcessPoolExecutor (PROTEA_PAIR_FEATURE_WORKERS, default: serial). parasail’s traceback variants do not release the GIL efficiently, so threads do not scale; processes do (~3x on a 12-core box).
Persistent on-disk cache. A SQLite file keyed by the ordered sequence pair plus alignment parameters stores computed alignment features. Because the K=10 neighbour set is a superset of K=5 and K=3, the first run at K=10 pays the alignment cost; subsequent runs at smaller K are near-free cache hits.

"""Parallel + persistent computation of per-pair alignment features.

The export pipeline's hotspot is the per-(query, reference) alignment in
``_KnnTransferRunner._compute_pair_features``: a single-threaded triple
loop calling ``compute_alignment`` (parasail NW+SW with traceback) at a
few hundred pairs/s. Two structural wins live here:

1. **Process-level parallelism** over the unique alignment pairs. parasail's
   traceback variants do not release the GIL well enough for threads to
   scale, so a :class:`ProcessPoolExecutor` is used (benchmarked ~3x on a
   12-core box). The taxonomy lookups stay in the parent process because
   they are cheap (an ``lru_cache`` over ete3 lineages) and process
   re-init of the ete3 sqlite handle would dwarf the work.

2. **A persistent on-disk cache** keyed by the ordered sequence pair plus
   the alignment parameters. Intra-PLM the K=10 neighbour set is a
   superset of K=5 / K=3, so alignments computed once for the largest-K
   dataset make the smaller-K datasets (and any re-run) near-free.
   ``protea.training`` is serialised (one job at a time) so there is no
   concurrent-write contention; a plain sqlite file is enough.

This module is value-preserving by construction: the alignment feature
dict it returns is exactly what ``compute_alignment`` returns, only
computed concurrently and memoised. Disabling parallelism (workers <= 1)
and the cache (``PROTEA_ALIGN_CACHE_DIR`` unset / cache-miss) reduces to
the original serial code path.
"""

from __future__ import annotations

Stage-by-stage breakdown

The operation is orchestrated in three session windows (see ExportResearchDatasetOperation.execute). Within the second window, _dump_to_stage calls the internal _KnnTransferRunner:

Sub-step	Complexity	Parallelism
Load embeddings from DB (`SequenceEmbedding`)	O(n + q) rows	Single DB query
KNN search (numpy backend, per aspect)	O(q * n * d) per aspect	Serial (one aspect at a time)
GO transfer	O(q * k * T)	Vectorised
Pairwise alignment	O(q * k)	ProcessPoolExecutor + SQLite cache
Taxonomy features	O(q * k) with ete3 lru_cache	In-process (cheap)
anc2vec lookup	O(q * k * T)	O(1) per term via hash table
PCA projection	O((n + q) * d * p) fit; O(q * d * p) transform	NumPy (p = 16 components, transductive)
Parquet write + MinIO upload	O(rows)	Sequential (single connection)

Environment variables

Variable	Effect
`PROTEA_PAIR_FEATURE_WORKERS`	Number of process-pool workers for alignment (default: 1, serial)
`PROTEA_ALIGN_CACHE_DIR`	Directory for the persistent SQLite alignment cache (unset disables)
`PROTEA_METHOD_NUMPY_QUERY_CHUNK`	Per-chunk query count for the numpy KNN backend (default: 500)
`PROTEA_ANC2VEC_PATH`	Path to the anc2vec NPZ artifact (required for export to succeed)

Forward reference

Scalene flamegraphs for the full FARM-EXP.13 run (24-cell grid, 8 PLMs x 3 K values) will be published under docs/perf/ as part of the upcoming PERF.1 profiling slice. They will show the fraction of wall clock consumed by each sub-step and guide future parallelisation work.

Cross-reference

Thesis Ch. 5.5 presents end-to-end timing measurements for the EXP.13 export grid, the alignment cache hit rate per PLM, and the performance model used to predict export duration for larger corpora.