Export Pipeline Complexity¶
Note
Hot path: O(q * k) pairwise sequence alignments, parallelised across a process pool (PR #421).
The export_research_dataset operation materialises
train.parquet + eval.parquet for the reranker lab. Its runtime
is dominated by the pairwise NW+SW alignment step in
_KnnTransferRunner._compute_pair_features: for each (query,
reference) pair in the KNN result, a full Needleman-Wunsch traceback
is computed via parasail (BLOSUM62, gap-open 10, gap-extend 1). With
q query proteins and k neighbours per query there are q * k unique
pairs to align. At the FARM-EXP.13 scale (q ~ 5000, k = 10) this is
50k alignments per aspect, each taking ~2 ms single-threaded, for a
total of ~100 s per aspect before parallelisation.
Two structural optimisations landed in PR #421:
Process-level parallelism. Alignments are distributed across a
ProcessPoolExecutor(PROTEA_PAIR_FEATURE_WORKERS, default: serial). parasail’s traceback variants do not release the GIL efficiently, so threads do not scale; processes do (~3x on a 12-core box).Persistent on-disk cache. A SQLite file keyed by the ordered sequence pair plus alignment parameters stores computed alignment features. Because the K=10 neighbour set is a superset of K=5 and K=3, the first run at K=10 pays the alignment cost; subsequent runs at smaller K are near-free cache hits.
"""Parallel + persistent computation of per-pair alignment features.
The export pipeline's hotspot is the per-(query, reference) alignment in
``_KnnTransferRunner._compute_pair_features``: a single-threaded triple
loop calling ``compute_alignment`` (parasail NW+SW with traceback) at a
few hundred pairs/s. Two structural wins live here:
1. **Process-level parallelism** over the unique alignment pairs. parasail's
traceback variants do not release the GIL well enough for threads to
scale, so a :class:`ProcessPoolExecutor` is used (benchmarked ~3x on a
12-core box). The taxonomy lookups stay in the parent process because
they are cheap (an ``lru_cache`` over ete3 lineages) and process
re-init of the ete3 sqlite handle would dwarf the work.
2. **A persistent on-disk cache** keyed by the ordered sequence pair plus
the alignment parameters. Intra-PLM the K=10 neighbour set is a
superset of K=5 / K=3, so alignments computed once for the largest-K
dataset make the smaller-K datasets (and any re-run) near-free.
``protea.training`` is serialised (one job at a time) so there is no
concurrent-write contention; a plain sqlite file is enough.
This module is value-preserving by construction: the alignment feature
dict it returns is exactly what ``compute_alignment`` returns, only
computed concurrently and memoised. Disabling parallelism (workers <= 1)
and the cache (``PROTEA_ALIGN_CACHE_DIR`` unset / cache-miss) reduces to
the original serial code path.
"""
from __future__ import annotations
Stage-by-stage breakdown
The operation is orchestrated in three session windows (see
ExportResearchDatasetOperation.execute). Within the second window,
_dump_to_stage calls the internal _KnnTransferRunner:
Sub-step |
Complexity |
Parallelism |
|---|---|---|
Load embeddings from DB ( |
O(n + q) rows |
Single DB query |
KNN search (numpy backend, per aspect) |
O(q * n * d) per aspect |
Serial (one aspect at a time) |
GO transfer |
O(q * k * T) |
Vectorised |
Pairwise alignment |
O(q * k) |
ProcessPoolExecutor + SQLite cache |
Taxonomy features |
O(q * k) with ete3 lru_cache |
In-process (cheap) |
anc2vec lookup |
O(q * k * T) |
O(1) per term via hash table |
PCA projection |
O((n + q) * d * p) fit; O(q * d * p) transform |
NumPy (p = 16 components, transductive) |
Parquet write + MinIO upload |
O(rows) |
Sequential (single connection) |
Environment variables
Variable |
Effect |
|---|---|
|
Number of process-pool workers for alignment (default: 1, serial) |
|
Directory for the persistent SQLite alignment cache (unset disables) |
|
Per-chunk query count for the numpy KNN backend (default: 500) |
|
Path to the anc2vec NPZ artifact (required for export to succeed) |
Forward reference
Scalene flamegraphs for the full FARM-EXP.13 run (24-cell grid, 8 PLMs
x 3 K values) will be published under docs/perf/ as part of the
upcoming PERF.1 profiling slice. They will show the fraction of wall
clock consumed by each sub-step and guide future parallelisation work.
Cross-reference
Thesis Ch. 5.5 presents end-to-end timing measurements for the EXP.13 export grid, the alignment cache hit rate per PLM, and the performance model used to predict export duration for larger corpora.