Core¶
The protea.core package contains all domain logic. It has no dependency
on the infrastructure layer: operations receive an open SQLAlchemy session
and an emit callback, but they do not manage connections, queues, or
transactions themselves. This strict boundary makes every operation
independently testable and trivially substitutable.
Axis tuple
protea.core.axis_tuple defines the AxisTuple named-tuple used to
identify a training configuration in the multi-PLM sweep. Each axis
carries the PLM identifier, neighbourhood size k, reranker flag,
feature family set, evaluation set name, propagation mode, and ensemble
strategy. All serialised config keys in Dataset and ExperimentRun
rows use the canonical axis-tuple string form derived from this type.
Axis-tuple shortid: thin shim that prefers the protea-contracts helper.
The canonical implementation lives in protea_contracts.axis_tuple
(shipped by FARM-EXP.1 companion PR on protea-contracts). PROTEA pins
protea-contracts@main in pyproject.toml, so once that PR
merges and we re-lock, the local fallback below disappears via the
try / except ImportError path.
The local fallback is byte-for-byte identical to the upstream helper so the cross-repo shortid contract holds during the brief window where PROTEA is merged before the contracts release that ships the helper.
Formula (mirrors ExperimentSpec.hash() in protea-reranker-lab/src/protea_reranker_lab/experiment.py:108-111):
sha256(json.dumps(payload, sort_keys=True).encode()).hexdigest()[:12]
When the upstream helper is available, this module re-exports it verbatim (same symbol, same digest). Callers should import from here so the swap is invisible:
from protea.core.axis_tuple import axis_tuple_shortid
- protea.core.axis_tuple.CANONICAL_AXIS_KEYS: tuple[str, ...] = ('plm', 'k', 'reranker_spec_id', 'feature_schema_sha', 'eval_set_name', 'eval_set_manifest_sha', 'propagation', 'ensemble_spec')¶
Canonical axis-tuple key set. Falls back to the locally-pinned tuple while the upstream contracts release is being rolled out. Order does NOT affect the digest (
json.dumps(..., sort_keys=True)).
- protea.core.axis_tuple.SHORTID_HEX_LEN: int = 12¶
Length of the truncated hex digest. Matches the lab convention.
- protea.core.axis_tuple.axis_tuple_shortid(axis_tuple: Mapping[str, Any]) str¶
Fallback implementation: see module docstring for the formula.
Domain types
protea.core.domain.aspect defines the Aspect enum used throughout
the codebase to identify the three GO namespaces: MFO (Molecular Function),
BPO (Biological Process), and CCO (Cellular Component).
GO aspect (namespace) — the three-way partition of the Gene Ontology.
Two encodings circulate in the codebase, both of them load-bearing:
Single-char codes (
"P"/"F"/"C") — the wire format used byGOTerm.aspectin PostgreSQL and thego-basic.obofile itself. Thego_termtable CHECK constraint is on these codes.Three-char CAFA codes (
"BPO"/"MFO"/"CCO") — the format expected bycafaeval(the upstream Fmax / AuPRC evaluator) and surfaced in the UI for human readers.
Until this module landed, both encodings appeared as bare string literals in 30+ places — a textbook Primitive Obsession smell. The enum is the single source of truth; everything else converts at the boundary.
Typical usage:
from protea.core.domain.aspect import Aspect
# Iterate the three aspects in a stable canonical order
for aspect in Aspect:
...
# Convert from a DB row
aspect = Aspect.from_code(row.aspect)
# Hand off to cafaeval
result_dict[aspect.cafa] = ...
- class protea.core.domain.aspect.Aspect(*values)¶
Bases:
EnumGene Ontology aspect / namespace.
The three GO sub-ontologies. Iteration order is the canonical PROTEA order (P → F → C), matching the historical
_ASPECTS = ("P", "F", "C")tuples this enum replaces.- BIOLOGICAL_PROCESS = 'P'¶
- CELLULAR_COMPONENT = 'C'¶
- MOLECULAR_FUNCTION = 'F'¶
- property cafa: str¶
Three-char CAFA code (
"BPO"/"MFO"/"CCO").Format expected by the upstream
cafaevalpackage and the evaluation results JSON; also the canonical UI label.
- property code: str¶
Single-char code (
"P"/"F"/"C").Wire format in PostgreSQL (
go_term.aspectcolumn) and thego-basic.obofile. Use this when reading/writing the DB or comparing against the ORM column.
Contracts¶
The contracts module defines the interfaces that every operation must satisfy and the shared types used across the entire codebase.
Operation is a structural Protocol: any class that exposes a name
string and an execute(session, payload, *, emit) method conforms to it,
without needing to inherit from a base class. ProteaPayload is the
immutable, strictly-typed Pydantic base class for all operation payloads:
strict mode prevents silent type coercion, and frozen configuration prevents
accidental mutation after validation. OperationResult is the return value
of every execute call; its deferred flag tells BaseWorker that
completion will be signalled by child workers rather than immediately.
RetryLaterError is raised when a shared resource (e.g. the GPU) is
occupied; BaseWorker catches it, resets the job to QUEUED, and
re-publishes the message after a configurable delay.
OperationRegistry is a simple dict-backed mapping from operation name
strings to instances. Workers resolve the correct operation at message
dispatch time; new operations are registered at process startup in
scripts/worker.py without modifying any worker code.
parent_progress exposes the shared
_update_parent_progress helper used by every coordinator
operation (compute_embeddings, predict_go_terms) to advance
the parent job’s progress as child workers finish their batches.
Extracted to its own module in F0 (T0.7) to remove duplicated copies
across coordinators.
Retry middleware¶
protea.core.retry exposes with_retry, a wrapper function used
by BaseWorker to run the execute session against transient
database errors (deadlocks, connection drops, serialisation
failures) and brief network blips. Exponential backoff with jitter;
all knobs (max_attempts, base_delay, max_delay,
jitter_ratio, predicate, on_retry) are bundled in a
RetryPolicy frozen dataclass passed via the policy keyword
argument (T-CONTEXTS, PR #237). BaseWorker instantiates a fixed
policy at call site (RetryPolicy(max_attempts=3, base_delay=1.0,
max_delay=10.0, jitter_ratio=0.3)); there is no global
TuningSettings field for these values. Added as part of F0
(T0.3) of the master plan revision 3.
Operation catalogue¶
protea.core.operation_catalog builds the singleton
OperationRegistry that workers consult at message dispatch. The
public function build_operation_registry() instantiates each
operation class and registers it under its canonical name. Adding a
new operation is a one-line edit here plus a new module under
protea/core/operations/.
Plugin discovery¶
protea.core.plugins centralises importlib.metadata.entry_points
discovery for every PROTEA plugin group (protea.backends,
protea.sources, protea.runners). discover_plugins(group)
returns a cached {name: plugin} map and hard-errors with
RuntimeError if a plugin’s name attribute drifts from its
entry-point name. reset_plugin_cache is a test-only seam for
suites that install/uninstall plugins between cases. Added in T2A.5
for backend dispatch and generalised in T2A.8 (PR #240).
Generic entry-point plugin discovery (T2A.5 + T2A.8).
Both backends (T2A.5) and runners (T2A.8) follow the same shape: a
package declares plugins in [tool.poetry.plugins."protea.<group>"]
mapping name = "module:plugin", PROTEA discovers them via
importlib.metadata.entry_points, and a resolver maps a string
identifier to a plugin instance for dispatch.
This module centralises that discovery + cache + name-mismatch hard
error so the backend and runner code paths share one implementation
instead of forking. See protea.core.runners and the
_get_backend_plugins shim in
protea.core.operations.compute_embeddings for the call sites.
- protea.core.plugins.discover_plugins(group: str) dict[str, Any]¶
Discover and cache plugins in the given
entry_pointsgroup.Returns a dict keyed by
plugin.name. Each plugin must declare anameclass attribute that matches its entry-point name; mismatches raiseRuntimeErrorat discovery time so the failure surface is a clear “your declaration drifted from your entry_points file” rather than a confusing “Unknown <something>” later on.Discovery happens once per process per group; subsequent calls return the cached map. The cache key is the group string itself, so the same group cannot be discovered twice (which keeps the import-time side effects stable across reload-heavy test runs).
- protea.core.plugins.reset_plugin_cache(group: str | None = None) None¶
Drop the cached plugin map for one (or every) group.
Test-only seam: lets unit tests force re-discovery without restarting the process. Passing
group=Noneclears every group; passing a specific name clears only that one. Production callers should not invoke this; the cache is intentionally process-local.
Feature registry
protea.core.features.registry is the central registry that maps
feature-family names to their producer functions. The ALL_FEATURES
constant lists every column the pipeline may populate; adding a column
without wiring a producer here causes a T1.8 invariant failure in the
dataset-export pipeline.
Schema SHA
protea.core.schema_sha_v2 computes a 16-character deterministic
fingerprint of the feature schema version. The hash is embedded in
Dataset.schema_sha and RerankerModel.feature_schema_sha; a
mismatch at inference time indicates that the booster was trained on a
different feature set and must be rejected.
schema_sha_v2 dual-write feature flag (T1.6 of master plan v3).
The protea.infrastructure.orm.models Dataset, RerankerModel, and
ExperimentRun rows carry both the legacy schema_sha /
feature_schema_sha column and the canonical
schema_sha_v2 column added by ADR D10. Until the production
cut-over window, write paths only fill the legacy column; once the
operator flips PROTEA_SCHEMA_SHA_V2_WRITE_ENABLED=true new rows
dual-write both columns.
The flag defaults to false (off) so a normal deploy of this slice
is a pure no-op for the data plane. The accompanying alembic
migration ships the column and a one-shot backfill; production reads
still come off schema_sha until a later slice flips the routers.
Truthy values follow the same convention as
protea.api.auth._authn_required() so operators do not learn two
different env-flag dialects: 1, true, yes, on (case
insensitive). Anything else (including unset) is treated as false.
- protea.core.schema_sha_v2.is_dual_write_enabled() bool¶
Return
Truewhenschema_sha_v2should be written on inserts.The check is read-fresh each call so tests can flip the value via
pytest.MonkeyPatch.setenv()without restarting the process and operators can roll the flag forward without redeploying.
- protea.core.schema_sha_v2.maybe_v2(value: str | None) str | None¶
Pass-through helper for dual-write call sites.
Returns
valuewhen the flag is on,Noneotherwise. Lets write paths stay terse:schema_sha_v2=maybe_v2(legacy_sha)instead of an inline ternary at every insert site.
JSONB dual-write
protea.core.jsonb_dual_write provides helpers for writing structured
data to both a typed column and a JSONB fallback column in the same
transaction. Used during schema-migration windows where old and new code
may run concurrently against the same database.
Env-flag helper for the T3.1 GOPrediction.predictions_jsonb dual-write.
T3.1 scaffolds the prediction-tuple JSONB dual-write: writers continue
filling the typed prediction columns (go_term_id, distance,
evidence_code) and additionally serialise the same tuple into the
predictions_jsonb blob when the environment opt-in is set. This
keeps the scaffolding inert in production until a human-coordinated
deploy window flips the flag.
Environment variable¶
PROTEA_GO_PREDICTION_JSONB_WRITE_ENABLEDTruthy values (
1,true,yes,on; case-insensitive) enable the dual-write. Anything else (unset, empty,0,false…) leaves the writer skipping the JSONB column sopredictions_jsonbstaysNULLon new rows. Default off.
Design notes¶
The helper API is intentionally tiny so every writer site can opt in with a single import + one-line call:
from protea.core.jsonb_dual_write import maybe_jsonb
row["predictions_jsonb"] = maybe_jsonb(
[(pred["go_term_id"], pred["distance"], pred.get("evidence_code"))]
)
maybe_jsonb returns None when the flag is off (so the row
dict carries an explicit NULL for SQL) and a compact dict when on.
The compact shape is a single-level dict with two keys:
predictions: a list of{"go_term_id": int, "score": float, "evidence": str | None}records, preserving caller order.count: len(predictions), redundant but useful for cheap GIN filtering and sanity checks (WHERE predictions_jsonb @> '{"count": 5}'etc.).
The list shape lets a single row carry one or more prediction tuples without re-shaping the helper at the T3.2 backfill / T3.3 reader cut-over.
- protea.core.jsonb_dual_write.JSONB_DUAL_WRITE_ENV: str = 'PROTEA_GO_PREDICTION_JSONB_WRITE_ENABLED'¶
Environment variable name; exported for tests / docs.
- protea.core.jsonb_dual_write.is_jsonb_dual_write_enabled(env: dict[str, str] | None = None) bool¶
Return
Trueiff the env opt-in is set to a truthy value.envdefaults toos.environ; the explicit parameter is a test hook so callers can verify the helper without mutating process state.
- protea.core.jsonb_dual_write.maybe_jsonb(predictions: Sequence[tuple[int, float, str | None]], *, env: dict[str, str] | None = None) dict[str, Any] | None¶
Serialise a list of (go_term_id, score, evidence) tuples.
Returns
Nonewhen the dual-write env opt-in is off so the caller can assign the result directly to a row dict / ORM attribute without an explicit guard at every writer site.When the flag is on, returns the compact JSONB shape documented in the module docstring. Empty input still returns a well-formed dict (
{"predictions": [], "count": 0}) so downstream consumers can rely on the keys being present.
Experiment runners¶
protea.core.runners adapts the generic plugin discovery to the
protea.runners group. resolve_runner(name) maps an identifier
("knn" / "baseline" / "lightgbm") to a runner plugin
instance implementing the protea_contracts.ExperimentRunner
interface; unknown names raise ValueError listing the discovered
set. PROTEA does not yet dispatch to runners at inference time (the
active KNN + reranker path stays in PredictGOTermsBatchOperation
until F2C of master plan revision 3 hoists the inference core into a
shared package). The adapter exists so GET /v1/runners has a
stable resolver and future code has a one-line entry. Closes T2A.8
(PR #240).
Experiment-runner plugin adapter (T2A.8).
Mirrors the backend-plugin adapter in
protea.core.operations.compute_embeddings: discovers runners via
importlib.metadata.entry_points("protea.runners") and exposes a
resolve_runner() lookup that maps a string identifier (e.g.
"knn" / "baseline" / "lightgbm") to a runner plugin
instance implementing protea_contracts.ExperimentRunner.
The protea-runners package declares the canonical plugin set:
knn: KNN-only baseline (no reranker), evaluates against a frozen dataset for ablation.baseline: predict-by-frequency baseline (most-frequent GO terms in the training set).lightgbm: the v9 / v18 LightGBM reranker, currently trained out-of-tree inprotea-reranker-lab.
PROTEA does not yet dispatch to runners at inference time (the active
KNN + reranker path lives in PredictGOTermsBatchOperation and stays
there until F2C of master plan v3 hoists the inference core into a
package both PROTEA and protea-runners can depend on). The adapter
exists today so:
The plugin registry endpoint (
GET /v1/registry/runners) has a stable resolver under the hood (currently still using its own_discoverfor the API shape; that path is untouched here).Future code that wants a runner instance has a one-line entry:
from protea.core.runners import resolve_runner.The discovery + name-mismatch hard error follows the same pattern as backends, so the lab + plugin authoring workflow is uniform across both groups.
- protea.core.runners.get_runner_plugins() dict[str, Any]¶
Return the discovered runner plugins keyed by
plugin.name.Wraps
protea.core.plugins.discover_plugins()so the runner callers do not need to know the entry-point group string; the constant lives here next to the consumers.
- protea.core.runners.resolve_runner(name: str) Any¶
Resolve a runner identifier to a plugin instance.
Raises
ValueErrorlisting the discovered runners when the identifier is unknown, so the failure message is actionable. The returned object implementsprotea_contracts.ExperimentRunner.
Utilities¶
protea.core.utils provides two shared utilities used across multiple
operations.
utcnow() returns a timezone-aware UTC datetime, avoiding the common
mistake of calling datetime.utcnow() which returns a naive object.
chunks(seq, n) splits any sequence into fixed-size chunks, used by
coordinator operations to partition work into batches.
The previous UniProtHttpMixin (exponential backoff with jitter,
Retry-After header parsing, cursor extraction for paginated UniProt
REST endpoints) was inlined into InsertProteinsOperation and
FetchUniProtMetadataOperation when those operations were rewritten;
the retry/backoff/cursor logic now lives directly in each operation.
- protea.core.utils.chunks(seq: Sequence[Any], n: int) Iterable[Sequence[Any]]¶
Yield successive n-sized chunks from seq.
- protea.core.utils.utcnow() datetime¶
Return the current UTC datetime (timezone-aware).
KNN search¶
protea.core.knn_search provides the nearest-neighbour search layer used
during GO term prediction. The single public entry point is search_knn(),
which dispatches to one of two backends based on the backend parameter.
The numpy backend computes exact cosine or L2 distances via matrix multiplication. It requires no additional dependencies and is the default. For cosine distance, query and reference matrices are L2-normalised and the distance is computed as \(D = 1 - \cos(\theta) \in [0, 2]\). This is \(O(NQ)\) and is appropriate for reference sets up to approximately 100 000 proteins when embeddings fit in RAM as float16.
The faiss backend wraps the FAISS library and supports three index
types: Flat (exact), IVFFlat (approximate, Voronoi partitioning),
and HNSW (approximate, hierarchical graph). IVFFlat is recommended
for datasets above 100 000 vectors: it restricts search to the nprobe
nearest Voronoi cells, reducing query time from \(O(N)\) to approximately
\(O(\sqrt{N})\) with negligible recall loss at default settings.
Important
KNN search is never performed at the database layer. pgvector index types (HNSW, IVFFlat) are not used. All search happens in Python after loading reference embeddings into a numpy array. See Predict GO terms in the how-to guides.
Feature engineering¶
protea.core.feature_engineering enriches each query–reference pair in a
prediction result with sequence-level and phylogenetic signals. These features
are opt-in: they are computed only when compute_alignments=true and/or
compute_taxonomy=true are set in the prediction payload.
Pairwise alignment is computed via the parasail library using the
BLOSUM62 substitution matrix with gap-open/extend penalties of 10/1. Both
global (Needleman–Wunsch) and local (Smith–Waterman) alignments are run for
each pair, producing identity, similarity, raw score, gap percentage, and
alignment length for each. These metrics capture sequence similarity beyond
what the embedding distance alone encodes, which is especially valuable for
distant homologues where embedding geometry may be unreliable.
Taxonomic distance is computed via ete3 and the NCBI taxonomy tree
(local SQLite, downloaded on first use). For each (query, reference) pair
where taxonomy IDs are available from UniProt metadata, PROTEA finds the
lowest common ancestor and computes the edge count through it. Results are
cached with an LRU cache keyed by taxon-ID pair to avoid redundant tree
traversals across a batch.
Feature engineering utilities for functional annotation enrichment.
Provides pairwise alignment metrics (Needleman–Wunsch and Smith–Waterman) via parasail and taxonomic distance computation via ete3 NCBITaxa.
These features complement the embedding-space KNN distance stored in
GOPrediction.distance with sequence-level and phylogenetic signals.
Performance notes:
Alignment is O(m*n) per pair; parasail uses SIMD acceleration.
Taxonomy lookups use an LRU cache over lineage queries (ete3 local SQLite). First call may trigger a DB download if the ete3 database is absent.
- protea.core.feature_engineering.compute_alignment(seq1: str, seq2: str) dict[str, Any]¶
Compute both NW and SW alignment metrics in one call.
- protea.core.feature_engineering.compute_nw(seq1: str, seq2: str, *, gap_open: int = 10, gap_extend: int = 1) dict[str, Any]¶
Global alignment (Needleman–Wunsch) via parasail/BLOSUM62.
- Returns a dict with keys:
identity_nw, similarity_nw, alignment_score_nw, gaps_pct_nw, alignment_length_nw, length_query, length_ref
- protea.core.feature_engineering.compute_sw(seq1: str, seq2: str, *, gap_open: int = 10, gap_extend: int = 1) dict[str, Any]¶
Local alignment (Smith–Waterman) via parasail/BLOSUM62.
- Returns a dict with keys:
identity_sw, similarity_sw, alignment_score_sw, gaps_pct_sw, alignment_length_sw
- protea.core.feature_engineering.compute_taxonomy(t1_raw: Any, t2_raw: Any) dict[str, Any]¶
Compute taxonomic distance between two NCBI taxonomy IDs.
- Returns a dict with keys:
taxonomic_lca, taxonomic_distance, taxonomic_common_ancestors, taxonomic_relation
- protea.core.feature_engineering.warmup_taxonomy_db() None¶
Pre-initialize the NCBITaxa database.
Call at worker startup so the download (~100 MB on first run) happens before any batch is processed, not mid-flight.
Re-ranker¶
protea.core.reranker implements a LightGBM binary classifier that
re-scores GO term predictions using 20 numeric features (embedding distance,
NW/SW alignment metrics, sequence lengths, taxonomic distance and common
ancestors, and 5 aggregate re-ranker signals) plus 3 categorical features
(qualifier, evidence code, taxonomic relation). The full feature list is
documented in train_reranker.
The module provides:
prepare_dataset(df): extracts and coerces feature columns. Numeric columns are coerced witherrors="coerce"(invalid strings becomeNaN); categorical columns are converted to pandascategorydtype, which LightGBM consumes directly without manual label encoding.train(df): stratified positive/negative split with early-stopping on a held-out validation set (default 20 %). Returns aTrainResultwith the Booster, validation metrics (AUC, logloss, precision, recall, F1 at the 0.5 threshold), the best boosting iteration, and gain-based feature importance.predict(model, df): returns probability scores in[0, 1].model_to_string()/model_from_string(): serialization for DB storage in theRerankerModeltable.load_training_tsv(): parses a training data TSV as produced by the/scoring/prediction-sets/{id}/training-data.tsvendpoint.
Note
load_reranker / apply_reranker / infer_active_feature_families
were originally split into a sibling protea.core.reranking module;
they were folded back into protea.core.reranker to remove a naming
trap (reranker vs reranking were impossible to grep apart).
This module is now the single inference-side surface.
Parquet export (protea.core.parquet_export)¶
protea.core.parquet_export consolidates per-split, per-category
parquet shards produced by the KNN + feature pipeline into the frozen
dataset layout consumed by protea-reranker-lab: exactly
train.parquet, eval.parquet and manifest.json under a single
directory. The manifest follows ManifestV1 (schema version 2)
and records PROTEA’s producer_version + producer_git_sha.
The single public function export_reranker_parquets(...) is shared
by two callers:
training_dump_helpers._dump_frozen_dataset: thin wrapper that uses this helper to emit the dataset alongside a training-data dump.ExportResearchDatasetOperation: stand-alone operation that only materialises and publishes the dataset, without running LightGBM.
When store is provided, the three consolidated files are
additionally uploaded under key_prefix using the ArtifactStore
interface, and the returned dict includes train_uri / eval_uri
/ manifest_uri.
Scoring¶
protea.core.scoring implements the scoring engine that applies weighted
formulas to GO predictions. A ScoringConfig defines a set of weights for
each feature column (embedding distance, alignment metrics, taxonomy, re-ranker
features). The engine computes a composite score per prediction row and can
stream scored results as TSV or compute CAFA-style metrics (Fmax, AUC-PR)
against an evaluation set.
Metrics¶
protea.core.metrics implements CAFA-style precision-recall evaluation.
Provides functions for computing Fmax (maximum F-measure over all thresholds),
weighted precision/recall, and coverage for a set of predictions against
ground-truth annotations.
Evidence codes¶
protea.core.evidence_codes provides mappings between ECO (Evidence and
Conclusion Ontology) identifiers and GO evidence codes used in GAF files.
Used by the QuickGO annotation loader to resolve ECO IDs to standard
three-letter evidence codes.
GO evidence code utilities.
Provides normalisation from ECO IDs to canonical GO evidence codes and classification of codes as experimental vs. non-experimental.
- Mapping source (canonical ECO → GO code):
https://github.com/evidenceontology/evidenceontology/blob/master/gaf-eco-mapping.txt
Additional ECO IDs used by UniProt GAF (not in the canonical mapping file) are resolved here using the EBI QuickGO ECO term definitions; all are “used in automatic assertion” and therefore map to IEA.
- protea.core.evidence_codes.is_experimental(code: str) bool¶
Return True if code (GO or ECO) represents experimental evidence.
Examples:
is_experimental("IDA") # True is_experimental("ECO:0000314") # True (IDA) is_experimental("IEA") # False is_experimental("ECO:0000501") # False (IEA)
- protea.core.evidence_codes.normalize(code: str) str¶
Return the canonical GO evidence code for code.
If code is already a GO root evidence code it is returned as-is. ECO IDs are translated via
ECO_TO_CODE. Unknown codes are returned unchanged so that no information is silently discarded.Examples:
normalize("IDA") # → "IDA" normalize("ECO:0000314") # → "IDA" normalize("ECO:0000256") # → "IEA" normalize("UNKNOWN") # → "UNKNOWN"
Evaluation¶
protea.core.evaluation implements the CAFA5 evaluation protocol for
computing the ground-truth delta between two annotation snapshots.
The module’s central data structure is EvaluationData, a frozen dataclass
that holds the NK, LK, PK, known, and pk_known annotation dictionaries.
Each dictionary maps a protein accession to a set of GO term IDs.
EvaluationData fields:
nk: delta annotations for No-Knowledge proteins (no prior annotations in any namespace at t0).lk: delta annotations for Limited-Knowledge proteins (had annotations in some namespaces but gained new terms in a previously empty namespace).pk: novel annotations for Partial-Knowledge proteins (gained new terms in a namespace where they already had annotations).pk_known: old experimental annotations for PK proteins in the relevant namespaces; passed as-knowntocafaevalto exclude them from scoring.known: all old experimental annotations flattened across namespaces; available for download via the reference endpoint.
The public entry point is compute_evaluation_data(session,
old_annotation_set_id, new_annotation_set_id, ontology_snapshot_id).
It loads the GO DAG for NOT-propagation, builds a per-namespace annotation
map for both old and new sets, and classifies each (protein, namespace) pair
into NK, LK, or PK. The same protein can appear in multiple categories across
different namespaces simultaneously (e.g., LK in CCO and PK in BPO).
Provenance¶
protea.core.provenance provides capture_provenance(extra=None),
a side-effect-free runtime snapshot for jobs / experiments / artefacts
to carry an audit trail without DB or network probes. Returns a fresh
dict[str, Any] with auto-keys protea_version (from
importlib.metadata), protea_git_sha (delegates to
parquet_export.resolve_protea_git_sha), python_version,
platform, hostname, and captured_at (ISO-8601 UTC). Any
caller-supplied extra mapping is overlaid last, so callers always
win on key collisions.
Every probe is wrapped: missing distribution metadata, a non-git
checkout, or an absent git binary all degrade to None rather
than raising. Added in T3.11 of master plan v3.2 §24 Fase 4.
Operations¶
PROTEA ships a curated set of registered operation instances at worker
startup via protea.core.operation_catalog.build_operation_registry,
which is the authoritative list (read the function body for the live
count).
The catalog splits into job-backed entries (reachable through
POST /jobs) and ephemeral consumers (dispatched internally by the
compute_embeddings and predict_go_terms coordinators; see
Operations for that taxonomy). Each operation is a
class that implements the Operation protocol: a name string and
an execute method. Operations are stateless with respect to
infrastructure: they receive a session and emit structured events, but
do not open connections or manage transactions. The job-backed entries
are documented below; the four ephemeral siblings
(compute_embeddings_batch, store_embeddings,
predict_go_terms_batch, store_predictions) live in
Operations.
- ping
Smoke-test operation. Returns immediately with a success result. Used to verify end-to-end connectivity between the API, RabbitMQ, and worker processes.
- insert_proteins
Fetches protein sequences from the UniProt REST API using cursor-based FASTA streaming. Sequences are deduplicated by MD5 hash before upsert; proteins are upserted by accession. Exponential backoff with jitter and
Retry-Afterheader handling are implemented inline in the operation. Isoforms are parsed and stored separately, sharing the canonical sequence where the amino-acid string is identical.
- fetch_uniprot_metadata
Downloads TSV functional annotation data from UniProt and upserts
ProteinUniProtMetadatarows keyed by canonical accession. Fields include functional description, EC numbers, pathway membership, and kinetics. Isoforms inherit metadata through thecanonical_accessionjoin; no duplicate rows are created.
- load_ontology_snapshot
Downloads a GO OBO file and populates
OntologySnapshot,GOTerm, andGOTermRelationshiprows. Theobo_versionfield carries a unique constraint so that re-importing the same release is idempotent. If a snapshot already exists but its relationships are missing, they are backfilled automatically.
- load_goa_annotations
Bulk-loads a GAF (Gene Association Format) file. Annotations are filtered against canonical accessions present in the database, avoiding orphaned foreign keys. Each batch is committed independently to bound transaction size.
- load_quickgo_annotations
Streams GO annotations from the QuickGO bulk download API (paginated TSV). Supports optional ECO→GO evidence code mapping and per-page commits. Filters out annotations whose accessions are not already in the database.
- compute_embeddings
Coordinator operation that partitions the target sequence set into batches and dispatches one
compute_embeddings_batchmessage per batch toprotea.embeddings.batch. The coordinator serialises on theprotea.embeddingsqueue (one at a time) to prevent concurrent model loads from exhausting GPU memory. Batch and write workers scale independently. Returnsdeferred=True; the parent job is closed by the last write worker.
- predict_go_terms
Coordinator operation that loads reference embeddings into a process-level float16 cache, partitions the query set into batches, and dispatches one
predict_go_terms_batchmessage per batch toprotea.predictions.batch. Feature engineering (alignments, taxonomy) is opt-in via payload flags. Returnsdeferred=True; the parent job is closed by the last write worker.
- generate_evaluation_set
Computes the NK/LK/PK evaluation delta between two annotation sets using the CAFA5 protocol (experimental evidence only, NOT-propagation through the GO DAG, per-namespace classification). Stores an
EvaluationSetrow with summary statistics. Ground-truth files are generated on-demand by the download endpoints.
- run_cafa_evaluation
Runs
cafaevalfor NK, LK, and PK settings against a given prediction set. Downloads the OBO file, writes ground-truth and prediction TSVs, callscafa_eval()three times (NK and LK without-known, PK withpk_known_terms.tsvas-known), and persists anEvaluationResultrow with per-namespace Fmax, precision, recall, τ, and coverage.
- load_interpro_go_mapping
Downloads and persists the InterPro-to-GO mapping file, linking InterPro domain entries to their associated GO terms. Used as a prerequisite step before running InterProScan-based annotation.
- run_interproscan_batch
Submits sequences to the EBI InterProScan REST API in configurable batch sizes. Stores
InterproAnnotationrows linking each protein to its domain signatures. Supports resume via previously stored job IDs.
- predict_go_terms_from_interpro
Derives GO term predictions from stored
InterproAnnotationrows and theInterproGoMappingtable, without running KNN search. Produces aPredictionSetwhose entries are directly evidence-backed by domain-matching rather than embedding distance.
- export_research_dataset
Publishes the frozen re-ranker dataset (
train.parquet/eval.parquet/manifest.json) consumed byprotea-reranker-lab. Runs the KNN + feature-generation pipeline viaTrainRerankerAutoOperationindump_onlymode and uploads the resulting artefacts through the configuredArtifactStore(local FS by default, MinIO when thestoragecompose profile is active). Manifest records PROTEA’sproducer_version/producer_git_shafor full traceability from lab runs back to PROTEA HEAD.
Training-dump helpers¶
protea.core.training_dump_helpers is the home of the KNN /
feature-generation helpers that were extracted in F0 (T0.6) when
protea.core.operations.train_reranker was deleted. The module is
deliberately not an operation: it is reused in-process by
ExportResearchDatasetOperation to materialise train and
eval shards before the parquet_export consolidation pass.
LightGBM training itself lives in
protea-reranker-lab,
which consumes the published Dataset rows produced by
export_research_dataset.
As of T-RES.1 (PR #368), parent_map_str is always passed to the KNN
transfer runner unconditionally, regardless of the expand_votes_to_ancestors
flag. The lineage producer (protea_method.lineage.compute_lineage_features)
requires the parent map for its lineage_* column computation; the flag only
controls ancestor-vote expansion, not the availability of the map itself.
Internal helpers¶
These modules are imported by the operations and the feature engineering layer; they are documented here for completeness but are not part of the public API.
protea.core.anc2vec_embeddings: anc2vec ancestry embeddings for GO terms, used as features by the re-ranker.protea.core.annotation_intern: string interning helper for reducing memory pressure when loading large annotation sets.protea.core.disk_cache: generic on-disk cache with TTL used by the KNN reference loader and the PCA cache.protea.core.feature_enricher: orchestrator that combines alignment, taxonomy and anc2vec features into a single per-candidate row.protea.core.pca_cache: per-PLM PCA projection cache, used to pre-compute theemb_pcafeature family.protea.core._anc2vec_phases: phase helpers for the anc2vec embedding pipeline.protea.core._feature_enricher_helpers: low-level helpers extracted fromfeature_enricherto keep per-phase logic self-contained.protea.core.features._bindings: internal feature-column binding declarations consumed byfeatures.registry.protea.core._knn_transfer_runner: worker-side KNN transfer runner dispatched by the predictions coordinator.protea.core._leaf_record_builder: builder for leaf-prediction records in the KNN transfer pipeline.protea.core._pair_feature_compute: pair-level feature computation with optional process-pool parallelism and SQLite alignment cache.protea.core._training_dump_loaders: data-loading helpers used by the training-dump pipeline.protea.core.training_dump._constants: shared constants for the training-dump subpackage.protea.core.training_dump._contexts: context dataclasses bundling session, settings, and configuration for training-dump passes.protea.core.training_dump._data_loaders: loaders that hydrate protein, embedding, and annotation data for the training-dump pipeline.protea.core.training_dump._knn_transfer: KNN search and GO-term transfer logic specific to the training-dump path.protea.core.training_dump._payload: Pydantic payload model for theexport_research_datasetoperation.protea.core.training_dump._runner: top-level runner that coordinates the training-dump passes.protea.core.training_dump._test_split: helper that materialises the eval/test shard of the frozen dataset.protea.core.training_dump._train_split: helper that materialises the train shard of the frozen dataset.protea.core.operations._compute_embeddings_backends: backend dispatch logic for thecompute_embeddingscoordinator.protea.core.operations._compute_embeddings_helpers: batch construction and progress helpers forcompute_embeddings.protea.core.operations._load_ontology_helpers: OBO parsing helpers used byload_ontology_snapshot.protea.core.operations._predict_go_terms_adapter: adapter that routes thepredict_go_termscoordinator to the unified-path or batch-op implementations.protea.core.operations.predict_go_terms._aspect_helpers: per-aspect splitting and merging helpers for the predict pipeline.protea.core.operations.predict_go_terms._batch_op: batch operation executed byOperationConsumerworkers.protea.core.operations.predict_go_terms._batch_op_feature: feature generation phase of the batch operation.protea.core.operations.predict_go_terms._batch_op_reference: reference-embedding loading and KNN search for batch operations.protea.core.operations.predict_go_terms._batch_op_reranker: reranker scoring phase of the batch operation.protea.core.operations.predict_go_terms._common: shared types and constants used across the predict-go-terms subpackage.protea.core.operations.predict_go_terms._coordinator: coordinator that partitions the query set and dispatches batch messages.protea.core.operations.predict_go_terms._post_knn_pipeline: post-KNN pipeline steps (feature enrichment, reranker, aggregation).protea.core.operations.predict_go_terms._reranker_scorer:RerankerScorercompositive class that applies a loaded booster to batch prediction DataFrames.protea.core.operations.predict_go_terms._store: bulk-insert helpers for writingGOPredictionrows.protea.core.operations.predict_go_terms._unified_path: unified in-process execution path used when a single worker handles both KNN and write phases.protea.core.operations._run_cafa_artifacts: artifact download and staging helpers forrun_cafa_evaluation.protea.core.operations._run_cafa_data_helpers: data-loading helpers for the CAFA evaluation pipeline.protea.core.operations._run_cafa_eval_driver: driver that callscafaeval.cafa_evaland collects per-namespace results.protea.core.operations._run_cafa_helpers: shared helper functions used across the_run_cafa_*modules.protea.core.operations._run_cafa_reranker_loader: loads and applies the reranker model to CAFA prediction TSVs.protea.core.operations._run_cafa_setup: environment and directory setup for a CAFA evaluation run.protea.core.operations._run_cafa_summary: summarises cafaeval results intoEvaluationResultrows.
Process-local string interning for annotation hot loops.
The predict / train pipelines build millions of per-row dicts of the
shape {"go_term_id": int, "qualifier": str|None, "evidence_code": str|None}
when they materialise annotations from PostgreSQL. SQLAlchemy returns
each qualifier and evidence_code as a fresh Python string —
even though across a million rows there are only ~5-10 distinct values
("IEA", "IDA", "EXP", "TAS", …) and most rows have
qualifier = None.
Without interning, each duplicate string allocates ~50 B in CPython, so a 5 M-row batch can carry ~500 MB of redundant string objects. Interning collapses every duplicate to a single shared instance, a Flyweight-style intrinsic-state share. Python already does this implicitly for short identifier-like literals; this module forces the same dedup for the strings that come back from the DB driver.
Process-local by design (one cache per worker), trivially small (the
domain has fewer than 50 distinct codes/qualifiers in practice), and
thread-safe via the GIL — setdefault is atomic for built-in dicts.
- protea.core.annotation_intern.intern_string(value: str | None) str | None¶
Return a shared instance of
valueif it has been seen before.Pass
Nonethrough unchanged so callers don’t need a guard. The pool grows monotonically; reset only when the worker restarts.
- protea.core.annotation_intern.pool_size() int¶
Diagnostic — how many distinct strings the pool has retained.
On-disk caches for the predict_go_terms pipeline.
The reference embedding pool for an (EmbeddingConfig, AnnotationSet)
pair is large (millions of rows × ~1280 dims × 2 bytes = several GB).
Re-fetching it from PostgreSQL on every batch worker is the main
bottleneck, so the pool is materialised once into data/ref_cache/
and read back via numpy mmap on subsequent runs. Annotations are stored
in a CSR-style layout (offsets + flat go_term_ids /
qualifiers / evidence_codes arrays) so per-protein lookups stay
O(1) instead of dictionary-of-list-of-dict.
Files (under _DISK_CACHE_DIR):
{cfg}__{ann}_embeddings.npy— float16 reference embeddings{cfg}__{ann}_accessions.npy— aligned accession list{cfg}__{ann}__{aspect}_indices.npy— int32 indices into the unified pool{cfg}__{ann}__{aspect}_anno_*.npy— CSR annotation arrays
Invalidation: AnnotationSet rows are immutable once loaded, so the
cache stays valid for as long as the files exist. Delete the files
manually to force a reload after a model change.
Per-group helpers extracted from expand_predictions_to_ancestors.
The ancestor-expansion loop in feature_enricher mutates either an
existing leaf record (when an ancestor coincides with a sibling leaf
prediction) or a synthetic ancestor record (cloned from the leaf with
go_id overridden). Both branches were inlined in the orchestrator,
pushing it well past the master plan v3.2 §3 method-LOC ceiling. This
module hosts the two branch helpers + the small label-config bundle so
the orchestrator can stay readable and short.
- class protea.core._feature_enricher_helpers.LabelConfig(q_acc: str, gt_pairs: set[tuple[str, str]] | None, column: str, present: bool)¶
Bases:
NamedTupleBundle for the per-query label-injection knobs.
Carries the query accession (used as half of the
(q_acc, anc)ground-truth lookup) plus the optionalgt_pairsset and the column / presence flags that drive whether a synthesized ancestor record should carry a label at all. The training dump path passes a populated set; the livepredict_go_termspath leavesgt_pairsasNoneandpresentasFalse.- column: str¶
Alias for field number 2
- gt_pairs: set[tuple[str, str]] | None¶
Alias for field number 1
- present: bool¶
Alias for field number 3
- q_acc: str¶
Alias for field number 0
- protea.core._feature_enricher_helpers.compute_ia_weight(anc_gid: str, leaf_gid: str, ia_weights: dict[str, float] | None) float¶
Weight contributed by
leaf_gidto its ancestoranc_gid.Without IA weights every edge counts as 1.0. When weights are provided, the contribution is the IA ratio anc/leaf (with a guard against zero-weight leaves to avoid division blow-ups when an incomplete IA file leaves a term at IA=0).
- protea.core._feature_enricher_helpers.make_ancestor_closure(parent_map: dict[str, set[str]] | dict[str, list[str]] | None) Callable[[str], frozenset[str]]¶
Return a memoized
go_id -> ancestor closurecallable.The orchestrator calls the result repeatedly for every leaf GO id; caching is_a / part_of closures here keeps the inner loop O(1) after the first visit.
parent_mapis normalised tofrozensetparents up front so subsequent traversal cannot accidentally mutate the caller’s dict.
- protea.core._feature_enricher_helpers.merge_into_existing_leaf(leaf_anc: dict[str, Any], leaf_rec: dict[str, Any], vote_increment: float, leaf_d: float) None¶
Fold a leaf-as-ancestor vote into the existing leaf candidate.
Called when the ancestor of a leaf prediction is itself already a leaf candidate for the same
(protein, aspect)group. Bumps the target’sneighbor_vote_fractionand lowers itsneighbor_min_distanceif the contributing leaf is closer.
- protea.core._feature_enricher_helpers.update_synth_entry(synth: dict[str, dict[str, Any]], anc: str, leaf_rec: dict[str, Any], leaf_d: float, vote_increment: float, label_ctx: LabelConfig) None¶
Insert or update a synthetic ancestor record.
When the ancestor is not already a leaf candidate the record is fabricated from the closest contributing leaf. If a synthetic entry already exists, the closer leaf wins the per-pair feature payload (alignment, taxonomy, anc2vec, emb_pca) so the booster sees a consistent training-time view; otherwise only the vote fraction is bumped.
Parallel + persistent computation of per-pair alignment features.
The export pipeline’s hotspot is the per-(query, reference) alignment in
_KnnTransferRunner._compute_pair_features: a single-threaded triple
loop calling compute_alignment (parasail NW+SW with traceback) at a
few hundred pairs/s. Two structural wins live here:
Process-level parallelism over the unique alignment pairs. parasail’s traceback variants do not release the GIL well enough for threads to scale, so a
ProcessPoolExecutoris used (benchmarked ~3x on a 12-core box). The taxonomy lookups stay in the parent process because they are cheap (anlru_cacheover ete3 lineages) and process re-init of the ete3 sqlite handle would dwarf the work.A persistent on-disk cache keyed by the ordered sequence pair plus the alignment parameters. Intra-PLM the K=10 neighbour set is a superset of K=5 / K=3, so alignments computed once for the largest-K dataset make the smaller-K datasets (and any re-run) near-free.
protea.trainingis serialised (one job at a time) so there is no concurrent-write contention; a plain sqlite file is enough.
This module is value-preserving by construction: the alignment feature
dict it returns is exactly what compute_alignment returns, only
computed concurrently and memoised. Disabling parallelism (workers <= 1)
and the cache (PROTEA_ALIGN_CACHE_DIR unset / cache-miss) reduces to
the original serial code path.
- class protea.core._pair_feature_compute.PairAlignmentComputer¶
Bases:
objectCompute alignment-feature dicts for many pairs, cached + parallel.
Use as a context manager so the sqlite handle and process pool are released deterministically.
- compute(pairs: list[tuple[str, str]]) dict[tuple[str, str], dict[str, Any]]¶
Return
{(q_seq, r_seq): alignment_dict}for every input pair.Empty-sequence pairs are skipped by the caller; here every pair is assumed to have two non-empty sequences. Deduplicates on the ordered sequence pair so identical pairs are aligned once.
- protea.core._pair_feature_compute.build_pair_feature_dict(q_acc: str, ref_acc: str, align_by_pair: dict[tuple[str, str], dict[str, Any]], *, do_alignments: bool, do_taxonomy: bool, tax_ids: tuple[dict[str, Any] | None, dict[str, Any] | None]) dict[str, Any]¶
Merge the precomputed alignment + per-pair taxonomy for one pair.
Value-identical to the original inline
compute_alignment+compute_taxonomymerge: the alignment half is looked up from the batch-computed map, the taxonomy half is computed here (cheap,lru_cache-backed). Either family is skipped when its flag is off.
- protea.core._pair_feature_compute.precompute_alignment_features(*, aspects: Sequence[str], neighbors_by_aspect: dict[str, list[list[tuple[str, float]]]], valid_queries: Sequence[str], query_sequences: dict[str, str], ref_sequences: dict[str, str]) dict[tuple[str, str], dict[str, Any]]¶
Batch-align every unique (q_acc, ref_acc) neighbour pair.
Drives
PairAlignmentComputer(parallel + on-disk cache) over the deduplicated pair set and remaps the result back onto the(q_acc, ref_acc)accession keys the runner indexes by. Returns an empty map when there are no alignable pairs.
- protea.core._pair_feature_compute.resolve_workers() int¶
Number of process-pool workers for pair-feature computation.
PROTEA_PAIR_FEATURE_WORKERSoverrides; default is the CPU count. A value <= 1 forces the serial path (no pool spin-up).
Shared constants for the dump pipeline submodules.
Kept here so the split helpers, the test-split helpers, and the runner can share them without forming an import cycle (T2B.6).
Parameter objects for the dump pipeline.
Originally lived in protea/core/training_dump_helpers.py. Extracted
to a leaf submodule (T2B.6) so the type-only consumers in
protea/core/_knn_transfer_runner.py can import them without
dragging the larger orchestration code into the import graph.
- class protea.core.training_dump._contexts.KnnTransferContext(valid_queries: list[str], query_emb: np.ndarray, ref_by_aspect: dict[str, dict[str, Any]], go_id_map: dict[int, str], aspect_map: dict[int, str], gt_pairs: set[tuple[str, str]], query_known_gos: dict[str, set[str]] | None = None, parent_map_str: dict[str, set[str]] | None = None, ia_weights: dict[str, float] | None = None, pca_state: tuple[np.ndarray, np.ndarray] | None = None, pivot_go_ids: set[str] | frozenset[str] | None = None, embedding_pool: np.ndarray | None = None)
Bases:
objectBundle of KNN inputs + enrichment maps for
_knn_transfer_and_label.Groups the 12 per-call data arguments (queries, references, ontology maps, optional enrichment helpers) so the entry-point signature stays under flake8-bugbear’s parameter ceiling.
session, payloadp,sequence_context, andstream_outputremain standalone arguments because they are configuration / IO concerns, not data.- aspect_map: dict[int, str]
- embedding_pool: np.ndarray | None = None
- go_id_map: dict[int, str]
- gt_pairs: set[tuple[str, str]]
- ia_weights: dict[str, float] | None = None
- parent_map_str: dict[str, set[str]] | None = None
- pca_state: tuple[np.ndarray, np.ndarray] | None = None
- pivot_go_ids: set[str] | frozenset[str] | None = None
- query_emb: np.ndarray
- query_known_gos: dict[str, set[str]] | None = None
- ref_by_aspect: dict[str, dict[str, Any]]
- valid_queries: list[str]
- class protea.core.training_dump._contexts.SequenceContext(query_sequences: dict[str, str] | None = None, ref_sequences: dict[str, str] | None = None, query_tax_ids: dict[str, int | None] | None = None, ref_tax_ids: dict[str, int | None] | None = None)
Bases:
objectPer-protein sequence and taxonomy lookups.
All four attributes are optional; passing
Nonedisables the corresponding feature family (alignment / taxonomy).- query_sequences: dict[str, str] | None = None
- query_tax_ids: dict[str, int | None] | None = None
- ref_sequences: dict[str, str] | None = None
- ref_tax_ids: dict[str, int | None] | None = None
- class protea.core.training_dump._contexts.StreamOutput(output_parquet: Path, chunk_rows: int = 100000)
Bases:
objectStreaming parquet output for memory-bounded dataset generation.
When provided,
_knn_transfer_and_labelwrites labeled rows directly tooutput_parquetin chunks ofchunk_rowsinstead of accumulating the full result list in memory.- chunk_rows: int = 100000
- output_parquet: Path
See also
Operations: narrative documentation for every operation listed above, with payload examples and execution flow.
Infrastructure: the ORM models that
protea.corereads and writes.How-to Guides: task-oriented recipes that exercise these modules end-to-end.