Core¶

The protea.core package contains all domain logic. It has no dependency on the infrastructure layer: operations receive an open SQLAlchemy session and an emit callback, but they do not manage connections, queues, or transactions themselves. This strict boundary makes every operation independently testable and trivially substitutable.

Axis tuple

protea.core.axis_tuple defines the AxisTuple named-tuple used to identify a training configuration in the multi-PLM sweep. Each axis carries the PLM identifier, neighbourhood size k, reranker flag, feature family set, evaluation set name, propagation mode, and ensemble strategy. All serialised config keys in Dataset and ExperimentRun rows use the canonical axis-tuple string form derived from this type.

Axis-tuple shortid: thin shim that prefers the protea-contracts helper.

The canonical implementation lives in protea_contracts.axis_tuple (shipped by FARM-EXP.1 companion PR on protea-contracts). PROTEA pins protea-contracts@main in pyproject.toml, so once that PR merges and we re-lock, the local fallback below disappears via the try / except ImportError path.

The local fallback is byte-for-byte identical to the upstream helper so the cross-repo shortid contract holds during the brief window where PROTEA is merged before the contracts release that ships the helper.

Formula (mirrors ExperimentSpec.hash() in protea-reranker-lab/src/protea_reranker_lab/experiment.py:108-111):

sha256(json.dumps(payload, sort_keys=True).encode()).hexdigest()[:12]

When the upstream helper is available, this module re-exports it verbatim (same symbol, same digest). Callers should import from here so the swap is invisible:

from protea.core.axis_tuple import axis_tuple_shortid

protea.core.axis_tuple.CANONICAL_AXIS_KEYS: tuple[str, ...] = ('plm', 'k', 'reranker_spec_id', 'feature_schema_sha', 'eval_set_name', 'eval_set_manifest_sha', 'propagation', 'ensemble_spec')¶: Canonical axis-tuple key set. Falls back to the locally-pinned tuple while the upstream contracts release is being rolled out. Order does NOT affect the digest (json.dumps(..., sort_keys=True)).

protea.core.axis_tuple.SHORTID_HEX_LEN: int = 12¶: Length of the truncated hex digest. Matches the lab convention.

protea.core.axis_tuple.axis_tuple_shortid(axis_tuple: Mapping[str, Any]) → str¶: Fallback implementation: see module docstring for the formula.

Domain types

protea.core.domain.aspect defines the Aspect enum used throughout the codebase to identify the three GO namespaces: MFO (Molecular Function), BPO (Biological Process), and CCO (Cellular Component).

GO aspect (namespace) — the three-way partition of the Gene Ontology.

Two encodings circulate in the codebase, both of them load-bearing:

Single-char codes ("P" / "F" / "C") — the wire format used by GOTerm.aspect in PostgreSQL and the go-basic.obo file itself. The go_term table CHECK constraint is on these codes.
Three-char CAFA codes ("BPO" / "MFO" / "CCO") — the format expected by cafaeval (the upstream Fmax / AuPRC evaluator) and surfaced in the UI for human readers.

Until this module landed, both encodings appeared as bare string literals in 30+ places — a textbook Primitive Obsession smell. The enum is the single source of truth; everything else converts at the boundary.

Typical usage:

from protea.core.domain.aspect import Aspect

# Iterate the three aspects in a stable canonical order
for aspect in Aspect:
    ...

# Convert from a DB row
aspect = Aspect.from_code(row.aspect)

# Hand off to cafaeval
result_dict[aspect.cafa] = ...

class protea.core.domain.aspect.Aspect(*values)¶

Bases: Enum

Gene Ontology aspect / namespace.

The three GO sub-ontologies. Iteration order is the canonical PROTEA order (P → F → C), matching the historical _ASPECTS = ("P", "F", "C") tuples this enum replaces.

BIOLOGICAL_PROCESS = 'P'¶

CELLULAR_COMPONENT = 'C'¶

MOLECULAR_FUNCTION = 'F'¶

property cafa: str¶

Three-char CAFA code ("BPO" / "MFO" / "CCO").

Format expected by the upstream cafaeval package and the evaluation results JSON; also the canonical UI label.

property code: str¶

Single-char code ("P" / "F" / "C").

Wire format in PostgreSQL (go_term.aspect column) and the go-basic.obo file. Use this when reading/writing the DB or comparing against the ORM column.

classmethod from_cafa(cafa: str) → Aspect¶: Build an Aspect from its three-char CAFA code.

classmethod from_code(code: str) → Aspect¶: Build an Aspect from its single-char wire code.

Contracts ¶

The contracts module defines the interfaces that every operation must satisfy and the shared types used across the entire codebase.

Operation is a structural Protocol: any class that exposes a name string and an execute(session, payload, *, emit) method conforms to it, without needing to inherit from a base class. ProteaPayload is the immutable, strictly-typed Pydantic base class for all operation payloads: strict mode prevents silent type coercion, and frozen configuration prevents accidental mutation after validation. OperationResult is the return value of every execute call; its deferred flag tells BaseWorker that completion will be signalled by child workers rather than immediately. RetryLaterError is raised when a shared resource (e.g. the GPU) is occupied; BaseWorker catches it, resets the job to QUEUED, and re-publishes the message after a configurable delay.

OperationRegistry is a simple dict-backed mapping from operation name strings to instances. Workers resolve the correct operation at message dispatch time; new operations are registered at process startup in scripts/worker.py without modifying any worker code.

parent_progress exposes the shared _update_parent_progress helper used by every coordinator operation (compute_embeddings, predict_go_terms) to advance the parent job’s progress as child workers finish their batches. Extracted to its own module in F0 (T0.7) to remove duplicated copies across coordinators.

Retry middleware ¶

protea.core.retry exposes with_retry, a wrapper function used by BaseWorker to run the execute session against transient database errors (deadlocks, connection drops, serialisation failures) and brief network blips. Exponential backoff with jitter; all knobs (max_attempts, base_delay, max_delay, jitter_ratio, predicate, on_retry) are bundled in a RetryPolicy frozen dataclass passed via the policy keyword argument (T-CONTEXTS, PR #237). BaseWorker instantiates a fixed policy at call site (RetryPolicy(max_attempts=3, base_delay=1.0, max_delay=10.0, jitter_ratio=0.3)); there is no global TuningSettings field for these values. Added as part of F0 (T0.3) of the master plan revision 3.

Operation catalogue ¶

protea.core.operation_catalog builds the singleton OperationRegistry that workers consult at message dispatch. The public function build_operation_registry() instantiates each operation class and registers it under its canonical name. Adding a new operation is a one-line edit here plus a new module under protea/core/operations/.

Plugin discovery ¶

protea.core.plugins centralises importlib.metadata.entry_points discovery for every PROTEA plugin group (protea.backends, protea.sources, protea.runners). discover_plugins(group) returns a cached {name: plugin} map and hard-errors with RuntimeError if a plugin’s name attribute drifts from its entry-point name. reset_plugin_cache is a test-only seam for suites that install/uninstall plugins between cases. Added in T2A.5 for backend dispatch and generalised in T2A.8 (PR #240).

Generic entry-point plugin discovery (T2A.5 + T2A.8).

Both backends (T2A.5) and runners (T2A.8) follow the same shape: a package declares plugins in [tool.poetry.plugins."protea.<group>"] mapping name = "module:plugin", PROTEA discovers them via importlib.metadata.entry_points, and a resolver maps a string identifier to a plugin instance for dispatch.

This module centralises that discovery + cache + name-mismatch hard error so the backend and runner code paths share one implementation instead of forking. See protea.core.runners and the _get_backend_plugins shim in protea.core.operations.compute_embeddings for the call sites.

protea.core.plugins.discover_plugins(group: str) → dict[str, Any]¶

Discover and cache plugins in the given entry_points group.

Returns a dict keyed by plugin.name. Each plugin must declare a name class attribute that matches its entry-point name; mismatches raise RuntimeError at discovery time so the failure surface is a clear “your declaration drifted from your entry_points file” rather than a confusing “Unknown <something>” later on.

Discovery happens once per process per group; subsequent calls return the cached map. The cache key is the group string itself, so the same group cannot be discovered twice (which keeps the import-time side effects stable across reload-heavy test runs).

protea.core.plugins.reset_plugin_cache(group: str | None = None) → None¶

Drop the cached plugin map for one (or every) group.

Test-only seam: lets unit tests force re-discovery without restarting the process. Passing group=None clears every group; passing a specific name clears only that one. Production callers should not invoke this; the cache is intentionally process-local.

Feature registry

protea.core.features.registry is the central registry that maps feature-family names to their producer functions. The ALL_FEATURES constant lists every column the pipeline may populate; adding a column without wiring a producer here causes a T1.8 invariant failure in the dataset-export pipeline.

Schema SHA

protea.core.schema_sha_v2 computes a 16-character deterministic fingerprint of the feature schema version. The hash is embedded in Dataset.schema_sha and RerankerModel.feature_schema_sha; a mismatch at inference time indicates that the booster was trained on a different feature set and must be rejected.

schema_sha_v2 dual-write feature flag (T1.6 of master plan v3).

The protea.infrastructure.orm.models Dataset, RerankerModel, and ExperimentRun rows carry both the legacy schema_sha / feature_schema_sha column and the canonical schema_sha_v2 column added by ADR D10. Until the production cut-over window, write paths only fill the legacy column; once the operator flips PROTEA_SCHEMA_SHA_V2_WRITE_ENABLED=true new rows dual-write both columns.

The flag defaults to false (off) so a normal deploy of this slice is a pure no-op for the data plane. The accompanying alembic migration ships the column and a one-shot backfill; production reads still come off schema_sha until a later slice flips the routers.

Truthy values follow the same convention as protea.api.auth._authn_required() so operators do not learn two different env-flag dialects: 1, true, yes, on (case insensitive). Anything else (including unset) is treated as false.

protea.core.schema_sha_v2.is_dual_write_enabled() → bool¶

Return True when schema_sha_v2 should be written on inserts.

The check is read-fresh each call so tests can flip the value via pytest.MonkeyPatch.setenv() without restarting the process and operators can roll the flag forward without redeploying.

protea.core.schema_sha_v2.maybe_v2(value: str | None) → str | None¶

Pass-through helper for dual-write call sites.

Returns value when the flag is on, None otherwise. Lets write paths stay terse: schema_sha_v2=maybe_v2(legacy_sha) instead of an inline ternary at every insert site.

JSONB dual-write

protea.core.jsonb_dual_write provides helpers for writing structured data to both a typed column and a JSONB fallback column in the same transaction. Used during schema-migration windows where old and new code may run concurrently against the same database.

Env-flag helper for the T3.1 GOPrediction.predictions_jsonb dual-write.

T3.1 scaffolds the prediction-tuple JSONB dual-write: writers continue filling the typed prediction columns (go_term_id, distance, evidence_code) and additionally serialise the same tuple into the predictions_jsonb blob when the environment opt-in is set. This keeps the scaffolding inert in production until a human-coordinated deploy window flips the flag.

Environment variable ¶

PROTEA_GO_PREDICTION_JSONB_WRITE_ENABLED: Truthy values (1, true, yes, on; case-insensitive) enable the dual-write. Anything else (unset, empty, 0, false…) leaves the writer skipping the JSONB column so predictions_jsonb stays NULL on new rows. Default off.

Design notes ¶

The helper API is intentionally tiny so every writer site can opt in with a single import + one-line call:

from protea.core.jsonb_dual_write import maybe_jsonb

row["predictions_jsonb"] = maybe_jsonb(
    [(pred["go_term_id"], pred["distance"], pred.get("evidence_code"))]
)

maybe_jsonb returns None when the flag is off (so the row dict carries an explicit NULL for SQL) and a compact dict when on. The compact shape is a single-level dict with two keys:

predictions: a list of {"go_term_id": int, "score": float, "evidence": str | None} records, preserving caller order.
count: len(predictions), redundant but useful for cheap GIN filtering and sanity checks (WHERE predictions_jsonb @> '{"count": 5}' etc.).

The list shape lets a single row carry one or more prediction tuples without re-shaping the helper at the T3.2 backfill / T3.3 reader cut-over.

protea.core.jsonb_dual_write.JSONB_DUAL_WRITE_ENV: str = 'PROTEA_GO_PREDICTION_JSONB_WRITE_ENABLED'¶: Environment variable name; exported for tests / docs.

protea.core.jsonb_dual_write.is_jsonb_dual_write_enabled(env: dict[str, str] | None = None) → bool¶

Return True iff the env opt-in is set to a truthy value.

env defaults to os.environ; the explicit parameter is a test hook so callers can verify the helper without mutating process state.

protea.core.jsonb_dual_write.maybe_jsonb(predictions: Sequence[tuple[int, float, str | None]], *, env: dict[str, str] | None = None) → dict[str, Any] | None¶

Serialise a list of (go_term_id, score, evidence) tuples.

Returns None when the dual-write env opt-in is off so the caller can assign the result directly to a row dict / ORM attribute without an explicit guard at every writer site.

When the flag is on, returns the compact JSONB shape documented in the module docstring. Empty input still returns a well-formed dict ({"predictions": [], "count": 0}) so downstream consumers can rely on the keys being present.

Experiment runners ¶

protea.core.runners adapts the generic plugin discovery to the protea.runners group. resolve_runner(name) maps an identifier ("knn" / "baseline" / "lightgbm") to a runner plugin instance implementing the protea_contracts.ExperimentRunner interface; unknown names raise ValueError listing the discovered set. PROTEA does not yet dispatch to runners at inference time (the active KNN + reranker path stays in PredictGOTermsBatchOperation until F2C of master plan revision 3 hoists the inference core into a shared package). The adapter exists so GET /v1/runners has a stable resolver and future code has a one-line entry. Closes T2A.8 (PR #240).

Experiment-runner plugin adapter (T2A.8).

Mirrors the backend-plugin adapter in protea.core.operations.compute_embeddings: discovers runners via importlib.metadata.entry_points("protea.runners") and exposes a resolve_runner() lookup that maps a string identifier (e.g. "knn" / "baseline" / "lightgbm") to a runner plugin instance implementing protea_contracts.ExperimentRunner.

The protea-runners package declares the canonical plugin set:

knn: KNN-only baseline (no reranker), evaluates against a frozen dataset for ablation.
baseline: predict-by-frequency baseline (most-frequent GO terms in the training set).
lightgbm: the v9 / v18 LightGBM reranker, currently trained out-of-tree in protea-reranker-lab.

PROTEA does not yet dispatch to runners at inference time (the active KNN + reranker path lives in PredictGOTermsBatchOperation and stays there until F2C of master plan v3 hoists the inference core into a package both PROTEA and protea-runners can depend on). The adapter exists today so:

The plugin registry endpoint (GET /v1/registry/runners) has a stable resolver under the hood (currently still using its own _discover for the API shape; that path is untouched here).
Future code that wants a runner instance has a one-line entry: from protea.core.runners import resolve_runner.
The discovery + name-mismatch hard error follows the same pattern as backends, so the lab + plugin authoring workflow is uniform across both groups.

protea.core.runners.get_runner_plugins() → dict[str, Any]¶

Return the discovered runner plugins keyed by plugin.name.

Wraps protea.core.plugins.discover_plugins() so the runner callers do not need to know the entry-point group string; the constant lives here next to the consumers.

protea.core.runners.resolve_runner(name: str) → Any¶

Resolve a runner identifier to a plugin instance.

Raises ValueError listing the discovered runners when the identifier is unknown, so the failure message is actionable. The returned object implements protea_contracts.ExperimentRunner.

Utilities ¶

protea.core.utils provides two shared utilities used across multiple operations.

utcnow() returns a timezone-aware UTC datetime, avoiding the common mistake of calling datetime.utcnow() which returns a naive object. chunks(seq, n) splits any sequence into fixed-size chunks, used by coordinator operations to partition work into batches.

The previous UniProtHttpMixin (exponential backoff with jitter, Retry-After header parsing, cursor extraction for paginated UniProt REST endpoints) was inlined into InsertProteinsOperation and FetchUniProtMetadataOperation when those operations were rewritten; the retry/backoff/cursor logic now lives directly in each operation.

protea.core.utils.chunks(seq: Sequence[Any], n: int) → Iterable[Sequence[Any]]¶: Yield successive n-sized chunks from seq.

protea.core.utils.utcnow() → datetime¶: Return the current UTC datetime (timezone-aware).

KNN search ¶

protea.core.knn_search provides the nearest-neighbour search layer used during GO term prediction. The single public entry point is search_knn(), which dispatches to one of two backends based on the backend parameter.

The numpy backend computes exact cosine or L2 distances via matrix multiplication. It requires no additional dependencies and is the default. For cosine distance, query and reference matrices are L2-normalised and the distance is computed as \(D = 1 - \cos(\theta) \in [0, 2]\). This is \(O(NQ)\) and is appropriate for reference sets up to approximately 100 000 proteins when embeddings fit in RAM as float16.

The faiss backend wraps the FAISS library and supports three index types: Flat (exact), IVFFlat (approximate, Voronoi partitioning), and HNSW (approximate, hierarchical graph). IVFFlat is recommended for datasets above 100 000 vectors: it restricts search to the nprobe nearest Voronoi cells, reducing query time from \(O(N)\) to approximately \(O(\sqrt{N})\) with negligible recall loss at default settings.

Important

KNN search is never performed at the database layer. pgvector index types (HNSW, IVFFlat) are not used. All search happens in Python after loading reference embeddings into a numpy array. See Predict GO terms in the how-to guides.

Feature engineering ¶

protea.core.feature_engineering enriches each query–reference pair in a prediction result with sequence-level and phylogenetic signals. These features are opt-in: they are computed only when compute_alignments=true and/or compute_taxonomy=true are set in the prediction payload.

Pairwise alignment is computed via the parasail library using the BLOSUM62 substitution matrix with gap-open/extend penalties of 10/1. Both global (Needleman–Wunsch) and local (Smith–Waterman) alignments are run for each pair, producing identity, similarity, raw score, gap percentage, and alignment length for each. These metrics capture sequence similarity beyond what the embedding distance alone encodes, which is especially valuable for distant homologues where embedding geometry may be unreliable.

Taxonomic distance is computed via ete3 and the NCBI taxonomy tree (local SQLite, downloaded on first use). For each (query, reference) pair where taxonomy IDs are available from UniProt metadata, PROTEA finds the lowest common ancestor and computes the edge count through it. Results are cached with an LRU cache keyed by taxon-ID pair to avoid redundant tree traversals across a batch.

Feature engineering utilities for functional annotation enrichment.

Provides pairwise alignment metrics (Needleman–Wunsch and Smith–Waterman) via parasail and taxonomic distance computation via ete3 NCBITaxa.

These features complement the embedding-space KNN distance stored in GOPrediction.distance with sequence-level and phylogenetic signals.

Performance notes:

Alignment is O(m*n) per pair; parasail uses SIMD acceleration.
Taxonomy lookups use an LRU cache over lineage queries (ete3 local SQLite). First call may trigger a DB download if the ete3 database is absent.

protea.core.feature_engineering.compute_alignment(seq1: str, seq2: str) → dict[str, Any]¶: Compute both NW and SW alignment metrics in one call.

protea.core.feature_engineering.compute_nw(seq1: str, seq2: str, *, gap_open: int = 10, gap_extend: int = 1) → dict[str, Any]¶

Global alignment (Needleman–Wunsch) via parasail/BLOSUM62.

Returns a dict with keys:: identity_nw, similarity_nw, alignment_score_nw, gaps_pct_nw, alignment_length_nw, length_query, length_ref

protea.core.feature_engineering.compute_sw(seq1: str, seq2: str, *, gap_open: int = 10, gap_extend: int = 1) → dict[str, Any]¶

Local alignment (Smith–Waterman) via parasail/BLOSUM62.

Returns a dict with keys:: identity_sw, similarity_sw, alignment_score_sw, gaps_pct_sw, alignment_length_sw

protea.core.feature_engineering.compute_taxonomy(t1_raw: Any, t2_raw: Any) → dict[str, Any]¶

Compute taxonomic distance between two NCBI taxonomy IDs.

Returns a dict with keys:: taxonomic_lca, taxonomic_distance, taxonomic_common_ancestors, taxonomic_relation

protea.core.feature_engineering.warmup_taxonomy_db() → None¶

Pre-initialize the NCBITaxa database.

Call at worker startup so the download (~100 MB on first run) happens before any batch is processed, not mid-flight.

Re-ranker ¶

protea.core.reranker implements a LightGBM binary classifier that re-scores GO term predictions using 20 numeric features (embedding distance, NW/SW alignment metrics, sequence lengths, taxonomic distance and common ancestors, and 5 aggregate re-ranker signals) plus 3 categorical features (qualifier, evidence code, taxonomic relation). The full feature list is documented in train_reranker.

The module provides:

prepare_dataset(df): extracts and coerces feature columns. Numeric columns are coerced with errors="coerce" (invalid strings become NaN); categorical columns are converted to pandas category dtype, which LightGBM consumes directly without manual label encoding.
train(df): stratified positive/negative split with early-stopping on a held-out validation set (default 20 %). Returns a TrainResult with the Booster, validation metrics (AUC, logloss, precision, recall, F1 at the 0.5 threshold), the best boosting iteration, and gain-based feature importance.
predict(model, df): returns probability scores in [0, 1].
model_to_string() / model_from_string(): serialization for DB storage in the RerankerModel table.
load_training_tsv(): parses a training data TSV as produced by the /scoring/prediction-sets/{id}/training-data.tsv endpoint.

Note

load_reranker / apply_reranker / infer_active_feature_families were originally split into a sibling protea.core.reranking module; they were folded back into protea.core.reranker to remove a naming trap (reranker vs reranking were impossible to grep apart). This module is now the single inference-side surface.

Parquet export (`protea.core.parquet_export`)¶

protea.core.parquet_export consolidates per-split, per-category parquet shards produced by the KNN + feature pipeline into the frozen dataset layout consumed by protea-reranker-lab: exactly train.parquet, eval.parquet and manifest.json under a single directory. The manifest follows ManifestV1 (schema version 2) and records PROTEA’s producer_version + producer_git_sha.

The single public function export_reranker_parquets(...) is shared by two callers:

training_dump_helpers._dump_frozen_dataset: thin wrapper that uses this helper to emit the dataset alongside a training-data dump.
ExportResearchDatasetOperation: stand-alone operation that only materialises and publishes the dataset, without running LightGBM.

When store is provided, the three consolidated files are additionally uploaded under key_prefix using the ArtifactStore interface, and the returned dict includes train_uri / eval_uri / manifest_uri.

Scoring ¶

protea.core.scoring implements the scoring engine that applies weighted formulas to GO predictions. A ScoringConfig defines a set of weights for each feature column (embedding distance, alignment metrics, taxonomy, re-ranker features). The engine computes a composite score per prediction row and can stream scored results as TSV or compute CAFA-style metrics (Fmax, AUC-PR) against an evaluation set.

Metrics ¶

protea.core.metrics implements CAFA-style precision-recall evaluation. Provides functions for computing Fmax (maximum F-measure over all thresholds), weighted precision/recall, and coverage for a set of predictions against ground-truth annotations.

Evidence codes ¶

protea.core.evidence_codes provides mappings between ECO (Evidence and Conclusion Ontology) identifiers and GO evidence codes used in GAF files. Used by the QuickGO annotation loader to resolve ECO IDs to standard three-letter evidence codes.

GO evidence code utilities.

Provides normalisation from ECO IDs to canonical GO evidence codes and classification of codes as experimental vs. non-experimental.

Mapping source (canonical ECO → GO code):: https://github.com/evidenceontology/evidenceontology/blob/master/gaf-eco-mapping.txt

Additional ECO IDs used by UniProt GAF (not in the canonical mapping file) are resolved here using the EBI QuickGO ECO term definitions; all are “used in automatic assertion” and therefore map to IEA.

protea.core.evidence_codes.is_experimental(code: str) → bool¶

Return True if code (GO or ECO) represents experimental evidence.

Examples:

is_experimental("IDA")          # True
is_experimental("ECO:0000314")  # True  (IDA)
is_experimental("IEA")          # False
is_experimental("ECO:0000501")  # False (IEA)

protea.core.evidence_codes.normalize(code: str) → str¶

Return the canonical GO evidence code for code.

If code is already a GO root evidence code it is returned as-is. ECO IDs are translated via ECO_TO_CODE. Unknown codes are returned unchanged so that no information is silently discarded.

Examples:

normalize("IDA")           # → "IDA"
normalize("ECO:0000314")   # → "IDA"
normalize("ECO:0000256")   # → "IEA"
normalize("UNKNOWN")       # → "UNKNOWN"

Evaluation ¶

protea.core.evaluation implements the CAFA5 evaluation protocol for computing the ground-truth delta between two annotation snapshots.

The module’s central data structure is EvaluationData, a frozen dataclass that holds the NK, LK, PK, known, and pk_known annotation dictionaries. Each dictionary maps a protein accession to a set of GO term IDs.

EvaluationData fields:

nk: delta annotations for No-Knowledge proteins (no prior annotations in any namespace at t0).
lk: delta annotations for Limited-Knowledge proteins (had annotations in some namespaces but gained new terms in a previously empty namespace).
pk: novel annotations for Partial-Knowledge proteins (gained new terms in a namespace where they already had annotations).
pk_known: old experimental annotations for PK proteins in the relevant namespaces; passed as -known to cafaeval to exclude them from scoring.
known: all old experimental annotations flattened across namespaces; available for download via the reference endpoint.

The public entry point is compute_evaluation_data(session, old_annotation_set_id, new_annotation_set_id, ontology_snapshot_id). It loads the GO DAG for NOT-propagation, builds a per-namespace annotation map for both old and new sets, and classifies each (protein, namespace) pair into NK, LK, or PK. The same protein can appear in multiple categories across different namespaces simultaneously (e.g., LK in CCO and PK in BPO).

Provenance ¶

protea.core.provenance provides capture_provenance(extra=None), a side-effect-free runtime snapshot for jobs / experiments / artefacts to carry an audit trail without DB or network probes. Returns a fresh dict[str, Any] with auto-keys protea_version (from importlib.metadata), protea_git_sha (delegates to parquet_export.resolve_protea_git_sha), python_version, platform, hostname, and captured_at (ISO-8601 UTC). Any caller-supplied extra mapping is overlaid last, so callers always win on key collisions.

Every probe is wrapped: missing distribution metadata, a non-git checkout, or an absent git binary all degrade to None rather than raising. Added in T3.11 of master plan v3.2 §24 Fase 4.

Operations ¶

PROTEA ships a curated set of registered operation instances at worker startup via protea.core.operation_catalog.build_operation_registry, which is the authoritative list (read the function body for the live count). The catalog splits into job-backed entries (reachable through POST /jobs) and ephemeral consumers (dispatched internally by the compute_embeddings and predict_go_terms coordinators; see Operations for that taxonomy). Each operation is a class that implements the Operation protocol: a name string and an execute method. Operations are stateless with respect to infrastructure: they receive a session and emit structured events, but do not open connections or manage transactions. The job-backed entries are documented below; the four ephemeral siblings (compute_embeddings_batch, store_embeddings, predict_go_terms_batch, store_predictions) live in Operations.

ping: Smoke-test operation. Returns immediately with a success result. Used to verify end-to-end connectivity between the API, RabbitMQ, and worker processes.

insert_proteins: Fetches protein sequences from the UniProt REST API using cursor-based FASTA streaming. Sequences are deduplicated by MD5 hash before upsert; proteins are upserted by accession. Exponential backoff with jitter and Retry-After header handling are implemented inline in the operation. Isoforms are parsed and stored separately, sharing the canonical sequence where the amino-acid string is identical.

fetch_uniprot_metadata: Downloads TSV functional annotation data from UniProt and upserts ProteinUniProtMetadata rows keyed by canonical accession. Fields include functional description, EC numbers, pathway membership, and kinetics. Isoforms inherit metadata through the canonical_accession join; no duplicate rows are created.

load_ontology_snapshot: Downloads a GO OBO file and populates OntologySnapshot, GOTerm, and GOTermRelationship rows. The obo_version field carries a unique constraint so that re-importing the same release is idempotent. If a snapshot already exists but its relationships are missing, they are backfilled automatically.

load_goa_annotations: Bulk-loads a GAF (Gene Association Format) file. Annotations are filtered against canonical accessions present in the database, avoiding orphaned foreign keys. Each batch is committed independently to bound transaction size.

load_quickgo_annotations: Streams GO annotations from the QuickGO bulk download API (paginated TSV). Supports optional ECO→GO evidence code mapping and per-page commits. Filters out annotations whose accessions are not already in the database.

compute_embeddings: Coordinator operation that partitions the target sequence set into batches and dispatches one compute_embeddings_batch message per batch to protea.embeddings.batch. The coordinator serialises on the protea.embeddings queue (one at a time) to prevent concurrent model loads from exhausting GPU memory. Batch and write workers scale independently. Returns deferred=True; the parent job is closed by the last write worker.

predict_go_terms: Coordinator operation that loads reference embeddings into a process-level float16 cache, partitions the query set into batches, and dispatches one predict_go_terms_batch message per batch to protea.predictions.batch. Feature engineering (alignments, taxonomy) is opt-in via payload flags. Returns deferred=True; the parent job is closed by the last write worker.

generate_evaluation_set: Computes the NK/LK/PK evaluation delta between two annotation sets using the CAFA5 protocol (experimental evidence only, NOT-propagation through the GO DAG, per-namespace classification). Stores an EvaluationSet row with summary statistics. Ground-truth files are generated on-demand by the download endpoints.

run_cafa_evaluation: Runs cafaeval for NK, LK, and PK settings against a given prediction set. Downloads the OBO file, writes ground-truth and prediction TSVs, calls cafa_eval() three times (NK and LK without -known, PK with pk_known_terms.tsv as -known), and persists an EvaluationResult row with per-namespace Fmax, precision, recall, τ, and coverage.

load_interpro_go_mapping: Downloads and persists the InterPro-to-GO mapping file, linking InterPro domain entries to their associated GO terms. Used as a prerequisite step before running InterProScan-based annotation.

run_interproscan_batch: Submits sequences to the EBI InterProScan REST API in configurable batch sizes. Stores InterproAnnotation rows linking each protein to its domain signatures. Supports resume via previously stored job IDs.

predict_go_terms_from_interpro: Derives GO term predictions from stored InterproAnnotation rows and the InterproGoMapping table, without running KNN search. Produces a PredictionSet whose entries are directly evidence-backed by domain-matching rather than embedding distance.

export_research_dataset: Publishes the frozen re-ranker dataset (train.parquet / eval.parquet / manifest.json) consumed by protea-reranker-lab. Runs the KNN + feature-generation pipeline via TrainRerankerAutoOperation in dump_only mode and uploads the resulting artefacts through the configured ArtifactStore (local FS by default, MinIO when the storage compose profile is active). Manifest records PROTEA’s producer_version / producer_git_sha for full traceability from lab runs back to PROTEA HEAD.

Training-dump helpers ¶

protea.core.training_dump_helpers is the home of the KNN / feature-generation helpers that were extracted in F0 (T0.6) when protea.core.operations.train_reranker was deleted. The module is deliberately not an operation: it is reused in-process by ExportResearchDatasetOperation to materialise train and eval shards before the parquet_export consolidation pass. LightGBM training itself lives in protea-reranker-lab, which consumes the published Dataset rows produced by export_research_dataset.

As of T-RES.1 (PR #368), parent_map_str is always passed to the KNN transfer runner unconditionally, regardless of the expand_votes_to_ancestors flag. The lineage producer (protea_method.lineage.compute_lineage_features) requires the parent map for its lineage_* column computation; the flag only controls ancestor-vote expansion, not the availability of the map itself.

Internal helpers ¶

These modules are imported by the operations and the feature engineering layer; they are documented here for completeness but are not part of the public API.

protea.core.anc2vec_embeddings: anc2vec ancestry embeddings for GO terms, used as features by the re-ranker.
protea.core.annotation_intern: string interning helper for reducing memory pressure when loading large annotation sets.
protea.core.disk_cache: generic on-disk cache with TTL used by the KNN reference loader and the PCA cache.
protea.core.feature_enricher: orchestrator that combines alignment, taxonomy and anc2vec features into a single per-candidate row.
protea.core.pca_cache: per-PLM PCA projection cache, used to pre-compute the emb_pca feature family.
protea.core._anc2vec_phases: phase helpers for the anc2vec embedding pipeline.
protea.core._feature_enricher_helpers: low-level helpers extracted from feature_enricher to keep per-phase logic self-contained.
protea.core.features._bindings: internal feature-column binding declarations consumed by features.registry.
protea.core._knn_transfer_runner: worker-side KNN transfer runner dispatched by the predictions coordinator.
protea.core._leaf_record_builder: builder for leaf-prediction records in the KNN transfer pipeline.
protea.core._pair_feature_compute: pair-level feature computation with optional process-pool parallelism and SQLite alignment cache.
protea.core._training_dump_loaders: data-loading helpers used by the training-dump pipeline.
protea.core.training_dump._constants: shared constants for the training-dump subpackage.
protea.core.training_dump._contexts: context dataclasses bundling session, settings, and configuration for training-dump passes.
protea.core.training_dump._data_loaders: loaders that hydrate protein, embedding, and annotation data for the training-dump pipeline.
protea.core.training_dump._knn_transfer: KNN search and GO-term transfer logic specific to the training-dump path.
protea.core.training_dump._payload: Pydantic payload model for the export_research_dataset operation.
protea.core.training_dump._runner: top-level runner that coordinates the training-dump passes.
protea.core.training_dump._test_split: helper that materialises the eval/test shard of the frozen dataset.
protea.core.training_dump._train_split: helper that materialises the train shard of the frozen dataset.
protea.core.operations._compute_embeddings_backends: backend dispatch logic for the compute_embeddings coordinator.
protea.core.operations._compute_embeddings_helpers: batch construction and progress helpers for compute_embeddings.
protea.core.operations._load_ontology_helpers: OBO parsing helpers used by load_ontology_snapshot.
protea.core.operations._predict_go_terms_adapter: adapter that routes the predict_go_terms coordinator to the unified-path or batch-op implementations.
protea.core.operations.predict_go_terms._aspect_helpers: per-aspect splitting and merging helpers for the predict pipeline.
protea.core.operations.predict_go_terms._batch_op: batch operation executed by OperationConsumer workers.
protea.core.operations.predict_go_terms._batch_op_feature: feature generation phase of the batch operation.
protea.core.operations.predict_go_terms._batch_op_reference: reference-embedding loading and KNN search for batch operations.
protea.core.operations.predict_go_terms._batch_op_reranker: reranker scoring phase of the batch operation.
protea.core.operations.predict_go_terms._common: shared types and constants used across the predict-go-terms subpackage.
protea.core.operations.predict_go_terms._coordinator: coordinator that partitions the query set and dispatches batch messages.
protea.core.operations.predict_go_terms._post_knn_pipeline: post-KNN pipeline steps (feature enrichment, reranker, aggregation).
protea.core.operations.predict_go_terms._reranker_scorer: RerankerScorer compositive class that applies a loaded booster to batch prediction DataFrames.
protea.core.operations.predict_go_terms._store: bulk-insert helpers for writing GOPrediction rows.
protea.core.operations.predict_go_terms._unified_path: unified in-process execution path used when a single worker handles both KNN and write phases.
protea.core.operations._run_cafa_artifacts: artifact download and staging helpers for run_cafa_evaluation.
protea.core.operations._run_cafa_data_helpers: data-loading helpers for the CAFA evaluation pipeline.
protea.core.operations._run_cafa_eval_driver: driver that calls cafaeval.cafa_eval and collects per-namespace results.
protea.core.operations._run_cafa_helpers: shared helper functions used across the _run_cafa_* modules.
protea.core.operations._run_cafa_reranker_loader: loads and applies the reranker model to CAFA prediction TSVs.
protea.core.operations._run_cafa_setup: environment and directory setup for a CAFA evaluation run.
protea.core.operations._run_cafa_summary: summarises cafaeval results into EvaluationResult rows.

Process-local string interning for annotation hot loops.

The predict / train pipelines build millions of per-row dicts of the shape {"go_term_id": int, "qualifier": str|None, "evidence_code": str|None} when they materialise annotations from PostgreSQL. SQLAlchemy returns each qualifier and evidence_code as a fresh Python string — even though across a million rows there are only ~5-10 distinct values ("IEA", "IDA", "EXP", "TAS", …) and most rows have qualifier = None.

Without interning, each duplicate string allocates ~50 B in CPython, so a 5 M-row batch can carry ~500 MB of redundant string objects. Interning collapses every duplicate to a single shared instance, a Flyweight-style intrinsic-state share. Python already does this implicitly for short identifier-like literals; this module forces the same dedup for the strings that come back from the DB driver.

Process-local by design (one cache per worker), trivially small (the domain has fewer than 50 distinct codes/qualifiers in practice), and thread-safe via the GIL — setdefault is atomic for built-in dicts.

protea.core.annotation_intern.intern_string(value: str | None) → str | None¶

Return a shared instance of value if it has been seen before.

Pass None through unchanged so callers don’t need a guard. The pool grows monotonically; reset only when the worker restarts.

protea.core.annotation_intern.pool_size() → int¶: Diagnostic — how many distinct strings the pool has retained.

On-disk caches for the predict_go_terms pipeline.

The reference embedding pool for an (EmbeddingConfig, AnnotationSet) pair is large (millions of rows × ~1280 dims × 2 bytes = several GB). Re-fetching it from PostgreSQL on every batch worker is the main bottleneck, so the pool is materialised once into data/ref_cache/ and read back via numpy mmap on subsequent runs. Annotations are stored in a CSR-style layout (offsets + flat go_term_ids / qualifiers / evidence_codes arrays) so per-protein lookups stay O(1) instead of dictionary-of-list-of-dict.

Files (under _DISK_CACHE_DIR):

{cfg}__{ann}_embeddings.npy — float16 reference embeddings
{cfg}__{ann}_accessions.npy — aligned accession list
{cfg}__{ann}__{aspect}_indices.npy — int32 indices into the unified pool
{cfg}__{ann}__{aspect}_anno_*.npy — CSR annotation arrays

Invalidation: AnnotationSet rows are immutable once loaded, so the cache stays valid for as long as the files exist. Delete the files manually to force a reload after a model change.

Per-group helpers extracted from expand_predictions_to_ancestors.

The ancestor-expansion loop in feature_enricher mutates either an existing leaf record (when an ancestor coincides with a sibling leaf prediction) or a synthetic ancestor record (cloned from the leaf with go_id overridden). Both branches were inlined in the orchestrator, pushing it well past the master plan v3.2 §3 method-LOC ceiling. This module hosts the two branch helpers + the small label-config bundle so the orchestrator can stay readable and short.

class protea.core._feature_enricher_helpers.LabelConfig(q_acc: str, gt_pairs: set[tuple[str, str]] | None, column: str, present: bool)¶

Bases: NamedTuple

Bundle for the per-query label-injection knobs.

Carries the query accession (used as half of the (q_acc, anc) ground-truth lookup) plus the optional gt_pairs set and the column / presence flags that drive whether a synthesized ancestor record should carry a label at all. The training dump path passes a populated set; the live predict_go_terms path leaves gt_pairs as None and present as False.

column: str¶: Alias for field number 2

gt_pairs: set[tuple[str, str]] | None¶: Alias for field number 1

present: bool¶: Alias for field number 3

q_acc: str¶: Alias for field number 0

protea.core._feature_enricher_helpers.compute_ia_weight(anc_gid: str, leaf_gid: str, ia_weights: dict[str, float] | None) → float¶

Weight contributed by leaf_gid to its ancestor anc_gid.

Without IA weights every edge counts as 1.0. When weights are provided, the contribution is the IA ratio anc/leaf (with a guard against zero-weight leaves to avoid division blow-ups when an incomplete IA file leaves a term at IA=0).

protea.core._feature_enricher_helpers.make_ancestor_closure(parent_map: dict[str, set[str]] | dict[str, list[str]] | None) → Callable[[str], frozenset[str]]¶

Return a memoized go_id -> ancestor closure callable.

The orchestrator calls the result repeatedly for every leaf GO id; caching is_a / part_of closures here keeps the inner loop O(1) after the first visit. parent_map is normalised to frozenset parents up front so subsequent traversal cannot accidentally mutate the caller’s dict.

protea.core._feature_enricher_helpers.merge_into_existing_leaf(leaf_anc: dict[str, Any], leaf_rec: dict[str, Any], vote_increment: float, leaf_d: float) → None¶

Fold a leaf-as-ancestor vote into the existing leaf candidate.

Called when the ancestor of a leaf prediction is itself already a leaf candidate for the same (protein, aspect) group. Bumps the target’s neighbor_vote_fraction and lowers its neighbor_min_distance if the contributing leaf is closer.

protea.core._feature_enricher_helpers.update_synth_entry(synth: dict[str, dict[str, Any]], anc: str, leaf_rec: dict[str, Any], leaf_d: float, vote_increment: float, label_ctx: LabelConfig) → None¶

Insert or update a synthetic ancestor record.

When the ancestor is not already a leaf candidate the record is fabricated from the closest contributing leaf. If a synthetic entry already exists, the closer leaf wins the per-pair feature payload (alignment, taxonomy, anc2vec, emb_pca) so the booster sees a consistent training-time view; otherwise only the vote fraction is bumped.

Parallel + persistent computation of per-pair alignment features.

The export pipeline’s hotspot is the per-(query, reference) alignment in _KnnTransferRunner._compute_pair_features: a single-threaded triple loop calling compute_alignment (parasail NW+SW with traceback) at a few hundred pairs/s. Two structural wins live here:

Process-level parallelism over the unique alignment pairs. parasail’s traceback variants do not release the GIL well enough for threads to scale, so a ProcessPoolExecutor is used (benchmarked ~3x on a 12-core box). The taxonomy lookups stay in the parent process because they are cheap (an lru_cache over ete3 lineages) and process re-init of the ete3 sqlite handle would dwarf the work.
A persistent on-disk cache keyed by the ordered sequence pair plus the alignment parameters. Intra-PLM the K=10 neighbour set is a superset of K=5 / K=3, so alignments computed once for the largest-K dataset make the smaller-K datasets (and any re-run) near-free. protea.training is serialised (one job at a time) so there is no concurrent-write contention; a plain sqlite file is enough.

This module is value-preserving by construction: the alignment feature dict it returns is exactly what compute_alignment returns, only computed concurrently and memoised. Disabling parallelism (workers <= 1) and the cache (PROTEA_ALIGN_CACHE_DIR unset / cache-miss) reduces to the original serial code path.

class protea.core._pair_feature_compute.PairAlignmentComputer¶

Bases: object

Compute alignment-feature dicts for many pairs, cached + parallel.

Use as a context manager so the sqlite handle and process pool are released deterministically.

compute(pairs: list[tuple[str, str]]) → dict[tuple[str, str], dict[str, Any]]¶

Return {(q_seq, r_seq): alignment_dict} for every input pair.

Empty-sequence pairs are skipped by the caller; here every pair is assumed to have two non-empty sequences. Deduplicates on the ordered sequence pair so identical pairs are aligned once.

protea.core._pair_feature_compute.build_pair_feature_dict(q_acc: str, ref_acc: str, align_by_pair: dict[tuple[str, str], dict[str, Any]], *, do_alignments: bool, do_taxonomy: bool, tax_ids: tuple[dict[str, Any] | None, dict[str, Any] | None]) → dict[str, Any]¶

Merge the precomputed alignment + per-pair taxonomy for one pair.

Value-identical to the original inline compute_alignment + compute_taxonomy merge: the alignment half is looked up from the batch-computed map, the taxonomy half is computed here (cheap, lru_cache-backed). Either family is skipped when its flag is off.

protea.core._pair_feature_compute.precompute_alignment_features(*, aspects: Sequence[str], neighbors_by_aspect: dict[str, list[list[tuple[str, float]]]], valid_queries: Sequence[str], query_sequences: dict[str, str], ref_sequences: dict[str, str]) → dict[tuple[str, str], dict[str, Any]]¶

Batch-align every unique (q_acc, ref_acc) neighbour pair.

Drives PairAlignmentComputer (parallel + on-disk cache) over the deduplicated pair set and remaps the result back onto the (q_acc, ref_acc) accession keys the runner indexes by. Returns an empty map when there are no alignable pairs.

protea.core._pair_feature_compute.resolve_workers() → int¶

Number of process-pool workers for pair-feature computation.

PROTEA_PAIR_FEATURE_WORKERS overrides; default is the CPU count. A value <= 1 forces the serial path (no pool spin-up).

Shared constants for the dump pipeline submodules.

Kept here so the split helpers, the test-split helpers, and the runner can share them without forming an import cycle (T2B.6).

Parameter objects for the dump pipeline.

Originally lived in protea/core/training_dump_helpers.py. Extracted to a leaf submodule (T2B.6) so the type-only consumers in protea/core/_knn_transfer_runner.py can import them without dragging the larger orchestration code into the import graph.

class protea.core.training_dump._contexts.KnnTransferContext(valid_queries: list[str], query_emb: np.ndarray, ref_by_aspect: dict[str, dict[str, Any]], go_id_map: dict[int, str], aspect_map: dict[int, str], gt_pairs: set[tuple[str, str]], query_known_gos: dict[str, set[str]] | None = None, parent_map_str: dict[str, set[str]] | None = None, ia_weights: dict[str, float] | None = None, pca_state: tuple[np.ndarray, np.ndarray] | None = None, pivot_go_ids: set[str] | frozenset[str] | None = None, embedding_pool: np.ndarray | None = None)

Bases: object

Bundle of KNN inputs + enrichment maps for _knn_transfer_and_label.

Groups the 12 per-call data arguments (queries, references, ontology maps, optional enrichment helpers) so the entry-point signature stays under flake8-bugbear’s parameter ceiling. session, payload p, sequence_context, and stream_output remain standalone arguments because they are configuration / IO concerns, not data.

aspect_map: dict[int, str]

embedding_pool: np.ndarray | None = None

go_id_map: dict[int, str]

gt_pairs: set[tuple[str, str]]

ia_weights: dict[str, float] | None = None

parent_map_str: dict[str, set[str]] | None = None

pca_state: tuple[np.ndarray, np.ndarray] | None = None

pivot_go_ids: set[str] | frozenset[str] | None = None

query_emb: np.ndarray

query_known_gos: dict[str, set[str]] | None = None

ref_by_aspect: dict[str, dict[str, Any]]

valid_queries: list[str]

Bases: object

Per-protein sequence and taxonomy lookups.

All four attributes are optional; passing None disables the corresponding feature family (alignment / taxonomy).

query_sequences: dict[str, str] | None = None

query_tax_ids: dict[str, int | None] | None = None

ref_sequences: dict[str, str] | None = None

ref_tax_ids: dict[str, int | None] | None = None

class protea.core.training_dump._contexts.StreamOutput(output_parquet: Path, chunk_rows: int = 100000)

Bases: object

Streaming parquet output for memory-bounded dataset generation.

When provided, _knn_transfer_and_label writes labeled rows directly to output_parquet in chunks of chunk_rows instead of accumulating the full result list in memory.

chunk_rows: int = 100000

output_parquet: Path