Operations¶

An Operation is the fundamental unit of domain logic in PROTEA. Every task that a worker can execute (from a health-check ping to a full UniProt ingest) is encapsulated in a class that satisfies the Operation protocol.

The Operation protocol ¶

class Operation(Protocol):
    name: str

    def execute(
        self,
        session: Session,
        payload: dict[str, Any],
        *,
        emit: EmitFn,
    ) -> OperationResult: ...

name: A stable string identifier used to route jobs. Must be unique across all registered operations and must match the operation field in the Job row.
execute: Receives an open SQLAlchemy session, a raw dict payload (validated internally), and an emit callback. Returns an OperationResult. Must not manage sessions, queue connections, or threads.
EmitFn: Type alias for Callable[[str, str | None, dict[str, Any], Level], None]. Calling emit(event, message, fields, level) writes a JobEvent row in real time, visible on the frontend timeline.
OperationResult: Frozen dataclass with four fields: result (stored in Job.meta), optional progress_current / progress_total written back to the Job row for the progress bar, and deferred (bool) which tells BaseWorker that job completion will be signalled by child workers rather than immediately.

Payload validation ¶

Every operation defines a payload class that extends ProteaPayload:

class ProteaPayload(BaseModel, frozen=True):
    model_config = ConfigDict(strict=True)

ProteaPayload is an immutable, strictly-typed Pydantic (2.x) base. Strict mode prevents silent coercions ("yes" is not a valid bool). Each operation calls MyPayload.model_validate(payload) at the top of execute(); validation errors surface as FAILED jobs with a clear error message, before any DB writes occur.

Shared HTTP / retry behaviour ¶

Both insert_proteins and fetch_uniprot_metadata implement an identical resilience strategy against the UniProt REST API:

Cursor-based pagination. The link response header carries the next cursor token. Iteration stops when no rel="next" link is present.
Exponential backoff with jitter. On retriable errors (429, 5xx, network exceptions), the wait time is min(base × 2^(attempt-1), max) + uniform(0, jitter).
Retry-After header. If UniProt returns a 429 with a Retry-After header, that duration (capped at backoff_max_seconds) is used directly.
max_retries. After this many attempts the exception is re-raised, transitioning the job to FAILED.

Every retry is logged via emit("http.retry", ...) so the frontend timeline always shows when and why a delay occurred.

insert_proteins ¶

Operation name: insert_proteins; queue: protea.jobs

How to invoke this: see Submit a job via the API in How-to Guides.

Tables touched: Protein and Sequence (see Data Model).

Fetches protein sequences from the UniProt REST API in FASTA format and upserts them into the protein and sequence tables.

Payload fields ¶

Field	Default	Description
`search_criteria`	(required)	Raw UniProt query string. Example: `reviewed:true AND organism_id:9606`
`page_size`	`500`	Results per page (1 – ∞). Larger values reduce round-trips.
`total_limit`	`null`	Stop after this many records (useful for testing).
`timeout_seconds`	`60`	HTTP request timeout per page.
`include_isoforms`	`true`	Append `includeIsoform=true` to the UniProt query.
`compressed`	`false`	Request gzip-compressed responses.
`max_retries`	`6`	Maximum retry attempts per page before raising.
`backoff_base_seconds`	`0.8`	Exponential backoff base (seconds).
`backoff_max_seconds`	`20.0`	Maximum wait between retries (seconds).
`jitter_seconds`	`0.4`	Random jitter added to each backoff wait.
`user_agent`	PROTEA/insert_proteins …	`User-Agent` header sent to UniProt.

Execution flow ¶

1. validate payload (Pydantic)
2. for each FASTA page from UniProt:
   a. parse FASTA → list of records (accession, sequence, metadata)
   b. compute MD5 hash per sequence
   c. bulk-load existing Sequence rows by hash (chunks of 5 000)
   d. INSERT missing Sequence rows → obtain IDs
   e. bulk-load existing Protein rows by accession (chunks of 5 000)
   f. INSERT new Protein rows / conservative UPDATE existing rows
   g. session.flush(): no commit per page (commit on job success)
   h. emit("insert_proteins.page_done", ...)
3. if total_limit reached → emit warning + break
4. emit("insert_proteins.done", ...)
5. return OperationResult(result={counts and timing})

Sequence deduplication ¶

Sequence rows are keyed by sequence_hash (MD5 of the amino-acid string). Many Protein rows can point to the same Sequence row; sequence_id is deliberately non-unique. This eliminates redundant storage for identical sequences across species or isoforms.

Conservative protein update ¶

For proteins that already exist, insert_proteins applies a fill-in-blanks policy: it only overwrites a field if the current DB value is None or empty. Existing non-null values are never overwritten. This prevents a re-ingestion from degrading data that was enriched by a later step.

Isoform handling ¶

Accessions of the form <canonical>-<n> are parsed by Protein.parse_isoform(). Both the isoform accession and the canonical accession are stored. is_canonical = False for isoforms; isoform_index stores the numeric suffix.

fetch_uniprot_metadata ¶

Operation name: fetch_uniprot_metadata; queue: protea.jobs

How to invoke this: see Fetch UniProt metadata for existing proteins in How-to Guides.

Tables touched: ProteinUniProtMetadata (see Data Model).

Fetches functional annotations from the UniProt REST API in TSV format and upserts ProteinUniProtMetadata rows, one per canonical accession.

Payload fields ¶

Field	Default	Description
`search_criteria`	(required)	Raw UniProt query string.
`page_size`	`500`	Results per TSV page.
`total_limit`	`null`	Stop after this many rows.
`timeout_seconds`	`60`	HTTP timeout per page.
`compressed`	`true`	Request gzip-compressed TSV.
`max_retries`	`6`	Maximum retry attempts.
`backoff_base_seconds`	`0.8`	Backoff base (seconds).
`backoff_max_seconds`	`20.0`	Maximum wait (seconds).
`jitter_seconds`	`0.4`	Jitter added to backoff.
`commit_every_page`	`true`	Commit after each page (reduces memory pressure on large ingests).
`update_protein_core`	`true`	Backfill `reviewed`, `organism`, `gene_name`, `length` on existing `Protein` rows if those fields are currently `null`.
`user_agent`	PROTEA/fetch_uniprot_metadata …	`User-Agent` header.

TSV field mapping ¶

The operation requests 25 TSV fields from UniProt and maps them to ProteinUniProtMetadata columns:

DB column	UniProt TSV header
`function_cc`	Function [CC]
`catalytic_activity`	Catalytic activity
`ec_number`	EC number
`pathway`	Pathway
`kinetics`	Kinetics
`absorption`	Absorption
`active_site`	Active site
`binding_site`	Binding site
`cofactor`	Cofactor
`dna_binding`	DNA binding
`activity_regulation`	Activity regulation
`ph_dependence`	pH dependence
`redox_potential`	Redox potential
`rhea_id`	Rhea ID
`site`	Site
`temperature_dependence`	Temperature dependence
`keywords`	Keywords
`features`	Features

Isoform scoping ¶

Because ProteinUniProtMetadata is keyed by canonical_accession, all isoforms of a protein share a single metadata record. The canonical accession is resolved via Protein.parse_isoform(accession)[0] for each row in the TSV response.

load_ontology_snapshot ¶

Operation name: load_ontology_snapshot; queue: protea.jobs

How to invoke this: see Load a GO ontology snapshot in How-to Guides.

Tables touched: OntologySnapshot, GOTerm, GOTermRelationship (see GO ontology in Data Model).

Downloads a GO OBO file and populates OntologySnapshot + GOTerm + GOTermRelationship rows. Idempotent: if a snapshot with the same obo_version already exists and its relationships are present, the operation returns immediately without writing anything.

Payload fields ¶

Field	Default	Description
`obo_url`	(required)	Direct HTTP(S) URL to the `.obo` file (e.g. the EBI GO release).
`timeout_seconds`	`120`	HTTP download timeout in seconds.
`force_relationships`	`false`	Re-insert relationships even if the snapshot already exists with relationships.

Execution flow ¶

validate payload
download OBO text (HTTP GET, single request)
extract ``data-version`` header → obo_version
check DB for existing OntologySnapshot with that obo_version
   a. exists + has relationships → skip (idempotent)
   b. exists + no relationships → backfill relationships only
   c. does not exist → full insert
parse OBO stanzas → GOTerm rows (aspect mapped from namespace)
session.add_all(go_terms)
parse relationship edges → GOTermRelationship rows
session.add_all(relationships)
emit("load_ontology_snapshot.done", ...)
return OperationResult(result={snapshot_id, term_count, rel_count})

load_goa_annotations ¶

Operation name: load_goa_annotations; queue: protea.jobs

How to invoke this: see Load GOA annotations in How-to Guides.

Tables touched: AnnotationSet, ProteinGOAnnotation (see Annotation sets in Data Model).

Streams a GOA GAF 2.2 file (plain or gzip) and bulk-inserts AnnotationSet + ProteinGOAnnotation rows. Only accessions already present in the protein table are retained; all others are silently skipped.

Payload fields ¶

Field	Default	Description
`ontology_snapshot_id`	(required)	UUID of the `OntologySnapshot` row these annotations belong to.
`gaf_url`	(required)	HTTP(S) URL to the GAF file (plain or `.gz`).
`source_version`	(required)	Human-readable version label stored in `AnnotationSet.source_version`.
`page_size`	`10000`	Lines buffered per commit cycle.
`timeout_seconds`	`300`	HTTP stream timeout.
`commit_every_page`	`true`	Commit after each page to bound memory use (recommended for large GAFs).
`total_limit`	`null`	Stop after this many annotation rows (for testing).

Execution flow ¶

1. validate payload; resolve OntologySnapshot and canonical accession set
2. create AnnotationSet row (source="goa")
3. stream GAF lines:
   a. skip comment lines (starting with "!")
   b. parse 15-column tab-separated record
   c. filter against canonical accessions; skip unknown
   d. resolve go_term_id from go_id; skip if term unknown in snapshot
   e. buffer ProteinGOAnnotation rows
   f. flush + commit every page_size rows
4. emit("load_goa_annotations.done", ...)
5. return OperationResult(result={annotation_set_id, inserted, skipped})

load_quickgo_annotations ¶

Operation name: load_quickgo_annotations; queue: protea.jobs

How to invoke this: see Load QuickGO annotations in How-to Guides.

Tables touched: AnnotationSet, ProteinGOAnnotation (see Annotation sets in Data Model).

Streams GO annotations from the QuickGO bulk download TSV API. Proteins are determined by the canonical accessions already in the DB; no external accession list is needed. Supports optional ECO ID → evidence code mapping, taxon filtering, and aspect filtering.

Payload fields ¶

Field	Default	Description
`ontology_snapshot_id`	(required)	UUID of the `OntologySnapshot` row.
`source_version`	(required)	Version label for `AnnotationSet.source_version`.
`quickgo_base_url`	EBI QuickGO	Base URL for the QuickGO download endpoint.
`gene_product_ids`	`null`	Explicit accession filter; `null` = use DB accessions.
`use_db_accessions`	`true`	Pull the accession filter from the `protein` table.
`eco_mapping_url`	`null`	URL to a GAF-ECO mapping file for evidence code resolution.
`page_size`	`10000`	Rows buffered per commit.
`timeout_seconds`	`300`	HTTP stream timeout per QuickGO request, in seconds.
`commit_every_page`	`true`	Commit after each page.
`total_limit`	`null`	Row cap (for testing).
`gene_product_batch_size`	`200`	Accessions per QuickGO API request when using `use_db_accessions`.

compute_embeddings ¶

Operation name: compute_embeddings; queue: protea.embeddings (coordinator, serialised; one at a time via RetryLaterError if GPU busy)

How to invoke this: see Compute sequence embeddings in How-to Guides.
Tables touched: EmbeddingConfig, SequenceEmbedding (see Embeddings in Data Model).
Lifecycle pattern: parent-child coordinator with deferred completion (see Job Lifecycle).

Coordinator operation: determines which sequences need embeddings and fans out ComputeEmbeddingsBatchOperation messages to protea.embeddings.batch. Does not run GPU inference directly; returns OperationResult(deferred=True) immediately after publishing batch messages.

Payload fields ¶

Field	Default	Description
`embedding_config_id`	(required)	UUID of the `EmbeddingConfig` that defines model, layers, pooling.
`accessions`	`null`	List of UniProt accessions to embed; `null` = embed all proteins.
`query_set_id`	`null`	UUID of a `QuerySet` (alternative to `accessions`).
`sequences_per_job`	`64`	Sequences per batch message. Tune to GPU memory.
`device`	`"cuda"`	Device for batch workers (`"cuda"` or `"cpu"`).
`skip_existing`	`true`	Skip sequences that already have an embedding for this config.
`batch_size`	`1`	Model forward-pass batch size inside each batch worker. Default of `1` is deliberate: the largest supported backend (`prot_t5_xl_uniref50` at `max_length=2048`) OOMs on a 12 GB GPU with anything higher. Callers running smaller models on roomier GPUs can raise this explicitly.

Execution flow ¶

resolve embedding config; raise RetryLaterError(delay=60s) if another
   embedding job is RUNNING (GPU exclusive lock)
resolve sequence IDs from accessions or query_set_id
if skip_existing: filter out sequence IDs that already have embeddings
partition sequence IDs into batches of sequences_per_job
publish N ComputeEmbeddingsBatch messages to protea.embeddings.batch
publish N StoreEmbeddings slots to protea.embeddings.write
update Job.progress_total = N
return OperationResult(deferred=True, result={"batches": N})

Batch and write workers ¶

The compute_embeddings_batch (OperationConsumer on protea.embeddings.batch) runs GPU inference per batch and publishes float32 vectors to protea.embeddings.write. The store_embeddings (OperationConsumer on protea.embeddings.write) bulk-inserts SequenceEmbedding rows and atomically increments Job.progress_current, closing the parent job when all batches are done.

predict_go_terms ¶

Operation name: predict_go_terms; queue: protea.predictions (coordinator; fans out KNN batch workers on protea.predictions.batch)

How to invoke this: see Predict GO terms in How-to Guides.
Tables touched: PredictionSet, GOPrediction (see Predictions in Data Model).
Lifecycle pattern: parent-child coordinator with deferred completion (see Job Lifecycle).
KNN backend rationale: see ADR-001: KNN on CPU, not pgvector or GPU.

Coordinator operation: loads reference embeddings into a process-level cache, partitions query proteins into batches, and fans out PredictGOTermsBatch messages to protea.predictions.batch. Returns OperationResult(deferred=True).

Payload fields ¶

Field	Default	Description
`embedding_config_id`	(required)	UUID of the `EmbeddingConfig` used for both query and reference embeddings.
`annotation_set_id`	(required)	UUID of the `AnnotationSet` supplying GO labels for reference proteins.
`ontology_snapshot_id`	(required)	UUID of the `OntologySnapshot` used to resolve GO terms.
`query_accessions`	`null`	List of query protein accessions; `null` = use all proteins.
`query_set_id`	`null`	UUID of a `QuerySet` (alternative to `query_accessions`).
`limit_per_entry`	`5`	Maximum GO predictions per query protein.
`distance_threshold`	`null`	Discard neighbors beyond this distance; `null` = no threshold.
`batch_size`	`1024`	Query proteins per batch message.
`search_backend`	`"numpy"`	KNN backend: `"numpy"` (brute-force) or `"faiss"`.
`metric`	`"cosine"`	Distance metric (`"cosine"` or `"l2"`).
`faiss_index_type`	`"Flat"`	FAISS index type: `"Flat"`, `"IVFFlat"`, or `"HNSW"`.
`faiss_nlist`	`100`	Number of Voronoi cells when `faiss_index_type="IVFFlat"`. Larger values give finer partitioning at higher build cost. Ignored for `Flat` and `HNSW`.
`faiss_nprobe`	`10`	Number of Voronoi cells visited per query when `faiss_index_type="IVFFlat"`. Trades recall against query latency. Ignored for `Flat` and `HNSW`.
`faiss_hnsw_m`	`32`	HNSW graph degree (edges per node) when `faiss_index_type="HNSW"`. Larger values raise recall and memory. Ignored for `Flat` and `IVFFlat`.
`faiss_hnsw_ef_search`	`64`	HNSW search-time queue size when `faiss_index_type="HNSW"`. Larger values raise recall at the cost of query latency. Ignored for `Flat` and `IVFFlat`.
`compute_alignments`	`true`	Compute NW + SW pairwise alignments (parasail) for each prediction.
`compute_taxonomy`	`true`	Compute taxonomic distance (ete3 NCBITaxa) for each prediction.
`compute_reranker_features`	`true`	Compute 5 aggregate re-ranker features per prediction: `vote_count`, `k_position`, `go_term_frequency`, `ref_annotation_density`, and `neighbor_distance_std`.
`compute_v6_features`	`false`	Compute the 25-column `v6_features` family: 6 anc2vec ancestry signals, 3 taxonomic-voter aggregates, 16 PCA-projected embedding columns. PCA state is fit once per `EmbeddingConfig` and persisted; enabling this flag adds those 25 columns to every `GOPrediction` row.
`expand_votes_to_ancestors`	`false`	Synthesise the `is_a` / `part_of` ancestor closure of every leaf candidate as additional records. Matches the candidate distribution a lab booster sees when trained with ancestor expansion enabled.
`aspect_separated_knn`	`true`	Build three separate KNN indices, one per GO aspect (BPO/MFO/CCO), so every query receives candidates in every aspect even when its nearest neighbours in a unified index happen to be annotated only in one or two. A common cause of BPO recall ceilings.
`reranker_model_id`	`null`	Optional UUID of a registered `RerankerModel` (typically produced by `protea-reranker-lab` and registered via `scripts/register_reranker.py`). When set, the coordinator looks up `artifact_uri` and `feature_schema_sha`, validates both are present (raises `ValueError` otherwise), and emits a `predict_go_terms.reranker_bound` event. The snapshotted `artifact_uri` / `feature_schema_sha` are then propagated into every batch payload so workers do not have to re-query the row.

Reranker scoring and schema-SHA guard ¶

As of T2B.4 the reranker scoring path lives in a dedicated compositive class, RerankerScorer, rather than in the former _RerankerMixin hierarchy. PredictGOTermsBatchOperation receives a RerankerScorer instance via its constructor (default: one instance per operation) and delegates the full score-and-emit cycle to it. The legacy protea.core.operations.predict_go_terms._batch_op_reranker module is retained as a backward-compat shim that re-exports load_reranker, apply_reranker, and friends so existing tests patching the legacy module path continue to work.

When reranker_model_id is set, RerankerScorer.score is invoked after prediction_dicts has been built. The flow is:

Compute live_sha = compute_feature_schema_sha(active_families) from the live compute_alignments / compute_taxonomy / compute_v6_features flags.
If live_sha != reranker_feature_schema_sha, the scorer emits reranker.schema_mismatch (level error) and raises SchemaShaMismatchError (FARM-EXP.5 guard). The base worker catches this as an operation failure, transitions the batch job to FAILED with error_code="SchemaShaMismatchError", and does not continue with KNN ordering. Silent prediction with a stale booster layout is the exact failure mode this guard exists to prevent.
On match, the scorer attaches the GOTerm aspect to each dict, calls protea.core.reranker.apply_reranker(), and writes the reranker_score field into every prediction dict in memory.

reranker_score is in-memory only: GOPrediction does not yet have a column for it, so the score is surfaced through the predict_go_terms_batch.done event (nested reranker block with applied/skipped/mean_score/etc.) but is not persisted.

Note

The lab contract is imported lazily in the batch worker. If the production image ships without protea_reranker_lab installed the worker emits reranker.skipped with reason=contracts_unavailable and proceeds with KNN ordering.

Reference cache ¶

Reference embeddings are loaded once per (embedding_config_id, annotation_set_id) pair and stored as a process-level float16 numpy array (max 1 entry, LRU-evicted on config change). This avoids reloading hundreds of thousands of vectors for every batch.

Batch and write workers ¶

The predict_go_terms_batch (OperationConsumer on protea.predictions.batch) runs KNN search and GO transfer per batch. The store_predictions (OperationConsumer on protea.predictions.write) bulk-inserts GOPrediction rows and atomically increments Job.progress_current.

ping ¶

Operation name: ping; queue: protea.ping

Smoke-test operation. Accepts no required payload fields, emits a single ping.pong event, and returns immediately. Used to verify end-to-end connectivity of the job queue without touching the protein data tables.

generate_evaluation_set ¶

Operation name: generate_evaluation_set; queue: protea.jobs

Computes the CAFA-style evaluation delta between two AnnotationSet rows (old → new) and stores an EvaluationSet row with summary statistics. Follows the CAFA5 evaluation protocol exactly.

Payload fields ¶

Field	Default	Description
`old_annotation_set_id`	(required)	UUID of the older annotation set (t0, used as reference).
`new_annotation_set_id`	(required)	UUID of the newer annotation set (t1, ground truth source).
`pivot_ontology_snapshot_id`	`null`	Optional UUID of an `OntologySnapshot` to use as the pivot for NOT-propagation when the two annotation sets reference different snapshots. Ignored when both sets share a snapshot.

If pivot_ontology_snapshot_id is not set, both annotation sets must share the same ontology_snapshot_id; the operation raises ValueError otherwise.

Execution flow ¶

0. validate payload; load AnnotationSet rows; assert same ontology snapshot
1. idempotency check: if an EvaluationSet already exists for the
   (old, new) pair, return its summary immediately (no recompute).
   The DB-level UNIQUE constraint (alembic b8e3f1a7c2d9) enforces the
   same invariant at the schema layer; the short-circuit avoids paying
   the delta compute cost on re-submission.
2. call compute_evaluation_data(); see protea.core.evaluation:
   a. load GO DAG children map (is_a / part_of only)
   b. build NOT-propagation exclusion set (negated terms + GO descendants)
   c. load experimental annotations for old and new sets, filtered by
      exclusion set and evidence codes (EXP, IDA, IMP, IGI, IEP, IPI, …)
   d. classify per (protein, namespace) into NK / LK / PK:
      - NK: protein had NO annotations in any namespace at t0
      - LK: protein had annotations in some namespaces but not in namespace S;
            novel terms in S are LK ground truth
      - PK: protein had annotations in namespace S at t0 and gained new terms
            in S; novel terms are PK ground truth, old terms go to pk_known
3. emit stats (delta_proteins, nk/lk/pk counts and annotations)
4. INSERT EvaluationSet row with stats stored in JSONB
5. return OperationResult(result={evaluation_set_id, ...stats})

NK/LK/PK classification note ¶

Classification is per (protein, namespace), not globally per protein. A protein that had MFO and BPO annotations at t0 and gains new CCO and BPO annotations at t1 will be LK in CCO (no prior CCO annotations) and PK in BPO (had prior BPO annotations) simultaneously. This matches the CAFA5 evaluation protocol.

Ground truth files are computed on-demand by the download endpoints (not stored in the DB) using the same compute_evaluation_data logic.

run_cafa_evaluation ¶

Operation name: run_cafa_evaluation; queue: protea.evaluations

How to invoke this: see Run a CAFA evaluation in How-to Guides.
Tables touched: EvaluationSet, EvaluationResult (see Evaluation in Data Model).
Protocol: CAFA Evaluation Protocol for the NK/LK/PK semantics.

Runs the cafaeval evaluator against NK, LK, and PK ground-truth settings for a given EvaluationSet × PredictionSet pair. Persists an EvaluationResult row with per-namespace Fmax, precision, recall, τ, and coverage for each setting.

Payload fields ¶

Field	Default	Description
`evaluation_set_id`	(required)	UUID of the `EvaluationSet` to evaluate against.
`prediction_set_id`	(required)	UUID of the `PredictionSet` containing predictions to score.
`max_distance`	`null`	Discard predictions with cosine distance above this threshold before scoring (range 0 – 2). `null` = no threshold.
`scoring_config_id`	`null`	UUID of a `ScoringConfig` to apply to the prediction set before evaluation. `null` = score with the raw KNN-distance derived probability.
`reranker_id_nk`	`null`	UUID of a `RerankerModel` to apply to NK-category predictions before evaluation. Validated against the live feature schema; falls back to KNN ordering on `feature_schema_sha` mismatch.
`reranker_id_lk`	`null`	Same as `reranker_id_nk` but for the LK category.
`reranker_id_pk`	`null`	Same as `reranker_id_nk` but for the PK category.
`rerankers`	`null`	Nested mapping of category → aspect → reranker_model_id, e.g. `{"nk": {"bpo": "<uuid>", "mfo": "<uuid>"}, "lk": {...}}`. Overrides the flat `reranker_id_*` fields when present and lets a single run mix per-aspect boosters within a category.
`ia_file`	`null`	Path to an Information Accretion (IA) TSV file (two columns: `go_id`, `ia_value`). When provided, `cafaeval` weights each GO term by its IC so rare, specific terms count more than common, easy-to-predict ones. Without this file `cafaeval` runs with uniform IC=1, which inflates Fmax. For CAFA6 evaluations use the `IA_cafa6.tsv` file shipped with the benchmark.
`restrict_gt_to_predicted`	`true`	Standard CAFA practice: drop ground-truth proteins not present in the `PredictionSet` before evaluation, so coverage / Fmax measure performance on the actually-predicted cohort. Disable only when the eval set is guaranteed to be a subset of the predicted query set (e.g. a re-eval of a frozen lab dump where this filter has already been applied).

Execution flow ¶

1. load EvaluationSet + PredictionSet + OntologySnapshot from DB
2. compute_evaluation_data(): same delta as generate_evaluation_set
3. download OBO file from snapshot.obo_url to a temp directory
4. write ground-truth TSVs: gt_NK.tsv, gt_LK.tsv, gt_PK.tsv,
   known_terms.tsv, pk_known_terms.tsv
5. write predictions in CAFA format (protein \\t go_id \\t score),
   score = max(0, 1 - distance), filtered to delta proteins only
6. session.commit(): releases DB connection before cafaeval forks workers
7. for each setting in (NK, LK, PK):
   a. NK pass: cafa_eval(obo, predictions/, gt_NK.tsv)
   b. LK pass: cafa_eval(obo, predictions/, gt_LK.tsv)
   c. PK pass: cafa_eval(obo, predictions/, gt_PK.tsv, exclude=pk_known_terms.tsv)
   d. parse per-namespace Fmax / precision / recall / τ / coverage
   e. write cafaeval outputs (PR curves, metric TSVs) into a persistent
      staging dir under ``settings.storage.artifacts_dir`` and upload them
      through the configured ``ArtifactStore``
8. INSERT EvaluationResult row with results dict (NK → {BPO, MFO, CCO})
9. return OperationResult(result={evaluation_result_id, results})

PK known-terms ¶

For the PK pass, pk_known_terms.tsv is passed to cafa_eval as the -known (exclude) argument. This file contains all experimental annotations from the old snapshot for PK proteins in the relevant namespace. cafaeval uses this to exclude known annotations from scoring; methods that simply repeat prior annotations receive no credit for PK predictions.

SIGTERM handling ¶

cafaeval uses Python multiprocessing.Pool internally. Before each cafa_eval() call, the operation temporarily resets SIGTERM and SIGINT handlers to defaults so that forked pool workers can be terminated cleanly. The original handlers are restored afterwards.

train_reranker ¶

Deprecated since version contract-first-lab-integration: LightGBM training has been decoupled from PROTEA and now lives in protea-reranker-lab. TrainRerankerOperation and TrainRerankerAutoOperation are no longer registered in the OperationRegistry; a POST /jobs request with operation_name: train_reranker or train_reranker_auto will be rejected. The classes remain importable as internal helpers that export_research_dataset reuses in-process to run the shared KNN + feature-generation pipeline in dump_only mode.

The narrative below (feature matrix, LightGBM hyperparameters, validation metrics) is retained as the canonical specification of the booster that the lab trains; the lab consumes the exact same feature schema PROTEA produces. For the live end-to-end flow, see export_research_dataset below and the /reranker-models/import HTTP surface.

Operation name: train_reranker (internal helper, not queued)

How to invoke this: no longer invocable via /jobs. See
export_research_dataset and the POST /datasets
endpoint; booster training runs in protea-reranker-lab.
Tables touched: RerankerModel (written by /reranker-models/import, not by this helper directly).
Evaluation context: CAFA Evaluation Protocol and Results.

Trains a LightGBM binary classifier that re-scores GO term predictions on top of the embedding-based KNN baseline. The input is a (PredictionSet, EvaluationSet) pair: predictions supply the feature matrix (stored in GOPrediction columns populated at prediction time through compute_alignments=True and compute_taxonomy=True), the evaluation set supplies the ground-truth delta used to derive binary labels.

The design follows the principle of keeping the training signal interpretable: the re-ranker is not an end-to-end deep network over raw embeddings, but a gradient-boosted decision tree over hand-engineered features that each have a well-known meaning (alignment identity, taxonomic distance, neighbour agreement). Feature importance can therefore be inspected directly and used to drive the iterative training loop documented in Results.

Payload fields ¶

Field	Default	Description
`prediction_set_id`	(required)	UUID of the `PredictionSet` to use as training data.
`evaluation_set_id`	(required)	UUID of the `EvaluationSet` providing ground-truth labels.

Feature matrix ¶

The feature set is defined statically in protea.core.reranker.NUMERIC_FEATURES and protea.core.reranker.CATEGORICAL_FEATURES. It comprises 20 numeric and 3 categorical columns, grouped by origin:

Numeric features (20)¶
Feature	Group	Description
`distance`	KNN	Cosine or L2 distance from query to nearest-neighbour reference. Lower = closer in embedding space.
`identity_nw`	Alignment	Needleman–Wunsch global identity percentage (parasail, BLOSUM62).
`similarity_nw`	Alignment	NW similarity percentage (identical + positive-scoring substitutions).
`alignment_score_nw`	Alignment	Raw NW alignment score.
`gaps_pct_nw`	Alignment	Fraction of aligned positions that are gaps in the NW alignment.
`alignment_length_nw`	Alignment	Length of the NW alignment (number of columns).
`identity_sw`	Alignment	Smith–Waterman local identity percentage.
`similarity_sw`	Alignment	SW similarity percentage.
`alignment_score_sw`	Alignment	Raw SW alignment score.
`gaps_pct_sw`	Alignment	Gap fraction in the SW alignment.
`alignment_length_sw`	Alignment	Length of the SW alignment.
`length_query`	Length	Residue count of the query protein sequence.
`length_ref`	Length	Residue count of the matched reference sequence.
`taxonomic_distance`	Taxonomy	Distance between query and reference in the NCBI taxonomy tree (ete3 `NCBITaxa`).
`taxonomic_common_ancestors`	Taxonomy	Number of common ancestors shared between query and reference taxa.
`vote_count`	Aggregate	Number of top-`k` neighbours that transfer the same GO term to the query (higher = stronger consensus).
`k_position`	Aggregate	Rank of the first neighbour supporting the term (1 = top hit).
`go_term_frequency`	Aggregate	Frequency of the GO term in the entire reference set (prior).
`ref_annotation_density`	Aggregate	Average number of GO annotations per reference protein in the batch.
`neighbor_distance_std`	Aggregate	Standard deviation of KNN distances for the query (low = tight cluster, high = uncertain).

Categorical features (3)¶
Feature	Description
`qualifier`	GO annotation qualifier propagated from the reference (e.g. `enables`, `involved_in`, `part_of`). `NOT`-qualified references are already filtered out upstream by the evaluation pipeline.
`evidence_code`	GO evidence code of the reference annotation (EXP, IDA, IMP, …, or electronic IEA). Acts as a prior on annotation reliability.
`taxonomic_relation`	Coarse taxonomic relation between query and reference (`same_species`, `same_genus`, `same_family`, …, `unrelated`).

All three categorical features are converted to pandas category dtype in protea.core.reranker.prepare_dataset(); LightGBM consumes them directly, so no manual label encoding is required. Missing values in numeric columns are left as NaN; LightGBM handles them natively by learning an optimal direction at each split.

Training protocol ¶

The training loop (now in protea-reranker-lab) proceeds as follows:

Label derivation. Each row of the training DataFrame carries a binary label column: 1 if the predicted (protein, go_term) pair is present in the NK ∪ LK ∪ PK ground truth of the associated EvaluationSet, 0 otherwise. The label is computed upstream by the /scoring/prediction-sets/{id}/training-data.tsv endpoint.
Stratified split. Positives and negatives are split independently into train and validation fractions (default val_fraction = 0.2) so that the positive rate is preserved on both sides. Shuffling uses a fixed seed (default 42) for reproducibility.
Optional class-balance control. Because the raw training set is extremely imbalanced (positive rate typically ≤ 1 %), the caller can pass neg_pos_ratio to subsample negatives independently on the train and validation splits. The default is None (keep all negatives); the results chapter documents the ratios used for each re-ranker iteration.
Optional per-sample weighting. When sample_weight is provided (for example, the Information Accretion of each GO term), the weights are attached to both the training and validation datasets so that high-weight samples contribute more to the LightGBM loss. This lets the re-ranker be trained directly against the IA-weighted Fmax metric used in CAFA Evaluation Protocol.
LightGBM training. The Booster is trained with the default hyperparameters below, merged with any overrides passed in the params argument. Early stopping is enabled on the validation split.

LightGBM default hyperparameters¶
Parameter	Default	Notes
`objective`	`binary`	Binary classification.
`metric`	`["binary_logloss", "auc"]`	Both are reported; early stopping uses the first metric.
`boosting_type`	`gbdt`	Standard gradient-boosted decision trees.
`num_leaves`	`31`	LightGBM default.
`learning_rate`	`0.01`	Low LR + early stopping; see `num_boost_round`.
`feature_fraction`	`0.8`	Column sub-sampling per tree (reduces overfitting).
`bagging_fraction`	`0.8`	Row sub-sampling per boosting round.
`bagging_freq`	`5`	Apply bagging every 5 iterations.
`seed`	`42`	Fixed for reproducibility.
`num_boost_round`	`1000`	Maximum boosting iterations.
`early_stopping_rounds`	`50`	Stop if validation logloss does not improve for 50 rounds.

Validation metrics ¶

After training, the validation set is scored and the following metrics are written to RerankerModel.metrics as a JSONB dict:

best_iteration: boosting round at which early stopping triggered.
val_auc: ROC-AUC on the validation split.
val_logloss: binary logloss at the best iteration.
val_precision / val_recall / val_f1: computed at the p ≥ 0.5 decision threshold. These are reported for quick inspection but are not the primary objective: the re-ranker is ultimately scored through the full CAFA evaluation pipeline (CAFA Evaluation Protocol), which uses IA-weighted Fmax per (category, namespace) cell.
train_samples / val_samples: row counts after any negative subsampling.
positive_rate: fraction of rows with label == 1 before any subsampling.

In addition, the training step returns a gain-based feature-importance dictionary (feature_name → total gain), which is persisted in RerankerModel.feature_importance. This is the signal used in Results to drive the three documented iterations of the re-ranker.

Note

Cross-validation is not performed inside the training operation. The temporal-holdout re-ranker design uses 13 historical GOA splits (releases 160 through 220) as independent training folds; that cross-validation loop now lives in protea-reranker-lab, which consumes the frozen parquet datasets PROTEA publishes via export_research_dataset (see export_research_dataset). The lab fits one booster per split and emits the run directory consumed by scripts/register_reranker.py. See Results for the aggregate numbers.

train_reranker_auto ¶

Operation name: train_reranker_auto (internal helper, not queued)

Note

Formerly a convenience variant that auto-selected the most recent PredictionSet and EvaluationSet and forwarded them to train_reranker. Like train_reranker it is no longer registered in the OperationRegistry. The class survives only as the in-process engine that ExportResearchDatasetOperation drives in dump_only mode to produce the frozen parquet triple consumed by the lab.

export_research_dataset ¶

Operation name: export_research_dataset; queue: protea.training

How to invoke this: see Registering a reranker from protea-reranker-lab in How-to Guides.

Tables touched: read-only over existing PredictionSet / feature data. Writes blobs via the configured ArtifactStore (no ORM writes).

Runs the same KNN + feature-generation pipeline as train_reranker_auto but skips LightGBM training entirely and publishes the resulting train.parquet / eval.parquet / manifest.json triple through the configured ArtifactStore. The produced dataset is the canonical input consumed by protea-reranker-lab for offline re-ranker training.

The manifest embeds schema_version=2, the PROTEA producer_version (protea.__version__) and producer_git_sha so any lab run can be traced back to the exact PROTEA HEAD that produced its training data.

Payload fields ¶

Field	Default	Description
`embedding_config_id`	(required)	UUID of the `EmbeddingConfig` used for KNN.
`ontology_snapshot_id`	(required)	UUID of the `OntologySnapshot` used to resolve GO terms.
`train_versions`	(required)	List of historical annotation-set versions forming the training snapshot pairs (must contain at least 2 entries; sorted ascending).
`test_versions`	(required)	List of annotation-set versions reserved for evaluation (non-empty; sorted ascending).
`annotation_source`	`"goa"`	Annotation source tag recorded in the manifest.
`output_name`	(required)	Human label for the published dataset. Used as `name` in the manifest and as the store key prefix `datasets/<output_name>/`.
`k`	`5`	KNN neighbour count.
`search_backend`	`"faiss"`	KNN backend (`"numpy"` or `"faiss"`).
`compute_alignments`	`false`	Include NW alignment and length feature families.
`compute_taxonomy`	`false`	Include the taxonomy-pair feature family.
`expand_votes_to_ancestors`	`false`	Propagate neighbour votes up the GO DAG before feature materialisation.
`use_embedding_pca`	`false`	Include embedding PCA and `v6_features` family (enables `anc2vec_*`, `emb_pca`, `taxonomy_voters`, `go_context`).

Execution flow ¶

1. validate payload (Pydantic 2.x, strict mode)
2. load Settings; resolve ArtifactStore via get_artifact_store(settings)
3. run train_reranker_auto in dump-only mode against a temp dir
   (per_cell training_scope, no booster trained)
4. for each of train.parquet / eval.parquet / manifest.json:
     store.put("datasets/<output_name>/<file>", path)
5. emit export_research_dataset.published {backend, key_prefix, files}
6. return OperationResult(result={
     output_name, key_prefix, storage_backend,
     train_uri, eval_uri, manifest_uri, ...auto_result
   })

Emitted events ¶

The operation relays the underlying train_reranker_auto events under its own namespace (train_reranker_auto.* → export_research_dataset.*) and additionally emits:

export_research_dataset.started: payload echoed back with resolved backend.
export_research_dataset.rows_written: n_train_rows / n_eval_rows per shard once the parquets are materialised.
export_research_dataset.published: {backend, key_prefix, files} when the three artefacts have been uploaded successfully.
export_research_dataset.completed: final summary including the three resulting URIs (file://… or s3://bucket/key).

InterPro annotation pipeline (IP.2/3/4)

Three operations form the InterPro-based functional-enrichment stage. They run sequentially: first load the InterPro2GO mapping (IP.4a), then annotate proteins with InterProScan (IP.2/3), then propagate GO terms (IP.4b). All three are registered in the OperationRegistry and run on the protea.jobs queue. The only prerequisite outside PROTEA is an InterProScan binary install on the host (see the binary_path payload field below).

load_interpro_go_mapping ¶

Operation name: load_interpro_go_mapping; queue: protea.jobs

Tables touched: writes interpro_go_mapping (upsert).

Downloads the EBI interpro2go flat file and upserts (ipr_accession, go_id, source_version) rows into the interpro_go_mapping table. The operation is idempotent: the unique constraint uq_interpro_go_mapping_pair on (ipr_accession, go_id, source_version) means that re-running for the same release is a no-op.

Payload fields¶
Field	Default	Description
`source_version`	(required)	InterPro release tag stored with every row, e.g. `"InterPro-104.0"`. Used as the join key by `predict_go_terms_from_interpro`.
`mapping_url`	EBI FTP endpoint	URL of the `interpro2go` flat file. Defaults to `https://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go`.
`evidence`	`"IEA"`	Evidence code stored with each mapping row.
`timeout_seconds`	`120`	HTTP request timeout in seconds.
`chunk_size`	`1000`	Upsert batch size.

The result dict reports rows_parsed, rows_inserted, and rows_skipped_duplicate.

run_interproscan_batch ¶

Operation name: run_interproscan_batch; queue: protea.jobs

Tables touched: writes interpro_annotation (insert).

Annotates proteins in fixed-size chunks via an InterProScan subprocess (IP.3). Inputs are resolved from either a QuerySet or an explicit accession list. An optional ipr_release_floor makes the run resumable: proteins whose latest persisted annotation already meets the floor are skipped, so re-dispatching the same payload picks up where the prior run left off.

Payload fields¶
Field	Default	Description
`query_set_id`	`null`	UUID of a `QuerySet`. Exactly one of `query_set_id` / `accessions` must be provided.
`accessions`	`null`	Explicit list of UniProt accessions. Exactly one of `query_set_id` / `accessions` must be provided.
`ipr_release_floor`	`null`	Lexicographic floor on `ipr_release`. Proteins already annotated at or above this release are skipped.
`chunk_size`	`50`	Proteins per `interproscan.sh` invocation. Small enough to bound subprocess wall-clock; large enough to amortise JVM startup.
`timeout_seconds`	`3600`	Per-chunk subprocess timeout in seconds.
`binary_path`	`null`	Full path to the `interproscan.sh` executable. Required if it is not on `$PATH`.
`extra_args`	`[]`	Additional arguments forwarded to `interproscan.sh`.
`commit_every_chunk`	`true`	Commit after each chunk so partial progress is durable.

The result dict reports chunks_done, proteins_processed, proteins_skipped, annotations_inserted, and ipr_releases_seen. Progress events (run_interproscan_batch.chunk_done) are emitted after each chunk.

predict_go_terms_from_interpro ¶

Operation name: predict_go_terms_from_interpro; queue: protea.jobs

Tables touched: writes one PredictionSet row and GOPrediction rows.

A parallel-signal GO term predictor (IP.4b). Given a set of query proteins whose InterProAnnotation rows are already in the DB, the operation:

Joins each protein’s distinct InterPro accessions against interpro_go_mapping at the requested source_version.
Resolves GO IDs against the target OntologySnapshot.
Aggregates per-protein votes: N distinct InterPro entries on the same protein mapping to the same GO term yield a vote count of N, with distance = 1.0 / N (matching the KNN convention).
Persists the result as a new PredictionSet tagged meta["algorithm"] = "interpro_propagation".

The operation is single-stage (no batch fan-out): the three-table join is a cheap index scan and the per-protein InterPro hit count is O(10-100).

Payload fields¶
Field	Default	Description
`embedding_config_id`	(required)	UUID of the `EmbeddingConfig` used as the trace pointer for the new `PredictionSet`. The model is not re-run; this field satisfies the NOT NULL FK on the table.
`annotation_set_id`	(required)	UUID of the `AnnotationSet` that acts as the trace pointer for the new `PredictionSet`.
`ontology_snapshot_id`	(required)	UUID of the `OntologySnapshot` used to resolve GO IDs.
`source_version`	(required)	InterPro2GO mapping release tag (must match a previously loaded `load_interpro_go_mapping` run, e.g. `"InterPro-104.0"`).
`query_set_id`	`null`	UUID of a `QuerySet`. At least one of `query_set_id` / `query_accessions` is required; both may be set (union used).
`query_accessions`	`null`	Explicit list of UniProt accessions.
`ipr_release_floor`	`null`	If set, only `InterProAnnotation` rows whose `ipr_release` is at or above this floor contribute to the join.
`chunk_size`	`1000`	Accession batch size for the three-table join.

The result dict reports prediction_set_id, proteins_with_ipr, proteins_with_predictions, candidate_pairs, and predictions_inserted.

Ephemeral consumer operations ¶

Four operations run as OperationConsumer workers on the high-throughput batch queues. They differ from the operations documented above in three ways:

No ``Job`` row. The coordinator publishes payloads directly to the queue; the consumer executes them without creating a child Job row in the DB. Progress is tracked by atomically incrementing Job.progress_current on the parent job.
No HTTP endpoint. These operations cannot be submitted through POST /jobs. They are only reachable through the fan-out logic of the coordinator operations compute_embeddings and predict_go_terms.
Payload is the full message. Because there is no Job row to read from, the full ProteaPayload is serialised into the AMQP body.

They still implement the Operation protocol and are registered alongside the other eleven in protea/core/operation_catalog.py. Bringing the total to 18 registered operations (14 job-backed + 4 ephemeral).

compute_embeddings_batch ¶

Queue: protea.embeddings.batch; consumer: OperationConsumer

Runs the protein language model forward pass for one batch of sequences. Loads the model on first use, caches it at process level, performs GPU inference (ESM-2, ESM-C, ProtT5/ProstT5, or Ankh base/large), pools residue-level representations according to the EmbeddingConfig strategy, converts to float32, and publishes a StoreEmbeddings message to protea.embeddings.write.

Backends are selected via EmbeddingConfig.model_backend:

esm: ESM-2 family. Dispatched through the protea-backends plugin (protea.backends entry_points group, T2A.1). The plugin calls plugin.embed_chunks; the legacy _embed_esm shim is used only as a fallback when the installed plugin predates T2A.1.
esm3c: ESM SDK ESMC (ESM3c family); FP16 on GPU, no external tokenizer.
t5: HuggingFace T5EncoderModel (ProstT5, prot_t5_xl_uniref50…); the <AA2fold> prefix is auto-injected when the model name contains prostt5. Runs through the local _embed_t5 shim (plugin migration pending T2A.2).
ankh: HuggingFace T5EncoderModel loaded via AutoTokenizer (ElnaggarLab/ankh-base, ElnaggarLab/ankh-large). Shares the batched T5 pipeline but with two mandatory deviations from the ProstT5 path, both verified end-to-end with real weights on 2026-04-10:
1. bfloat16 on CUDA, never float16. Ankh was pre-trained on TPU in bfloat16 and its T5 LayerNorm overflows to NaN under FP16 on every forward pass. The loader pins torch_dtype=torch.bfloat16 on GPU and torch.float32 on CPU.
2. Char-level tokenisation with is_split_into_words=True. Ankh’s SentencePiece vocabulary treats literal spaces as <unk>, so the ProstT5-style " ".join(seq) path emits ~50 % <unk> tokens and destroys the embedding. _embed_ankh instead passes [list(seq) for seq in batch] with is_split_into_words=True (verified 0 <unk>). The <AA2fold> prefix is never injected.
auto: Dispatched through the protea-backends plugin (same path as esm, T2A.1). Falls back to the _embed_esm shim when the plugin is absent.

Residue-tensor convention¶

Every backend strips all special tokens before pooling and chunking, so the residue tensor has exactly N rows where N is the amino-acid length of the input sequence. This makes chunk_index_s and chunk_index_e mean the same thing on every backend: indices into the amino-acid sequence, not into a backend-specific token tensor.

Backend	Tokens stripped from residue slice	residue_len
`esm`	CLS (position 0) + EOS (last position)	`N`
`esm3c`	BOS (position 0) + EOS (last position)	`N`
`t5` (`prot_t5_xl_uniref50`)	EOS (last position)	`N`
`t5` (ProstT5)	`<AA2fold>` (position 0) + EOS (last position)	`N`
`ankh`	EOS (last position)	`N`

This convention was unified on 2026-04-10. Before that date, the T5 family kept the trailing EOS and ProstT5 additionally kept the leading <AA2fold> prefix token, so chunk indices on those backends were offset by 0–2 positions relative to the amino-acid sequence. Embeddings computed under the old convention are not directly comparable with new ones and must be recomputed.

A single GPU worker is typically sufficient because each batch already saturates the device; the coordinator serialises dispatch with RetryLaterError(delay=60s) when another embedding job is RUNNING.

store_embeddings ¶

Queue: protea.embeddings.write; consumer: OperationConsumer

Bulk-inserts SequenceEmbedding rows (one per chunk) into the pgvector VECTOR column, using PostgreSQL INSERT ... ON CONFLICT DO NOTHING to tolerate re-runs. After each successful insert it atomically increments Job.progress_current on the parent compute_embeddings job and closes the job when progress_current == progress_total.

This stage is deliberately decoupled from GPU inference so that slow disk writes never block the GPU, and so the write worker can be scaled independently.

predict_go_terms_batch ¶

Queue: protea.predictions.batch; consumer: OperationConsumer

Runs the KNN + GO transfer pipeline for one batch of query proteins. As of F2C.5 the heavy lifting is delegated to protea_method.pipeline.predict() (protea-method >= 0.2.0); the in-tree adapter protea.core.operations._predict_go_terms_adapter re-derives PROTEA-only fields from the PredictDiagnostics the unified path returns.

Loads query embeddings from the DB.
Resolves the reference cache (process-level float16 array; loaded from disk at data/ref_cache/<embedding_config>__<annotation_set>_*.npy if present, otherwise streamed from PostgreSQL in chunks of 2 000 rows and persisted).
Calls protea_method.pipeline.predict() with return_diagnostics=True: per-aspect KNN search via numpy or FAISS (Flat / IVFFlat / HNSW) using the configured metric (cosine or l2), neighbour vote accumulation, and optional ancestor expansion live in the pure inference library.
The PROTEA adapter walks neighbors_by_aspect and go_map_by_aspect to attach prediction_set_id, ref_protein_accession, qualifier, evidence_code plus the legacy re-ranker aggregates (k_position, go_term_frequency, ref_annotation_density, neighbor_distance_std, neighbor_vote_fraction). Optional alignment features (NW/SW via parasail) and taxonomic features (ete3 NCBITaxa) are computed on top, plus the v6_features shim (enrich_v6_features) when compute_v6_features=true.
Publishes a StorePredictions message to protea.predictions.write.

GPU is not required: KNN search runs on CPU unless a FAISS GPU index is configured at process startup. The re-ranker (booster) is applied after ancestor expansion by the injected RerankerScorer collaborator (T2B.4), not inside protea_method.pipeline.predict, so the historical scoring order remains bit-exact.

store_predictions ¶

Queue: protea.predictions.write; consumer: OperationConsumer

Bulk-inserts GOPrediction rows (one row per predicted (protein, go_term) pair) and atomically increments Job.progress_current on the parent predict_go_terms job. Uses INSERT ... ON CONFLICT DO NOTHING so that re-delivered messages are a no-op.

All feature-engineering columns declared in GOPrediction (identity_nw, similarity_sw, taxonomic_distance, etc.) are written in the same insert when they are present in the payload.

GPU torch installation

PROTEA’s pyproject.toml pins torch to the pytorch-cpu PyPI source so that CI runners and the slim production Docker image do not pull the ~6 GB NVIDIA / triton wheel stack. Hosts that run GPU inference (compute_embeddings_batch workers, or the torch KNN backend in the export pipeline) must override torch with a CUDA-capable wheel after every poetry install or poetry update.

# Default: cu128 (NVIDIA driver 570+)
bash scripts/install_gpu_torch.sh

# Override for older drivers
CUDA_VARIANT=cu121 bash scripts/install_gpu_torch.sh
CUDA_VARIANT=cu118 bash scripts/install_gpu_torch.sh

# Point at an explicit venv (CI, smoke tests)
VENV_PATH=/tmp/torch-smoke bash scripts/install_gpu_torch.sh

The CUDA_VARIANT default is cu128 as of PR #564. The script validates compatibility against the protea-method torch-GPU KNN backend; cu128 is tested against NVIDIA driver 570 and 580 series.

Re-running after poetry sync

Both poetry install (sync mode) and poetry update reinstall the CPU torch wheel from pytorch-cpu, silently downgrading the CUDA build. You must re-run scripts/install_gpu_torch.sh after every sync to restore GPU capability. A quick self-check is included in the script output:

>>> import torch self-check:
torch 2.x.y+cu128 cuda_available=True

If cuda_available=False after the script, the driver is too old for the selected variant or the CUDA runtime is not found. Try an older variant (cu121) or check that nvidia-smi shows a live device.

Why runtime deps are included

Torch on Linux ships its CUDA runtime through the nvidia-* PyPI packages (nvidia-cudnn-cu12, nvidia-cublas-cu12, etc.). Skipping dependency resolution (e.g. via pip install with the no-deps flag) leaves those absent and import torch fails at load time with a missing libcudnn.so error. The script lets pip resolve the runtime deps automatically; any unrelated version bumps (sympy, networkx, MarkupSafe) stay inside the permissive ranges declared in pyproject.toml.

Registering a new operation ¶

Create a module under protea/core/operations/.
Define a payload class extending ProteaPayload.
Implement the Operation protocol (name attribute + execute).
Register the instance in the worker startup script:
```
registry.register(MyOperation())
```
Route jobs to the correct queue by setting queue_name in the POST /jobs request body.

No changes to BaseWorker, the API, or the infrastructure layer are required.