Operations¶
An Operation is the fundamental unit of domain logic in PROTEA. Every task
that a worker can execute (from a health-check ping to a full UniProt ingest)
is encapsulated in a class that satisfies the Operation protocol.
See also
How-to Guides: task-oriented recipes that submit each of these operations through the HTTP API with concrete payloads.
Data Model: the ORM tables that operations read from and write to (
Sequence,Protein,GOTerm,AnnotationSet,EmbeddingConfig,GOPrediction,EvaluationSet…).Job Lifecycle: how the worker layer dispatches these operations and tracks their progress.
The Operation protocol¶
class Operation(Protocol):
name: str
def execute(
self,
session: Session,
payload: dict[str, Any],
*,
emit: EmitFn,
) -> OperationResult: ...
nameA stable string identifier used to route jobs. Must be unique across all registered operations and must match the
operationfield in theJobrow.executeReceives an open SQLAlchemy session, a raw
dictpayload (validated internally), and anemitcallback. Returns anOperationResult. Must not manage sessions, queue connections, or threads.EmitFnType alias for
Callable[[str, str | None, dict[str, Any], Level], None]. Callingemit(event, message, fields, level)writes aJobEventrow in real time, visible on the frontend timeline.OperationResultFrozen dataclass with four fields:
result(stored inJob.meta), optionalprogress_current/progress_totalwritten back to theJobrow for the progress bar, anddeferred(bool) which tellsBaseWorkerthat job completion will be signalled by child workers rather than immediately.
Payload validation¶
Every operation defines a payload class that extends ProteaPayload:
class ProteaPayload(BaseModel, frozen=True):
model_config = ConfigDict(strict=True)
ProteaPayload is an immutable, strictly-typed Pydantic (2.x) base. Strict
mode prevents silent coercions ("yes" is not a valid bool). Each
operation calls MyPayload.model_validate(payload) at the top of
execute(); validation errors surface as FAILED jobs with a clear
error message, before any DB writes occur.
insert_proteins¶
Operation name: insert_proteins; queue: protea.jobs
Fetches protein sequences from the UniProt REST API in FASTA format and
upserts them into the protein and sequence tables.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
Raw UniProt query string. Example: |
|
|
Results per page (1 – ∞). Larger values reduce round-trips. |
|
|
Stop after this many records (useful for testing). |
|
|
HTTP request timeout per page. |
|
|
Append |
|
|
Request gzip-compressed responses. |
|
|
Maximum retry attempts per page before raising. |
|
|
Exponential backoff base (seconds). |
|
|
Maximum wait between retries (seconds). |
|
|
Random jitter added to each backoff wait. |
|
PROTEA/insert_proteins … |
|
Execution flow¶
1. validate payload (Pydantic)
2. for each FASTA page from UniProt:
a. parse FASTA → list of records (accession, sequence, metadata)
b. compute MD5 hash per sequence
c. bulk-load existing Sequence rows by hash (chunks of 5 000)
d. INSERT missing Sequence rows → obtain IDs
e. bulk-load existing Protein rows by accession (chunks of 5 000)
f. INSERT new Protein rows / conservative UPDATE existing rows
g. session.flush(): no commit per page (commit on job success)
h. emit("insert_proteins.page_done", ...)
3. if total_limit reached → emit warning + break
4. emit("insert_proteins.done", ...)
5. return OperationResult(result={counts and timing})
Sequence deduplication¶
Sequence rows are keyed by sequence_hash (MD5 of the amino-acid
string). Many Protein rows can point to the same Sequence row;
sequence_id is deliberately non-unique. This eliminates redundant storage
for identical sequences across species or isoforms.
Conservative protein update¶
For proteins that already exist, insert_proteins applies a
fill-in-blanks policy: it only overwrites a field if the current DB value
is None or empty. Existing non-null values are never overwritten. This
prevents a re-ingestion from degrading data that was enriched by a later step.
Isoform handling¶
Accessions of the form <canonical>-<n> are parsed by
Protein.parse_isoform(). Both the isoform accession and the canonical
accession are stored. is_canonical = False for isoforms;
isoform_index stores the numeric suffix.
fetch_uniprot_metadata¶
Operation name: fetch_uniprot_metadata; queue: protea.jobs
ProteinUniProtMetadata (see Data Model).Fetches functional annotations from the UniProt REST API in TSV format and
upserts ProteinUniProtMetadata rows, one per canonical accession.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
Raw UniProt query string. |
|
|
Results per TSV page. |
|
|
Stop after this many rows. |
|
|
HTTP timeout per page. |
|
|
Request gzip-compressed TSV. |
|
|
Maximum retry attempts. |
|
|
Backoff base (seconds). |
|
|
Maximum wait (seconds). |
|
|
Jitter added to backoff. |
|
|
Commit after each page (reduces memory pressure on large ingests). |
|
|
Backfill |
|
PROTEA/fetch_uniprot_metadata … |
|
TSV field mapping¶
The operation requests 25 TSV fields from UniProt and maps them to
ProteinUniProtMetadata columns:
DB column |
UniProt TSV header |
|---|---|
|
Function [CC] |
|
Catalytic activity |
|
EC number |
|
Pathway |
|
Kinetics |
|
Absorption |
|
Active site |
|
Binding site |
|
Cofactor |
|
DNA binding |
|
Activity regulation |
|
pH dependence |
|
Redox potential |
|
Rhea ID |
|
Site |
|
Temperature dependence |
|
Keywords |
|
Features |
Isoform scoping¶
Because ProteinUniProtMetadata is keyed by canonical_accession, all
isoforms of a protein share a single metadata record. The canonical accession
is resolved via Protein.parse_isoform(accession)[0] for each row in the
TSV response.
load_ontology_snapshot¶
Operation name: load_ontology_snapshot; queue: protea.jobs
Downloads a GO OBO file and populates OntologySnapshot + GOTerm +
GOTermRelationship rows. Idempotent: if a snapshot with the same
obo_version already exists and its relationships are present, the
operation returns immediately without writing anything.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
Direct HTTP(S) URL to the |
|
|
HTTP download timeout in seconds. |
|
|
Re-insert relationships even if the snapshot already exists with relationships. |
Execution flow¶
1. validate payload
2. download OBO text (HTTP GET, single request)
3. extract ``data-version`` header → obo_version
4. check DB for existing OntologySnapshot with that obo_version
a. exists + has relationships → skip (idempotent)
b. exists + no relationships → backfill relationships only
c. does not exist → full insert
5. parse OBO stanzas → GOTerm rows (aspect mapped from namespace)
6. session.add_all(go_terms)
7. parse relationship edges → GOTermRelationship rows
8. session.add_all(relationships)
9. emit("load_ontology_snapshot.done", ...)
10. return OperationResult(result={snapshot_id, term_count, rel_count})
load_goa_annotations¶
Operation name: load_goa_annotations; queue: protea.jobs
Streams a GOA GAF 2.2 file (plain or gzip) and bulk-inserts
AnnotationSet + ProteinGOAnnotation rows. Only accessions already
present in the protein table are retained; all others are silently
skipped.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
(required) |
HTTP(S) URL to the GAF file (plain or |
|
(required) |
Human-readable version label stored in |
|
|
Lines buffered per commit cycle. |
|
|
HTTP stream timeout. |
|
|
Commit after each page to bound memory use (recommended for large GAFs). |
|
|
Stop after this many annotation rows (for testing). |
Execution flow¶
1. validate payload; resolve OntologySnapshot and canonical accession set
2. create AnnotationSet row (source="goa")
3. stream GAF lines:
a. skip comment lines (starting with "!")
b. parse 15-column tab-separated record
c. filter against canonical accessions; skip unknown
d. resolve go_term_id from go_id; skip if term unknown in snapshot
e. buffer ProteinGOAnnotation rows
f. flush + commit every page_size rows
4. emit("load_goa_annotations.done", ...)
5. return OperationResult(result={annotation_set_id, inserted, skipped})
load_quickgo_annotations¶
Operation name: load_quickgo_annotations; queue: protea.jobs
Streams GO annotations from the QuickGO bulk download TSV API. Proteins are determined by the canonical accessions already in the DB; no external accession list is needed. Supports optional ECO ID → evidence code mapping, taxon filtering, and aspect filtering.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
(required) |
Version label for |
|
EBI QuickGO |
Base URL for the QuickGO download endpoint. |
|
|
Explicit accession filter; |
|
|
Pull the accession filter from the |
|
|
URL to a GAF-ECO mapping file for evidence code resolution. |
|
|
Rows buffered per commit. |
|
|
HTTP stream timeout per QuickGO request, in seconds. |
|
|
Commit after each page. |
|
|
Row cap (for testing). |
|
|
Accessions per QuickGO API request when using |
compute_embeddings¶
Operation name: compute_embeddings; queue: protea.embeddings
(coordinator, serialised; one at a time via RetryLaterError if GPU busy)
Coordinator operation: determines which sequences need embeddings and fans
out ComputeEmbeddingsBatchOperation messages to protea.embeddings.batch.
Does not run GPU inference directly; returns OperationResult(deferred=True)
immediately after publishing batch messages.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
|
List of UniProt accessions to embed; |
|
|
UUID of a |
|
|
Sequences per batch message. Tune to GPU memory. |
|
|
Device for batch workers ( |
|
|
Skip sequences that already have an embedding for this config. |
|
|
Model forward-pass batch size inside each batch worker. Default of
|
Execution flow¶
1. resolve embedding config; raise RetryLaterError(delay=60s) if another
embedding job is RUNNING (GPU exclusive lock)
2. resolve sequence IDs from accessions or query_set_id
3. if skip_existing: filter out sequence IDs that already have embeddings
4. partition sequence IDs into batches of sequences_per_job
5. publish N ComputeEmbeddingsBatch messages to protea.embeddings.batch
6. publish N StoreEmbeddings slots to protea.embeddings.write
7. update Job.progress_total = N
8. return OperationResult(deferred=True, result={"batches": N})
Batch and write workers¶
The compute_embeddings_batch (OperationConsumer on protea.embeddings.batch)
runs GPU inference per batch and publishes float32 vectors to protea.embeddings.write.
The store_embeddings (OperationConsumer on protea.embeddings.write) bulk-inserts
SequenceEmbedding rows and atomically increments Job.progress_current,
closing the parent job when all batches are done.
predict_go_terms¶
Operation name: predict_go_terms; queue: protea.predictions
(coordinator; fans out KNN batch workers on protea.predictions.batch)
Coordinator operation: loads reference embeddings into a process-level cache,
partitions query proteins into batches, and fans out PredictGOTermsBatch
messages to protea.predictions.batch. Returns OperationResult(deferred=True).
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
(required) |
UUID of the |
|
(required) |
UUID of the |
|
|
List of query protein accessions; |
|
|
UUID of a |
|
|
Maximum GO predictions per query protein. |
|
|
Discard neighbors beyond this distance; |
|
|
Query proteins per batch message. |
|
|
KNN backend: |
|
|
Distance metric ( |
|
|
FAISS index type: |
|
|
Number of Voronoi cells when |
|
|
Number of Voronoi cells visited per query when
|
|
|
HNSW graph degree (edges per node) when
|
|
|
HNSW search-time queue size when |
|
|
Compute NW + SW pairwise alignments (parasail) for each prediction. |
|
|
Compute taxonomic distance (ete3 NCBITaxa) for each prediction. |
|
|
Compute 5 aggregate re-ranker features per prediction: |
|
|
Compute the 25-column |
|
|
Synthesise the |
|
|
Build three separate KNN indices, one per GO aspect (BPO/MFO/CCO), so every query receives candidates in every aspect even when its nearest neighbours in a unified index happen to be annotated only in one or two. A common cause of BPO recall ceilings. |
|
|
Optional UUID of a registered |
Reranker scoring and schema-SHA guard¶
As of T2B.4 the reranker scoring path lives in a dedicated compositive
class, RerankerScorer,
rather than in the former _RerankerMixin hierarchy.
PredictGOTermsBatchOperation receives a RerankerScorer instance
via its constructor (default: one instance per operation) and delegates
the full score-and-emit cycle to it. The legacy
protea.core.operations.predict_go_terms._batch_op_reranker module
is retained as a backward-compat shim that re-exports load_reranker,
apply_reranker, and friends so existing tests patching the legacy
module path continue to work.
When reranker_model_id is set, RerankerScorer.score is invoked
after prediction_dicts has been built. The flow is:
Compute
live_sha = compute_feature_schema_sha(active_families)from the livecompute_alignments/compute_taxonomy/compute_v6_featuresflags.If
live_sha != reranker_feature_schema_sha, the scorer emitsreranker.schema_mismatch(levelerror) and raisesSchemaShaMismatchError(FARM-EXP.5 guard). The base worker catches this as an operation failure, transitions the batch job toFAILEDwitherror_code="SchemaShaMismatchError", and does not continue with KNN ordering. Silent prediction with a stale booster layout is the exact failure mode this guard exists to prevent.On match, the scorer attaches the GOTerm aspect to each dict, calls
protea.core.reranker.apply_reranker(), and writes thereranker_scorefield into every prediction dict in memory.
reranker_score is in-memory only: GOPrediction does not yet
have a column for it, so the score is surfaced through the
predict_go_terms_batch.done event (nested reranker block with
applied/skipped/mean_score/etc.) but is not persisted.
Note
The lab contract is imported lazily in the batch worker. If the
production image ships without protea_reranker_lab installed the
worker emits reranker.skipped with reason=contracts_unavailable
and proceeds with KNN ordering.
Reference cache¶
Reference embeddings are loaded once per (embedding_config_id,
annotation_set_id) pair and stored as a process-level float16 numpy
array (max 1 entry, LRU-evicted on config change). This avoids reloading
hundreds of thousands of vectors for every batch.
Batch and write workers¶
The predict_go_terms_batch (OperationConsumer on protea.predictions.batch)
runs KNN search and GO transfer per batch. The store_predictions
(OperationConsumer on protea.predictions.write) bulk-inserts
GOPrediction rows and atomically increments Job.progress_current.
ping¶
Operation name: ping; queue: protea.ping
Smoke-test operation. Accepts no required payload fields, emits a single
ping.pong event, and returns immediately. Used to verify end-to-end
connectivity of the job queue without touching the protein data tables.
generate_evaluation_set¶
Operation name: generate_evaluation_set; queue: protea.jobs
Computes the CAFA-style evaluation delta between two AnnotationSet rows
(old → new) and stores an EvaluationSet row with summary statistics.
Follows the CAFA5 evaluation protocol exactly.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the older annotation set (t0, used as reference). |
|
(required) |
UUID of the newer annotation set (t1, ground truth source). |
|
|
Optional UUID of an |
If pivot_ontology_snapshot_id is not set, both annotation sets must
share the same ontology_snapshot_id; the operation raises
ValueError otherwise.
Execution flow¶
0. validate payload; load AnnotationSet rows; assert same ontology snapshot
1. idempotency check: if an EvaluationSet already exists for the
(old, new) pair, return its summary immediately (no recompute).
The DB-level UNIQUE constraint (alembic b8e3f1a7c2d9) enforces the
same invariant at the schema layer; the short-circuit avoids paying
the delta compute cost on re-submission.
2. call compute_evaluation_data(); see protea.core.evaluation:
a. load GO DAG children map (is_a / part_of only)
b. build NOT-propagation exclusion set (negated terms + GO descendants)
c. load experimental annotations for old and new sets, filtered by
exclusion set and evidence codes (EXP, IDA, IMP, IGI, IEP, IPI, …)
d. classify per (protein, namespace) into NK / LK / PK:
- NK: protein had NO annotations in any namespace at t0
- LK: protein had annotations in some namespaces but not in namespace S;
novel terms in S are LK ground truth
- PK: protein had annotations in namespace S at t0 and gained new terms
in S; novel terms are PK ground truth, old terms go to pk_known
3. emit stats (delta_proteins, nk/lk/pk counts and annotations)
4. INSERT EvaluationSet row with stats stored in JSONB
5. return OperationResult(result={evaluation_set_id, ...stats})
NK/LK/PK classification note¶
Classification is per (protein, namespace), not globally per protein. A protein that had MFO and BPO annotations at t0 and gains new CCO and BPO annotations at t1 will be LK in CCO (no prior CCO annotations) and PK in BPO (had prior BPO annotations) simultaneously. This matches the CAFA5 evaluation protocol.
Ground truth files are computed on-demand by the download endpoints (not
stored in the DB) using the same compute_evaluation_data logic.
run_cafa_evaluation¶
Operation name: run_cafa_evaluation; queue: protea.evaluations
Runs the cafaeval evaluator against NK, LK, and PK ground-truth settings
for a given EvaluationSet × PredictionSet pair. Persists an
EvaluationResult row with per-namespace Fmax, precision, recall, τ, and
coverage for each setting.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
(required) |
UUID of the |
|
|
Discard predictions with cosine distance above this threshold before
scoring (range 0 – 2). |
|
|
UUID of a |
|
|
UUID of a |
|
|
Same as |
|
|
Same as |
|
|
Nested mapping of category → aspect → reranker_model_id, e.g.
|
|
|
Path to an Information Accretion (IA) TSV file (two columns:
|
|
|
Standard CAFA practice: drop ground-truth proteins not present in
the |
Execution flow¶
1. load EvaluationSet + PredictionSet + OntologySnapshot from DB
2. compute_evaluation_data(): same delta as generate_evaluation_set
3. download OBO file from snapshot.obo_url to a temp directory
4. write ground-truth TSVs: gt_NK.tsv, gt_LK.tsv, gt_PK.tsv,
known_terms.tsv, pk_known_terms.tsv
5. write predictions in CAFA format (protein \\t go_id \\t score),
score = max(0, 1 - distance), filtered to delta proteins only
6. session.commit(): releases DB connection before cafaeval forks workers
7. for each setting in (NK, LK, PK):
a. NK pass: cafa_eval(obo, predictions/, gt_NK.tsv)
b. LK pass: cafa_eval(obo, predictions/, gt_LK.tsv)
c. PK pass: cafa_eval(obo, predictions/, gt_PK.tsv, exclude=pk_known_terms.tsv)
d. parse per-namespace Fmax / precision / recall / τ / coverage
e. write cafaeval outputs (PR curves, metric TSVs) into a persistent
staging dir under ``settings.storage.artifacts_dir`` and upload them
through the configured ``ArtifactStore``
8. INSERT EvaluationResult row with results dict (NK → {BPO, MFO, CCO})
9. return OperationResult(result={evaluation_result_id, results})
PK known-terms¶
For the PK pass, pk_known_terms.tsv is passed to cafa_eval as the
-known (exclude) argument. This file contains all experimental
annotations from the old snapshot for PK proteins in the relevant namespace.
cafaeval uses this to exclude known annotations from scoring; methods
that simply repeat prior annotations receive no credit for PK predictions.
SIGTERM handling¶
cafaeval uses Python multiprocessing.Pool internally. Before each
cafa_eval() call, the operation temporarily resets SIGTERM and
SIGINT handlers to defaults so that forked pool workers can be terminated
cleanly. The original handlers are restored afterwards.
train_reranker¶
Deprecated since version contract-first-lab-integration: LightGBM training has been decoupled from PROTEA and now lives in
protea-reranker-lab.
TrainRerankerOperation and TrainRerankerAutoOperation are
no longer registered in the OperationRegistry; a POST /jobs
request with operation_name: train_reranker or train_reranker_auto
will be rejected. The classes remain importable as internal helpers
that export_research_dataset reuses in-process to run
the shared KNN + feature-generation pipeline in dump_only mode.
The narrative below (feature matrix, LightGBM hyperparameters,
validation metrics) is retained as the canonical specification of
the booster that the lab trains; the lab consumes the exact same
feature schema PROTEA produces. For the live end-to-end flow, see
export_research_dataset below and the
/reranker-models/import HTTP surface.
Operation name: train_reranker (internal helper, not queued)
/jobs. See
export_research_dataset and the POST /datasets
endpoint; booster training runs in protea-reranker-lab.RerankerModel (written by /reranker-models/import, not by this helper directly).Trains a LightGBM binary classifier that re-scores GO term predictions on top
of the embedding-based KNN baseline. The input is a
(PredictionSet, EvaluationSet) pair: predictions supply the feature matrix
(stored in GOPrediction columns populated at prediction time through
compute_alignments=True and compute_taxonomy=True), the evaluation set
supplies the ground-truth delta used to derive binary labels.
The design follows the principle of keeping the training signal interpretable: the re-ranker is not an end-to-end deep network over raw embeddings, but a gradient-boosted decision tree over hand-engineered features that each have a well-known meaning (alignment identity, taxonomic distance, neighbour agreement). Feature importance can therefore be inspected directly and used to drive the iterative training loop documented in Results.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
(required) |
UUID of the |
Feature matrix¶
The feature set is defined statically in
protea.core.reranker.NUMERIC_FEATURES and
protea.core.reranker.CATEGORICAL_FEATURES. It comprises 20 numeric
and 3 categorical columns, grouped by origin:
Feature |
Group |
Description |
|---|---|---|
|
KNN |
Cosine or L2 distance from query to nearest-neighbour reference. Lower = closer in embedding space. |
|
Alignment |
Needleman–Wunsch global identity percentage (parasail, BLOSUM62). |
|
Alignment |
NW similarity percentage (identical + positive-scoring substitutions). |
|
Alignment |
Raw NW alignment score. |
|
Alignment |
Fraction of aligned positions that are gaps in the NW alignment. |
|
Alignment |
Length of the NW alignment (number of columns). |
|
Alignment |
Smith–Waterman local identity percentage. |
|
Alignment |
SW similarity percentage. |
|
Alignment |
Raw SW alignment score. |
|
Alignment |
Gap fraction in the SW alignment. |
|
Alignment |
Length of the SW alignment. |
|
Length |
Residue count of the query protein sequence. |
|
Length |
Residue count of the matched reference sequence. |
|
Taxonomy |
Distance between query and reference in the NCBI taxonomy tree
(ete3 |
|
Taxonomy |
Number of common ancestors shared between query and reference taxa. |
|
Aggregate |
Number of top- |
|
Aggregate |
Rank of the first neighbour supporting the term (1 = top hit). |
|
Aggregate |
Frequency of the GO term in the entire reference set (prior). |
|
Aggregate |
Average number of GO annotations per reference protein in the batch. |
|
Aggregate |
Standard deviation of KNN distances for the query (low = tight cluster, high = uncertain). |
Feature |
Description |
|---|---|
|
GO annotation qualifier propagated from the reference
(e.g. |
|
GO evidence code of the reference annotation (EXP, IDA, IMP, …, or electronic IEA). Acts as a prior on annotation reliability. |
|
Coarse taxonomic relation between query and reference
( |
All three categorical features are converted to pandas category dtype in
protea.core.reranker.prepare_dataset(); LightGBM consumes them directly,
so no manual label encoding is required. Missing values in numeric columns
are left as NaN; LightGBM handles them natively by learning an optimal
direction at each split.
Training protocol¶
The training loop (now in protea-reranker-lab) proceeds as follows:
Label derivation. Each row of the training DataFrame carries a binary
labelcolumn:1if the predicted(protein, go_term)pair is present in the NK ∪ LK ∪ PK ground truth of the associatedEvaluationSet,0otherwise. The label is computed upstream by the/scoring/prediction-sets/{id}/training-data.tsvendpoint.Stratified split. Positives and negatives are split independently into train and validation fractions (default
val_fraction = 0.2) so that the positive rate is preserved on both sides. Shuffling uses a fixed seed (default42) for reproducibility.Optional class-balance control. Because the raw training set is extremely imbalanced (positive rate typically ≤ 1 %), the caller can pass
neg_pos_ratioto subsample negatives independently on the train and validation splits. The default isNone(keep all negatives); the results chapter documents the ratios used for each re-ranker iteration.Optional per-sample weighting. When
sample_weightis provided (for example, the Information Accretion of each GO term), the weights are attached to both the training and validation datasets so that high-weight samples contribute more to the LightGBM loss. This lets the re-ranker be trained directly against the IA-weighted Fmax metric used in CAFA Evaluation Protocol.LightGBM training. The Booster is trained with the default hyperparameters below, merged with any overrides passed in the
paramsargument. Early stopping is enabled on the validation split.
Parameter |
Default |
Notes |
|---|---|---|
|
|
Binary classification. |
|
|
Both are reported; early stopping uses the first metric. |
|
|
Standard gradient-boosted decision trees. |
|
|
LightGBM default. |
|
|
Low LR + early stopping; see |
|
|
Column sub-sampling per tree (reduces overfitting). |
|
|
Row sub-sampling per boosting round. |
|
|
Apply bagging every 5 iterations. |
|
|
Fixed for reproducibility. |
|
|
Maximum boosting iterations. |
|
|
Stop if validation logloss does not improve for 50 rounds. |
Validation metrics¶
After training, the validation set is scored and the following metrics are
written to RerankerModel.metrics as a JSONB dict:
best_iteration: boosting round at which early stopping triggered.val_auc: ROC-AUC on the validation split.val_logloss: binary logloss at the best iteration.val_precision/val_recall/val_f1: computed at thep ≥ 0.5decision threshold. These are reported for quick inspection but are not the primary objective: the re-ranker is ultimately scored through the full CAFA evaluation pipeline (CAFA Evaluation Protocol), which uses IA-weighted Fmax per(category, namespace)cell.train_samples/val_samples: row counts after any negative subsampling.positive_rate: fraction of rows withlabel == 1before any subsampling.
In addition, the training step returns a
gain-based feature-importance dictionary (feature_name → total gain),
which is persisted in RerankerModel.feature_importance. This is the
signal used in Results to drive the three documented
iterations of the re-ranker.
Note
Cross-validation is not performed inside the training operation.
The temporal-holdout re-ranker design uses 13 historical GOA splits
(releases 160 through 220) as independent training folds; that
cross-validation loop now lives in
protea-reranker-lab,
which consumes the frozen parquet datasets PROTEA publishes via
export_research_dataset (see
export_research_dataset). The lab fits one booster
per split and emits the run directory consumed by
scripts/register_reranker.py. See Results for the
aggregate numbers.
train_reranker_auto¶
Operation name: train_reranker_auto (internal helper, not queued)
Note
Formerly a convenience variant that auto-selected the most recent
PredictionSet and EvaluationSet and forwarded them to
train_reranker. Like train_reranker it is no longer
registered in the OperationRegistry. The class survives only
as the in-process engine that ExportResearchDatasetOperation
drives in dump_only mode to produce the frozen parquet triple
consumed by the lab.
export_research_dataset¶
Operation name: export_research_dataset; queue: protea.training
PredictionSet / feature data. Writes blobs via the configured ArtifactStore (no ORM writes).Runs the same KNN + feature-generation pipeline as train_reranker_auto
but skips LightGBM training entirely and publishes the resulting
train.parquet / eval.parquet / manifest.json triple through the
configured ArtifactStore. The
produced dataset is the canonical input consumed by
protea-reranker-lab for offline re-ranker training.
The manifest embeds schema_version=2, the PROTEA
producer_version (protea.__version__) and producer_git_sha so
any lab run can be traced back to the exact PROTEA HEAD that produced
its training data.
Payload fields¶
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
(required) |
UUID of the |
|
(required) |
List of historical annotation-set versions forming the training snapshot pairs (must contain at least 2 entries; sorted ascending). |
|
(required) |
List of annotation-set versions reserved for evaluation (non-empty; sorted ascending). |
|
|
Annotation source tag recorded in the manifest. |
|
(required) |
Human label for the published dataset. Used as |
|
|
KNN neighbour count. |
|
|
KNN backend ( |
|
|
Include NW alignment and length feature families. |
|
|
Include the taxonomy-pair feature family. |
|
|
Propagate neighbour votes up the GO DAG before feature materialisation. |
|
|
Include embedding PCA and |
Execution flow¶
1. validate payload (Pydantic 2.x, strict mode)
2. load Settings; resolve ArtifactStore via get_artifact_store(settings)
3. run train_reranker_auto in dump-only mode against a temp dir
(per_cell training_scope, no booster trained)
4. for each of train.parquet / eval.parquet / manifest.json:
store.put("datasets/<output_name>/<file>", path)
5. emit export_research_dataset.published {backend, key_prefix, files}
6. return OperationResult(result={
output_name, key_prefix, storage_backend,
train_uri, eval_uri, manifest_uri, ...auto_result
})
Emitted events¶
The operation relays the underlying train_reranker_auto events
under its own namespace (train_reranker_auto.* →
export_research_dataset.*) and additionally emits:
export_research_dataset.started: payload echoed back with resolved backend.export_research_dataset.rows_written: n_train_rows / n_eval_rows per shard once the parquets are materialised.export_research_dataset.published:{backend, key_prefix, files}when the three artefacts have been uploaded successfully.export_research_dataset.completed: final summary including the three resulting URIs (file://…ors3://bucket/key).
InterPro annotation pipeline (IP.2/3/4)
Three operations form the InterPro-based functional-enrichment stage.
They run sequentially: first load the InterPro2GO mapping (IP.4a), then
annotate proteins with InterProScan (IP.2/3), then propagate GO terms
(IP.4b). All three are registered in the OperationRegistry and run on
the protea.jobs queue. The only prerequisite outside PROTEA is an
InterProScan binary install on the host (see the binary_path payload
field below).
load_interpro_go_mapping¶
Operation name: load_interpro_go_mapping; queue: protea.jobs
interpro_go_mapping (upsert).Downloads the EBI interpro2go flat file and upserts
(ipr_accession, go_id, source_version) rows into the
interpro_go_mapping table. The operation is idempotent: the unique
constraint uq_interpro_go_mapping_pair on (ipr_accession, go_id,
source_version) means that re-running for the same release is a no-op.
Field |
Default |
Description |
|---|---|---|
|
(required) |
InterPro release tag stored with every row, e.g.
|
|
EBI FTP endpoint |
URL of the |
|
|
Evidence code stored with each mapping row. |
|
|
HTTP request timeout in seconds. |
|
|
Upsert batch size. |
The result dict reports rows_parsed, rows_inserted, and
rows_skipped_duplicate.
run_interproscan_batch¶
Operation name: run_interproscan_batch; queue: protea.jobs
interpro_annotation (insert).Annotates proteins in fixed-size chunks via an InterProScan subprocess
(IP.3). Inputs are resolved from either a QuerySet or an explicit
accession list. An optional ipr_release_floor makes the run resumable:
proteins whose latest persisted annotation already meets the floor are
skipped, so re-dispatching the same payload picks up where the prior run
left off.
Field |
Default |
Description |
|---|---|---|
|
|
UUID of a |
|
|
Explicit list of UniProt accessions. Exactly one of
|
|
|
Lexicographic floor on |
|
|
Proteins per |
|
|
Per-chunk subprocess timeout in seconds. |
|
|
Full path to the |
|
|
Additional arguments forwarded to |
|
|
Commit after each chunk so partial progress is durable. |
The result dict reports chunks_done, proteins_processed,
proteins_skipped, annotations_inserted, and
ipr_releases_seen. Progress events (run_interproscan_batch.chunk_done)
are emitted after each chunk.
predict_go_terms_from_interpro¶
Operation name: predict_go_terms_from_interpro; queue: protea.jobs
PredictionSet row and GOPrediction rows.A parallel-signal GO term predictor (IP.4b). Given a set of query proteins
whose InterProAnnotation rows are already in the DB, the operation:
Joins each protein’s distinct InterPro accessions against
interpro_go_mappingat the requestedsource_version.Resolves GO IDs against the target
OntologySnapshot.Aggregates per-protein votes: N distinct InterPro entries on the same protein mapping to the same GO term yield a vote count of N, with
distance = 1.0 / N(matching the KNN convention).Persists the result as a new
PredictionSettaggedmeta["algorithm"] = "interpro_propagation".
The operation is single-stage (no batch fan-out): the three-table join is a cheap index scan and the per-protein InterPro hit count is O(10-100).
Field |
Default |
Description |
|---|---|---|
|
(required) |
UUID of the |
|
(required) |
UUID of the |
|
(required) |
UUID of the |
|
(required) |
InterPro2GO mapping release tag (must match a previously loaded
|
|
|
UUID of a |
|
|
Explicit list of UniProt accessions. |
|
|
If set, only |
|
|
Accession batch size for the three-table join. |
The result dict reports prediction_set_id, proteins_with_ipr,
proteins_with_predictions, candidate_pairs, and
predictions_inserted.
Ephemeral consumer operations¶
Four operations run as OperationConsumer workers on the high-throughput
batch queues. They differ from the operations documented above in three ways:
No ``Job`` row. The coordinator publishes payloads directly to the queue; the consumer executes them without creating a child
Jobrow in the DB. Progress is tracked by atomically incrementingJob.progress_currenton the parent job.No HTTP endpoint. These operations cannot be submitted through
POST /jobs. They are only reachable through the fan-out logic of the coordinator operationscompute_embeddingsandpredict_go_terms.Payload is the full message. Because there is no
Jobrow to read from, the fullProteaPayloadis serialised into the AMQP body.
They still implement the Operation protocol and are registered alongside
the other eleven in protea/core/operation_catalog.py. Bringing the
total to 18 registered operations (14 job-backed + 4 ephemeral).
compute_embeddings_batch¶
Queue: protea.embeddings.batch; consumer: OperationConsumer
Runs the protein language model forward pass for one batch of sequences.
Loads the model on first use, caches it at process level, performs GPU
inference (ESM-2, ESM-C, ProtT5/ProstT5, or Ankh base/large), pools
residue-level representations according to the EmbeddingConfig strategy,
converts to float32, and publishes a StoreEmbeddings message to
protea.embeddings.write.
Backends are selected via EmbeddingConfig.model_backend:
esm: ESM-2 family. Dispatched through theprotea-backendsplugin (protea.backendsentry_points group, T2A.1). The plugin callsplugin.embed_chunks; the legacy_embed_esmshim is used only as a fallback when the installed plugin predates T2A.1.esm3c: ESM SDKESMC(ESM3c family); FP16 on GPU, no external tokenizer.t5: HuggingFaceT5EncoderModel(ProstT5,prot_t5_xl_uniref50…); the<AA2fold>prefix is auto-injected when the model name containsprostt5. Runs through the local_embed_t5shim (plugin migration pending T2A.2).ankh: HuggingFaceT5EncoderModelloaded viaAutoTokenizer(ElnaggarLab/ankh-base,ElnaggarLab/ankh-large). Shares the batched T5 pipeline but with two mandatory deviations from the ProstT5 path, both verified end-to-end with real weights on 2026-04-10:bfloat16 on CUDA, never float16. Ankh was pre-trained on TPU in bfloat16 and its T5 LayerNorm overflows to
NaNunder FP16 on every forward pass. The loader pinstorch_dtype=torch.bfloat16on GPU andtorch.float32on CPU.Char-level tokenisation with
is_split_into_words=True. Ankh’s SentencePiece vocabulary treats literal spaces as<unk>, so the ProstT5-style" ".join(seq)path emits ~50 %<unk>tokens and destroys the embedding._embed_ankhinstead passes[list(seq) for seq in batch]withis_split_into_words=True(verified 0<unk>). The<AA2fold>prefix is never injected.
auto: Dispatched through theprotea-backendsplugin (same path asesm, T2A.1). Falls back to the_embed_esmshim when the plugin is absent.
Residue-tensor convention¶
Every backend strips all special tokens before pooling and chunking, so
the residue tensor has exactly N rows where N is the amino-acid
length of the input sequence. This makes chunk_index_s and
chunk_index_e mean the same thing on every backend: indices into the
amino-acid sequence, not into a backend-specific token tensor.
Backend |
Tokens stripped from residue slice |
residue_len |
|---|---|---|
|
CLS (position 0) + EOS (last position) |
|
|
BOS (position 0) + EOS (last position) |
|
|
EOS (last position) |
|
|
|
|
|
EOS (last position) |
|
This convention was unified on 2026-04-10. Before that date, the T5 family
kept the trailing EOS and ProstT5 additionally kept the leading
<AA2fold> prefix token, so chunk indices on those backends were offset
by 0–2 positions relative to the amino-acid sequence. Embeddings computed
under the old convention are not directly comparable with new ones and
must be recomputed.
A single GPU worker is typically sufficient because each batch already
saturates the device; the coordinator serialises dispatch with
RetryLaterError(delay=60s) when another embedding job is RUNNING.
store_embeddings¶
Queue: protea.embeddings.write; consumer: OperationConsumer
Bulk-inserts SequenceEmbedding rows (one per chunk) into the pgvector
VECTOR column, using PostgreSQL INSERT ... ON CONFLICT DO NOTHING to
tolerate re-runs. After each successful insert it atomically increments
Job.progress_current on the parent compute_embeddings job and closes
the job when progress_current == progress_total.
This stage is deliberately decoupled from GPU inference so that slow disk writes never block the GPU, and so the write worker can be scaled independently.
predict_go_terms_batch¶
Queue: protea.predictions.batch; consumer: OperationConsumer
Runs the KNN + GO transfer pipeline for one batch of query proteins. As
of F2C.5 the heavy lifting is delegated to
protea_method.pipeline.predict() (protea-method >= 0.2.0); the
in-tree adapter protea.core.operations._predict_go_terms_adapter
re-derives PROTEA-only fields from the PredictDiagnostics the
unified path returns.
Loads query embeddings from the DB.
Resolves the reference cache (process-level float16 array; loaded from disk at
data/ref_cache/<embedding_config>__<annotation_set>_*.npyif present, otherwise streamed from PostgreSQL in chunks of 2 000 rows and persisted).Calls
protea_method.pipeline.predict()withreturn_diagnostics=True: per-aspect KNN search via numpy or FAISS (Flat/IVFFlat/HNSW) using the configured metric (cosineorl2), neighbour vote accumulation, and optional ancestor expansion live in the pure inference library.The PROTEA adapter walks
neighbors_by_aspectandgo_map_by_aspectto attachprediction_set_id,ref_protein_accession,qualifier,evidence_codeplus the legacy re-ranker aggregates (k_position,go_term_frequency,ref_annotation_density,neighbor_distance_std,neighbor_vote_fraction). Optional alignment features (NW/SW viaparasail) and taxonomic features (ete3NCBITaxa) are computed on top, plus thev6_featuresshim (enrich_v6_features) whencompute_v6_features=true.Publishes a
StorePredictionsmessage toprotea.predictions.write.
GPU is not required: KNN search runs on CPU unless a FAISS GPU index is
configured at process startup. The re-ranker (booster) is applied
after ancestor expansion by the injected
RerankerScorer
collaborator (T2B.4), not inside protea_method.pipeline.predict,
so the historical scoring order remains bit-exact.
store_predictions¶
Queue: protea.predictions.write; consumer: OperationConsumer
Bulk-inserts GOPrediction rows (one row per predicted (protein, go_term)
pair) and atomically increments Job.progress_current on the parent
predict_go_terms job. Uses INSERT ... ON CONFLICT DO NOTHING so that
re-delivered messages are a no-op.
All feature-engineering columns declared in GOPrediction
(identity_nw, similarity_sw, taxonomic_distance, etc.) are written
in the same insert when they are present in the payload.
GPU torch installation
PROTEA’s pyproject.toml pins torch to the pytorch-cpu PyPI source so
that CI runners and the slim production Docker image do not pull the ~6 GB
NVIDIA / triton wheel stack. Hosts that run GPU inference
(compute_embeddings_batch workers, or the torch KNN backend in the
export pipeline) must override torch with a CUDA-capable wheel after every
poetry install or poetry update.
# Default: cu128 (NVIDIA driver 570+)
bash scripts/install_gpu_torch.sh
# Override for older drivers
CUDA_VARIANT=cu121 bash scripts/install_gpu_torch.sh
CUDA_VARIANT=cu118 bash scripts/install_gpu_torch.sh
# Point at an explicit venv (CI, smoke tests)
VENV_PATH=/tmp/torch-smoke bash scripts/install_gpu_torch.sh
The CUDA_VARIANT default is cu128 as of PR #564. The script
validates compatibility against the protea-method torch-GPU KNN backend;
cu128 is tested against NVIDIA driver 570 and 580 series.
Re-running after poetry sync
Both poetry install (sync mode) and poetry update reinstall the CPU torch
wheel from pytorch-cpu, silently downgrading the CUDA build. You must
re-run scripts/install_gpu_torch.sh after every sync to restore GPU
capability. A quick self-check is included in the script output:
>>> import torch self-check:
torch 2.x.y+cu128 cuda_available=True
If cuda_available=False after the script, the driver is too old for
the selected variant or the CUDA runtime is not found. Try an older
variant (cu121) or check that nvidia-smi shows a live device.
Why runtime deps are included
Torch on Linux ships its CUDA runtime through the nvidia-* PyPI
packages (nvidia-cudnn-cu12, nvidia-cublas-cu12, etc.). Skipping
dependency resolution (e.g. via pip install with the no-deps flag)
leaves those absent and import torch fails at
load time with a missing libcudnn.so error.
The script lets pip resolve the runtime deps automatically; any unrelated
version bumps (sympy, networkx, MarkupSafe) stay inside the permissive
ranges declared in pyproject.toml.
Registering a new operation¶
Create a module under
protea/core/operations/.Define a payload class extending
ProteaPayload.Implement the
Operationprotocol (nameattribute +execute).Register the instance in the worker startup script:
registry.register(MyOperation())Route jobs to the correct queue by setting
queue_namein thePOST /jobsrequest body.
No changes to BaseWorker, the API, or the infrastructure layer are required.