How-to Guides¶

Audience and scope

Read this if: you have one specific task to accomplish (load an ontology, upload a FASTA, train a re-ranker, scale a worker) and you want the shortest path from a clean stack to a finished job.

Read Reproduction guide instead if: you want to regenerate every figure and table in Results end-to-end.

Submit a job via the API ¶

Every job requires operation, queue_name, and an optional payload dict. The payload must match the fields expected by the target operation’s ProteaPayload subclass.

Example: insert Swiss-Prot human proteins.

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "insert_proteins",
    "queue_name": "protea.jobs",
    "payload": {
      "search_criteria": "reviewed:true AND organism_id:9606",
      "page_size": 500,
      "include_isoforms": true
    }
  }' | python -m json.tool

The response contains the job UUID:

{"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6", "status": "queued"}

Fetch UniProt metadata for existing proteins ¶

Run fetch_uniprot_metadata with the same query used during ingestion. The operation uses canonical_accession as the upsert key so it is safe to re-run at any time.

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "fetch_uniprot_metadata",
    "queue_name": "protea.jobs",
    "payload": {
      "search_criteria": "reviewed:true AND organism_id:9606",
      "page_size": 200,
      "commit_every_page": true,
      "update_protein_core": true
    }
  }'

Monitor job progress ¶

Poll the job status endpoint:

curl -s http://127.0.0.1:8000/jobs/<job-id> | python -m json.tool

Stream the event timeline:

curl -s http://127.0.0.1:8000/jobs/<job-id>/events | python -m json.tool

The frontend at http://127.0.0.1:3000 auto-refreshes every 2 seconds while a job is active and renders the event timeline in chronological order.

Cancel a queued job ¶

curl -s -X POST http://127.0.0.1:8000/jobs/<job-id>/cancel

Jobs in terminal states (SUCCEEDED, FAILED) are unaffected; the endpoint is a no-op. Cancelling a RUNNING job marks the DB row as CANCELLED but does not interrupt the worker process (soft cancel).

Run a single worker manually ¶

Useful for debugging a specific queue without the full manage.sh stack:

poetry run python scripts/worker.py --queue protea.jobs

Add a new operation ¶

Create protea/core/operations/my_operation.py:

from protea.core.contracts.operation import (
    EmitFn, Operation, OperationResult, ProteaPayload
)
from sqlalchemy.orm import Session
from typing import Any

class MyPayload(ProteaPayload, frozen=True):
    some_param: str

class MyOperation(Operation):
    name = "my_operation"

    def execute(
        self, session: Session, payload: dict[str, Any], *, emit: EmitFn
    ) -> OperationResult:
        p = MyPayload.model_validate(payload)
        emit("my_operation.start", None, {"param": p.some_param}, "info")
        # ... domain logic ...
        return OperationResult(result={"done": True})

from protea.core.operations.my_operation import MyOperation
registry.register(MyOperation())

Route jobs to the appropriate queue (protea.jobs or a new dedicated queue).

No changes to BaseWorker, the FastAPI router, or the DB schema are needed.

Generate and apply a database migration ¶

After modifying an ORM model, generate an Alembic migration:

alembic revision --autogenerate -m "add my_column to protein"
alembic upgrade head

Always review auto-generated migrations before applying them to production. Alembic’s autogenerate detects column additions and removals but may miss index changes or server-default modifications.

Load a GO ontology snapshot ¶

Download and parse a GO OBO file release. The obo_version extracted from the file header is used as the unique key, so re-running with the same URL is safe (idempotent).

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "load_ontology_snapshot",
    "queue_name": "protea.jobs",
    "payload": {
      "obo_url": "https://purl.obolibrary.org/obo/go.obo"
    }
  }'

Load GOA annotations ¶

Load all UniProt-GOA annotations for a specific organism. Replace <snapshot-uuid> with the ontology_snapshot_id returned by the load_ontology_snapshot job.

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "load_goa_annotations",
    "queue_name": "protea.jobs",
    "payload": {
      "ontology_snapshot_id": "<snapshot-uuid>",
      "gaf_url": "https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz",
      "source_version": "2024-01"
    }
  }'

Load QuickGO annotations ¶

Stream annotations from the QuickGO API for all proteins present in the DB:

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "load_quickgo_annotations",
    "queue_name": "protea.jobs",
    "payload": {
      "ontology_snapshot_id": "<snapshot-uuid>",
      "source_version": "quickgo-2024-01"
    }
  }'

Upload a custom FASTA query set ¶

Use the /query-sets endpoint to upload a FASTA file for custom predictions. The returned id is used as query_set_id in subsequent jobs.

curl -s -X POST http://127.0.0.1:8000/query-sets \
  -F "file=@my_proteins.fasta" \
  -F "name=My dataset" \
  -F "description=Custom proteins for GO prediction" | python -m json.tool

Compute sequence embeddings ¶

Compute ESM-2 embeddings for all proteins (or a specific query set). Replace <config-uuid> with the UUID of an EmbeddingConfig row.

# Embed all proteins in the DB
curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "compute_embeddings",
    "queue_name": "protea.embeddings",
    "payload": {
      "embedding_config_id": "<config-uuid>",
      "device": "cuda",
      "skip_existing": true
    }
  }'

# Embed only a FASTA query set
curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "compute_embeddings",
    "queue_name": "protea.embeddings",
    "payload": {
      "embedding_config_id": "<config-uuid>",
      "query_set_id": "<query-set-uuid>",
      "device": "cuda"
    }
  }'

The coordinator returns immediately (deferred=True). Progress is tracked on the parent job via progress_current / progress_total.

Predict GO terms ¶

Run KNN-based GO function transfer. All three UUID references must exist in the DB before submitting.

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "predict_go_terms",
    "queue_name": "protea.predictions",
    "payload": {
      "embedding_config_id": "<config-uuid>",
      "annotation_set_id": "<annotation-set-uuid>",
      "ontology_snapshot_id": "<snapshot-uuid>",
      "limit_per_entry": 5,
      "distance_threshold": 0.3,
      "search_backend": "numpy",
      "compute_alignments": true,
      "compute_taxonomy": false,
      "compute_reranker_features": true
    }
  }'

Generate an evaluation set (temporal holdout)¶

Create a CAFA-style evaluation delta between an old and new annotation set. Both must share the same ontology_snapshot_id.

curl -s -X POST http://127.0.0.1:8000/annotations/evaluation-sets/generate \
  -H "Content-Type: application/json" \
  -d '{
    "old_annotation_set_id": "<old-uuid>",
    "new_annotation_set_id": "<new-uuid>"
  }'

The job classifies proteins into NK (no-knowledge), LK (limited-knowledge), and PK (partial-knowledge) categories per namespace. Download ground-truth files via GET /annotations/evaluation-sets/{id}/ground-truth-{NK|LK|PK}.tsv.

Run a CAFA evaluation ¶

Evaluate a prediction set against an evaluation set using the cafaeval evaluator:

curl -s -X POST http://127.0.0.1:8000/annotations/evaluation-sets/<eval-id>/run \
  -H "Content-Type: application/json" \
  -d '{
    "prediction_set_id": "<prediction-set-uuid>"
  }'

Results include per-namespace Fmax, precision, recall, and coverage for NK, LK, and PK settings. Download metrics via GET /annotations/evaluation-sets/{id}/results/{rid}/metrics.tsv.

Train a re-ranker ¶

In-process re-ranker training was retired in F0 (T0.6): the train_reranker and train_reranker_auto operations are no longer registered. LightGBM training has moved to the sibling repo protea-reranker-lab, which consumes the frozen parquet dataset that PROTEA publishes via export_research_dataset. The four-step workflow is described in Register a reranker from protea-reranker-lab below.

Apply a trained re-ranker to new predictions via GET /scoring/prediction-sets/{id}/rerank.tsv?reranker_id=<uuid>.

Register a reranker from `protea-reranker-lab`¶

Re-rankers trained offline in protea-reranker-lab (separate repo, contract-first integration) are registered in PROTEA in four steps: export a frozen dataset, train in the lab, register the resulting run, and invoke predict_go_terms with the new reranker_model_id.

Step 1. Export the frozen dataset. Submit an export_research_dataset job. The operation generates train.parquet / eval.parquet / manifest.json and uploads them through the configured ArtifactStore under datasets/<output_name>/.

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "export_research_dataset",
    "queue_name": "protea.training",
    "payload": {
      "embedding_config_id": "<config-uuid>",
      "ontology_snapshot_id": "<snapshot-uuid>",
      "train_versions": [200, 210, 215],
      "test_versions": [220],
      "annotation_source": "goa",
      "output_name": "rkv8-full-aa-multisnap",
      "k": 5,
      "search_backend": "faiss",
      "compute_alignments": true,
      "compute_taxonomy": true,
      "use_embedding_pca": true
    }
  }'

The job emits export_research_dataset.published once the three files have been uploaded. The result dict contains train_uri, eval_uri and manifest_uri: file://… URIs with the local backend, s3://bucket/… with MinIO.

Step 2. Train in the lab. In a protea-reranker-lab checkout, point the lab’s spec at the dataset URI and run its training CLI:

cd ../protea-reranker-lab
poetry run python -m protea_reranker_lab.train \
  --manifest file:///abs/path/storage/artifacts/datasets/rkv8-full-aa-multisnap/manifest.json \
  --output-name rkv8-full-aa-multisnap

The lab writes runs/<name>/ containing run.json, spec.yaml and model.txt (the LightGBM booster).

Step 3. Register the run in PROTEA. scripts/register_reranker.py parses the run directory, uploads the booster to the configured ArtifactStore under rerankers/<run_id>/model.txt, computes feature_schema_sha via protea_reranker_lab.contracts.compute_feature_schema_sha, and inserts a RerankerModel row.

poetry run python scripts/register_reranker.py \
  --run-dir ../protea-reranker-lab/runs/rkv8-full-aa-multisnap

The script prints the new RerankerModel UUID to stdout. Use --prediction-set-id / --evaluation-set-id to back-link the row to existing DB artefacts, --name-override to pick a custom name, and --force to replace an existing row with the same name.

Step 4. Predict with the reranker. Reference the new model by UUID in the predict_go_terms payload:

curl -s -X POST http://127.0.0.1:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "predict_go_terms",
    "queue_name": "protea.jobs",
    "payload": {
      "embedding_config_id": "<config-uuid>",
      "annotation_set_id": "<annotation-set-uuid>",
      "ontology_snapshot_id": "<snapshot-uuid>",
      "limit_per_entry": 5,
      "compute_alignments": true,
      "compute_taxonomy": true,
      "compute_v6_features": true,
      "reranker_model_id": "<reranker-model-uuid>"
    }
  }'

The coordinator validates that the row has both artifact_uri and feature_schema_sha set, emits predict_go_terms.reranker_bound, and snapshots both fields into every batch payload. Each batch worker computes a live feature-schema sha from its active feature flags and applies the booster only when shas match exactly; on mismatch it emits reranker.schema_mismatch and falls back to KNN distance ordering without crashing.

Note

reranker_score is currently surfaced in-memory only and exposed through the predict_go_terms_batch.done event; GOPrediction has no column for it yet. Persistence is tracked for a future schema change.

Use one-click annotation ¶

The /annotate endpoint accepts a FASTA file and automatically selects the best available embedding config, annotation set, and ontology snapshot:

curl -s -X POST http://127.0.0.1:8000/annotate \
  -F "file=@my_proteins.fasta" \
  -F "name=Quick annotation" | python -m json.tool

The response includes all IDs needed to monitor the embedding job and chain the prediction step. The frontend uses this endpoint to power the one-click annotation wizard.

Score predictions with a ScoringConfig ¶

Create a scoring config and apply it to a prediction set:

# Create scoring config. ``weights`` keys must come from
# ``DEFAULT_WEIGHTS``: ``embedding_similarity``, ``identity_nw``,
# ``identity_sw``, ``evidence_weight``, ``taxonomic_proximity``,
# ``neighbor_vote_fraction`` (any other key triggers a 422 since the
# request body is ``extra=forbid``). Omitted keys default to 0.
curl -s -X POST http://127.0.0.1:8000/scoring/configs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "alignment-weighted",
    "weights": {"embedding_similarity": 0.5, "identity_nw": 0.3, "identity_sw": 0.2}
  }' | python -m json.tool

# Download scored predictions
curl -s "http://127.0.0.1:8000/scoring/prediction-sets/<id>/score.tsv?scoring_config_id=<config-id>" \
  -o scored.tsv

# Compute CAFA metrics for scored predictions
curl -s "http://127.0.0.1:8000/scoring/prediction-sets/<id>/metrics?scoring_config_id=<config-id>&evaluation_set_id=<eval-id>"

Scale batch workers ¶

Add extra batch workers to a queue without restarting the full stack:

bash scripts/manage.sh scale protea.embeddings.batch 2
bash scripts/manage.sh scale protea.predictions.batch 3

Use bash scripts/manage.sh status to verify running workers and their memory consumption.

Build the documentation locally ¶

Sphinx and the theme stack live in the optional docs Poetry group. Install once, then build:

poetry install --with docs       # one-time: pulls Sphinx, furo, etc.
poetry run task html_docs
# or directly:
cd docs && poetry run sphinx-build -b html source build/html

Open docs/build/html/index.html in a browser. The html_docs task is defined under [tool.taskipy.tasks] in pyproject.toml; it requires taskipy (installed via --with lint).