HTTP API

The PROTEA HTTP API is a FastAPI application that exposes a set of routers under protea/api/routers/ (the live OpenAPI is regenerated into docs/openapi.json and is authoritative for the exact endpoint list). All state mutations flow through this layer: it writes Job rows to PostgreSQL and publishes messages to RabbitMQ. The API is stateless between requests; the session factory and AMQP URL are injected via app.state at startup, keeping every router free of global state and infrastructure imports.

All endpoints return JSON. Error responses follow the RFC 7807 ``application/problem+json`` shape (T4.4 / D4): every error body includes type (relative URI under /problems/{slug}, e.g. /problems/not-found), title (short stable summary), status (mirror of the HTTP code), and an optional detail + instance (request URI). Validation errors carry an extra errors array with the offending field paths. Existing route code keeps raising HTTPException exactly as before; only the wire format changed. Timestamps are ISO 8601 UTC strings. UUID identifiers are lowercase hyphenated strings.

Every client request body is strict (model_config = ConfigDict(extra="forbid"), PR #215): unknown keys raise a 422 instead of being silently dropped, so {"oepration": "ping"} on POST /jobs (typo for operation) fails fast against the schema rather than parsing as if operation were missing. The contract covers every documented request body (CreateJobRequest / CreateJobCommentRequest / ScoringConfigCreate / CreateExperimentRunRequest / UpdateExperimentRunRequest / CreateDatasetRequest / ImportDatasetByReferenceRequest / ImportRerankerByReferenceRequest / SupportCreate); response models are not constrained because they are server-built and never parse client input.

Versioning under the /v1/ prefix

Every router is mounted twice (T4.1, decision D4):

  • Canonical under the /v1/ prefix (the first major URL segment): surfaced in OpenAPI / Swagger and the only path schema exporters and codegen tools see. All new clients should target this form.

  • Legacy alias at the root path: the same handler reachable without a prefix, include_in_schema=False so OpenAPI does not advertise it. This exists for the deprecation window so existing frontend, CLI, and CI traffic keeps working without a coordinated cutover.

The endpoint paths in the per-router sections and the Endpoints summary below are listed without the prefix for terseness; both the bare and the prefixed paths resolve to the same handler today. Health endpoints (/health, /health/ready) stay at the root by convention. When the legacy aliases are retired the second include_router call in protea.api.app._register_routers will be removed; this page is the source of truth for that timing.

Application factory

protea.api.app creates the FastAPI application, registers all routers, and wires the session factory and AMQP URL into app.state at startup. It also configures CORS and mounts any static middleware.

Application lifecycle and startup stages

protea.api.stages orchestrates the FastAPI lifespan: it opens the SQLAlchemy engine, publishes the session factory into app.state, and tears down the AMQP connection pool on shutdown.

Jobs router

The /jobs router is the primary interface for job lifecycle management. Jobs are created by POST /jobs with an operation name, a queue_name, and an optional JSON payload. The API creates a Job row in QUEUED status, commits, then publishes the UUID to RabbitMQ (in that order, so workers always find the row before they try to claim it).

Job status and the structured event timeline can be polled via GET /jobs/{id} and GET /jobs/{id}/events respectively. The frontend uses 2-second polling on the events endpoint to render a live progress timeline.

Proteins router

The /proteins router provides read access to the protein and sequence catalogue. Proteins are not created directly through this router; they are inserted asynchronously by the insert_proteins operation. The router exposes list and detail endpoints with filtering by organism and review status.

Annotations router

The /annotations router exposes the GO ontology and annotation set data. It provides:

  • Ontology snapshot listing and detail, including GO term counts per aspect.

  • Annotation set listing and detail.

  • A BFS ancestor subgraph endpoint (GET /annotations/snapshots/{id}/subgraph) that returns the ancestor closure for a given set of GO term IDs within a snapshot. Used by the frontend to render the GO hierarchy for a prediction result.

The annotations router is split into four sub-modules, each handling one endpoint group.

Embeddings router

The /embeddings router manages embedding configurations and prediction sets. Embedding configurations are immutable recipes: once created, they can be referenced by any number of embedding computation and prediction jobs. Creating a new configuration with different parameters produces a new UUID, preserving reproducibility.

Prediction sets are created by submitting a predict_go_terms job and are queryable once the job completes. The GET /embeddings/prediction-sets/{id}/predictions.tsv endpoint streams prediction results as a tab-separated file (32 columns including re-ranker features) using StreamingResponse with yield_per(1000), avoiding loading the full result set into memory.

Scoring router

The /scoring router exposes scoring configurations, the training-data export, and read-only endpoints for applying LightGBM re-ranker models. In-process re-ranker training was retired in F0/T0.6: boosters are now trained offline in protea-reranker-lab and registered through the Reranker models router (POST /reranker-models/import).

Key endpoints:

  • GET /scoring/prediction-sets/{id}/training-data.tsv: generates a 31-column TSV with binary labels from temporal ground truth, consumed by protea-reranker-lab to fit a booster.

  • GET /scoring/rerankers / GET /scoring/rerankers/{id} / DELETE /scoring/rerankers/{id}: read/delete operations for registered re-ranker models. Creation lives at POST /reranker-models/import.

  • GET /scoring/prediction-sets/{id}/rerank.tsv: applies a trained re-ranker to a prediction set, streaming re-scored predictions.

  • GET /scoring/prediction-sets/{id}/reranker-metrics: computes CAFA-style Fmax and AUC-PR using re-ranker probability scores.

Query sets router

The /query-sets router handles user-uploaded FASTA files. On POST /query-sets, the server parses the multipart upload, creates a QuerySet row, upserts one Sequence row per unique amino-acid string (deduplicating by MD5 hash), and creates QuerySetEntry rows preserving the original FASTA headers. The returned query set ID can then be referenced in compute_embeddings and predict_go_terms job payloads.

Annotate router

The /annotate router provides a one-click annotation endpoint. It accepts a FASTA file (or raw text), auto-selects the best available embedding config, annotation set, and ontology snapshot, creates a QuerySet, and queues a compute_embeddings job. Returns all the IDs the frontend needs to chain predict_go_terms once embeddings finish.

Maintenance router

The /maintenance router provides housekeeping endpoints for identifying and removing orphaned data. Two pairs of preview/execute endpoints handle orphan sequences (not referenced by any Protein or QuerySetEntry) and unindexed embeddings (for sequences not referenced by any Protein). Preview endpoints are read-only; execute endpoints perform the actual deletion.

Admin router

The /admin router exposes destructive administrative operations. Currently provides POST /admin/reset-db, which drops and recreates the public schema and re-applies all Alembic migrations. Protected by the admin role via require_role() (FARM-AUTH.4).

Showcase router

The /showcase router aggregates platform statistics and best evaluation results for the landing page. Returns protein counts, embedding counts, prediction counts, best Fmax per aspect per evaluation category (NK/LK/PK), and a method comparison table, all in a single JSON response.

Support router

The /support router handles community feedback. GET /support returns the total thumbs-up count and recent comments. POST /support submits a new thumbs-up with an optional comment (max 500 characters).

Benchmark router

The /benchmark router powers the per-PLM comparison grid in the UI. Where /showcase collapses every model into a few buckets and reports the maximum, this router preserves which embedding produced each number and which scoring config was used, exposing one stage per distinct ScoringConfig.name plus an implicit "reranker" stage for evaluations that used a re-ranker. Stage labels, GO categories, and the baseline tag are read from protea/config/benchmark.yaml; no hardcoded constants.

Datasets router

The /datasets router is the registry for frozen re-ranker training datasets. POST /datasets enqueues an export_research_dataset job that runs the KNN + feature pipeline, publishes the train.parquet / eval.parquet / manifest.json triple to the configured ArtifactStore (local FS or MinIO), and inserts a Dataset row once the upload completes. GET /datasets and GET /datasets/{id_or_name} expose the registry to protea-reranker-lab’s pull_dataset.py and to UI consumers.

POST /datasets/import-by-reference (LB.1) is the lightweight registration path for datasets whose artefacts already reside in the artifact store. The caller supplies the name, storage backend, artifact URIs, content fingerprints (schema_sha, manifest_sha), and dump parameters verbatim from the lab’s manifest.json; PROTEA inserts a Dataset row pointing at those URIs without re-running the KNN pipeline or enqueueing a job. Typical use cases are: replay after a DB wipe while artefacts remain in MinIO, lab-side dumps produced before export_research_dataset existed, and the FARM-EXP.2a placeholder-digest backfill. Optional FK columns (embedding_config_id, ontology_snapshot_id) are silently set to NULL when the referenced row is absent in the local DB, matching the same defensive pattern used by POST /reranker-models/import-by-reference. The resulting Dataset row is content-identical to one produced by an in-PROTEA export; the only visible difference is meta.imported_by_reference = true.

Registry router

The /backends, /sources, and /runners endpoints list the plugins discovered at runtime via importlib.metadata.entry_points for the three plugin groups: embedding backends, annotation sources, and experiment runners. The router is intentionally stateless: it re-scans entry points on every call rather than caching, so a worker that has just been restarted with a newly-installed extra surfaces in the next request without an API restart.

Reranker models router

The /reranker-models router accepts boosters trained offline in protea-reranker-lab (or any compatible trainer) and registers them in PROTEA. POST /reranker-models/import is the multipart flow: the lab sends model.txt + spec.yaml + run.json inline and the server uploads model.txt to the artifact store under rerankers/<run_id>/. POST /reranker-models/import-by-reference is the production flow: the lab pre-uploads model.txt to MinIO under its own key and posts JSON with artifact_uri + run_json + spec_yaml. Both flows share _register_model so the resulting RerankerModel row is identical.

Stack router

The /stack router exposes metadata about the eight-repo PROTEA stack to the UI. GET /stack returns the registry from docs/source/_data/stack.yaml. GET /stack/pulls aggregates open pull requests across every repo in the stack via the GitHub REST API and caches the result in-process to stay under the unauthenticated 60 req/h rate limit (set PROTEA_GITHUB_TOKEN to lift to 5000 req/h).

Experiment runs router

The /experiment-runs router exposes CRUD over the ExperimentRun ORM (T4.7-T4.9, decision D11). One row aggregates multiple Job / EvaluationResult / RerankerModel rows under a unique human name and carries the narrative trio (description / hypothesis / findings) plus JSONB config / provenance and Text[] tags. PATCH /experiment-runs/{run_id} accepts partial updates; status transitions stamp started_at (on planned running) and finished_at (on running done or abandoned) idempotently: re-entering a state never resets its timestamp.

Services layer

Each router delegates non-trivial business logic to a service module. Services are pure Python: they accept a SQLAlchemy session and return domain objects or raise domain exceptions. Routers map those exceptions to HTTP status codes. This separation allows the same logic to be exercised from CLI tools or batch scripts without importing FastAPI. Full symbol-level documentation lives in Services.

Authentication helpers

protea.api.auth implements the credential-verification layer. It exposes require_api_key_or_bearer, a FastAPI dependency that accepts three header forms (Authorization: ApiKey, X-Api-Key, or Authorization: Bearer). The API-key path computes a SHA-256 hash of the raw key and compares it against the database; the Bearer path verifies an HS256 JWT. A missing or invalid credential returns 401 with a WWW-Authenticate challenge.

protea.api.bearer provides the HS256 JWT verification utilities used by auth.require_api_key_or_bearer. Minimum required claims are sub, iat, and exp.

protea.api.auth_api_keys is the router for managing API key creation and revocation.

Request caching and rate limiting

protea.api.cache provides in-process caching utilities for expensive read-only endpoints (showcase statistics, benchmark matrix). Results are stored with a configurable TTL, reducing redundant database queries on frequently-polled pages.

protea.api.rate_limit configures the slowapi limiter and exposes the per-principal rate-limit rules applied to the five write routes protected by authentication (POST /jobs, POST /datasets, POST /datasets/import-by-reference, POST /reranker-models/import, POST /reranker-models/import-by-reference).

Shared dependencies and error handling

protea.api.deps provides FastAPI Depends callables shared across multiple routers: database session injection, current-user extraction, and pagination helpers.

protea.api.problem_details implements RFC 7807 application/problem+json error serialisation. Every exception handler in the application calls into this module to produce a consistent {"type", "title", "status", "detail", "instance"} body. Validation errors carry an additional errors array with the offending field paths.

Middleware

protea.api.middleware.visitor_counter is the WSGI middleware that logs one VisitorEvent row per HTTP GET to a non-asset path. It extracts the client IP, combines it with a daily salt, and stores the first 16 hex characters of the resulting SHA-256 hash.

Metrics router

The /metrics router exposes Prometheus-compatible scrape metrics for the API process. Response time histograms, active-connection gauges, and job-state counters are surfaced at GET /metrics.

Authentication and rate limits

Five POST routes require a credential (T5.6a + T5.6b):

  • POST /v1/jobs

  • POST /v1/datasets

  • POST /v1/datasets/import-by-reference

  • POST /v1/reranker-models/import

  • POST /v1/reranker-models/import-by-reference

Three header forms are accepted, any one of which satisfies the gate:

Authorization: ApiKey <raw_key>
X-Api-Key: <raw_key>
Authorization: Bearer <jwt>

The API key path uses protea.api.auth.require_api_key_or_bearer() (sha256 hash verification). The Bearer path uses HS256 with the PROTEA_JWT_SECRET env var; minimum token claims are sub, iat, and exp. A missing or invalid credential returns 401 with WWW-Authenticate: ApiKey, Bearer. Rate limits on these routes are enforced by slowapi per principal (API-key prefix or JWT sub); exceeding the limit returns 429 with a Retry-After header. See Authentication for the complete auth and rate-limit reference, and Configuration Reference for the PROTEA_AUTHN_REQUIRED, PROTEA_JWT_SECRET, and PROTEA_RATELIMIT_* knobs.

Endpoints summary

Method

Path

Description

Health

GET

/health

Liveness probe: returns 200 if the API process is up.

GET

/health/ready

Readiness probe: verifies database and RabbitMQ connections.

Jobs

POST

/jobs

Create a job and publish its UUID to RabbitMQ.

GET

/jobs

List jobs; filter by status and/or operation. Max 500 rows. Cursor pagination (T4.2): pass after=<created_at> to walk forward past the limit.

GET

/jobs/{id}

Retrieve a single job with full payload and meta.

GET

/jobs/{id}/events

Retrieve the event timeline for a job (up to 2 000 events). Cursor pagination (T4.2): pass after=<ts> to walk forward.

POST

/jobs/{id}/cancel

Transition a QUEUED or RUNNING job to CANCELLED.

DELETE

/jobs/{id}

Delete a job that is not in RUNNING status.

POST

/jobs/{id}/comments

Append a JobComment (T3.10 / D11). Body fields: body (required, non-empty), author (optional). Returns 201.

GET

/jobs/{id}/comments

List the JobComment thread chronologically (created_at ASC, id ASC tiebreaker). Cursor pagination (T4.2): pass after=<created_at> to walk forward past the limit.

Proteins

GET

/proteins/stats

Aggregate protein statistics (total, canonical, reviewed, organisms).

GET

/proteins

List proteins with pagination; filter by organism / reviewed.

GET

/proteins/{accession}

Retrieve a single protein with its UniProt metadata.

GET

/proteins/{accession}/annotations

List GO annotations for a protein across all annotation sets.

Annotations

GET

/annotations/snapshots

List ontology snapshots with GO term counts per aspect.

GET

/annotations/snapshots/{id}

Retrieve a snapshot with its full list of GO terms.

PATCH

/annotations/snapshots/{id}/ia-url

Set the Information Accretion (IA) file URL on an ontology snapshot.

POST

/annotations/snapshots/load

Queue a load_ontology_snapshot job.

GET

/annotations/snapshots/{id}/subgraph

BFS ancestor subgraph for a given set of GO term IDs.

GET

/annotations/sets

List annotation sets with protein GO annotation counts.

GET

/annotations/sets/{id}

Retrieve a single annotation set with summary statistics.

DELETE

/annotations/sets/{id}

Delete an annotation set and all its annotations.

POST

/annotations/sets/load-goa

Queue a load_goa_annotations job.

POST

/annotations/sets/load-quickgo

Queue a load_quickgo_annotations job.

POST

/annotations/evaluation-sets/generate

Queue a generate_evaluation_set job.

GET

/annotations/evaluation-sets

List evaluation sets with summary statistics.

GET

/annotations/evaluation-sets/{id}

Get evaluation set details.

DELETE

/annotations/evaluation-sets/{id}

Delete an evaluation set.

GET

/annotations/evaluation-sets/{id}/ground-truth-NK.tsv

Download NK ground truth in CAFA format.

GET

/annotations/evaluation-sets/{id}/ground-truth-LK.tsv

Download LK ground truth in CAFA format.

GET

/annotations/evaluation-sets/{id}/ground-truth-PK.tsv

Download PK ground truth in CAFA format.

GET

/annotations/evaluation-sets/{id}/known-terms.tsv

Download known terms from old annotation set (for PK evaluation).

GET

/annotations/evaluation-sets/{id}/delta-proteins.fasta

Download delta proteins as FASTA.

POST

/annotations/evaluation-sets/{id}/run

Queue a run_cafa_evaluation job.

GET

/annotations/evaluation-sets/{id}/results

List evaluation results for an evaluation set.

GET

/annotations/evaluation-sets/{id}/results/{rid}/metrics.tsv

Download evaluation metrics as TSV.

GET

/annotations/evaluation-sets/{id}/results/{rid}/artifacts.zip

Download all cafaeval artifacts as a zip.

DELETE

/annotations/evaluation-sets/{id}/results/{rid}

Delete an evaluation result.

Embeddings

GET

/embeddings/configs

List all embedding configurations.

POST

/embeddings/configs

Create a new (immutable) embedding configuration.

GET

/embeddings/configs/{id}

Retrieve an embedding configuration by UUID.

DELETE

/embeddings/configs/{id}

Delete an embedding configuration.

POST

/embeddings/predict

Queue a predict_go_terms job.

GET

/embeddings/prediction-sets

List prediction sets with entry counts.

GET

/embeddings/prediction-sets/{id}

Retrieve a prediction set with summary statistics.

GET

/embeddings/prediction-sets/{id}/proteins

List proteins in a prediction set.

GET

/embeddings/prediction-sets/{id}/proteins/{accession}

Get predictions for one protein.

GET

/embeddings/prediction-sets/{id}/go-terms

GO term distribution in a prediction set.

GET

/embeddings/prediction-sets/{id}/predictions.tsv

Stream all predictions as TSV (filtered by accession / aspect / distance).

GET

/embeddings/prediction-sets/{id}/predictions-cafa.tsv

Download predictions in CAFA submission format.

DELETE

/embeddings/prediction-sets/{id}

Delete a prediction set.

Scoring

GET

/scoring/configs

List scoring configurations.

POST

/scoring/configs

Create a scoring configuration.

POST

/scoring/configs/presets

Create preset scoring configurations.

GET

/scoring/configs/{id}

Retrieve a scoring configuration.

DELETE

/scoring/configs/{id}

Delete a scoring configuration.

GET

/scoring/prediction-sets/{id}/score.tsv

Stream scored predictions as TSV.

GET

/scoring/prediction-sets/{id}/metrics

Compute CAFA-style metrics for scored predictions.

GET

/scoring/prediction-sets/{id}/training-data.tsv

Export labeled training data for the re-ranker.

GET

/scoring/rerankers

List all trained re-ranker models.

GET

/scoring/rerankers/{id}

Retrieve a re-ranker model’s metadata, metrics, and feature importance.

DELETE

/scoring/rerankers/{id}

Delete a trained re-ranker model.

GET

/scoring/prediction-sets/{id}/rerank.tsv

Apply a re-ranker to a prediction set and stream re-scored TSV.

GET

/scoring/prediction-sets/{id}/reranker-metrics

Compute CAFA Fmax and AUC-PR using re-ranker scores.

Query Sets

POST

/query-sets

Upload a FASTA file and create a QuerySet.

GET

/query-sets

List all query sets with entry counts.

GET

/query-sets/{id}

Retrieve a query set with its full entry list.

DELETE

/query-sets/{id}

Delete a query set and all its entries.

Annotate

POST

/annotate

One-click annotation: upload FASTA, auto-run the full pipeline.

Maintenance

GET

/maintenance/vacuum-sequences/preview

Count orphan sequences (preview).

POST

/maintenance/vacuum-sequences

Delete orphan sequences.

GET

/maintenance/vacuum-embeddings/preview

Count unindexed embeddings (preview).

POST

/maintenance/vacuum-embeddings

Delete unindexed embeddings.

Admin

POST

/admin/reset-db

Drop and recreate the public schema (requires admin token).

Showcase

GET

/showcase

Platform statistics and best evaluation results.

Support

GET

/support

Total thumbs-up count and recent comments.

POST

/support

Submit a thumbs-up with optional comment.

Benchmark

GET

/benchmark/embeddings

List embedding configs with persisted display metadata.

GET

/benchmark/matrix

Per-embedding / per-stage Fmax matrix across all evaluation results.

Datasets

POST

/datasets

Enqueue an export_research_dataset job.

POST

/datasets/import-by-reference

Register a Dataset row pointing at already-staged artefacts (no job, no KNN re-run). Requires auth (LB.1).

GET

/datasets

List registered re-ranker datasets. Cursor pagination (T4.2): pass after=<created_at> to walk forward past the limit.

GET

/datasets/{id_or_name}

Get a dataset by id or name.

Plugin Registry

GET

/backends

List installed embedding-backend plugins.

GET

/sources

List installed annotation-source plugins.

GET

/runners

List installed experiment-runner plugins.

Reranker Models

POST

/reranker-models/import

Import a lab-trained booster (multipart).

POST

/reranker-models/import-by-reference

Import a booster already uploaded to the artifact store (JSON).

Stack

GET

/stack

Return the eight-repo PROTEA stack registry.

GET

/stack/pulls

Aggregate open pull requests across every repo in the stack.

Experiment Runs

POST

/experiment-runs

Create an ExperimentRun (T4.7). Body: name required + optional narrative trio + status + JSONB / tags.

GET

/experiment-runs

List experiment runs newest-first; filter by status (T4.8). Cursor pagination (T4.2): pass after=<created_at> from the previous page’s last row.

GET

/experiment-runs/{run_id}

Retrieve one experiment run.

PATCH

/experiment-runs/{run_id}

Partial update (T4.9). Status transitions stamp started_at / finished_at idempotently.

DELETE

/experiment-runs/{run_id}

Delete an experiment run (returns 204).

Request body for POST /jobs

The operation and queue_name fields are required. payload is passed verbatim to the operation’s execute method after Pydantic validation; its schema depends on the operation. meta is stored on the Job row and never interpreted by the API. description and tags are optional D11 narrative fields surfaced on the GET /jobs and GET /jobs/{id} responses; they let any caller attach human intent and ad-hoc grouping tokens at submission time without round-tripping through a separate metadata endpoint.

{
  "operation": "insert_proteins",
  "queue_name": "protea.jobs",
  "payload": {
    "search_criteria": "reviewed:true AND organism_id:9606"
  },
  "meta": {},
  "description": "Backfill reviewed Swiss-Prot for benchmark_v1",
  "tags": ["ablation", "benchmark_v1"]
}

Common payload examples by operation:

{ "operation": "fetch_uniprot_metadata",  "queue_name": "protea.jobs",
  "payload": { "search_criteria": "reviewed:true AND organism_id:9606" } }
{ "operation": "compute_embeddings", "queue_name": "protea.embeddings",
  "payload": { "embedding_config_id": "<uuid>", "sequences_per_job": 64 } }
{ "operation": "predict_go_terms", "queue_name": "protea.predictions",
  "payload": {
    "embedding_config_id": "<uuid>",
    "annotation_set_id": "<uuid>",
    "ontology_snapshot_id": "<uuid>",
    "query_set_id": "<uuid>",
    "limit_per_entry": 5
  }
}

See also

  • Operations: every operation referenced in a payload, with field-level documentation.

  • How-to Guides: concrete curl recipes that submit each endpoint end-to-end.

  • Job Lifecycle: how the API turns a request into a persistent Job row and a queue message.