HTTP API¶
The PROTEA HTTP API is a FastAPI application that exposes a set of routers
under protea/api/routers/ (the live OpenAPI is regenerated into
docs/openapi.json and is authoritative for the exact endpoint list).
All state mutations flow through this layer: it writes Job rows to
PostgreSQL and publishes messages to RabbitMQ. The API is stateless between
requests; the session factory and AMQP URL are injected via app.state
at startup, keeping every router free of global state and infrastructure
imports.
All endpoints return JSON. Error responses follow the RFC 7807
``application/problem+json`` shape (T4.4 / D4): every error body
includes type (relative URI under /problems/{slug},
e.g. /problems/not-found), title (short stable summary),
status (mirror of the HTTP code), and an optional detail +
instance (request URI). Validation errors carry an extra
errors array with the offending field paths. Existing route code
keeps raising HTTPException exactly as before; only the wire
format changed. Timestamps are ISO 8601 UTC strings. UUID identifiers
are lowercase hyphenated strings.
Every client request body is strict (model_config =
ConfigDict(extra="forbid"), PR #215): unknown keys raise a 422
instead of being silently dropped, so {"oepration": "ping"} on
POST /jobs (typo for operation) fails fast against the
schema rather than parsing as if operation were missing. The
contract covers every documented request body
(CreateJobRequest / CreateJobCommentRequest /
ScoringConfigCreate / CreateExperimentRunRequest /
UpdateExperimentRunRequest / CreateDatasetRequest /
ImportDatasetByReferenceRequest /
ImportRerankerByReferenceRequest / SupportCreate); response
models are not constrained because they are server-built and never
parse client input.
Versioning under the /v1/ prefix¶
Every router is mounted twice (T4.1, decision D4):
Canonical under the
/v1/prefix (the first major URL segment): surfaced in OpenAPI / Swagger and the only path schema exporters and codegen tools see. All new clients should target this form.Legacy alias at the root path: the same handler reachable without a prefix,
include_in_schema=Falseso OpenAPI does not advertise it. This exists for the deprecation window so existing frontend, CLI, and CI traffic keeps working without a coordinated cutover.
The endpoint paths in the per-router sections and the Endpoints
summary below are listed without the prefix for terseness; both the
bare and the prefixed paths resolve to the same handler today. Health
endpoints (/health, /health/ready) stay at the root by
convention. When the legacy aliases are retired the second
include_router call in
protea.api.app._register_routers will be removed; this page is
the source of truth for that timing.
Application factory¶
protea.api.app creates the FastAPI application, registers all routers,
and wires the session factory and AMQP URL into app.state at startup.
It also configures CORS and mounts any static middleware.
Application lifecycle and startup stages
protea.api.stages orchestrates the FastAPI lifespan: it opens the
SQLAlchemy engine, publishes the session factory into app.state, and
tears down the AMQP connection pool on shutdown.
Jobs router¶
The /jobs router is the primary interface for job lifecycle management.
Jobs are created by POST /jobs with an operation name, a
queue_name, and an optional JSON payload. The API creates a Job
row in QUEUED status, commits, then publishes the UUID to RabbitMQ
(in that order, so workers always find the row before they try to claim it).
Job status and the structured event timeline can be polled via
GET /jobs/{id} and GET /jobs/{id}/events respectively. The frontend
uses 2-second polling on the events endpoint to render a live progress
timeline.
Proteins router¶
The /proteins router provides read access to the protein and sequence
catalogue. Proteins are not created directly through this router; they are
inserted asynchronously by the insert_proteins operation. The router
exposes list and detail endpoints with filtering by organism and review
status.
Annotations router¶
The /annotations router exposes the GO ontology and annotation set data.
It provides:
Ontology snapshot listing and detail, including GO term counts per aspect.
Annotation set listing and detail.
A BFS ancestor subgraph endpoint (
GET /annotations/snapshots/{id}/subgraph) that returns the ancestor closure for a given set of GO term IDs within a snapshot. Used by the frontend to render the GO hierarchy for a prediction result.
The annotations router is split into four sub-modules, each handling one endpoint group.
Embeddings router¶
The /embeddings router manages embedding configurations and prediction
sets. Embedding configurations are immutable recipes: once created, they
can be referenced by any number of embedding computation and prediction
jobs. Creating a new configuration with different parameters produces a
new UUID, preserving reproducibility.
Prediction sets are created by submitting a predict_go_terms job and
are queryable once the job completes. The
GET /embeddings/prediction-sets/{id}/predictions.tsv endpoint streams
prediction results as a tab-separated file (32 columns including re-ranker
features) using StreamingResponse with yield_per(1000), avoiding
loading the full result set into memory.
Scoring router¶
The /scoring router exposes scoring configurations, the training-data
export, and read-only endpoints for applying LightGBM re-ranker models.
In-process re-ranker training was retired in F0/T0.6: boosters are now
trained offline in protea-reranker-lab and registered through the
Reranker models router (POST /reranker-models/import).
Key endpoints:
GET /scoring/prediction-sets/{id}/training-data.tsv: generates a 31-column TSV with binary labels from temporal ground truth, consumed byprotea-reranker-labto fit a booster.GET /scoring/rerankers/GET /scoring/rerankers/{id}/DELETE /scoring/rerankers/{id}: read/delete operations for registered re-ranker models. Creation lives atPOST /reranker-models/import.GET /scoring/prediction-sets/{id}/rerank.tsv: applies a trained re-ranker to a prediction set, streaming re-scored predictions.GET /scoring/prediction-sets/{id}/reranker-metrics: computes CAFA-style Fmax and AUC-PR using re-ranker probability scores.
Query sets router¶
The /query-sets router handles user-uploaded FASTA files. On
POST /query-sets, the server parses the multipart upload, creates a
QuerySet row, upserts one Sequence row per unique amino-acid string
(deduplicating by MD5 hash), and creates QuerySetEntry rows preserving
the original FASTA headers. The returned query set ID can then be referenced
in compute_embeddings and predict_go_terms job payloads.
Annotate router¶
The /annotate router provides a one-click annotation endpoint. It accepts
a FASTA file (or raw text), auto-selects the best available embedding config,
annotation set, and ontology snapshot, creates a QuerySet, and queues a
compute_embeddings job. Returns all the IDs the frontend needs to chain
predict_go_terms once embeddings finish.
Maintenance router¶
The /maintenance router provides housekeeping endpoints for identifying
and removing orphaned data. Two pairs of preview/execute endpoints handle
orphan sequences (not referenced by any Protein or QuerySetEntry) and
unindexed embeddings (for sequences not referenced by any Protein).
Preview endpoints are read-only; execute endpoints perform the actual deletion.
Admin router¶
The /admin router exposes destructive administrative operations.
Currently provides POST /admin/reset-db, which drops and recreates
the public schema and re-applies all Alembic migrations. Protected by the admin role via require_role() (FARM-AUTH.4).
Showcase router¶
The /showcase router aggregates platform statistics and best evaluation
results for the landing page. Returns protein counts, embedding counts,
prediction counts, best Fmax per aspect per evaluation category (NK/LK/PK),
and a method comparison table, all in a single JSON response.
Support router¶
The /support router handles community feedback. GET /support returns
the total thumbs-up count and recent comments. POST /support submits a
new thumbs-up with an optional comment (max 500 characters).
Benchmark router¶
The /benchmark router powers the per-PLM comparison grid in the UI.
Where /showcase collapses every model into a few buckets and reports
the maximum, this router preserves which embedding produced each number
and which scoring config was used, exposing one stage per distinct
ScoringConfig.name plus an implicit "reranker" stage for evaluations
that used a re-ranker. Stage labels, GO categories, and the baseline tag are
read from protea/config/benchmark.yaml; no hardcoded constants.
Datasets router¶
The /datasets router is the registry for frozen re-ranker training
datasets. POST /datasets enqueues an export_research_dataset job
that runs the KNN + feature pipeline, publishes the
train.parquet / eval.parquet / manifest.json triple to the
configured ArtifactStore (local FS or MinIO), and inserts a
Dataset row once the upload completes. GET /datasets and
GET /datasets/{id_or_name} expose the registry to
protea-reranker-lab’s pull_dataset.py and to UI consumers.
POST /datasets/import-by-reference (LB.1) is the lightweight
registration path for datasets whose artefacts already reside in the
artifact store. The caller supplies the name, storage backend, artifact
URIs, content fingerprints (schema_sha, manifest_sha), and dump
parameters verbatim from the lab’s manifest.json; PROTEA inserts a
Dataset row pointing at those URIs without re-running the KNN
pipeline or enqueueing a job. Typical use cases are: replay after a DB
wipe while artefacts remain in MinIO, lab-side dumps produced before
export_research_dataset existed, and the FARM-EXP.2a
placeholder-digest backfill. Optional FK columns
(embedding_config_id, ontology_snapshot_id) are silently set to
NULL when the referenced row is absent in the local DB, matching the
same defensive pattern used by
POST /reranker-models/import-by-reference. The resulting Dataset
row is content-identical to one produced by an in-PROTEA export; the
only visible difference is meta.imported_by_reference = true.
Registry router¶
The /backends, /sources, and /runners endpoints list the plugins
discovered at runtime via importlib.metadata.entry_points for the three
plugin groups: embedding backends, annotation sources, and experiment
runners. The router is intentionally stateless: it re-scans entry points
on every call rather than caching, so a worker that has just been restarted
with a newly-installed extra surfaces in the next request without an API
restart.
Reranker models router¶
The /reranker-models router accepts boosters trained offline in
protea-reranker-lab (or any compatible trainer) and registers them
in PROTEA. POST /reranker-models/import is the multipart flow:
the lab sends model.txt + spec.yaml + run.json inline and
the server uploads model.txt to the artifact store under
rerankers/<run_id>/. POST /reranker-models/import-by-reference
is the production flow: the lab pre-uploads model.txt to MinIO under
its own key and posts JSON with artifact_uri + run_json +
spec_yaml. Both flows share _register_model so the resulting
RerankerModel row is identical.
Stack router¶
The /stack router exposes metadata about the eight-repo PROTEA stack
to the UI. GET /stack returns the registry from
docs/source/_data/stack.yaml. GET /stack/pulls aggregates open
pull requests across every repo in the stack via the GitHub REST API and
caches the result in-process to stay under the unauthenticated 60 req/h
rate limit (set PROTEA_GITHUB_TOKEN to lift to 5000 req/h).
Experiment runs router¶
The /experiment-runs router exposes CRUD over the
ExperimentRun ORM (T4.7-T4.9, decision D11). One row aggregates
multiple Job / EvaluationResult / RerankerModel rows
under a unique human name and carries the narrative trio
(description / hypothesis / findings) plus JSONB
config / provenance and Text[] tags.
PATCH /experiment-runs/{run_id} accepts partial updates; status
transitions stamp started_at (on planned → running) and
finished_at (on running → done or → abandoned)
idempotently: re-entering a state never resets its timestamp.
Services layer¶
Each router delegates non-trivial business logic to a service module. Services are pure Python: they accept a SQLAlchemy session and return domain objects or raise domain exceptions. Routers map those exceptions to HTTP status codes. This separation allows the same logic to be exercised from CLI tools or batch scripts without importing FastAPI. Full symbol-level documentation lives in Services.
Authentication helpers
protea.api.auth implements the credential-verification layer. It
exposes require_api_key_or_bearer, a FastAPI dependency that accepts
three header forms (Authorization: ApiKey, X-Api-Key, or
Authorization: Bearer). The API-key path computes a SHA-256 hash of
the raw key and compares it against the database; the Bearer path verifies
an HS256 JWT. A missing or invalid credential returns 401 with a
WWW-Authenticate challenge.
protea.api.bearer provides the HS256 JWT verification utilities used
by auth.require_api_key_or_bearer. Minimum required claims are
sub, iat, and exp.
protea.api.auth_api_keys is the router for managing API key creation
and revocation.
Request caching and rate limiting
protea.api.cache provides in-process caching utilities for expensive
read-only endpoints (showcase statistics, benchmark matrix). Results are
stored with a configurable TTL, reducing redundant database queries on
frequently-polled pages.
protea.api.rate_limit configures the slowapi limiter and exposes
the per-principal rate-limit rules applied to the five write routes
protected by authentication (POST /jobs, POST /datasets,
POST /datasets/import-by-reference, POST /reranker-models/import,
POST /reranker-models/import-by-reference).
Shared dependencies and error handling
protea.api.deps provides FastAPI Depends callables shared across
multiple routers: database session injection, current-user extraction,
and pagination helpers.
protea.api.problem_details implements RFC 7807
application/problem+json error serialisation. Every exception handler
in the application calls into this module to produce a consistent
{"type", "title", "status", "detail", "instance"} body. Validation
errors carry an additional errors array with the offending field paths.
Middleware
protea.api.middleware.visitor_counter is the WSGI middleware that
logs one VisitorEvent row per HTTP GET to a non-asset path. It
extracts the client IP, combines it with a daily salt, and stores the
first 16 hex characters of the resulting SHA-256 hash.
Metrics router
The /metrics router exposes Prometheus-compatible scrape metrics for
the API process. Response time histograms, active-connection gauges, and
job-state counters are surfaced at GET /metrics.
Authentication and rate limits¶
Five POST routes require a credential (T5.6a + T5.6b):
POST /v1/jobsPOST /v1/datasetsPOST /v1/datasets/import-by-referencePOST /v1/reranker-models/importPOST /v1/reranker-models/import-by-reference
Three header forms are accepted, any one of which satisfies the gate:
Authorization: ApiKey <raw_key>
X-Api-Key: <raw_key>
Authorization: Bearer <jwt>
The API key path uses protea.api.auth.require_api_key_or_bearer()
(sha256 hash verification). The Bearer path uses HS256 with the
PROTEA_JWT_SECRET env var; minimum token claims are sub,
iat, and exp. A missing or invalid credential returns 401 with
WWW-Authenticate: ApiKey, Bearer. Rate limits on these routes are
enforced by slowapi per principal (API-key prefix or JWT sub);
exceeding the limit returns 429 with a Retry-After header.
See Authentication for the complete auth and rate-limit
reference, and Configuration Reference for the
PROTEA_AUTHN_REQUIRED, PROTEA_JWT_SECRET, and
PROTEA_RATELIMIT_* knobs.
Endpoints summary¶
Method |
Path |
Description |
|---|---|---|
Health |
||
|
|
Liveness probe: returns 200 if the API process is up. |
|
|
Readiness probe: verifies database and RabbitMQ connections. |
Jobs |
||
|
|
Create a job and publish its UUID to RabbitMQ. |
|
|
List jobs; filter by |
|
|
Retrieve a single job with full payload and meta. |
|
|
Retrieve the event timeline for a job (up to 2 000 events).
Cursor pagination (T4.2): pass |
|
|
Transition a |
|
|
Delete a job that is not in |
|
|
Append a |
|
|
List the |
Proteins |
||
|
|
Aggregate protein statistics (total, canonical, reviewed, organisms). |
|
|
List proteins with pagination; filter by |
|
|
Retrieve a single protein with its UniProt metadata. |
|
|
List GO annotations for a protein across all annotation sets. |
Annotations |
||
|
|
List ontology snapshots with GO term counts per aspect. |
|
|
Retrieve a snapshot with its full list of GO terms. |
|
|
Set the Information Accretion (IA) file URL on an ontology snapshot. |
|
|
Queue a |
|
|
BFS ancestor subgraph for a given set of GO term IDs. |
|
|
List annotation sets with protein GO annotation counts. |
|
|
Retrieve a single annotation set with summary statistics. |
|
|
Delete an annotation set and all its annotations. |
|
|
Queue a |
|
|
Queue a |
|
|
Queue a |
|
|
List evaluation sets with summary statistics. |
|
|
Get evaluation set details. |
|
|
Delete an evaluation set. |
|
|
Download NK ground truth in CAFA format. |
|
|
Download LK ground truth in CAFA format. |
|
|
Download PK ground truth in CAFA format. |
|
|
Download known terms from old annotation set (for PK evaluation). |
|
|
Download delta proteins as FASTA. |
|
|
Queue a |
|
|
List evaluation results for an evaluation set. |
|
|
Download evaluation metrics as TSV. |
|
|
Download all cafaeval artifacts as a zip. |
|
|
Delete an evaluation result. |
Embeddings |
||
|
|
List all embedding configurations. |
|
|
Create a new (immutable) embedding configuration. |
|
|
Retrieve an embedding configuration by UUID. |
|
|
Delete an embedding configuration. |
|
|
Queue a |
|
|
List prediction sets with entry counts. |
|
|
Retrieve a prediction set with summary statistics. |
|
|
List proteins in a prediction set. |
|
|
Get predictions for one protein. |
|
|
GO term distribution in a prediction set. |
|
|
Stream all predictions as TSV (filtered by accession / aspect / distance). |
|
|
Download predictions in CAFA submission format. |
|
|
Delete a prediction set. |
Scoring |
||
|
|
List scoring configurations. |
|
|
Create a scoring configuration. |
|
|
Create preset scoring configurations. |
|
|
Retrieve a scoring configuration. |
|
|
Delete a scoring configuration. |
|
|
Stream scored predictions as TSV. |
|
|
Compute CAFA-style metrics for scored predictions. |
|
|
Export labeled training data for the re-ranker. |
|
|
List all trained re-ranker models. |
|
|
Retrieve a re-ranker model’s metadata, metrics, and feature importance. |
|
|
Delete a trained re-ranker model. |
|
|
Apply a re-ranker to a prediction set and stream re-scored TSV. |
|
|
Compute CAFA Fmax and AUC-PR using re-ranker scores. |
Query Sets |
||
|
|
Upload a FASTA file and create a |
|
|
List all query sets with entry counts. |
|
|
Retrieve a query set with its full entry list. |
|
|
Delete a query set and all its entries. |
Annotate |
||
|
|
One-click annotation: upload FASTA, auto-run the full pipeline. |
Maintenance |
||
|
|
Count orphan sequences (preview). |
|
|
Delete orphan sequences. |
|
|
Count unindexed embeddings (preview). |
|
|
Delete unindexed embeddings. |
Admin |
||
|
|
Drop and recreate the public schema (requires admin token). |
Showcase |
||
|
|
Platform statistics and best evaluation results. |
Support |
||
|
|
Total thumbs-up count and recent comments. |
|
|
Submit a thumbs-up with optional comment. |
Benchmark |
||
|
|
List embedding configs with persisted display metadata. |
|
|
Per-embedding / per-stage Fmax matrix across all evaluation results. |
Datasets |
||
|
|
Enqueue an |
|
|
Register a |
|
|
List registered re-ranker datasets. Cursor pagination (T4.2):
pass |
|
|
Get a dataset by id or name. |
Plugin Registry |
||
|
|
List installed embedding-backend plugins. |
|
|
List installed annotation-source plugins. |
|
|
List installed experiment-runner plugins. |
Reranker Models |
||
|
|
Import a lab-trained booster (multipart). |
|
|
Import a booster already uploaded to the artifact store (JSON). |
Stack |
||
|
|
Return the eight-repo PROTEA stack registry. |
|
|
Aggregate open pull requests across every repo in the stack. |
Experiment Runs |
||
|
|
Create an |
|
|
List experiment runs newest-first; filter by |
|
|
Retrieve one experiment run. |
|
|
Partial update (T4.9). Status transitions stamp
|
|
|
Delete an experiment run (returns 204). |
Request body for POST /jobs¶
The operation and queue_name fields are required. payload is
passed verbatim to the operation’s execute method after Pydantic
validation; its schema depends on the operation. meta is stored on
the Job row and never interpreted by the API. description and
tags are optional D11 narrative fields surfaced on the
GET /jobs and GET /jobs/{id} responses; they let any caller
attach human intent and ad-hoc grouping tokens at submission time
without round-tripping through a separate metadata endpoint.
{
"operation": "insert_proteins",
"queue_name": "protea.jobs",
"payload": {
"search_criteria": "reviewed:true AND organism_id:9606"
},
"meta": {},
"description": "Backfill reviewed Swiss-Prot for benchmark_v1",
"tags": ["ablation", "benchmark_v1"]
}
Common payload examples by operation:
{ "operation": "fetch_uniprot_metadata", "queue_name": "protea.jobs",
"payload": { "search_criteria": "reviewed:true AND organism_id:9606" } }
{ "operation": "compute_embeddings", "queue_name": "protea.embeddings",
"payload": { "embedding_config_id": "<uuid>", "sequences_per_job": 64 } }
{ "operation": "predict_go_terms", "queue_name": "protea.predictions",
"payload": {
"embedding_config_id": "<uuid>",
"annotation_set_id": "<uuid>",
"ontology_snapshot_id": "<uuid>",
"query_set_id": "<uuid>",
"limit_per_entry": 5
}
}
See also
Operations: every operation referenced in a payload, with field-level documentation.
How-to Guides: concrete
curlrecipes that submit each endpoint end-to-end.Job Lifecycle: how the API turns a request into a persistent
Jobrow and a queue message.