Architecture¶
This section describes the runtime architecture of PROTEA: its components, data model, job lifecycle, and extension points. Each page focuses on one concern and links to the others where they intersect.
- System Overview
The four horizontal layers (presentation, API, worker, data), the ten RabbitMQ queues that connect them, and how a typical request flows through the stack from FASTA upload to stored prediction.
- Job Lifecycle
The two-session
BaseWorkerpattern, parent-child coordinator jobs,RetryLaterErrorfor serialised resources, atomic progress counters, and the soft-cancellation contract.- Data Model
The relational schema in five logical groups (sequences and proteins, ontology and annotations, embeddings, predictions, query sets and jobs) with the deduplication and versioning rules that make every prediction reproducible.
- Operations
The
Operationprotocol that unifies every unit of domain logic, theOperationRegistry, and reference documentation for every operation shipped with PROTEA (ingestion, embeddings, predictions, evaluation).- CAFA Evaluation Protocol
The CAFA temporal-holdout protocol, the NK/LK/PK classification, and the end-to-end evaluation workflow used to produce the figures in Results.
- Orchestration
How PROTEA relates to the rest of the working tree: the satellite repositories, the optional
agent-farmorchestration system, and the contract surface (HTTP API + artefact store) the platform exposes for automated consumption.- Authentication
Four-role authentication system (guest/researcher/operator/admin) shipped in FARM-AUTH.1-11 (ADR D37). Human email+password login, API-key programmatic access, session revocation, per-user quota, optional SMTP, and audit log.
- Multi-Stage Pipeline Contract
Shared coordinator/fan-out/collect contract (
MultiStagePayload,StageArtifactStore,PipelineStage,Coordinator) that the three production pipelines will converge onto.- Export Coordinator
The minijob-based dataset export pipeline:
PROTEA_EXPORT_MINIJOBSenv gate, fan-out into KNN batch minijobs, fast-fail pre-flight, aggregate failure semantics, accepted search backends (numpy/faiss/torch), andannotation_set_idauto-derivation.
Architecture Decision Records¶
The pages above describe what the architecture looks like today. The Architecture Decision Records records explain why each major decision was taken: the constraint, the rejected alternatives, and the trade-off that closed the question.
ADR |
Decision |
Problem it solves |
|---|---|---|
KNN on CPU, not pgvector or GPU |
pgvector does not scale to 500K+ vectors; the GPU must stay free for embedding inference |
|
Two-session worker pattern |
A mid-operation crash used to leave the job invisible to monitoring |
|
|
Thousands of batch jobs per pipeline flooded the |
|
Dead-letter queue and retry strategy |
Failed messages were silently lost; retries without backoff amplified failures |
|
Reusable RabbitMQ connections |
A coordinator dispatching 500 batches opened 500 TCP connections |
|
Sequence deduplication by MD5 |
Tens of thousands of duplicate Swiss-Prot sequences wasted GPU hours |
The full ADR index lives at Architecture Decision Records.