Architecture¶

This section describes the runtime architecture of PROTEA: its components, data model, job lifecycle, and extension points. Each page focuses on one concern and links to the others where they intersect.

System Overview: The four horizontal layers (presentation, API, worker, data), the ten RabbitMQ queues that connect them, and how a typical request flows through the stack from FASTA upload to stored prediction.
Job Lifecycle: The two-session BaseWorker pattern, parent-child coordinator jobs, RetryLaterError for serialised resources, atomic progress counters, and the soft-cancellation contract.
Data Model: The relational schema in five logical groups (sequences and proteins, ontology and annotations, embeddings, predictions, query sets and jobs) with the deduplication and versioning rules that make every prediction reproducible.
Operations: The Operation protocol that unifies every unit of domain logic, the OperationRegistry, and reference documentation for every operation shipped with PROTEA (ingestion, embeddings, predictions, evaluation).
CAFA Evaluation Protocol: The CAFA temporal-holdout protocol, the NK/LK/PK classification, and the end-to-end evaluation workflow used to produce the figures in Results.
Orchestration: How PROTEA relates to the rest of the working tree: the satellite repositories, the optional agent-farm orchestration system, and the contract surface (HTTP API + artefact store) the platform exposes for automated consumption.
Authentication: Four-role authentication system (guest/researcher/operator/admin) shipped in FARM-AUTH.1-11 (ADR D37). Human email+password login, API-key programmatic access, session revocation, per-user quota, optional SMTP, and audit log.
Multi-Stage Pipeline Contract: Shared coordinator/fan-out/collect contract (MultiStagePayload, StageArtifactStore, PipelineStage, Coordinator) that the three production pipelines will converge onto.
Export Coordinator: The minijob-based dataset export pipeline: PROTEA_EXPORT_MINIJOBS env gate, fan-out into KNN batch minijobs, fast-fail pre-flight, aggregate failure semantics, accepted search backends (numpy/faiss/torch), and annotation_set_id auto-derivation.

Architecture Decision Records¶

The pages above describe what the architecture looks like today. The Architecture Decision Records records explain why each major decision was taken: the constraint, the rejected alternatives, and the trade-off that closed the question.

ADR	Decision	Problem it solves
001	KNN on CPU, not pgvector or GPU	pgvector does not scale to 500K+ vectors; the GPU must stay free for embedding inference
002	Two-session worker pattern	A mid-operation crash used to leave the job invisible to monitoring
003	`QueueConsumer` vs. `OperationConsumer`	Thousands of batch jobs per pipeline flooded the `jobs` table
004	Dead-letter queue and retry strategy	Failed messages were silently lost; retries without backoff amplified failures
005	Reusable RabbitMQ connections	A coordinator dispatching 500 batches opened 500 TCP connections
006	Sequence deduplication by MD5	Tens of thousands of duplicate Swiss-Prot sequences wasted GPU hours

The full ADR index lives at Architecture Decision Records.