Architecture

This section describes the runtime architecture of PROTEA: its components, data model, job lifecycle, and extension points. Each page focuses on one concern and links to the others where they intersect.

System Overview

The four horizontal layers (presentation, API, worker, data), the ten RabbitMQ queues that connect them, and how a typical request flows through the stack from FASTA upload to stored prediction.

Job Lifecycle

The two-session BaseWorker pattern, parent-child coordinator jobs, RetryLaterError for serialised resources, atomic progress counters, and the soft-cancellation contract.

Data Model

The relational schema in five logical groups (sequences and proteins, ontology and annotations, embeddings, predictions, query sets and jobs) with the deduplication and versioning rules that make every prediction reproducible.

Operations

The Operation protocol that unifies every unit of domain logic, the OperationRegistry, and reference documentation for every operation shipped with PROTEA (ingestion, embeddings, predictions, evaluation).

CAFA Evaluation Protocol

The CAFA temporal-holdout protocol, the NK/LK/PK classification, and the end-to-end evaluation workflow used to produce the figures in Results.

Orchestration

How PROTEA relates to the rest of the working tree: the satellite repositories, the optional agent-farm orchestration system, and the contract surface (HTTP API + artefact store) the platform exposes for automated consumption.

Authentication

Four-role authentication system (guest/researcher/operator/admin) shipped in FARM-AUTH.1-11 (ADR D37). Human email+password login, API-key programmatic access, session revocation, per-user quota, optional SMTP, and audit log.

Multi-Stage Pipeline Contract

Shared coordinator/fan-out/collect contract (MultiStagePayload, StageArtifactStore, PipelineStage, Coordinator) that the three production pipelines will converge onto.

Export Coordinator

The minijob-based dataset export pipeline: PROTEA_EXPORT_MINIJOBS env gate, fan-out into KNN batch minijobs, fast-fail pre-flight, aggregate failure semantics, accepted search backends (numpy/faiss/torch), and annotation_set_id auto-derivation.

Architecture Decision Records

The pages above describe what the architecture looks like today. The Architecture Decision Records records explain why each major decision was taken: the constraint, the rejected alternatives, and the trade-off that closed the question.

ADR

Decision

Problem it solves

001

KNN on CPU, not pgvector or GPU

pgvector does not scale to 500K+ vectors; the GPU must stay free for embedding inference

002

Two-session worker pattern

A mid-operation crash used to leave the job invisible to monitoring

003

QueueConsumer vs. OperationConsumer

Thousands of batch jobs per pipeline flooded the jobs table

004

Dead-letter queue and retry strategy

Failed messages were silently lost; retries without backoff amplified failures

005

Reusable RabbitMQ connections

A coordinator dispatching 500 batches opened 500 TCP connections

006

Sequence deduplication by MD5

Tens of thousands of duplicate Swiss-Prot sequences wasted GPU hours

The full ADR index lives at Architecture Decision Records.