Introduction¶
The legacy coupling problem¶
The Protein Information System (PIS) and FANTASIA established foundational infrastructure for protein data ingestion and functional annotation at scale. However, both systems share a structural limitation: their workers conflate multiple concerns into single classes.
A typical PIS/FANTASIA worker manages its own database session, connects directly to the message broker, orchestrates task sequencing, and executes domain logic — all in the same class. This coupling produces code that is difficult to unit-test (because all infrastructure must be mocked at once), hard to extend (because adding a new operation requires understanding the entire execution context), and fragile under failure (because a queue disconnect or DB error can leave jobs in ambiguous states with no audit trail).
The PROTEA approach¶
PROTEA is architected around a deliberate separation of three layers:
- Infrastructure layer (
protea/infrastructure/) Manages database sessions, connection factories, configuration loading, and the RabbitMQ transport. This layer knows nothing about domain logic.
- Execution layer (
protea/workers/) Orchestrates the job lifecycle: claiming a job, dispatching it to the correct operation, and recording the outcome. The
BaseWorkeruses two independent sessions by design — one to claim (QUEUED → RUNNING) and one to execute — ensuring that even a mid-execution crash leaves the DB in a consistent, inspectable state.- Domain layer (
protea/core/) Pure domain logic. Each
Operationreceives an open session and anemitcallback; it returns anOperationResult. Operations do not manage sessions, queues, or HTTP routing. They are individually testable with a mocked session and a noop emit function.
An incremental migration¶
The goal of PROTEA is not a complete rewrite. PIS tables (protein, sequence,
protein_uniprot_metadata) and FANTASIA computation workflows are progressively migrated
into this architecture as new capabilities are added. Each migration step must preserve or
improve computational efficiency and must not introduce regressions in the data model.
Design principle
New operations are added by implementing the Operation protocol and registering them
at worker startup. No changes to the infrastructure or execution layers are required.