System Overview¶
Requirements and design goals¶
The design of PROTEA is governed by five requirements derived from the limitations of its predecessors (PIS and FANTASIA):
- R1 — Reproducibility
A prediction produced today must be exactly reproducible in the future. This requires recording the ontology version, reference annotation set, and embedding model configuration used for every prediction run.
- R2 — Scalability
The system must handle reference sets of hundreds of thousands of proteins and query sets of thousands without holding all data in memory simultaneously.
- R3 — Separation of concerns
Domain logic (what to compute), execution flow (how jobs are dispatched and tracked), and infrastructure (database, message queue) must be independently replaceable.
- R4 — Observability
Every job must produce a structured audit trail so that failures can be diagnosed without replaying the computation.
- R5 — Accessibility
Researchers without machine-learning infrastructure expertise must be able to submit sequences and retrieve predictions through a web interface or a REST API.
Runtime stack¶
PROTEA runs as a set of cooperative processes managed by scripts/manage.sh:
┌──────────────────────────────────────────────────────────────────────┐
│ PROTEA Stack │
│ │
│ Next.js (port 3000) ──HTTP──▶ FastAPI (port 8000) │
│ │ │
│ publishes UUID / payload │
│ │ │
│ ▼ │
│ RabbitMQ │
│ ┌─────────────────────────┐ │
│ │ protea.ping │ │
│ │ protea.jobs │ │
│ │ protea.embeddings │ coordinator │
│ │ protea.embeddings.batch│ ephemeral │
│ │ protea.embeddings.write│ ephemeral │
│ │ protea.predictions.batch│ ephemeral │
│ │ protea.predictions.write│ ephemeral │
│ └───────────┬─────────────┘ │
│ │ │
│ Worker processes │
│ (one or more per queue) │
│ │ │
│ ▼ │
│ PostgreSQL + pgvector │
└──────────────────────────────────────────────────────────────────────┘
Services and data stores¶
FastAPI (port 8000)
RESTful HTTP API. Handles job creation, status queries, event retrieval, and cancellation. On
POST /jobs, it creates aJobrow in QUEUED status, commits, then publishes the job UUID to RabbitMQ. The session factory and AMQP URL are injected viaapp.stateat startup, keeping the router free of global state.
RabbitMQ (port 5672 / 15672)
Message broker. Standard queues carry the job UUID — all state lives in PostgreSQL. Ephemeral batch queues carry the full operation payload (no DB row per message). Durable queues ensure messages survive broker restarts.
Queue routing¶ Queue
Consumer type
Operations
protea.pingQueueConsumer
ping
protea.jobsQueueConsumer
insert_proteins,fetch_uniprot_metadata,load_ontology_snapshot,load_goa_annotations,load_quickgo_annotations,compute_embeddings(coordinator),predict_go_terms(coordinator)
protea.embeddingsQueueConsumer
compute_embeddingscoordinator (serialised: one at a time, 60 s retry delay if GPU busy)
protea.embeddings.batchOperationConsumer
compute_embeddings_batch— GPU inference per batch (ephemeral, no DB Job row)
protea.embeddings.writeOperationConsumer
store_embeddings— bulk pgvector insert (ephemeral, no DB Job row)
protea.predictions.batchOperationConsumer
predict_go_terms_batch— KNN search + GO transfer (ephemeral, no DB Job row)
protea.predictions.writeOperationConsumer
store_predictions— bulk GOPrediction insert (ephemeral, no DB Job row)
QueueConsumer vs OperationConsumer
Two consumer patterns exist in
protea/infrastructure/queue/consumer.py:
QueueConsumer — reads a job UUID from the queue, delegates to
BaseWorker.handle_job(). Creates a full Job row with status transitions and event log.OperationConsumer — reads a raw operation payload from the queue and executes it directly. Used for high-throughput batch workers where creating thousands of child Job rows would cause queue bloat. Progress is tracked at the parent level only.
Worker processes
Long-running Python processes, one per queue. Launched and managed by
scripts/manage.sh. Workers reconnect automatically on broker disconnection and can be scaled horizontally:bash scripts/manage.sh scale protea.predictions.batch 2 # add 2 more batch workers
PostgreSQL + pgvector (port 5432)
Persistent store for all state. Holds job queues, event logs, protein sequences, UniProt metadata, GO ontologies, annotation sets, sequence embeddings (pgvector), and GO predictions. SQLAlchemy 2.x ORM with
Mapped[]annotations.Note
pgvector is used only for storage of embeddings (VECTOR type columns). KNN search is performed in Python using numpy or FAISS, never at the DB layer. See Predict GO terms in the howto guides.
Next.js frontend (port 3000)
Single-page application for job management. Displays job list with status filtering, live auto-refresh (2 s polling while a job is active), progress bar, and structured event timeline. Built with React 19 and Tailwind CSS v4.
Stack management¶
All processes are managed through scripts/manage.sh:
bash scripts/manage.sh start [N] # start full stack (N batch workers per pipeline)
bash scripts/manage.sh stop # stop all processes
bash scripts/manage.sh status # show PID, RAM, running/dead per worker
bash scripts/manage.sh logs [name] # tail logs (interactive picker or name fragment)
bash scripts/manage.sh scale <queue> [N] # add N extra workers to a queue without restart
Logs are written to logs/<name>.log. PIDs are tracked in logs/pids/.
Code layout¶
protea/
api/ FastAPI application and routers
routers/ jobs, proteins, annotations, embeddings,
query_sets, maintenance, admin
core/
contracts/ Operation protocol, ProteaPayload, OperationResult
operations/ Domain logic (all 9 operations)
knn_search.py KNN backends: numpy brute-force and FAISS (Flat/IVFFlat/HNSW)
feature_engineering.py Alignment (parasail NW/SW) and taxonomy (ete3 NCBITaxa)
utils.py UniProtHttpMixin, chunks(), utcnow()
infrastructure/
orm/models/ SQLAlchemy 2.x ORM models (protein, sequence, annotation,
embedding, prediction, query, job)
queue/ RabbitMQ consumer (QueueConsumer, OperationConsumer) and publisher
session.py session_scope context manager
settings.py YAML + env-var config loader
workers/
base_worker.py Two-session job lifecycle orchestrator
apps/
web/ Next.js frontend
scripts/
manage.sh Unified stack manager (start/stop/status/logs/scale)
worker.py Worker entry point (registers all operations)
init_db.py Schema initialisation
run_one_job.py Manual job runner (debugging)
Technology stack¶
Component |
Technology |
Version |
|---|---|---|
API framework |
FastAPI |
0.115+ |
ORM / migrations |
SQLAlchemy 2.0 + Alembic |
2.0 / 1.13 |
Database |
PostgreSQL 16 + pgvector |
16 / 0.7 |
Message broker |
RabbitMQ + aio-pika |
3.x / 9.x |
Data validation |
Pydantic v2 |
2.x |
Protein LM inference |
Hugging Face Transformers |
4.x |
Alignment |
parasail-python (BLOSUM62) |
1.x |
Taxonomy |
ete3 + NCBITaxa |
3.x |
ANN search |
NumPy / FAISS |
— |
Frontend |
Next.js 19 + Tailwind v4 |
19 / 4 |
Dependency management |
Poetry |
1.x |
All Python dependencies are declared in pyproject.toml with pinned version
ranges; poetry.lock guarantees reproducible installs. The dev dependency
group adds pytest, pytest-cov, and related tooling without affecting production.
Testing strategy¶
The test suite is split into two categories:
- Unit tests
Run with plain
pytest. Mock external services (HTTP, RabbitMQ) and use minimal fixtures. Cover operation logic, alignment and taxonomy utilities, FASTA parsing, and API router behaviour. Currently 283 tests passing across 17 test files; coverage enforced at 70 % bypytest-cov.- Integration tests
Run with
pytest --with-postgres. Theconftest.pyfixture pulls apgvector/pgvector:pg16Docker image, initialises the schema, and tears down the container after the session. These tests exercise the full round-trip from job submission to database state.
poetry run pytest # unit tests only
poetry run pytest --with-postgres # full suite including integration tests