Deployment Guide¶
PROTEA supports five deployment modes. Choose the mode that matches your infrastructure; all modes run the same service set (postgres, rabbitmq, api, workers, frontend) with the same environment variables.
Deployment modes at a glance¶
Mode |
When to use |
Entry point |
|---|---|---|
docker-compose dev |
Local development, single host |
|
docker-compose bundle |
Smoke-test from pre-built images, laptop or CI |
|
Docker Swarm |
Multi-node production cluster, no Kubernetes (T-OPS.4) |
|
Helm / Kubernetes |
Kubernetes cluster (T-OPS.3, CI pending) |
|
SLURM |
HPC cluster, Slurm workload manager (T-OPS.5, in flight) |
|
Telemetry variables apply across all modes; see
Observability: OpenTelemetry SDK for the PROTEA_OTEL_* variable reference.
Prerequisites (all modes)¶
Docker 24+ installed on every node that runs containers.
Python 3.11+ and Poetry (dev mode only).
Access to
ghcr.io/frapercan/protea(bundle, Swarm, Helm modes pull pre-built images; dev mode builds locally).A valid
.envor environment override for non-default secrets (see per-mode sections below).
Mode 1: docker-compose dev stack¶
The canonical local development workflow. Images are built from the
local source tree; a dedicated deploy slot is kept at
~/Thesis2/worktrees/protea-deploy so the developer’s working tree
can move freely without disturbing a running stack.
Setup (once)
Create the deploy slot if it does not exist:
git worktree add ~/Thesis2/worktrees/protea-deploy \
-b feat/deploy-tooling origin/develop
Deploy
# Update the slot to origin/develop, build frontend, start stack.
bash scripts/deploy.sh
# Deploy a specific branch or SHA.
bash scripts/deploy.sh my-feature-branch
# Skip the frontend build (faster for back-end iteration).
bash scripts/deploy.sh --no-build
# Deploy from a local folder snapshot (skips git).
bash scripts/deploy.sh --from /path/to/snapshot
Status and stop
bash scripts/deploy.sh --status
bash scripts/deploy.sh --stop
--status prints the active branch/SHA, API health, and frontend
health. The API is available at http://localhost:8000; the frontend
at http://localhost:3000.
GPU detection
deploy.sh auto-detects the presence of NVIDIA drivers via
nvidia-smi. When a GPU is available, it swaps the PyTorch wheel to
the cu128 build post-install. Override with
PROTEA_DEPLOY_GPU=1|0|auto (default auto). Older CUDA wheels are
still reachable via CUDA_VARIANT=cu121 (or cu118) when invoking
scripts/install_gpu_torch.sh directly.
Key environment variables (override via shell or .env in the
deploy slot):
Variable |
Purpose |
|---|---|
|
Target worktree path (default |
|
Default git ref when none is passed as argument (default |
|
GPU wheel selection: |
|
|
For telemetry configuration in this mode see Observability: OpenTelemetry SDK.
Mode 2: docker-compose bundle (T-OPS.2)¶
A single, self-contained compose file that pulls pre-built images from
ghcr.io without requiring a local build context or any plugin
repository to be cloned. Designed for smoke-testing the production
service wiring on a laptop or in a CI job.
Source: docker-compose.bundle.yml
Quick start
cp .env.bundle.example .env.bundle # adjust credentials if needed
docker compose -f docker-compose.bundle.yml --env-file .env.bundle up -d
# Verify the API is healthy.
curl -s http://localhost:8000/health # {"status":"ok"}
# Tear down (preserves the postgres volume by default).
docker compose -f docker-compose.bundle.yml down
# Full tear down including volumes.
docker compose -f docker-compose.bundle.yml down -v
Image tag
Override the image tag via the PROTEA_IMAGE_TAG variable in
.env.bundle:
PROTEA_IMAGE_TAG=v1.2.0
The default tag is latest.
Key environment variables (.env.bundle / shell overrides):
Variable |
Purpose |
|---|---|
|
Image tag for the |
|
Image tag for the frontend image (default |
|
CORS allowed origins (default |
|
Host port for the API (default |
|
Host port for the frontend (default |
|
Host port for postgres (default |
|
AMQP and management ports (default |
Differences versus the dev stack
The bundle uses pre-built images (no build: keys), a trimmed worker
set (worker-jobs, worker-ping, worker-embeddings) sufficient
for smoke runs, and an explicit named network (protea-bundle) so
external monitoring stacks can attach. The full worker set lives in
docker-compose.yml.
For telemetry configuration in this mode see Observability: OpenTelemetry SDK.
Mode 3: Docker Swarm stack (T-OPS.4)¶
The production deployment target for operators running a Docker Swarm
cluster. Source: deploy/swarm/stack.yml.
Full annotated documentation is in deploy/swarm/README.md. This
section summarises the key operational steps.
Prerequisites
Docker 24+ on every node; Swarm initialised:
docker swarm init # on the manager docker swarm join ... # on each worker node
Every node able to pull from
ghcr.io. Authenticate once:docker login ghcr.ioA node labelled for the GPU embedding worker (optional):
docker node update --label-add gpu=true <node-id>
Create Swarm secrets (once per cluster)
The stack reads credentials from Docker Swarm secrets, not from plaintext
environment variables. The canonical source of those credentials is the
sops + age encrypted secrets/secrets.prod.enc.yaml in the repo;
see Secrets management runbook (sops + age onboarding) for how to obtain a private key, decrypt
the file, and pipe values into docker secret create. Create the six
secrets before the first deploy:
printf 'change-me-pg' | docker secret create protea_postgres_password -
printf 'change-me-rmq' | docker secret create protea_rabbitmq_password -
printf 'change-me-minio' | docker secret create protea_minio_password -
printf 'postgresql+psycopg://protea:change-me-pg@postgres:5432/protea' \
| docker secret create protea_db_url -
printf 'amqp://protea:change-me-rmq@rabbitmq:5672/' \
| docker secret create protea_amqp_url -
Validate the stack file
docker stack config -c deploy/swarm/stack.yml > /dev/null
A zero exit code confirms the file is well-formed.
Deploy
export PROTEA_IMAGE_TAG=$(git describe --tags --always)
export PROTEA_FRONTEND_TAG=$PROTEA_IMAGE_TAG
export PROTEA_ALLOWED_ORIGINS=https://protea.example.org
docker stack deploy \
--with-registry-auth \
-c deploy/swarm/stack.yml \
protea
--with-registry-auth propagates the manager’s ghcr.io
credentials to worker nodes. The api and frontend services use
start-first rolling updates with automatic rollback on failure. The
migrate service runs alembic upgrade head on every deploy and
exits 0 if the schema is already current.
Status, logs, scaling
docker stack services protea
docker service logs -f protea_api
docker service scale protea_worker-embeddings=4
Tear down
docker stack rm protea
# Remove persistent volumes only when wiping the cluster.
docker volume rm protea_postgres_data protea_minio_data
Service placement
Stateful services (postgres, rabbitmq, minio) are pinned to
the manager node so their volumes stay co-located with a known host.
Operators running a multi-manager cluster should replace the
node.role == manager constraint with a custom label such as
node.labels.protea_data == true. Worker processes are constrained to
worker nodes (node.role == worker) to leave the manager headroom for
the API.
For telemetry configuration in this mode pass PROTEA_OTEL_* variables
via additional Swarm secrets or environment overrides in the stack file.
See Observability: OpenTelemetry SDK for variable definitions.
Mode 4: Helm chart on Kubernetes (T-OPS.3)¶
A Helm 3 chart for installing the full PROTEA stack on any
Kubernetes 1.24+ cluster. Source: deploy/helm/protea/.
Note
T-OPS.3 CI gates are pending. The chart has landed in the repository (PR #326, merged to develop) and can be installed manually; full automated promotion to production is blocked until CI passes.
Prerequisites
helm3.x installed.kubectlconfigured for the target cluster.Access to
ghcr.io/frapercan/proteafrom the cluster nodes.
Install
# Install with default values (all services enabled, internal postgres/rabbitmq).
helm install protea deploy/helm/protea/
# Override values inline.
helm install protea deploy/helm/protea/ \
--set image.tag=v1.2.0 \
--set api.replicaCount=2 \
--set workers.embeddingsBatch.gpu.enabled=true
# Override via a custom values file.
helm install protea deploy/helm/protea/ -f my-values.yaml
Upgrade and rollback
helm upgrade protea deploy/helm/protea/ --set image.tag=v1.3.0
helm rollback protea
Key chart knobs (deploy/helm/protea/values.yaml):
Value path |
Purpose |
|---|---|
|
Image tag for all PROTEA-owned containers (default |
|
When |
|
Same pattern as |
|
Deploys MinIO when |
|
Number of API pod replicas (default |
|
Enables an Ingress resource (default |
|
Per-worker replica count. |
|
Requests |
|
Runtime class for GPU workloads (e.g. |
The chart deploys a pre-install migration Job (alembic upgrade head)
that completes before the API and workers start, mirroring the migrate
service in compose.
For telemetry, pass PROTEA_OTEL_* variables via api.extraEnv in
your values override:
api:
extraEnv:
- name: PROTEA_OTEL_ENABLED
value: "true"
- name: PROTEA_OTEL_ENDPOINT
value: "http://otel-collector:4318"
See Observability: OpenTelemetry SDK for the full variable reference.
Mode 5: SLURM templates (T-OPS.5, in flight)¶
HPC/SLURM deployment templates will land in deploy/slurm/ as part
of T-OPS.5. This section is a placeholder; it will be filled in when
T-OPS.5 merges.
The templates will provide Slurm job scripts for:
Running the API as a Slurm batch job (or
sallocinteractive session).Launching worker processes as Slurm array jobs with per-task queue assignments.
The GPU embedding worker on a GPU partition.
Once T-OPS.5 is merged, refer to deploy/slurm/ for the canonical
template files and update this section with the sbatch invocation
commands.
Telemetry applies identically to SLURM deployments; see Observability: OpenTelemetry SDK.
Common operations¶
Environment variables common to all modes¶
The following variables are recognised by the API and worker processes
across all deployment modes. Credential variables accept either a direct
value (e.g. PROTEA_DB_URL) or a file-path variant (e.g.
PROTEA_DB_URL_FILE) for secret-store integration.
Variable |
Default |
Description |
|---|---|---|
|
(required) |
SQLAlchemy connection URL for postgres.
Format: |
|
(required) |
AMQP connection URL for RabbitMQ.
Format: |
|
(none) |
Comma-separated list of allowed CORS origins. |
|
|
Storage backend: |
|
(none) |
Absolute path to the Anc2Vec npz artefact
( |
|
|
Enable OpenTelemetry distributed tracing. See Observability: OpenTelemetry SDK. |
|
(none) |
OTLP HTTP exporter endpoint. See Observability: OpenTelemetry SDK. |
|
|
OTel service name. See Observability: OpenTelemetry SDK. |
|
|
OTel head-sampling ratio. See Observability: OpenTelemetry SDK. |
Health checks¶
The API exposes GET /health which returns {"status": "ok"} when
the process is running. All deployment modes configure a health check
against this endpoint:
curl -s http://localhost:8000/health
Schema migrations¶
Every deployment mode runs alembic upgrade head before starting the
API and workers. In compose modes the migrate service handles this; in
Swarm the migrate task runs on every docker stack deploy; in Helm
the chart deploys a pre-install Job. Never start the API against an
un-migrated schema.
To run migrations manually:
# dev mode (inside deploy slot)
poetry run alembic upgrade head
# bundle / Swarm / Helm (via a one-off container)
docker run --rm \
-e PROTEA_DB_URL=postgresql+psycopg://... \
ghcr.io/frapercan/protea:latest \
alembic upgrade head
See also¶
deploy/swarm/README.mdfor Swarm-specific prerequisites and secret rotation.deploy/helm/protea/values.yamlfor the full Helm value reference.scripts/deploy.sh(in-file comments) for dev-stack behaviour details.Observability: OpenTelemetry SDK for telemetry environment variables (
PROTEA_OTEL_*) that apply across all deployment modes.Ngrok Deploy Recovery if the public demo endpoint becomes unreachable.
Disaster Recovery for the postgres dump and restore procedure (drill and real recovery paths).