Deployment Guide¶

PROTEA supports five deployment modes. Choose the mode that matches your infrastructure; all modes run the same service set (postgres, rabbitmq, api, workers, frontend) with the same environment variables.

Deployment modes at a glance ¶

Mode	When to use	Entry point
docker-compose dev	Local development, single host	`bash scripts/deploy.sh` (see Mode 1: docker-compose dev stack)
docker-compose bundle	Smoke-test from pre-built images, laptop or CI	`docker compose -f docker-compose.bundle.yml up` (see Mode 2: docker-compose bundle (T-OPS.2))
Docker Swarm	Multi-node production cluster, no Kubernetes (T-OPS.4)	`docker stack deploy -c deploy/swarm/stack.yml protea` (see Mode 3: Docker Swarm stack (T-OPS.4))
Helm / Kubernetes	Kubernetes cluster (T-OPS.3, CI pending)	`helm install protea deploy/helm/protea/` (see Mode 4: Helm chart on Kubernetes (T-OPS.3))
SLURM	HPC cluster, Slurm workload manager (T-OPS.5, in flight)	`deploy/slurm/` templates (see Mode 5: SLURM templates (T-OPS.5, in flight))

Telemetry variables apply across all modes; see Observability: OpenTelemetry SDK for the PROTEA_OTEL_* variable reference.

Prerequisites (all modes)¶

Docker 24+ installed on every node that runs containers.
Python 3.11+ and Poetry (dev mode only).
Access to ghcr.io/frapercan/protea (bundle, Swarm, Helm modes pull pre-built images; dev mode builds locally).
A valid .env or environment override for non-default secrets (see per-mode sections below).

Mode 1: docker-compose dev stack ¶

The canonical local development workflow. Images are built from the local source tree; a dedicated deploy slot is kept at ~/Thesis2/worktrees/protea-deploy so the developer’s working tree can move freely without disturbing a running stack.

Setup (once)

Create the deploy slot if it does not exist:

git worktree add ~/Thesis2/worktrees/protea-deploy \
  -b feat/deploy-tooling origin/develop

Deploy

# Update the slot to origin/develop, build frontend, start stack.
bash scripts/deploy.sh

# Deploy a specific branch or SHA.
bash scripts/deploy.sh my-feature-branch

# Skip the frontend build (faster for back-end iteration).
bash scripts/deploy.sh --no-build

# Deploy from a local folder snapshot (skips git).
bash scripts/deploy.sh --from /path/to/snapshot

Status and stop

bash scripts/deploy.sh --status
bash scripts/deploy.sh --stop

--status prints the active branch/SHA, API health, and frontend health. The API is available at http://localhost:8000; the frontend at http://localhost:3000.

GPU detection

deploy.sh auto-detects the presence of NVIDIA drivers via nvidia-smi. When a GPU is available, it swaps the PyTorch wheel to the cu128 build post-install. Override with PROTEA_DEPLOY_GPU=1|0|auto (default auto). Older CUDA wheels are still reachable via CUDA_VARIANT=cu121 (or cu118) when invoking scripts/install_gpu_torch.sh directly.

Key environment variables (override via shell or .env in the deploy slot):

Variable	Purpose
`PROTEA_DEPLOY_PATH`	Target worktree path (default `~/Thesis2/worktrees/protea-deploy`).
`PROTEA_DEPLOY_REF`	Default git ref when none is passed as argument (default `origin/develop`).
`PROTEA_DEPLOY_GPU`	GPU wheel selection: `auto` (default), `1`, or `0`.
`PROTEA_PUBLIC_API_URL`	`NEXT_PUBLIC_API_URL` injected into `apps/web/.env.local` if the file is absent (default `/api-proxy`).

For telemetry configuration in this mode see Observability: OpenTelemetry SDK.

Mode 2: docker-compose bundle (T-OPS.2)¶

A single, self-contained compose file that pulls pre-built images from ghcr.io without requiring a local build context or any plugin repository to be cloned. Designed for smoke-testing the production service wiring on a laptop or in a CI job.

Source: docker-compose.bundle.yml

Quick start

cp .env.bundle.example .env.bundle   # adjust credentials if needed
docker compose -f docker-compose.bundle.yml --env-file .env.bundle up -d

# Verify the API is healthy.
curl -s http://localhost:8000/health   # {"status":"ok"}

# Tear down (preserves the postgres volume by default).
docker compose -f docker-compose.bundle.yml down

# Full tear down including volumes.
docker compose -f docker-compose.bundle.yml down -v

Image tag

Override the image tag via the PROTEA_IMAGE_TAG variable in .env.bundle:

PROTEA_IMAGE_TAG=v1.2.0

The default tag is latest.

Key environment variables (.env.bundle / shell overrides):

Variable	Purpose
`PROTEA_IMAGE_TAG`	Image tag for the `ghcr.io/frapercan/protea` image (default `latest`).
`PROTEA_FRONTEND_TAG`	Image tag for the frontend image (default `latest`).
`PROTEA_ALLOWED_ORIGINS`	CORS allowed origins (default `http://localhost:3000`).
`PROTEA_API_PORT`	Host port for the API (default `8000`).
`PROTEA_WEB_PORT`	Host port for the frontend (default `3000`).
`POSTGRES_PORT`	Host port for postgres (default `5432`).
`RABBITMQ_PORT` / `RABBITMQ_MGMT_PORT`	AMQP and management ports (default `5672` / `15672`).

Differences versus the dev stack

The bundle uses pre-built images (no build: keys), a trimmed worker set (worker-jobs, worker-ping, worker-embeddings) sufficient for smoke runs, and an explicit named network (protea-bundle) so external monitoring stacks can attach. The full worker set lives in docker-compose.yml.

For telemetry configuration in this mode see Observability: OpenTelemetry SDK.

Mode 3: Docker Swarm stack (T-OPS.4)¶

The production deployment target for operators running a Docker Swarm cluster. Source: deploy/swarm/stack.yml.

Full annotated documentation is in deploy/swarm/README.md. This section summarises the key operational steps.

Prerequisites

Docker 24+ on every node; Swarm initialised:

docker swarm init          # on the manager
docker swarm join ...      # on each worker node

Every node able to pull from ghcr.io. Authenticate once:
```
docker login ghcr.io
```

A node labelled for the GPU embedding worker (optional):

docker node update --label-add gpu=true <node-id>

Create Swarm secrets (once per cluster)

The stack reads credentials from Docker Swarm secrets, not from plaintext environment variables. The canonical source of those credentials is the sops + age encrypted secrets/secrets.prod.enc.yaml in the repo; see Secrets management runbook (sops + age onboarding) for how to obtain a private key, decrypt the file, and pipe values into docker secret create. Create the six secrets before the first deploy:

printf 'change-me-pg' | docker secret create protea_postgres_password -
printf 'change-me-rmq' | docker secret create protea_rabbitmq_password -
printf 'change-me-minio' | docker secret create protea_minio_password -
printf 'postgresql+psycopg://protea:change-me-pg@postgres:5432/protea' \
    | docker secret create protea_db_url -
printf 'amqp://protea:change-me-rmq@rabbitmq:5672/' \
    | docker secret create protea_amqp_url -

Validate the stack file

docker stack config -c deploy/swarm/stack.yml > /dev/null

A zero exit code confirms the file is well-formed.

Deploy

export PROTEA_IMAGE_TAG=$(git describe --tags --always)
export PROTEA_FRONTEND_TAG=$PROTEA_IMAGE_TAG
export PROTEA_ALLOWED_ORIGINS=https://protea.example.org

docker stack deploy \
  --with-registry-auth \
  -c deploy/swarm/stack.yml \
  protea

--with-registry-auth propagates the manager’s ghcr.io credentials to worker nodes. The api and frontend services use start-first rolling updates with automatic rollback on failure. The migrate service runs alembic upgrade head on every deploy and exits 0 if the schema is already current.

Status, logs, scaling

docker stack services protea
docker service logs -f protea_api
docker service scale protea_worker-embeddings=4

Tear down

docker stack rm protea

# Remove persistent volumes only when wiping the cluster.
docker volume rm protea_postgres_data protea_minio_data

Service placement

Stateful services (postgres, rabbitmq, minio) are pinned to the manager node so their volumes stay co-located with a known host. Operators running a multi-manager cluster should replace the node.role == manager constraint with a custom label such as node.labels.protea_data == true. Worker processes are constrained to worker nodes (node.role == worker) to leave the manager headroom for the API.

For telemetry configuration in this mode pass PROTEA_OTEL_* variables via additional Swarm secrets or environment overrides in the stack file. See Observability: OpenTelemetry SDK for variable definitions.

Mode 4: Helm chart on Kubernetes (T-OPS.3)¶

A Helm 3 chart for installing the full PROTEA stack on any Kubernetes 1.24+ cluster. Source: deploy/helm/protea/.

Note

T-OPS.3 CI gates are pending. The chart has landed in the repository (PR #326, merged to develop) and can be installed manually; full automated promotion to production is blocked until CI passes.

Prerequisites

helm 3.x installed.
kubectl configured for the target cluster.
Access to ghcr.io/frapercan/protea from the cluster nodes.

Install

# Install with default values (all services enabled, internal postgres/rabbitmq).
helm install protea deploy/helm/protea/

# Override values inline.
helm install protea deploy/helm/protea/ \
  --set image.tag=v1.2.0 \
  --set api.replicaCount=2 \
  --set workers.embeddingsBatch.gpu.enabled=true

# Override via a custom values file.
helm install protea deploy/helm/protea/ -f my-values.yaml

Upgrade and rollback

helm upgrade protea deploy/helm/protea/ --set image.tag=v1.3.0
helm rollback protea

Key chart knobs (deploy/helm/protea/values.yaml):

Value path	Purpose
`image.tag`	Image tag for all PROTEA-owned containers (default `latest`).
`database.internal`	When `true` (default) the chart deploys postgres. Set to `false` and provide `database.externalUrl` to use an external database.
`amqp.internal`	Same pattern as `database.internal` for RabbitMQ.
`objectStore.enabled`	Deploys MinIO when `true` (default `false`).
`api.replicaCount`	Number of API pod replicas (default `1`).
`api.ingress.enabled`	Enables an Ingress resource (default `false`).
`workers.<name>.replicaCount`	Per-worker replica count.
`workers.embeddingsBatch.gpu.enabled`	Requests `nvidia.com/gpu: 1` on the batch embedding pod (default `false`).
`workers.embeddingsBatch.runtimeClassName`	Runtime class for GPU workloads (e.g. `nvidia`).

The chart deploys a pre-install migration Job (alembic upgrade head) that completes before the API and workers start, mirroring the migrate service in compose.

For telemetry, pass PROTEA_OTEL_* variables via api.extraEnv in your values override:

api:
  extraEnv:
    - name: PROTEA_OTEL_ENABLED
      value: "true"
    - name: PROTEA_OTEL_ENDPOINT
      value: "http://otel-collector:4318"

See Observability: OpenTelemetry SDK for the full variable reference.

Mode 5: SLURM templates (T-OPS.5, in flight)¶

HPC/SLURM deployment templates will land in deploy/slurm/ as part of T-OPS.5. This section is a placeholder; it will be filled in when T-OPS.5 merges.

The templates will provide Slurm job scripts for:

Running the API as a Slurm batch job (or salloc interactive session).
Launching worker processes as Slurm array jobs with per-task queue assignments.
The GPU embedding worker on a GPU partition.

Once T-OPS.5 is merged, refer to deploy/slurm/ for the canonical template files and update this section with the sbatch invocation commands.

Telemetry applies identically to SLURM deployments; see Observability: OpenTelemetry SDK.

Common operations ¶

Environment variables common to all modes ¶

The following variables are recognised by the API and worker processes across all deployment modes. Credential variables accept either a direct value (e.g. PROTEA_DB_URL) or a file-path variant (e.g. PROTEA_DB_URL_FILE) for secret-store integration.

Variable	Default	Description
`PROTEA_DB_URL`	(required)	SQLAlchemy connection URL for postgres. Format: `postgresql+psycopg://user:pass@host:5432/db`.
`PROTEA_AMQP_URL`	(required)	AMQP connection URL for RabbitMQ. Format: `amqp://user:pass@host:5672/`.
`PROTEA_ALLOWED_ORIGINS`	(none)	Comma-separated list of allowed CORS origins.
`PROTEA_STORAGE_BACKEND`	`local`	Storage backend: `local` (filesystem) or `minio`.
`PROTEA_ANC2VEC_PATH`	(none)	Absolute path to the Anc2Vec npz artefact (`anc2vec_2020-10.npz`). When unset, the API/worker process falls back to the repo-relative `artifacts/anc2vec/anc2vec_2020-10.npz` (gitignored, so absent on fresh deploy worktrees). On a deploy without the file in either location the process logs the resolution chain and raises `FileNotFoundError`. See Secrets management runbook (sops + age onboarding) for the resolution order.
`PROTEA_OTEL_ENABLED`	`false`	Enable OpenTelemetry distributed tracing. See Observability: OpenTelemetry SDK.
`PROTEA_OTEL_ENDPOINT`	(none)	OTLP HTTP exporter endpoint. See Observability: OpenTelemetry SDK.
`PROTEA_OTEL_SERVICE_NAME`	`protea-api`	OTel service name. See Observability: OpenTelemetry SDK.
`PROTEA_OTEL_SAMPLE_RATIO`	`1.0`	OTel head-sampling ratio. See Observability: OpenTelemetry SDK.

Health checks ¶

The API exposes GET /health which returns {"status": "ok"} when the process is running. All deployment modes configure a health check against this endpoint:

curl -s http://localhost:8000/health

Schema migrations ¶

Every deployment mode runs alembic upgrade head before starting the API and workers. In compose modes the migrate service handles this; in Swarm the migrate task runs on every docker stack deploy; in Helm the chart deploys a pre-install Job. Never start the API against an un-migrated schema.

To run migrations manually:

# dev mode (inside deploy slot)
poetry run alembic upgrade head

# bundle / Swarm / Helm (via a one-off container)
docker run --rm \
  -e PROTEA_DB_URL=postgresql+psycopg://... \
  ghcr.io/frapercan/protea:latest \
  alembic upgrade head