Observability Operator Runbook¶

This runbook covers the full PROTEA observability stack from an operator’s perspective: Prometheus metrics, Loki log aggregation, Grafana dashboards and alerting, and the OpenTelemetry tracing layer. It is a quick-reference companion to the component-specific runbooks (Observability: Prometheus metrics, Observability: Loki log aggregation, Observability: OpenTelemetry SDK).

All four signals (metrics, logs, traces, alerts) ship from the same monitoring compose project:

docker compose -f docker-compose.monitoring.yml up -d

Architecture overview ¶

The PROTEA process stack (uvicorn API + workers + Next.js) on the host emits three observable signals:

/metrics on port 8000 (Prometheus text format) scraped by prometheus:9090.
logs/*.log files tailed by the promtail sidecar and shipped to loki:3100.
OTLP HTTP spans (when PROTEA_OTEL_ENABLED=true) sent to an OTel collector (not yet deployed in the standard monitoring stack).

Grafana at port 3001 queries both datasources:

datasource protea-prometheus (uid protea-prometheus) for metric panels.
datasource protea-loki (uid protea-loki) for log panels and alerts.

Prometheus scrapes the host API on port 8000, RabbitMQ on 15692 (via container name protea-rabbitmq-1), and the Postgres exporter sidecar on port 9187. Workers do not expose their own /metrics port; all worker counters surface through the shared API registry.

Loki receives log lines via a promtail sidecar that tails logs/*.log files written by the process stack. The docker-driver plugin path (documented in Observability: Loki log aggregation) applies when the application itself runs in containers.

Tracing via the OpenTelemetry SDK is optional and disabled by default (PROTEA_OTEL_ENABLED=false). See the Tracing section below.

Prometheus ¶

Scrape config: deploy/prometheus/prometheus.yml

Scrape interval: 15 s

Jobs defined in the scrape config:

Job	Target	Series powered
`protea-api`	`host.docker.internal:8000/metrics`	`protea_*` (HTTP latency, job counters, embedding/prediction histograms, DB pool gauge)
`rabbitmq`	`protea-rabbitmq-1:15692/metrics`	`rabbitmq_queue_messages_ready`, `rabbitmq_queue_messages_unacknowledged`, `rabbitmq_queue_consumers`, publish/deliver rate counters
`postgres`	`protea-postgres-exporter:9187/metrics`	`pg_stat_activity_count`, `pg_stat_database_xact_commit`, `pg_stat_database_deadlocks`, `pg_stat_activity_max_tx_duration`
`prometheus`	`localhost:9090`	Self-scrape: `up{}`, server health

Useful PromQL queries

API p99 latency (last 5 min)

histogram_quantile(
  0.99,
  rate(http_request_duration_seconds_bucket{job="protea-api"}[5m])
)

Worker queue depth (ready messages, all queues)

sum by (queue) (rabbitmq_queue_messages_ready)

Active Postgres sessions

pg_stat_activity_count{datname="protea"}

PROTEA DB connection pool utilisation

protea_db_pool_checked_out / protea_db_pool_size

GPU utilisation (if nvidia_smi_gpu_utilization_ratio is exported via a node-exporter or custom script): no scrape job is currently defined for GPU metrics; add a dcgm-exporter or nvidia_gpu_exporter target to prometheus.yml when GPU monitoring is required.

Hot-reload the scrape config without restarting Prometheus:

curl -X POST http://localhost:9090/-/reload

Loki ¶

Loki API: http://localhost:3100

Promtail config: deploy/promtail/promtail.yml

Log files tailed: logs/*.log under the deploy working directory

Labels injected per line:

compose_project=protea (static)
service_name=<log-file-stem> (e.g., api, worker-jobs)

Top LogQL queries for operators

Worker errors in the last hour

{compose_project="protea", service_name=~"worker-.*"}
| json
| level="error"
| __error__=""

API 5xx responses in the last 15 minutes

{compose_project="protea", service_name="api"}
| json
| level="error"
| line_format "{{.message}}"

Stuck job filter (no progress on a known job_id)

Replace <JOB_ID> with the UUID visible in the UI:

{compose_project="protea"}
| json
| job_id="<JOB_ID>"

Verify log ingestion is live:

curl -sf http://localhost:3100/ready && echo "loki ready"

# Force a log line and check it arrives.
curl -sf http://localhost:8000/health > /dev/null
# Query the last line from Loki (URL-encode the query manually or use a tool):
curl -sG http://localhost:3100/loki/api/v1/query_range \
    -d "query=%7Bcompose_project%3D%22protea%22%7D&limit=1" \
    | python3 -m json.tool | head -20

Grafana ¶

Local URL: http://localhost:3001

Ngrok-exposed URL: https://protea.ngrok.app (tunnels to the Next.js frontend on port 3000; Grafana is accessible at its own port 3001 on the host but is not reverse-proxied through the ngrok tunnel).

Dashboard JSON files live in deploy/grafana/dashboards/.

Dashboards by audience

Developer dashboards (day-to-day pipeline work):

Dashboard	JSON file	Primary signal
PROTEA / API latency	`api-latency.json`	`protea_*` HTTP percentiles
PROTEA / Worker throughput	`worker-throughput.json`	Job counters per operation
PROTEA / Embeddings pipeline	`embeddings-pipeline.json`	Batch throughput + error rate
PROTEA / Logs (Loki)	`logs.json`	Structured log stream

Operator dashboards (infrastructure health):

Dashboard	JSON file	Primary signal
PROTEA / DB connections	`db-connections.json`	`pg_*` pool + activity
PROTEA / Queue depth	`queue-depth.json`	`rabbitmq_queue_messages_ready`
Agent-farm heartbeats	`agent-farm-heartbeats.json`	Per-task error bursts (FARM-UI.7)

The visitors.json dashboard tracks public endpoint traffic via nginx access log metrics (only active when an nginx proxy is in front of the stack).

Top 5 bookmarks

Quick links to check first when something goes wrong:

http://localhost:3001/d/api-latency (API request latency, p50/p95/p99).
http://localhost:3001/d/queue-depth (RabbitMQ queue depths across all workers).
http://localhost:3001/d/db-connections (Postgres connection pool and active sessions).
http://localhost:3001/d/worker-throughput (job success/failure counters per operation).
http://localhost:3001/explore (free-form Loki LogQL and Prometheus PromQL queries).

If Grafana shows “No data” on metric panels, confirm Prometheus is running (curl -sf http://localhost:9090/-/ready) and that the protea-api scrape target is up (curl -sf 'http://localhost:9090/api/v1/targets').

Alerting ¶

Alert rules file: deploy/grafana/provisioning/alerting/rules.yml

Contact points file: deploy/grafana/provisioning/alerting/contact-points.yml

Routing policy: deploy/grafana/provisioning/alerting/policies.yml

Currently one alert rule is provisioned (FARM-UI.7):

Alert	Severity	Condition
agent-farm error burst	warning	3 or more `level=error` agent-farm heartbeats in a 5-minute window for the same `task_id`

The alert routes via the agent-farm Grafana folder to the Slack contact point keyed on SLACK_WEBHOOK_URL. Repeat interval is 30 minutes (one immediate fire, then at most one follow-up per half hour).

Setup (one command):

export SLACK_WEBHOOK_URL='https://hooks.slack.com/services/...'
docker compose -f docker-compose.monitoring.yml up -d

When SLACK_WEBHOOK_URL is unset, Grafana boots cleanly and the contact point delivers to an empty URL (no 4xx storm).

Acknowledging an alert in Grafana: open the alert rule in Alerting > Alert rules > agent-farm, click Silence and set a duration. Silences do not stop evaluation; they suppress Slack notifications.

Planned alerts (not yet provisioned):

The following conditions are currently detected only by log inspection or manual Prometheus queries. Adding Grafana alert rules for them is tracked under F-OPS:

worker-queue-stuck: rabbitmq_queue_messages_ready > N for more than 10 minutes with no consumer activity.
postgres-down: pg_up == 0 for more than 30 seconds.
ngrok-tunnel-down: the deploy-keeper supervisor already polls for this and escalates to a janitor agent; a Grafana alert is additive redundancy.

Tracing ¶

Distributed tracing via the OpenTelemetry SDK is implemented but disabled by default. The four PROTEA_OTEL_* environment variables control it; see Observability: OpenTelemetry SDK for the full variable reference and quick-start.

Current scope (T5.1a): SDK boot, OTLP/HTTP exporter, FastAPI auto-instrumentation. Each HTTP request to the API generates a root span with the route template as the operation name.

No dedicated OTel collector is deployed. To enable tracing locally, point PROTEA_OTEL_ENDPOINT at a Jaeger all-in-one container (see Observability: OpenTelemetry SDK Quick-start). A production OTel collector is planned under ADR-D7; until it lands, set PROTEA_OTEL_ENABLED=false (the default) in production.

Upcoming (T5.1b): SQLAlchemy + pika instrumentation and traceparent propagation across the HTTP-to-queue-to-worker boundary.

Observability Operator Runbook¶

Architecture overview¶

Prometheus¶

Loki¶

Grafana¶

Alerting¶

Tracing¶

See also¶