Observability: Prometheus metrics¶
PROTEA exposes a Prometheus scrape endpoint on the API and ships the
metric dashboards (api-latency, worker-throughput, embeddings-pipeline,
db-connections) in the Grafana monitoring stack. The missing piece for a
long time was the scraper itself: the Grafana protea-prometheus
datasource pointed at host.docker.internal:9090 but no Prometheus
server ran there, so every metric panel showed “No data”. This runbook
covers bringing that Prometheus server up.
Prometheus runs as a container inside
docker-compose.monitoring.yml alongside Grafana and Loki. The scrape
configuration is committed at deploy/prometheus/prometheus.yml.
How the pipeline fits together¶
PROTEA API (host process, :8000/metrics, protea_* series)
|
v
prometheus:9090 (monitoring compose, scrapes via host.docker.internal)
|
v
Grafana (PROTEA Prometheus datasource, uid protea-prometheus)
|
v
dashboards (api-latency, worker-throughput, embeddings-pipeline,
db-connections, queue-depth)
Two exporters bridge the gap for the queue-depth and db-connections dashboards:
RabbitMQ (
rabbitmq_*series): therabbitmq_prometheusplugin built intorabbitmq:3-managementexposes metrics on port 15692.docker/rabbitmq/enabled_pluginsactivates the plugin explicitly so the image never needs a custom entrypoint.Postgres (
pg_*series):prometheuscommunity/postgres-exporterruns as a sidecar indocker-compose.monitoring.ymland connects to host-published Postgres on port 5432.
The application code does not need to know anything about Prometheus.
The API already serves the Prometheus text exposition format at
/metrics (and the canonical alias /v1/metrics) on port 8000; the
collector registry is built once at API startup in
protea.api.app.create_app and rendered on demand by
protea/api/routers/metrics.py.
Starting Prometheus¶
Prometheus runs inside docker-compose.monitoring.yml. Bring up the
full monitoring stack from the repo root:
docker compose -f docker-compose.monitoring.yml up -d
curl -sf http://localhost:9090/-/ready && echo "prometheus ready"
The container publishes 9090 on the host because the Grafana datasource
reaches it through host.docker.internal:9090 (the same host-gateway
convention used for Postgres). Grafana also reaches the same container by
service name (http://prometheus:9090) on the protea_monitoring
bridge network if you prefer to edit the datasource url to use the
in-network name.
Scrape targets¶
The committed scrape config defines these jobs:
protea-apiScrapes
host.docker.internal:8000/metrics. This is the host-run PROTEA API; the application stack is not part of the monitoring compose project, so the target useshost.docker.internal(mapped to the docker host gateway viaextra_hosts). This job feeds everyprotea_*series: HTTP request latency, job counters, embedding and prediction batch histograms, and the DB pool gauge.prometheusSelf-scrape so the
up{}series and the server’s own health are visible.rabbitmqScrapes
host.docker.internal:15692/metricsvia therabbitmq_prometheusplugin. Port 15692 is published on the host by therabbitmqservice indocker-compose.ymlanddocker-compose.bundle.yml. This job powers the queue-depth dashboard (rabbitmq_queue_messages_ready,rabbitmq_queue_messages_unacknowledged,rabbitmq_queue_consumers, and the publish/deliver rate counters).postgresScrapes
host.docker.internal:9187/metricsfrom theprometheuscommunity/postgres-exportersidecar indocker-compose.monitoring.yml. This job powers the server-side panels of the db-connections dashboard (pg_stat_activity_count,pg_stat_database_xact_commit/rollback,pg_stat_activity_max_tx_duration,pg_stat_database_deadlocks).
Only the API process serves /metrics. The queue workers do not open
their own metrics port, so worker-emitted counters surface through the
shared API registry rather than a per-worker scrape target. There is no
separate worker scrape job by design.
Verifying metrics reach Prometheus¶
Confirm the API is up and serving metrics on the host:
curl -sf http://localhost:8000/metrics | head -20The output should be Prometheus text exposition lines such as
# HELP protea_jobs_total .... A connection refused here means the PROTEA application stack is not running; Prometheus has nothing to scrape until it is.Check the scrape target health from Prometheus:
curl -sf 'http://localhost:9090/api/v1/targets' \ | grep -o '"health":"[a-z]*"'
The
protea-apitarget should report"health":"up".downwith aconnection refusedlast-error means Prometheus cannot reach the host API; see Troubleshooting.Open Grafana at http://localhost:3001 and confirm the
PROTEA / API latency(and other metric) dashboards populate within a couple of scrape intervals (15s each).
Reloading the scrape config without a restart¶
The container enables the lifecycle API (the web.enable-lifecycle
flag in the compose command), so an edited
deploy/prometheus/prometheus.yml can be hot-reloaded:
curl -X POST http://localhost:9090/-/reload
Validate the file before reloading if promtool is available:
promtool check config deploy/prometheus/prometheus.yml
Troubleshooting¶
Dashboards still show “No data” with Prometheus running
First confirm the application stack is up (step 1 above). Prometheus
only scrapes; it emits no protea_* series of its own. With no API
running, every protea-api panel is empty even though Prometheus is
healthy.
protea-api target is down with “connection refused”
Prometheus reaches the host API through host.docker.internal. On
this host the mapping is provided by the extra_hosts:
host.docker.internal:host-gateway entry on the prometheus service. If
the target is down, confirm the API listens on 0.0.0.0:8000 (not
127.0.0.1) so the docker bridge gateway can reach it, and that no
host firewall blocks the bridge subnet.
Datasource error in Grafana (“bad gateway” / “no such host”)
The protea-prometheus datasource url is
http://host.docker.internal:9090. Grafana resolves that via its own
extra_hosts host-gateway entry. If you removed the published 9090
port or renamed the container, point the datasource at
http://prometheus:9090 (the in-network service name) instead and
restart Grafana.
queue-depth panels still empty after bring-up
Confirm the RabbitMQ container is running and port 15692 is reachable from the host:
curl -sf http://localhost:15692/metrics | head -5
If this times out, the container may not have the plugin enabled. Check
that docker/rabbitmq/enabled_plugins was mounted into the container
and that the file contains rabbitmq_prometheus. A container restart
is required after mounting the file for the first time.
db-connections “server-side” panels (pg_* series) empty
Confirm the postgres-exporter sidecar is running:
curl -sf http://localhost:9187/metrics | head -5
If it is not running, bring the monitoring stack up again:
docker compose -f docker-compose.monitoring.yml up -d postgres-exporter
If the exporter starts but immediately exits, check its logs for a
connection-refused error. The DATA_SOURCE_NAME defaults to
postgresql://protea:protea@host.docker.internal:5432/protea?sslmode=disable.
Override via POSTGRES_EXPORTER_DATA_SOURCE_NAME in the environment if
the dev Postgres instance uses different credentials.
Operational notes¶
Retention
The compose command sets the TSDB retention to 15 days (the
storage.tsdb.retention.time flag). The TSDB lives on the
prometheus_data named volume. Raise the flag value for
deployments that need more history, or front Prometheus with a remote
write target (Thanos, Mimir, Cortex) for long-term storage; that
migration is out of scope here.
Auth
Prometheus ships with no authentication. Keep port 9090 off the public internet. A reverse proxy with basic auth is the simplest hardening step, mirroring the Loki guidance.
See also¶
ADR-D7: Observability stack for the rationale behind the observability stack choice.
Observability: Loki log aggregation for the log-aggregation side of the same monitoring stack.
Observability: OpenTelemetry SDK for the OpenTelemetry tracing side of the stack.
deploy/prometheus/prometheus.ymlfor the committed scrape config.deploy/grafana/provisioning/datasources/prometheus.ymlfor the Grafana datasource (uidprotea-prometheus).protea/api/routers/metrics.pyandprotea/infrastructure/telemetry.pyfor the metrics endpoint and the collector registry.