Observability: OpenTelemetry SDK¶
PROTEA ships an optional distributed tracing layer built on the
OpenTelemetry SDK (T5.1a, ADR-D7).
protea/infrastructure/telemetry.py is the single entry point: it
reads four PROTEA_OTEL_* environment variables, boots a global
TracerProvider, and instruments the
FastAPI application. When the OTel SDK packages are absent or tracing is
disabled, the module degrades silently to a no-op rather than crashing
the API boot.
T5.1a scope (current): SDK boot, OTLP/HTTP exporter, FastAPI auto-instrumentation.
T5.1b (next): SQLAlchemy and pika instrumentation plus
traceparent propagation across the HTTP-to-queue-to-worker boundary.
Environment variables¶
All four variables default to a safe “opt-in” state so a plain
poetry install starts PROTEA without requiring a running collector.
Variable |
Default |
Description |
|---|---|---|
|
|
Truthy values |
|
(none) |
OTLP HTTP exporter endpoint, e.g. |
|
|
|
|
|
|
Installing the optional dependencies¶
The OTel SDK and FastAPI instrumentor live in the optional
telemetry poetry group and are not installed by a plain
poetry install:
# Install the base dependencies plus the telemetry extras.
poetry install --with telemetry
The group pins the following packages:
opentelemetry-api >=1.27.0,<2.0.0
opentelemetry-sdk >=1.27.0,<2.0.0
opentelemetry-exporter-otlp-proto-http >=1.27.0,<2.0.0
opentelemetry-instrumentation-fastapi >=0.48b0,<1.0.0
If the SDK is not installed but PROTEA_OTEL_ENABLED=1 is set, the
API boots normally and logs a single WARNING:
telemetry enabled but opentelemetry SDK not installed (...);
tracing will be a no-op. Install the `telemetry` extra or run
`poetry install --with telemetry`.
Quick-start: enabling tracing locally¶
Install telemetry dependencies (one-off):
poetry install --with telemetryStart a local OTLP collector (Jaeger all-in-one is the simplest option):
docker run --rm -p 4318:4318 -p 16686:16686 \ jaegertracing/all-in-one:latest
Set the required variables and start the API:
export PROTEA_OTEL_ENABLED=1 export PROTEA_OTEL_ENDPOINT=http://localhost:4318 export PROTEA_OTEL_SERVICE_NAME=protea-api poetry run uvicorn protea.api.app:create_app \ --factory --host 0.0.0.0 --port 8000
Send any request to the API and open the Jaeger UI at
http://localhost:16686to inspect the trace.
To enable tracing in the docker-compose stack, add the variables to the
api service in docker-compose.yml (or a docker-compose.override.yml
that is not committed):
services:
api:
environment:
PROTEA_OTEL_ENABLED: "1"
PROTEA_OTEL_ENDPOINT: "http://otel-collector:4318"
PROTEA_OTEL_SERVICE_NAME: "protea-api"
FastAPI auto-instrumentation behaviour¶
When both the telemetry group is installed and
PROTEA_OTEL_ENABLED=1, configure_telemetry(app) calls
FastAPIInstrumentor.instrument_app(app) before middlewares are
registered. This means every incoming HTTP request receives a root span
with the route template as the operation name (e.g.
GET /jobs/{job_id}). The span includes HTTP status, method, and
route attributes following the OTel HTTP semantic conventions.
configure_telemetry is idempotent: a second call on an already-booted
process is a no-op (logged at DEBUG). This matters during test runs
that reinitialise the FastAPI app.
Soft-degrade fallback¶
The module is designed to never block the API or worker boot:
If
PROTEA_OTEL_ENABLEDis falsy (or unset),configure_telemetryreturns immediately with the resolvedTelemetryConfigand touches nothing.If the OTel SDK packages are missing, a single
WARNINGis emitted and the function returns the config without raising.If
opentelemetry-instrumentation-fastapiis missing but the core SDK is present, the SDK boots successfully and only HTTP-server spans are disabled (a secondWARNINGis emitted).
The resolved config is stored on app.state.telemetry so it can be
surfaced through /health in a future iteration.
Diagnosis¶
Verify the current telemetry config at runtime
resolve_telemetry_config() can be called without booting anything:
python3 - <<'EOF'
import os
os.environ.setdefault("PROTEA_OTEL_ENABLED", "0")
from protea.infrastructure.telemetry import resolve_telemetry_config
cfg = resolve_telemetry_config()
print(cfg)
EOF
Check whether the provider was installed
from opentelemetry import trace
from opentelemetry.trace import ProxyTracerProvider
provider = trace.get_tracer_provider()
print("OTel active:", not isinstance(provider, ProxyTracerProvider))
A ProxyTracerProvider means no SDK boot happened in this process.
Inspect startup logs for the boot confirmation line
When tracing boots successfully, the API logs an INFO line:
telemetry boot: service=protea-api endpoint=http://otel-collector:4318 sample_ratio=1.0
Absence of this line alongside a DEBUG line
telemetry disabled (PROTEA_OTEL_ENABLED is not truthy) confirms the
feature is off.
Check the OTLP endpoint is reachable
# Replace ${OTEL_TRACES_URL} with the value of PROTEA_OTEL_ENDPOINT
# plus the OTLP traces path (see the OTLP/HTTP spec for the current
# API version), e.g. http://otel-collector:4318/<api-version>/traces.
curl -v "${OTEL_TRACES_URL}" \
-H "Content-Type: application/json" \
-d '{}' 2>&1 | grep -E "^< HTTP|Connection refused|Could not resolve"
A 405 Method Not Allowed response confirms the endpoint is reachable
(the collector rejects empty JSON but the TCP connection succeeded).
Connection refused or Could not resolve host points to a
misconfigured PROTEA_OTEL_ENDPOINT or a collector that is not running.
No spans appearing in the collector
Confirm the
BatchSpanProcessorqueue is not full. Under sustained load the batch queue can back up; this appears as dropped-span log lines from the OTel SDK internals (levelWARNING, prefixFailed to export).Check the sample ratio. A
PROTEA_OTEL_SAMPLE_RATIO=0.0silently discards every trace:echo ${PROTEA_OTEL_SAMPLE_RATIO:-"(unset, default 1.0)"}Confirm the exporter endpoint includes the OTLP traces suffix (
/<api-version>/tracesper the OTLP/HTTP spec). The module appends it automatically, so the variable should be set to the bare collector root (e.g.http://otel-collector:4318), not the full path with the traces suffix already attached.
Operational notes¶
Worker service names
Each worker process should set PROTEA_OTEL_SERVICE_NAME at startup
so traces distinguish API spans from queue-worker spans:
PROTEA_OTEL_SERVICE_NAME=protea-worker-embeddings \
python3 -m protea.workers.run_worker embeddings
Production sample ratio
PROTEA_OTEL_SAMPLE_RATIO=1.0 (the default) records every trace. For
high-throughput production deployments, lower the ratio at the worker
level or configure tail sampling at the collector instead of head
sampling here:
# Record 10% of traces from this process.
export PROTEA_OTEL_SAMPLE_RATIO=0.1
The ParentBased wrapper ensures that child spans respect the
sampling decision made by the parent so partial traces do not appear in
the collector.
T5.1b extension
The next sub-slice (T5.1b) adds:
opentelemetry-instrumentation-sqlalchemywrapping the session factory.opentelemetry-instrumentation-pikawrapping the AMQP publisher and consumer.traceparentheader injection and extraction so a single API request that dispatches a queue message appears as one end-to-end trace in the collector.
This runbook will be extended in-place when T5.1b merges. No
configuration changes are required on the operator side; the same four
PROTEA_OTEL_* variables control the whole stack.
See also¶
ADR-D7: Observability stack for the design rationale behind the OTel stack choice (versus ELK/Jaeger-native SDK).
protea/infrastructure/telemetry.pyfor the full module docstring and API surface.tests/test_telemetry.pyfor unit-level examples ofresolve_telemetry_configandconfigure_telemetryusage.