ADR-004: Dead letter queue and retries¶
- Date:
2026-03-18
- Author:
frapercan
- Status:
Accepted
Context¶
Two related messaging problems:
Lost messages: when a message failed permanently (invalid JSON, unknown operation), it was discarded with
basic_nack. The payload disappeared and there was no way to do post-mortem.Aggressive retries: transient failures (broker down, GPU busy) were retried immediately, amplifying load on the service that was already struggling.
Decision¶
Dead letter queue. All queues are declared with
x-dead-letter-exchange: protea.dlx. Rejected messages
(nack without requeue) end up in protea.dead-letter, a durable
queue where they can be inspected, fixed, and republished.
Publisher retries. Exponential backoff: 5 attempts with delays of 1, 2, 4, 8, 16s (capped at 30s). If the connection is broken, it is discarded and a new one is created.
Worker retries. Operations can raise
RetryLaterError("GPU busy", delay_seconds=60). The worker calculates
adaptive backoff based on how many previous retries have occurred:
delay = min(base * 2^retries, 600s). The job goes back to QUEUED
and is republished after the wait.
Consequences¶
The DLQ grows if nobody inspects it; it must be monitored (see runbook).
Adaptive backoff makes one DB query per retry to count previous
job.retry_laterevents. Negligible cost.
Rejected alternatives¶
TTL + delay queue in RabbitMQ: more complex to set up and debug than an application-level
sleep().Celery retries: PROTEA does not use Celery; reimplementing its countdown over raw pika adds no value.