DLQ Triage¶
PROTEA declares one dead-letter exchange (protea.dlx) and one dead-letter
queue (protea.dead-letter). Every durable queue is wired to forward
undeliverable messages to this exchange via the x-dead-letter-exchange
argument. Messages land in the DLQ when:
An
OperationConsumer(protea/infrastructure/queue/consumer.py) nacks a message withrequeue=False(invalid JSON, unregistered operation, or non-OOM exception with the defaultrequeue_on_failure=Falseoption).CUDA OOM retries are exhausted (after up to 5 attempts by default; configurable via
PROTEA_TUNING__QUEUE__OOM_MAX_RETRIES).A message body cannot be parsed as valid JSON.
The DLQ never drains automatically. It accumulates until an operator inspects and resolves the root cause.
Symptoms¶
RabbitMQ management UI (
http://localhost:15672, guest/guest) showsprotea.dead-letterwith a non-zero message count.A parent job is stuck in RUNNING with no forward progress and its events log contains
child.failedorchild.cuda_oom_dead_letterentries.GET /jobs/{id}/eventsreturns"event": "child.cuda_oom_dead_letter"or"event": "child.failed"with no subsequentchild.*activity.Worker logs on the relevant queue (e.g.
logs/worker-embeddings-batch-1.log) showOperation failedorCUDA OOM retries exhaustedlines.
Diagnosis¶
Check DLQ depth via RabbitMQ management API:
curl -s -u guest:guest \ http://localhost:15672/api/queues/%2F/protea.dead-letter \ | python3 -m json.tool | grep '"messages"'
Inspect DLQ message content without consuming it (peek via the HTTP API; this does not remove the message):
curl -s -u guest:guest \ -X POST http://localhost:15672/api/queues/%2F/protea.dead-letter/get \ -H "Content-Type: application/json" \ -d '{"count":10,"ackmode":"ack_requeue_true","encoding":"auto","truncate":50000}' \ | python3 -m json.tool
The
x-deathheader in each message shows the original queue name, the reason (rejectedorexpired), and the death count.Correlate with the parent job: The message body contains
"job_id": "<uuid>"(parent job) and"operation": "<name>". Use thejob_idto retrieve the full event log:curl -s http://localhost:8000/jobs/<job-uuid>/events \ | python3 -m json.tool
Determine the failure class from worker logs. For batch queues:
# Embeddings batch worker bash scripts/manage.sh logs embeddings-batch # Predictions batch worker bash scripts/manage.sh logs predictions-batch
Check whether OOM retries were exhausted:
grep "CUDA OOM retries exhausted\|dead-lettering" logs/worker-embeddings-batch-1.log
Fix¶
Drain and discard dead-lettered messages (use only after you have identified the root cause and confirmed the messages are not recoverable):
# Purge the entire DLQ curl -s -u guest:guest \ -X DELETE http://localhost:15672/api/queues/%2F/protea.dead-letter/contents \ -H "Content-Type: application/json" \ -d '{}'
Re-queue a message back to the original queue (if the root cause is fixed and the message is retryable). There is no built-in shovel in the dev stack; use the management UI:
Open
http://localhost:15672> Queues >protea.dead-letter.Use the
Get Messagesform to peek and copy the message body.Use the
Publish Messageform on the target queue (e.g.protea.embeddings.batch) to re-publish.
Alternatively, re-submit the parent job via the API, which will re-dispatch all batch messages:
# Cancel the stuck parent job curl -s -X POST http://localhost:8000/jobs/<parent-uuid>/cancel # Re-submit the same operation with the same payload curl -s -X POST http://localhost:8000/jobs \ -H "Content-Type: application/json" \ -d '{"operation":"compute_embeddings","payload":{...}}'
For CUDA OOM dead-letters: reduce batch size in the payload (
batch_sizefield incompute_embeddings_batch/predict_go_terms_batchmessages) or increasePROTEA_TUNING__QUEUE__OOM_MAX_RETRIESto allow more backoff cycles before giving up.For schema / parse errors (
Unparseable operation messagein logs): the message is malformed and cannot be recovered. Purge from DLQ and fix the publisher code that generated it.Mark the parent job FAILED once the dead-lettered children confirm the job cannot complete:
UPDATE job SET status = 'failed', finished_at = now(), error_code = 'ChildDeadLettered', error_message = 'One or more child batch messages were dead-lettered' WHERE id = '<parent-uuid>' AND status = 'running';
Verify the DLQ is drained:
curl -s -u guest:guest \ http://localhost:15672/api/queues/%2F/protea.dead-letter \ | python3 -m json.tool | grep '"messages"' # Expect: "messages": 0
Prevention¶
Set up a RabbitMQ alert on the
protea.dead-letterqueue depth. A count greater than zero warrants immediate investigation.Tune
PROTEA_TUNING__QUEUE__OOM_MAX_RETRIESandOOM_BASE_DELAYbefore running large GPU inference batches to give the GPU time to recover between retries.For GPU OOM specifically, reduce
embedding_batch_size/prediction_batch_sizein the operation payload rather than increasing retries indefinitely.Related source:
protea/infrastructure/queue/consumer.py(_DLQ_NAME,_handle_cuda_oom,_handle_general_failure).