ADR-D35: Canonical 8-PLM embedding config IDs and orphan classification¶
- Status:
Accepted
- Date:
2026-05-18
Context¶
The LAFA KNN-8PLM ensemble (apps/lafa_knn_8plm/plm_encoders.py,
PLM_SPECS tuple) defines eight protein language models as the
production ensemble. Each PLM must have a corresponding
EmbeddingConfig row in the PROTEA database before embeddings can be
stored and retrieved.
On 2026-05-18 a bioinfo-quick dispatch agent sent eight
compute_embeddings jobs without first auditing the existing
EmbeddingConfig table. The table already contained 9 rows (11 after
the dispatch created 2 new ones), and most of the dispatched work was
redundant. The 8 jobs were immediately cancelled (all returned HTTP 200
cancelled; job IDs listed in the References section).
This ADR pins the canonical embedding_config_id for each of the 8
ensemble members, classifies the 3 non-canonical (orphan) rows, and
records the protocol divergences that require attention before any future
hydration dispatch.
Decision¶
Canonical 8-PLM ``embedding_config_id`` table (as of 2026-05-18):
PLM key |
HF / SDK checkpoint |
embedding_config_id |
embedding_count |
|---|---|---|---|
esm2_150m |
facebook/esm2_t30_150M_UR50D |
|
66 432 |
esm2_650m |
facebook/esm2_t33_650M_UR50D |
|
527 422 |
esm2_3b |
facebook/esm2_t36_3B_UR50D |
|
551 918 |
prot_t5 |
Rostlab/prot_t5_xl_half_uniref50-enc |
|
0 |
prostt5 |
Rostlab/ProstT5 |
|
527 424 |
ankh_base |
ElnaggarLab/ankh-base |
|
527 424 |
ankh_large |
ElnaggarLab/ankh-large |
|
527 424 |
esmc_600m |
esmc_600m (EvolutionaryScale SDK) |
|
527 424 |
Match notes:
esm2_150m (500a0c59): exact match. Newly created on 2026-05-18 by the dispatch agent. Has 66 432 embeddings (partial hydration completed before the cancel arrived). Needs a future targeted hydration to reach the full reference-set count (~527 k).
esm2_650m (c2e9dda3): exact match. Full hydration present.
esm2_3b (55e43f1c): exact match. Largest embedding_count (551 918, above the 527 k baseline due to the chunking-enabled config). Full hydration present.
prot_t5 (084943c6): exact checkpoint match (half-precision encoder,
Rostlab/prot_t5_xl_half_uniref50-enc). Newly created on 2026-05-18. embedding_count = 0. Hydration job was cancelled; needs a fresh dispatch to populate.prostt5 (c0ae5b69): exact match. Full hydration present.
ankh_base (08234f06): exact match (layer_indices=[0], normalize=false). Full hydration present.
ankh_large (238f79b1): exact match. Full hydration present.
esmc_600m (2bf1e753): exact match. Full hydration present.
Protocol divergence flagged (prot_t5 vs orphan db4db5ed): The old
full-precision ProtT5 config (db4db5ed, checkpoint
Rostlab/prot_t5_xl_uniref50, no _half_) has 527 424 embeddings
and is classified as superseded (see orphan table below). The canonical
config 084943c6 uses the half-precision checkpoint and currently has
0 embeddings. A targeted compute_embeddings dispatch against
084943c6 is required before the 8-PLM ensemble can run end-to-end.
Orphan / non-canonical rows:
embedding_config_id |
model_name |
embedding_count |
classification |
|---|---|---|---|
|
Rostlab/prot_t5_xl_uniref50 (full precision) |
527 424 |
superseded (full-precision ProtT5; canonical is half-precision
|
|
ElnaggarLab/ankh-base (layers 0,1,2; normalize_residues=true) |
527 432 |
experimental (multi-layer ablation config; not in 8-PLM protocol) |
|
esmc_300m |
527 855 |
baseline (ESMC-300M single-PLM baseline; not in 8-PLM ensemble) |
Recommended follow-up actions (deferred to a future user-led cleanup slice, not executed in this ADR):
Label
db4db5edassuperseded-prot-t5-full-precisionin its description field. Do not delete; the 527 k embeddings are a valid reference for ablation comparisons.Label
c2868c1aasexperimental-ankh-base-multilayerin its description field. Do not delete; the multi-layer ablation is a distinct research cell.Label
c85d1afeasbaseline-esmc-300min its description field. Do not delete; single-PLM baseline row is needed for the ESMC-300M vs ESMC-600M comparison in chapter 6.Dispatch a fresh
compute_embeddingsjob targeting config084943c6(prot_t5 half-precision) to bring its embedding_count from 0 to the full reference-set size. This is a prerequisite for any 8-PLM ensemble inference run.Investigate why
esm2_150m(500a0c59) has only 66 432 embeddings. A targeted top-up dispatch is needed to reach the full reference-set count before the config can be used in 8-PLM KNN queries.
Consequences¶
Positive
Canonical IDs are pinned in version control; future dispatch agents must audit against this table before creating new configs.
Orphan rows are classified and labelled, eliminating ambiguity about which row to use for each PLM.
The cancelled-job incident is documented; the GPU is freed immediately.
Two newly created configs (
500a0c59,084943c6) are identified as the correct canonical rows, preventing future duplication.
Negative
prot_t5(084943c6) has 0 embeddings. A hydration dispatch is required before the full 8-PLM ensemble can be used in production.esm2_150m(500a0c59) has a partial hydration (66 432 rows), also requiring a top-up before full production use.Two dispatch jobs remain partially incomplete despite the cancellation; workers that had already started processing may have written partial rows before detecting the cancel signal.
Neutral
The dispatch agent pattern (dispatch without auditing first) must be updated: see recommended action to add a pre-dispatch audit step in the
bioinfo-quickconductor flow.The 3 full-hydration canonical configs (prostt5, ankh_base, ankh_large, esmc_600m) plus esm2_650m and esm2_3b are production-ready today.
References¶
apps/lafa_knn_8plm/plm_encoders.py(PLM_SPECStuple, canonical 8-PLM checkpoint registry).Cancelled job IDs (2026-05-18):
esm2_150m:
05de343e-be81-4215-b79b-01dfb9cc394aesm2_650m:
af48df6c-6ff7-4a1f-b2f7-9bace15ea4e6esm2_3b:
76e24a5b-5be6-489a-8610-bebb634e02e3prot_t5 half:
c6f5f812-91a2-4f94-9dd2-45db1f1408adprostt5:
387542b3-6579-4a1f-b700-1be7f2ae5b07ankh_base:
ce5ada6f-4fd6-4f60-9417-e5794189c951ankh_large:
23cba402-f6c1-4ad1-a3be-5830cffc8e11esmc_600m:
74da4ab9-ed14-4007-b855-323a9fdd9ffa
Memory entry
project_canonical_8plm_embedding_configs(canonical ID table + orphan list, full record with protocol notes).ADR D09 (
009-cancellation-nack-before-dispatch.rst) for the cancel endpoint behaviour relied on in Task A.