ADR-D36: PLM axis explicit in dataset naming¶
- Status:
Accepted
- Date:
2026-05-20
Context¶
The reranker benchmark dataset historically known as
bench-v1-K5-v226-lineage was built against ProstT5 embeddings
(embedding_config_id=c0ae5b69-d6dc-41cf-a711-1739d3d2e170), but the
PLM choice was not encoded in the dataset name. The same name was used
across publishable prose, lab YAMLs, scripts, ADR D34, and the PROTEA
Dataset registry row.
Once the multi-PLM v226 sweep plan landed
(memory project_multi_plm_v226_sweep_plan, 2026-05-19), the slate of
eight canonical PLMs (ADR D35) each needs its own per-K dataset. A
single name shared across PLMs is structurally ambiguous: it cannot
distinguish the existing ProstT5 build from the seven sibling builds
scheduled by FARM-EXP.13, and it silently breaks any axis-tuple
shortid lookup that takes eval_set_name as part of the canonical
payload.
The transversal catalog
(protea-reranker-lab/experiments/_catalog/transversal.yaml) made
the underlying problem visible: 96 cells carried the PLM-blind
eval_set: bench-v1-K5-v226-lineage value while their plm
axis ranged across eight PLMs, meaning the catalog implicitly
overloaded one dataset name to mean eight different things depending
on which other axis the reader was holding fixed.
Decision¶
Per-PLM dataset name template. The canonical name for a lineage reranker dataset is
bench-v1-K{k}-v{val_band}-lineage-{plm_short}where
kis the KNN neighbour count,val_bandis the evaluation snapshot band (226for the v226 to v230 window), andplm_shortis the canonical short key from ADR D35:Canonical PLM short keys¶ plm_short
HF / SDK checkpoint
esm2_150mfacebook/esm2_t30_150M_UR50D
esm2_650mfacebook/esm2_t33_650M_UR50D
esm2_3bfacebook/esm2_t36_3B_UR50D
prot_t5Rostlab/prot_t5_xl_half_uniref50-enc
prostt5Rostlab/ProstT5
ankh_baseElnaggarLab/ankh-base
ankh_largeElnaggarLab/ankh-large
esmc_600mesmc_600m (EvolutionaryScale SDK)
The
esmc_300mbaseline (orphan per ADR D35) is permitted as a ninthplm_shortvalue for chapter-6 single-PLM comparisons.The existing build is renamed to ``-prostt5``. The pre-FARM-EXP.12
bench-v1-K5-v226-lineagerow in the PROTEADatasetregistry refers to the ProstT5 build (c0ae5b69); it is renamed in place tobench-v1-K5-v226-lineage-prostt5via the Alembic data-only migrationf1a2b3c4d5e6_farm_exp_12_rename_lineage_dataset_to_prostt5.py(file path symbolic; see the migrations directory for the actual filename). The legacy name is recorded in the row’smetaJSONB under thealias_nameskey so any code path still resolving by the old name has a traceable fallback path (the columnDataset.metaalready exists; no schema change is required).Backward-compat handle. Lookups by the legacy name (
bench-v1-K5-v226-lineage) continue to resolve through thealias_namesfield for one release cycle. A follow-up slice removes the alias once every caller has migrated to the canonical per-PLM name.Lab-side rename. The lab repo (
protea-reranker-lab) rewrites every prose / yaml / python reference to the legacy name to the canonicalbench-v1-K5-v226-lineage-prostt5form. The transversal catalog builder (scripts/build_study_specs.py) gains a_resolve_eval_set(family, plm, k)helper that expands thelineagefamily into onebench-v1-K{k}-v226-lineage-{plm_short}per PLM, so the catalog now carries 12 cells per PLM (96 cells total) rather than one PLM-blind block.Pre-commit / CI linter. The lab repo ships
scripts/lint_dataset_names.pywhich rejects any committed file containingbench-v1-K{k}-v{val_band}-lineagewithout a-{plm_short}or-minisuffix. The linter is wired into.pre-commit-config.yamland a dedicated CI workflow (.github/workflows/dataset-name-lint.yml).Champion records updated.
champions.mdis rewritten so every axis tuple that pointed at the legacy name now reads the-prostt5form; the 9-cell selective-rerank champion row (study_v23, post-replication-artefact fix) keeps its measured numbers (selective_avg_cafaeval = 0.6215); only thedatasetcolumn text changes.No data movement. Train / eval parquet contents and the manifest
key_prefixare not rewritten in the artifact store as part of this slice; only the registryDataset.nameand lab prose change. The artifactkey_prefix(e.g.datasets/bench-v1-K5-v226-lineage/) keeps its legacy form in the storage backend so prior pulls and pre-computed caches keep resolving. A future cleanup slice (post-FARM-EXP.13 sibling builds) may rename the storage prefix in lockstep.
Consequences¶
Positive
The PLM axis is now explicit at dataset-name level; multi-PLM sweeps (FARM-EXP.13) can name their eight sibling datasets without ambiguity.
Axis-tuple shortid lookups across repos stay consistent: a
ExperimentRun.axis_tuple_shortidrow produced before the rename is still recoverable via thealias_namesfield on theDatasetrow.A linter prevents the legacy untagged form from re-entering prose.
The transversal catalog goes from a PLM-blind 12-cell block to a PLM-explicit 96-cell block (12 per PLM), surfacing structural coverage the previous catalog had hidden.
Negative
48 lab files plus 6 PROTEA files were rewritten in a single sweep; reviewers must read the diff for content drift (mostly search-and-replace, but the catalog regeneration changes 96 cell shortids because the eval-set name is part of the axis payload).
The Alembic migration is data-only against the live registry; rolling back the migration restores the legacy name but does not rewind the lab prose. The downgrade is a registry-only revert.
Any external consumer that pulled the dataset by the legacy name before the alias was wired must update its config.
Neutral
The artifact-store
key_prefixkeeps the legacy directory name (datasets/bench-v1-K5-v226-lineage/). Storage rename is scheduled as a follow-up so disk and S3 layouts stay stable while the rename propagates through downstream tools.
References¶
ADR D35 (canonical 8-PLM embedding config IDs).
ADR D34 (selective rerank resurrection; live PROTEA inference policy on the prostt5 build).
Memory entry
project_canonical_8plm_embedding_configs(canonical PLM short keys).Memory entry
project_multi_plm_v226_sweep_plan(locked per-PLM dataset naming, 2026-05-19).Slice
FARM-EXP.12(farm-platform loop) and the follow-upFARM-EXP.13per-PLM dataset-family build.Lab linter
scripts/lint_dataset_names.pyand CI workflow.github/workflows/dataset-name-lint.ymlinprotea-reranker-lab.