ADR-007: Contract-first integration with protea-reranker-lab¶
- Date:
2026-04-21
- Author:
frapercan
- Status:
Accepted
Context¶
Re-ranker development is iterative and research-shaped: we want to try
out new feature families, different boosting objectives
(binary vs LambdaRank), IA weighting schemes, and training protocols
(per-cell vs full, multi-snapshot vs single-snapshot) without reshaping
the production predict path every time. At the same time, a trained
re-ranker must eventually be usable from inside PROTEA’s
predict_go_terms batch worker.
Two natural but incorrect structures were considered and rejected:
Single repository, shared runtime. Lab training code would live inside
protea/core/and be imported by predict-time workers. This forces every experimental LightGBM knob, callback, or eval harness into PROTEA’s dependency tree (optuna,seaborn, notebook code) and couples the schedule of research iteration to the schedule of production releases.Single repository, disjoint packages. Separate
proteaandprotea_labtop-level packages under one repo. Still leaks lab imports into the production image (Python’s import system does not care about package boundaries at install time), and creates a single review/merge pipeline for two workflows with very different cadences.
Decision¶
PROTEA and protea-reranker-lab live in separate repositories
coupled only through a narrow contract:
A file-format contract. The frozen dataset layout. PROTEA’s
export_research_datasetoperation writes exactly three files under anArtifactStorekey prefix:train.parquet: all training shards concatenated withcategoryandsnapshot_paircolumns;eval.parquet: held-out evaluation shards;manifest.json: schema version 2, producer version and git sha, snapshot pair list,schema_shafingerprint.
A code contract. The lab’s
protea_reranker_lab.contractsmodule exposes two symbols PROTEA imports:ManifestV1: Pydantic model for the manifest, used byprotea.core.parquet_exportat export time for best-effort validation (dev-only; silently skipped in production images that don’t install the lab);compute_feature_schema_sha(feature_families: list[str]) -> str: deterministic 12-hex-char fingerprint used at predict time to verify the live feature set matches the booster’s expectations.
An artefact contract. Lab training writes
runs/<name>/{run.json, spec.yaml, model.txt}. PROTEA’sscripts/register_reranker.pyparses those files, uploads the booster through theArtifactStore, and inserts aRerankerModelrow with provenance (producer_version/producer_git_sha/spec_yaml).
Strict feature-schema equality¶
feature_schema_sha is computed over the sorted list of feature
families active at training time. At predict time the batch worker
recomputes it from its own active flags and compares for strict
equality, not subset, not prefix.
Rejected alternative: subset compatibility. If the booster was trained
on {knn, annotation_meta, alignment_nw, length} and the live
pipeline has {knn, annotation_meta, alignment_nw, length,
taxonomy_pair}, a subset-match would feed the booster only its
trained columns. But LightGBM boosters are sensitive to the ordering
and distribution of the training feature matrix; a superset at
inference is still a drift in the implicit joint distribution the
model learned. Strict equality fails safe: the batch worker emits
reranker.schema_mismatch and falls back to KNN ordering, which is
always a legitimate baseline. A missed re-ranking hit is preferable to
silently miscalibrated scores.
Consequences¶
Two-repo friction. Changing a feature family requires coordinated commits in both repos plus a dataset re-export. Mitigated by keeping the contract surface tiny (three files + two symbols) and by making
register_reranker.pyreject runs whoseschema_shadoes not match any PROTEA-side snapshot of the same feature-family list.Production images ship without the lab.
protea_reranker_labis a dev-only dependency of PROTEA. The predict-time import is guarded; a missing lab install causes the batch worker to emitreranker.skippedwithreason=contracts_unavailableand fall back to KNN ordering, without crashing. In production we ship the lab alongside PROTEA (single editable path dep) precisely socompute_feature_schema_shais available.
Rejected alternatives¶
Dynamic feature-family negotiation. Letting the worker infer which columns the booster expects from
booster.feature_name()and building them lazily. Too fragile: names in the booster do not carry family grouping, so every feature engineering change would need manual backwards-compat shims.Pickling ``ExperimentSpec`` objects across repos. Requires both sides to share Python class identity; defeats the point of decoupling and breaks on any lab refactor.