LightGBM Complexity¶
Note
Inference: O(C * depth) per candidate, where C = q * k * T.
For each (query, reference, GO-term) candidate triple, the LightGBM
booster traverses a fixed tree ensemble of depth levels. With q
query proteins, k nearest neighbours, and T transferred GO terms
per (query, reference) pair, there are at most C = q * k * T rows in the
scoring DataFrame. Tree depth is a training hyperparameter (typically 6);
the feature vector has F columns (fixed at schema compile time; currently
O(100)). Inference is dominated by the KNN and alignment steps; the
LightGBM call is rarely more than 5% of total export wall clock.
Architecture note
LightGBM training is not done inside PROTEA. The pipeline is:
export_research_datasetgeneratestrain.parquet+eval.parquetand uploads them to the artifact store.protea-reranker-labdownloads the parquets, trains a booster, and writesmodel.txt.POST /reranker-models/importorimport-by-referenceregisters the booster in the PROTEA DB (RerankerModelrow), linking it back to the sourceDatasetviadataset_idand pinningfeature_schema_shafor schema-drift detection.Inference calls
predictfromprotea_method.reranker:
def load_reranker(
"""Fetch (once) and load a LightGBM booster by URI.
The first call materialises the booster blob under
``cache_dir/<feature_schema_sha>_<uri_tag>.txt``; subsequent
calls reuse the on-disk file *and* an in-process booster cache
keyed by the URI.
``store`` is used only when the cached file does not exist;
``artifact_uri`` is expected to resolve to a store key but the
Schema-drift guard
The booster refuses to score a feature vector whose column layout
(captured by feature_schema_sha) differs from the layout it was
trained on. This makes the schema an explicit versioning contract between
PROTEA’s export pipeline and the lab’s training code.
Training complexity
Training happens in protea-reranker-lab (offline). The dominant cost is
the GBDT boosting loop: for m boosting rounds, n_train rows, and
F features, the complexity is O(m * n_train * F). For the v226 corpus a
single-PLM dataset has ~1M training rows; 400 rounds at 100 features runs in
under 10 minutes on CPU.
Cross-reference
Thesis Ch. 5.4 covers the feature importance analysis, the LM.3 stability result (9 aspect-stable features dominate), and the selective-deploy decision for NK + LK categories.