ADR-008: PK coverage fix in cafaeval fork¶

Date:: 2026-04-23
Author:: frapercan
Status:: Accepted

Context¶

Upstream cafaeval (pinned at claradepaolis/CAFA-evaluator-PK) reports coverage values greater than 1.0 for the Partial-Knowledge (PK) evaluation branch. Coverage is defined as the fraction of eligible proteins for which the predictor emits at least one scored term at a given threshold, which by definition is bounded in [0, 1]. Values of 1.3–1.9 are observed on every PROTEA benchmark run.

The root cause is an asymmetry between the numerator and the denominator of the coverage ratio in the PK kernel of cafaeval/evaluation.py:

Quantity	Definition in upstream code	Location
`metrics['n']` (numerator)	Number of proteins in `proteins_with_gt` (any GT annotation in TOI, pre-exclusion) that receive at least one valid prediction at threshold `τ`.	`compute_confusion_matrix_exclude_sparse`, line 304.
`ne` (denominator)	Number of proteins with at least one GT annotation in TOI that survives the per-protein `exclude_matrix` (i.e. novel annotations the predictor should recover).	`_count_proteins_in_toi` with `exclude_matrix != None`, line 627; materialised in `evaluate_prediction`, line 657.

Under the PK branch the exclusion mask is applied to prediction values (valid = (pred_sub != 0) & toi_mask & ~excluded_mask, line 266) and to the per-TP weighting, but not to the row count that becomes metrics['n']. Proteins whose TOI annotations are fully contained in t0 (no novel GT in t1) are therefore:

excluded from ne (correct),
but still counted in metrics['n'] whenever the predictor emits a non-excluded term for them (incorrect).

The observed coverage > 1 is the visible symptom. The silent secondary effect is that precision under normalization='cafa' uses metrics['n'] as its denominator (normalize(), line 569), so precision is under-divided, tightened by the same factor. On the 220→230 PROTEA benchmark this drags PK Fmax from its true value down by 30–40 %.

Decision¶

cafaeval-protea (fork commit cec8ccd) applies a one-line semantic fix inside compute_confusion_matrix_exclude_sparse:

# Restrict the row count to proteins that still have ≥1 GT annotation
# in TOI after the per-protein exclude mask. Without this, `n` counts
# proteins whose TOI annotations were all already known in t0, while
# the denominator `ne` drops them, producing coverage > 1.
eligible_rows = (
    (gt_sub != 0) & toi_mask[None, :] & (~excluded_mask)
).any(axis=1)

metrics[:, 0] = ((pred_at_tau > 0) & eligible_rows[:, None]).sum(axis=0)

The patch runs in numpy (bool mask broadcast + a sum), so it is bounded in cost by O(n_prot × n_terms), the same complexity as the sibling mask lines it mirrors (lines 266 and 273). No allocation changes, no new dependencies.

The NK/LK kernel path (compute_confusion_matrix_sparse) is not touched because it has no exclusion concept: proteins_with_gt already matches ne by construction, and coverage is bounded by [0, 1] without any extra masking.

Why this was not caught by the fork’s parity tests¶

The parity tests in tests/diff/test_oracle_parity.py compare the fork’s output against a frozen pickle of the upstream evaluator run on the same corpora. They enforce bit-/ULP-exact equality across every column (including n and cov), so any semantic correction in the PK branch will look like a regression.

To keep the parity gate honest, the fix is accompanied by two changes in the test suite:

tests/test_pk_coverage_bug.py: a positive regression gate with a synthetic three-protein PK scenario. One of the proteins has all its GT annotations pre-excluded. The test asserts that metrics['n'] ≤ ne and that TP/FP/FN/recall are unaffected. This test fails if the fix is ever reverted.
tests/diff/test_oracle_parity.py: a _maybe_xfail_pk() helper xfails the PK variants of the oracle parity with a documented reason. The NK/LK variants continue to enforce bit-exact parity with upstream. The xfail is intentional and load-bearing: it records the fact that the fork has deliberately diverged from upstream on PK semantics, not because of a numerical drift.

Consequences¶

The sections below document the measured impact and operational steps required after applying the fix.

Effect on the PROTEA 220→230 benchmark¶

After re-running the 15 PK evaluations under the patched fork:

Cell	Fmax (before)	Fmax (after)	Δ
PK BPO	0.130	0.198	+0.068
PK CCO	0.301	0.366	+0.065
PK MFO	0.210	0.291	+0.081
PK BPO coverage	1.94	0.97	−0.97
PK BPO precision	0.088	0.157	+0.069

NK and LK cells are unchanged within float noise. The thesis PK metrics reported hereafter reflect the corrected computation.

Operational implication¶

cafaeval-protea is installed in PROTEA via a file:// path dependency. After pulling a new fork commit, the venv must be force- reinstalled:
```
pip install --force-reinstall --no-deps /path/to/cafaeval-protea
```
because poetry install treats the local path as satisfied once the lockfile hash matches. A plain poetry install will silently leave the old module in site-packages/.
Every live worker-evaluations process holds the cafaeval module in memory. Reinstalling the package does not hot-patch running workers; they must be restarted to pick up the fix:
```
systemctl --user restart protea-worker-evaluations
```
EvaluationResult rows persisted before the fix carry the buggy cov / precision values and must be discarded (DELETE via /annotations/evaluation-sets/{id}/results/{rid}; the endpoint cascades to MinIO artifacts). The launcher re-fires new runs automatically for any prediction_set that loses its eval.

We should push this fix upstream¶

The bug exists verbatim in claradepaolis/CAFA-evaluator-PK and, as far as we can tell, has never been flagged in an issue. The fix is small, semantically justified, and strictly improves correctness for every downstream user of the PK branch. An upstream report with the minimal reproducer from tests/test_pk_coverage_bug.py is pending.