ADR-008: PK coverage fix in cafaeval fork¶
- Date:
2026-04-23
- Author:
frapercan
- Status:
Accepted
Context¶
Upstream cafaeval (pinned at claradepaolis/CAFA-evaluator-PK) reports
coverage values greater than 1.0 for the Partial-Knowledge (PK)
evaluation branch. Coverage is defined as the fraction of eligible proteins
for which the predictor emits at least one scored term at a given threshold,
which by definition is bounded in [0, 1]. Values of 1.3–1.9 are
observed on every PROTEA benchmark run.
The root cause is an asymmetry between the numerator and the denominator of
the coverage ratio in the PK kernel of cafaeval/evaluation.py:
Quantity |
Definition in upstream code |
Location |
|---|---|---|
|
Number of proteins in |
|
|
Number of proteins with at least one GT annotation in TOI that
survives the per-protein |
|
Under the PK branch the exclusion mask is applied to prediction values
(valid = (pred_sub != 0) & toi_mask & ~excluded_mask, line 266) and to
the per-TP weighting, but not to the row count that becomes
metrics['n']. Proteins whose TOI annotations are fully contained in
t0 (no novel GT in t1) are therefore:
excluded from
ne(correct),but still counted in
metrics['n']whenever the predictor emits a non-excluded term for them (incorrect).
The observed coverage > 1 is the visible symptom. The silent
secondary effect is that precision under
normalization='cafa' uses metrics['n'] as its denominator
(normalize(), line 569), so precision is under-divided, tightened
by the same factor. On the 220→230 PROTEA benchmark this drags PK Fmax
from its true value down by 30–40 %.
Decision¶
cafaeval-protea (fork commit cec8ccd) applies a one-line semantic
fix inside compute_confusion_matrix_exclude_sparse:
# Restrict the row count to proteins that still have ≥1 GT annotation
# in TOI after the per-protein exclude mask. Without this, `n` counts
# proteins whose TOI annotations were all already known in t0, while
# the denominator `ne` drops them, producing coverage > 1.
eligible_rows = (
(gt_sub != 0) & toi_mask[None, :] & (~excluded_mask)
).any(axis=1)
metrics[:, 0] = ((pred_at_tau > 0) & eligible_rows[:, None]).sum(axis=0)
The patch runs in numpy (bool mask broadcast + a sum), so it is bounded
in cost by O(n_prot × n_terms), the same complexity as the sibling
mask lines it mirrors (lines 266 and 273). No allocation changes, no new
dependencies.
The NK/LK kernel path (compute_confusion_matrix_sparse) is not
touched because it has no exclusion concept: proteins_with_gt already
matches ne by construction, and coverage is bounded by
[0, 1] without any extra masking.
Why this was not caught by the fork’s parity tests¶
The parity tests in tests/diff/test_oracle_parity.py compare the
fork’s output against a frozen pickle of the upstream evaluator run on
the same corpora. They enforce bit-/ULP-exact equality across every
column (including n and cov), so any semantic correction in
the PK branch will look like a regression.
To keep the parity gate honest, the fix is accompanied by two changes in the test suite:
tests/test_pk_coverage_bug.py: a positive regression gate with a synthetic three-protein PK scenario. One of the proteins has all its GT annotations pre-excluded. The test asserts thatmetrics['n'] ≤ neand that TP/FP/FN/recall are unaffected. This test fails if the fix is ever reverted.tests/diff/test_oracle_parity.py: a_maybe_xfail_pk()helper xfails the PK variants of the oracle parity with a documented reason. The NK/LK variants continue to enforce bit-exact parity with upstream. The xfail is intentional and load-bearing: it records the fact that the fork has deliberately diverged from upstream on PK semantics, not because of a numerical drift.
Consequences¶
The sections below document the measured impact and operational steps required after applying the fix.
Effect on the PROTEA 220→230 benchmark¶
After re-running the 15 PK evaluations under the patched fork:
Cell |
Fmax (before) |
Fmax (after) |
Δ |
|---|---|---|---|
PK BPO |
0.130 |
0.198 |
+0.068 |
PK CCO |
0.301 |
0.366 |
+0.065 |
PK MFO |
0.210 |
0.291 |
+0.081 |
PK BPO coverage |
1.94 |
0.97 |
−0.97 |
PK BPO precision |
0.088 |
0.157 |
+0.069 |
NK and LK cells are unchanged within float noise. The thesis PK metrics reported hereafter reflect the corrected computation.
Operational implication¶
cafaeval-proteais installed in PROTEA via afile://path dependency. After pulling a new fork commit, the venv must be force- reinstalled:pip install --force-reinstall --no-deps /path/to/cafaeval-proteabecause
poetry installtreats the local path as satisfied once the lockfile hash matches. A plainpoetry installwill silently leave the old module insite-packages/.Every live
worker-evaluationsprocess holds thecafaevalmodule in memory. Reinstalling the package does not hot-patch running workers; they must be restarted to pick up the fix:systemctl --user restart protea-worker-evaluationsEvaluationResultrows persisted before the fix carry the buggycov/precisionvalues and must be discarded (DELETE via/annotations/evaluation-sets/{id}/results/{rid}; the endpoint cascades to MinIO artifacts). The launcher re-fires new runs automatically for anyprediction_setthat loses its eval.
We should push this fix upstream¶
The bug exists verbatim in claradepaolis/CAFA-evaluator-PK and, as
far as we can tell, has never been flagged in an issue. The fix is
small, semantically justified, and strictly improves correctness for
every downstream user of the PK branch. An upstream report with the
minimal reproducer from tests/test_pk_coverage_bug.py is pending.