Operational Insights and Lessons Learned¶

This appendix captures non-obvious gotchas that surfaced during the PROTEA F-EXP campaigns and operational history. Each entry is grounded in a real incident: an ADR, a commit record, an experiment run, or an operational failure that triggered a fix. The goal is to keep the lessons alive across handovers and to give future maintainers a head start on problems that were only obvious in retrospect.

The material is a companion to thesis chapter 6 (F-EXP evaluation) and appendix B (reproduction guide). Where numbers appear, they come from named experiment runs; cross-references to ADRs point to the decision record that closed each issue.

Feature-schema SHA drift across the platform-lab boundary ¶

What it is. feature_schema_sha is a deterministic fingerprint over the sorted list of feature families active at training time. It is the load-bearing safety check that prevents inference from scoring with a booster trained against a different feature schema: the predict-time batch worker recomputes it from its own active flags and falls back to KNN ordering on a mismatch rather than producing miscalibrated scores.

The lesson. For one study iteration in 2026-05, two independent implementations of compute_schema_sha (one in PROTEA, one in protea-reranker-lab) used different normalisations of the same feature list. The booster import accepted the model but the batch worker rejected it at scoring time. The fallback to KNN was silent enough to mask the drift for a full sweep.

The fix moved the canonical implementation to protea-contracts and backfilled a parallel column (schema_sha_v2) so historical rows could be compared without disturbing the live schema_sha. Production inference reads schema_sha_v2. See ADR-D10: schema_sha_v2 parallel migration for the dual-write rollout and ADR-007: Contract-first integration with protea-reranker-lab for the broader contract-first rationale.

Prevention rule. Anything that both sides of the platform-lab boundary must agree on lives in protea-contracts, not duplicated in each repo. The strict-equality check was the right design; the only mistake was implementing it twice.

Replication artefact in the anc2vec_query feature family ¶

Note

Earlier drafts of this section described the incident as “temporal label leakage”. That framing was wrong and has been corrected. The CAFA temporal partition itself (NK / LK / PK in CAFA Evaluation Protocol) is mathematically clean: PK simply records that the protein already had experimental annotations in some namespace at t0, which is a legitimate evaluation split, not a leak. The incident below was a feature-construction artefact in one feature family, not a flaw in the temporal protocol.

What it is. anc2vec_query_known_count counts the protein’s t0 annotations in the query namespace. As a stand-alone feature this is unproblematic. The issue arose from how the training table was assembled in the early export_research_dataset pipeline.

How it broke. The export materialised the training parquet by replicating each (protein, aspect) row across categories (NK, LK, PK) so the booster could see all three label streams in one pass. The replication step did not filter on whether the protein actually belonged to that category in that aspect: a protein that was genuinely NK in F appeared as a synthetic negative row in P and C, and a protein that was PK in P got a synthetic NK row in aspects where it had no terms. Because anc2vec_query_known_count is deterministic in the t0 annotation profile, its value silently became a perfect bucket identifier: low value → “this is a genuine NK row”, high value → “this is a synthetic negative replicated from another category”. The booster learned the bucket, not the biology.

The effect was dramatic and initially invisible. In the study_v9 leave-one-out ablation, dropping anc2vec_query cost 0.2565 Fmax on nk-bpo and 0.1524 on lk-cco, mean delta +0.1449 across three representative cells. Those are signals three to four times larger than any legitimate feature family in the study. At the time, the size of the delta was misread as evidence that ancestry features were unusually informative.

How it was caught. During the study_v9 cafaeval re-validation phase, the ratio between lab Fmax and cafaeval Fmax for lk-cco reached 3.00x and for lk-mfo 2.57x. Those ratios were plausible given that cafaeval propagates predictions through the GO DAG and the lab evaluator does not, but the magnitude prompted a closer look at the training table. Inspecting the row distribution per (protein, aspect, category) exposed the cross-category replication.

The fix. PROTEA commit 223299c filters (protein, aspect) by category membership before the replication step: a row is emitted in category \(c\) only if there is at least one label-1 row for that (protein, aspect) in category \(c\). The synthetic-negative buckets disappear; anc2vec_query_known_count reverts to a normal feature with normal importance. After the rebuild, the cafaeval re-validation ratios returned to the 1.1-1.6x range expected from GO propagation alone, and per-category Fmax values aligned across evaluators.

Prevention. Any cross-category replication in dataset assembly must be gated by an explicit per-category membership filter. The filter_provenance block in the manifest records the rule. Generally, any feature derived from the t0 annotation profile must be constructed against an explicitly time-stamped snapshot, passed as a payload parameter rather than resolved at construction time against the live database view, so its values cannot accidentally encode metadata about the row that carries them.

cafaeval PK coverage bug ¶

What it is. Upstream cafaeval (the CAFA evaluator fork at claradepaolis/CAFA-evaluator-PK) computes a coverage metric for the Partial-Knowledge (PK) evaluation branch. Coverage is the fraction of eligible proteins for which the predictor emits at least one scored term at a given threshold. By definition it is bounded in [0, 1].

How it broke. On every PROTEA benchmark run against the 220 to 230 temporal window, PK coverage values of 1.3 to 1.9 were observed. The upstream bug is an asymmetry between numerator and denominator inside compute_confusion_matrix_exclude_sparse: the denominator ne correctly excludes proteins whose TOI annotations were all pre-excluded (already known at t0), but the numerator metrics['n'] counted those same proteins whenever the predictor emitted any non-excluded term for them. The visible symptom was coverage greater than one. The silent secondary effect was that precision used metrics['n'] as its denominator, so precision was under-divided. On the 220 to 230 benchmark this dragged PK Fmax by 30-40%.

Effect on the 220 to 230 PK cells (from ADR-008: PK coverage fix in cafaeval fork)¶
Cell	Fmax before fix	Fmax after fix	Delta
PK BPO	0.130	0.198	+0.068
PK CCO	0.301	0.366	+0.065
PK MFO	0.210	0.291	+0.081
PK BPO coverage	1.94	0.97	-0.97

NK and LK cells were unchanged within float noise.

The fix. The cafaeval-protea fork (commit cec8ccd) applies a one-line semantic correction inside the PK confusion matrix kernel: restrict the numerator row count to proteins that still have at least one GT annotation in the TOI after the per-protein exclude mask. The NK/LK path was not touched. The fork’s parity test suite was updated to xfail only the PK variants (intentional divergence), while NK/LK variants continue to enforce bit-exact parity with upstream. See ADR-008: PK coverage fix in cafaeval fork for the full root-cause analysis and patch.

Operational fallout. Three operational implications followed directly from this fix.

First, cafaeval-protea is installed as a file:// path dependency. After pulling a new fork commit, the venv must be force-reinstalled with pip install --force-reinstall --no-deps, because poetry install treats the local path as satisfied once the lockfile hash matches and will silently leave the old module in site-packages/.

Second, running workers hold the cafaeval module in memory. Reinstalling the package does not hot-patch running workers; the evaluation worker must be restarted to pick up the fix.

Third, every EvaluationResult row persisted before the fix carries the buggy cov and precision values and must be discarded. The delete endpoint cascades to MinIO artifacts and the launcher re-fires new evaluation runs automatically for any prediction set that loses its result.

Prevention. Coverage outside [0, 1] is a useful invariant assertion. Adding a post-hoc check on run_cafa_evaluation output that logs a warning for any out-of-range coverage value would have surfaced this earlier. The upstream bug exists verbatim in claradepaolis/CAFA-evaluator-PK and has been flagged for an upstream report with the minimal reproducer from the fork’s regression test.

Deploy infrastructure fragility ¶

What it is. The public PROTEA endpoint is served via a static ngrok domain that tunnels to the Next.js frontend on port 3000. This arrangement has two single points of failure: the ngrok process can die silently, and the deploy slot worktree (~/Thesis2/worktrees/protea-deploy) can disappear while the deploy.sh script continues to assume it exists.

The ngrok tunnel death pattern. The ngrok process runs in the foreground under scripts/expose.sh. If the host goes to sleep, the network stack drops, or the process is killed by an OOM event, the public domain becomes unreachable while the local stack continues running normally. The symptom from the outside is a 502 Bad Gateway or a “Tunnel not found” page. The symptom from the inside is that curl -sf http://localhost:8000/jobs returns 200 but curl -sf https://protea.ngrok.app fails.

The deploy-keeper worktree-not-existing pattern. The deploy.sh script manages the production-mode stack inside the protea-deploy git worktree. On 2026-05-11, a supervisor session reached an agent that could not recreate the worktree because the parent branch had been fast-forwarded past the branch tip the worktree was pinned to. The script exited immediately with deploy slot does not exist on every invocation, leaving the demo endpoint dead until a manual git worktree add restored the slot.

The fix for each pattern. Tunnel deaths: restart expose.sh from the deploy slot, or from repositories/PROTEA if the slot is missing. If the script exits immediately, diagnose the local stack first. If ngrok authentication is missing, run ngrok config add-authtoken.

Worktree loss: recreate the worktree with a git worktree add against the current origin/develop tip, then run a full deploy cycle before re-opening the tunnel. See Ngrok Deploy Recovery for the exact command sequence.

Prevention. Run expose.sh under a process supervisor (a manage.sh-tracked background process or a systemd unit) rather than in a terminal tab that can be closed. Add a pre-flight check to deploy.sh that detects a missing worktree and either recreates it automatically or emits an actionable error message instead of exiting with a bare path error. The deploy slot must be created manually once; it is not recreated automatically if the directory is deleted.

Smell-budget enforcement and the Method Object reframe ¶

What it is. PROTEA inherits from the PIS and FANTASIA codebases, where workers conflated database sessions, queue management, orchestration, and business logic into single classes. The smell-budget campaign (T2B.5) targeted the four largest method bodies in the new codebase for Method Object extraction.

Why the spec was stale. By the time PR #267 landed (2026-05-09), AST analysis revealed that three of the four listed execute methods had already been reduced to well below the 60 LOC budget by prior PRs (#162, #169, #170, #177). A plan entry that listed four Method Object classes as pending work had become misleading: most of the work had shipped across 8 partial merges discovered only during a shepherd scan.

What actually shipped. The remaining smell on predict_go_terms.py was the size of the parent class (1589 LOC), not any single method. The reframe extracted the largest cohesive sub-cluster of methods that share mutable state into _AspectSeparatedKnnRunner (347 LOC, 10 methods, all below budget), reducing the parent class from 1589 to 1283 LOC and closing T2B.5 without the redundant Method Object class that the original spec had prescribed for an already-budgeted entrypoint.

See ADR-D31: T2B.5 Method Object reframe for the full decision, the AST audit numbers, and the criterion for treating future oversized methods in the same files.

The lesson. Plan entries that list specific class or method names can become stale faster than the execution cadence. A shepherd scan before opening a task is cheaper than opening a PR against work that has already landed in a different form. The T2B.5 closure also validated that the acceptance criterion (no method over 60 LOC) is meaningful as a target, but the mechanism (Method Object vs. sub-cluster extraction) should be chosen based on the actual shape of the code at the time of the refactor, not on the shape it had when the spec was written.

Plan-store vs reality drift ¶

What it is. The canonical plan store at agent-farm/plans/ records the status of every slice. When a slice lands via PR but the plan file is not updated in the same pass, the plan store and the repository diverge. Subsequent conductor sessions may dispatch executor agents to work that has already shipped.

How it happened. Multiple slices (T-OPS.12, T2B.1, T2B.2, and others) were marked as pending in the plan store after their corresponding PRs had merged to develop. The conductor had no mechanism to cross-check PR merge history against plan status. Executor agents opened worktrees, read the plan, and began implementing work that was already present in the codebase.

The downstream cost was low in each individual case (the agent discovered the existing code and exited cleanly), but the pattern accumulated: over several loops, a non-trivial fraction of conductor decisions were based on stale status data.

The fix. Two safeguards were added. A render.py script generates a human-readable status table from the plan files and exits non-zero if any slice marked done or in_review lacks a corresponding PR reference in its metadata. A lint guard in the CI workflow blocks plan files from being merged without an explicit pr: field once status is set to done. The conductor was also updated to check git log for recent merges before dispatching against a slice with no recent heartbeat.

Prevention. Every slice closure must include a plan-store update in the same PR. The rule is: one PR closes one slice; the PR description contains the slice id; the plan file is updated as part of the PR diff. Slices that land across multiple PRs (like T2B.5) are closed in the final PR of the sequence, not in the first one. A post-merge hook that checks plan consistency takes the discipline requirement off the developer and puts it into the tooling.

Cross-repo plugin coordination ¶

What it is. The PROTEA stack is split across eight repositories. Any change to protea-contracts (ABCs, payloads, schema) must propagate to every consumer repo before the consumer can be tested or released. Doing this in the wrong order causes integration failures that are hard to diagnose because the error surfaces in the consumer, not in the contract definition.

The MIL.1a incident. The MIL.1a delivery shipped new contract symbols in protea-contracts and new backend implementations in protea-backends in the same release window. The backends repo was pinned to the contracts feature branch rather than a released tag. During integration testing, a third consumer (protea-runners) that had not been updated picked up the old contracts tag and failed to import the new symbols. The failure appeared as an import error in the runner worker, not in the contracts or backends repos.

The protocol. Merge contracts first, tag and release a semver patch or minor version. Consumer repos update their dependency pin to the new tag in a separate PR, with CI running against the released tag rather than a branch ref. A consumer that has not updated its contracts pin will fail loudly at import time (the Operation protocol version check raises on mismatch) rather than running with a silently wrong schema. This fail-loud design was intentional (see ADR-007: Contract-first integration with protea-reranker-lab): the batch worker emits reranker.contracts_unavailable and falls back to KNN ordering rather than crashing, but it does log the fallback reason at warning level.

Prevention. The release checklist for any change that touches protea-contracts is: (1) merge and tag contracts, (2) open consumer PRs that bump the pin, (3) merge consumer PRs in dependency order (platform before plugins), (4) verify the integration test suite against the new tags. Skipping step 1 and opening consumer PRs against a branch ref is the failure mode to avoid.

Reproducibility as a methodological contribution ¶

Context. Protein function prediction has a reproducibility problem that is rarely acknowledged. Most CAFA-style methods publish a single Fmax number computed against a benchmark dataset, but the accompanying code does not specify which GO annotation release served as t0 for feature construction, which proteins were in the query set, or whether the GO DAG propagation was applied before or after the temporal split. Two groups running nominally the same method against the same CAFA benchmark can produce different numbers for legitimate reasons that are invisible in the paper.

What the PROTEA campaign discovered empirically. During the F-EXP campaign, running the same trained booster against two slightly different annotation snapshots (differing only in which GOA release was used to resolve the training pairs) produced Fmax differences of 0.03 to 0.05 across NK cells. These differences are not noise: they are deterministic given the snapshot pair, reproducible to four decimal places. But they would look like unexplained variance to a researcher who had not pinned the snapshot version in their run record.

The anc2vec_query artefact (see above) made this concrete. The cross-category replication in the early training-table assembly turned anc2vec_query_known_count into a bucket identifier and inflated its apparent importance. The corrected pipeline filters (protein, aspect) by category membership before replication, so each feature carries only the information its definition implies and not metadata about how its row was constructed.

The contribution. PROTEA’s experiment infrastructure was designed to make the full snapshot pair list, the annotation source, the embedding config id, the schema_sha, and the producer git sha part of every Dataset row. Every EvaluationResult row is linked back through PredictionSet to the specific prediction run and its parameters. The thesis (chapter 6) reports numbers that can be reproduced from the commands in appendix B because the snapshot versions are part of the record, not implicit.

This is not a novel insight in the machine learning literature, but it is novel in the protein function prediction literature, where the dominant publication format does not include this level of provenance. The PROTEA campaign’s methodological angle is that temporal holdout correctness is not a detail: it is the thing being measured, and it requires explicit infrastructure to get right.

Prevention for future campaigns. Any evaluation that involves a temporal split must specify: (1) the annotation release used as t0, (2) the annotation release used as t1, (3) whether GO lineage propagation was applied before or after splitting, (4) which proteins were in the query set, and (5) whether the same protein could appear in both training and evaluation sets (a common source of leakage in methods that aggregate across multiple snapshot pairs). The EvaluationSet row in PROTEA captures (1), (2), (3), and (4). The training dataset manifest captures the snapshot pair list. Cross-checking that the eval snapshot pair does not overlap with any training pair is enforced by the export_research_dataset payload validator. See Reproduction guide for the full ordered procedure.