CAFA Evaluation Protocol¶
PROTEA implements the evaluation protocol used in the CAFA5 (Critical Assessment of protein Function Annotation) challenge. This page explains the protocol, the NK/LK/PK classification, and how to run an evaluation end-to-end within PROTEA.
Background: the CAFA temporal holdout¶
CAFA evaluates protein function prediction by exploiting the growth of experimental GO annotations over time:
t0 — an older annotation snapshot (the reference set). Methods may use these annotations as training signal.
t1 — a newer annotation snapshot (the ground truth). Proteins that gained new experimental GO annotations between t0 and t1 form the test set (the delta).
Only annotations with experimental evidence codes are considered
(EXP, IDA, IMP, IGI, IEP, IPI, and their ECO equivalents). Annotations with a
NOT qualifier — meaning the protein is not associated with that term — are
excluded, and their exclusion is propagated to all GO descendants through the
is_a and part_of relationships.
NK / LK / PK classification¶
A key feature of CAFA5 is that test proteins are not treated uniformly. Classification is determined per (protein, namespace), where namespace is one of Molecular Function (MFO), Biological Process (BPO), or Cellular Component (CCO).
- NK — No-Knowledge
The protein had no experimental annotations in any namespace at t0. All its new annotations across all namespaces form the NK ground truth. Evaluating NK targets tests a method’s ability to make predictions from sequence alone, without any prior functional signal.
- LK — Limited-Knowledge
The protein had experimental annotations in some namespaces at t0, but not in namespace S. It gained new annotations in S at t1. Those new annotations in S are the LK ground truth for that (protein, S) pair. Evaluating LK tests transfer across namespaces.
- PK — Partial-Knowledge
The protein already had experimental annotations in namespace S at t0, and gained additional annotations in S at t1. Only the novel terms are ground truth; the old terms are collected in a
pk_known_terms.tsvfile and passed tocafaevalwith the-knownflag, which excludes them from scoring. This prevents credit for simply repeating prior annotations.
Important
A single protein can be LK in one namespace and PK in another simultaneously. For example, a protein with MFO and BPO annotations at t0 that gains new CCO and BPO annotations at t1 will be LK for CCO and PK for BPO.
Toy example¶
Protein P1 at t0: MFO={GO:0003674} BPO={} CCO={}
Protein P1 at t1: MFO={GO:0003674} BPO={GO:0008150} CCO={GO:0005575}
had_anything_old = True (had MFO)
Namespace BPO: old_BPO={} → LK (empty at t0, gained GO:0008150)
Namespace CCO: old_CCO={} → LK (empty at t0, gained GO:0005575)
Namespace MFO: no new terms → not in test set for this namespace
Protein P2 at t0: BPO={GO:0006355} (all others empty)
Protein P2 at t1: BPO={GO:0006355, GO:0045893}
Namespace BPO: old_BPO={GO:0006355} delta={GO:0045893}
→ PK ground truth = {GO:0045893}
→ pk_known = {GO:0006355} (passed as -known)
Protein P3 at t0: (no annotations in any namespace)
Protein P3 at t1: MFO={GO:0003674} BPO={GO:0008150}
had_anything_old = False → NK
NK ground truth = {GO:0003674, GO:0008150} (all new terms)
Evaluation flow in PROTEA¶
1. Load two GOA annotation sets (old = t0, new = t1).
2. POST /annotations/evaluation-sets/generate
→ queues generate_evaluation_set job
→ computes delta and creates EvaluationSet row with stats
3. Download delta-proteins.fasta (all NK+LK+PK sequences).
4. POST /jobs (compute_embeddings, query_set_id=...)
→ compute ESM-2 embeddings for delta proteins
5. POST /embeddings/predict (predict_go_terms, query_set_id=...)
→ run KNN GO transfer; creates PredictionSet
6. POST /annotations/evaluation-sets/{id}/run
→ queues run_cafa_evaluation job
→ runs cafaeval for NK, LK, PK; creates EvaluationResult
7. View results in the Evaluation UI or download artifacts.zip.
The cafaeval command equivalent (for manual inspection):
python -m cafaeval go-basic.obo predictions/ ground_truth_NK.tsv -out_dir results/NK
python -m cafaeval go-basic.obo predictions/ ground_truth_LK.tsv -out_dir results/LK
python -m cafaeval go-basic.obo predictions/ ground_truth_PK.tsv \
-known pk_known_terms.tsv -out_dir results/PK
Data model¶
EvaluationSetStores the (old_annotation_set_id, new_annotation_set_id) pair and a JSONB
statsdict with delta/NK/LK/PK protein and annotation counts. Created bygenerate_evaluation_set.EvaluationResultStores per-setting (NK/LK/PK) and per-namespace (MFO/BPO/CCO) metrics: Fmax, precision, recall, τ (threshold), and coverage. Created by
run_cafa_evaluation. MultipleEvaluationResultrows can exist perEvaluationSet, one per (prediction_set, run).
See Infrastructure for the full ORM schema.
Implementation reference¶
Core logic:
protea.core.evaluation—EvaluationData,compute_evaluation_dataOperations:
protea.core.operations.generate_evaluation_set,protea.core.operations.run_cafa_evaluationAPI router:
protea/api/routers/annotations.py(download endpoints, generate and run routes)