Related work¶
PROTEA sits at the intersection of three lines of research that have so far evolved largely in isolation: (i) workflow and pipeline engineering for large-scale bioinformatics, (ii) automated Gene Ontology term prediction from sequence, and (iii) protein language models as general-purpose feature extractors. This chapter situates the system against each of these bodies of work and articulates the specific gap that motivates the platform.
Workflow systems for bioinformatics¶
General-purpose workflow engines such as Galaxy [Afgan et al., 2018], Snakemake [Köster and Rahmann, 2012], and Nextflow [Di Tommaso et al., 2017] provide reproducible pipeline execution, DAG-style task scheduling, and containerised tool encapsulation. They target the “compose existing tools” workflow pattern: each step wraps a CLI executable, inputs and outputs are files on disk, and provenance is captured by pinning container digests.
This pattern is poorly suited to PROTEA’s domain for three reasons. First, UniProt ingestion and GO annotation are not one-shot file conversions: they are stateful, long-running, partial-progress-tolerant processes that must survive broker disconnects and resume without re-downloading. Second, embedding computation and KNN prediction operate on shared in-memory state (the reference cache) that cannot be serialised to files between steps without prohibitive I/O overhead. Third, the consumers of the output are interactive users through an HTTP API, not batch scripts; they need sub-second status queries, structured event timelines, and the ability to cancel in-flight jobs.
PROTEA is therefore architected as an application with an internal job queue, not as a pipeline expressed in a DSL. The design draws on the job queue pattern familiar from web applications (Celery, Sidekiq, RQ), adapted to the scientific computing setting through a strict separation between domain operations and infrastructure (see System Overview).
The PIS and FANTASIA precursors¶
The Protein Information System (PIS) and FANTASIA codebases were developed at CBBIO as end-to-end systems for protein data management and functional annotation transfer. PIS established the ingestion side: a PostgreSQL schema, a RabbitMQ-backed job queue, and Python workers that paginate the UniProt REST API. FANTASIA extended this with GPU embedding computation and KNN-based annotation transfer using ProtT5 and ESM models.
Both systems proved that the pipeline was tractable at UniProtKB/Swiss-Prot
scale (500 000+ reviewed proteins), but share a structural weakness: each
worker conflates a database session, the AMQP channel, orchestration logic,
and domain code in a single class. The consequences are well-known from
enterprise software [Fowler, 2002]: unit-testable pieces are hard
to isolate, new operations inherit boilerplate from an unrelated base class,
and partial failures (a broker reconnect mid-job) leave jobs in ambiguous
states because state transitions and business logic share the same session.
PROTEA is an explicit response: it keeps the data model, the queue topology,
and the empirical lessons from PIS/FANTASIA, but rebuilds the execution layer
around an Operation protocol that is pure domain logic and testable with
a mocked session.
Automated GO term prediction¶
The Critical Assessment of Functional Annotation (CAFA) challenges
[Radivojac et al., 2013, Zhou et al., 2019, CAFA Consortium, 2023] have established the reference benchmark
for automated GO term prediction: given a set of target proteins whose
experimental annotations are known at time t1 but not at time t0,
methods submit scored (protein, GO term) predictions and are evaluated
against the t1 − t0 delta using Information Accretion (IA) weighting.
Published methods span three families:
Homology-based transfer. BLAST-style search against a reference set, followed by annotation transfer through sequence identity thresholds. The canonical open-source tools are Pannzer2 [Törönen et al., 2018], InterProScan [Jones et al., 2014] (domain-level signatures), and eggNOG-mapper [Cantalapiedra et al., 2021] (orthology groups).
Embedding-based transfer. Replace BLAST similarity with cosine distance in the embedding space of a protein language model, then transfer annotations from nearest neighbours. DeepGOPlus [Kulmanov and Hoehndorf, 2020] combines a CNN over sequence with DIAMOND hits; SPROF-GO [Yuan et al., 2023] uses ProtT5 embeddings and a learned aggregator.
Deep-learning classifiers. End-to-end networks that predict per-GO-term probabilities directly from sequence or embeddings, e.g. GoFormer and related transformer-based models.
PROTEA falls into the embedding-based family but makes three deliberate
choices that distinguish it from prior work. First, KNN search is performed
in Python (numpy or FAISS [Johnson et al., 2021]) rather than in the database, a
design decision motivated by the observed latency and memory behaviour of
pgvector on 500 000+ vectors (see ADR-001: KNN on CPU, not pgvector or GPU). Second, the reference
set is frozen at t0 by construction; the ingestion pipeline records the
OntologySnapshot OBO version and the AnnotationSet source version of
every reference annotation, so that a prediction produced today is exactly
reproducible against the same references tomorrow. Third, the LightGBM
re-ranker (trained offline in protea-reranker-lab and registered into
PROTEA via POST /reranker-models/import) operates on hand-engineered
features on top of KNN results (Needleman–Wunsch and Smith–Waterman
alignment metrics via
parasail [Daily, 2016], taxonomic distance via ete3
[Huerta-Cepas et al., 2016], and neighbour-aggregate signals) rather than on raw
embeddings, keeping the training signal interpretable.
Information-theoretic evaluation with cafaeval¶
A recurring source of confusion in the GO prediction literature is the
difference between naive Fmax (averaged over predictions, independent of
term specificity) and the CAFA weighted Fmax that uses Information
Accretion [Clark and Radivojac, 2013] to down-weight trivially correct predictions of root
or near-root terms. The open-source cafaeval package [Piovesan and others, 2023]
is the reference implementation of the CAFA scoring protocol, including IA
propagation, NK/LK/PK partitioning, and per-namespace Fmax reporting.
PROTEA delegates scoring entirely to cafaeval: the run_cafa_evaluation
operation writes CAFA-format TSVs, invokes cafaeval as a subprocess, and
parses the resulting per-namespace metrics into an EvaluationResult row.
This design decision avoids the temptation of reimplementing the scoring
logic: any bug in IA computation would invalidate every reported number, and
the literature has already converged on a single trusted implementation. See
CAFA Evaluation Protocol for the full protocol.
Protein language models¶
The embedding backends supported by PROTEA are all publicly released, pre-trained protein language models:
ProtT5 [Elnaggar et al., 2022]. Encoder/decoder transformer trained on UniRef50 with a T5 denoising objective. The
prot_t5_xl_uniref50checkpoint produces 1024-dimensional residue embeddings, which are mean-pooled to one vector per sequence in the default PROTEA configuration.ESM-2 [Lin et al., 2023]. Decoder-only transformer from Meta AI trained on UniRef50 with a masked-language-modelling objective. Checkpoints of 35 M, 150 M, 650 M, 3 B, and 15 B parameters are available; PROTEA benchmarks use the 650 M
esm2_t33_650M_UR50Dvariant.ESM-C [EvolutionaryScale Team, 2024]. Compressed ESM family released by EvolutionaryScale in 2024. The 300 M checkpoint produces 960-dimensional embeddings at a fraction of the inference cost of ESM-2 650 M while preserving most of the downstream task performance. ESM-C is the default backend in PROTEA’s benchmarks because of this favourable cost/accuracy trade-off.
Ankh. Encoder/decoder T5-style protein model from ElnaggarLab, available as
ankh-baseandankh-largecheckpoints. Loaded via the sameT5EncoderModelpath as ProtT5, but the backend forcesbfloat16on CUDA (FP16 overflows toNaN) and tokenises char-by-char withis_split_into_words=Truebecause Ankh’s SentencePiece vocabulary maps literal spaces to<unk>; the<AA2fold>prefix is never injected. Not yet used in the benchmark tables (see Results); included for parity with the upstream protein-PLM survey.
All four backends are wrapped by a single EmbeddingConfig row that
records the model checkpoint, layer selection, pooling strategy, and any
post-processing (L2 normalisation). This discipline is necessary for
reproducibility: a prediction run annotated with embedding_config_id can
always be replayed against the exact same model weights and pooling recipe.
Positioning PROTEA¶
Against this backdrop, PROTEA’s contribution is not a new prediction
algorithm but a reproducible, auditable, and extensible platform that
turns the existing literature into an executable system. The three
architectural invariants (typed operations, two-session job lifecycle, and
versioned reference data) are specifically designed so that an
embedding-based GO predictor can be benchmarked against external tools
(Pannzer2, InterProScan, eggNOG-mapper) under a fair temporal holdout:
reference annotations frozen at t0, ground truth computed from
t1 − t0, and every data source tagged with an immutable version.
Chapter Results quantifies the effect of this discipline through a
data-leakage analysis of the external tools and a set of ablation studies
over the PROTEA pipeline itself.