Introduction¶
Motivation¶
The gap between the number of protein sequences deposited in public databases and the number that carry experimentally verified functional annotations has grown by several orders of magnitude over the past decade. UniProtKB/TrEMBL stores more than 250 million unreviewed sequences, while UniProtKB/Swiss-Prot (the manually curated subset) contains fewer than 600 000. Closing this gap by wet-lab experiments is economically infeasible, and the community has therefore invested in automated functional annotation: computational pipelines that transfer Gene Ontology (GO) terms [Radivojac et al., 2013] from a small set of well-characterised reference proteins to the much larger set of unannotated targets.
Automated GO term prediction is now a mature research area with well-defined
benchmarks (the CAFA challenges [Radivojac et al., 2013, Zhou et al., 2019, CAFA Consortium, 2023]), a
reference scoring tool (cafa-evaluator [Piovesan and others, 2023]), and a rich
catalogue of methods surveyed in Related work. Most open-source
pipelines, however, were released as research artefacts: single-purpose
scripts optimised for the paper that accompanies them. When integrated into a
production setting (a web server, a shared lab platform, a recurring batch
job) they expose a cluster of engineering issues that are rarely discussed in
the original publications:
Reproducibility under versioned data. Annotation transfer depends on three versioned inputs: the GO ontology release, the reference annotation set, and the embedding or similarity model. A prediction that is reproducible today often cannot be replayed one year later because none of these versions is recorded alongside the prediction.
Temporal integrity of evaluation. Benchmarking a tool against its own reference database (the default configuration for most open-source predictors) exposes the evaluation to data leakage: the tool has already seen the annotations that the benchmark treats as ground truth. This problem is acknowledged by the CAFA protocol but ignored by many method papers.
Architectural coupling. Pipelines that conflate database sessions, message-broker connections, orchestration logic, and domain computation in the same worker class are hard to extend, hard to unit-test, and fragile under the partial failures typical of long-running jobs against external APIs.
PROTEA is designed to address all three problems jointly, as a single system, so that each concern is enforced by construction rather than by convention.
The legacy coupling problem¶
The Protein Information System (PIS) and FANTASIA codebases were developed at the Andalusian Centre for Biomedical Bioinformatics (CBBIO) and established foundational infrastructure for protein data ingestion and functional annotation at scale. Both systems share a structural limitation that directly motivates PROTEA: their workers conflate multiple concerns into single classes.
A typical PIS/FANTASIA worker manages its own database session, connects directly to the message broker, orchestrates task sequencing, and executes domain logic, all in the same class. This coupling produces code that is difficult to unit-test (because all infrastructure must be mocked at once), hard to extend (because adding a new operation requires understanding the entire execution context), and fragile under failure (because a queue disconnect or database error can leave jobs in ambiguous states with no audit trail). The consequences are well-documented in the enterprise software literature [Fowler, 2002].
Research questions¶
This thesis investigates whether the engineering issues above can be resolved by architectural discipline without sacrificing prediction quality or computational efficiency. It is organised around three research questions:
- RQ1. Reproducible architecture.
Can a protein functional annotation pipeline be architected so that every prediction is exactly reproducible given the same input data, without sacrificing horizontal scalability on a single GPU and a modest number of CPU workers?
- RQ2. Temporal integrity of evaluation.
To what extent does temporal data leakage inflate the apparent performance of homology-based GO prediction tools (Pannzer2, InterProScan, eggNOG-mapper) when they are benchmarked against their own current reference databases, and can a fair temporal holdout protocol be enforced at the data model level?
- RQ3. Feature engineering on top of KNN.
Does a learned re-ranker that exploits classical pairwise alignment metrics (Needleman–Wunsch, Smith–Waterman) and taxonomic distance features on top of embedding-based KNN results consistently outperform the baseline embedding similarity score, across the full CAFA NK/LK/PK partitioning and all three GO namespaces?
Hypotheses¶
The three questions are paired with three falsifiable hypotheses:
H1. A strict separation between an Operation protocol (pure domain
logic), a two-session job lifecycle (claim → execute), and a typed
infrastructure layer is sufficient to express every existing PIS/FANTASIA
workflow. No domain operation requires direct access to the message broker
or to the session-management layer.
H2. Open-source homology-based tools evaluated against their current reference databases exhibit measurable exact-match overlap with the ground-truth annotations of a temporal holdout. This overlap is quantifiable and large enough to account for a significant fraction of their apparent Fmax advantage over strictly temporal methods.
H3. A LightGBM binary classifier trained on 20 numeric and 3 categorical features derived from alignment and taxonomy can outperform the baseline cosine similarity score of an embedding-only pipeline across all 9 cells of the NK/LK/PK × BPO/MFO/CCO grid.
Contributions¶
Provisional numbers in C2 and C3
The specific figures cited below (62.4 % NK leakage in C2, “improves Fmax across all 9 cells” in C3) come from the pre-2026-04-10 experimental run and will be regenerated for the Zenodo deposit accompanying the thesis. The direction of the findings (large data-leakage overlap for Pannzer2, and the re-ranker surpassing the heuristic) is stable; only the exact values may move slightly when the chapter is re-rendered. See Results for the full provisional notice.
The thesis makes three contributions, one per research question:
C1. A reproducible platform for protein functional annotation, built on
a typed operation protocol, a two-session job lifecycle, a RabbitMQ job queue
with ten routed queues, and a PostgreSQL + pgvector data model that
versions every input (OBO release, annotation set source, embedding config)
by UUID. The platform is released as open source and runs end-to-end on a
single workstation with one GPU. PROTEA currently consolidates eighteen
registered operations covering ingestion, embedding, prediction, evaluation,
re-ranking and provenance maintenance, as well as a one-click /annotate
endpoint that takes a FASTA upload and returns ranked GO predictions. The
authoritative list is the body of
protea.core.operation_catalog.build_operation_registry.
C2. A quantitative data-leakage analysis of Pannzer2, InterProScan, and eggNOG-mapper against a GOA 220 → 229 temporal holdout. The analysis measures exact-match overlap between each tool’s predictions and the ground-truth annotations and shows that up to 62.4 % of the NK ground truth is already present in the Pannzer2 reference database, fully explaining its apparent advantage over temporally strict methods. The chapter Results presents the full numbers and discusses the interpretation.
C3. A temporal-holdout re-ranking pipeline trained on 13 historical GOA splits (releases 160 through 220) using alignment and taxonomy features on top of ESM-C 300M KNN results. The final re-ranker (the iteration with full alignment and taxonomy features) is shown to improve Fmax over the embedding-only baseline across all 9 evaluation cells of the NK/LK/PK × BPO/MFO/CCO grid, while keeping the training signal interpretable (per-feature importances are reported in Results).
The PROTEA approach¶
PROTEA realises these contributions through a deliberate separation of three layers:
- Infrastructure layer (
protea/infrastructure/) Manages database sessions, connection factories, configuration loading, and the RabbitMQ transport. This layer knows nothing about domain logic.
- Execution layer (
protea/workers/) Orchestrates the job lifecycle: claiming a job, dispatching it to the correct operation, and recording the outcome. The
BaseWorkeruses two independent sessions by design (one to claim withQUEUED → RUNNING, one to execute), ensuring that even a mid-execution crash leaves the database in a consistent, inspectable state.- Domain layer (
protea/core/) Pure domain logic. Each
Operationreceives an open session and anemitcallback; it returns anOperationResult. Operations do not manage sessions, queues, or HTTP routing. They are individually testable with a mocked session and a noop emit function.
The three layers communicate only through well-defined interfaces. Chapter System Overview describes the runtime stack, chapter Job Lifecycle documents the two-session lifecycle in detail, and chapter Operations lists every registered operation together with its payload schema, execution flow, and side effects.
An incremental migration¶
The goal of PROTEA is not a complete rewrite. PIS tables (protein,
sequence, protein_uniprot_metadata) and FANTASIA computation
workflows are progressively migrated into the new architecture as new
capabilities are added. Each migration step must preserve or improve
computational efficiency and must not introduce regressions in the data model.
The discipline that makes this incremental evolution safe is the combination
of a typed operation protocol, an append-only audit log (JobEvent), and
database migrations managed by Alembic.
Current capabilities¶
PROTEA currently provides the following registered operations across
the protein functional annotation pipeline (the authoritative list is
the body of protea.core.operation_catalog.build_operation_registry):
Data ingestion:
insert_proteins,fetch_uniprot_metadata,load_ontology_snapshot,load_goa_annotations,load_quickgo_annotations.Embedding computation:
compute_embeddings(coordinator),compute_embeddings_batch,store_embeddings.GO term prediction:
predict_go_terms(coordinator),predict_go_terms_batch,store_predictions.InterPro-based prediction:
load_interpro_go_mapping,run_interproscan_batch,predict_go_terms_from_interpro.Evaluation:
generate_evaluation_set,run_cafa_evaluation.Re-ranker dataset publishing:
export_research_dataset(LightGBM training itself lives inprotea-reranker-lab; PROTEA only produces the frozen train/eval parquets and serves the registered boosters).Diagnostics:
ping.
A scoring engine applies weighted formulas or trained LightGBM re-rankers to
prediction sets. The one-click /annotate endpoint automates the entire
workflow from FASTA upload to ranked GO term prediction. The full operation
catalogue is documented in Operations, including the
four ephemeral consumer operations that fan out GPU and KNN work across
batch queues.
Design principle
New operations are added by implementing the Operation protocol and
registering the instance at worker startup. No changes to the
infrastructure or execution layers are required. This is the property
that makes Hypothesis H1 testable: if a new workflow requires
modifications outside the domain layer, the architectural claim is
falsified.
Thesis outline¶
The remainder of this thesis is organised as follows.
Related work situates PROTEA against existing workflow engines, the CAFA evaluation tradition, homology- and embedding-based GO prediction methods, and the protein language models that supply its embedding backends.
Architecture describes the system architecture in five chapters: the runtime stack and requirements (
system_overview), the two-session job lifecycle (job_lifecycle), the versioned data model (data_model), the operation catalogue (operations), and the evaluation protocol with formal NK/LK/PK definitions (evaluation).Results presents the experimental evaluation: ablations over
kand the scoring function, the three re-ranker iterations, the external benchmark against Pannzer2 / InterProScan / eggNOG-mapper, and the quantitative data-leakage analysis.Appendix contains installation and quickstart instructions, a configuration reference, how-to guides, operational runbooks, and architectural decision records (ADRs) for every non-obvious design choice.
References lists every cited work.
Readers interested in architecture should start with System Overview; readers interested in empirical results should jump directly to Results; readers interested in reproducing the pipeline end-to-end should follow Installation and Quickstart and then the how-to guides.