PROTEA
/

Documentation

  • Abstract
  • Introduction
  • Related work
  • Architecture
    • System Overview
    • Job Lifecycle
    • Data Model
    • Operations
    • CAFA Evaluation Protocol
    • Orchestration
    • Authentication
    • Multi-Stage Pipeline Contract
    • Export Coordinator
    • Architecture Decision Records
      • ADR-001: KNN on CPU, not pgvector or GPU
      • ADR-002: Two-session worker pattern
      • ADR-003: Two types of consumer
      • ADR-004: Dead letter queue and retries
      • ADR-005: Reusable RabbitMQ connections
      • ADR-006: Sequence deduplication by MD5
      • ADR-007: Contract-first integration with protea-reranker-lab
      • ADR-008: PK coverage fix in cafaeval fork
      • ADR-009: Pre-dispatch cancellation nack in QueueConsumer
      • ADR-D1: Project structure (7 code repositories plus thesis)
      • ADR-D2: export_research_dataset lives in protea-core
      • ADR-D3: GOPrediction.features stored as JSONB
      • ADR-D4: API versioning strategy
      • ADR-D5: Front-end co-located in protea-core
      • ADR-D6: Authentication strategy
      • ADR-D7: Observability stack
      • ADR-D8: UI component library
      • ADR-D9: OBSOLETE: lab as runtime dependency
      • ADR-D10: schema_sha_v2 parallel migration
      • ADR-D11: Operational narrative attached to Job
      • ADR-D12: F-EXP as QA reproduction of the canonical pipeline
      • ADR-D13: Early UI track parallel to F2
      • ADR-D14: Per-plugin repository granularity (deferred)
      • ADR-D15: protea-method distribution channels
      • ADR-D16: Thesis repository location
      • ADR-D17: OBSOLETE: thesis LaTeX template choice
      • ADR-D18: Thesis writing model
      • ADR-D19: F-RESEARCH targets
      • ADR-D20: Co-supervisor review cadence
      • ADR-D21: Thesis writing track parallel from F0
      • ADR-D22: Thesis as a concise research diary
      • ADR-D23: LAFA submission strategy
      • ADR-D24: Hardcoded parameters externalisation (T-CONF)
      • ADR-D25: HPC operation mode
      • ADR-D26: Container runtime: OCI plus Apptainer
      • ADR-D27: Image registry
      • ADR-D28: Secrets management
      • ADR-D29: Release pipeline
      • ADR-D30: Insights appendix
      • ADR-D31: T2B.5 Method Object reframe
      • ADR-D34: Selective rerank resurrection, recompute not archaeology
      • ADR-D35: Canonical 8-PLM embedding config IDs and orphan classification
      • ADR-D36: PLM axis explicit in dataset naming
      • ADR-D37: Single auth system, manual approvals, multi-instance (FEAT-AUTH)
      • ADR-D38: Defer neural-head champion; pivot to curated dataset packaging
  • Computational Complexity
    • Pipeline Overview: Big-O per Stage
    • PLM Attention Complexity
    • KNN Search Complexity
    • LightGBM Complexity
    • Anc2Vec Complexity
    • Export Pipeline Complexity
    • Measuring Performance
  • Plugin authoring guide
    • Backend plugin guide
    • Runner plugin guide
    • Source plugin guide
  • Results
  • Appendix
    • PROTEA stack
    • Installation and Quickstart
    • Configuration Reference
    • How-to Guides
    • Reproduction guide
    • Operational Runbook
    • Monitoring
    • Secrets management (sops + age)
  • Runbooks
    • Deployment Guide
    • Secrets management runbook (sops + age onboarding)
    • Disaster Recovery
    • Stale Job Reaper
    • DLQ Triage
    • Ngrok Deploy Recovery
    • Embedding Worker OOM
    • schema_sha_v2 backfill
    • schema_sha_v2 rollout (T1.6)
    • Observability: OpenTelemetry SDK
    • Observability Operator Runbook
    • Observability: Loki log aggregation
    • Observability: Prometheus metrics
    • Process-Based Stack Deployment Guide
  • Quality Engineering
    • Mutation Testing
  • Operational Insights and Lessons Learned
  • Glossary
  • References

API Reference

  • API Reference
    • Core
    • Infrastructure
    • HTTP API
    • Workers
    • Services
    • Configuration

On this page

  • Workflow systems for bioinformatics
  • The PIS and FANTASIA precursors
  • Automated GO term prediction
  • Information-theoretic evaluation with cafaeval
  • Protein language models
  • Positioning PROTEA
  1. PROTEA /
  2. Related work

Related work¶

PROTEA sits at the intersection of three lines of research that have so far evolved largely in isolation: (i) workflow and pipeline engineering for large-scale bioinformatics, (ii) automated Gene Ontology term prediction from sequence, and (iii) protein language models as general-purpose feature extractors. This chapter situates the system against each of these bodies of work and articulates the specific gap that motivates the platform.

Workflow systems for bioinformatics¶

General-purpose workflow engines such as Galaxy [Afgan et al., 2018], Snakemake [Köster and Rahmann, 2012], and Nextflow [Di Tommaso et al., 2017] provide reproducible pipeline execution, DAG-style task scheduling, and containerised tool encapsulation. They target the “compose existing tools” workflow pattern: each step wraps a CLI executable, inputs and outputs are files on disk, and provenance is captured by pinning container digests.

This pattern is poorly suited to PROTEA’s domain for three reasons. First, UniProt ingestion and GO annotation are not one-shot file conversions: they are stateful, long-running, partial-progress-tolerant processes that must survive broker disconnects and resume without re-downloading. Second, embedding computation and KNN prediction operate on shared in-memory state (the reference cache) that cannot be serialised to files between steps without prohibitive I/O overhead. Third, the consumers of the output are interactive users through an HTTP API, not batch scripts; they need sub-second status queries, structured event timelines, and the ability to cancel in-flight jobs.

PROTEA is therefore architected as an application with an internal job queue, not as a pipeline expressed in a DSL. The design draws on the job queue pattern familiar from web applications (Celery, Sidekiq, RQ), adapted to the scientific computing setting through a strict separation between domain operations and infrastructure (see System Overview).

The PIS and FANTASIA precursors¶

The Protein Information System (PIS) and FANTASIA codebases were developed at CBBIO as end-to-end systems for protein data management and functional annotation transfer. PIS established the ingestion side: a PostgreSQL schema, a RabbitMQ-backed job queue, and Python workers that paginate the UniProt REST API. FANTASIA extended this with GPU embedding computation and KNN-based annotation transfer using ProtT5 and ESM models.

Both systems proved that the pipeline was tractable at UniProtKB/Swiss-Prot scale (500 000+ reviewed proteins), but share a structural weakness: each worker conflates a database session, the AMQP channel, orchestration logic, and domain code in a single class. The consequences are well-known from enterprise software [Fowler, 2002]: unit-testable pieces are hard to isolate, new operations inherit boilerplate from an unrelated base class, and partial failures (a broker reconnect mid-job) leave jobs in ambiguous states because state transitions and business logic share the same session. PROTEA is an explicit response: it keeps the data model, the queue topology, and the empirical lessons from PIS/FANTASIA, but rebuilds the execution layer around an Operation protocol that is pure domain logic and testable with a mocked session.

Automated GO term prediction¶

The Critical Assessment of Functional Annotation (CAFA) challenges [Radivojac et al., 2013, Zhou et al., 2019, CAFA Consortium, 2023] have established the reference benchmark for automated GO term prediction: given a set of target proteins whose experimental annotations are known at time t1 but not at time t0, methods submit scored (protein, GO term) predictions and are evaluated against the t1 − t0 delta using Information Accretion (IA) weighting.

Published methods span three families:

  • Homology-based transfer. BLAST-style search against a reference set, followed by annotation transfer through sequence identity thresholds. The canonical open-source tools are Pannzer2 [Törönen et al., 2018], InterProScan [Jones et al., 2014] (domain-level signatures), and eggNOG-mapper [Cantalapiedra et al., 2021] (orthology groups).

  • Embedding-based transfer. Replace BLAST similarity with cosine distance in the embedding space of a protein language model, then transfer annotations from nearest neighbours. DeepGOPlus [Kulmanov and Hoehndorf, 2020] combines a CNN over sequence with DIAMOND hits; SPROF-GO [Yuan et al., 2023] uses ProtT5 embeddings and a learned aggregator.

  • Deep-learning classifiers. End-to-end networks that predict per-GO-term probabilities directly from sequence or embeddings, e.g. GoFormer and related transformer-based models.

PROTEA falls into the embedding-based family but makes three deliberate choices that distinguish it from prior work. First, KNN search is performed in Python (numpy or FAISS [Johnson et al., 2021]) rather than in the database, a design decision motivated by the observed latency and memory behaviour of pgvector on 500 000+ vectors (see ADR-001: KNN on CPU, not pgvector or GPU). Second, the reference set is frozen at t0 by construction; the ingestion pipeline records the OntologySnapshot OBO version and the AnnotationSet source version of every reference annotation, so that a prediction produced today is exactly reproducible against the same references tomorrow. Third, the LightGBM re-ranker (trained offline in protea-reranker-lab and registered into PROTEA via POST /reranker-models/import) operates on hand-engineered features on top of KNN results (Needleman–Wunsch and Smith–Waterman alignment metrics via parasail [Daily, 2016], taxonomic distance via ete3 [Huerta-Cepas et al., 2016], and neighbour-aggregate signals) rather than on raw embeddings, keeping the training signal interpretable.

Information-theoretic evaluation with cafaeval¶

A recurring source of confusion in the GO prediction literature is the difference between naive Fmax (averaged over predictions, independent of term specificity) and the CAFA weighted Fmax that uses Information Accretion [Clark and Radivojac, 2013] to down-weight trivially correct predictions of root or near-root terms. The open-source cafaeval package [Piovesan and others, 2023] is the reference implementation of the CAFA scoring protocol, including IA propagation, NK/LK/PK partitioning, and per-namespace Fmax reporting.

PROTEA delegates scoring entirely to cafaeval: the run_cafa_evaluation operation writes CAFA-format TSVs, invokes cafaeval as a subprocess, and parses the resulting per-namespace metrics into an EvaluationResult row. This design decision avoids the temptation of reimplementing the scoring logic: any bug in IA computation would invalidate every reported number, and the literature has already converged on a single trusted implementation. See CAFA Evaluation Protocol for the full protocol.

Protein language models¶

The embedding backends supported by PROTEA are all publicly released, pre-trained protein language models:

  • ProtT5 [Elnaggar et al., 2022]. Encoder/decoder transformer trained on UniRef50 with a T5 denoising objective. The prot_t5_xl_uniref50 checkpoint produces 1024-dimensional residue embeddings, which are mean-pooled to one vector per sequence in the default PROTEA configuration.

  • ESM-2 [Lin et al., 2023]. Decoder-only transformer from Meta AI trained on UniRef50 with a masked-language-modelling objective. Checkpoints of 35 M, 150 M, 650 M, 3 B, and 15 B parameters are available; PROTEA benchmarks use the 650 M esm2_t33_650M_UR50D variant.

  • ESM-C [EvolutionaryScale Team, 2024]. Compressed ESM family released by EvolutionaryScale in 2024. The 300 M checkpoint produces 960-dimensional embeddings at a fraction of the inference cost of ESM-2 650 M while preserving most of the downstream task performance. ESM-C is the default backend in PROTEA’s benchmarks because of this favourable cost/accuracy trade-off.

  • Ankh. Encoder/decoder T5-style protein model from ElnaggarLab, available as ankh-base and ankh-large checkpoints. Loaded via the same T5EncoderModel path as ProtT5, but the backend forces bfloat16 on CUDA (FP16 overflows to NaN) and tokenises char-by-char with is_split_into_words=True because Ankh’s SentencePiece vocabulary maps literal spaces to <unk>; the <AA2fold> prefix is never injected. Not yet used in the benchmark tables (see Results); included for parity with the upstream protein-PLM survey.

All four backends are wrapped by a single EmbeddingConfig row that records the model checkpoint, layer selection, pooling strategy, and any post-processing (L2 normalisation). This discipline is necessary for reproducibility: a prediction run annotated with embedding_config_id can always be replayed against the exact same model weights and pooling recipe.

Positioning PROTEA¶

Against this backdrop, PROTEA’s contribution is not a new prediction algorithm but a reproducible, auditable, and extensible platform that turns the existing literature into an executable system. The three architectural invariants (typed operations, two-session job lifecycle, and versioned reference data) are specifically designed so that an embedding-based GO predictor can be benchmarked against external tools (Pannzer2, InterProScan, eggNOG-mapper) under a fair temporal holdout: reference annotations frozen at t0, ground truth computed from t1 − t0, and every data source tagged with an immutable version. Chapter Results quantifies the effect of this discipline through a data-leakage analysis of the external tools and a set of ablation studies over the PROTEA pipeline itself.

Previous
Introduction
Next
Architecture

2025, frapercan

Made with Sphinx and Shibuya theme.