PROTEA¶

PROtein funcTional Embedding-based Annotation

PROTEA is the target platform for the progressive consolidation of the Protein Information System (PIS) and FANTASIA codebases. It provides a clean, decoupled architecture for large-scale protein data ingestion, metadata enrichment, and job orchestration.

Quickstart

Start here Bring up the full stack from a fresh checkout and run your first job in about ten minutes.

Installation and Quickstart

Architecture

Design System layers, job lifecycle, data model, the full operation catalogue, the CAFA evaluation protocol, and the ADRs that explain why.

Architecture

API Reference

autodoc Symbol-level documentation for protea.core, protea.infrastructure, the FastAPI routers, and every worker class.

API Reference

Complexity

Performance Big-O profile per pipeline stage, measured hot paths, and a guide to profiling with scalene and pyinstrument.

Computational Complexity

Results

Evaluation Benchmark numbers, ablation studies, the re-ranker training pipeline, and the figures that back the thesis.

Results

What is PROTEA?

A platform for protein functional annotation: from sequence ingestion through GPU embedding computation (ESM-2, ESM-C, T5/ProstT5, Ankh) and KNN-based GO term prediction to CAFA evaluation and LightGBM re-ranking, with clean separation of infrastructure, execution flow, and domain logic.

Documentation