PROTEAΒΆ
PROtein funcTional Embedding-based Annotation
PROTEA is the target platform for the progressive consolidation of the Protein Information System (PIS) and FANTASIA codebases. It provides a clean, decoupled architecture for large-scale protein data ingestion, metadata enrichment, and job orchestration.
Start here Bring up the full stack from a fresh checkout and run your first job in about ten minutes.
Design System layers, job lifecycle, data model, the full operation catalogue, the CAFA evaluation protocol, and the ADRs that explain why.
autodoc Symbol-level documentation for protea.core,
protea.infrastructure, the FastAPI routers, and every worker class.
Performance Big-O profile per pipeline stage, measured hot paths, and a guide to profiling with scalene and pyinstrument.
Evaluation Benchmark numbers, ablation studies, the re-ranker training pipeline, and the figures that back the thesis.
What is PROTEA?
A platform for protein functional annotation: from sequence ingestion through GPU embedding computation (ESM-2, ESM-C, T5/ProstT5, Ankh) and KNN-based GO term prediction to CAFA evaluation and LightGBM re-ranking, with clean separation of infrastructure, execution flow, and domain logic.
Documentation
- Abstract
- Introduction
- Related work
- Architecture
- Computational Complexity
- Plugin authoring guide
- Results
- Appendix
- Runbooks
- Deployment Guide
- Secrets management runbook (sops + age onboarding)
- Disaster Recovery
- Stale Job Reaper
- DLQ Triage
- Ngrok Deploy Recovery
- Embedding Worker OOM
- schema_sha_v2 backfill
- schema_sha_v2 rollout (T1.6)
- Observability: OpenTelemetry SDK
- Observability Operator Runbook
- Observability: Loki log aggregation
- Observability: Prometheus metrics
- Process-Based Stack Deployment Guide
- Quality Engineering
- Operational Insights and Lessons Learned
- Glossary
- References