Data Model¶
All models use SQLAlchemy 2.x declarative style with Mapped[] type annotations.
The schema is managed by Alembic (8 migrations to date).
Protein and sequence deduplication¶
┌──────────────────────────┐ ┌────────────────────────┐
│ Protein │ │ Sequence │
│──────────────────────────│ N→1 │────────────────────────│
│ accession (PK) │───────▶│ id (PK, autoincrement) │
│ canonical_accession │ │ sequence (Text) │
│ is_canonical │ │ sequence_hash (MD5) │
│ isoform_index │ └────────────────────────┘
│ entry_name │
│ reviewed │ ┌──────────────────────────────┐
│ taxonomy_id │ N→1 │ ProteinUniProtMetadata │
│ organism │───────▶│──────────────────────────────│
│ gene_name │ (view) │ canonical_accession (PK) │
│ length │ │ function_cc, ec_number, ... │
│ sequence_id (FK) │ └──────────────────────────────┘
└──────────────────────────┘
- Sequence
Stores unique amino-acid sequences, deduplicated by MD5 hash (
sequence_hash). ManyProteinrows can reference the sameSequence—sequence_idis deliberately non-unique.- Protein
One row per UniProt accession, including isoforms (
<canonical>-<n>). Isoforms share the samecanonical_accessionand are differentiated byis_canonicalandisoform_index. The relationship toProteinUniProtMetadatais view-only (no foreign key), joined bycanonical_accession.- ProteinUniProtMetadata
One row per canonical accession. Stores raw UniProt functional annotations (functional description, EC numbers, pathways, kinetics, etc.) as
Textfields. Isoforms inherit metadata via thecanonical_accessionjoin.
GO ontology¶
┌──────────────────────────┐ 1→N ┌────────────────────────┐
│ OntologySnapshot │──────────▶│ GOTerm │
│──────────────────────────│ │────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ obo_url │ │ go_id (e.g. GO:0003674)│
│ obo_version │ │ name │
│ loaded_at │ │ aspect (F/P/C) │
└──────────────────────────┘ │ definition │
│ is_obsolete │
│ ontology_snapshot_id │
└──────────┬─────────────┘
│
┌──────────▼─────────────┐
│ GOTermRelationship │
│────────────────────────│
│ child_go_term_id (FK) │
│ parent_go_term_id (FK) │
│ relation_type │
│ ontology_snapshot_id │
└────────────────────────┘
- OntologySnapshot
One row per loaded OBO file release, versioned by
obo_version(unique constraint). Idempotent load: if a snapshot already exists with its relationships, it is skipped. If relationships are missing they are backfilled automatically.- GOTerm
One row per GO term per snapshot.
aspectis one ofF(molecular function),P(biological process), orC(cellular component).- GOTermRelationship
Directed edge in the GO DAG.
relation_typeis one ofis_a,part_of,regulates,positively_regulates,negatively_regulates. Used byGET /annotations/snapshots/{id}/subgraphfor BFS ancestor traversal.
Annotation sets¶
┌──────────────────────┐ 1→N ┌────────────────────────────────┐
│ AnnotationSet │──────────▶│ ProteinGOAnnotation │
│──────────────────────│ │────────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ source (goa/quickgo) │ │ protein_accession │
│ source_version │ │ go_term_id (FK → GOTerm) │
│ ontology_snapshot_id │ │ annotation_set_id (FK) │
│ job_id │ │ qualifier │
│ created_at │ │ evidence_code │
│ meta (JSONB) │ │ assigned_by │
└──────────────────────┘ │ db_reference │
│ with_from │
│ annotation_date │
└────────────────────────────────┘
- AnnotationSet
Groups a batch of protein GO annotations by source (
goaorquickgo) and ontology snapshot version.- ProteinGOAnnotation
One row per (protein, GO term, annotation set) triple. Stores all GAF/QuickGO evidence fields verbatim.
Embeddings¶
┌──────────────────────────┐ 1→N ┌──────────────────────────────┐
│ EmbeddingConfig │──────────▶│ SequenceEmbedding │
│──────────────────────────│ │──────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ model_name │ │ sequence_id (FK) │
│ model_backend │ │ embedding_config_id (FK) │
│ layer_indices │ │ embedding (VECTOR) │
│ layer_agg │ │ chunk_index_s (int) │
│ pooling │ │ chunk_index_e (int, nullable)│
│ normalize │ └──────────────────────────────┘
│ normalize_residues │
│ max_length │
│ use_chunking │
│ chunk_size │
│ chunk_overlap │
│ description │
│ created_at │
└──────────────────────────┘
- EmbeddingConfig
Defines a reproducible embedding recipe (model, layer selection, pooling strategy, chunking). Referenced by both
SequenceEmbeddingrows andPredictionSetrows to ensure query and reference embeddings are always comparable.- SequenceEmbedding
Stores a pgvector VECTOR for one (sequence, config, chunk) triple. When chunking is disabled:
chunk_index_s=0,chunk_index_e=NULL. When chunking is enabled: each chunk is a separate row with its own start/end indices.Note
KNN search is never performed at the DB layer. Embeddings are loaded into numpy arrays and searched via
protea.core.knn_searchusing numpy or FAISS.
Query sets¶
┌──────────────────────┐ 1→N ┌──────────────────────────────┐
│ QuerySet │──────────▶│ QuerySetEntry │
│──────────────────────│ │──────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ name │ │ query_set_id (FK) │
│ description │ │ accession (original header) │
│ created_at │ │ sequence_id (FK → Sequence) │
└──────────────────────┘ └──────────────────────────────┘
- QuerySet
User-uploaded FASTA dataset for custom prediction queries. Created via
POST /query-sets(multipart upload).- QuerySetEntry
One row per FASTA entry. Preserves the original accession header from the FASTA file and links to the deduplicated
Sequencerow (reuses existing sequences if the amino-acid string is already in the DB).
Predictions¶
┌──────────────────────────────┐ 1→N ┌───────────────────────────────────┐
│ PredictionSet │──────────▶│ GOPrediction │
│──────────────────────────────│ │───────────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ embedding_config_id (FK) │ │ prediction_set_id (FK) │
│ annotation_set_id (FK) │ │ protein_accession (query) │
│ ontology_snapshot_id (FK) │ │ go_term_id (FK) │
│ query_set_id (FK, nullable) │ │ distance (cosine/L2) │
│ limit_per_entry │ │ ref_protein_accession │
│ distance_threshold │ │ qualifier, evidence_code │
│ created_at │ │ ── alignment (NW) ── │
└──────────────────────────────┘ │ identity_nw, similarity_nw │
│ alignment_score_nw │
│ gaps_pct_nw, alignment_length_nw │
│ ── alignment (SW) ── │
│ identity_sw, similarity_sw │
│ alignment_score_sw │
│ gaps_pct_sw, alignment_length_sw │
│ ── lengths ── │
│ length_query, length_ref │
│ ── taxonomy ── │
│ query_taxonomy_id │
│ ref_taxonomy_id │
│ taxonomic_lca │
│ taxonomic_distance │
│ taxonomic_common_ancestors │
│ taxonomic_relation │
└───────────────────────────────────┘
- PredictionSet
Groups all GO predictions for one run of
predict_go_terms. References theEmbeddingConfig,AnnotationSet, andOntologySnapshotused. Optionally linked to aQuerySetwhen predictions were run from a FASTA upload.- GOPrediction
One row per (query protein, GO term, reference protein) triple. The alignment and taxonomy columns are
NULLunlesscompute_alignments=trueand/orcompute_taxonomy=truewere set in the prediction payload.
Job queue¶
┌────────────────────────────┐ 1→N ┌──────────────────────────┐
│ Job │──────────▶│ JobEvent │
│────────────────────────────│ │──────────────────────────│
│ id (UUID, PK) │ │ id (BigInt, PK) │
│ operation │ │ job_id (FK) │
│ queue_name │ │ event (str) │
│ status (enum) │ │ message (str, nullable) │
│ parent_job_id (FK, null) │ │ fields (JSONB) │
│ payload (JSONB) │ │ level (info/warn/error) │
│ meta (JSONB) │ │ ts (timestamp) │
│ progress_current │ └──────────────────────────┘
│ progress_total │
│ error_code │
│ error_message │
│ created_at / started_at / │
│ finished_at │
└────────────────────────────┘
- Job
Central entity of the job queue.
parent_job_idlinks child batch jobs to their coordinator parent (used in distributed pipelines).progress_current/progress_totaltrack batch completion for progress bars.- JobEvent
Append-only audit log. Written by the
emitcallback during execution. The frontend renders these as a chronological timeline. Events are never updated or deleted.
Status enum¶
Value |
Meaning |
|---|---|
|
Created, waiting in RabbitMQ |
|
Worker has claimed the job |
|
Operation completed successfully |
|
Operation raised an exception |
|
Cancelled via API before or during execution |