Data Model¶

All models use SQLAlchemy 2.x declarative style with Mapped[] type annotations. The schema is managed by Alembic (35 migrations to date).

Protein and sequence deduplication ¶

┌──────────────────────────┐        ┌────────────────────────┐
│         Protein          │        │        Sequence        │
│──────────────────────────│  N→1   │────────────────────────│
│ accession (PK)           │───────▶│ id (PK, autoincrement) │
│ canonical_accession      │        │ sequence (Text)        │
│ is_canonical             │        │ sequence_hash (MD5)    │
│ isoform_index            │        └────────────────────────┘
│ entry_name               │
│ reviewed                 │        ┌──────────────────────────────┐
│ taxonomy_id              │  N→1   │   ProteinUniProtMetadata     │
│ organism                 │───────▶│──────────────────────────────│
│ gene_name                │ (view) │ canonical_accession (PK)     │
│ length                   │        │ function_cc, ec_number, ...  │
│ sequence_id (FK)         │        └──────────────────────────────┘
└──────────────────────────┘

Sequence: Stores unique amino-acid sequences, deduplicated by MD5 hash (sequence_hash). Many Protein rows can reference the same Sequence; sequence_id is deliberately non-unique.
Protein: One row per UniProt accession, including isoforms (<canonical>-<n>). Isoforms share the same canonical_accession and are differentiated by is_canonical and isoform_index. The relationship to ProteinUniProtMetadata is view-only (no foreign key), joined by canonical_accession.
ProteinUniProtMetadata: One row per canonical accession. Stores raw UniProt functional annotations (functional description, EC numbers, pathways, kinetics, etc.) as Text fields. Isoforms inherit metadata via the canonical_accession join.

GO ontology ¶

┌──────────────────────────┐     1→N    ┌────────────────────────┐
│    OntologySnapshot      │──────────▶│        GOTerm          │
│──────────────────────────│           │────────────────────────│
│ id (UUID, PK)            │           │ id (PK)                │
│ obo_url                  │           │ go_id (e.g. GO:0003674)│
│ obo_version              │           │ name                   │
│ loaded_at                │           │ aspect (F/P/C)         │
└──────────────────────────┘           │ definition             │
                                       │ is_obsolete            │
                                       │ ontology_snapshot_id   │
                                       └──────────┬─────────────┘
                                                  │
                                       ┌──────────▼─────────────┐
                                       │  GOTermRelationship    │
                                       │────────────────────────│
                                       │ child_go_term_id (FK)  │
                                       │ parent_go_term_id (FK) │
                                       │ relation_type          │
                                       │ ontology_snapshot_id   │
                                       └────────────────────────┘

OntologySnapshot: One row per loaded OBO file release, versioned by obo_version (unique constraint). Idempotent load: if a snapshot already exists with its relationships, it is skipped. If relationships are missing they are backfilled automatically.
GOTerm: One row per GO term per snapshot. aspect is one of F (molecular function), P (biological process), or C (cellular component).
GOTermRelationship: Directed edge in the GO DAG. relation_type is one of is_a, part_of, regulates, positively_regulates, negatively_regulates. Used by GET /annotations/snapshots/{id}/subgraph for BFS ancestor traversal.

Annotation sets ¶

┌──────────────────────┐     1→N    ┌────────────────────────────────┐
│    AnnotationSet     │──────────▶│     ProteinGOAnnotation        │
│──────────────────────│           │────────────────────────────────│
│ id (UUID, PK)        │           │ id (PK)                        │
│ source (goa/quickgo) │           │ protein_accession              │
│ source_version       │           │ go_term_id (FK → GOTerm)       │
│ ontology_snapshot_id │           │ annotation_set_id (FK)         │
│ job_id               │           │ qualifier                      │
│ created_at           │           │ evidence_code                  │
│ meta (JSONB)         │           │ assigned_by                    │
└──────────────────────┘           │ db_reference                   │
                                   │ with_from                      │
                                   │ annotation_date                │
                                   └────────────────────────────────┘

AnnotationSet: Groups a batch of protein GO annotations by source (goa or quickgo) and ontology snapshot version.
ProteinGOAnnotation: One row per (protein, GO term, annotation set) triple. Stores all GAF/QuickGO evidence fields verbatim.

Defines a reproducible embedding recipe (model, layer selection, pooling strategy, chunking). Referenced by both SequenceEmbedding rows and PredictionSet rows to ensure query and reference embeddings are always comparable.

Valid values for model_backend are esm (HuggingFace EsmModel / ESM-2), esm3c (ESM SDK ESMC), t5 (T5EncoderModel for ProstT5 and prot_t5_xl_uniref50), ankh (T5EncoderModel for ElnaggarLab/ankh-base / ankh-large, loaded via AutoTokenizer, forced to bfloat16 on CUDA because Ankh overflows to NaN under FP16, and tokenised char-by-char with is_split_into_words=True because its SentencePiece vocabulary maps literal spaces to <unk>; the <AA2fold> prefix is never injected), and auto (alias for esm).

SequenceEmbedding

Stores a pgvector VECTOR for one (sequence, config, chunk) triple. When chunking is disabled: chunk_index_s=0, chunk_index_e=NULL. When chunking is enabled: each chunk is a separate row with its own start/end indices.

chunk_index_s and chunk_index_e are amino-acid positions on every backend. The embedding operation strips all special tokens (CLS/BOS/EOS and ProstT5’s <AA2fold> prefix) from the residue tensor before chunking, so chunk_index_s=0 always refers to the first amino acid and chunk_index_e equals the amino-acid length for the final chunk of a full-length sequence. This convention was unified on 2026-04-10; embeddings computed before that date used backend-specific slicing and must be recomputed to be directly comparable.

Note

KNN search is never performed at the DB layer. Embeddings are loaded into numpy arrays and searched via protea.core.knn_search using numpy or FAISS.

Query sets ¶

┌──────────────────────┐     1→N    ┌──────────────────────────────┐
│      QuerySet        │──────────▶│       QuerySetEntry          │
│──────────────────────│           │──────────────────────────────│
│ id (UUID, PK)        │           │ id (PK)                      │
│ name                 │           │ query_set_id (FK)            │
│ description          │           │ accession (original header)  │
│ created_at           │           │ sequence_id (FK → Sequence)  │
└──────────────────────┘           └──────────────────────────────┘

QuerySet: User-uploaded FASTA dataset for custom prediction queries. Created via POST /query-sets (multipart upload).
QuerySetEntry: One row per FASTA entry. Preserves the original accession header from the FASTA file and links to the deduplicated Sequence row (reuses existing sequences if the amino-acid string is already in the DB).

Predictions ¶

┌──────────────────────────────┐     1→N    ┌───────────────────────────────────┐
│        PredictionSet         │──────────▶│          GOPrediction             │
│──────────────────────────────│           │───────────────────────────────────│
│ id (UUID, PK)                │           │ id (PK)                           │
│ embedding_config_id (FK)     │           │ prediction_set_id (FK)            │
│ annotation_set_id (FK)       │           │ protein_accession (query)         │
│ ontology_snapshot_id (FK)    │           │ go_term_id (FK)                   │
│ query_set_id (FK, nullable)  │           │ distance (cosine/L2)              │
│ limit_per_entry              │           │ ref_protein_accession             │
│ distance_threshold           │           │ qualifier, evidence_code          │
│ created_at                   │           │ ── alignment (NW) ──              │
└──────────────────────────────┘           │ identity_nw, similarity_nw        │
                                           │ alignment_score_nw                │
                                           │ gaps_pct_nw, alignment_length_nw  │
                                           │ ── alignment (SW) ──              │
                                           │ identity_sw, similarity_sw        │
                                           │ alignment_score_sw                │
                                           │ gaps_pct_sw, alignment_length_sw  │
                                           │ ── lengths ──                     │
                                           │ length_query, length_ref          │
                                           │ ── taxonomy ──                    │
                                           │ query_taxonomy_id                 │
                                           │ ref_taxonomy_id                   │
                                           │ taxonomic_lca                     │
                                           │ taxonomic_distance                │
                                           │ taxonomic_common_ancestors        │
                                           │ taxonomic_relation                │
                                           └───────────────────────────────────┘

PredictionSet

Groups all GO predictions for one run of predict_go_terms. References the EmbeddingConfig, AnnotationSet, and OntologySnapshot used. Optionally linked to a QuerySet when predictions were run from a FASTA upload.

GOPrediction

One row per (query protein, GO term, reference protein) triple. The alignment and taxonomy columns are NULL unless compute_alignments=true and/or compute_taxonomy=true were set in the prediction payload (both default to True).

When compute_reranker_features=true (also the default) the legacy re-ranker aggregate columns are populated: vote_count, k_position, go_term_frequency, ref_annotation_density, neighbor_distance_std, neighbor_vote_fraction, neighbor_min_distance, neighbor_mean_distance: eight fields in total.

When compute_v6_features=true (opt-in, default False) the v6_features family is materialised on top: 6 anc2vec columns (anc2vec_neighbor_cos / anc2vec_neighbor_maxcos / anc2vec_has_emb / anc2vec_query_known_cos / anc2vec_query_known_maxcos / anc2vec_query_known_count), 3 tax-voter aggregates (tax_voters_same_frac, tax_voters_close_frac, tax_voters_mean_common_ancestors) and 16 PCA-projected embedding columns (emb_pca_query_0 … emb_pca_query_15): 25 additional fields. PCA state is fit once per EmbeddingConfig and cached on disk (PROTEA_PCA_ARTIFACTS_DIR).

RerankerModel

Stores a trained LightGBM binary (or LambdaRank) re-ranker. References the PredictionSet and EvaluationSet used for training (both SET NULL on delete). Two storage modes coexist:

Inline (legacy). model_data (Text, now nullable) holds the serialized booster string. Rows created before the 2026-04 integration with protea-reranker-lab use this path.
Artifact-backed (preferred). artifact_uri (String(512)) points at a file:// or s3:// URI resolved by the ArtifactStore. Rows inserted via scripts/register_reranker.py always use this path and leave model_data NULL.

Additional provenance columns record the lab run that produced the model:

feature_schema_sha (String(16)): 12-hex-char fingerprint of the feature families the booster was trained on, computed via protea_reranker_lab.contracts.compute_feature_schema_sha. Load-bearing at inference time: predict_go_terms refuses to apply a booster whose feature_schema_sha does not match the live feature set, falling back to KNN ordering rather than scoring with NaN-filled columns.
embedding_config_id / ontology_snapshot_id (FKs, both SET NULL): the embedding recipe and ontology release the booster was trained against.
producer_version (String(64)) / producer_git_sha (String(40)): PROTEA __version__ and HEAD sha at export time, recorded in the dataset manifest and propagated here.
spec_yaml (Text): the full ExperimentSpec YAML used to drive the lab training run, for reproducibility.

ScoringConfig

Defines a named scoring recipe: a set of feature weights and parameters that can be applied to any prediction set. Immutable once created.

Re-ranker datasets ¶

┌──────────────────────────────────┐
│             Dataset              │
│──────────────────────────────────│
│ id (UUID, PK)                    │
│ name (str, unique)               │
│ operation (export_research_…)    │
│ job_id (FK, nullable)            │
│ ── storage layout ──             │
│ storage_backend (local|minio)    │
│ key_prefix                       │
│ train_uri / eval_uri             │
│ manifest_uri                     │
│ ── content fingerprints ──       │
│ schema_sha (16-char)             │
│ manifest_sha (sha256)            │
│ n_train_rows / n_eval_rows       │
│ ── dump parameters ──            │
│ k                                │
│ annotation_source                │
│ embedding_config_id (FK, null)   │
│ ontology_snapshot_id (FK, null)  │
│ train_snapshot_pairs (JSONB)     │
│ eval_snapshot_pair               │
│ ── reproducibility ──            │
│ producer_version                 │
│ producer_git_sha                 │
│ meta (JSONB)                     │
│ created_at                       │
└──────────────────────────────────┘

Dataset: The durable handle for a frozen re-ranker dataset published to the artifact store by export_research_dataset. One row per successful export run; protea-reranker-lab’s pull_dataset.py resolves either id or name against this table to download the exact train.parquet / eval.parquet / manifest.json triple that trained a given booster. schema_sha must equal the booster’s feature_schema_sha at inference (load-bearing); manifest_sha detects content drift across re-emissions.

Evaluation ¶

┌──────────────────────────────┐     1→N    ┌───────────────────────────────────┐
│       EvaluationSet          │──────────▶│        EvaluationResult           │
│──────────────────────────────│           │───────────────────────────────────│
│ id (UUID, PK)                │           │ id (UUID, PK)                     │
│ old_annotation_set_id (FK)   │           │ evaluation_set_id (FK)            │
│ new_annotation_set_id (FK)   │           │ prediction_set_id (FK)            │
│ ontology_snapshot_id (FK)    │           │ scoring_config_id (FK, nullable)  │
│ stats (JSONB)                │           │ reranker_model_id (FK, nullable)  │
│ job_id (FK)                  │           │ results (JSONB)                   │
│ created_at                   │           │ max_distance (Float, nullable)    │
└──────────────────────────────┘           │ job_id (FK)                       │
                                           │ created_at                        │
                                           └───────────────────────────────────┘

EvaluationSet: Stores the CAFA-style temporal holdout delta between two annotation sets (old → new). The stats JSONB column contains NK/LK/PK protein and annotation counts, delta protein count, and per-namespace breakdowns.
EvaluationResult: Stores the output of running cafaeval against a prediction set for a given evaluation set. The results JSONB column contains per-category (NK/LK/PK) per-namespace (BPO/MFO/CCO) Fmax, precision, recall, τ, and coverage. Optionally references a ScoringConfig or RerankerModel when predictions were scored or re-ranked before evaluation.

Support ¶

┌──────────────────────────┐
│      SupportEntry        │
│──────────────────────────│
│ id (UUID, PK)            │
│ comment (Text, nullable) │
│ created_at               │
└──────────────────────────┘

SupportEntry: Community feedback: a thumbs-up with an optional comment (max 500 chars).

Visitor events ¶

┌──────────────────────────────┐
│        VisitorEvent          │
│──────────────────────────────│
│ id (BigInt, PK)              │
│ day (Date, indexed)          │
│ visitor_hash (16-char)       │
│ path                         │
│ method                       │
│ status (int)                 │
│ created_at                   │
└──────────────────────────────┘

VisitorEvent: Append-only log that powers the Grafana “unique visitors” dashboard. The schema deliberately omits IP addresses: visitor_hash is the first 16 hex chars of sha256(daily_salt || client_ip), where daily_salt is a 32-byte random value held in process memory and regenerated every calendar day. Once the day rolls over the salt is gone, so cross-day correlation becomes cryptographically infeasible (the same no-PII model used by Plausible and Fathom). The (day, visitor_hash) index drives unique-visitor counts per day; the path index drives top-page reports.

Job queue ¶

┌────────────────────────────┐    1→N    ┌──────────────────────────┐
│           Job              │──────────▶│         JobEvent         │
│────────────────────────────│           │──────────────────────────│
│ id (UUID, PK)              │           │ id (BigInt, PK)          │
│ operation                  │           │ job_id (FK)              │
│ queue_name                 │           │ event (str)              │
│ status (enum)              │           │ message (str, nullable)  │
│ parent_job_id (FK, null)   │           │ fields (JSONB)           │
│ payload (JSONB)            │           │ level (info/warn/error)  │
│ meta (JSONB)               │           │ ts (timestamp)           │
│ progress_current           │           └──────────────────────────┘
│ progress_total             │
│ error_code                 │
│ error_message              │
│ description (Text, null)   │
│ findings (Text, null)      │
│ tags (ARRAY[Text])         │
│ created_at / started_at /  │
│ finished_at                │
└────────────────────────────┘

Job

Central entity of the job queue. parent_job_id links child batch jobs to their coordinator parent (used in distributed pipelines). progress_current / progress_total track batch completion for progress bars.

The three narrative columns (description, findings, tags) were added in T3.9 / D11 to give every run a place for human context: description carries the pre-execution intent supplied at POST /jobs time, findings is a free-form post-completion note the operator (or a downstream tool) can write back via the existing PATCH path, and tags is a free-form Text[] (default []) for ad-hoc grouping. None of the three are interpreted by BaseWorker; they are operator metadata only.

JobEvent

Append-only audit log. Written by the emit callback during execution. The frontend renders these as a chronological timeline. Events are never updated or deleted.

JobComment

Human-authored note thread attached to a Job (T3.10 / D11). One-to-many: job_id UUID FK to job(id) with ON DELETE CASCADE, author Text nullable, body Text required, created_at Timestamptz default now(). Indexed by (job_id, created_at) for the natural read pattern. Comments are written through POST /jobs/{job_id}/comments and read chronologically through GET /jobs/{job_id}/comments; they complement JobEvent (which is machine-emitted) by giving curators / operators a place for free-form annotations.

Experiment runs ¶

┌──────────────────────────────┐
│       ExperimentRun          │
│──────────────────────────────│
│ id (UUID, PK)                │
│ name (Text, UNIQUE)          │
│ description (Text, null)     │
│ hypothesis (Text, null)      │
│ findings (Text, null)        │
│ status (enum)                │
│ config (JSONB)               │
│ provenance (JSONB)           │
│ tags (ARRAY[Text])           │
│ plm (Text, null)             │
│ k (Integer, null)            │
│ reranker_spec_id (Text, null)│
│ feature_schema_sha (Text)    │
│ eval_set_name (Text, null)   │
│ eval_set_manifest_sha (Text) │
│ propagation (Text, null)     │
│ ensemble_spec (JSONB, null)  │
│ axis_tuple_shortid (Text)    │
│ created_at                   │
│ started_at (null)            │
│ finished_at (null)           │
└──────────────────────────────┘

ExperimentRun

Per-research-run narrative + provenance anchor (T3.8 / Fase 4). The row that F-EXP campaigns and the F8b Experiments page hang their per-run metadata off of: a single ExperimentRun typically aggregates multiple Job / EvaluationResult / RerankerModel rows under one human name. The narrative trio (description / hypothesis / findings) mirrors Job’s D11 columns; config and provenance are JSONB bags suitable for the capture_provenance snapshot from protea.core.provenance. Status enum is planned → running → done (or abandoned); planned rather than queued because experiments often live as drafts before any compute kicks off, and abandoned rather than failed because a research run can be stopped without a hard error.

The nine FARM-EXP.1 axis columns (plm, k, reranker_spec_id, feature_schema_sha, eval_set_name, eval_set_manifest_sha, propagation, ensemble_spec, axis_tuple_shortid) land in migration e1c4a7b2d8f3_farm_exp_1_experiment_run_axis. Together they form the axis tuple that the F-EXP-RESET re-benchmark addresses every cell by. axis_tuple_shortid is the canonical 12-hex digest sha256(canonical_json(axis_tuple))[:12] shared with protea-reranker-lab via protea.core.axis_tuple. A partial-unique index (uq_experiment_run_axis_tuple_shortid) enforces uniqueness only when the shortid is non-NULL, so legacy rows without a shortid can coexist until backfilled.

The status column uses values_callable=lambda e: [m.value for m in e] (FIX-EXP-RUN-ENUM) so SQLAlchemy persists and reads the DB-native lowercase labels ("planned", "running", …) instead of the Python enum names; without this fix all ORM inserts raised InvalidTextRepresentation.

Status enum ¶

Value	Meaning
`queued`	Created, waiting in RabbitMQ
`running`	Worker has claimed the job
`succeeded`	Operation completed successfully
`failed`	Operation raised an exception
`cancelled`	Cancelled via API before or during execution