Data Model¶
All models use SQLAlchemy 2.x declarative style with Mapped[] type annotations.
The schema is managed by Alembic (35 migrations to date).
Protein and sequence deduplication¶
┌──────────────────────────┐ ┌────────────────────────┐
│ Protein │ │ Sequence │
│──────────────────────────│ N→1 │────────────────────────│
│ accession (PK) │───────▶│ id (PK, autoincrement) │
│ canonical_accession │ │ sequence (Text) │
│ is_canonical │ │ sequence_hash (MD5) │
│ isoform_index │ └────────────────────────┘
│ entry_name │
│ reviewed │ ┌──────────────────────────────┐
│ taxonomy_id │ N→1 │ ProteinUniProtMetadata │
│ organism │───────▶│──────────────────────────────│
│ gene_name │ (view) │ canonical_accession (PK) │
│ length │ │ function_cc, ec_number, ... │
│ sequence_id (FK) │ └──────────────────────────────┘
└──────────────────────────┘
- Sequence
Stores unique amino-acid sequences, deduplicated by MD5 hash (
sequence_hash). ManyProteinrows can reference the sameSequence;sequence_idis deliberately non-unique.- Protein
One row per UniProt accession, including isoforms (
<canonical>-<n>). Isoforms share the samecanonical_accessionand are differentiated byis_canonicalandisoform_index. The relationship toProteinUniProtMetadatais view-only (no foreign key), joined bycanonical_accession.- ProteinUniProtMetadata
One row per canonical accession. Stores raw UniProt functional annotations (functional description, EC numbers, pathways, kinetics, etc.) as
Textfields. Isoforms inherit metadata via thecanonical_accessionjoin.
GO ontology¶
┌──────────────────────────┐ 1→N ┌────────────────────────┐
│ OntologySnapshot │──────────▶│ GOTerm │
│──────────────────────────│ │────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ obo_url │ │ go_id (e.g. GO:0003674)│
│ obo_version │ │ name │
│ loaded_at │ │ aspect (F/P/C) │
└──────────────────────────┘ │ definition │
│ is_obsolete │
│ ontology_snapshot_id │
└──────────┬─────────────┘
│
┌──────────▼─────────────┐
│ GOTermRelationship │
│────────────────────────│
│ child_go_term_id (FK) │
│ parent_go_term_id (FK) │
│ relation_type │
│ ontology_snapshot_id │
└────────────────────────┘
- OntologySnapshot
One row per loaded OBO file release, versioned by
obo_version(unique constraint). Idempotent load: if a snapshot already exists with its relationships, it is skipped. If relationships are missing they are backfilled automatically.- GOTerm
One row per GO term per snapshot.
aspectis one ofF(molecular function),P(biological process), orC(cellular component).- GOTermRelationship
Directed edge in the GO DAG.
relation_typeis one ofis_a,part_of,regulates,positively_regulates,negatively_regulates. Used byGET /annotations/snapshots/{id}/subgraphfor BFS ancestor traversal.
Annotation sets¶
┌──────────────────────┐ 1→N ┌────────────────────────────────┐
│ AnnotationSet │──────────▶│ ProteinGOAnnotation │
│──────────────────────│ │────────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ source (goa/quickgo) │ │ protein_accession │
│ source_version │ │ go_term_id (FK → GOTerm) │
│ ontology_snapshot_id │ │ annotation_set_id (FK) │
│ job_id │ │ qualifier │
│ created_at │ │ evidence_code │
│ meta (JSONB) │ │ assigned_by │
└──────────────────────┘ │ db_reference │
│ with_from │
│ annotation_date │
└────────────────────────────────┘
- AnnotationSet
Groups a batch of protein GO annotations by source (
goaorquickgo) and ontology snapshot version.- ProteinGOAnnotation
One row per (protein, GO term, annotation set) triple. Stores all GAF/QuickGO evidence fields verbatim.
Embeddings¶
┌──────────────────────────┐ 1→N ┌──────────────────────────────┐
│ EmbeddingConfig │──────────▶│ SequenceEmbedding │
│──────────────────────────│ │──────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ model_name │ │ sequence_id (FK) │
│ model_backend │ │ embedding_config_id (FK) │
│ layer_indices │ │ embedding (VECTOR) │
│ layer_agg │ │ chunk_index_s (int) │
│ pooling │ │ chunk_index_e (int, nullable)│
│ normalize │ └──────────────────────────────┘
│ normalize_residues │
│ max_length │
│ use_chunking │
│ chunk_size │
│ chunk_overlap │
│ description │
│ created_at │
└──────────────────────────┘
- EmbeddingConfig
Defines a reproducible embedding recipe (model, layer selection, pooling strategy, chunking). Referenced by both
SequenceEmbeddingrows andPredictionSetrows to ensure query and reference embeddings are always comparable.Valid values for
model_backendareesm(HuggingFaceEsmModel/ ESM-2),esm3c(ESM SDKESMC),t5(T5EncoderModelfor ProstT5 andprot_t5_xl_uniref50),ankh(T5EncoderModelforElnaggarLab/ankh-base/ankh-large, loaded viaAutoTokenizer, forced tobfloat16on CUDA because Ankh overflows toNaNunder FP16, and tokenised char-by-char withis_split_into_words=Truebecause its SentencePiece vocabulary maps literal spaces to<unk>; the<AA2fold>prefix is never injected), andauto(alias foresm).- SequenceEmbedding
Stores a pgvector VECTOR for one (sequence, config, chunk) triple. When chunking is disabled:
chunk_index_s=0,chunk_index_e=NULL. When chunking is enabled: each chunk is a separate row with its own start/end indices.chunk_index_sandchunk_index_eare amino-acid positions on every backend. The embedding operation strips all special tokens (CLS/BOS/EOS and ProstT5’s<AA2fold>prefix) from the residue tensor before chunking, sochunk_index_s=0always refers to the first amino acid andchunk_index_eequals the amino-acid length for the final chunk of a full-length sequence. This convention was unified on 2026-04-10; embeddings computed before that date used backend-specific slicing and must be recomputed to be directly comparable.Note
KNN search is never performed at the DB layer. Embeddings are loaded into numpy arrays and searched via
protea.core.knn_searchusing numpy or FAISS.
Query sets¶
┌──────────────────────┐ 1→N ┌──────────────────────────────┐
│ QuerySet │──────────▶│ QuerySetEntry │
│──────────────────────│ │──────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ name │ │ query_set_id (FK) │
│ description │ │ accession (original header) │
│ created_at │ │ sequence_id (FK → Sequence) │
└──────────────────────┘ └──────────────────────────────┘
- QuerySet
User-uploaded FASTA dataset for custom prediction queries. Created via
POST /query-sets(multipart upload).- QuerySetEntry
One row per FASTA entry. Preserves the original accession header from the FASTA file and links to the deduplicated
Sequencerow (reuses existing sequences if the amino-acid string is already in the DB).
Predictions¶
┌──────────────────────────────┐ 1→N ┌───────────────────────────────────┐
│ PredictionSet │──────────▶│ GOPrediction │
│──────────────────────────────│ │───────────────────────────────────│
│ id (UUID, PK) │ │ id (PK) │
│ embedding_config_id (FK) │ │ prediction_set_id (FK) │
│ annotation_set_id (FK) │ │ protein_accession (query) │
│ ontology_snapshot_id (FK) │ │ go_term_id (FK) │
│ query_set_id (FK, nullable) │ │ distance (cosine/L2) │
│ limit_per_entry │ │ ref_protein_accession │
│ distance_threshold │ │ qualifier, evidence_code │
│ created_at │ │ ── alignment (NW) ── │
└──────────────────────────────┘ │ identity_nw, similarity_nw │
│ alignment_score_nw │
│ gaps_pct_nw, alignment_length_nw │
│ ── alignment (SW) ── │
│ identity_sw, similarity_sw │
│ alignment_score_sw │
│ gaps_pct_sw, alignment_length_sw │
│ ── lengths ── │
│ length_query, length_ref │
│ ── taxonomy ── │
│ query_taxonomy_id │
│ ref_taxonomy_id │
│ taxonomic_lca │
│ taxonomic_distance │
│ taxonomic_common_ancestors │
│ taxonomic_relation │
└───────────────────────────────────┘
- PredictionSet
Groups all GO predictions for one run of
predict_go_terms. References theEmbeddingConfig,AnnotationSet, andOntologySnapshotused. Optionally linked to aQuerySetwhen predictions were run from a FASTA upload.- GOPrediction
One row per (query protein, GO term, reference protein) triple. The alignment and taxonomy columns are
NULLunlesscompute_alignments=trueand/orcompute_taxonomy=truewere set in the prediction payload (both default toTrue).When
compute_reranker_features=true(also the default) the legacy re-ranker aggregate columns are populated:vote_count,k_position,go_term_frequency,ref_annotation_density,neighbor_distance_std,neighbor_vote_fraction,neighbor_min_distance,neighbor_mean_distance: eight fields in total.When
compute_v6_features=true(opt-in, defaultFalse) thev6_featuresfamily is materialised on top: 6 anc2vec columns (anc2vec_neighbor_cos/anc2vec_neighbor_maxcos/anc2vec_has_emb/anc2vec_query_known_cos/anc2vec_query_known_maxcos/anc2vec_query_known_count), 3 tax-voter aggregates (tax_voters_same_frac,tax_voters_close_frac,tax_voters_mean_common_ancestors) and 16 PCA-projected embedding columns (emb_pca_query_0…emb_pca_query_15): 25 additional fields. PCA state is fit once perEmbeddingConfigand cached on disk (PROTEA_PCA_ARTIFACTS_DIR).- RerankerModel
Stores a trained LightGBM binary (or LambdaRank) re-ranker. References the
PredictionSetandEvaluationSetused for training (bothSET NULLon delete). Two storage modes coexist:Inline (legacy).
model_data(Text, now nullable) holds the serialized booster string. Rows created before the 2026-04 integration withprotea-reranker-labuse this path.Artifact-backed (preferred).
artifact_uri(String(512)) points at afile://ors3://URI resolved by theArtifactStore. Rows inserted viascripts/register_reranker.pyalways use this path and leavemodel_dataNULL.
Additional provenance columns record the lab run that produced the model:
feature_schema_sha(String(16)): 12-hex-char fingerprint of the feature families the booster was trained on, computed viaprotea_reranker_lab.contracts.compute_feature_schema_sha. Load-bearing at inference time:predict_go_termsrefuses to apply a booster whosefeature_schema_shadoes not match the live feature set, falling back to KNN ordering rather than scoring with NaN-filled columns.embedding_config_id/ontology_snapshot_id(FKs, bothSET NULL): the embedding recipe and ontology release the booster was trained against.producer_version(String(64)) /producer_git_sha(String(40)): PROTEA__version__and HEAD sha at export time, recorded in the dataset manifest and propagated here.spec_yaml(Text): the fullExperimentSpecYAML used to drive the lab training run, for reproducibility.
- ScoringConfig
Defines a named scoring recipe: a set of feature weights and parameters that can be applied to any prediction set. Immutable once created.
Re-ranker datasets¶
┌──────────────────────────────────┐
│ Dataset │
│──────────────────────────────────│
│ id (UUID, PK) │
│ name (str, unique) │
│ operation (export_research_…) │
│ job_id (FK, nullable) │
│ ── storage layout ── │
│ storage_backend (local|minio) │
│ key_prefix │
│ train_uri / eval_uri │
│ manifest_uri │
│ ── content fingerprints ── │
│ schema_sha (16-char) │
│ manifest_sha (sha256) │
│ n_train_rows / n_eval_rows │
│ ── dump parameters ── │
│ k │
│ annotation_source │
│ embedding_config_id (FK, null) │
│ ontology_snapshot_id (FK, null) │
│ train_snapshot_pairs (JSONB) │
│ eval_snapshot_pair │
│ ── reproducibility ── │
│ producer_version │
│ producer_git_sha │
│ meta (JSONB) │
│ created_at │
└──────────────────────────────────┘
- Dataset
The durable handle for a frozen re-ranker dataset published to the artifact store by
export_research_dataset. One row per successful export run;protea-reranker-lab’spull_dataset.pyresolves eitheridornameagainst this table to download the exacttrain.parquet/eval.parquet/manifest.jsontriple that trained a given booster.schema_shamust equal the booster’sfeature_schema_shaat inference (load-bearing);manifest_shadetects content drift across re-emissions.
Evaluation¶
┌──────────────────────────────┐ 1→N ┌───────────────────────────────────┐
│ EvaluationSet │──────────▶│ EvaluationResult │
│──────────────────────────────│ │───────────────────────────────────│
│ id (UUID, PK) │ │ id (UUID, PK) │
│ old_annotation_set_id (FK) │ │ evaluation_set_id (FK) │
│ new_annotation_set_id (FK) │ │ prediction_set_id (FK) │
│ ontology_snapshot_id (FK) │ │ scoring_config_id (FK, nullable) │
│ stats (JSONB) │ │ reranker_model_id (FK, nullable) │
│ job_id (FK) │ │ results (JSONB) │
│ created_at │ │ max_distance (Float, nullable) │
└──────────────────────────────┘ │ job_id (FK) │
│ created_at │
└───────────────────────────────────┘
- EvaluationSet
Stores the CAFA-style temporal holdout delta between two annotation sets (old → new). The
statsJSONB column contains NK/LK/PK protein and annotation counts, delta protein count, and per-namespace breakdowns.- EvaluationResult
Stores the output of running
cafaevalagainst a prediction set for a given evaluation set. TheresultsJSONB column contains per-category (NK/LK/PK) per-namespace (BPO/MFO/CCO) Fmax, precision, recall, τ, and coverage. Optionally references aScoringConfigorRerankerModelwhen predictions were scored or re-ranked before evaluation.
Support¶
┌──────────────────────────┐
│ SupportEntry │
│──────────────────────────│
│ id (UUID, PK) │
│ comment (Text, nullable) │
│ created_at │
└──────────────────────────┘
- SupportEntry
Community feedback: a thumbs-up with an optional comment (max 500 chars).
Visitor events¶
┌──────────────────────────────┐
│ VisitorEvent │
│──────────────────────────────│
│ id (BigInt, PK) │
│ day (Date, indexed) │
│ visitor_hash (16-char) │
│ path │
│ method │
│ status (int) │
│ created_at │
└──────────────────────────────┘
- VisitorEvent
Append-only log that powers the Grafana “unique visitors” dashboard. The schema deliberately omits IP addresses:
visitor_hashis the first 16 hex chars ofsha256(daily_salt || client_ip), wheredaily_saltis a 32-byte random value held in process memory and regenerated every calendar day. Once the day rolls over the salt is gone, so cross-day correlation becomes cryptographically infeasible (the same no-PII model used by Plausible and Fathom). The(day, visitor_hash)index drives unique-visitor counts per day; thepathindex drives top-page reports.
Job queue¶
┌────────────────────────────┐ 1→N ┌──────────────────────────┐
│ Job │──────────▶│ JobEvent │
│────────────────────────────│ │──────────────────────────│
│ id (UUID, PK) │ │ id (BigInt, PK) │
│ operation │ │ job_id (FK) │
│ queue_name │ │ event (str) │
│ status (enum) │ │ message (str, nullable) │
│ parent_job_id (FK, null) │ │ fields (JSONB) │
│ payload (JSONB) │ │ level (info/warn/error) │
│ meta (JSONB) │ │ ts (timestamp) │
│ progress_current │ └──────────────────────────┘
│ progress_total │
│ error_code │
│ error_message │
│ description (Text, null) │
│ findings (Text, null) │
│ tags (ARRAY[Text]) │
│ created_at / started_at / │
│ finished_at │
└────────────────────────────┘
- Job
Central entity of the job queue.
parent_job_idlinks child batch jobs to their coordinator parent (used in distributed pipelines).progress_current/progress_totaltrack batch completion for progress bars.The three narrative columns (
description,findings,tags) were added in T3.9 / D11 to give every run a place for human context:descriptioncarries the pre-execution intent supplied atPOST /jobstime,findingsis a free-form post-completion note the operator (or a downstream tool) can write back via the existing PATCH path, andtagsis a free-formText[](default[]) for ad-hoc grouping. None of the three are interpreted byBaseWorker; they are operator metadata only.- JobEvent
Append-only audit log. Written by the
emitcallback during execution. The frontend renders these as a chronological timeline. Events are never updated or deleted.- JobComment
Human-authored note thread attached to a
Job(T3.10 / D11). One-to-many:job_idUUID FK tojob(id)withON DELETE CASCADE,authorText nullable,bodyText required,created_atTimestamptz defaultnow(). Indexed by(job_id, created_at)for the natural read pattern. Comments are written throughPOST /jobs/{job_id}/commentsand read chronologically throughGET /jobs/{job_id}/comments; they complementJobEvent(which is machine-emitted) by giving curators / operators a place for free-form annotations.
Experiment runs¶
┌──────────────────────────────┐
│ ExperimentRun │
│──────────────────────────────│
│ id (UUID, PK) │
│ name (Text, UNIQUE) │
│ description (Text, null) │
│ hypothesis (Text, null) │
│ findings (Text, null) │
│ status (enum) │
│ config (JSONB) │
│ provenance (JSONB) │
│ tags (ARRAY[Text]) │
│ plm (Text, null) │
│ k (Integer, null) │
│ reranker_spec_id (Text, null)│
│ feature_schema_sha (Text) │
│ eval_set_name (Text, null) │
│ eval_set_manifest_sha (Text) │
│ propagation (Text, null) │
│ ensemble_spec (JSONB, null) │
│ axis_tuple_shortid (Text) │
│ created_at │
│ started_at (null) │
│ finished_at (null) │
└──────────────────────────────┘
- ExperimentRun
Per-research-run narrative + provenance anchor (T3.8 / Fase 4). The row that F-EXP campaigns and the F8b Experiments page hang their per-run metadata off of: a single
ExperimentRuntypically aggregates multipleJob/EvaluationResult/RerankerModelrows under one humanname. The narrative trio (description/hypothesis/findings) mirrorsJob’s D11 columns;configandprovenanceare JSONB bags suitable for thecapture_provenancesnapshot fromprotea.core.provenance. Status enum isplanned→running→done(orabandoned); planned rather than queued because experiments often live as drafts before any compute kicks off, andabandonedrather thanfailedbecause a research run can be stopped without a hard error.The nine FARM-EXP.1 axis columns (
plm,k,reranker_spec_id,feature_schema_sha,eval_set_name,eval_set_manifest_sha,propagation,ensemble_spec,axis_tuple_shortid) land in migratione1c4a7b2d8f3_farm_exp_1_experiment_run_axis. Together they form the axis tuple that the F-EXP-RESET re-benchmark addresses every cell by.axis_tuple_shortidis the canonical 12-hex digestsha256(canonical_json(axis_tuple))[:12]shared withprotea-reranker-labviaprotea.core.axis_tuple. A partial-unique index (uq_experiment_run_axis_tuple_shortid) enforces uniqueness only when the shortid is non-NULL, so legacy rows without a shortid can coexist until backfilled.The
statuscolumn usesvalues_callable=lambda e: [m.value for m in e](FIX-EXP-RUN-ENUM) so SQLAlchemy persists and reads the DB-native lowercase labels ("planned","running", …) instead of the Python enum names; without this fix all ORM inserts raisedInvalidTextRepresentation.
Status enum¶
Value |
Meaning |
|---|---|
|
Created, waiting in RabbitMQ |
|
Worker has claimed the job |
|
Operation completed successfully |
|
Operation raised an exception |
|
Cancelled via API before or during execution |
See also
Operations: every operation lists the tables it touches.
Infrastructure: the SQLAlchemy
Mapped[]classes behind every table on this page.ADR-006: Sequence deduplication by MD5: why the
Sequence↔Proteinsplit exists.ADR-001: KNN on CPU, not pgvector or GPU: why
SequenceEmbeddinguses pgvector for storage but not for search.