Quality Engineering¶
PROTEA’s quality strategy is multi-layered. Hard gates (CI workflows that must be green before a PR can merge) sit alongside soft gates (the smell budget, which ratchets complexity without blocking emergency fixes). Both are complemented by structural patterns (refactoring slices, plugin architecture) that keep the codebase intrinsically testable, and by reproducibility invariants (SHA fingerprints, Alembic migrations, statistical regression benchmarks) that make every experiment result traceable to the exact code and data that produced it.
The inventory below covers every practice actually implemented in the stack. Items are grouped by concern. Each entry carries a one- or two-sentence description and a link to the canonical source: a file path, an ADR number, a CI workflow name, or a slice identifier from the plan store.
Testing layers¶
Unit tests¶
126 test modules and more than 2 200 test functions live under tests/.
Every subpackage has a corresponding test file; the suite runs with
poetry run pytest and requires no external services (database or broker
calls are mocked).
tests/
test_core.py, test_api.py, test_base_worker.py, ...
(126 test modules total)
Integration tests¶
Tests that need a real PostgreSQL instance are activated via the
with-postgres pytest flag. The conftest.py session fixture spins up a
pgvector/pgvector:pg16 Docker container, enables the vector
extension, and tears it down after the session. Key integration modules:
tests/test_infrastructure.py(session, engine, app factory)tests/test_pair_feature_perf_equivalence.py(parallelised vs serial parity for the export pipeline, PR #421)tests/test_predict_go_terms.pyandtest_predict_go_terms_coverage.pytests/test_auth_session_revocation.pytests/test_predict_go_terms_from_interpro_pg.pytests/test_datasets_and_reranker_import_smoke.py
The integration workflow (.github/workflows/integration.yml) runs
pytest with the Postgres flag enabled on every PR.
End-to-end tests (Playwright)¶
Critical user flows are covered by Playwright specs in
apps/web/e2e/flows/. Flows include landing, jobs, annotate submission,
reranker, scoring, stack, maintenance, and support pages. Specs use
fully hermetic per-test API mocks (e2e/flows/fixtures/mock-api.ts)
so no live backend is required. CI workflow: .github/workflows/playwright.yml.
Property tests (F6.2)¶
Hypothesis-based tests in
tests/property/ exercise algorithmic invariants (output ranges,
ancestor max-monotonicity, sort order, parquet roundtrip) over
randomised inputs. Three modules are currently covered:
tests/property/test_scoring_property.pytests/property/test_contract_payloads_property.pytests/property/test_parquet_roundtrip_property.py
A fixed seed (derandomize=True) in tests/conftest.py makes runs
reproducible across CI workers.
Mutation tests (F6.3)¶
Cosmic Ray rewrites
protea/core/scoring.py (and any PR-touched module in
protea/core/) one AST operator at a time and checks whether the
Hypothesis kill criterion catches each mutant. Configuration:
cosmic-ray.toml. Reference baseline for scoring.py:
227 mutants, 76.21 % mutation score (83.17 % effective after 19
type-hint equivalent mutants are excluded). Full description:
Mutation Testing.
Boundary validation (T1.8)¶
tests/test_parquet_export_boundary.py and
tests/test_contracts_invariants.py enforce the
producer-consumer invariant on the canonical feature column set: every
column declared in ALL_FEATURES must have an unconditional producer
wired to the CanonicalFeatureRegistry, and the parquet exporter must
emit exactly that column set. Adding a column without a producer crashes
the export pipeline at the T1.8 invariant check before the job reaches
compute-intensive stages.
Cross-repo invariant tests (T1.7)¶
tests/test_contracts_invariants.py additionally validates that the
protea-contracts ABC surface (Operation protocol, payload
dataclasses) is satisfied by all registered operations in
build_operation_registry. This prevents plugin repos
(protea-runners, protea-backends) from shipping operations that
silently break the consumer interface.
Smoke tests¶
scripts/smoke.sh probes a running stack ($PROTEA_API_URL,
default http://127.0.0.1:8000): health endpoint, ping job dispatch,
and response validation. Runs in the Deploy E2E workflow after the
compose stack is brought up.
Reproducibility and statistical regression¶
scripts/bootstrap_fmax_ci.py runs a paired bootstrap (N = 10 000
iterations) comparing a champion prediction set against the KNN baseline.
Slice LB.3 established that six of six NK+LK confidence intervals are
strictly positive at 95 % on bench-v1-K5-v226-lineage, providing a
publishable statistical claim for Chapter 6. The multi-seed binary
classifier recipe (3 independent seeds) produced NK+LK cafaeval 0.7291 +/- 0.0028,
confirming the benchmark is not a one-seed artefact.
Code-smell budget¶
scripts/check_smells.py enforces four structural thresholds:
Threshold |
Limit |
|---|---|
File LOC |
800 |
Class LOC |
500 |
Method LOC |
60 |
Parameter # |
6 |
Enforcement uses a ratchet model: a baseline JSON records existing
offenders at their current size. A run fails only if a new offender
appears or an existing one grows larger. Removing an offender
silently shrinks the baseline when the write-baseline flag is passed. The CI lint job
runs the script as part of the make lint step; the gate name in CI
is “smell budget OK”. The same script is distributed to all eight stack
repos for cross-repo consistency.
Lint and static analysis¶
Python¶
ruff (
>=0.15.5): fast linting and auto-fix, configured in[tool.ruff]inpyproject.toml. Runs as a pre-commit hook and in the lint workflow (.github/workflows/lint.yml).ruff format: canonical formatter replacing black.
mypy (
>=1.19.1): type checking, configured in[tool.mypy]inpyproject.toml. Strict on new modules. Runs in the lint workflow.bandit: security static analysis at severity level
highand confidence levelhighonprotea/. Configuration in[tool.bandit]inpyproject.toml. Promoted to a blocking check (T5.7) in.github/workflows/security.yml.pip-audit: dependency vulnerability scanning. Also a blocking check (T5.7) in the security workflow.
Frontend¶
ESLint (
apps/web/eslint.config.mjs): Next.js / TypeScript rules on theapps/web/application.tsc (TypeScript compiler): type checking runs as part of
npm run buildin the Playwright and docs CI jobs.
Dataset naming convention (FARM-EXP.6)¶
The reranker-token-lint workflow (.github/workflows/reranker-token-lint.yml)
delegates to the shared linter in
frapercan/agent-farm/.github/workflows/reranker-token-lint.yml.
It blocks PRs that introduce shorthand reranker-recipe version tokens
in publishable prose under docs/ and README.md. Only the
nine canonical GOA snapshot tokens are allowed. CHANGELOG.md is
structurally excluded (per ADR D36 and FARM-EXP.6 decision).
OpenAPI spec drift¶
.github/workflows/openapi-drift.yml generates a fresh
docs/openapi.json from the live FastAPI app and diffs it against the
committed copy. A non-zero diff fails the check “docs/openapi.json
matches code”, preventing the API spec from silently diverging from the
implementation.
Pre-commit hook bundle¶
The project ships a .pre-commit-config.yaml that installs the
following hooks:
ruff and ruff-format: lint and format on every staged Python file.
trailing-whitespace, end-of-file-fixer, check-yaml, check-toml, check-added-large-files (500 KB cap), debug-statements, check-merge-conflict: standard hygiene guards from
pre-commit/pre-commit-hooks.
Additional project-specific guards are enforced at multiple levels.
Em-dash guard¶
scripts/check_no_em_dashes.py greps for em-dashes (both the ASCII
double-hyphen form and the Unicode character) after stripping
backtick-quoted spans. The hook is stricter than a simple string search:
it fires even when the pattern appears inside fenced code blocks, so CLI
flag documentation must use inline backticks rather than fenced bash
blocks. Prose in RST section underlines uses =, ~, and ^
throughout this page for that reason.
Stash audit¶
.github/workflows/stash-audit.yml runs git stash list on the
checked-out PR branch and fails if any stash entries exist. This
enforces the project-wide policy of using git restore plus commits
on a WIP branch instead of stash (eight violations tracked across three
sessions before the CI gate was added).
CI gates (GitHub Actions)¶
Required checks on develop-targeted PRs:
Workflow file |
What it checks |
|---|---|
|
pytest unit suite (Python 3.12, no DB required) |
|
pytest with Postgres flag (Docker Postgres container) |
|
ruff, ruff-format, mypy, check_smells.py |
|
pip-audit (blocking) + bandit (blocking) |
|
Sphinx HTML build, zero warnings policy |
|
|
|
Playwright critical user flows |
|
Build Dockerfile + compose smoke |
|
Branch protection canary: fires when deploy paths are NOT changed, reports the same check name so the required check is always present |
|
Cosmic Ray on PR-touched core modules (informational, not blocking) |
|
Dataset naming convention in docs and README prose |
|
No git stash entries in the checked-out branch |
|
Enables squash auto-merge once all required checks pass on non-draft PRs targeting develop |
Refactoring patterns applied¶
The following slices applied named refactoring patterns to reduce complexity and improve testability. Each is traceable to an ADR or plan slice.
- T-CONTEXTS (Introduce Parameter Object)
The large
_KnnTransferRunnerdict-builder was refactored to accept a single parameter object instead of 14 positional arguments (protea/core/_knn_transfer_runner.py, referenced indocs/source/reference/core.rst).- T2B.1 (FeatureRegistry / Strategy pattern)
CanonicalFeatureRegistry(protea/core/features/registry.py) registers named feature-compute callables as strategies. Both the parquet exporter (T2B.2) and the batch predictor (T2B.3) drive feature generation through the registry singleton, decoupling feature definition from call sites.- T2B.3 (decompose ``_predict_batch``)
The monolithic batch prediction path was decomposed into collaborating objects, with feature generation driven through
CanonicalFeatureRegistryrather than inline dispatch.- T2B.4 (RerankerScorer as compositive class)
The re-ranker scoring logic was extracted as
RerankerScorer(protea/core/operations/predict_go_terms/_reranker_scorer.py), composed intoPredictGOTermsBatchOperationrather than inherited as a mixin. This makes the scorer independently testable and swappable. See ADR-D34: Selective rerank resurrection, recompute not archaeology.- T2B.5 (Method Object for 300+ LOC methods)
Long methods in the KNN transfer runner were extracted into
_KnnTransferRunner(protea/core/_knn_transfer_runner.py) and supporting phase modules (protea/core/_anc2vec_phases.py,protea/core/_leaf_record_builder.py). See ADR-D31: T2B.5 Method Object reframe.- T2B.6 (module split)
predict_go_termsandtraining_dump_helperswere split into focused sub-modules:protea/core/operations/predict_go_terms/(package with_batch_op.py,_reranker_scorer.py,_batch_op_reranker.py,_common.py) andprotea/core/training_dump_helpers.py.- Plugin entry_points architecture
All eight stack repos (
protea-sources,protea-runners,protea-backends,protea-contracts, etc.) expose operations through Python[project.entry-points]declarations.build_operation_registry()(protea/core/operation_catalog.py) resolves them at startup. This hexagonal-style boundary keeps the core independent of any specific annotation source, embedding backend, or runner implementation.
Reproducibility guardrails¶
SHA fingerprinting¶
Every RerankerModel row stores a feature_schema_sha: a
content-addressable fingerprint of the feature family and column order
used when the booster was trained. The inference path in
protea/core/reranker.py refuses to score with a booster whose schema
SHA does not match the live pipeline, preventing silent feature drift.
Similarly, Dataset rows carry schema_sha and manifest_sha.
These provenance fields are surfaced in the UI via the ProvenancePanel
component on the reranker page (apps/web/app/[locale]/reranker/page.tsx).
Dataset naming convention¶
The canonical naming form (ADR D36) is
bench-v1-K{K}-v{band}-lineage-{plm}, where K is the KNN
neighbourhood size, band is the GOA temporal snapshot, and plm
is the protein language model identifier. The reranker-token-lint
workflow enforces this pattern in prose; the field is stored in the
Dataset.name column with a UNIQUE constraint.
Alembic versioned migrations¶
The alembic/versions/ directory contains 56 migration scripts that
are the authoritative source of truth for the database schema. Schema
changes that affect parquet column layout increment the
dataset_schema_sha version (T1.6 migration). The
alembic upgrade head step is required before any operation that reads
or writes ORM models, and is part of the disaster recovery runbook
(Disaster Recovery).
Operational guardrails¶
Pre-warm and serve-stale-on-error¶
Slow aggregate endpoints (/v1/proteins/stats) use a background
cache with prewarm_all (protea/api/routers/proteins_stats.py)
and serve_stale_on_error=True. On a transient DB error or a cold
miss, the last-known value is returned rather than blocking the UI.
This prevents a 30-second DISTINCT-over-JOIN query from timing out
ngrok connections on cache cold-starts.
SIGTERM handler and force-fail gating¶
protea/workers/shutdown.py registers a SIGTERM handler on worker
startup. When the deploy-keeper sends SIGTERM during a redeploy, the
handler waits for the current job to finish (up to a configurable grace
period), calls force_fail to mark any in-flight job as FAILED,
and exits with code 143. This prevents orphaned RUNNING rows in the
Job table.
Stale job reaper¶
protea/workers/stale_job_reaper.py periodically queries for Job
rows that have been in RUNNING state beyond their expected
wall-clock budget and transitions them to FAILED. This is the last
line of defence against jobs whose workers died without sending a final
state transition. Runbook: Stale Job Reaper.
Database and ORM patterns¶
- 3NF schema
The ORM models (
protea/infrastructure/orm/models/) are in third normal form. Foreign keys, UNIQUE constraints, and partial indexes are declared in model metadata and materialised via Alembic migrations.- ENUM case consistency
A pre-existing bug (
ExperimentRunstatus stored as lowercase in Postgres but the ORMEnumsending uppercase names) was caught and fixed in the FIX-EXP-RUN-ENUM slice. A similar latent shape was identified inJob.status(protea/infrastructure/orm/models/job.py) and flagged for diagnosis before any ORM-level fix.- Idempotent UNIQUE constraints
The
EvaluationSetUNIQUE constraint (T-INFRA.EVAL-SET-UNIQUE) prevents duplicate ground-truth snapshots from being inserted by concurrent evaluation jobs. INSERT paths useON CONFLICT DO NOTHINGor manager-level deduplication.- Producer-consumer invariant (T1.8)
Described in the Testing section above. The invariant is also documented in
tests/test_contracts_invariants.pywith an explicit assertion message referencing the memory noteproject_canonical_feature_producer_consumer.- Two-session worker pattern
BaseWorker.handle_jobuses two separate SQLAlchemy sessions: one to claim the job (QUEUED -> RUNNING), and a second to run it and record the outcome. This prevents a crashed execute session from rolling back the claim transition. See ADR-002: Two-session worker pattern.
Architectural patterns¶
- Repository pattern
ORM model classes in
protea/infrastructure/orm/models/and corresponding manager/query helpers implement the repository pattern: domain logic calls typed query helpers rather than constructing raw SQL.- Operation registry
build_operation_registry()inprotea/core/operation_catalog.pyis the single source of truth for which operations are available. EveryPOST /v1/jobsdispatch resolves theoperationfield through this registry. Ad-hoc endpoint calls outside the registry are a hard constraint violation per the project’s hard-constraints document (CLAUDE.mdin the repo root).- Event-driven architecture
protea.jobs,protea.embeddings,protea.predictions,protea.training, andprotea.evaluationsqueues are powered by RabbitMQ and Pika consumers. Messages are durable, survive broker restarts, and support per-queue scaling viamanage.sh scale.- Lifespan-managed services
protea/api/app.pyuses FastAPI’s lifespan handler (_build_lifespan) to open and close the session factory, initialise the operation registry, and callprewarm_allduring startup. This ensures resources are released cleanly on shutdown and avoids global state initialised at import time.- Plugin architecture via Python entry_points
The eight stack repos expose their operations through
[project.entry-points]declarations. The PROTEA core does not import plugin code directly; it discovers it at runtime throughimportlib.metadata.
Security¶
- JWT + API key auth
protea/api/bearer.pyimplements JWT bearer authentication.assert_bearer_config()(called fromcreate_app) aborts startup ifJWT_SECRETis missing, preventing silent anonymous access. API key snapshots store therolefield so that the inference gate correctly applies viewer / annotator / administrator policies. A one-line bug whererolewas dropped from the snapshot (PR #504) was caught and fixed before any annotator credentials were issued in production. See ADR-D6: Authentication strategy.- bandit + pip-audit
Both run as blocking checks in the security workflow (
.github/workflows/security.yml, T5.7). bandit targets severity levelhighand confidence levelhigh; pip-audit scans the production dependency closure.- md5 with usedforsecurity=False
Hash calls that use
md5for caching (not cryptographic) purposes passusedforsecurity=False(protea/core/reranker.py,protea/core/_pair_feature_compute.py). This satisfies bandit’sB324rule without a# noqasuppression.
Coverage gates¶
pytest-cov is wired into the unit-test workflow
(.github/workflows/test.yml). Coverage is computed against
protea/ and emitted as coverage.xml for the GitHub Actions
summary. The current target is 80 % branch coverage for
protea/core and protea/api (the domain-critical layers); the
overall repo target is 70 % statement coverage. Coverage
regressions on a PR diff are surfaced as a comment by the
coverage-comment action but are advisory: a drop alone does not
block the merge. The hard requirement is the underlying test suite
passing; coverage is the visible side-channel that prevents silent
test-coverage erosion.
Mutation score (cosmic-ray) is a complementary signal: it answers
“are my tests actually killing bugs?” rather than “did my tests
execute the line?”. The 76 % / 83 % effective baseline on
protea/core/scoring.py (see Mutation Testing) is the
quality bar; the workflow runs on every PR that touches
protea/core/ modules but is informational, never blocking.
Type checking with mypy¶
The lint.yml workflow runs mypy --config-file pyproject.toml
against the full protea/ package. Strict-mode is enabled module-
by-module ([tool.mypy.overrides] in pyproject.toml);
new modules are added in strict mode as a default. SQLAlchemy 2.0
mapped types are typed via Mapped[...] annotations across the ORM
layer, so the static analysis catches refactors that would otherwise
fail only at runtime when a relationship name changes.
Third-party libraries without published stubs are stubbed in
stubs/ (project-local) or marked ignore_missing_imports per
module. The escape hatch is logged in pyproject.toml rather than
sprinkled as # type: ignore comments in source.
Schema migration testing¶
Alembic migrations under alembic/versions/ are reversible by
contract. Every migration must define both upgrade() and
downgrade(), paired with an op.create_index / op.drop_index
or op.add_column / op.drop_column mirror. The integration
workflow exercises this by performing a full alembic upgrade head
on a fresh Postgres container at the start of the run; selected PRs
that touch schema additionally run alembic downgrade -1 followed
by alembic upgrade head to verify the round-trip.
The unique partial index added for the job.dedup_key column
(F-OPS-JOBS.1a, migration a7b3c8d2e1f4) is a recent example of a
migration that ships a round-trip test in the same PR.
Branch protection and auto-merge policy¶
The develop branch is protected on GitHub with the following
configuration:
Required status checks: every workflow listed in the CI gates table above must pass before a PR can be merged. The exact list is enforced by GitHub branch-protection rules and is duplicated in
.github/branch-protection.jsonfor audit.Required approvals: PRs must have at least one approving review.
Linear history: merge commits are disabled; the merge button squashes the PR commits into a single commit on develop.
Dismiss stale approvals: any push after an approval invalidates it.
The auto-merge workflow (.github/workflows/auto-merge.yml) enables
GitHub’s native auto-merge feature on non-draft PRs that target
develop. Once all required status checks pass and any required
reviews are submitted, the PR squash-merges automatically.
Two notes about the auto-merge policy:
Advisory checks (bandit at low severity, reranker-token linter on certain doc files, cosmic-ray mutation score) can be red without blocking the merge. The required checks list is conservative on purpose; advisory checks are visible signals that get fixed in follow-up PRs.
Hotfixes to
main(production) follow a separate flow: cherry-picked from develop, gated by the same checks, manually merged by an operator.
Observability and SLO¶
Three observability surfaces are instrumented:
- Structured logs
Every worker and the API emit JSON-line logs with structured fields (
timestamp,level,logger,job_id,operation,stage,fields). Logs are written tologs/locally and shipped to Loki via theloki-docker-driver(T5.4) in deployed environments. The queue-consumer middleware emits one event per state transition and one per significant stage boundary inside an operation, creating an audit trail that survives worker restarts.- Metrics
A
/metricsendpoint (T5.2) exposes Prometheus counters and histograms:protea_job_state_transitions_total{from,to},protea_operation_duration_seconds{op},protea_queue_depth{queue},protea_http_requests_totalplus latency histograms. The Grafana dashboards (T5.3) graph these against an SLO of 95 % of jobs reach SUCCEEDED or FAILED within 6 h of dispatch, with a separate panel per long-running operation (export_research_dataset,compute_embeddings,run_cafa_evaluation).- Traces (OpenTelemetry)
FastAPI + SQLAlchemy + pika are instrumented via OTel auto-instrumentation (T5.1). Spans are emitted to an OTLP collector; the collector is configured for Tempo in deployed environments and stdout in dev. A request that crosses the API → queue → worker → DB → artifact-store boundary produces a single trace tree, which is essential when diagnosing a slow export.
Definition of done and PR checklist¶
A PR is considered ready to merge when:
All required CI checks are green (the workflows listed in CI gates).
Local CI was run before pushing (
ruff,mypy,pytest,check_smells.py). The pre-commit hook bundle enforces the smallest set automatically; the full suite is the author’s responsibility.The PR description references the slice id (e.g.
F-OPS-JOBS.1a) and includes a one-line summary of what changed and why.No new offenders in the smell budget. If new offenders are intentional (legitimate feature growth), the baseline is updated in the same PR via
python scripts/check_smells.py --write-baseline.OpenAPI is in sync if the change touched router code:
poetry run python scripts/generate_openapi.pyupdatesdocs/openapi.json.The plan store is updated if the PR closes a slice: the slice’s status moves to
doneand thepr:field records the PR number.Documentation is updated when the change introduces or removes a public surface: Sphinx pages, ADR if architectural, runbook if operational.
Documentation hygiene¶
- ADR registry
docs/source/adr/contains 38+ Architecture Decision Records numbered D01-D38, plus nine legacy numbered ADRs (001-009). Each ADR documents context, decision, and consequences in a consistent RST template. Recent entries: D34 (selective rerank resurrection), D35 (canonical 8-PLM embedding configs), D36 (PLM axis explicit in dataset naming), D37 (auth users/roles multi-instance), D38 (neural head deferred, dataset-pack pivot).- Hard constraints document
The repo-root
CLAUDE.mdandagent-farm/CLAUDE.mdlist non-negotiable constraints for every session: no pgvector for KNN at scale, no direct push to main/develop, no stash, no em-dashes in prose, no force push, no skipping pre-commit hooks. These are read by all contributors and any automated tooling at session start.- Plan store
agent-farm/plans/<loop>/PLAN.mdis the canonical slice catalog. Each slice is a self-contained unit of work with a unique ID (e.g.FARM-EXP.13,F-DATA-PACK.1), a clear objective, and acceptance criteria. Completed slices are marked with a status and linked to the PR that delivered them.- Sphinx docs build (zero warnings)
.github/workflows/docs.ymlrunssphinx-buildwith-W(treat warnings as errors). Every autodoc cross-reference must resolve; orphaned RST files are detected.
How to read this page¶
Link conventions used above:
:file:or inline code paths (e.g.scripts/check_smells.py) refer to files in the PROTEA repository root.:doc:cross-references point to other pages in this Sphinx site.Workflow names in the CI table (e.g.
test.yml) refer to files under.github/workflows/in the PROTEA repository.Slice identifiers (e.g.
FARM-EXP.6,T2B.4) refer to entries inagent-farm/plans/or the master plan v3.2. Use them as search keys in the plan store to find the full specification and linked PR.ADR identifiers (e.g.
D34,002) refer to files underdocs/source/adr/and are cross-referenced with:doc:.
What “hard gate” vs “soft gate” means in this context:
A hard gate is a required CI status check: a PR cannot be merged to
develop until it passes. A soft gate (the smell budget, mutation
score, bandit advisory) produces a visible signal and blocks the
conversation without blocking the merge. Soft gates are intended to
prompt a fix-or-accept decision rather than to stop work cold.