Configuration Reference¶

PROTEA loads its configuration from two sources, merged in this order (later entries win):

protea/config/system.yaml (file-based defaults)
Environment variables (runtime overrides)

YAML structure¶

database:
  url: postgresql+psycopg://user:pass@host:5432/dbname

queue:
  amqp_url: amqp://guest:guest@localhost:5672/

storage:
  artifacts_dir: storage/evaluation_artifacts   # cafaeval output
  backend: local                                 # "local" | "minio"
  root: storage/artifacts                        # local backend root
  minio:
    endpoint: localhost:9000
    bucket: protea
    access_key: minioadmin
    secret_key: minioadmin
    secure: false                                # true for HTTPS

admin:
  token: protea-admin

Only database.url and queue.amqp_url are strictly required; the storage, admin sections have working defaults. The file is loaded by protea.infrastructure.settings.load_settings(project_root) at startup.

The storage block drives the ArtifactStore abstraction described in Infrastructure. With backend: local (default) all blobs land under storage/artifacts/ on the API host. Setting backend: minio activates the S3-compatible path; requires the [storage] extra (pip install 'protea[storage]') and a running MinIO instance (see docker compose --profile storage up). Paths under storage.* are resolved relative to the project root when not absolute.

Environment variable overrides¶

Variable	Description
`PROTEA_DB_URL`	Overrides `database.url`. Must be a valid SQLAlchemy connection string using the `postgresql+psycopg` driver.
`PROTEA_AMQP_URL`	Overrides `queue.amqp_url`. Standard AMQP URL format.
`PROTEA_ARTIFACTS_DIR`	Overrides `storage.artifacts_dir` (the `cafaeval` artefacts directory used by `run_cafa_evaluation`).
`PROTEA_STORAGE_BACKEND`	Overrides `storage.backend`: `local` (default) or `minio`.
`PROTEA_STORAGE_ROOT`	Overrides `storage.root` (local backend root directory).
`PROTEA_MINIO_ENDPOINT`	Overrides `storage.minio.endpoint` (e.g. `localhost:9000`).
`PROTEA_MINIO_BUCKET`	Overrides `storage.minio.bucket`.
`PROTEA_MINIO_ACCESS_KEY`	Overrides `storage.minio.access_key`.
`PROTEA_MINIO_SECRET_KEY`	Overrides `storage.minio.secret_key`.
`PROTEA_MINIO_SECURE`	Overrides `storage.minio.secure`: truthy enables HTTPS.
`PROTEA_AUTHN_REQUIRED`	When `false`, the authentication gate is disabled (useful for local development without minted API keys). Default is `true`, so production deployments stay safe by accident. Accepted truthy values: `1`, `true`, `yes`, `on` (case-insensitive).
`PROTEA_JWT_SECRET`	Shared HS256 secret used to sign and verify `Authorization: Bearer <jwt>` tokens (T5.6b). Must be set when `PROTEA_AUTHN_REQUIRED=true`; the API process will refuse to start if the secret is absent and authentication is enabled. Minimum length: 32 bytes of randomness.
`PROTEA_RATELIMIT_JOBS`	slowapi rate-limit rule for `POST /v1/jobs`. Default `10/minute`. Accepts any slowapi syntax, e.g. `"100/minute"` or `"1000/hour;200/minute"` (T5.6b).
`PROTEA_RATELIMIT_DATASETS`	slowapi rate-limit rule for `POST /v1/datasets`. Default `5/minute` (T5.6b).
`PROTEA_RATELIMIT_API_KEYS`	slowapi rate-limit rule for `POST /v1/auth/api-keys`. Default `5/hour` (T5.6b).
`PROTEA_REF_CACHE_DIR`	Directory for the on-disk KNN reference cache (embedding + annotation matrices keyed by `(embedding_config_id, annotation_set_id)`). Defaults to `data/ref_cache`. Read by `protea.core.disk_cache`.
`PROTEA_PCA_ARTIFACTS_DIR`	Directory for per-PLM PCA projection states (`.npz`) used to pre-compute the `emb_pca` feature family. Defaults to `protea/artifacts/pca`. Read by `protea.core.pca_cache`.
`PROTEA_GITHUB_TOKEN`	GitHub token used by `GET /stack/pulls` to lift the unauthenticated 60 req/h rate limit to 5000 req/h. `GITHUB_TOKEN` and `GH_TOKEN` are honoured as fallbacks (in that order). Any value is accepted as long as the GitHub REST API recognises it.
`PROTEA_METHOD_NUMPY_QUERY_CHUNK`	Per-chunk query count for the numpy KNN backend (forwarded to `protea-method`). PROTEA auto-syncs this from `OperationTuning.numpy_query_chunk` (set via `PROTEA_TUNING__OPERATION__NUMPY_QUERY_CHUNK` or `system.yaml`); set the env var directly only as an escape hatch; it short-circuits the tuning sync.
`PROTEA_ALLOWED_ORIGINS`	Comma-separated CORS allowlist for the FastAPI app (T5.5). Priority: this env var overrides `cors.allowed_origins` in `system.yaml`, which in turn overrides the built-in default `http://localhost:3000, http://127.0.0.1:3000, https://protea.ngrok.app`. Empty values are stripped; the resolved tuple is read by `protea.api.app._register_middlewares` at startup.
`PROTEA_PAIR_FEATURE_WORKERS`	Number of parallel worker processes for pairwise alignment feature computation inside `export_research_dataset` (PR #421). Defaults to the host CPU count. Set to `1` to force serial execution (useful for debugging or memory-constrained hosts). Read by `protea.core._pair_feature_compute`.
`PROTEA_ALIGN_CACHE_DIR`	Directory for the persistent SQLite alignment cache used by `export_research_dataset` to avoid redundant NW/SW computations across K-slices of the same PLM (PR #421). Defaults to `protea/artifacts/align_cache` inside the project root. Set to an empty string to disable caching entirely. Read by `protea.core._pair_feature_compute`.

Frontend¶

# apps/web/.env.local
NEXT_PUBLIC_API_URL=http://127.0.0.1:8000
# NEXT_PUBLIC_FARM_API_URL=http://localhost:8801  # override only for non-standard deployments

NEXT_PUBLIC_API_URL is the only variable required for normal operation. It is injected at build time by Next.js and embedded in the client bundle.

NEXT_PUBLIC_FARM_API_URL overrides the farm dashboard API origin when the Next.js app cannot reach it through the default same-origin proxy (/farm-api/* rewrites to http://localhost:8801 server-side). Setting this variable is only necessary in non-standard deployments where the farm API runs on a different host or port (PR #443).

Integration test environment variables¶

The Docker-based integration test fixture is controlled by:

Variable	Default	Description
`PROTEA_PG_IMAGE`	`pgvector/pgvector:pg16`	Docker image for the ephemeral Postgres container.
`PROTEA_PG_USER`	`protea`	Database user.
`PROTEA_PG_PASSWORD`	`protea`	Database password.
`PROTEA_PG_DB`	`protea_test`	Database name.
`PROTEA_PG_PORT`	`15432`	Host port mapped to container port 5432.
`PROTEA_PG_TIMEOUT`	`30`	Seconds to wait for Postgres readiness.

RabbitMQ management¶

The RabbitMQ management UI is available at http://localhost:15672 (default credentials guest / guest). The ten PROTEA queues are:

Queue	Consumer	Operations
`protea.ping`	QueueConsumer	`ping`
`protea.jobs`	QueueConsumer	`insert_proteins`, `fetch_uniprot_metadata`, `load_ontology_snapshot`, `load_goa_annotations`, `load_quickgo_annotations`, `generate_evaluation_set`
`protea.training`	QueueConsumer	`export_research_dataset` (serialised; GPU/RAM-intensive)
`protea.embeddings`	QueueConsumer	`compute_embeddings` coordinator (serialised, one at a time)
`protea.embeddings.batch`	OperationConsumer	`compute_embeddings_batch`: GPU inference (ephemeral)
`protea.embeddings.write`	OperationConsumer	`store_embeddings`: bulk pgvector insert (ephemeral)
`protea.predictions`	QueueConsumer	`predict_go_terms` coordinator
`protea.predictions.batch`	OperationConsumer	`predict_go_terms_batch`: KNN + GO transfer (ephemeral)
`protea.predictions.write`	OperationConsumer	`store_predictions`: bulk GOPrediction insert (ephemeral)
`protea.evaluations`	QueueConsumer	`run_cafa_evaluation`

Queues are declared at worker startup and survive broker restarts.

Tuning settings¶

PROTEA exposes throughput, retry policy and boundary limits through protea.config.tuning.TuningSettings (pydantic). Values are resolved per call (defaults < tuning: section in protea/config/system.yaml < env vars).

Env var convention: PROTEA_TUNING__<group>__<field>. Double underscore is the path separator (matches pydantic-settings’ env_nested_delimiter) so it never collides with single underscores inside field names.

Categories are derived from docs/CONFIG_INVENTORY.md (T-CONF.1 of master plan revision 3) and migrated incrementally in T-CONF.2.

QueueTuning¶

RabbitMQ publisher and consumer policy.

Field	Default	Purpose
`publisher_max_attempts`	12	Reintentos máximos al publicar a RabbitMQ. 12 attempts cubren ~4 min de broker downtime con backoff exponencial cap a 30s.
`publisher_base_delay`	1.0	Backoff inicial publisher en segundos. Multiplica x2 por intento.
`oom_max_retries`	5	Reintentos al hit CUDA OOM en GPU worker.
`oom_base_delay`	5	Backoff inicial OOM en segundos.
`oom_max_delay`	300	Cap del backoff OOM en segundos (5 min).

YAML excerpt:

tuning:
  queue:
    publisher_max_attempts: 12
    oom_max_retries: 5

Env override example:

PROTEA_TUNING__QUEUE__PUBLISHER_MAX_ATTEMPTS=20

WorkerTuning¶

Pool sizes, in-process caches, reaper timeouts, HTTP cache TTL.

Field	Default	Purpose
`db_pool_size`	20	SQLAlchemy connection pool size.
`db_pool_max_overflow`	40	Conexiones extra permitidas durante picos.
`db_pool_recycle_seconds`	3600	Reciclar conexiones tras N segundos.
`model_cache_max`	1	Modelos PLM en cache por proceso de embeddings.
`ref_cache_max`	1	Reference data sets en cache por proceso predict.
`reaper_main_timeout_seconds`	21600	Timeout duro antes de marcar jobs FAILED en producción (6h).
`reaper_default_timeout_seconds`	3600	Default constructor de StaleJobReaper.
`reaper_stall_seconds`	1800	Tiempo sin JobEvent antes de considerar un job stalled.
`api_cache_default_ttl_seconds`	300.0	TTL default cache HTTP.

OperationTuning¶

Module-level chunk and batch sizes used inside operations.

Field	Default	Purpose
`annotation_chunk_size`	10_000	Filas por chunk al cargar/iterar anotaciones.
`stream_chunk_size`	2_000	Chunk size streaming PyArrow / SQLAlchemy yield_per.
`store_chunk_size`	10_000	Filas por chunk al publicar predictions a la cola store.
`numpy_query_chunk`	500	Query chunk size para KNN numpy backend (caps memoria de la matriz de distancias).

HTTP retry policy and per-source timeouts (UniProt, GOA, QuickGO, ontology) live inside the respective pydantic payloads (InsertProteinsPayload, LoadGoaAnnotationsPayload, etc.) by design: callers pick them per-job rather than as global infra defaults.

APILimits¶

HTTP boundary limits enforced at the FastAPI router layer.

Field	Default	Purpose
`max_fasta_bytes`	52428800 (50 MB)	Tope upload FASTA en bytes. Aplica a `annotate` y `query_sets`.
`max_comment_length`	500	Caracteres máximos por comentario en /support.
`recent_limit`	20	Items devueltos por defecto en /support/recent.
`page_limit`	100	Page size hard cap para list endpoints de soporte.

Config-exempt: research methodology constants¶

The following constants are deliberately not in TuningSettings because changing them would shift the canonical numbers reported in the thesis and papers:

EMBEDDING_PCA_DIM = 16 (core/reranker.py): part of the feature schema contract that protea-contracts will own; it gates compatibility with trained boosters.
N_THRESHOLDS = 101 (core/metrics.py): CAFA Fmax sweep granularity. Changing it produces non-comparable Fmax numbers.

Structural exempt¶

Format-spec positional indices live in code (e.g. GAF column indices in core/operations/load_goa_annotations.py). They are not configurable because doing so would mean PROTEA stops reading the GAF format.