Measuring Performance

This page describes how to profile PROTEA operations. It covers two lightweight tools (scalene, pyinstrument) and the structured event log that PROTEA writes to its database.

PROTEA’s built-in timing: JobEvent

Every Operation.execute call emits structured events via the emit callback. Timing information is available from the DB without any extra tooling:

SELECT event, created_at,
       payload->>'elapsed_s' AS elapsed_s
FROM   job_events
WHERE  job_id = '<your-job-uuid>'
ORDER  BY created_at;

The export_research_dataset operation emits events with the export_research_dataset.* prefix (e.g., export_research_dataset.knn_done, export_research_dataset.alignment_done) so each sub-step can be timed from the event log alone.

scalene (line-level CPU + GPU + memory)

scalene is the recommended profiler for PROTEA workers. It samples both CPU and GPU time per line without requiring code changes.

To profile the export operation module, run scalene with the –cpu, –gpu, and –memory flags pointing at the operation module:

poetry run scalene --cpu --gpu --memory \
    protea/core/operations/export_research_dataset.py

Or to profile a specific worker invocation using the –cpu and –memory flags:

poetry run scalene --cpu --memory \
    scripts/run_one_job.py <job_uuid>

Output is an HTML report in the current directory. The PERF.1 slice will publish pre-computed flamegraphs from the FARM-EXP.13 run under docs/perf/ once that slice lands.

pyinstrument (call-stack sampling)

pyinstrument is faster to set up for a quick call-stack snapshot:

poetry run pyinstrument scripts/run_one_job.py <job_uuid>

It groups time by call stack rather than by line, which makes it easier to identify which function family (alignment vs KNN vs DB IO) dominates.

cProfile + snakeviz (function-level)

For function-level profiling without installing extra tools:

poetry run python -m cProfile -o /tmp/protea.prof \
    scripts/run_one_job.py <job_uuid>
snakeviz /tmp/protea.prof

Interpreting hot paths

Based on FARM-EXP.13 measurements, the typical cost breakdown for a single export_research_dataset cell is:

  • GPU embedding pass: 70-90% of wall clock (PLM-dependent)

  • Pairwise alignment: 5-20% (cold cache); under 1% (warm cache, PR #421)

  • KNN search: 3-8%

  • DB queries + parquet IO: under 2%

If alignment dominates even on a warm cache, verify that PROTEA_PAIR_FEATURE_WORKERS is set and that PROTEA_ALIGN_CACHE_DIR points to a writable directory.

Forward reference: PERF.1 flamegraphs

The upcoming PERF.1 slice will publish scalene HTML reports for each of the 24 FARM-EXP.13 cells under docs/perf/. This page will be updated with direct links once that slice ships.

Cross-reference

Thesis Ch. 5.6 summarises the profiling methodology and reproduces the top-line measurements used to motivate the process-pool + cache design in PR #421.