ADR-006: Sequence deduplication by MD5¶

Date:: 2025-12-10
Author:: frapercan
Status:: Accepted

Context¶

UniProt has ~570K accessions in Swiss-Prot, but only ~540K unique sequences. The remaining 30K are isoforms or cross-references sharing the same amino acid chain.

Computing the embedding for a sequence costs ~0.5s on GPU. Processing 30K duplicates wastes 4+ hours per full run.

Decision¶

When inserting proteins, we compute the MD5 hash of the amino acid string. The Sequence table has a unique constraint on ``sequence_hash``:

If the hash already exists -> reuse the existing Sequence.id.
If it does not exist -> insert a new row.

Multiple Protein rows (one per UniProt accession) point to the same Sequence. The FK Protein.sequence_id is intentionally non-unique.

When the embedding pipeline runs, it only processes Sequence rows without an embedding; duplicates are skipped automatically.

Consequences¶

MD5 is not cryptographically secure, but that does not matter here: there is no adversarial input, only biological sequences.
Sequences with a single mutation produce different hashes and are stored separately. This is correct: a mutation changes the embedding.

Rejected alternatives¶

SHA-256: digest twice as long, zero practical benefit.
UNIQUE on the sequence text column: indexing multi-kilobyte text columns is expensive; the 32-char hex digest is far more efficient.
CD-HIT clustering (90-95% identity): useful for reducing redundancy in evolutionary analysis, but here we need exact deduplication (100%).