ADR-001: KNN on CPU, not pgvector or GPU¶

Date:: 2025-12-15
Author:: frapercan
Status:: Accepted

Context¶

GO term prediction requires K-nearest-neighbor search over 500K+ embeddings of 1280 dimensions. The natural options were pgvector (we already store vectors there) or PyTorch on GPU (we already have the GPU for inference). Both failed:

pgvector with an IVFFlat index on 527K vectors: index build took >20 minutes, and each individual query cost 100-500ms. For a job with thousands of queries, unacceptable.
PyTorch on GPU: the GPU is busy with ESM-2/ESM-3c/T5 inference. Loading the distance matrix competes with model forward passes and causes CUDA OOM.

Decision¶

KNN runs on CPU, entirely in Python:

NumPy (brute-force via matrix multiplication) for small datasets (<100K).
FAISS (Flat, IVFFlat, HNSW) for large datasets. Uses SIMD and multithreading on CPU without touching the GPU.

Reference embeddings are loaded once from PostgreSQL into a process-level cache (_REF_CACHE, float16, ~4 GB for 500K vectors). pgvector remains as storage only: the VECTOR type is there, but we never search with <=>.

Consequences¶

The cache consumes worker RAM (~4 GB). If the worker restarts, the first prediction takes ~15s extra to reload from DB.
KNN and inference run in parallel without contention: CPU computes distances while GPU computes embeddings.

Rejected alternatives¶

Dedicated vector database (Milvus, Qdrant): one more infra dependency for something NumPy/FAISS solves in-process.
Persistent FAISS index on disk: IVFFlat training takes a few seconds; not worth the complexity of serialising/deserialising for now.