Anc2Vec ComplexityΒΆ
Note
Lookup: O(1) per GO term (hash table). Batch lookup: O(N) in the number of GO terms.
The anc2vec index is a 200-dimensional float32 embedding matrix for the GO release of 2020-10-06, loaded once into RAM and accessed via a Python dict (hash table). Single-term lookup is O(1); the batch variant allocates an (N, 200) output matrix and fills it row-by-row, so it is O(N) in the number of requested GO terms. The full index occupies approximately 25 MB in RAM.
The implementation is in protea_method.anc2vec (extracted from
protea/core/anc2vec_embeddings.py):
def get_index(path: str | None = None) -> Anc2VecIndex:
"""Return a process-wide singleton index keyed by path.
Resolves the default path through env var > repo-relative fallback
when no path is passed. See the module docstring for the full
contract and operator-facing failure mode.
"""
if path is not None:
return _get_index_lib(path)
return _get_index_lib(str(_resolve_default_path()))
Runtime characteristics
The @lru_cache(maxsize=2) decorator on get_index ensures the
index is loaded from disk once per process (keyed by path), after which
all downstream callers share the same in-memory object. The .npz
file is loaded with np.load(..., allow_pickle=True); the embeddings
array is made contiguous with np.ascontiguousarray so that
column-slicing in batch() avoids copies.
Usage in the export pipeline
During feature engineering, batch() is called for the transferred GO
terms of each (query, reference) pair. For k = 5 neighbours and T ~ 30
GO terms per pair, a single query generates 150 anc2vec lookups.
Across q ~ 5000 queries per aspect this is 750k lookups per KNN call,
all O(1) via the hash table, typically completing in milliseconds.
Artifact path
The NPZ file lives at PROTEA/artifacts/anc2vec/anc2vec_2020-10.npz
and is resolved at runtime by the PROTEA_METHOD_ANC2VEC_PATH
environment variable (or PROTEA_ANC2VEC_PATH in the PROTEA worker
context). The embedding dimension is 200 (not 256; verify with
Anc2VecIndex.dim before consuming it downstream).
Cross-reference
Thesis Ch. 5.3 covers the role of anc2vec features in the LightGBM ensemble and the GeOKG no-go decision (2026-05-17): anc2vec remains the default GO embedding until a full-text Fmax gain greater than 0.005 can be demonstrated.